0% found this document useful (0 votes)
33 views45 pages

BDA lab manual UPDATED

The document is a lab manual for Big Data Analytics in the Department of Computer Science and Engineering for the academic year 2025-2026. It outlines a series of experiments, including creating a Hadoop cluster, implementing a map-reduce job, and data processing using various tools like HBase and MongoDB. Detailed steps for Hadoop installation, configuration, and running a simple inverted index job are also provided.

Uploaded by

Swathi Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views45 pages

BDA lab manual UPDATED

The document is a lab manual for Big Data Analytics in the Department of Computer Science and Engineering for the academic year 2025-2026. It outlines a series of experiments, including creating a Hadoop cluster, implementing a map-reduce job, and data processing using various tools like HBase and MongoDB. Detailed steps for Hadoop installation, configuration, and running a simple inverted index job are also provided.

Uploaded by

Swathi Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

DEPARTMENT OF COMPUTER

SCIENCE AND ENGINEERING


(DATA SCIENCE)
Academic Year 2025-2026

BIG DATA ANALYTICS


LAB MASTER MANUAL

Prepared By:
k.keerthi reddy , Assistant Professor CSD
List of Experiments

1. Create a Hadoop cluster

2. Implement a simple map-reduce job that builds an inverted index on the set of input documents
(Hadoop)
3. Process big data in HBase

4. Store and retrieve data in Pig

5. Perform data analysis using MongoDB

6. Using Power Pivot (Excel) Perform the following on any dataset a. Big Data Analytics b. Big
Data Charting
Hadoop installation steps
Prerequisite
1. $ sudo apt update
2. Install java jdk 8
sudo apt install openjdk-11-jdk

Note: To check java jdk 8 type cd /usr/lib/jvm in tenrminal

Installation
2. type ls -l and find the .bashrc hidden file.
3. open .bashrc file with nano or vi editor.
4. paste the following commands in .bashrc file for path setting
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-11-openjdk-amd64/bin
export HADOOP_HOME=~/hadoop-3.2.3/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export

HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-
streaming-3.2.3.jar
export
HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PDSH_RCMD_TYPE=ssh

5. execute .bashrc file using source .bashrc command.


6. Install ssh ( ssh — secure shell — protocol used to securely connect to remote server/system
— transfers data in encrypted form)
sudo apt-get install ssh
7. Download the latest hadoop tar file from: hadoop.apache.org
(hadoop.apache.org — download tar file of hadoop.)
8. extract the tar file:

tar -zxvf ~/Downloads/hadoop-3.2.3.tar.gz (Extract the tar file)

9. goto directory: cd hadoop-3.2.3/etc/hadoop


10.open hadoop-env.sh
sudo nano hadoop-env.sh (file opening command)
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 (set the path for JAVA_HOME)
11. open core-site.xml
sudo nano core-site.xml (file opening command) paste the following code
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

<property>
<name>hadoop.proxyuser.dataflair.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.dataflair.hosts</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.server.hosts</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.server.groups</name>
<value>*</value>
</property>
</configuration>

12. open hdfs-site.xml


sudo nano hdfs-site.xml (file opening command) paste the following code
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

13. open mapred-site.xml


sudo nano mapred-site.xml (file opening command) paste the following code
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<property>
<name>mapreduce.application.classpath</name>

<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:

$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>

14. open yarn-site.xml


sudo nano yarn-site.xml (file opening command) paste the following code
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CON
F_DIR,CLASSPATH_PREP
END_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

15. establish the secured connection with ssh execute the following
commands ssh localhost
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

16. formant the name node of hdfs

hadoop-3.2.3/bin/hdfs namenode -format

format the file system


17. type the command: export PDSH_RCMD_TYPE=ssh

18. To start Apache Hadoop 3 following Daemons


● NameNode
● DataNode
● Secondary Name Node
● Resource Manager
● Node Manager
19. We can check the daemons with this command: cd ~/hadoop-3.2.3/sbin/

20. ./start-dfs.sh (It will start the NameNode, DataNode and SecondaryNode.)

21. ./start-yarn.sh (It will start the YARN resource and NodeManagers)

22. start-all.sh(Start all daemon)


23. some hdfs commands
$ hadoop fs -mkdir /user create directory
$ hadoop fs -mkdir /user/aadi create sub directory
$ hadoop fs -put demo.csv /user/aadi (copy data from local to hdfs)

24. to check the started daemons: jps

25. You can also access NameNode and YARN Resource Manager through browsers
(Google Chrome/Mozilla Firefox). Hadoop NameNode runs on default port 9870. Run
http://localhost:9870/ in the browser.

26. We can get information about the cluster and all applications by accessing port 8042.
Run http://localhost:8042/ in the browser.

27. To get details of Hadoop node you can access port 9864. Run http://localhost:9864/ in
the browser.

28. stop-all.sh (to stop all daemons)


1.create hadoop cluster
Creating a Hadoop cluster involves several steps, which typically include setting up a cluster of machines,
installing Hadoop, and configuring the necessary components to work together as a distributed system. Here's a
high-level overview of how to set up a Hadoop cluster:

Prerequisites
 Hardware: Multiple machines (physical or virtual) to serve as nodes in your cluster. At a minimum,
you’ll need:
o 1 NameNode (Master node)
o 2 or more DataNodes (Worker nodes)
 Software: A supported version of Hadoop, Java (typically OpenJDK or Oracle JDK), and a Linux-based
operating system (like Ubuntu or CentOS).
 Network Configuration: All machines need to be on the same network and able to communicate with
each other.

Steps to Set Up a Hadoop Cluster


1. Set Up Java
Hadoop requires Java to run. You need to install Java on all nodes of the cluster.

 Install Java (Ubuntu example):

bash
Copy
sudo apt update
sudo apt install openjdk-8-jdk

 Verify Java Installation:

bash
Copy
java -version
2. Create a Hadoop User
It's best practice to run Hadoop as a non-root user. You’ll create a Hadoop user to avoid running the processes
as root.

bash
Copy
sudo useradd -m hadoop
sudo passwd hadoop
3. Download Hadoop
 Download the Hadoop binary distribution from the official Apache Hadoop website:
https://hadoop.apache.org/releases.html For example:

bash
Copy
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.x.x/hadoop-
3.x.x.tar.gz
tar -xvzf hadoop-3.x.x.tar.gz

 Move Hadoop to a directory:


bash
Copy
sudo mv hadoop-3.x.x /opt/hadoop
4. Set Environment Variables
You’ll need to configure the environment variables to point to your Hadoop installation.

 Edit the .bashrc file of the Hadoop user (use the user you created earlier):

bash
Copy
sudo nano /home/hadoop/.bashrc

 Add the following environment variables:

bash
Copy
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

 Reload the .bashrc file:

bash
Copy
source /home/hadoop/.bashrc
5. Configure Hadoop Cluster
 core-site.xml (Hadoop Core Settings) This file contains the configuration for the Hadoop cluster’s core
settings. You will configure properties like the default filesystem URI.

xml
Copy
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode_host:9000</value>
</property>
</configuration>

 hdfs-site.xml (HDFS Configuration) This file contains the settings for HDFS storage.

xml
Copy
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/hdfs/datanode</value>
</property>
</configuration>

 mapred-site.xml (MapReduce Configuration) This file contains the MapReduce settings.

xml
Copy
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

 yarn-site.xml (YARN Configuration) This file contains the settings for YARN, which handles job
scheduling.

xml
Copy
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager_host:8032</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resourcemanager_host:8025</value>
</property>
</configuration>
6. Format the NameNode
Before starting the Hadoop cluster, you must format the NameNode to initialize the HDFS file system.

bash
Copy
hdfs namenode -format
7. Start Hadoop Cluster
You can now start the Hadoop daemons on the cluster. You’ll typically use the following commands:

 Start the HDFS daemons:

bash
Copy
start-dfs.sh

 Start the YARN daemons:

bash
Copy
start-yarn.sh
8. Verify the Cluster
 To verify that your Hadoop cluster is running properly, you can visit the web UI:
o HDFS Web UI: http://<namenode_host>:50070
o YARN ResourceManager UI: http://<resourcemanager_host>:8088
9. Add Nodes to the Cluster
 For each DataNode, you need to modify the slaves file located in the $HADOOP_HOME/etc/hadoop/
directory to include the hostnames or IP addresses of your worker nodes.
 On each DataNode, you will need to configure the Hadoop environment and start the DataNode daemon
by running:

bash
Copy
start-dfs.sh
10. Test the Cluster
Once the cluster is up and running, you can run some test jobs. For example:

bash
Copy
hadoop fs -mkdir /user/hadoop/test
hadoop fs -put localfile.txt /user/hadoop/test/
Conclusion
By following these steps, you will have a working Hadoop cluster. You can scale this setup by adding more
DataNodes or configuring high-availability setups, depending on your needs.

2.Implement a simple map-Reduce job that builds an


inverted index on the set of input documents(Hadoop)

Inverted Index Java Program


import java.io.IOException; import
java.util.StringTokenizer; import
java.util.HashMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
//import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
//import org.apache.hadoop.mapreduce.MapContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {

/*
This is the Mapper class. It extends the hadoop's Mapper class.
This maps input key/value pairs to a set of intermediate(output) key/value pairs. Here
our input key is a Object and input value is a Text.
And the output key is a Text and value is an Text. [word<Text> DocID<Text>]<->[aspect
5722018411]
*/
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{
/*
Hadoop supported datatypes. This is a hadoop specific datatype that is used to handle
numbers and Strings in a hadoop environment. IntWritable and Text are used instead of
Java's Integer and String datatypes.
Here 'one' is the number of occurance of the 'word' and is set to value 1 during the
Map process.
*/
//private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {

// Split DocID and the actual text


String DocId = value.toString().substring(0, value.toString().indexOf("\t"));
String value_raw = value.toString().substring(value.toString().indexOf("\t") +
1);

// Reading input one line at a time and tokenizing by using space, "'", and "-" characters
as tokenizers.
StringTokenizer itr = new StringTokenizer(value_raw, " '-");

// Iterating through all the words available in that line and forming the key/value pair.
while (itr.hasMoreTokens()) {
// Remove special characters
word.set(itr.nextToken().replaceAll("[^a-zA-Z]", "").toLowerCase());
if(word.toString() != "" && !word.toString().isEmpty()){
/*
Sending to output collector(Context) which in-turn passed the output to Reducer.
The output is as follows:
'word1' 5722018411
'word1' 6722018415
'word2' 6722018415
*/
context.write(word, new Text(DocId));
}
}
}
}

public static class IntSumReducer


extends Reducer<Text,Text,Text,Text> {
/*
Reduce method collects the output of the Mapper calculate and aggregate the word's count.
*/
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {

HashMap<String,Integer> map = new HashMap<String,Integer>();


/*
Iterable through all the values available with a key [word] and add them together and give
the
final result as the key and sum of its values along with the DocID.
*/
for (Text val : values) {
if (map.containsKey(val.toString()))
{ map.put(val.toString(), map.get(val.toString()) + 1);
} else
{ map.put(val.toString(),
1);
}
}
StringBuilder docValueList = new StringBuilder();
for(String docID : map.keySet())
{ docValueList.append(docID + ":" + map.get(docID) + "
");
}
context.write(key, new Text(docValueList.toString()));
}
}

public static void main(String[] args) throws Exception


{ Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "inverted index");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(TokenizerMapper.class);
// Commend out this part if you want to use combiner. Mapper and Reducer input and
outputs type matching might be needed in this case.
//job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Inverted Index Program Execution Steps
1. create a directory : inverted

1. create sub directory: inputdata

2. create another sub-directory: container (which is used to create jar


file)

3. create WordCount.java file

2. set the hadoop class path: export HADOOP_CLASSPATH=$


(hadoop classpath)

3. check: echo $HADOOP_CLASSPATH

4. copy the content of inputdata dirctory into a hdfs.

1. hadoop fs -mkdir /inverted

2. hadoop fs -mkdir /inverted/input

3. hadoop fs -put /home/invertedindex/inputdata/word1.txt


/inverted/input

4. hadoop fs -put /home/invertedindex/inputdata/word2.txt


/inverted/input

5. compile java file: javac -classpath ${HADOOP_CLASSPATH} -d


'/home/invertedindex/container'
'/home/invertedindex/InvertedIndex.java' (generated class files are
stored in container)

6. create jar file: jar -cvf fr.jar -C container/ .

7. Execute program: hadoop jar /home/invertedindex/fr.jar InvertedIndex


/inverted/input /inverted/output

8. to see the output: hadoop fs -cat /inverted/output/part-r-00000


3.PROCESS BIG DATA IN HBASE
HBase is a distributed; NoSQL database that can handle massive amounts of data in
real-time. Here are the steps to process big data in HBase on Ubuntu:

1. Install HBase:
● Download the latest version of HBase from the official
website: https://hbase.apache.org/downloads.html
● Extract the downloaded package to a directory of your choice using the command: tar
-xvf hbase-<version>.tar.gz
● Set the HBASE_HOME environment variable to the extracted directory by
adding the following line to your ~/.bashrc file: export
HBASE_HOME=/path/to/hbase-<version>
● Reload the ~/.bashrc file using the command: source ~/.bashrc

2. Configure HBase:
● Navigate to the conf directory within the HBase installation directory using
the command: cd $HBASE_HOME/conf
● Edit the hbase-site.xml file using a text editor such as nano or vim:
nano hbase-site.xml
● Add the following properties to the file:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///path/to/hbase-data</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/path/to/zookeeper-data</value>
</property>
</configuration>
● Save and close the file.

3. Start HBase:
● Navigate to the bin directory within the HBase installation directory: cd
$HBASE_HOME/bin
● Start HBase using the command: ./start-hbase.sh
● Verify that HBase is running by accessing the HBase shell using the command:
./hbase shell. If successful, you should see a prompt that looks like this:
hbase(main):001:0>

4. Create a table:
● In the HBase shell, create a table using the command: create '<table-
name>', '<column-family>'. Replace <table-name> with the name of your
table and
<column-family> with the name of your column family.
● For example, to create a table called users with a column family called personal,
use the command: create 'users', 'personal'.
5. Add data to the table:
● In the HBase shell, add data to the table using the command: put '<table-name>',
'<row-key>', '<column-family>:<column-name>', '<value>'. Replace <row-key>
with a unique identifier for the row, <column-family> with the name of your
column family, <column-name> with the name of your column, and <value> with
the value to be stored.
● For example, to add a user with the name "John Doe" and age 30 to the users table,
use the command: put 'users', 'user1', 'personal:name', 'John Doe' and put
'users', 'user1', 'personal:age', '30'.

6. Query data from the table:


● In the HBase shell, query data from the table using the command: get
'<table-name>', '<row-key>'. Replace <row-key> with the unique identifier for the
row.
● For example, to retrieve the data for the user with the row key user1 in the users
table, use the command: get 'users', 'user1'.

7. Stop HBase:
● To stop HBase, navigate to the bin directory within the HBase installation.
3.STORE AND RETRIEVE DATA IN PIG

Step 1: Login into Ubuntu

Step 2: Go to https://pig.apache.org/releases.html and copy the path of the latest version of


pig that you want to install. Run the following comment to download Apache Pig in
Ubuntu:

$ wget https://dlcdn.apache.org/pig/pig-0.16.0/pig-0.16.0.tar.gz

Step 3: To untar pig-0.16.0.tar.gz file run the following command:

$ tar xvzf pig-0.16.0.tar.gz


Step 4: To create a pig folder and move pig-0.16.0 to the pig folder, execute the following
command:
$ sudo mv /home/hdoop/pig-0.16.0 /home/hdoop/pig

Step 5: Now open the .bashrc file to edit the path and variables/settings for pig. Run the
following command:
$ sudo nano .bashrc

Add the below given to .bashrc file at the end and save the file.

#PIG settingsexport PIG_HOME=/home/hdoop/pigexport


PATH=$PATH:$PIG_HOME/binexport PIG_CLASSPATH=$PIG_HOME/conf:
$HADOOP_INSTALL/etc/hadoop/export PIG_CONF_DIR=$PIG_HOME/confexport
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64export
PIG_CLASSPATH=$PIG_CONF_DIR:$PATH#PIG setting ends

Step 6: Run the following command to make the changes effective in the .bashrc file:

$ source .bashrc
Step 7: To start all Hadoop daemons, navigate to the hadoop-3.2.1/sbin folder and run the
following commands:

$ ./start-dfs.sh$ ./start-yarn$ jps

Step 8: Now you can launch pig by executing the following command:
$ pig

Step 9: Now you are in pig and can perform your desired tasks on pig. You can come out
of the pig by the quit command:

> quit;

Step 10: Store data in Pig:


● Load data into Pig using the LOAD command. For example, to load a CSV file called
data.csv, use the command: data = LOAD '/path/to/data.csv' USING PigStorage(',')
AS (col1:int, col2:chararray, col3:float);
● The PigStorage(',') parameter specifies that the file is comma-separated. The AS
clause specifies the schema for the data.
● Store the data using the STORE command. For example, to store the data in a Hadoop
Distributed File System (HDFS) directory called /output, use the command: STORE
data INTO '/output';

Step 11: Retrieve data from Pig:

● Load the data from the HDFS directory using the LOAD command. For example, to load
the data from the /output directory, use the command: data = LOAD '/output' USING
PigStorage();
● The PigStorage() parameter specifies that the file is tab-separated.
● Display the data using the DUMP command. For example, to display the data, use
the command: DUMP data;
2. Exit Pig:
● To exit Pig, type quit or press Ctrl + D.
sudo systemctl start cassandra

Check the status of the service again. It should change to active.


To restart the service, use the restart command:

sudo systemctl restart cassandra

To stop the Cassandra service, enter:

sudo systemctl stop cassandra

The status shows inactive after using the stop command.

Optional: Start Apache Cassandra Service Automatically on


Boot
When you turn off or reboot your system, the Cassandra service switches to inactive.
To start Cassandra automatically after booting up, use the following command:

sudo systemctl enable cassandra

Now, if your system reboots, the Cassandra service is enabled automatically.

STEP 4: Configure Apache Cassandra


You may want to change the Cassandra configuration settings depending on your
requirements. The default configuration is sufficient if you intend to use Cassandra on a
single node. If using Cassandra in a cluster, you can customize the main settings using
the cassandra.yaml file.
sudo cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.backup

We used the /etc/cassandra directory as a destination for the backup, but you can change
the path as you see fit.
Rename Apache Cassandra Cluster
Use a text editor of your choice to open the cassandra.yaml file (we will be using nano):

sudo nano /etc/cassandra/cassandra.yaml

Find the line that reads cluster_name: The default name is Test Cluster. That is the first
change you want to make when you start working with Cassandra.

If you do not want to make more changes, exit and save the file.

Add IP Addresses of Cassandra Nodes


Another thing that you must add to the cassandra.yaml if you are running a cluster is the IP
address of every node.
Open the configuration file and under the seed _provider section, find the seeds entry:
Add the IP address of every node in your cluster. Divide the entries by using a comma after
every address.

STEP 5: Test Cassandra Command-Line Shell


The Cassandra software package comes with its command-line tool (CLI). This tool uses
Cassandra Query Language (CQL) for communication.
To start a new shell, open the terminal and type:

cqlsh

A shell loads showing the connection to the default cluster. If you had changed
the cluster_name parameter, it will show the one you defined in the configuration file. The
example above is the default connection to the localhost.
Experiment: 3 Store and retrieve data in Pig
Aim:To perform storing and retrieval of big data using Apache pig Resources:Apache pig
Theory:
Pig is a platform that works with large data sets for the purpose of analysis. The
Pig dialect is called Pig Latin, and the Pig Latin commands get compiled into
MapReduce jobs that can be run on a suitable platform, like Hadoop.
Apache Pig is a platform for analyzing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial parallelization, which
in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that


produces sequences of Map-Reduce programs, for which large-scale parallel
implementations already exist (e.g., the Hadoop subproject). Pig's language
layer currently consists of a textual language called Pig Latin, which has the
following key properties:

• Ease of programming. It is trivial to achieve parallel execution of


simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are explicitly
encoded as data flow sequences, making them easy to write, understand,
and maintain.
• Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically, allowing
the user to focus on semantics rather than efficiency.
• Extensibility. Users can create their own functions to do special-purpose
processing.
• Pig Latin – Relational Operations
• The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the 昀椀 le system (local/HDFS) into a relation.

STORE To save a relation to the 昀椀 le system (local/HDFS).

47 | P a g e
Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

ORDER To arrange a relation in a sorted order based on one or more 昀椀


elds (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

EXPLAIN To view the logical, physical, or MapReduce execution plans to


compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

48 | P a g e
For the given Student dataset and Employee dataset,perform Relational
operations like Loading, Storing, Diagnostic Operations (Dump, Describe,
Illustrate & Explain) in Hadoop Pig framework using Cloudera
Student ID First Name Age City CGPA
001 Jagruthi 21 Hyderabad 9.1
002 Praneeth 22 Chennai 8.6
003 Sujith 22 Mumbai 7.8
004 Sreeja 21 Bengaluru 9.2
005 Mahesh 24 Hyderabad 8.8
006 Rohit 22 Chennai 7.8
007 Sindhu 23 Mumbai 8.3

Employee Age
Name City
ID
001 Angelina 22 LosAngeles
002 Jackie 23 Beijing
003 Deepika 22 Mumbai
004 Pawan 24 Hyderabad
005 Rajani 21 Chennai
006 Amitabh 22 Mumbai

Step-1: Create a Directoryin HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input 昀椀 le of Pig contains each tuple/record in individual lines with the
entities separated by a delimiter ( “,”).
In the local file system, create an input file
Instudent_data.txt
the local file system,
containing
create an
datainput
as shown
file employee_d
below.
001,Jagruthi,21,Hyderabad,9.1 002,Praneeth,22,Chennai,8.6
001,Angelina,22,LosAngeles
003,Sujith,22,Mumbai,7.8
002,Jackie,23,Beijing
004,Sreeja,21
003,D

Step-3: Move the 昀椀 le from the local 昀椀 le system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
To get the path of the 昀椀 le student_data.txt type the below
command readlink -f student_data.txt
$ hdfs dfs -put /home/hadoop/Desktop/student_data.txt /bdalab/pigdir/
49 | P a g e
$ hdfs dfs -cat /bdalab/pigdir/student_data
$ hdfs dfs -put /home/hadoop/Desktop/employee_data /bdalab/pigdir/
Step-4: Apply Relational Operator – LOAD to load the data from the 昀椀 le student_data.txt
into Pig by executing the following Pig Latin statement in the Grunt shell.
Relational Operators are NOT case sensitive.
$ pig => will direct to grunt> shell
grunt> student = LOAD ' /bdalab/pigdir/student_data.txt' USING PigStorage(',')as (
id:int, name:chararray, age:int, city:chararray, cgpa:double );
grunt>employee = LOAD ' /bdalab/pigdir/employee_data.txt’ USING
PigStorage(',')as ( id:int, name:chararray, age:int, city:chararray);

Step-5: Apply Relational Operator – STORE to Store the relation in the HDFS directory
“/pig_output/” as shown below.

grunt> STORE student INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');

grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');

Step-6: Verify the stored data as shown below

$ hdfs dfs -ls /bdalab/pigdir/pig_output/

$ hdfs dfs -cat /bdalab/pigdir/pig_output/part-m-00000

Step-7: Apply Relational Operator – Diagnostic Operator – DUMP toPrint the contents
of the relation.

grunt> Dump student

grunt> Dump employee


Step-8: Apply Relational Operator – Diagnostic Operator – DESCRIBE toView the schema
of a relation.
grunt> Describe student
grunt> Describe employee

Step-9: Apply Relational Operator – Diagnostic Operator – EXPLAIN toDisplay the logical,
physical, and MapReduce executionplans of a relation usingExplain operator
grunt> Explain student
grunt>Explain employee

50 | P a g e
Step-9: Apply Relational Operator – Diagnostic Operator – ILLUSTRATE to
give the step- by-step execution of a sequence of statements
grunt>Illustr
ate student
grunt>Illustr
ate
employee

EXPERIMENT :4 Perform data analysis using MongoDB

To perform data analysis using MongoDB, there are several steps and tools you can leverage. MongoDB is a
NoSQL database, which stores data in a flexible, JSON-like format. I'll walk you through the process of using
MongoDB for data analysis.

Steps for Data Analysis with MongoDB:


1. Set up MongoDB:
o You need a MongoDB instance running. You can either install MongoDB locally or use a cloud-
based solution like MongoDB Atlas.
2. Connect to MongoDB:
o Use MongoDB's official client libraries (e.g., PyMongo for Python) to connect to your MongoDB
instance.
3. Load Data into MongoDB:
o If you already have data in a CSV, JSON, or another format, you can import it into MongoDB
using the mongoimport command or through an application.
4. Data Exploration & Analysis:
o Once you have the data, you can query and perform data analysis with MongoDB's aggregation
framework or by using MongoDB queries to filter and analyze the data.
5. Visualization (Optional):
o For a better understanding of the data, you may want to visualize the results using libraries like
Matplotlib, Seaborn, or other visualization tools.

Example Process of Data Analysis with MongoDB


Let me guide you through an example of querying and analyzing data from MongoDB using Python ( PyMongo).

1. Set Up MongoDB with Python


First, install PyMongo to interact with MongoDB:

bash
Copy
pip install pymongo
2. Connecting to MongoDB
python
Copy
from pymongo import MongoClient

# Connect to MongoDB (replace with your own MongoDB URI if using a cloud instance)
client = MongoClient("mongodb://localhost:27017/")
# Select the database and collection you want to work with
db = client['your_database_name']
collection = db['your_collection_name']
3. Example Data Query
Assume you have a collection of data, such as customer transactions. You can perform some basic queries:

a. Find all documents:


python
Copy
# Query all documents
all_documents = collection.find()
for document in all_documents:
print(document)
b. Query documents with a filter:
python
Copy
# Query documents based on specific filter
query = {"customer_id": "1234"}
filtered_documents = collection.find(query)

for document in filtered_documents:


print(document)
c. Aggregation for analysis:
MongoDB's aggregation framework allows you to perform complex queries like group, sort, and calculate
averages.

Example: Calculate the total amount spent by each customer.

python
Copy
pipeline = [
{"$group": {"_id": "$customer_id", "total_spent": {"$sum": "$amount"}}},
{"$sort": {"total_spent": -1}} # Sort by total spent in descending order
]

aggregation_result = collection.aggregate(pipeline)

for result in aggregation_result:


print(result)

4. Data Visualization
Once you have the data, you can use Python libraries like matplotlib or seaborn for visualization.

Example: Visualizing total spending by customers.

python
Copy
import matplotlib.pyplot as plt

# Get data for plotting (you can adjust this to match the structure of your query)
customer_data = list(collection.aggregate(pipeline))

# Prepare data for plotting


customer_ids = [item['_id'] for item in customer_data]
total_spent = [item['total_spent'] for item in customer_data]

# Create a bar plot


plt.bar(customer_ids, total_spent)
plt.xlabel('Customer ID')
plt.ylabel('Total Spent')
plt.title('Total Spending by Customer')
plt.xticks(rotation=45)
plt.show()

5. Advanced Analysis
If you're working with more complex data, MongoDB's aggregation framework supports operations such as:

 $match – Filters documents by criteria.


 $group – Groups documents and performs operations like sum, average, etc.
 $project – Specifies which fields to include or exclude.
 $sort – Sorts the data.
 $limit – Limits the number of results.
 $unwind – Deconstructs an array field.

Example: Aggregating Data by Date


If you have transaction data with a date field, you might want to group transactions by month:

python
Copy
from datetime import datetime

pipeline = [
{"$project": {
"month": {"$month": "$transaction_date"},
"year": {"$year": "$transaction_date"},
"amount": 1
}},
{"$group": {
"_id": {"month": "$month", "year": "$year"},
"total_spent": {"$sum": "$amount"}
}},
{"$sort": {"_id.year": 1, "_id.month": 1}} # Sort by year and month
]

aggregation_result = collection.aggregate(pipeline)

for result in aggregation_result:


print(f"Year: {result['_id']['year']}, Month: {result['_id']['month']}, Total
Spent: {result['total_spent']}")

Conclusion
These are the basic steps you can follow to perform data analysis using MongoDB. You can customize the
queries, aggregations, and visualizations depending on the specific analysis you're interested in.

Would you like help with a specific query or dataset? Feel free to share more details!
51 | P a g e

6.Using Power Pivot(Excel) Perform the following on any


dataset
a) Big Data Analytics
b) Big Data Charting

Power Pivot is a powerful data analysis tool that allows you to analyze large datasets, create
relationships between tables, and build sophisticated data models. To perform Big Data
Analytics using Power Pivot.
Steps to create a pivot table:
Gather the data: This may involve collecting data from multiple sources, cleaning and
organizing it, and storing it in a format that can be analyzed.
Import your data into Power Pivot: Power Pivot allows you to import data from a variety of
sources, including Excel spreadsheets, SQL Server databases, and other data sources. You
can import data directly into Power Pivot or create a data connection to an external data
source.
Define your data model: Once you have imported your data, you can create relationships
between tables and define your data model. This may involve creating calculated columns,
measures, and hierarchies.
Analyze your data: Power Pivot allows you to perform a wide range of data analysis tasks,
including filtering, sorting, and grouping data, as well as creating pivot tables and charts.
we can get the dataset from:
https://mail.google.com/mail/u/0?ui=2&ik=3a4aada490&attid=0.1&permmsgid=msg-f:1763
039101482923603&th=187792b8c9037653&view=att&disp=safe&realattid=f_lg1v80fj0
step 1: Open excel and click on insert and choose pivot table. In select source click on ok.
step 2: we will need to choose filters and then click on ok.
and we will get the output as given below.
Visualize your data: Use charts, graphs, or other visual aids to help you interpret the data and
communicate your findings to others. Power Pivot allows you to create a wide range of charts
and visualizations, including pivot charts, line charts, bar charts, and scatter plots.

To chart Big Data Analytics results in Power Pivot, you can follow these general steps:
Create a pivot chart: Power Pivot allows you to create pivot charts based on your data model.
To create a pivot chart, simply select the data you want to use and click the PivotChart button
on the Insert tab.
Customize your chart: Once you have created your pivot chart, you can customize it by
changing the chart type, adding, or removing chart elements, and formatting the chart to your
liking.
We are creating the chart for the above dataset.
step 1: click on insert and choose charts. and click on the chart you want.

And click on finish to get the output.


the output for the above chosen chart is

Performing statistical analysis on Excel data.


step 1: click on data, choose statistics and then choose the options from below mentioned list.

For sampling, choose data for the output and click on data choose statistics and
then sampling we will get to choose the cell to put the result on and then click ok.
and the output will be printed on the chosen cell.
For Descriptive statistics repeat the same process.

the output will be printed as

For ANOVA.
The output is printed as,

For Correlation,
the output is printed as:

For Covariance,
the output is printed as:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy