BDA lab manual UPDATED
BDA lab manual UPDATED
Prepared By:
k.keerthi reddy , Assistant Professor CSD
List of Experiments
2. Implement a simple map-reduce job that builds an inverted index on the set of input documents
(Hadoop)
3. Process big data in HBase
6. Using Power Pivot (Excel) Perform the following on any dataset a. Big Data Analytics b. Big
Data Charting
Hadoop installation steps
Prerequisite
1. $ sudo apt update
2. Install java jdk 8
sudo apt install openjdk-11-jdk
Installation
2. type ls -l and find the .bashrc hidden file.
3. open .bashrc file with nano or vi editor.
4. paste the following commands in .bashrc file for path setting
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-11-openjdk-amd64/bin
export HADOOP_HOME=~/hadoop-3.2.3/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export
HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-
streaming-3.2.3.jar
export
HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PDSH_RCMD_TYPE=ssh
<property>
<name>hadoop.proxyuser.dataflair.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.dataflair.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.groups</name>
<value>*</value>
</property>
</configuration>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CON
F_DIR,CLASSPATH_PREP
END_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
15. establish the secured connection with ssh execute the following
commands ssh localhost
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
20. ./start-dfs.sh (It will start the NameNode, DataNode and SecondaryNode.)
21. ./start-yarn.sh (It will start the YARN resource and NodeManagers)
25. You can also access NameNode and YARN Resource Manager through browsers
(Google Chrome/Mozilla Firefox). Hadoop NameNode runs on default port 9870. Run
http://localhost:9870/ in the browser.
26. We can get information about the cluster and all applications by accessing port 8042.
Run http://localhost:8042/ in the browser.
27. To get details of Hadoop node you can access port 9864. Run http://localhost:9864/ in
the browser.
Prerequisites
Hardware: Multiple machines (physical or virtual) to serve as nodes in your cluster. At a minimum,
you’ll need:
o 1 NameNode (Master node)
o 2 or more DataNodes (Worker nodes)
Software: A supported version of Hadoop, Java (typically OpenJDK or Oracle JDK), and a Linux-based
operating system (like Ubuntu or CentOS).
Network Configuration: All machines need to be on the same network and able to communicate with
each other.
bash
Copy
sudo apt update
sudo apt install openjdk-8-jdk
bash
Copy
java -version
2. Create a Hadoop User
It's best practice to run Hadoop as a non-root user. You’ll create a Hadoop user to avoid running the processes
as root.
bash
Copy
sudo useradd -m hadoop
sudo passwd hadoop
3. Download Hadoop
Download the Hadoop binary distribution from the official Apache Hadoop website:
https://hadoop.apache.org/releases.html For example:
bash
Copy
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.x.x/hadoop-
3.x.x.tar.gz
tar -xvzf hadoop-3.x.x.tar.gz
Edit the .bashrc file of the Hadoop user (use the user you created earlier):
bash
Copy
sudo nano /home/hadoop/.bashrc
bash
Copy
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
bash
Copy
source /home/hadoop/.bashrc
5. Configure Hadoop Cluster
core-site.xml (Hadoop Core Settings) This file contains the configuration for the Hadoop cluster’s core
settings. You will configure properties like the default filesystem URI.
xml
Copy
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode_host:9000</value>
</property>
</configuration>
hdfs-site.xml (HDFS Configuration) This file contains the settings for HDFS storage.
xml
Copy
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/hdfs/datanode</value>
</property>
</configuration>
xml
Copy
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml (YARN Configuration) This file contains the settings for YARN, which handles job
scheduling.
xml
Copy
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager_host:8032</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resourcemanager_host:8025</value>
</property>
</configuration>
6. Format the NameNode
Before starting the Hadoop cluster, you must format the NameNode to initialize the HDFS file system.
bash
Copy
hdfs namenode -format
7. Start Hadoop Cluster
You can now start the Hadoop daemons on the cluster. You’ll typically use the following commands:
bash
Copy
start-dfs.sh
bash
Copy
start-yarn.sh
8. Verify the Cluster
To verify that your Hadoop cluster is running properly, you can visit the web UI:
o HDFS Web UI: http://<namenode_host>:50070
o YARN ResourceManager UI: http://<resourcemanager_host>:8088
9. Add Nodes to the Cluster
For each DataNode, you need to modify the slaves file located in the $HADOOP_HOME/etc/hadoop/
directory to include the hostnames or IP addresses of your worker nodes.
On each DataNode, you will need to configure the Hadoop environment and start the DataNode daemon
by running:
bash
Copy
start-dfs.sh
10. Test the Cluster
Once the cluster is up and running, you can run some test jobs. For example:
bash
Copy
hadoop fs -mkdir /user/hadoop/test
hadoop fs -put localfile.txt /user/hadoop/test/
Conclusion
By following these steps, you will have a working Hadoop cluster. You can scale this setup by adding more
DataNodes or configuring high-availability setups, depending on your needs.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
//import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
//import org.apache.hadoop.mapreduce.MapContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {
/*
This is the Mapper class. It extends the hadoop's Mapper class.
This maps input key/value pairs to a set of intermediate(output) key/value pairs. Here
our input key is a Object and input value is a Text.
And the output key is a Text and value is an Text. [word<Text> DocID<Text>]<->[aspect
5722018411]
*/
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{
/*
Hadoop supported datatypes. This is a hadoop specific datatype that is used to handle
numbers and Strings in a hadoop environment. IntWritable and Text are used instead of
Java's Integer and String datatypes.
Here 'one' is the number of occurance of the 'word' and is set to value 1 during the
Map process.
*/
//private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// Reading input one line at a time and tokenizing by using space, "'", and "-" characters
as tokenizers.
StringTokenizer itr = new StringTokenizer(value_raw, " '-");
// Iterating through all the words available in that line and forming the key/value pair.
while (itr.hasMoreTokens()) {
// Remove special characters
word.set(itr.nextToken().replaceAll("[^a-zA-Z]", "").toLowerCase());
if(word.toString() != "" && !word.toString().isEmpty()){
/*
Sending to output collector(Context) which in-turn passed the output to Reducer.
The output is as follows:
'word1' 5722018411
'word1' 6722018415
'word2' 6722018415
*/
context.write(word, new Text(DocId));
}
}
}
}
1. Install HBase:
● Download the latest version of HBase from the official
website: https://hbase.apache.org/downloads.html
● Extract the downloaded package to a directory of your choice using the command: tar
-xvf hbase-<version>.tar.gz
● Set the HBASE_HOME environment variable to the extracted directory by
adding the following line to your ~/.bashrc file: export
HBASE_HOME=/path/to/hbase-<version>
● Reload the ~/.bashrc file using the command: source ~/.bashrc
2. Configure HBase:
● Navigate to the conf directory within the HBase installation directory using
the command: cd $HBASE_HOME/conf
● Edit the hbase-site.xml file using a text editor such as nano or vim:
nano hbase-site.xml
● Add the following properties to the file:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///path/to/hbase-data</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/path/to/zookeeper-data</value>
</property>
</configuration>
● Save and close the file.
3. Start HBase:
● Navigate to the bin directory within the HBase installation directory: cd
$HBASE_HOME/bin
● Start HBase using the command: ./start-hbase.sh
● Verify that HBase is running by accessing the HBase shell using the command:
./hbase shell. If successful, you should see a prompt that looks like this:
hbase(main):001:0>
4. Create a table:
● In the HBase shell, create a table using the command: create '<table-
name>', '<column-family>'. Replace <table-name> with the name of your
table and
<column-family> with the name of your column family.
● For example, to create a table called users with a column family called personal,
use the command: create 'users', 'personal'.
5. Add data to the table:
● In the HBase shell, add data to the table using the command: put '<table-name>',
'<row-key>', '<column-family>:<column-name>', '<value>'. Replace <row-key>
with a unique identifier for the row, <column-family> with the name of your
column family, <column-name> with the name of your column, and <value> with
the value to be stored.
● For example, to add a user with the name "John Doe" and age 30 to the users table,
use the command: put 'users', 'user1', 'personal:name', 'John Doe' and put
'users', 'user1', 'personal:age', '30'.
7. Stop HBase:
● To stop HBase, navigate to the bin directory within the HBase installation.
3.STORE AND RETRIEVE DATA IN PIG
$ wget https://dlcdn.apache.org/pig/pig-0.16.0/pig-0.16.0.tar.gz
Step 5: Now open the .bashrc file to edit the path and variables/settings for pig. Run the
following command:
$ sudo nano .bashrc
Add the below given to .bashrc file at the end and save the file.
Step 6: Run the following command to make the changes effective in the .bashrc file:
$ source .bashrc
Step 7: To start all Hadoop daemons, navigate to the hadoop-3.2.1/sbin folder and run the
following commands:
Step 8: Now you can launch pig by executing the following command:
$ pig
Step 9: Now you are in pig and can perform your desired tasks on pig. You can come out
of the pig by the quit command:
> quit;
● Load the data from the HDFS directory using the LOAD command. For example, to load
the data from the /output directory, use the command: data = LOAD '/output' USING
PigStorage();
● The PigStorage() parameter specifies that the file is tab-separated.
● Display the data using the DUMP command. For example, to display the data, use
the command: DUMP data;
2. Exit Pig:
● To exit Pig, type quit or press Ctrl + D.
sudo systemctl start cassandra
We used the /etc/cassandra directory as a destination for the backup, but you can change
the path as you see fit.
Rename Apache Cassandra Cluster
Use a text editor of your choice to open the cassandra.yaml file (we will be using nano):
Find the line that reads cluster_name: The default name is Test Cluster. That is the first
change you want to make when you start working with Cassandra.
If you do not want to make more changes, exit and save the file.
cqlsh
A shell loads showing the connection to the default cluster. If you had changed
the cluster_name parameter, it will show the one you defined in the configuration file. The
example above is the default connection to the localhost.
Experiment: 3 Store and retrieve data in Pig
Aim:To perform storing and retrieval of big data using Apache pig Resources:Apache pig
Theory:
Pig is a platform that works with large data sets for the purpose of analysis. The
Pig dialect is called Pig Latin, and the Pig Latin commands get compiled into
MapReduce jobs that can be run on a suitable platform, like Hadoop.
Apache Pig is a platform for analyzing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial parallelization, which
in turns enables them to handle very large data sets.
Operator Description
LOAD To Load the data from the 昀椀 le system (local/HDFS) into a relation.
47 | P a g e
Filtering
Sorting
Diagnostic Operators
48 | P a g e
For the given Student dataset and Employee dataset,perform Relational
operations like Loading, Storing, Diagnostic Operations (Dump, Describe,
Illustrate & Explain) in Hadoop Pig framework using Cloudera
Student ID First Name Age City CGPA
001 Jagruthi 21 Hyderabad 9.1
002 Praneeth 22 Chennai 8.6
003 Sujith 22 Mumbai 7.8
004 Sreeja 21 Bengaluru 9.2
005 Mahesh 24 Hyderabad 8.8
006 Rohit 22 Chennai 7.8
007 Sindhu 23 Mumbai 8.3
Employee Age
Name City
ID
001 Angelina 22 LosAngeles
002 Jackie 23 Beijing
003 Deepika 22 Mumbai
004 Pawan 24 Hyderabad
005 Rajani 21 Chennai
006 Amitabh 22 Mumbai
Step-1: Create a Directoryin HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input 昀椀 le of Pig contains each tuple/record in individual lines with the
entities separated by a delimiter ( “,”).
In the local file system, create an input file
Instudent_data.txt
the local file system,
containing
create an
datainput
as shown
file employee_d
below.
001,Jagruthi,21,Hyderabad,9.1 002,Praneeth,22,Chennai,8.6
001,Angelina,22,LosAngeles
003,Sujith,22,Mumbai,7.8
002,Jackie,23,Beijing
004,Sreeja,21
003,D
Step-3: Move the 昀椀 le from the local 昀椀 le system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
To get the path of the 昀椀 le student_data.txt type the below
command readlink -f student_data.txt
$ hdfs dfs -put /home/hadoop/Desktop/student_data.txt /bdalab/pigdir/
49 | P a g e
$ hdfs dfs -cat /bdalab/pigdir/student_data
$ hdfs dfs -put /home/hadoop/Desktop/employee_data /bdalab/pigdir/
Step-4: Apply Relational Operator – LOAD to load the data from the 昀椀 le student_data.txt
into Pig by executing the following Pig Latin statement in the Grunt shell.
Relational Operators are NOT case sensitive.
$ pig => will direct to grunt> shell
grunt> student = LOAD ' /bdalab/pigdir/student_data.txt' USING PigStorage(',')as (
id:int, name:chararray, age:int, city:chararray, cgpa:double );
grunt>employee = LOAD ' /bdalab/pigdir/employee_data.txt’ USING
PigStorage(',')as ( id:int, name:chararray, age:int, city:chararray);
Step-5: Apply Relational Operator – STORE to Store the relation in the HDFS directory
“/pig_output/” as shown below.
grunt> STORE student INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
Step-7: Apply Relational Operator – Diagnostic Operator – DUMP toPrint the contents
of the relation.
Step-9: Apply Relational Operator – Diagnostic Operator – EXPLAIN toDisplay the logical,
physical, and MapReduce executionplans of a relation usingExplain operator
grunt> Explain student
grunt>Explain employee
50 | P a g e
Step-9: Apply Relational Operator – Diagnostic Operator – ILLUSTRATE to
give the step- by-step execution of a sequence of statements
grunt>Illustr
ate student
grunt>Illustr
ate
employee
To perform data analysis using MongoDB, there are several steps and tools you can leverage. MongoDB is a
NoSQL database, which stores data in a flexible, JSON-like format. I'll walk you through the process of using
MongoDB for data analysis.
bash
Copy
pip install pymongo
2. Connecting to MongoDB
python
Copy
from pymongo import MongoClient
# Connect to MongoDB (replace with your own MongoDB URI if using a cloud instance)
client = MongoClient("mongodb://localhost:27017/")
# Select the database and collection you want to work with
db = client['your_database_name']
collection = db['your_collection_name']
3. Example Data Query
Assume you have a collection of data, such as customer transactions. You can perform some basic queries:
python
Copy
pipeline = [
{"$group": {"_id": "$customer_id", "total_spent": {"$sum": "$amount"}}},
{"$sort": {"total_spent": -1}} # Sort by total spent in descending order
]
aggregation_result = collection.aggregate(pipeline)
4. Data Visualization
Once you have the data, you can use Python libraries like matplotlib or seaborn for visualization.
python
Copy
import matplotlib.pyplot as plt
# Get data for plotting (you can adjust this to match the structure of your query)
customer_data = list(collection.aggregate(pipeline))
5. Advanced Analysis
If you're working with more complex data, MongoDB's aggregation framework supports operations such as:
python
Copy
from datetime import datetime
pipeline = [
{"$project": {
"month": {"$month": "$transaction_date"},
"year": {"$year": "$transaction_date"},
"amount": 1
}},
{"$group": {
"_id": {"month": "$month", "year": "$year"},
"total_spent": {"$sum": "$amount"}
}},
{"$sort": {"_id.year": 1, "_id.month": 1}} # Sort by year and month
]
aggregation_result = collection.aggregate(pipeline)
Conclusion
These are the basic steps you can follow to perform data analysis using MongoDB. You can customize the
queries, aggregations, and visualizations depending on the specific analysis you're interested in.
Would you like help with a specific query or dataset? Feel free to share more details!
51 | P a g e
Power Pivot is a powerful data analysis tool that allows you to analyze large datasets, create
relationships between tables, and build sophisticated data models. To perform Big Data
Analytics using Power Pivot.
Steps to create a pivot table:
Gather the data: This may involve collecting data from multiple sources, cleaning and
organizing it, and storing it in a format that can be analyzed.
Import your data into Power Pivot: Power Pivot allows you to import data from a variety of
sources, including Excel spreadsheets, SQL Server databases, and other data sources. You
can import data directly into Power Pivot or create a data connection to an external data
source.
Define your data model: Once you have imported your data, you can create relationships
between tables and define your data model. This may involve creating calculated columns,
measures, and hierarchies.
Analyze your data: Power Pivot allows you to perform a wide range of data analysis tasks,
including filtering, sorting, and grouping data, as well as creating pivot tables and charts.
we can get the dataset from:
https://mail.google.com/mail/u/0?ui=2&ik=3a4aada490&attid=0.1&permmsgid=msg-f:1763
039101482923603&th=187792b8c9037653&view=att&disp=safe&realattid=f_lg1v80fj0
step 1: Open excel and click on insert and choose pivot table. In select source click on ok.
step 2: we will need to choose filters and then click on ok.
and we will get the output as given below.
Visualize your data: Use charts, graphs, or other visual aids to help you interpret the data and
communicate your findings to others. Power Pivot allows you to create a wide range of charts
and visualizations, including pivot charts, line charts, bar charts, and scatter plots.
To chart Big Data Analytics results in Power Pivot, you can follow these general steps:
Create a pivot chart: Power Pivot allows you to create pivot charts based on your data model.
To create a pivot chart, simply select the data you want to use and click the PivotChart button
on the Insert tab.
Customize your chart: Once you have created your pivot chart, you can customize it by
changing the chart type, adding, or removing chart elements, and formatting the chart to your
liking.
We are creating the chart for the above dataset.
step 1: click on insert and choose charts. and click on the chart you want.
For sampling, choose data for the output and click on data choose statistics and
then sampling we will get to choose the cell to put the result on and then click ok.
and the output will be printed on the chosen cell.
For Descriptive statistics repeat the same process.
For ANOVA.
The output is printed as,
For Correlation,
the output is printed as:
For Covariance,
the output is printed as: