Hadoop File Complte
Hadoop File Complte
Lab Report
of
BIG DATA ANALYTICS
Submitted by:
MD. khalid
192211009
M.Tech-CSE(Analytics), Ist-Semester
INDEX
7. MapReduce
Experiment 1
Lab-1
Hadoop Installation
Introduction
Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving
massive amounts of data and computation.
Components of hadoop-
1. Hadoop Common: The common utilities that support the other Hadoop modules.
2. Hadoop Distributed File System (HDFS): A distributed file system that provides high throughput access to application data.
3. Hadoop YARN: A framework for job scheduling and cluster resource management.
4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Installation:
Step 1: Download the Java 8 Package. Save this file in your home directory.
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open .bashrc file. Now, add Hadoop and Java Path.
Command: gedit .bashrc
export HADOOP_HOME="$HOME/hadoop-2.7.7"
export HADOOP_CONF_DIR="$HOME/hadoop-2.7.7/etc/hadoop"
export HADOOP_MAPRED_HOME="$HOME/hadoop-2.7.7"
export HADOOP_COMMON_HOME="$HOME/hadoop-2.7.7"
export HADOOP_HDFS_HOME="$HOME/hadoop-2.7.7"
export YARN_HOME="$HOME/hadoop-2.7.7"
export PATH="$PATH:$HOME/hadoop-2.7.7/bin"
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export PATH="$PATH:JAVA_HOME/bin"
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings of Hadoop core such as
I/O settings that are common to HDFS & MapReduce.
Command: gedit core-site.xml
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the
replication factor and block size of HDFS.
Command: gedit hdfs-site.xml
1
2 <?xml version="1.0" encoding="UTF-8"?>
3
4 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
5 <configuration>
6 <property>
7 <name>dfs.replication</name>
8 <value>1</value>
9 </property>
1 <property>
0 <name>dfs.permission</name>
1 <value>false</value>
1 </property>
1 </configuration>
2
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like number of JVM that can run in parallel, the size of the
mapper and the reducer process, CPU cores available for a process, etc.
Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the
operation needed on program & algorithm, etc.
Command: gedit yarn-site.xml
1
2 <?xml version="1.0">
3 <configuration>
4 <property>
5 <name>yarn.nodemanager.aux-services</name>
6 <value>mapreduce_shuffle</value>
7 </property>
8 <property>
9 <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
1 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
0 </property>
1 </configuration>
1
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run Hadoop like Java home path, etc.
Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the daemons.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
For starting dfs daemon:
Command: ./start-dfs.sh
Step 14: Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the NameNode interface.
Experiment 2
Lab 3
HDFS Commands
Introduction:
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing large data sets of structured or
unstructured data across various nodes and thereby maintaining the metadata in the form of log files.
HDFS commands are as follow:
1). ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Execution:
bin/hdfs dfs -ls /user
2).mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
Execution:
bin/hdfs dfs -mkdir /user/khalid
Execution:
bin/hdfs dfs -touchz /user/khalid/tst.txt
4).copyFromLocal : To copy files/folders from local file system to hdfs store. This is the most important command. Local filesystem
means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Execution:
bin/hdfs dfs -copyFromLocal /home/nitdelhipc22/Downloads/hadoopfile.doc /user/khalid/
Execution:
bin/hdfs dfs -copyToLocal /user/khalid/tst.txt /home/nitdpc22
6).cp: It copies file from one directory to other.Also multipe files can be copied by this command.
Syntex:
Bin/hdfs dfs -cp <srcfile(on hdfs)> <dest(hdfs)>
Execution:
Bin/hdfs dfs -cp /user/khalid/tst.txt /user/demo
Execution:
bin/hdfs dfs -moveFromLocal /home/nitdpc22/csa.txt /user/khalid
Execution:
bin/hdfs dfs -moveToLocal /user/khalid/tst.txt /home/nitdpc22
9) .mv: This command will move file from one directory to other.It also allow to move multiple files.
Syntex:
Bin/hdfs dfs -mv <src(on hdfs)> <dst(on hdsf)>
Execution:
bin/hdfs dfs -mv /user/khalid/tst.txt /user/demo
Execution:
bin/hdfs dfs -du /user/khalid
11) .text: This command that takes a source file and outputs the file in text format.
Syntax:
bin/hdfs dfs –text <src(on hdfs)>
Execution:
bin/hdfs dfs -text /user/khalid/tst.txt
12) .count:HDFS Command to count the number of directories, files, and bytes under the paths that match the specified file
pattern.
Syntax:
bin/hdfs dfs -count <src(on hdfs)>
Execution:
bin/hdfs dfs -count /user
Experiment 3
Lab-4
Mongo DB installation and commands
Introduction:
MongoDB is a cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like
documents with schema.
Mongo DB Instalation:
STEP 1:Importing Public Key
Commands:
sudo apt-key adv --keyserver
hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6
Commands:
echo "deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse"
sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list
Execution:
user.mycollection.find()
Execution:
user.mycollection.find().pretty()
Execution:
user.mycollection.find({“id”:190}).pretty()
Execution:
user.mycollection.find({“id”:{$lt:150}).pretty()
Execution:
user.mycollection.find({“id”:{$lte:150}).pretty()
Execution:
user.mycollection.find({“id”:{$gt:150}}).pretty()
5). Greater Than Equals: To check the greater than and equals condition.
Syntax:
{<key>:{$gte:<value>}}
Execution:
user.mycollection.find({“id”:{$gte:150}}).pretty()
Execution:
user.mycollection.find({“id”:{$ne:150}}).pretty()
Experiment 4
Lab 5
HIVE Installation
Introduction:
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a
SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Installation:
STEP 1: Download Hive tar.
Command: wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
Hive Commands:
1).Create database:
Syntax:
CREATE DATABASE [IF NOT EXISTS] <database name
Execution:
CREATE DATABASE [IF NOT EXISTS] khalid
2).Drop:
Syntax:
DROP DATABASE [IF EXISTS] database_name
Execution:
DROP DATABASE [IF EXISTS] khalid
3).Create Table:
Syntax:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
Execution:
CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
4).Load Data:
Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
Execution:
LOAD DATA LOCAL INPATH ‘home/user/khalid.txt’ OVERWRITE INTO TABLE employee
5).Select Data:
Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
Execution:
SELECT * FROM employee
Execution:
SELECT * FROM employee where salary>5000
Execution:
Select * from employee order by salary
Experiment 5
Lab 6
PIG INSTALLATION
Introduction:
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.
Installation:
STEP 1: Download the Apache Pig tar file.
http://mirrors.estointernet.in/apache/pig/pig-0.16.0/
Pig Commands:
1).Load Data:
Syntax:
LOAD 'info' [USING FUNCTION] [AS SCHEMA];
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int)
2).Distinct Operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int)
Result = Distinct A
3).ForEach operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int)
fe = FOREACH A generate a1,a2;
4).Group operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int);
groupgroupbylname = group A by l_name ;
5).Order by operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int);
Result = ORDER A BY a1 DESC;
6).Show data:
Command:
Dump Result;
Experiment 6
Lab 7,8
MapReduce
Introduction:
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel,
distributed algorithm on a cluster. A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a
reduce method, which performs a summary operation.
Algorithm:
Generally MapReduce paradigm is based on sending the computer to where the data resides.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage:
Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data
and creates several small chunks of data.
Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that
comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the
cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the
Hadoop server.
Record Reader
This is the first phase of MapReduce where the Record Reader reads every line from the input text file as text and yields output as key-
value pairs.
Output − Forms the key-value pairs. The following is the set of expected key-value pairs.
Map Phase
The Map phase takes input from the Record Reader, processes it, and produces the output as another set of key-value pairs.
Input − The following key-value pair is the input taken from the Record Reader.
The Map phase reads each key-value pair, divides each word from the value using StringTokenizer, treats each word as key and the
count of that word as value. The following code snippet shows the Mapper class and the map function.
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Combiner Phase
The Combiner phase takes each key-value pair from the Map phase, processes it, and produces the output as key-value collection
pairs.
Input − The following key-value pair is the input taken from the Map phase.
The Combiner phase reads each key-value pair, combines the common words as key and values as collection. Usually, the code and
operation for a Combiner is similar to that of a Reducer. Following is the code snippet for Mapper, Combiner and Reducer class
declaration.
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
Reducer Phase
The Reducer phase takes each key-value collection pair from the Combiner phase, processes it, and passes the output as key-value
pairs. Note that the Combiner functionality is same as the Reducer.
Input − The following key-value pair is the input taken from the Combiner phase.
The Reducer phase reads each key-value pair. Following is the code snippet for the Combiner.
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Record Writer
This is the last phase of MapReduce where the Record Writer writes every key-value pair from the Reducer phase and sends the output
as text.
Input − Each key-value pair from the Reducer phase along with the Output format.
Output − It gives you the key-value pairs in text format. Following is the expected output.
What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Save the above program as WordCount.java. The compilation and execution of the program is given below.
Follow the steps given below to compile and execute the above program.
Step 1 − Use the following command to create a directory to store the compiled java classes.
$ mkdir units
Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. You can download the jar
from mvnrepository.com.
Step 3 − Use the following commands to compile the WordCount.java program and to create a jar for the program.
Step 5 − Use the following command to copy the input file named input.txt in the input directory of HDFS.
Step 6 − Use the following command to verify the files in the input directory.
Step 7 − Use the following command to run the Word count application by taking input files from the input directory.
Wait for a while till the file gets executed. After execution, the output contains a number of input splits, Map tasks, and Reducer tasks.
Step 8 − Use the following command to verify the resultant files in the output folder.
Step 9 − Use the following command to see the output in Part-00000 file. This file is generated by HDFS.
What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1