0% found this document useful (0 votes)
80 views18 pages

Hadoop File Complte

The document provides details about installing and configuring Hadoop and its components. It describes: 1) Installing Java, downloading Hadoop, extracting the files, and configuring environment variables. 2) Configuring core Hadoop files like core-site.xml and hdfs-site.xml to set properties like the namenode address and replication factor. 3) Formatting the namenode, starting Hadoop processes, and verifying the installation using the HDFS health check URL. The document also outlines common HDFS commands like ls, mkdir, copyFromLocal, and mv and provides examples of their usage.

Uploaded by

rashant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views18 pages

Hadoop File Complte

The document provides details about installing and configuring Hadoop and its components. It describes: 1) Installing Java, downloading Hadoop, extracting the files, and configuring environment variables. 2) Configuring core Hadoop files like core-site.xml and hdfs-site.xml to set properties like the namenode address and replication factor. 3) Formatting the namenode, starting Hadoop processes, and verifying the installation using the HDFS health check URL. The document also outlines common HDFS commands like ls, mkdir, copyFromLocal, and mv and provides examples of their usage.

Uploaded by

rashant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 18

NATIONAL INSTITUTE OF TECHNOLOGY DELHI

COMPUTER SCIENCE AND ENGINEERING

Lab Report
of
BIG DATA ANALYTICS

Submitted by:
MD. khalid
192211009
M.Tech-CSE(Analytics), Ist-Semester
INDEX

1. Introduction to Hadoop and it's Installation

2. Hadoop Installation(Edit Configuration Files)

3. Introduction to HDFS and it's Commands

4. MongoDB Installation and Commands

5. Hive Installation and Commands

6. Pig Installation and Commands

7. MapReduce
Experiment 1

Lab-1

Hadoop Installation
Introduction
Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving
massive amounts of data and computation.
Components of hadoop-
1. Hadoop Common: The common utilities that support the other Hadoop modules.
2. Hadoop Distributed File System (HDFS): A distributed file system that provides high throughput access to application data.
3. Hadoop YARN: A framework for job scheduling and cluster resource management.
4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Installation:
Step 1: Download the Java 8 Package. Save this file in your home directory.

Step 2: Extract the Java Tar File.


Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Step 3: Download the Hadoop 2.7.3 Package.


Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Step 4: Extract the Hadoop tar File.


Command: tar -xvf hadoop-2.7.3.tar.gz

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open .bashrc file. Now, add Hadoop and Java Path.
Command: gedit .bashrc

Put the below line in bashrc file

export HADOOP_HOME="$HOME/hadoop-2.7.7"
export HADOOP_CONF_DIR="$HOME/hadoop-2.7.7/etc/hadoop"
export HADOOP_MAPRED_HOME="$HOME/hadoop-2.7.7"
export HADOOP_COMMON_HOME="$HOME/hadoop-2.7.7"
export HADOOP_HDFS_HOME="$HOME/hadoop-2.7.7"
export YARN_HOME="$HOME/hadoop-2.7.7"
export PATH="$PATH:$HOME/hadoop-2.7.7/bin"

export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export PATH="$PATH:JAVA_HOME/bin"

Now ,save the bash file and close it.


For applying all these changes to the current Terminal, execute the source command.
Command: source .bashrc
To make sure that Java and Hadoop have been properly installed on your system and can be accessed through the Terminal, execute
the java -version and hadoop version commands.
Command: java -version
hadoop version
Lab2
Hadoop Installaton (Cont....)

Step 6: Edit the Hadoop Configuration files.


Command: cd hadoop-2.7.3/etc/hadoop/
Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
HIVE Installation

core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings of Hadoop core such as
I/O settings that are common to HDFS & MapReduce.
Command: gedit core-site.xml

Configure file with the configuration given below:


1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>fs.default.name</name>
6 <value>hdfs://localhost:9000</value>
7 </property>
8 </configuration>

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:

hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the
replication factor and block size of HDFS.
Command: gedit hdfs-site.xml
1
2 <?xml version="1.0" encoding="UTF-8"?>
3
4 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
5 <configuration>
6 <property>
7 <name>dfs.replication</name>
8 <value>1</value>
9 </property>
1 <property>
0 <name>dfs.permission</name>
1 <value>false</value>
1 </property>
1 </configuration>
2

Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside configuration tag:

mapred-site.xml contains configuration settings of MapReduce application like number of JVM that can run in parallel, the size of the
mapper and the reducer process, CPU cores available for a process, etc.

Command: gedit mapred-site.xml.


1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>mapreduce.framework.name</name>
6 <value>yarn</value>
7 </property>
8 </configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:

yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the
operation needed on program & algorithm, etc.
Command: gedit yarn-site.xml
1
2 <?xml version="1.0">
3 <configuration>
4 <property>
5 <name>yarn.nodemanager.aux-services</name>
6 <value>mapreduce_shuffle</value>
7 </property>
8 <property>
9 <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
1 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
0 </property>
1 </configuration>
1

Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:

hadoop-env.sh contains the environment variables that are used in the script to run Hadoop like Java home path, etc.

Command: gedit hadoop–env.sh

Step 12: Go to Hadoop home directory and format the NameNode.


Command: cd hadoop-2.7.3

bin/hadoop namenode -format

Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the daemons.

Command: cd hadoop-2.7.3/sbin

Either you can start all daemons with a single command or do it individually.
For starting dfs daemon:

Command: ./start-dfs.sh

Step 14: Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the NameNode interface.
Experiment 2
Lab 3
HDFS Commands
Introduction:
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing large data sets of structured or
unstructured data across various nodes and thereby maintaining the metadata in the form of log files.
HDFS commands are as follow:

(Execution is in folder hadoop-2.7.7)

1). ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>

Execution:
bin/hdfs dfs -ls /user

2).mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>

Execution:
bin/hdfs dfs -mkdir /user/khalid

3).touchz: It creates an empty file.


Syntax:
bin/hdfs dfs -touchz <file_path>

Execution:
bin/hdfs dfs -touchz /user/khalid/tst.txt

4).copyFromLocal : To copy files/folders from local file system to hdfs store. This is the most important command. Local filesystem
means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

Execution:
bin/hdfs dfs -copyFromLocal /home/nitdelhipc22/Downloads/hadoopfile.doc /user/khalid/

5).copyToLocal: To copy files/folders from hdfs store to local file system.


Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

Execution:
bin/hdfs dfs -copyToLocal /user/khalid/tst.txt /home/nitdpc22

6).cp: It copies file from one directory to other.Also multipe files can be copied by this command.
Syntex:
Bin/hdfs dfs -cp <srcfile(on hdfs)> <dest(hdfs)>

Execution:
Bin/hdfs dfs -cp /user/khalid/tst.txt /user/demo

7).moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

Execution:
bin/hdfs dfs -moveFromLocal /home/nitdpc22/csa.txt /user/khalid

8).moveToLocal: This command will move file from hdfs to local.


Syntex:
bin/hdfs dfs -moveToLocal <src(on hdfs)> <local dest>

Execution:
bin/hdfs dfs -moveToLocal /user/khalid/tst.txt /home/nitdpc22

9) .mv: This command will move file from one directory to other.It also allow to move multiple files.
Syntex:
Bin/hdfs dfs -mv <src(on hdfs)> <dst(on hdsf)>

Execution:
bin/hdfs dfs -mv /user/khalid/tst.txt /user/demo

10).du: It will give the size of each file in directory.


Syntax:
bin/hdfs dfs -du <dirName>

Execution:
bin/hdfs dfs -du /user/khalid

11) .text: This command that takes a source file and outputs the file in text format.
Syntax:
bin/hdfs dfs –text <src(on hdfs)>

Execution:
bin/hdfs dfs -text /user/khalid/tst.txt

12) .count:HDFS Command to count the number of directories, files, and bytes under the paths that match the specified file
pattern.
Syntax:
bin/hdfs dfs -count <src(on hdfs)>

Execution:
bin/hdfs dfs -count /user
Experiment 3
Lab-4
Mongo DB installation and commands

Introduction:
MongoDB is a cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like
documents with schema.

Mongo DB Instalation:
STEP 1:Importing Public Key
Commands:
sudo apt-key adv --keyserver
hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6

STEP 2:Create a List File for the Mongodb


Create a mongodb-org-3.4.list file inside the /etc/apt/sources.list.d location. Use the following command.

Commands:
echo "deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse"
sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list

STEP 3:Update the Local Package


Command:
apt-get update

STEP 4: Install Mongo DB


Command:
apt-get install mongodb-10gen = 2.2.3

STEP 5: Start Mongo DB


Command:
sudo service mongodb start

STEP 6: Run Mongo DB


Command:
mongo

Commands for Mongo DB:


1).insert():To insert data into MongoDB collection.
Command:
db.COLLECTION_NAME.insert(document)
Execution:
user.mycollection.insert(
_id: ObjectId(7df78ad8902c),
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 100
})

2). find(): This will display all documents in unstructured way.


Command:
db.COLLECTION_NAME.find()

Execution:
user.mycollection.find()

3). pretty(): This display the results in a formatted way.


Command:
db.mycol.find().pretty()

Execution:
user.mycollection.find().pretty()

Condition related commands:

1). Equality: To check the equality condition.


Syntax:
{<key>:<value>}

Execution:
user.mycollection.find({“id”:190}).pretty()

2). Less Than: To check less than condition.


Syntax:
{<key>:{$lt:<value>}}

Execution:
user.mycollection.find({“id”:{$lt:150}).pretty()

3).Less Than Equals: To check less than and equals condition.


Syntax:
{<key>:{$lte:<value>}}

Execution:
user.mycollection.find({“id”:{$lte:150}).pretty()

4).Greater Than: To check greater than condition.


Syntax:
{<key>:{$gt:<value>}}

Execution:
user.mycollection.find({“id”:{$gt:150}}).pretty()

5). Greater Than Equals: To check the greater than and equals condition.
Syntax:
{<key>:{$gte:<value>}}

Execution:
user.mycollection.find({“id”:{$gte:150}}).pretty()

6).Not Equals: To check the not equals condition.


Syntax:
{<key>:{$ne:<value>}}

Execution:
user.mycollection.find({“id”:{$ne:150}}).pretty()

Experiment 4
Lab 5

HIVE Installation
Introduction:
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a
SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Installation:
STEP 1: Download Hive tar.
Command: wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz

STEP 2: Extracting and verifying Hive Archive


Command:
tar zxvf apache-hive-0.14.0-bin.tar.gz
ls

STEP 3: Open bashrc file


Command :
gedit ~/.bashrc

STEP 4: Provide the following HIVE_HOME path.


export HIVE_HOME=/home/user/apache-hive-1.2.2-bin
export PATH=$PATH:/home/user/apache-hive-1.2.2-bin/bin

STEP 5:Update the environment variable.


source ~/.bashrc

STEP 6: Start the hive by providing the following command.


hive

Hive Commands:
1).Create database:
Syntax:
CREATE DATABASE [IF NOT EXISTS] <database name

Execution:
CREATE DATABASE [IF NOT EXISTS] khalid

2).Drop:
Syntax:
 DROP DATABASE [IF EXISTS] database_name

Execution:
DROP DATABASE [IF EXISTS] khalid

3).Create Table:
Syntax:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]


[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Execution:
CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

4).Load Data:
Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename

Execution:
LOAD DATA LOCAL INPATH ‘home/user/khalid.txt’ OVERWRITE INTO TABLE employee

5).Select Data:
Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference

Execution:
SELECT * FROM employee

6).Select with where condition:


Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]

Execution:
SELECT * FROM employee where salary>5000

7).Select with order by:


Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[ORDER BY col_list]]

Execution:
Select * from employee order by salary
Experiment 5
Lab 6
PIG INSTALLATION

Introduction:
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.

Installation:
STEP 1: Download the Apache Pig tar file.
http://mirrors.estointernet.in/apache/pig/pig-0.16.0/

STEP 2: Unzip the downloaded tar file.


Command: tar -xvf pig-0.16.0.tar.gz

STEP 3: Open the bashrc file.


Command: gedit ~/.bashrc

STEP 4: Provide the following PIG_HOME path.


export PIG_HOME=/home/user/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
Export PIG_CLASSPATH=”$HADOOP_CONF_DIR”

STEP 5: Update the environment variable


Command: source ~/.bashrc

STEP 6: Test the installation on the command prompt type


Command: pig -h

STEP 7: Start the pig in MapReduce mode.


Command: pig

Pig Commands:
1).Load Data:
Syntax:
LOAD 'info' [USING FUNCTION] [AS SCHEMA];

Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int)

2).Distinct Operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int)
Result = Distinct A

3).ForEach operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int)
fe = FOREACH A generate a1,a2;

4).Group operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int);
groupgroupbylname = group A by l_name ;

5).Order by operator:
Execution:
A = LOAD 'user/khalid/pload.txt' USING PigStorage(',') AS (a1:int,a2:int,a3:int,a4:int);
Result = ORDER A BY a1 DESC;

6).Show data:
Command:
Dump Result;

Experiment 6
Lab 7,8
MapReduce
Introduction:
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel,
distributed algorithm on a cluster. A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a
reduce method, which performs a summary operation.

Algorithm:
Generally MapReduce paradigm is based on sending the computer to where the data resides.

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage:

Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data
and creates several small chunks of data.

Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that
comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.

The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the
cluster between the nodes.

Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the
Hadoop server.

Map-Reduce Implementation for counting the number of words in a program.

Consider the following text in a text file input.txt for MapReduce.

What do you mean by Object


What do you know about Java
What is Java Virtual Machine
How Java enabled High Performance

Record Reader
This is the first phase of MapReduce where the Record Reader reads every line from the input text file as text and yields output as key-
value pairs.

Input − Line by line text from the input file.

Output − Forms the key-value pairs. The following is the set of expected key-value pairs.

<1, What do you mean by Object>


<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>

Map Phase
The Map phase takes input from the Record Reader, processes it, and produces the output as another set of key-value pairs.

Input − The following key-value pair is the input taken from the Record Reader.

<1, What do you mean by Object>


<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>

The Map phase reads each key-value pair, divides each word from the value using StringTokenizer, treats each word as key and the
count of that word as value. The following code snippet shows the Mapper class and the map function.

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>


{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}

Output − The expected output is as follows −

<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>


<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

Combiner Phase
The Combiner phase takes each key-value pair from the Map phase, processes it, and produces the output as key-value collection
pairs.

Input − The following key-value pair is the input taken from the Map phase.

<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>


<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

The Combiner phase reads each key-value pair, combines the common words as key and values as collection. Usually, the code and
operation for a Combiner is similar to that of a Reducer. Following is the code snippet for Mapper, Combiner and Reducer class
declaration.

job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

Output − The expected output is as follows −

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>


<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

Reducer Phase
The Reducer phase takes each key-value collection pair from the Combiner phase, processes it, and passes the output as key-value
pairs. Note that the Combiner functionality is same as the Reducer.

Input − The following key-value pair is the input taken from the Combiner phase.

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>


<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

The Reducer phase reads each key-value pair. Following is the code snippet for the Combiner.

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>


{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Output − The expected output from the Reducer phase is as follows −

<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>


<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

Record Writer
This is the last phase of MapReduce where the Record Writer writes every key-value pair from the Reducer phase and sends the output
as text.

Input − Each key-value pair from the Reducer phase along with the Output format.
Output − It gives you the key-value pairs in text format. Following is the expected output.

What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1

Java Programe for implementation of word count using Map-Reduce

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {


public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>


{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Save the above program as WordCount.java. The compilation and execution of the program is given below.

Compilation and Execution


Let us assume we are in the home directory of Hadoop user (for example, /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1 − Use the following command to create a directory to store the compiled java classes.

$ mkdir units

Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. You can download the jar
from mvnrepository.com.

Let us assume the downloaded folder is /home/hadoop/.

Step 3 − Use the following commands to compile the WordCount.java program and to create a jar for the program.

$ javac -classpath hadoop-core-1.2.1.jar -d units WordCount.java


$ jar -cvf units.jar -C units/ .

Step 4 − Use the following command to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir

Step 5 − Use the following command to copy the input file named input.txt in the input directory of HDFS.

$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/input.txt input_dir

Step 6 − Use the following command to verify the files in the input directory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/

Step 7 − Use the following command to run the Word count application by taking input files from the input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

Wait for a while till the file gets executed. After execution, the output contains a number of input splits, Map tasks, and Reducer tasks.

Step 8 − Use the following command to verify the resultant files in the output folder.

$HADOOP_HOME/bin/hadoop fs -ls output_dir/

Step 9 − Use the following command to see the output in Part-00000 file. This file is generated by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000


Following is the output generated by the MapReduce program.

What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy