Practical-1: Aim:-Make A Single Node Cluster in Hadoop. Solution
Practical-1: Aim:-Make A Single Node Cluster in Hadoop. Solution
Practical-1
Aim :- Make a single node cluster in Hadoop.
Solution:-
To install Hadoop, you should have Java version 1.8 in your system.
Check your java version through this command on command prompt
java –version
Extract it to a folder.
Likewise, create a new user variable with variable name as JAVA_HOME and variable
value as the path of the bin folder in the Java directory.
Now we need to set Hadoop bin directory and Java bin directory path in system variable
path.
Click on New and add the bin directory path of Hadoop and Java in it.
Configurations
Now we need to edit some files located in the hadoop directory of the etc folder where we
installed hadoop. The files that need to be edited have been highlighted.
Edit the file core-site.xml in the hadoop directory. Copy this xml property in the configuration
in the file.
Edit the file hdfs-site.xml and add below property in the configuration.
Edit the file yarn-site.xml and add below property in the configuration.
Create a folder with the name dfs inside data and inside that create ‘datanode’ and
‘namenode’.
Hadoop needs windows OS specific files which does not come with default download of
hadoop.you can download it from the github.
hadoop version
Since it doesn’t throw error and successfully shows the hadoop version, that means hadoop
is successfully installed in the system.
hdfs namenode –format
Cluster
Namenode information:
Conclusion:
Thus, in this Practical we learnt about installing and setting up the Hadoop framework and
making a single node cluster in Hadoop.
PRACTICAL-2
AIM:- Run Word count program in Hadoop with 250 MB size of Data Set.
SOLUTION:
1. Create a text file with some content. We'll pass this file as input to
the wordcount MapReduce job for counting words.
C:\file1.txt:- This is the file with 250 mb size of the data which we will use for
word count problem.
Install Hadoop
1. Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt')
to be used for counting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input
2. Copy the text file(say 'file1.txt') from local disk to the newly created 'input'
directory in HDFS.
C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=145
CPU time spent (ms)=1418
Physical memory (bytes) snapshot=368246784
Virtual memory (bytes) snapshot=513716224
Total committed heap usage (bytes)=307757056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=55
File Output Format Counters
Bytes Written=59
5. Check output.
C:\hadoop>bin\hdfs dfs -cat output/*
Example 1
Hadoop 2
Install 1
Mapreduce 1
Run 1
Wordcount 1
Practical-3
Aim: Understand the logs generated by MapReduce program.
Solution :
Job Client
Job Client is used by the user to facilitate execution of the MapReduce job.
When a user writes a MapReduce job they will typically invoke job client in
their main class to configure and launch the job. In this example, we will be
using SleepJob to cause mapper tasks to sleep for an extended period of time
/user/gpadmin/.staging/job_1389385968629_0025/appTokens
/user/gpadmin/.staging/job_1389385968629_0025/job.jar
/user/gpadmin/.staging/job_1389385968629_0025/job.split
/user/gpadmin/.staging/job_1389385968629_0025/job.splitmetainfo
/user/gpadmin/.staging/job_1389385968629_0025/job.xml
/user/gpadmin/.staging/job_1389385968629_0025/job_1389385968629_0025_1.jhist
/
user/gpadmin/.staging/job_1389385968629_0025/job_1389385968629_0025_1_conf.xm
l
After .staging is created, job client will submit the job to the resource manager
service (application manager port 8032). Then job client will continue to
monitor the execution of the job and report back to the console with the
progress of the map and reduce containers. That is why you see the "map 5%
reduce 0%" while the job is running. Once the job completes, job client will
return some statistics about the job that it collected during execution.
Remember that job client gets map and reduce container statuses from the
Application Master directly. We will talk a bit more about that later but for
now here is an example of running the sleep job, so it hangs for a really long
time while we observe the map containers execute.
[gpadmin@hdm1 ~]$ hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-
mapreduce-client- jobclient.jar sleep -m 3 -r 1 -mt 6000000
Note: You can kill the MapReduce job using the following command:
[root@hdw3 yarn]# yarn application -kill
application_1389385968629_0025 Output:
Application Master
Once the application manager service has decided to the start running the job,
it then chooses one of the NodeManagers to launch the MapReduce
application master class, which is called
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.
The application master service will be launched on one of the NodeManager
servers running in the environment. The NodeManager selected by the
resource manager is largely dependent on the available resources within the
cluster. The Node manager service will generate shell scripts in the local
application cache, which are used to execute the application master
container.
_0025_01_000001/jo
bSubmitDir nm-
local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/jobSubmitDir
/job.split nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/jobSubmitDir
/appTokens nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/jobSubmitDir/job.spl
itmetainfo nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_0000
01/job.xml
nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/.default_container_execu
tor.sh.crc nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/launch_co
ntainer.sh nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_
000001/tm
p nm-
local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/.container_
tokens.crc nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/contai
ner_tokens nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_0000
01/job.jar
nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/default_container_e
xecutor.sh nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/.launch_container.sh.crc
The container executer class running in the NodeManager service will then use
launch_container.sh to execute the Application Master class. As per below you
can see all logs for stdout and stderr are getting redirected to $
{yarn.nodemanager.log-dirs} defined in yarn-site.xml
[gpadmin@hdw3 yarn]# tail -1 nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000001/launch_container.sh
Once launched the Application Master will issue resource allocation requests
for the map and reduce containers in the queue to the ResourceManager
service. When the resource manager determines that there are enough
resources on the cluster to grant the allocation request, it will inform the
L.D COLLEGE OF ENGINEERING(COMPUTER DEPARTMENT) 25
Big data analysis practicals kaushal chavda(180280107021)
The container executer class in the NodeManager will do the same for a map or
reduce container as it did with the Application Master class. All files and shell
scripts will be added into the containers application class within the nm-local-
dir
nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003
nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_0000
03/job.xml
nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/.default_container_execu
tor.sh.crc nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/launch_co
ntainer.sh nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_
000003/tm
p nm-
local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/.j
ob.xml.crc nm-
local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/.container_
tokens.crc nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/contai
ner_tokens nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_0000
03/job.jar
nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/default_container_e
xecutor.sh nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/.launch_container.sh.crc
Note: job.jar is only a soft link that points to the actual job.jar in the
applications filecache directory. This is how yarn handles distributed cache for
containers:
[root@hdw1 yarn]# ls -l nm-local-
dir/usercache/gpadmin/appcache/application_1389385968629_0025/container_13893
85968629
_0025_01_000003/
total 96
Note: By setting this param, the above container launches scripts and user
cache will remain on the system for a specified period of time; otherwise
these files get deleted after application completes.
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>10000000</value>
</property>
During run time you will see all the container logs in the $
{yarn.nodemanager.log-dirs}
[root@hdw3 yarn]# find
userlogs/ -print userlogs/
userlogs/application_13893859
68629_0025
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
1
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
1/stdout
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
1/stderr
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
1/syslog
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
2
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
2/stdout
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
2/stderr
userlogs/application_1389385968629_0025/container_1389385968629_0025_01_00000
2/syslog
Once the job has completed the NodeManager will keep the log for each
container for
${yarn.nodemanager.log.retain-seconds} which is 10800 seconds by default ( 3
hours ) and delete them once they have expired. But if ${yarn.log-aggregation-
enable} is enabled then the NodeManager will immediately concatenate all of
the containers logs into one file and upload them into HDFS in $
{yarn.nodemanager.remote-app-log- dir}/${user.name}/logs/<application ID>
and delete them from the local userlogs directory. Log aggregation is enabled
by default in PHD and it makes log collection convenient.
/
yarn/apps/gpadmin/logs/application_13893859686
29_0025/ Found 3 items
/
yarn/apps/gpadmin/logs/application_1389385968629_0025/hdw1.hadoop.local_308
25
-rw-r----- 3 gpadmin hadoop 5378 2014-02-01 16:54
/
yarn/apps/gpadmin/logs/application_1389385968629_0025/hdw2.hadoop.local_36429
/yarn/apps/gpadmin/logs/applica
PRACTICAL-4
AIM: Run two different Datasets/Different size of Datasets on Hadoop and
Compare the Logs
Solution:
Step 1:
Adding Input
dataset files to
HDFS
Step 2:
Step 3:
Running Job
on Dataset 2
Practical-5
Aim: Develop Map Reduce Application to sort a given file or do aggregation on
some parameter.
Solution:
Steps to run a Map Reduce Application
1. Create following files
A. SalesDriver.java
package com.kamlesh;
import
org.apache.hadoop.fs.Path
; import
org.apache.hadoop.io.*;
import
org.apache.hadoop.mapre
d.*; public class SalesDriver
{
public static void main(String[] args)
{
JobClient my_client = new JobClient();
JobConf job_conf = new JobConf(SalesDriver.class);
job_conf.setJobName("SalePerCountry");
job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);
job_conf.setMapperClass(SalesMapper.class);
job_conf.setReducerClass(SalesReducer.class);
job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job_conf, new Path(args[0]));
FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));
my_client.setConf(job_conf);
try
L.D COLLEGE OF ENGINEERING(COMPUTER DEPARTMENT) 38
Big data analysis practicals kaushal chavda(180280107021)
{
JobClient.runJob(job_conf);
}
catch (Exception e)
{
e.printStackTrace();
} } }
A. SalesMapper.java
package
com.kamlesh;
import
java.io.IOExcepti
on;
import
org.apache.hadoop.io.IntWritabl
e; import
org.apache.hadoop.io.LongWrita
ble; import
L.D COLLEGE OF ENGINEERING(COMPUTER DEPARTMENT) 39
Big data analysis practicals kaushal chavda(180280107021)
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
A. SalesReducer.java
package
com.kamlesh;
import
java.io.IOExcepti
on; import
java.util.*;
import
org.apache.hadoop.io.IntWrita
ble; import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapred.*;
First create the required jar file Hadoop_Aggregation.jar and then export
the jar file to the workspace.
4.Start Hadoop DFS daemons and Hadoop MapReduce and Yarn daemons
Start the Hadoop daemons using the commands $ start –dfs.sh and $ start –
yarn.sh
After successfully starting the required daemons now we check the input files
using the command $ hdfs dfs –ls/user/kamlesh/input
Now the required file is not present so we copy the required file that is
SalesJan2009.csv using the command $ hdfs dfs –put
/home/kamlesh/Desktop/SalesJan2009.csv
/user/kamlesh/input
Now run the aggregation mapreduce application. Here we use the command
$ hadoop jar /home/kamlesh/Desktop/Hadoop_Aggregation.jar
com.kamlesh.SalesDriver
/user/kamlesh/input/SalesJan2009.csv
/user/kamlesh/output/CountryAndProducts.
7.Getting Results
First check the names of resultant file using the command $ hdfs dfs –ls
/user/kamlesh/output.
Then copy the resultant file to the local file system using the command $ hdfs dfs
–get
/user/kamlesh/output/CountryAndProducts/part-0000
/home/kamlesh/Desktop/Results
8.Output
9.Stop Daemons
Finally, after the output stop the Hadoop daemons like Hadoop dfs daemons
and Hadoop yarn daemons by using command $ stop –dfs.sh and $ stop –
yarn.sh
Conclusion:
In this practical we performed sorting and aggregation on the Data Set using
Map Reduce Application.
Practical-6
Solution:
Website: Data Portal Of India
URL: https://data.gov.in/
Dataset 2: Weather
This dataset describes rainfall occurred during Hot-Weather Season by Districts
in Tamil Nadu 2016-17
URL: https://tn.data.gov.in/resources/rainfall-occurred-during-hot-
weather-season- districts-tamil-nadu-2016-17#web_catalog_tabs_block_10