Bda Unit2
Bda Unit2
UNIT-II
HADOOP
1. HISTORY OF HADOOP
2. THE HADOOP DISTRIBUTED FILE SYSTEM
3. COMPONENTS OF HADOOP
4. ANALYZING THE DATA WITH HADOOP
5. SCALING OUT
6. HADOOP STREAMING
7. DESIGN OF HDFS
8. JAVA INTERFACES TO HDFS BASICS
Technical terms:
Terms Literal Meaning Technical Meaning
Scaling out To exit a position by selling To allow Hadoop to move
in increments as the price of the MapReduce computation
the stock climbs. to each machine hosting a
part of the data.
1. HISTORY OF HADOOP
Hadoop is an open source framework overseen by Apache Software Foundation
which is written in Java for storing and processing of huge datasets with the cluster
of commodity hardware.
Mainly two problems with the big data.
1. First one is to store such a huge amount of data.
2. Second one is to process that stored data.
The traditional approach like RDBMS is not sufficient due to the heterogeneity of the
data.
1
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Hadoop comes as the solution to the problem of big data i.e. storing and processing
the big data with some extra capabilities.
History of Hadoop:
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when
they both started to work on Apache Nutch project.
Apache Nutch project was the process of building a search engine system that can
index 1 billion pages.
After a lot of research on Nutch, they concluded that such a system will cost around
half a million dollars in hardware, and along with a monthly running cost of $30, 000
approximately, which is very expensive.
So, they realized that their project architecture will not be capable enough to the
workaround with billions of pages on the web.
So they were looking for a feasible solution which can reduce the implementation
cost as well as the problem of storing and processing of large datasets.
The origin of the name of Hadoop:
Doug cutting separates the distributed computing parts from Nutch and formed a new
project Hadoop (He gave name Hadoop it was the name of a yellow toy elephant
which was owned by the Doug Cutting’s son. and it was easy to pronounce and was
the unique word.)
year Event
2
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
3
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
2024
Announcement 3.4.0
Hadoop is an Apache project; all components are available via the Apache open
source License.
Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and
MapReduce).
HBase was originally developed at Powerset, now a department at Microsoft.
Hive was originated and developed at Facebook.
Pig, ZooKeeper, and Chukwa were originated and developed at Yahoo!
Avro was originated at Yahoo! and is being co-developed with Cloudera.
4
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
5
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Recursive deleting
o hadoop fs -rmr <arg>
6
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within
the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest
within HDFS, and then deletes the local copy on success.
o cat <filen-ame>
Displays the contents of filename on stdout.
o moveToLocal <src><localDest>
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already
exists at path, unless the file is already size 0.
7
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Prints information about path. Format is a string which accepts file size in blocks (%b),
filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
3. COMPONENTS OF HADOOP
(https://www.ques10.com/p/2798/what-are-the-different-components-of-
hadoop-fram-1/)
Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is designed to scale up from single servers to thousands of machines, each providing
computation and storage.
Rather than rely on hardware to deliver high-availability, the framework itself is
designed to detect and handle failures at the application layer, thus delivering a highly-
available service on top of a cluster of computers, each of which may be prone to
failures.
1. Hadoop Distributed File System (HDFS)
HDFS is a distributed file system that provides high-throughput access to data.
It provides a limited interface for managing the file system to allow it to scale and
provide high throughput.
HDFS creates multiple replicas of each data block and distributes them on computers
throughout a cluster to enable reliable and rapid access.
The main components of HDFS are as described below:
Name Node is the master of the system. It maintains the name system (directories and
files) and manages the blocks which are present on the Data Nodes.
Data Nodes are the slaves which are deployed on each machine and provide the actual
storage. They are responsible for serving read and write requests for the clients.
Secondary Name Node is responsible for performing periodic checkpoints. In the
event of Name Node failure, you can restart the NameNode using the checkpoint.
2. MapReduce
MapReduce is a framework for performing distributed data processing using the
MapReduce programming paradigm.
In the MapReduce paradigm, each job has a user-defined map phase (which is a
parallel, share-nothing processing of input; followed by a user-defined reduce phase
where the output of the map phase is aggregated).
HDFS is the storage system for both input and output of the MapReduce jobs.
The main components of MapReduce:
8
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
JobTracker is the master of the system which manages the jobs and resources in the
clusters (Task Trackers). The JobTracker tries to schedule each map as close to the
actual data being processed i.e. on the Task Tracker which is running on the same
DataNode as the underlying block.
TaskTrackers are the slaves which are deployed on each machine. They are
responsible for running the map and reduce tasks as instructed by the JobTracker.
JobHistoryServer is a daemon that serves historical information about completed
applications. Typically, JobHistory server can be co-deployed with Job Tracker, but we
recommend to run it as a separate daemon.
3. YARN Component
The acronym YARN stands for “Yet Another Resource Negotiator”.
This Hadoop component manages the resources to run the processed data effectively.
The data processing results of MapReduce are run on the Hadoop cluster
simultaneously, and each of them needs some resources to complete the task.
This is done with the help of a set of resources such as RAM, network bandwidth, and
the CPU.
To process job requests and manage cluster resources in Hadoop, it consists of the
following components:
1. Resource Manager
The resource manager serves as the master and assigns resources.
2. Node Manager
The node manager aids the
resource manager and is
responsible for handling
the nodes and resource
usage in the nodes.
3. Application Manager
The application manager is
responsible for requesting
containers from the node
manager.
Once the node manager
gets the resources, he sends
them to the resource
manager.
4. Container
It is a collection of physical resources that conduct the actual processing of data.
9
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
To take advantage of the parallel processing that Hadoop provides, we need to express
After some local, small-scale testing, we will be able to run it on a cluster of machines.
2. Reduce phase
programmer.
The programmer also specifies two functions: the map function and the reduce function.
The input to our map phase is the raw NCDC (National climate Data Center).
To choose a text input format that gives us each line in the dataset as a text value.
The key is the offset of the beginning of the line from the beginning of the file, but as
we have no need for this, we ignore it.
To visualize the way the map works, consider the following sample lines of input
data (some unused columns have been dropped to fit the page, indicated by
ellipses):
0067011990999991950051507004…9999999N9+00001+99999999999…
0043011990999991950051512004…9999999N9+00221+99999999999…
0043011990999991950051518004…9999999N9–00111+99999999999…
0043012650999991949032412004…0500001N9+01111+99999999999…
0043012650999991949032418004…0500001N9+00781+99999999999…
These lines are presented to the map function as the key-value pairs:
(0,0067011990999991950051507004…9999999N9+00001+99999999999…)
(106, 0043011990999991950051512004…9999999N9+00221+99999999999…)
10
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
(212, 0043011990999991950051518004…9999999N9–00111+99999999999…)
(318, 0043012650999991949032412004…0500001N9+01111+99999999999…)
(424, 0043012650999991949032418004…0500001N9+00781+99999999999…)
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated in bold text), and
emits them as its output (the temperature values have been interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before
This processing sorts and groups the key-value pairs by key. So, continuing the example,
Each year appears with a list of all its air temperature readings. All the reduce function has to
do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
11
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
5. SCALING OUT
MapReduce works for large inputs.
To Scale out, we need to store the data in a distributed filesystem, typically HDFS, to
allow Hadoop to move the MapReduce computation to each machine hosting a part of
the data.
Data Flow:
A MapReduce job is a unit of work that the client wants to be performed:
It consists of the input data, the map reduce program and configuration information.
Hadoop runs the job by dividing it, into tasks, of which there are 2 types:
Map task
Reduce task
There are 2 types of nodes that control the job execution process:
1. Job tracker :
The job tracker, which coordinates the job run.
The job tracker is a Java application whose main class is Job Tracker.
It coordinates all the job run on the system by scheduling tasks to run on task
trackers.
2. Task tracker:
It runs tasks and send progress reports to the job tracker, which keeps a record
of the overall progress of each job.
If a task fails, the job tracker can reschedule it on a different task tracker.
Input Splits:
Hadoop divides the input to a MapReduce job into fixed size pieces called input splits
or just splits.
Hadoop creates one map task for each split, which runs the user defined map function
for each record in the split.
Locality optimization:
Hadoop does its best to run the map tsk on a node where the input data resides in
HDFS.This is called the data locality optimization.
12
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
It should now be clear why the optimal split size is the same as the block size.
It is the largest size is the same as the block size; it is the largest size of input that can
be stored on a single node.
The dotted boxes indicate nodes, the light arrows show data transfers on a node and the
Heavy arrows show data transfers on a node and the heavy arrows show data transfers
between nodes.
The number of reduce tasks is not governed by the size of the input, but is specified
independently.
When they are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task.
There can be many keys in each partition their output, each creating one partition for
each reduce task.
There can be many keys in each partition, but the records for any given key are all in
single partition.
The partitioning can be controlled by a user-defined partitioning function, but normally
the default partitioned which buckets keys using a hash function works very well.
13
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
This diagram
makes it clear
why the data
flow between
map and
reduce tasks is
colloquially
known as “the
shuffle”, as
each reduce
task is fed by
many map
tasks.
6. HADOOP STREAMING
Hadoop Streaming is another feature/utility of the Hadoop ecosystem that enables
users to execute a MapReduce job from an executable script as Mapper and
Reducers.
Hadoop Streaming is often confused with real-time streaming, but it’s simply a utility
that runs an executable script in the MapReduce framework.
The executable script may contain code to perform real-time ingestion of data.
The basic functionality of the Hadoop Streaming utility is that it executes the Mapper
and Reducer without any external script, creates a MapReduce job, submits the
MapReduce job to the cluster, and monitors the job until it completes.
14
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Kafka Yes Java-based High Scalable Can work Configurable Java, Python
Throughput without
of data Hadoop
The Mapper reads the input data from Input Reader/Format in the form of key-value
pair, maps them as per logic written on code, and then passes through the Reduce
stream, which performs data aggregation and releases the data to the output.
Syntax:
To submit a MapReduce job using Hadoop Streaming and to use a Mapper and Reducer
defined in Python.
15
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Where:
input = Input location for Mapper from where it can read input
output = Output location for the Reducer to store the output
mapper = The executable file of the Mapper
reducer = The executable file of the Reducer
Python to write the Mapper (which reads the input from the file or stdin) and Reducer
(which writes the output to the output location as specified).
The Hadoop Streaming utility creates a MapReduce job, submits the job to the cluster,
and monitors the job until completion.
Depending upon the input file size, the Hadoop Streaming process launches a number
of Mapper tasks (based on input split) as a separate process.
The Mapper task then reads the input (which is in the form of key-value pairs as
converted by the input reader format), applies the logic from the code (which is
basically to map the input), and passes the result to Reducers.
16
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Once all the mapper tasks are successful, the MapReduce job launches the Reducer task
(one, by default, unless otherwise specified) as a separate process.
In the Reducer phase, data aggregation occurs and the Reducer emits the output in the
form of key-value pairs.
import sys
#Word Count Example
current_word = None
current_count = 0
word = None
17
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
current_word = word
if current_word == word:
print '%s %s' % (current_word,current_count)
Once the job is successful, you will see some part files in /user/output location:
18
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
7. DESIGN OF HDFS
19
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
1. Create FileSystem:
FileSystem fs = FileSystem.get(new Configuration());
If you run with yarn command, Distributed Filesystem (HDFS) will be created
Utilizes fs.default.name property from configuration
Recall that Hadoop framework loads core-site.xml which sets property to hdfs
(hdfs://localhost:8020)
20
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
4. Close Stream
finally { IOUtils.closeStream(input); }
Example of ReadFile.java:
throws IOException {
try {
input = fs.open(fileToRead);
} finally {
IOUtils.closeStream(input); }}}
Output:
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.ReadFile Hello from readme.txt
21
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Output:
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.WriteToFile
$ hdfs dfs -cat /training/playArea/writeMe.txt
Hello HDFS! Elephants are awesome!
Append:
Append to the end of the existing file.Optional support by concrete FileSystem HDFS
supports
No support for writing in the middle of the File
Example:
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream out = fs.append(toHdfs, new Progressable(){
@Override
22
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Output:
FileSystem: mkdirs
Output:
23
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
24