0% found this document useful (0 votes)
40 views70 pages

Big-Data-Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views70 pages

Big-Data-Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Jaipur Engineering College And Research Centre

Subject Name with code: Big Data Analytics (BDA) (7AID4-01)

Presented by: Ms. Ati Garg

Assistant Professor, AI&DS

Branch and SEM - AI&DS/ VII SEM

Department of

Artificial Intelligence and Data Science

1
UNIT-2

Introduction to Map Reduced

2
Outline

A Weather Dataset

Understanding Hadoop API for Map Reduce Framework

Basic Program of Hadoop Map Reduce

Driver code

Mapper code

Reducer code

Record Reader

Combiner

Partitioner

3
UNIT 2

INTRODUCTION TO Map Reduced


HADOOP FRAMEWORK
Distributed File Systems - Large-Scale File System Organization – HDFS concepts –
MapReduce Execution, Algorithms using MapReduce, Matrix-Vector Multiplication – Hadoop
YARN
2.1 Distributed File Systems
Most computing is done on a single processor, with its main memory, cache, and local disk (a
compute node). In the past, applications that called for parallel processing, such as large
scientific calculations, were done on special-purpose parallel computers with many processors
and specialized hardware. However, the prevalence of large-scale Web services has caused more
and more computing to be done on installations with thousands of compute nodes operating more
or less independently. In these installations, the compute nodes are commodity hardware, which
greatly reduces the cost compared with special-purpose parallel machines. These new computing
facilities have given rise to a new generation of programming systems. These systems take
advantage of the power of parallelism and at the same time avoid the reliability problems that
arise when the computing hardware consists of thousands of independent components, any of
which could fail at any time.
Distributed File System Requirements:
In DFS first we need: access transparency and location transparency. Distributed File System
Requirements are performance, scalability, concurrency control, fault tolerance and security
requirements emerged and tolerance and security requirements emerged and were met in the later
phases of DFS development.
 Access transparency: Client programs should be unaware of the distribution of files.
 Location transparency: Client program should see a uniform namespace. Files should
be able to be relocated without changing their path name.
 Mobility transparency: Neither client programs nor system admin program tables in the
client nodes should be changed when files are moved either automatically or by the
system admin.
 Performance transparency: Client programs should continue to perform well on load
within a specified range.
 Scaling transparency: increase in size of storage and network size should be
transparent.

The following are the characteristics of these computing installations and the specialized file
systems that have been developed to take advantage of them.

4
Physical Organization of Compute Nodes
The new parallel-computing architecture, sometimes called cluster computing, is organized as
follows. Compute nodes are stored on racks, perhaps 8–64on a rack. The nodes on a single rack
are connected by a network, typically gigabit Ethernet. There can be many racks of compute
nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-
rack communication is somewhat greater than the intra rack Ethernet, but given the number of
pairs of nodes that might need to communicate between racks, this bandwidth may be essential.
However, there may be many more racks and many more compute nodes per rack. It is a
fact of life that components fail, and the more components, such as compute nodes and
interconnection networks, a system has, the more frequently something in the system will not be
working at any given time. Some important calculations take minutes or even hours on thousands
of compute nodes. If we had to abort and restart the computation every time one component
failed, then the computation might never complete successfully.
The solution to this problem takes two forms:
1. Files must be stored redundantly. If we did not duplicate the file at several compute nodes,
then if one node failed, all its files would be unavailable until the node is replaced. If we did not
back up the files at all, and the disk crashes, the files would be lost forever.

2. Computations must be divided into tasks, such that if any one task fails to execute to
completion, it can be restarted without affecting other tasks. This strategy is followed by the
map-reduce programming system

Fig: architecture of a large-scale computing system

5
2.2 Large-Scale File-System Organization
To exploit cluster computing, files must look and behave somewhat differently from the
conventional file systems found on single computers. This new file system, often called a
distributed file system or DFS (although this term had other meanings in the past), is typically
used as follows.
 Files can be enormous, possibly a terabyte in size. If you have only small files, there is no
point using a DFS for them.
 Files are rarely updated. Rather, they are read as data for some calculation, and possibly
additional data is appended to files from time to time.
 Files are divided into chunks, which are typically 64 megabytes in size. Chunks are
replicated, perhaps three times, at three different compute nodes. Moreover, the nodes
holding copies of one chunk should be located on different racks, so we don’t lose all
copies due to a rack failure. Normally, both the chunk size and the degree of replication
can be decided by the user.
 To find the chunks of a file, there is another small file called the master node or name
node for that file. The master node is itself replicated, and a directory for the file system
as a whole knows where to find its copies. The directory itself can be replicated, and all
participants using the DFS know where the directory copies are.

A Weather Dataset

A Weather Dataset - Hadoop. MapReduce is a programming model for data processing.


Weather sensors collecting data every hour at many locations across the globe gather a large
volume of log data, which is a good candidate for analysis with MapReduce, since it is semi
structured and record-oriented.

MapReduce is a programming model for data processing. The model is simple, yet not too
simple to express useful programs in. Hadoop can run MapReduce programs written in various
languages; in this chapter, we shall look at the same program expressed in Java, Ruby, Python,
and C++. Most important, MapReduce programs are inherently parallel, thus putting very large-
scale data analysis into the hands of anyone with enough machines at their disposal. MapReduce
comes into its own for large datasets, so let’s start by looking at one.

For our example, we will write a program that mines weather data. Weather sensors collecting
data every hour at many locations across the globe gather a large volume of log data, which is a
good candidate for analysis with MapReduce, since it is semi structured and record-oriented.

6
Data Format
The data we will use is from the National Climatic Data Center (NCDC,
http://www .ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which
each line is a record. The format supports a rich set of meteorological elements, many of which
are optional or with variable data lengths. For simplicity, we shall focus on the basic elements,
such as temperature, which are always present and are of fixed width. Example shows a sample
line with some of the salient fields highlighted. The line has been split into multiple lines to show
each field: in the real file, fields are packed into one line with no delimiters.

Analyzing the Data with UNIX Tools


What’s the highest recorded global temperature for each year in the dataset?
We will answer this first without using Hadoop, as this information will provide a
performance baseline, as well as a useful means to check our results. The
classic tool for processing line-oriented data is awk. Example is a small script to
calculate the maximum temperature for each year.

The script loops through the compressed year files, first printing the year, and
then processing each file using awk. The awk script extracts two fields from the
data: the air temperature and the quality code. The air temperature value is
turned into an integer by adding 0. Next, a test is applied to see if the
temperature is valid (the value 9999 signifies a missing value in the NCDC
dataset) and if the quality code indicates that the reading is not suspect or
erroneous. If the reading is OK, the value is compared with the maximum value
seen so far, which is updated if a new maximum is found. The END block is
executed after all the lines in the file have been processed, and it prints the
maximum value.

The temperature values in the source file are scaled by a factor of 10, so this
works out as a maximum temperature of 31.7°C for 1901 (there were very few
readings at the beginning of the century, so this is plausible). The complete run

7
for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large
Instance.

To speed up the processing, we need to run parts of the program in parallel. In


theory, this is straightforward: we could process different years in different
processes, using all the available hardware threads on a machine. There are a
few problems with this, however.

First, dividing the work into equal-size pieces isn’t always easy or obvious. In
this case, the file size for different years varies widely, so some processes will
finish much earlier than others. Even if they pick up further work, the whole run
is dominated by the longest file. A better approach, although one that requires
more work, is to split the input into fixed-size chunks and assign each chunk to a
process.

Second, combining the results from independent processes may need further
processing. In this case, the result for each year is independent of other years
and may be combined by concatenating all the results, and sorting by year. If
using the fixed-size chunk approach, the combination is more delicate. For this
example, data for a particular year will typically be split into several chunks,
each processed independently. We’ll end up with the maximum temperature for
each chunk, so the final step is to look for the highest of these maximums, for
each year.

Third, you are still limited by the processing capacity of a single machine. If the
best time you can achieve is 20 minutes with the number of processors you
have, then that’s it. You can’t make it go faster. Also, some datasets grow
beyond the capacity of a single machine. When we start using multiple
machines, a whole host of other factors come into play, mainly falling in the
category of coordination and reliability. Who runs the overall job? How do we
deal with failed processes?

8
So, though it’s feasible to parallelize the processing, in practice it’s messy. Using
a framework like Hadoop to take care of these issues is a great help.

A Weather Dataset

3-day forecasts of temperature, precipitation and wind. Forecasts have to be provided for several
regions in the country. Short-term weather forecasts are relevant for the general public to plan
activities, while also being reliable. See our methodology section for more information.

 Temperature extremes
 Temperature average
 Wind speed
 Wind direction
 Precipitation Amount
 Precipitation Probability
 Forecast for current day and four following days

How Map Reduce Works

Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the
input data-set into independent chunks which are processed by the map tasks in a completely
parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce

9
tasks. Typically both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running
on the same set of nodes. This configuration allows the framework to effectively schedule tasks
on the nodes where data is already present, resulting in very high aggregate bandwidth across the
cluster.

The MapReduce framework consists of a single master Job Tracker and one slave Task
Tracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on
the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as
directed by the master.

Minimally, applications specify the input/output locations and supply map and reduce functions
via implementations of appropriate interfaces and/or abstract-classes. These, and other job
parameters, comprise the job configuration. The Hadoop job client then submits the job
(jar/executable etc.) and configuration to the Job Tracker which then assumes the responsibility
of distributing the software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java MapReduce applications need not be
written in Java.

 Hadoop Streaming is a utility which allows users to create and run jobs with any executables
(e.g. shell utilities) as the mapper and/or the reducer.
 Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNI
based).

Inputs and Outputs


The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs
as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the Writable
Comparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job:

(Input) <k1, v1> -> map -> <K2, v2> -> combine -> <K2, v2> -> reduce -> <k3, v3> (output)

MapReduce - User Interfaces


This section provides a reasonable amount of detail on every user-facing aspect of the
MapReduce framework. This should help users implement, configure and tune their jobs in a

10
fine-grained manner. However, please note that the java doc for each class/interface remains the
most comprehensive documentation available; this is only meant to be a tutorial.

Let us first take the Mapper and Reducer interfaces. Applications typically implement them to
provide the map and reduce methods.We will then discuss other core interfaces
including JobConf, JobClient, Partitioner, OutputCollector, Reporter, InputFormat, OutputForma
t, OutputCommitter and others.

Finally, we will wrap up by discussing some useful features of the framework such as
the Distributed Cache, Isolation Runner etc.

Payload
Applications typically implement the Mapper and Reducer interfaces to provide
the map and reduce methods. These form the core of the job.

Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input records. A given
input pair may map to zero or many output pairs.

The Hadoop MapReduce framework spawns one map task for each Input Split generated by
the Input Format for the job.

Overall, Mapper implementations are passed the Job Conf for the job via the Job Configurable.
Configure (Job Conf) method and override it to initialize them. The framework then calls map
(Writable Comparable, Writable, Output Collector, and Reporter) for each key/value pair in
the Input Split for that task. Applications can then override the Closeable. Close () method to
perform any required cleanup.

Output pairs do not need to be of the same types as input pairs. A given input pair may map to
zero or many output pairs. Output pairs are collected with calls to OutputCollector.collect
(Writable Comparable, Writable).

Applications can use the Reporter to report progress, set application-level status messages and
update Counters, or just indicate that they are alive.

All intermediate values associated with a given output key are subsequently grouped by the
framework, and passed to the Reducer(s) to determine the final output. Users can control the
grouping by specifying a Comparator via Job conf.set Output Key Comparator Class (Class).

The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions
is the same as the number of reduce tasks for the job. Users can control which keys (and hence
records) go to which Reducer by implementing a custom Partitioner.

11
Users can optionally specify a combiner, via Job Conf.set Combiner Class (Class), to perform
local aggregation of the intermediate outputs, which helps to cut down the amount of data
transferred from the Mapper to the Reducer.

The intermediate, sorted outputs are always stored in a simple (key-Len, key, value-Len, value)
format. Applications can control if, and how, the intermediate outputs are to be compressed and
the Compression Codec to be used via the Job Conf.

How Many Maps?


The number of maps is usually driven by the total size of the inputs, that is, the total number of
blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it
has been set up to 300 maps for very cup-light map tasks. Task setup takes awhile, so it is best if
the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a block size of 128MB, you'll end up with
82,000 maps, unless set Num Map Tasks (int) (which only provides a hint to the framework) is
used to set it even higher.

Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.

The number of reduces for the job is set by the user via Job Conf.set Num Reduce Tasks (into).

Overall, Reducer implementations are passed the Job Conf for the job via the Job
Configurable.con figure (Job Conf) method and can override it to initialize them. The framework
then calls reduce (Writable Comparable, Iterator, Output Collector, Reporter) method for
each <key, (list of values)> pair in the grouped inputs. Applications can then override
the Closeable. Close () method to perform any required cleanup.

Reducer has 3 primary phases: shuffle, sort and reduce.

Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the
relevant partition of the output of all the mappers, via HTTP.

Sort
The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are
being fetched they are merged.

Secondary Sort
If equivalence rules for grouping the intermediate keys are required to be different from those for
grouping keys before reduction, then one may specify a Comparator via Job Conf.set Output

12
Value Grouping Comparator (Class). Since Job Conf.set Output Key Comparator Class
(Class) can be used to control how intermediate keys are grouped, these can be used in
conjunction to simulate secondary sort on values.

Reduce
In this phase the reduce (Writable Comparable, Iterator, Output Collector, Reporter) method is
called for each <key, (list of values)> pair in the grouped inputs.

The output of the reduce task is typically written to the File System via OutputCollector.collect
(Writable Comparable, Writable).

Applications can use the Reporter to report progress, set application-level status messages and
update Counters, or just indicate that they are alive.

The output of the Reducer is not sorted.

How Many Reduces?


Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures. The scaling factors above are slightly less than whole
numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks. In
this case the outputs of the map-tasks go directly to the File System, into the output path set
by set Output Path (Path). The framework does not sort the map-outputs before writing them out
to the File System.

Partitioner
Partitioner partitions the key space. Partitioner controls the partitioning of the keys of the
intermediate map-outputs. The key (or a subset of the key) is used to derive the partition,
typically by a hash function. The total number of partitions is the same as the number of reduce
tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and
hence the record) is sent to for reduction.

Hash Partitioner is the default Partitioner. Reporter is a facility for MapReduce applications to
report progress, set application-level status messages and update Counters.

Mapper and Reducer implementations can use the Reporter to report progress or just indicate that
they are alive. In scenarios where the application takes a significant amount of time to process
individual key/value pairs, this is crucial since the framework might assume that the task has
timed-out and kill that task. Another way to avoid this is to set the configuration
parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).

13
Output Collector- Output Collector is a generalization of the facility provided by the
MapReduce framework to collect data output by the Mapper or the Reducer (either the
intermediate outputs or the output of the job).
Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and
practitioners.

Map

 The input data is first split into smaller blocks. Each block is then assigned to a mapper
for processing.
 For example, if a file has 100 records to be processed, 100 mappers can run together to
process one record each. Or maybe 50 mappers can run together to process two records
each. The Hadoop framework decides how many mappers to use, based on the size of the
data to be processed and the memory block available on each mapper server.

Reduce

 After all the mappers complete processing, the framework shuffles and sorts the results
before passing them on to the reducers. A reducer cannot start while a mapper is still in
progress. All the map output values that have the same key are assigned to a single
reducer, which then aggregates the values for that key.

Combine and Partition

 There are two intermediate steps between Map and Reduce.

14
 Combine is an optional process. The combiner is a reducer that runs individually on each
mapper server. It reduces the data on each mapper further to a simplified form before
passing it downstream.
 This makes shuffling and sorting easier as there is less data to work with. Often, the
combiner class is set to the reducer class itself, due to the cumulative and associative
functions in the reduce function. However, if needed, the combiner can be a separate
class as well.

Partition is the process that translates the <key, value> pairs resulting from mappers to another
set of <key, value> pairs to feed into the reducer. It decides how the data has to be presented to
the reducer and also assigns it to a particular reducer.

The default partitioned determines the hash value for the key, resulting from the mapper, and
assigns a partition based on this hash value. There are as many partitions as there are reducers.
So, once the partitioning is complete, the data from each partition is sent to a specific reducer.

A Map Reduce Example

Consider an ecommerce system that receives a million requests every day to process payments.
There may be several exceptions thrown during these requests such as "payment declined by a
payment gateway," "out of inventory," and "invalid address." A developer wants to analyze last
four days' logs to understand which exception is thrown how many times.

A MapReduce Example
Consider an ecommerce system that receives a million requests every day to process payments.
There may be several exceptions thrown during these requests such as “payment declined by a
payment gateway,” “out of inventory,” and “invalid address.” A developer wants to analyze last
four days’ logs to understand which exception is thrown how many times.

Example Use Case


The objective is to isolate use cases that are most prone to errors, and to take appropriate action.
For example, if the same payment gateway is frequently throwing an exception, is it because of
an unreliable service or a badly written interface? If the “out of inventory” exception is thrown
often, does it mean the inventory calculation service has to be improved, or does the inventory
stocks need to be increased for certain products?
The developer can ask relevant questions and determine the right course of action. To perform
this analysis on logs that are bulky, with millions of records, MapReduce is an apt programming
model. Multiple mappers can process these logs simultaneously: one mapper could process a
day’s log or a subset of it based on the log size and the memory block available for processing in
the mapper server.

15
Map
For simplification, let’s assume that the Hadoop framework runs just four mappers. Mapper 1,
Mapper 2, Mapper 3, and Mapper 4.
The value input to the mapper is one record of the log file. The key could be a text string such as
“file name + line number.” The mapper, then, processes each record of the log file to produce
key value pairs. Here, we will just use filler for the value as ‘1.’ The output from the mappers
looks like this:

Mapper 1 -> <Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>,
<Exception A, 1>
Mapper 2 -> <Exception B, 1>, <Exception B, 1>, <Exception A, 1>, <Exception A, 1>
Mapper 3 -> <Exception A, 1>, <Exception C, 1>, <Exception A, 1>, <Exception B, 1>,
<Exception A, 1>
Mapper 4 -> <Exception B, 1>, <Exception C, 1>, <Exception C, 1>, <Exception A, 1>
Assuming that there is a combiner running on each mapper—Combiner 1 … Combiner 4—that
calculates the count of each exception (which is the same function as the reducer), the input to
Combiner 1 will be:
<Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>, <Exception A, 1>

Combine
The output of Combiner 1 will be:
<Exception A, 3>, <Exception B, 1>, <Exception C, 1>
The output from the other combiners will be:
Combiner 2: <Exception A, 2> <Exception B, 2>
Combiner 3: <Exception A, 3> <Exception B, 1> <Exception C, 1>
Combiner 4: <Exception A, 1> <Exception B, 1> <Exception C, 2>

Partition
After this, the partitioner allocates the data from the combiners to the reducers. The data is also
sorted for the reducer.
The input to the reducers will be as below:
Reducer 1: <Exception A> {3,2,3,1}
Reducer 2: <Exception B> {1,2,1,1}
Reducer 3: <Exception C> {1,1,2}
If there were no combiners involved, the input to the reducers will be as below:
Reducer 1: <Exception A> {1,1,1,1,1,1,1,1,1}
Reducer 2: <Exception B> {1,1,1,1,1}
Reducer 3: <Exception C> {1,1,1,1}

16
Here, the example is a simple one, but when there are terabytes of data involved, the combiner
process’ improvement to the bandwidth is significant.

Reduce
Now, each reducer just calculates the total count of the exceptions as:
Reducer 1: <Exception A, 9>
Reducer 2: <Exception B, 5>
Reducer 3: <Exception C, 4>
The data shows that Exception A is thrown more often than others and requires more attention.
When there are more than a few weeks’ or months’ of data to be processed together, the potential
of the MapReduce program can be truly exploited.

How to Implement MapReduce


MapReduce programs are not just restricted to Java. They can also be written in C, C++, Python,
Ruby, Perl, etc. Here is what the main function of a typical MapReduce job looks like:
Public static void main (String [] args) throws Exception {

Job Conf = new Job Conf(Exception Count. class);


conf.set Job Name(“exception count”);

conf.set Output Key Class (Text. class);


conf.set Output Value Class(Int Writable. Class);

conf.set Mapper Class(Map. class);


conf.set Reducer Class(Reduce. class);
conf.set Combiner Class(Reduce. class);

conf.set Input Format(Text Input Format. class);


conf.set Output Format(Text Output Format. class);

File Input Format. set Input Paths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
The parameters—MapReduce class name, Map, Reduce and Combiner classes, input and output
types, input and output file paths—are all defined in the main function. The Mapper class
extends Map Reduce Base and implements the Mapper interface. The Reducer class extends Map
Reduce Base and implements the Reducer interface.

17
Map reduce problem
 Specifically, for Map Reduce, Talend Studio makes it easier to create jobs that can run on
the Hadoop cluster, set parameters such as mapper and reducer class, input and output
 Hadoop job), it can be deployed as a service, executable, or stand-alone job that runs
natively on the big data cluster. It spawns one or more Hadoop Map Reduce jobs that, in
turn, execute the Map Reduce algorithm.
 Before running a Map Reduce job the Hadoop connection needs to be configured. For
more details on how to use Talend for setting up Map Reduce jobs, refer to
these tutorials.
 Leveraging Map Reduce To Solve Big Data Problems
 The Map Reduce programming paradigm can be used with any complex problem that can
be solved through parallelization.
 A social media site could use it to determine how many new sign-ups it received over the
past month from different countries, to gauge its increasing popularity among different
geographies. A trading firm could perform its batch reconciliations faster and also
determine which scenarios often cause trades to break. Search engines could determine
page views, and marketers could perform sentiment analysis using Map Reduce.
 To learn more about Map Reduce and experiment with use cases like the ones listed
above, download a trial version of Talend Studio today.

Understanding the MapReduce Job Life Cycle

This section briefly sketches the life cycle of a MapReduce job and the roles of the primary
actors in the life cycle. The full life cycle is much more complex. For details, refer to the
documentation for your Hadoop distribution or the Apache Hadoop MapReduce documentation.

Though other configurations are possible, a common Hadoop cluster configuration is a single
master node where the Job Tracker runs, and multiple worker nodes, each running a Task
Tracker. The Job Tracker node can also be a worker node.

When the user submits a MapReduce job to Hadoop:

1. The local Job Client prepares the job for submission and hands it off to the Job Tracker.
2. The Job Tracker schedules the job and distributes the map work among the Task Trackers for
parallel processing.
18
3. Each Task Tracker spawns a Map Task. The Job Tracker receives progress information from the
Task Trackers.
4. As map results become available, the Job Tracker distributes the reduce work among the Task
Trackers for parallel processing.
5. Each Task Tracker spawns a Reduce Task to perform the work. The Job Tracker receives
progress information from the Task Trackers.

All map tasks do not have to complete before reduce tasks begin running. Reduce tasks can
begin as soon as map tasks begin completing. Thus, the map and reduce steps often overlap.

Job Client

The Job Client prepares a job for execution. When you submit a MapReduce job to Hadoop, the
local Job Client:

1. Validates the job configuration.


2. Generates the input splits. See How Hadoop Partitions Map Input Data.
3. Copies the job resources (configuration, job JAR file, input splits) to a shared location, such as
an HDFS directory, where it is accessible to the Job Tracker and Task Trackers.
4. Submits the job to the Job Tracker.

Job Tracker

The Job Tracker is responsible for scheduling jobs, dividing a job into map and reduce tasks,
distributing map and reduce tasks among worker nodes, task failure recovery, and tracking the
job status. Job scheduling and failure recovery are not discussed here; see the documentation for
your Hadoop distribution or the Apache Hadoop MapReduce documentation.

When preparing to run a job, the Job Tracker:

1. Fetches input splits from the shared location where the Job Client placed the information.
2. Creates a map task for each split.
3. Assigns each map task to a Task Tracker (worker node). The Job Tracker monitors the health of
the Task Trackers and the progress of the job. As map tasks complete and results become
available, the Job Tracker:

1. Creates reduce tasks up to the maximum enabled by the job configuration.


2. Assigns each map result partition to a reduce task.
3. Assigns each reduce task to a Task Tracker.

A job is complete when all map and reduce tasks successfully complete, or, if there is no reduce
step, when all map tasks successfully complete.

19
Task Tracker

A Task Tracker manages the tasks of one worker node and reports status to the Job Tracker.
Often, the Task Tracker runs on the associated worker node, but it is not required to be on the
same host.

1. Fetches job resources locally.


2. Spawns a child JVM on the worker node to execute the map or reduce task.
3. Reports status to the Job Tracker.

The task spawned by the Task Tracker runs the job’s map or reduces functions.

Map Task

The Hadoop MapReduce framework creates a map task to process each input split. The map
task:

1. Uses the Input Format to fetch the input data locally and create input key-value pairs.
2. Applies the job-supplied map function to each key-value pair.
3. Performs local sorting and aggregation of the results.
4. If the job includes a Combiner, runs the Combiner for further aggregation.
5. Stores the results locally, in memory and on the local file system.
6. Communicates progress and status to the Task Tracker.

Map task results undergo a local sort by key to prepare the data for consumption by reduce tasks.
If a Combiner is configured for the job, it also runs in the map task. A Combiner consolidates the
data in an application-specific way, reducing the amount of data that must be transferred to
reduce tasks. For example, a Combiner might compute a local maximum value for a key and
discard the rest of the values. The details of how map tasks manage, sort, and shuffle results are
not covered here. See the documentation for your Hadoop distribution or the Apache Hadoop
MapReduce documentation.

When a map task notifies the Task Tracker of completion, the Task Tracker notifies the Job
Tracker. The Job Tracker then makes the results available to reduce tasks.

Reduce Task

The reduce phase aggregates the results from the map phase into final results. Usually, the final
result set is smaller than the input set, but this is application dependent. The reduction is carried
out by parallel reduce tasks. The reduce input keys and values need not have the same type as the
output keys and values. The reduce phase is optional. You may configure a job to stop after the
map phase completes. For details, see Configuring a Map-Only Job.

20
Reduce is carried out in three phases, copy, sort, and merge. A reduce task:

1. Fetches job resources locally.


2. Enters the copy phase to fetch local copies of all the assigned map results from the map worker
nodes.
3. When the copy phase completes, executes the sort phase to merge the copied results into a single
sorted set of (key, value-list) pairs.
4. When the sort phase completes, executes the reduce phase, invoking the job-supplied reduce
function on each (key, value-list) pair.
5. Saves the final results to the output destination, such as HDFS.

The input to a reduce function is key-value pairs where the value is a list of values sharing the
same key. For example, if one map task produces a key-value pair (eat, 2) and another map task
produces the pair (eat, 1), then these pairs are consolidated into (eat, (2, 1)) for input to the
reduce function. If the purpose of the reduce phase is to compute a sum of all the values for each
key, then the final output key-value pair for this input is (eat, 3). For a more complete example,
see Example: Calculating Word Occurrences.

Output from the reduce phase is saved to the destination configured for the job, such as HDFS or
Mark Logic Server. Reduce tasks use an Output Format subclass to record results. The Hadoop
API provides Output Format subclasses for using HDFS as the output destination. The Mark
Logic Connector for Hadoop provides Output Format subclasses for using a Mark Logic Server
database as the destination. For a list of available subclasses, see Output Format Subclasses. The
connector also provides classes for defining key and value types; see Mark Logic-Specific Key
and Value Types.

Features of map reduce

21
1. Scalability

Apache Hadoop is a highly scalable framework. This is because of its ability to store and
distribute huge data across plenty of servers. All these servers were inexpensive and can operate
in parallel. We can easily scale the storage and computation power by adding servers to the
cluster. Hadoop MapReduce programming enables organizations to run applications from large
sets of nodes which could involve the use of thousands of terabytes of data. Hadoop MapReduce
programming enables business organizations to run applications from large sets of nodes. This
can use thousands of terabytes of data.

2. Flexibility

MapReduce programming enables companies to access new sources of data. It enables


companies to operate on different types of data. It allows enterprises to access structured as well
as unstructured data, and derive significant value by gaining insights from the multiple sources
of data. Additionally, the MapReduce framework also provides support for the multiple
languages and data from sources ranging from email, social media, to clickstream. The
MapReduce processes data in simple key-value pairs thus supports data type including meta-
data, images, and large files. Hence, MapReduce is flexible to deal with data rather than
traditional DBMS.

3. Security and Authentication

The MapReduce programming model uses HBase and HDFS security platform that allows access
only to the authenticated users to operate on the data. Thus, it protects unauthorized access to
system data and enhances system security.

4. Cost-effective solution

Hadoop’s scalable architecture with the MapReduce programming framework allows the storage
and processing of large data sets in a very affordable manner.

5. Fast

Hadoop uses a distributed storage method called as a Hadoop Distributed File System that
basically implements a mapping system for locating data in a cluster. The tools that are used for
data processing, such as MapReduce programming, are generally located on the very same
servers that allow for the faster processing of data.

So, Even if we are dealing with large volumes of unstructured data, Hadoop MapReduce just
takes minutes to process terabytes of data. It can process petabytes of data in just an hour.

6. Simple model of programming

Amongst the various features of Hadoop MapReduce, one of the most important features is that
it is based on a simple programming model. Basically, this allows programmers to develop the
MapReduce programs which can handle tasks easily and efficiently.

22
The MapReduce programs can be written in Java, which is not very hard to pick up and is also
used widely. So, anyone can easily learn and write MapReduce programs and meet their data
processing needs.

7. Parallel Programming

One of the major aspects of the working of MapReduce programming is its parallel processing. It
divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these divided tasks. So the entire
program is run in less time.

8. Availability and resilient nature

Whenever the data is sent to an individual node, the same set of data is forwarded to some other
nodes in a cluster. So, if any particular node suffers from a failure, then there are always other
copies present on other nodes that can still be accessed whenever needed. This assures high
availability of data.

One of the major features offered by Apache Hadoop is its fault tolerance. The Hadoop
MapReduce framework has the ability to quickly recognizing faults that occur. It then applies a
quick and automatic recovery solution. This feature makes it a game-changer in the world of big
data processing.

Basic program map reduce

The above image shows a data set that is the basis for our programming exercise example. There
are a total of 10 fields of information in each line. Our programming objective uses only the first
and fourth fields, which are arbitrarily called "year" and "delta" respectively. We will ignore all
the other fields of data. We will need to define a Mapper class, Reducer class and a Driver class.

How many Map task in Hadoop?

The number of map tasks depends on the total number of blocks of the input files. In MapReduce
map, the right level of parallelism seems to be around 10-100 maps/node. But there is 300 map
for CPU-light map tasks.

23
For example, we have a block size of 128 MB. And we expect 10TB of input data. Thus it
produces 82,000 maps.

Hence the number of maps depends on Input Format.

Mapper = (total data size)/ (input split size)

Example – data size is 1 TB. Input split size is 100 MB.

Mapper = (1000*1000)/100 = 10,000

Input to mapper class

Output of the mapper class

Design mapper class

24
Input reducer class

Output reducer class

Design reducer class

25
Driver implemented 1

Driver implements 2

26
In the Driver class, we also define the types for output key and value in the job as Text and
Float Writable respectively. If the mapper and reducer classes do NOT use the same output
key and value types, we must specify for the mapper. In this case, the output value type of
the mapper is Text, while the output value type of the reducer is Float Writable.

There are 2 ways to launch the job – synchronously and asynchronously. The job wait For
Completion () launches the job synchronously. The driver code will block waiting for the job
to complete at this line. The true argument informs the framework to write verbose output to
the controlling terminal of the job.

The main () method is the entry point for the driver. In this method, we instantiate a new
Configuration object for the job. We then call the Tool Runner static run () method.

Build and Execute a Simple Map Reduce Program

You have to compile the three classes and place the compiled classes into a directory called
“classes”. Use the jar command to put the mapper and reducer classes into a jar file the path
to which is included in the class path when you build the driver. After you build the driver,
the driver class is also added to the existing jar file.

Make sure that you delete the reduce output directory before you execute the Map Reduce
program.

Build jar

Launch the hadoop job

27
Examine output

Driver code

The driver class, we set the configuration of our Map Reduce job to run in Hadoop.

We specify the name of the job, the data type of input/output of the mapper and reducer.

We also specify the names of the mapper and reducer classes.

The path of the input and output folder is also specified.

28
The method set Input Format Class () is used for specifying how a Mapper will read the input
data or what will be the unit of work. Here, we have chosen Text Input B Format so that a
single line is read by the mapper at a time from the input text file.

The main () method is the entry point for the driver. In this method, we instantiate a new
Configuration object for the job.

Learn the Concept of Key-Value Pair in Hadoop MapReduce


1. Objective

In this MapReduce tutorial, we are going to learn the concept of a key-value pair in Hadoop. The
key Value pair is the record entity that MapReduce job receives for execution. By default,
RecordReader uses Text Input Format for converting data into a key-value pair. Here we will
learn what a key-value pair in MapReduce is, how key-value pairs are generated in Hadoop using
Input Split and RecordReader and on what basis generation of key-value pairs in Hadoop
MapReduce takes place? We will also see Hadoop key-value pair example in this tutorial.

Learn the Concept of Key-Value Pair in Hadoop MapReduce


What is a key-value pair in Hadoop?

Apache Hadoop is used mainly for Data Analysis. We look at statistical and logical techniques
in data Analysis to describe, illustrate and evaluate data. Hadoop deals with structured,
unstructured and semi-structured data. In Hadoop, when the schema is static we can directly
work on the column instead of keys and values, but, when the schema is not static, then we will

29
work on keys and values. Keys and values are not the intrinsic properties of the data, but they are
chosen by user analyzing the data.
MapReduce is the core component of Hadoop which provides data processing. Hadoop
MapReduce is a software framework for easily writing an application that processes the vast
amount of structured and unstructured data stored in the Hadoop Distributed File System
(HDFS). MapReduce works by breaking the processing into two phases: Map phase and Reduce
phase. Each phase has key-value as input and output. Learn more about how data flows in
Hadoop MapReduce?
3. Key-value pair Generation in MapReduce

Let us now learn how key-value pair is generated in Hadoop Map Reduce? In MapReduce
process, before passing the data to the mapper, data should be first converted into key-value pairs
as mapper only understands key-value pairs of data.

Key-value pairs in Hadoop MapReduce are generated as follows:


 Input Split – It is the logical representation of data. The data to be processed by an
individual Mapper is presented by the Input Split. Learn MapReduce Input Split in detail.
 RecordReader – It communicates with the Input Split and it converts the Split into records
which are in form of key-value pairs that are suitable for reading by the mapper. By default,
RecordReader uses Text Input Format for converting data into a key-value pair.
RecordReader communicates with the Input Split until the file reading is not completed.
Learn more about MapReduce record reader in detail.
In MapReduce, map function processes a certain key-value pair and emits a certain number of
key-value pairs and the Reduce function processes values grouped by the same key and emits
another set of key-value pairs as output. The output types of the Map should match the input
types of the Reduce as shown below:

 Map: (K1, V1) -> list (K2, V2)


 Reduce: {(K2, list (V2 }) -> list (K3, V3)
4. On what basis is a key-value pair generated in Hadoop? Generation of a key-value pair in Hadoop
depends on the data set and the required output. In general, the key-value pair is specified in 4
places: Map input, Map output, Reduce input and Reduce output.

a. Map Input

30
Map-input by default will take the line offset as the key and the content of the line will be the
value as Text. By using custom Input Format we can modify them.
b. Map Output
Map basic responsibility is to filter the data and provide the environment for grouping of data
based on the key.

 Key – It will be the field/ text/ object on which the data has to be grouped and aggregated on
the reducer side.
 Value – It will be the field/ text/ object which is to be handled by each individual reduce
method.
c. Reduce Input
The output of Map is the input for reduce, so it is same as Map-Output.

d. Reduce Output
It depends on the required output.

MapReduce key-value pair Example

Suppose, the content of the file which is stored in HDFS is John is Mark Joey is John. Using
Input Format, we will define how this file will split and read. By default, RecordReader uses
Text Input Format to convert this file into a key-value pair.
 Key – It is offset of the beginning of the line within the file.
 Value – It is the content of the line, excluding line terminators.
From the above content of the file-

 Key is 0
 Value is John is Mark Joey is John.
Hadoop Input Format, Types of Input Format in MapReduce
1. Objective

Hadoop Input Format checks the Input-Specification of the job. Input Format split the Input file
into Input Split and assign to individual Mapper. In this Hadoop Input Format Tutorial, we will
learn what is Input Format in Hadoop Map Reduce, different methods to get the data to the
mapper and different types of Input Format in Hadoop like File Input Format in Hadoop, Text

31
Input Format, Key Value Text Input Format, etc. We will also see what is the default Input
Format in Hadoop?

Hadoop Input Format, Types of Input Format in MapReduce

What is Hadoop Input Format?

How the input files are split up and read in Hadoop is defined by the Input Format. A Hadoop
Input Format is the first component in Map-Reduce; it is responsible for creating the input splits
and dividing them into records. If you are not familiar with MapReduce Job Flow, so follow
our Hadoop MapReduce Data flow tutorial for more understanding.
Initially, the data for a MapReduce task is stored in input files, and input files typically reside
in HDFS. Although these files format is arbitrary, line-based log files and binary format can be
used. Using Input Format we define how these input files are split and read. The Input Format
class is one of the fundamental classes in the Hadoop MapReduce framework which provides the
following functionality:
 The files or other objects that should be used for input is selected by the Input Format.
 Input Format defines the Data splits, which defines both the size of individual Map
tasks and its potential execution server.
 Input Format defines the RecordReader, which is responsible for reading actual records
from the input files.
How we get the data to mapper?

32
We have 2 methods to get the data to mapper in
MapReduce: get splits () and create Record Reader () as shown below:
[php]public abstract class Input Format<K, V>
{
public abstract List<Input Split> get Splits(Job Context)
throws IO Exception, Interrupted Exception;
public abstract RecordReader<K, V>
create Record Reader (Input Split,
Task Attempt Context) throws IO Exception,
Interrupted Exception;
}[/php]

Types of Input Format in MapReduce

33
Types of Input Format

Let us now see what are the types of Input Format in Hadoop?

File Input Format in Hadoop

It is the base class for all file-based Input Formats. Hadoop File Input Format specifies input
directory where data files are located. When we start a Hadoop job, File Input Format is provided
with a path containing files to read. File Input Format will read all files and divides these files
into one or more Input Splits.

Text Input Format

It is the default Input Format of MapReduce. Text Input Format treats each line of each input file
as a separate record and performs no parsing. This is useful for unformatted data or line-based
records like log files.

34
 Key – It is the byte offset of the beginning of the line within the file (not whole file just one
split), so it will be unique if combined with the file name.
 Value – It is the contents of the line, excluding line terminators.
Key Value Text Input Format

It is similar to Text Input Format as it also treats each line of input as a separate record. While
Text Input Format treats entire line as the value, but the Key Value Text Input Format breaks the
line itself into key and value by a tab character (‘/t’). Here Key is everything up to the tab
character while the value is the remaining part of the line after tab character.

Sequence File Input Format

Hadoop Sequence File Input Format is an Input Format which reads sequence files. Sequence
files are binary files that stores sequences of binary key-value pairs. Sequence files block-
compress and provide direct serialization and deserialization of several arbitrary data types (not
just text). Here Key & Value both are user-defined.

Sequence File As Text Input Format

Hadoop Sequence File as Text Input Format is another form of Sequence File Input Format
which converts the sequence file key values to Text objects. By calling ‘to string ()’ conversion
is performed on the keys and values. This Input Format makes sequence files suitable input for
streaming.
Sequence File as Binary Input Format

Hadoop Sequence File as Binary Input Format is a Sequence File Input Format using which
we can extract the sequence file’s keys and values as an opaque binary object.

N Line Input Format

Hadoop N Line Input Format is another form of Text Input Format where the keys are byte
offset of the line and values are contents of the line. Each mapper receives a variable number of
lines of input with Text Input Format and Key Value Text Input Format and the number depends
on the size of the split and the length of the lines. And if we want our mapper to receive a fixed
number of lines of input, then we use N Line Input Format.
N is the number of lines of input that each mapper receives. By default (N=1), each mapper

35
receives exactly one line of input. If N=2, then each split contains two lines. One mapper will
receive the first two Key-Value pairs and another mapper will receive the second two key-value
pairs.

DB Input Format

Hadoop DB Input Format is an Input Format that reads data from a relational database, using
JDBC. As it doesn’t have portioning capabilities, so we need to careful not to swamp the
database from which we are reading too many mappers. So it is best for loading relatively small
datasets, perhaps for joining with large datasets from HDFS using Multiple Inputs. Here Key is
Long Writable while Value is DB Writable.

Input Split in Hadoop MapReduce – Hadoop MapReduce Tutorial

In this Hadoop MapReduce tutorial, we will provide you the detailed description of Input Split
in Hadoop. In this blog, we will try to answer what is Hadoop Input Split, what is the need of
input Split in MapReduce and how Hadoop performs Input Split, How to change split size in
Hadoop. We will also learn the difference between Input Split vs. Blocks in HDFS.

Input Split in Hadoop MapReduce – Hadoop MapReduce Tutorial

36
What is Input Split in Hadoop?

Input Split in Hadoop MapReduce is the logical representation of data. It describes a unit of
work that contains a single map task in a MapReduce program.
Hadoop Input Split represents the data which is processed by an individual Mapper. The split is
divided into records. Hence, the mapper process each record (which is a key-value pair).
MapReduce Input Split length is measured in bytes and every Input Split has storage locations
(hostname strings). MapReduce system use storage locations to place map tasks as close to
split’s data as possible. Map tasks are processed in the order of the size of the splits so that the
largest one gets processed first (greedy approximation algorithm) and this is done to minimize
the job runtime (Learn MapReduce job optimization techniques)The important thing to notice is
that Input split does not contain the input data; it is just a reference to the data.
As a user, we don’t need to deal with Input Split directly; because they are created by an Input
Format (Input Format creates the Input split and divides into records). File Input Format, by
default, breaks a file into 128MB chunks (same as blocks in HDFS) and by
setting mapred.min.split.size parameter in mapred-site.xml we can control this value or by
overriding the parameter in the Job object used to submit a particular MapReduce job. We can
also control how the file is broken up into splits, by writing a custom Input Format.
3. How to change split size in Hadoop?

Input Split in Hadoop is user defined. User can control split size according to the size of data in
MapReduce program. Thus the number of map tasks is equal to the number of Input Splits.

The client (running the job) can calculate the splits for a job by calling ‘get Split ()’, and then
sent to the application master, which uses their storage locations to schedule map tasks that will
process them on the cluster. Then, map task passes the split to the create Record Reader
() method on Input Format to get RecordReader for the split and RecordReader generate record
(key-value pair), which it passes to the map function.
Hadoop RecordReader – How RecordReader Works in Hadoop?
Hadoop RecordReader Tutorial – Objective

In this Hadoop RecordReader Tutorial, We are going to discuss the important concept
of Hadoop MapReduce i.e. RecordReader. The MapReduce RecordReader in Hadoop takes the
byte-oriented view of input, provided by the Input Split and presents as a record-oriented view
for Mapper. It uses the data within the boundaries that were created by the Input Split and
creates Key-value pair. This blog will answer what is RecordReader in Hadoop, how Hadoop

37
RecordReader works and types of Hadoop RecordReader – Sequence File Record Reader and
Line RecordReader, the maximum size of a record in Hadoop.
Learn How to Install Hadoop on Single Machine. and Hadoop Installation on a multi-node
cluster.

Hadoop MapReduce RecordReader

What is Hadoop Record Reader?

To understand record reader in Hadoop, we need to understand the Hadoop data flow. Map
Reduce has a simple model of data processing. Inputs and Outputs for the map and reduce
functions are key-value pairs. The map and reduce functions in Hadoop MapReduce have the
following general form:

38
Hadoop Record Reader and its types.
 map: (K1, V1) → list(K2, V2)
 reduce: (K2, list(V2)) → list(K3, V3)
Now before processing, it needs to know on which data to process, this is achieved with
the Input Format class. Input Format is the class which selects the file from HDFS that should be
input to the map function. An Input Format is also responsible for creating the Input Splits and
dividing them into records. The data is divided into the number of splits (typically 64/128mb) in
HDFS. This is called as input split which is the input that is processed by a single map.
Input Format class calls the get Splits() function and computes splits for each file and then sends
them to the Job Tracker, which uses their storage locations to schedule map tasks to process
them on the Task Trackers. Map task then passes the split to the create Record Reader () method
on Input Format in task tracker to obtain a RecordReader for that split. The Record Reader load’s
data from its source and converts into key-value pairs suitable for reading by the mapper.
Hadoop RecordReader uses the data within the boundaries that are being created by the input
split and creates Key-value pairs for the mapper. The “start” is the byte position in the file where
the RecordReader should start generating key/value pairs and the “end” is where it should stop
reading records. In Hadoop RecordReader, the data is loaded from its source and then the data is
converted into key-value pairs suitable for reading by the Mapper.

How Hadoop RecordReader works?

Let us now see the working of RecordReader in Hadoop.


A RecordReader is more than Iterator over records, and map task uses one record to generate
key-value pair which is passed to the map function. We can see this by using mappers run
function:

[php]public void run(Context context) throws IO Exception, Interrupted Exception{


<pre>setup(context);
while(context.nextKeyValue())
{
map(context.setCurrentKey(),context.getCurrentValue(),context)
}

39
cleanup(context);
}[/php]

After running setup (), the next Key Value () will repeat on the context, to populate the key and
value objects for the mapper. The key and value is retrieved from the record reader by way of
context and passed to the map () method to do its work. An input to the map function, which is a
key-value pair (K, V), gets processed as per the logic mentioned in the map code. When the
record gets to the end of the record, the next Key Value () method returns false.
A RecordReader usually stays in between the boundaries created by the input split to generate
key-value pairs but this is not mandatory. A custom implementation can even read more data
outside of the input split, but it is not encouraged a lot.

Types of Hadoop RecordReader in MapReduce

The RecordReader instance is defined by the Input Format. By default, it uses Text Input Format
for converting data into a key-value pair. Text Input Format provides 2 types of Record Readers:

I. Line RecordReader

Line RecordReader in Hadoop is the default RecordReader that text Input Format provides and it
treats each line of the input file as the new value and associated key is byte offset. Line
RecordReader always skips the first line in the split (or part of it), if it is not the first split. It read
one line after the boundary of the split in the end (if data is available, so it is not the last split).

ii. Sequence File Record Reader

It reads data specified by the header of a sequence file.

Maximum size for a Single Record

There is a maximum size allowed for a single record to be processed. This value can be set using
below parameter.

[php]conf.setInt(“mapred.linerecordreader.maxlength”, Integer.MAX_VALUE);[/php]
A line with a size greater than this maximum value (default is 2,147,483,647) will be ignored.

40
Hadoop Partitioner – Internals of MapReduce Partitioner
Hadoop Partitioner / MapReduce Partitioner

In this MapReduce Tutorial, our objective is to discuss what is Hadoop Partitioner.


The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper
output. By hash function, key (or a subset of the key) is used to derive the partition. A total
number of partitions depends on the number of reduce task. Here we will also learn what is the
need of Hadoop partitioner, what is the default Hadoop partitioner, how many practitioners are
required in Hadoop and what do you mean by poor partitioning in Hadoop along with ways to
overcome MapReduce poor partitioning.

Hadoop Partitioner – Internals of MapReduce Partitioner

What is Hadoop Partitioner?

Partitioning of the keys of the intermediate map output is controlled by the Partitioner. By hash
function, key (or a subset of the key) is used to derive the partition. According to the key-
value each mapper output is partitioned and records having the same key value go into the same
partition (within each mapper), and then each partition is sent to a reducer. Partition class
determines which partition a given (key, value) pair will go. Partition phase takes place after map
phase and before reduce phase. Lets move ahead with need of Hadoop Partitioner and if you face
any difficulty anywhere in Hadoop MapReduce tutorial, you can ask us in comments.

Need of Hadoop Map Reduce Partitioner?


41
MapReduce job takes an input data set and produces the list of the key-value pair which is the
result of map phase in which input data is split and each task processes the split and each map,
output the list of key-value pairs. Then, the output from the map phase is sent to reduce task
which processes the user-defined reduce function on map outputs. But before reduce phase,
partitioning of the map output take place on the basis of the key and sorted.
This partitioning specifies that all the values for each key are grouped together and make sure
that all the values of a single key go to the same reducer, thus allows even distribution of the map
output over the reducer.
Partitioner in Hadoop MapReduce redirects the mapper output to the reducer by determining
which reducer is responsible for the particular key.

Default MapReduce Partitioner

The Default Hadoop partitioner in Hadoop MapReduce is Hash Partitioner which computes a
hash value for the key and assigns the partition based on this result.

How many Practitioners are there in Hadoop?

The total number of Practitioners that run in Hadoop is equal to the number of reducers i.e.
Partitioner will divide the data according to the number of reducers which is set by Job Conf.set
Num Reduce Tasks () method. Thus, the data from single partitioner is processed by a single
reducer. And partitioner is created only when there are multiple reducers.

Poor Partitioning in Hadoop MapReduce

If in data input one key appears more than any other key. In such case, we use two mechanisms
to send data to partitions.

 The key appearing more will be sent to one partition.


 All the other key will be sent to partitions according to their hash Code().
But if hash Code () method does not uniformly distribute other keys data over partition range,
and then data will not be evenly sent to reducers. Poor partitioning of data means that some
reducers will have more data input than other i.e. they will have more work to do than other
reducers. So, the entire job will wait for one reducer to finish its extra-large share of the load.

42
How to overcome poor partitioning in MapReduce?
To overcome poor partitioner in Hadoop MapReduce, we can create Custom partitioner, which
allows sharing workload uniformly across different reducers.

Hadoop Combiner – Best Explanation to MapReduce Combiner

Hadoop Combiner / MapReduce Combiner

Hadoop Combiner is also known as “Mini-Reducer” that summarizes the Mapper output record
with the same Key before passing to the Reducer. In this tutorial on MapReduce combiner we
are going to answer what is a Hadoop combiner, MapReduce program with and without
combiner, advantages of Hadoop combiner and disadvantages of the combiner in Hadoop.

Hadoop Combiner – Best Explanation to MapReduce Combiner


What is Hadoop Combiner?

On a large dataset when we run MapReduce job, large chunks of intermediate data is generated
by the Mapper and this intermediate data is passed on the Reducer for further processing, which
leads to enormous network congestion. MapReduce framework provides a function known
as Hadoop Combiner that plays a key role in reducing network congestion.
We have already seen earlier what mapper is and what is reducer in Hadoop MapReduce. Now
we in the next step to learn Hadoop MapReduce Combiner.

43
The combiner in MapReduce is also known as ‘Mini-reducer’. The primary job of Combiner is to
process the output data from the Mapper, before passing it to Reducer. It runs after the mapper
and before the Reducer and its use is optional.

How does MapReduce Combiner work?

Let us now see the working of the Hadoop combiner in MapReduce and how things change when
combiner is used as compared to when combiner is not used in MapReduce?

MapReduce program without Combiner

MapReduce Combiner: MapReduce program without combiner


In the above diagram, no combiner is used. Input is split into two mappers and 9 keys are
generated from the mappers. Now we have (9 key/value) intermediate data, the further mapper
will send directly this data to reducer and while sending data to the reducer, it consumes some
network bandwidth (bandwidth means time taken to transfer data between 2 machines). It will
take more time to transfer data to reducer if the size of data is big.
Now in between mapper and reducer if we use a hadoop combiner, then combiner shuffles
intermediate data (9 key/value) before sending it to the reducer and generates 4 key/value pair as
an output.

MapReduce program with Combiner in between Mapper and Reducer

44
MapReduce Combiner: MapReduce program with combiner
Reducer now needs to process only 4 key/value pair data which is generated from 2 combiners.
Thus reducer gets executed only 4 times to produce final output, which increases the overall
performance.

Advantages of MapReduce Combiner

As we have discussed what is Hadoop MapReduce Combiner in detail, now we will discuss
some advantages of Mapreduce Combiner.

 Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
 It decreases the amount of data that needed to be processed by the reducer.
 The Combiner improves the overall performance of the reducer.

45
Read: Counters in MapReduce
5. Disadvantages of Hadoop combiner in MapReduce

There are also some disadvantages of hadoop Combiner. Let’s discuss them one by one-

 MapReduce jobs cannot depend on the Hadoop combiner execution because there is no
guarantee in its execution.
 In the local file system, the key-value pairs are stored in the Hadoop and run the combiner
later which will cause expensive disk IO.
Shuffling and Sorting in Hadoop MapReduce

In Hadoop, the process by which the intermediate output from mappers is transferred to
the reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the basis
of reducers. Intermediated key-value generated by mapper is sorted automatically by key. In this
blog, we will discuss in detail about shuffling and Sorting in Hadoop MapReduce.
Here we will learn what is sorting in Hadoop, what is shuffling in Hadoop, what is the purpose of
Shuffling and sorting phase in MapReduce, how MapReduce shuffle works and how MapReduce
sort works.

Shuffling and Sorting in Hadoop MapReduce

What is Shuffling and Sorting in Hadoop MapReduce?

Before we start with Shuffle and Sort in MapReduce, let us revise the other phases Map Reduce
like Mapper, reducer in MapReduce, Combiner, partitioner in MapReduce and input Format in
MapReduce. Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in
MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data
from the mapper are grouped by the key, split among reducers and sorted by the key. Every

46
reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur
simultaneously and are done by the MapReduce framework.

Shuffling in MapReduce

The process of transferring data from the mappers to reducers is known as shuffling i.e. the
process by which the system performs the sort and transfers the map output to the reducer as
input. So, MapReduce shuffle phase is necessary for the reducers, otherwise, they would not
have any input (or input from every mapper). As shuffling can start even before the map phase
has finished so this saves some time and completes the tasks in lesser time.

Sorting in MapReduce

The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e.
before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by
mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they
can be in any order.
Sorting in Hadoop helps reduce to easily distinguish when a new reduce task should start. This
saves time for the reducer. Reducer starts a new reduce task when the next key in the sorted input
data is different than the previous. Each reduce task takes key-value pairs as input and generates
key-value pair as output.

Note that shuffling and sorting in Hadoop MapReduce is not performed at all if you specify zero
reducers (set Num Reduce Tasks (0)). Then, the MapReduce job stops at the map phase and the
map phase does not include any kind of sorting (so even the map phase is faster).

Secondary Sorting in MapReduce

If we want to sort reducer’s values, then the secondary sorting technique is used as it enables us
to sort the values (in ascending or descending order) passed to each reducer.

Hadoop Output Format – Types of Output Format in Map reduce

Hadoop Output Format

The Hadoop Output Format checks the Output-Specification of the job. It determines how
Record Writer implementation is used to write output to output files. In this blog, we are going to

47
see what is Hadoop Output Format, what is Hadoop Record Writer, how Record Writer is used in
Hadoop.

Hadoop Output Format – Types of Output Format in Map reduce


Hadoop Output Format

As we saw above, Hadoop Record Writer takes output data from Reducer and writes this data to
output files. The way these output key-value pairs are written in output files by Record Writer is
determined by the Output Format. The Output Format and Input Format functions are alike.
Output Format instances provided by Hadoop are used to write to files on the HDFS or local
disk. Output Format describes the output-specification for a Map-Reduce job. On the basis of
output specification;
 MapReduce job checks that the output directory does not already exist.
 Output Format provides the Record Writer implementation to be used to write the output
files of the job. Output files are stored in a File System.
File Output Format.setOutputPath () method is used to set the output directory. Every Reducer
writes a separate file in a common output directory.

Types of Hadoop Output Formats

There are various types of Hadoop Output Format. Let us see some of them
below:

48
Types of Hadoop Output Formats
I. Text Output Format

MapReduce default Hadoop reducer Output Format is Text Output Format, which writes (key,
value) pairs on individual lines of text files and its keys and values can be of any type since Text
Output Format turns them to string by calling to String() on them. Each key-value pair is
separated by a tab character, which can be changed using Map
Reduce.output.textoutputformat.separator property. KeyValueTextOutputFormat is used for
reading these output text files since it breaks lines into key-value pairs based on a configurable
separator.
ii. Sequence File Output Format

It is an Output Format which writes sequences files for its output and it is intermediate format
use between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the
corresponding Sequence File Input Format will desterilize the file into the same types and
presents the data to the next mapper in the same manner as it was emitted by the previous
reducer, since these are compact and readily compressible. Compression is controlled by the
static methods on Sequence File Output Format.

iii. Sequence File as Binary Output Format


It is another form of Sequence File Input Format which writes keys and values to sequence file in
binary format.

49
iv. Map File Output Format
It is another form of File Output Format in Hadoop Output Format, which is used to write output
as map files. The key in a Map File must be added in order, so we need to ensure that reducer
emits keys in sorted order.

v. Multiple Outputs
It allows writing data to files whose names are derived from the output keys and values, or in
fact from an arbitrary string.

vi. Lazy Output Format


Sometimes File Output Format will create output files, even if they are empty. Lazy Output
Format is a wrapper Output Format which ensures that the output file will be created only when
the record is emitted for a given partition.

Vii. DB Output Format


DB Output Format in Hadoop is an Output Format for writing to relational databases and HBase.
It sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type
extending DB writable.
.

Map Reduce Input Split vs. HDFS Block in Hadoop

MapReduce Input Split vs. HDFS Block in Hadoop


MapReduce Input Split & HDFS Block – Introduction

Block in HDFS

50
Block is a continuous location on the hard drive where data is stored. In general, File System
stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The
Hadoop application is responsible for distributing the data block across multiple nodes. Read
more about blocks here.
Input Split in Hadoop
The data to be processed by an individual Mapper is represented by Input Split. The split is
divided into records and each record (which is a key-value pair) is processed by the map. The
number of map tasks is equal to the number of Input Splits.
Initially, the data for MapReduce task is stored in input files and input files typically reside in
HDFS. Input Format is used to define how these input files are split and read. Input Format is
responsible for creating Input Split. Learn more about MapReduce Input Split.

MapReduce Input Split vs. Blocks in Hadoop

Let’s discuss feature wise comparison between MapReduce Input Split vs. Blocks-

I. Input Split vs. Block Size in Hadoop

 Block – The default size of the HDFS block is 128 MB which we can configure as per our
requirement. All blocks of the file are of the same size except the last block, which can be of
same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop
File System.
 Input Split – By default, split size is approximately equal to block size. Input Split is user
defined and the user can control split size based on the size of data in MapReduce program.
ii. Data Representation in Hadoop Blocks vs. Input Split

 Block – It is the physical representation of data. It contains a minimum amount of data that
can be read or write.
 Input Split – It is the logical representation of data present in the block. It is used during
data processing in MapReduce program or other processing techniques. Input Split doesn’t
contain actual data, but a reference to the data.

51
iii. Example of Block vs. Input Split in Hadoop

Input Split vs. Block

Input Split vs. Block

Input Split vs. Block


Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks.
Block is the smallest unit of data that can be stored or retrieved from the disk and the default size
of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes
in the cluster. Suppose we have a file of 130 MB, so HDFS will break this file into 2 blocks.

52
Now, if we want to perform MapReduce operation on the blocks, it will not process, because the
2nd block is incomplete. Thus, this problem is solved by Input Split. Input Split will form a
logical grouping of blocks as a single block, because the Input Split Include a location for the
next block and the byte offset of the data needed to complete the block.
Map Only Job in Hadoop MapReduce with example
In Hadoop, Map-Only job is the process in which mapper does all task, no task is done by the
reducer and mappers output is the final output. In this tutorial on Map only job in Hadoop
MapReduce, we will learn about MapReduce process, the need of map only job in Hadoop, how
to set a number of reducers to 0 for Hadoop map only job.
We will also learn what are the advantages of Map Only job in Hadoop MapReduce, processing
in Hadoop without reducer along with MapReduce example with any reducer.

Map Only Job in Hadoop MapReduce with example

53
What is Map Only Job in Hadoop MapReduce?

Hadoop MapReduce – Map Only job


MapReduce is a software framework for easily writing applications that process the vast amount
of structured and unstructured data stored in the Hadoop Distributed File system (HDFS). Two
important tasks done by MapReduce algorithm are: Map task and Reduce task. Hadoop Map
phase takes a set of data and converts it into another set of data, where individual element are
broken down into tulles (key/value pairs). Hadoop Reduce phase takes the output from the map
as input and combines those data tuples based on the key and accordingly modifies the value of
the key.
From the above word-count example, we can say that there are two sets of parallel
process map and reduce; in map process, the first input is split to distribute the work among all
the map nodes as shown in a figure, and then each word is identified and mapped to the number
1. Thus the pairs called tuples (key-value) pairs.
In the first mapper node three words lion, tiger, and river are passed. Thus the output of the node
will be three key-value pairs with three different keys and value set to 1 and the same process
repeated for all nodes. These tuples are then passed to the reducer nodes and partitioner comes
into action. It carries out shuffling so that all tuples with the same key are sent to the same node.
Thus, in reduce process basically what happens is an aggregation of values or rather an operation
on values that share the same key.

54
Now, let us consider a scenario where we just need to perform the operation and no aggregation
required, in such case, we will prefer ‘Map-Only job’ in Hadoop. In Hadoop Map-Only job, the
map does all task with its Input Split and no job is done by the reducer. Here map output is the
final output.
How to avoid Reduce Phase in Hadoop?

We can achieve this by setting job. Set Num reduce Tasks (0) in the configuration in a driver.
This will make a number of reducer as 0 and thus the only mapper will be doing the complete
task.
Advantages of Map only job in Hadoop

In between map and reduces phases there is key, sort and shuffle phase. Sort and shuffle are
responsible for sorting the keys in ascending order and then grouping values based on same keys.
This phase is very expensive and if reduce phase is not required we should avoid it, as avoiding
reduce phase would eliminate sort and shuffle phase as well. This also saves network congestion
as in shuffling, an output of mapper travels to reducer and when data size is huge, large data
needs to travel to the reducer. Learn more about shuffling and sorting in Hadoop MapReduce.
The output of mapper is written to local disk before sending to reducer but in map only job, this
output is directly written to HDFS. This further saves time and reduces cost as well.
Also, there is no need of partitioner and combiner in Hadoop Map Only job that makes the
process fast.
Data locality in Hadoop: The Most Comprehensive Guide
1. Data Locality in Hadoop – Objective

In Hadoop, Data locality is the process of moving the computation close to where the actual data
resides on the node, instead of moving large data to computation. This minimizes network
congestion and increases the overall throughput of the system. This feature of Hadoop we will
discuss in detail in this tutorial. We will learn what is data locality in Hadoop, data locality
definition, how Hadoop exploits Data Locality, what is the need of Hadoop Data Locality,
various types of data locality in Hadoop MapReduce, Data locality optimization in Hadoop and
various advantages of Hadoop data locality.

55
Data locality in Hadoop: The Most Comprehensive Guide

The Concept of Data locality in Hadoop

Let us understand Data Locality concept and what is Data Locality in MapReduce?
The major drawback of Hadoop was cross-switch network traffic due to the huge volume of
data. To overcome this drawback, Data Locality in Hadoop came into the picture. Data locality
in MapReduce refers to the ability to move the computation close to where the actual data resides
on the node, instead of moving large data to computation. This minimizes network congestion
and increases the overall throughput of the system.In Hadoop, datasets are stored in HDFS.
Datasets are divided into blocks and stored across the data nodes in Hadoop cluster. When a
user runs the MapReduce job then Name Node sent this MapReduce code to the data nodes on
which data is available related to MapReduce job.

Data Locality in Hadoop – MapReduce

56
Requirements for Data locality in MapReduce

Our system architecture needs to satisfy the following conditions, in order to get the benefits of
all the advantages of data locality:

 First of all the cluster should have the appropriate topology. Hadoop code must have the
ability to read data locality.
 Second, Hadoop must be aware of the topology of the nodes where tasks are executed. And
Hadoop must know where the data is located.

Categories of Data Locality in Hadoop

Below are the various categories in which Data Locality in Hadoop is categorized:

I. Data local data locality in Hadoop

When the data is located on the same node as the mapper working on the data it is known as data
local data locality. In this case, the proximity of data is very near to computation. This is the
most preferred scenario.

ii. Intra-Rack data locality in Hadoop

It is not always possible to execute the mapper on the same data node due to resource constraints.
In such case, it is preferred to run the mapper on the different node but on the same rack.
iii. Inter-Rack data locality in Hadoop

Sometimes it is not possible to execute mapper on a different node in the same rack due to
resource constraints. In such a case, we will execute the mapper on the nodes on different racks.
This is the least preferred scenario.

Hadoop Data Locality Optimization

Although Data locality in Hadoop MapReduce is the main advantage of Hadoop MapReduce as
map code is executed on the same data node where data resides. But this is not always true in
practice due to various reasons like speculative execution in Hadoop, Heterogeneous cluster,
Data distribution and placement, and Data Layout and Input Splitter.
Challenges become more prevalent in large clusters, because more the number of data nodes and

57
data, less will be the locality. In larger clusters, some nodes are newer and faster than the other,
creating the data to compute ratio out of balance thus, large clusters tend not to be completely
homogenous. In speculative execution even though the data might not be local, but it uses the
computing power. The root cause also lies in the data layout/placement and the used Input
Splitter. Non-local data processing puts a strain on the network which creates problem to
scalability. Thus the network becomes the bottleneck.
We can improve data locality by first detecting which jobs has the data locality problem or
degrade over time. Problem-solving is more complex and involves changing the data placement
and data layout, using a different scheduler or by simply changing the number of mapper
and reducer slots for a job. Then we have to verify whether a new execution of the same
workload has a better data locality ratio.

Advantages of Hadoop Data locality

There are two benefits of data Locality in MapReduce. Let’s discuss them one by one-

I. Faster Execution

In data locality, the program is moved to the node where data resides instead of moving large
data to the node, this makes Hadoop faster. Because the size of the program is always lesser than
the size of data, so moving data is a bottleneck of network transfer.

ii. High Throughput

Data locality increases the overall throughput of the system.

Speculative Execution in Hadoop MapReduce


In this big data Hadoop tutorial, we are going to learn Hadoop speculative execution. Apache
Hadoop does not fix or diagnose slow-running tasks. Instead, it tries to detect when a task is
running slower than expected and launches another, an equivalent task as a backup (the backup
task is called as speculative task). This process is called speculative execution in Hadoop.
In this tutorial we will discuss speculative execution – A key feature of Hadoop that improves
job efficiency, what is the need of speculative execution in Hadoop, is Speculative execution
always helpful or do we need to turn it off and how can we disable speculative execution in
Hadoop.

58
Speculative Execution in Hadoop MapReduce
What is Speculative Execution in Hadoop?

Speculative Execution in Spark


In Hadoop, MapReduce breaks jobs into tasks and these tasks run parallel rather than sequential,
thus reduces overall execution time. This model of execution is sensitive to slow tasks (even if
they are few in numbers) as they slow down the overall execution of a job.
There may be various reasons for the slowdown of tasks, including hardware degradation or
software mis configuration, but it may be difficult to detect causes since the tasks still complete
successfully, although more time is taken than the expected time. Hadoop doesn’t try to diagnose
and fix slow running tasks; instead, it tries to detect them and runs backup tasks for them. This is
called speculative execution in Hadoop. These backup tasks are called Speculative tasks in
Hadoop.

59
How Speculative Execution works in Hadoop?

Firstly all the tasks for the job are launched in Hadoop MapReduce. The speculative tasks are
launched for those tasks that have been running for some time (at least one minute) and have not
made any much progress, on average, as compared with other tasks from the job. The speculative
task is killed if the original task completes before the speculative task, on the other hand, the
original task is killed if the speculative task finishes before it.

Is Speculative Execution Beneficial?

Hadoop MapReduce Speculative execution is beneficial in some cases because in a Hadoop


cluster with 100s of nodes, problems like hardware failure or network congestion are common
and running parallel or duplicate task would be better since we won’t be waiting for the task in
the problem to complete .But if two duplicate tasks are launched at about same time, it will be
wastage of cluster resources.

How to Enable or Disable Speculative Execution?

Speculative execution is a MapReduce job optimization technique in Hadoop that is enabled


by default. You can disable speculative execution for mappers and reducers in mapred-
site.xml as shown below:
[php]<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>[/php]

60
What is the need to turn off Speculative Execution?

The main work of speculative execution is to reduce the job execution time; however, the
clustering efficiency is affected due to duplicate tasks. Since in speculative execution redundant
tasks are being executed, thus this can reduce overall throughput. For this reason, some cluster
administrators prefer to turn off the speculative execution in Hadoop.

Hadoop Counters the Most Complete Guide to MapReduce Counters


Hadoop Counters

In this MapReduce Hadoop Counters tutorial, we will provide you the detailed description of
MapReduce Counters in Hadoop. The tutorial covers an introduction to
Hadoop MapReduce counters, Types of Hadoop Counters such as Built-in Counters and User-
defined counters. In this Hadoop counters tutorial, we will also discuss the File Input Format and
File Output Format of Hadoop MapReduce.

Hadoop Counters

61
What is Hadoop MapReduce?

Before we start with Hadoop Counters, let us first see the overview of Hadoop MapReduce.
MapReduce is the core component of Hadoop which provides data processing. MapReduce
works by breaking the processing into two phases; Map phase and Reduce phase. The map is the
first phase of processing, where we specify all the complex logic/business rules/costly code,
whereas the Reduce phase is the second phase of processing, where we specify light-weight
processing like aggregation/ summation.

In Hadoop, MapReduce Framework has certain elements such as Counters, Combiners,


and Practitioners, which play a key role in improving the performance of data processing.

What are Hadoop Counters?

Hadoop Counters Explained: Hadoop Counters provides a way to measure the progress or the
number of operations that occur within map/reduce job. Counters in Hadoop MapReduce are a
useful channel for gathering statistics about the MapReduce job: for quality control or for
application-level. They are also useful for problem diagnosis.
Counters represent Hadoop global counters, defined either by the MapReduce framework or
applications. Each Hadoop counter is named by an “Enum” and has a long for the value.
Counters are bunched into groups, each comprising of counters from a particular Enum class.
Hadoop Counters validate that:
 The correct number of bytes was read and written.
 The correct number of tasks was launched and successfully ran.
 The amount of CPU and memory consumed is appropriate for our job and cluster nodes.
Read about Key – Value Pairs

Types of Hadoop MapReduce Counters

There are basically 2 types of MapReduce Counters:

 Built-In Counters in MapReduce


 User-Defined Counters/Custom counters in MapReduce
Let’s discuss these types’ counters in Hadoop MapReduce:

62
Built-In Counters in MapReduce

Hadoop maintains some built-in Hadoop counters for every job and these report various metrics,
like, there are counters for the number of bytes and records, which allow us to confirm that the
expected amount of input is consumed and the expected amount of output is produced.

Hadoop Counters are divided into groups and there are several groups for the built-in counters.
Each group either contains task counters (which are updated as task progress) or job counter
there are several groups for the Hadoop built-in Counters:

MapReduce Task Counter in Hadoop

Hadoop Task counter collects specific information (like number of records read and written)
about tasks during its execution time. For example, the MAP_INPUT_RECORDS counter is the
Task Counter which counts the input records read by each map task.

Hadoop Task counters are maintained by each task attempt and periodically sent to the
application master so they can be globally aggregated.

File System Counters

Hadoop File System Counters in Hadoop MapReduce gather information like a number of bytes
read and written by the file system. Below are the name and description of the file system
counters:
 File System bytes read– The number of bytes read by the file system by map and reduce
tasks.
 File System bytes written– The number of bytes written to the file system by map and
reduce tasks.
File Input Format Counters in Hadoop

File Input Format Counters in Hadoop MapReduce gather information of a number of bytes read
by map tasks via File Input Format. Refer this guide to learn about Input Format in Hadoop
MapReduce.
File Output Format counters in MapReduce
File Output Format counters in Hadoop MapReduce gathers information of a number of bytes
written by map tasks (for map-only jobs) or reduce tasks via File Output Format. Refer this guide
to learn about Output Format in Hadoop MapReduce in detail.

63
MapReduce Job Counters

MapReduce Job counter measures the job-level statistics, not values that change while a task is
running. For example, TOTAL_LAUNCHED_MAPS, count the number of map tasks that were
launched over the course of a job (including tasks that failed). Application master maintains
MapReduce Job counters, so these Hadoop Counters don’t need to be sent across the network,
unlike all other counters, including user-defined ones.

User-Defined Counters/Custom Counters in Hadoop MapReduce

In addition to MapReduce built-in counters, MapReduce allows user code to define a set of
counters, which are then incremented as desired in the mapper or reducer. For example, in Java,
‘enum’ is used to define counters.
A job may define an arbitrary number of ‘enums’, each with an arbitrary number of fields. The
name of the enum is the group name, and the enum’s fields are the counter names.
a. Dynamic Counters in Hadoop MapReduce

b. Java enum’s fields are defined at compile time, so we cannot create new counters
in Hadoop MapReduce at runtime using enums. To do so, we use dynamic
counters in Hadoop MapReduce, one that is not defined at compile time using
java enum.

Hadoop Optimization Job Optimization & Performance Tuning


1. Hadoop Optimization Tutorial

This tutorial on Hadoop Optimization will explain you Hadoop cluster optimization
or MapReduce job optimization techniques that would help you in optimizing MapReduce job
performance to ensure the best performance for your Hadoop cluster.

Hadoop Optimization | Job Optimization & Performance Tuning

64
Hadoop Optimization or Job Optimization Techniques

There are various ways to improve the Hadoop optimization. Let’s discuss each of them one by
one-

I. Proper configuration of your cluster

 Dfs and MapReduce storage have been mounted with –noise option. This disables access
time and can improve I/O performance.
 Avoid RAID on Task Tracker and data node machines, it generally reduces performance.
 Make sure you have configured mapred.local.dir and dfs.data.dir to point to one directory
on each of your disks to ensure that all of your I/O capacity is used.
 Ensure that you have smart monitoring to the health status of your disk drives. This is 1 of
the best practice for Hadoop MapReduce performance tuning. MapReduce jobs are fault
tolerant, but dying disks can cause performance to degrade as tasks must be re-executed.
 Monitor the graph of swap usage and network usage with software like ganglia, Hadoop
monitoring metrics. If you see swap being used, reduce the amount of RAM allocated to
each task in mapred.child.java.opts.
ii. LZO compression usage

This is always a good idea for Intermediate data. Almost every Hadoop job that generates a non-
negligible amount of map output will benefit from intermediate data compression with LZO.
Although LZO adds a little bit of CPU overhead, it saves time by reducing the amount of disk IO
during the shuffle.
In order to enable LZO compression set mapred.compress.map.output to true. This is one of the
most important Hadoop optimization techniques.
iii. Proper tuning of the number of MapReduce tasks

 If each task takes 30-40 seconds or more, then reduce the number of tasks. The start
of mapper or reducer process involves following things: first, you need to start JVM (JVM
loaded into the memory), then you need to initialize JVM and after processing
(mapper/reducer) you need to de-initialize JVM. All these JVM tasks are costly. Now
consider a case where mapper runs a task just for 20-30 seconds and for this we have to
start/initialize/stop JVM, which might take a considerable amount of time. It is
recommended to run the task for at least 1 minute.

65
 If a job has more than 1TB of input, you should consider increasing the block size of the
input dataset to 256M or even 512M so that the number of tasks will be smaller. You can
change the block size of existing files by using the command Hadoop distcp –
Hdfs.block.size=$[256*1024*1024] /path/tm/input data /path/tm/input data-with-large
blocks
 So long as each task runs for at least 30-40 seconds, you should increase the number of
mapper tasks to some multiple of the number of mapper slots in the cluster.
 Don’t schedule too many reduce tasks – for most jobs, the number of reduce tasks equal to
or a bit less than the number of reduce slots in the cluster.
iv. Combiner between mapper and reducer

If your algorithm involves computing aggregates of any sort, it is suggested to use a Combiner to
perform some aggregation before the data hits the reducer. The MapReduce framework runs
combine intelligently to reduce the amount of data to be written to disk and that has to be
transferred between the Map and Reduce stages of computation.

v. Usage of most appropriate and compact writable type for data

Big data new users or users switching from Hadoop Streaming to Java MapReduce often use the
Text writable type unnecessarily. Although Text can be convenient, it’s inefficient to convert
numeric data to and from UTF8 strings and can actually make up a significant portion of CPU
time. Whenever dealing with non-textual data, consider using the binary Writables like Int
Writable, Float writable etc.

vi. Reuseage of Writable

One of the common mistakes that many MapReduce users make is to allocate a new Writable
object for every output from a mapper or reducer. For example, to implement a word-count
mapper:
[php]public void map(…) {

for (String word : words) {
output. Collect(new Text(word), new Intractable(1));
}[/php]

66
This implementation causes allocation of thousands of short-lived objects. While Java garbage
collector does a reasonable job at dealing with this, it is more efficient to write:
[php]class MyMapper … {
Text wordText = new Text();
Intractable one = new Intractable(1);
public void map(…) {
… for (String word : words)
{
wordText.set(word);
output. Collect(word, one); }
}
}[/php]

Hadoop MapReduce Performance Tuning Best Practices


1. MapReduce Performance Tuning Tutorial

Performance tuning in Hadoop will help in optimizing the Hadoop cluster performance. This
tutorial on Hadoop MapReduce performance tuning will provide you ways for improving your
Hadoop cluster performance and get the best result from your programming in Hadoop. It will
cover 7 important concepts like Memory Tuning in Hadoop, Map Disk spill in Hadoop, tuning
mapper tasks, Speculative execution in big data Hadoop and many other related concepts for
Hadoop MapReduce performance tuning. If you face any difficulty in Hadoop MapReduce
Performance tuning tutorial, please let us know in the comments.

Hadoop MapReduce Performance Tuning Best Practices

67
Hadoop MapReduce Performance Tuning

Hadoop performance tuning will help you in optimizing your Hadoop cluster performance and
make it better to provide best results while doing Hadoop programming in Big Data companies.
To perform the same, you need to repeat the process given below till desired output is achieved
at optimal way.
Run Job –> Identify Bottleneck –> Address Bottleneck.
The first step in hadoop performance tuning is to run Hadoop job, Identify the bottlenecks and
address them using below methods to get the highest performance. You need to repeat above step
till a level of performance is achieved.
Tips for Hadoop MapReduce Performance Tuning

Here we are going to discuss the ways to improve the Hadoop MapReduce performance tuning.
We have classified these ways into two categories.
 Hadoop run-time parameters based performance tuning.
 Hadoop application-specific performance tuning.
Let’s discuss how to improve the performance of Hadoop cluster on the basis of these two
categories.

I. Tuning Hadoop Run-time Parameters

There are many options provided by Hadoop on CPU, memory, disk, and network for
performance tuning. Most Hadoop tasks are not CPU bounded, what is most considered is to
optimize usage of memory and disk spills. Let us get into the details in this Hadoop performance
tuning in Tuning Hadoop Run-time parameters.

a. Memory Tuning

The most general and common rule for memory tuning in MapReduce performance tuning is: use
as much memory as you can without triggering swapping. The parameter for task memory
is mapred.child.java.opts that can be put in your configuration file.
You can also monitor memory usage on the server using Ganglia, Cloud era manager, or Navies
for better memory performance.
b. Minimize the Map Disk Spill

Disk IO is usually the performance bottleneck in Hadoop. There are a lot of parameters you can
tune for minimizing spilling like:

68
 Compression of mapper output
 Usage of 70% of heap memory ion mapper for spill buffer
c. Tuning Mapper Tasks

The number of mapper tasks is set implicitly unlike reducer tasks. The most common hadoop
performance tuning way for the mapper is controlling the amount of mapper and the size of each
job. When dealing with large files, Hadoop split the file into smaller chunks so that mapper can
run it in parallel. However, initializing new mapper job usually takes few seconds that is also an
overhead to be minimized. Below are the suggestions for the same:
 Reuse jvm task
 Aim for map tasks running 1-3 minutes each. For this if the average mapper running time is
lesser than one minute, increase the mapred.min.split.size, to allocate less mappers in slot
and thus reduce the mapper initializing overhead.
 Use Combine file input format for bunch of smaller files.
ii. Tuning Application Specific Performance
a. Minimize your Mapper Output

Minimizing the mapper output can improve the general performance a lot as this is sensitive to
disk IO, network IO, and memory sensitivity on shuffle phase.
For achieving this, below are the suggestions:

 Filter the records on mapper side instead of reducer side.


 Use minimal data to form your map output key and map output value in Map Reduce.
 Compress mapper output
b. Balancing Reducer’s loading

Unbalanced reducer tasks create another performance issue. Some reducers take most of the
output from mapper and ran extremely long compare to other reducers.
Below are the methods to do the same:

 Implement a better hash function in Partitioner class.


 Write a preprocess job to separate keys using Multiple Outputs. Then use another map-
reduce job to process the special keys that cause the problem.

69
c. Reduce Intermediate data with Combiner in Hadoop

Implement a combiner to reduce data which enables faster data transfer.

d. Speculative Execution

When tasks take long time to finish the execution, it affects the MapReduce jobs. This problem is
being solved by the approach of speculative execution by backing up slow tasks on alternate
machines.
You need to set the configuration parameters ‘mapreduce.map.tasks.speculative.execution’
and ‘mapreduce.reduce.tasks.speculative.execution’ to true for enabling speculative
execution. This will reduce the job execution time if the task progress is slow due to memory
unavailability.
This was all about the Hadoop Map reduce Combiner.

70

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy