Big-Data-Unit 2
Big-Data-Unit 2
Department of
1
UNIT-2
2
Outline
A Weather Dataset
Driver code
Mapper code
Reducer code
Record Reader
Combiner
Partitioner
3
UNIT 2
The following are the characteristics of these computing installations and the specialized file
systems that have been developed to take advantage of them.
4
Physical Organization of Compute Nodes
The new parallel-computing architecture, sometimes called cluster computing, is organized as
follows. Compute nodes are stored on racks, perhaps 8–64on a rack. The nodes on a single rack
are connected by a network, typically gigabit Ethernet. There can be many racks of compute
nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-
rack communication is somewhat greater than the intra rack Ethernet, but given the number of
pairs of nodes that might need to communicate between racks, this bandwidth may be essential.
However, there may be many more racks and many more compute nodes per rack. It is a
fact of life that components fail, and the more components, such as compute nodes and
interconnection networks, a system has, the more frequently something in the system will not be
working at any given time. Some important calculations take minutes or even hours on thousands
of compute nodes. If we had to abort and restart the computation every time one component
failed, then the computation might never complete successfully.
The solution to this problem takes two forms:
1. Files must be stored redundantly. If we did not duplicate the file at several compute nodes,
then if one node failed, all its files would be unavailable until the node is replaced. If we did not
back up the files at all, and the disk crashes, the files would be lost forever.
2. Computations must be divided into tasks, such that if any one task fails to execute to
completion, it can be restarted without affecting other tasks. This strategy is followed by the
map-reduce programming system
5
2.2 Large-Scale File-System Organization
To exploit cluster computing, files must look and behave somewhat differently from the
conventional file systems found on single computers. This new file system, often called a
distributed file system or DFS (although this term had other meanings in the past), is typically
used as follows.
Files can be enormous, possibly a terabyte in size. If you have only small files, there is no
point using a DFS for them.
Files are rarely updated. Rather, they are read as data for some calculation, and possibly
additional data is appended to files from time to time.
Files are divided into chunks, which are typically 64 megabytes in size. Chunks are
replicated, perhaps three times, at three different compute nodes. Moreover, the nodes
holding copies of one chunk should be located on different racks, so we don’t lose all
copies due to a rack failure. Normally, both the chunk size and the degree of replication
can be decided by the user.
To find the chunks of a file, there is another small file called the master node or name
node for that file. The master node is itself replicated, and a directory for the file system
as a whole knows where to find its copies. The directory itself can be replicated, and all
participants using the DFS know where the directory copies are.
A Weather Dataset
MapReduce is a programming model for data processing. The model is simple, yet not too
simple to express useful programs in. Hadoop can run MapReduce programs written in various
languages; in this chapter, we shall look at the same program expressed in Java, Ruby, Python,
and C++. Most important, MapReduce programs are inherently parallel, thus putting very large-
scale data analysis into the hands of anyone with enough machines at their disposal. MapReduce
comes into its own for large datasets, so let’s start by looking at one.
For our example, we will write a program that mines weather data. Weather sensors collecting
data every hour at many locations across the globe gather a large volume of log data, which is a
good candidate for analysis with MapReduce, since it is semi structured and record-oriented.
6
Data Format
The data we will use is from the National Climatic Data Center (NCDC,
http://www .ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which
each line is a record. The format supports a rich set of meteorological elements, many of which
are optional or with variable data lengths. For simplicity, we shall focus on the basic elements,
such as temperature, which are always present and are of fixed width. Example shows a sample
line with some of the salient fields highlighted. The line has been split into multiple lines to show
each field: in the real file, fields are packed into one line with no delimiters.
The script loops through the compressed year files, first printing the year, and
then processing each file using awk. The awk script extracts two fields from the
data: the air temperature and the quality code. The air temperature value is
turned into an integer by adding 0. Next, a test is applied to see if the
temperature is valid (the value 9999 signifies a missing value in the NCDC
dataset) and if the quality code indicates that the reading is not suspect or
erroneous. If the reading is OK, the value is compared with the maximum value
seen so far, which is updated if a new maximum is found. The END block is
executed after all the lines in the file have been processed, and it prints the
maximum value.
The temperature values in the source file are scaled by a factor of 10, so this
works out as a maximum temperature of 31.7°C for 1901 (there were very few
readings at the beginning of the century, so this is plausible). The complete run
7
for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large
Instance.
First, dividing the work into equal-size pieces isn’t always easy or obvious. In
this case, the file size for different years varies widely, so some processes will
finish much earlier than others. Even if they pick up further work, the whole run
is dominated by the longest file. A better approach, although one that requires
more work, is to split the input into fixed-size chunks and assign each chunk to a
process.
Second, combining the results from independent processes may need further
processing. In this case, the result for each year is independent of other years
and may be combined by concatenating all the results, and sorting by year. If
using the fixed-size chunk approach, the combination is more delicate. For this
example, data for a particular year will typically be split into several chunks,
each processed independently. We’ll end up with the maximum temperature for
each chunk, so the final step is to look for the highest of these maximums, for
each year.
Third, you are still limited by the processing capacity of a single machine. If the
best time you can achieve is 20 minutes with the number of processors you
have, then that’s it. You can’t make it go faster. Also, some datasets grow
beyond the capacity of a single machine. When we start using multiple
machines, a whole host of other factors come into play, mainly falling in the
category of coordination and reliability. Who runs the overall job? How do we
deal with failed processes?
8
So, though it’s feasible to parallelize the processing, in practice it’s messy. Using
a framework like Hadoop to take care of these issues is a great help.
A Weather Dataset
3-day forecasts of temperature, precipitation and wind. Forecasts have to be provided for several
regions in the country. Short-term weather forecasts are relevant for the general public to plan
activities, while also being reliable. See our methodology section for more information.
Temperature extremes
Temperature average
Wind speed
Wind direction
Precipitation Amount
Precipitation Probability
Forecast for current day and four following days
Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the
input data-set into independent chunks which are processed by the map tasks in a completely
parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce
9
tasks. Typically both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running
on the same set of nodes. This configuration allows the framework to effectively schedule tasks
on the nodes where data is already present, resulting in very high aggregate bandwidth across the
cluster.
The MapReduce framework consists of a single master Job Tracker and one slave Task
Tracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on
the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as
directed by the master.
Minimally, applications specify the input/output locations and supply map and reduce functions
via implementations of appropriate interfaces and/or abstract-classes. These, and other job
parameters, comprise the job configuration. The Hadoop job client then submits the job
(jar/executable etc.) and configuration to the Job Tracker which then assumes the responsibility
of distributing the software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.
Although the Hadoop framework is implemented in Java MapReduce applications need not be
written in Java.
Hadoop Streaming is a utility which allows users to create and run jobs with any executables
(e.g. shell utilities) as the mapper and/or the reducer.
Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNI
based).
The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the Writable
Comparable interface to facilitate sorting by the framework.
(Input) <k1, v1> -> map -> <K2, v2> -> combine -> <K2, v2> -> reduce -> <k3, v3> (output)
10
fine-grained manner. However, please note that the java doc for each class/interface remains the
most comprehensive documentation available; this is only meant to be a tutorial.
Let us first take the Mapper and Reducer interfaces. Applications typically implement them to
provide the map and reduce methods.We will then discuss other core interfaces
including JobConf, JobClient, Partitioner, OutputCollector, Reporter, InputFormat, OutputForma
t, OutputCommitter and others.
Finally, we will wrap up by discussing some useful features of the framework such as
the Distributed Cache, Isolation Runner etc.
Payload
Applications typically implement the Mapper and Reducer interfaces to provide
the map and reduce methods. These form the core of the job.
Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input records. A given
input pair may map to zero or many output pairs.
The Hadoop MapReduce framework spawns one map task for each Input Split generated by
the Input Format for the job.
Overall, Mapper implementations are passed the Job Conf for the job via the Job Configurable.
Configure (Job Conf) method and override it to initialize them. The framework then calls map
(Writable Comparable, Writable, Output Collector, and Reporter) for each key/value pair in
the Input Split for that task. Applications can then override the Closeable. Close () method to
perform any required cleanup.
Output pairs do not need to be of the same types as input pairs. A given input pair may map to
zero or many output pairs. Output pairs are collected with calls to OutputCollector.collect
(Writable Comparable, Writable).
Applications can use the Reporter to report progress, set application-level status messages and
update Counters, or just indicate that they are alive.
All intermediate values associated with a given output key are subsequently grouped by the
framework, and passed to the Reducer(s) to determine the final output. Users can control the
grouping by specifying a Comparator via Job conf.set Output Key Comparator Class (Class).
The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions
is the same as the number of reduce tasks for the job. Users can control which keys (and hence
records) go to which Reducer by implementing a custom Partitioner.
11
Users can optionally specify a combiner, via Job Conf.set Combiner Class (Class), to perform
local aggregation of the intermediate outputs, which helps to cut down the amount of data
transferred from the Mapper to the Reducer.
The intermediate, sorted outputs are always stored in a simple (key-Len, key, value-Len, value)
format. Applications can control if, and how, the intermediate outputs are to be compressed and
the Compression Codec to be used via the Job Conf.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it
has been set up to 300 maps for very cup-light map tasks. Task setup takes awhile, so it is best if
the maps take at least a minute to execute.
Thus, if you expect 10TB of input data and have a block size of 128MB, you'll end up with
82,000 maps, unless set Num Map Tasks (int) (which only provides a hint to the framework) is
used to set it even higher.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The number of reduces for the job is set by the user via Job Conf.set Num Reduce Tasks (into).
Overall, Reducer implementations are passed the Job Conf for the job via the Job
Configurable.con figure (Job Conf) method and can override it to initialize them. The framework
then calls reduce (Writable Comparable, Iterator, Output Collector, Reporter) method for
each <key, (list of values)> pair in the grouped inputs. Applications can then override
the Closeable. Close () method to perform any required cleanup.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the
relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are
being fetched they are merged.
Secondary Sort
If equivalence rules for grouping the intermediate keys are required to be different from those for
grouping keys before reduction, then one may specify a Comparator via Job Conf.set Output
12
Value Grouping Comparator (Class). Since Job Conf.set Output Key Comparator Class
(Class) can be used to control how intermediate keys are grouped, these can be used in
conjunction to simulate secondary sort on values.
Reduce
In this phase the reduce (Writable Comparable, Iterator, Output Collector, Reporter) method is
called for each <key, (list of values)> pair in the grouped inputs.
The output of the reduce task is typically written to the File System via OutputCollector.collect
(Writable Comparable, Writable).
Applications can use the Reporter to report progress, set application-level status messages and
update Counters, or just indicate that they are alive.
Partitioner
Partitioner partitions the key space. Partitioner controls the partitioning of the keys of the
intermediate map-outputs. The key (or a subset of the key) is used to derive the partition,
typically by a hash function. The total number of partitions is the same as the number of reduce
tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and
hence the record) is sent to for reduction.
Hash Partitioner is the default Partitioner. Reporter is a facility for MapReduce applications to
report progress, set application-level status messages and update Counters.
Mapper and Reducer implementations can use the Reporter to report progress or just indicate that
they are alive. In scenarios where the application takes a significant amount of time to process
individual key/value pairs, this is crucial since the framework might assume that the task has
timed-out and kill that task. Another way to avoid this is to set the configuration
parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).
13
Output Collector- Output Collector is a generalization of the facility provided by the
MapReduce framework to collect data output by the Mapper or the Reducer (either the
intermediate outputs or the output of the job).
Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and
practitioners.
Map
The input data is first split into smaller blocks. Each block is then assigned to a mapper
for processing.
For example, if a file has 100 records to be processed, 100 mappers can run together to
process one record each. Or maybe 50 mappers can run together to process two records
each. The Hadoop framework decides how many mappers to use, based on the size of the
data to be processed and the memory block available on each mapper server.
Reduce
After all the mappers complete processing, the framework shuffles and sorts the results
before passing them on to the reducers. A reducer cannot start while a mapper is still in
progress. All the map output values that have the same key are assigned to a single
reducer, which then aggregates the values for that key.
14
Combine is an optional process. The combiner is a reducer that runs individually on each
mapper server. It reduces the data on each mapper further to a simplified form before
passing it downstream.
This makes shuffling and sorting easier as there is less data to work with. Often, the
combiner class is set to the reducer class itself, due to the cumulative and associative
functions in the reduce function. However, if needed, the combiner can be a separate
class as well.
Partition is the process that translates the <key, value> pairs resulting from mappers to another
set of <key, value> pairs to feed into the reducer. It decides how the data has to be presented to
the reducer and also assigns it to a particular reducer.
The default partitioned determines the hash value for the key, resulting from the mapper, and
assigns a partition based on this hash value. There are as many partitions as there are reducers.
So, once the partitioning is complete, the data from each partition is sent to a specific reducer.
Consider an ecommerce system that receives a million requests every day to process payments.
There may be several exceptions thrown during these requests such as "payment declined by a
payment gateway," "out of inventory," and "invalid address." A developer wants to analyze last
four days' logs to understand which exception is thrown how many times.
A MapReduce Example
Consider an ecommerce system that receives a million requests every day to process payments.
There may be several exceptions thrown during these requests such as “payment declined by a
payment gateway,” “out of inventory,” and “invalid address.” A developer wants to analyze last
four days’ logs to understand which exception is thrown how many times.
15
Map
For simplification, let’s assume that the Hadoop framework runs just four mappers. Mapper 1,
Mapper 2, Mapper 3, and Mapper 4.
The value input to the mapper is one record of the log file. The key could be a text string such as
“file name + line number.” The mapper, then, processes each record of the log file to produce
key value pairs. Here, we will just use filler for the value as ‘1.’ The output from the mappers
looks like this:
Mapper 1 -> <Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>,
<Exception A, 1>
Mapper 2 -> <Exception B, 1>, <Exception B, 1>, <Exception A, 1>, <Exception A, 1>
Mapper 3 -> <Exception A, 1>, <Exception C, 1>, <Exception A, 1>, <Exception B, 1>,
<Exception A, 1>
Mapper 4 -> <Exception B, 1>, <Exception C, 1>, <Exception C, 1>, <Exception A, 1>
Assuming that there is a combiner running on each mapper—Combiner 1 … Combiner 4—that
calculates the count of each exception (which is the same function as the reducer), the input to
Combiner 1 will be:
<Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>, <Exception A, 1>
Combine
The output of Combiner 1 will be:
<Exception A, 3>, <Exception B, 1>, <Exception C, 1>
The output from the other combiners will be:
Combiner 2: <Exception A, 2> <Exception B, 2>
Combiner 3: <Exception A, 3> <Exception B, 1> <Exception C, 1>
Combiner 4: <Exception A, 1> <Exception B, 1> <Exception C, 2>
Partition
After this, the partitioner allocates the data from the combiners to the reducers. The data is also
sorted for the reducer.
The input to the reducers will be as below:
Reducer 1: <Exception A> {3,2,3,1}
Reducer 2: <Exception B> {1,2,1,1}
Reducer 3: <Exception C> {1,1,2}
If there were no combiners involved, the input to the reducers will be as below:
Reducer 1: <Exception A> {1,1,1,1,1,1,1,1,1}
Reducer 2: <Exception B> {1,1,1,1,1}
Reducer 3: <Exception C> {1,1,1,1}
16
Here, the example is a simple one, but when there are terabytes of data involved, the combiner
process’ improvement to the bandwidth is significant.
Reduce
Now, each reducer just calculates the total count of the exceptions as:
Reducer 1: <Exception A, 9>
Reducer 2: <Exception B, 5>
Reducer 3: <Exception C, 4>
The data shows that Exception A is thrown more often than others and requires more attention.
When there are more than a few weeks’ or months’ of data to be processed together, the potential
of the MapReduce program can be truly exploited.
JobClient.runJob(conf);
}
The parameters—MapReduce class name, Map, Reduce and Combiner classes, input and output
types, input and output file paths—are all defined in the main function. The Mapper class
extends Map Reduce Base and implements the Mapper interface. The Reducer class extends Map
Reduce Base and implements the Reducer interface.
17
Map reduce problem
Specifically, for Map Reduce, Talend Studio makes it easier to create jobs that can run on
the Hadoop cluster, set parameters such as mapper and reducer class, input and output
Hadoop job), it can be deployed as a service, executable, or stand-alone job that runs
natively on the big data cluster. It spawns one or more Hadoop Map Reduce jobs that, in
turn, execute the Map Reduce algorithm.
Before running a Map Reduce job the Hadoop connection needs to be configured. For
more details on how to use Talend for setting up Map Reduce jobs, refer to
these tutorials.
Leveraging Map Reduce To Solve Big Data Problems
The Map Reduce programming paradigm can be used with any complex problem that can
be solved through parallelization.
A social media site could use it to determine how many new sign-ups it received over the
past month from different countries, to gauge its increasing popularity among different
geographies. A trading firm could perform its batch reconciliations faster and also
determine which scenarios often cause trades to break. Search engines could determine
page views, and marketers could perform sentiment analysis using Map Reduce.
To learn more about Map Reduce and experiment with use cases like the ones listed
above, download a trial version of Talend Studio today.
This section briefly sketches the life cycle of a MapReduce job and the roles of the primary
actors in the life cycle. The full life cycle is much more complex. For details, refer to the
documentation for your Hadoop distribution or the Apache Hadoop MapReduce documentation.
Though other configurations are possible, a common Hadoop cluster configuration is a single
master node where the Job Tracker runs, and multiple worker nodes, each running a Task
Tracker. The Job Tracker node can also be a worker node.
1. The local Job Client prepares the job for submission and hands it off to the Job Tracker.
2. The Job Tracker schedules the job and distributes the map work among the Task Trackers for
parallel processing.
18
3. Each Task Tracker spawns a Map Task. The Job Tracker receives progress information from the
Task Trackers.
4. As map results become available, the Job Tracker distributes the reduce work among the Task
Trackers for parallel processing.
5. Each Task Tracker spawns a Reduce Task to perform the work. The Job Tracker receives
progress information from the Task Trackers.
All map tasks do not have to complete before reduce tasks begin running. Reduce tasks can
begin as soon as map tasks begin completing. Thus, the map and reduce steps often overlap.
Job Client
The Job Client prepares a job for execution. When you submit a MapReduce job to Hadoop, the
local Job Client:
Job Tracker
The Job Tracker is responsible for scheduling jobs, dividing a job into map and reduce tasks,
distributing map and reduce tasks among worker nodes, task failure recovery, and tracking the
job status. Job scheduling and failure recovery are not discussed here; see the documentation for
your Hadoop distribution or the Apache Hadoop MapReduce documentation.
1. Fetches input splits from the shared location where the Job Client placed the information.
2. Creates a map task for each split.
3. Assigns each map task to a Task Tracker (worker node). The Job Tracker monitors the health of
the Task Trackers and the progress of the job. As map tasks complete and results become
available, the Job Tracker:
A job is complete when all map and reduce tasks successfully complete, or, if there is no reduce
step, when all map tasks successfully complete.
19
Task Tracker
A Task Tracker manages the tasks of one worker node and reports status to the Job Tracker.
Often, the Task Tracker runs on the associated worker node, but it is not required to be on the
same host.
The task spawned by the Task Tracker runs the job’s map or reduces functions.
Map Task
The Hadoop MapReduce framework creates a map task to process each input split. The map
task:
1. Uses the Input Format to fetch the input data locally and create input key-value pairs.
2. Applies the job-supplied map function to each key-value pair.
3. Performs local sorting and aggregation of the results.
4. If the job includes a Combiner, runs the Combiner for further aggregation.
5. Stores the results locally, in memory and on the local file system.
6. Communicates progress and status to the Task Tracker.
Map task results undergo a local sort by key to prepare the data for consumption by reduce tasks.
If a Combiner is configured for the job, it also runs in the map task. A Combiner consolidates the
data in an application-specific way, reducing the amount of data that must be transferred to
reduce tasks. For example, a Combiner might compute a local maximum value for a key and
discard the rest of the values. The details of how map tasks manage, sort, and shuffle results are
not covered here. See the documentation for your Hadoop distribution or the Apache Hadoop
MapReduce documentation.
When a map task notifies the Task Tracker of completion, the Task Tracker notifies the Job
Tracker. The Job Tracker then makes the results available to reduce tasks.
Reduce Task
The reduce phase aggregates the results from the map phase into final results. Usually, the final
result set is smaller than the input set, but this is application dependent. The reduction is carried
out by parallel reduce tasks. The reduce input keys and values need not have the same type as the
output keys and values. The reduce phase is optional. You may configure a job to stop after the
map phase completes. For details, see Configuring a Map-Only Job.
20
Reduce is carried out in three phases, copy, sort, and merge. A reduce task:
The input to a reduce function is key-value pairs where the value is a list of values sharing the
same key. For example, if one map task produces a key-value pair (eat, 2) and another map task
produces the pair (eat, 1), then these pairs are consolidated into (eat, (2, 1)) for input to the
reduce function. If the purpose of the reduce phase is to compute a sum of all the values for each
key, then the final output key-value pair for this input is (eat, 3). For a more complete example,
see Example: Calculating Word Occurrences.
Output from the reduce phase is saved to the destination configured for the job, such as HDFS or
Mark Logic Server. Reduce tasks use an Output Format subclass to record results. The Hadoop
API provides Output Format subclasses for using HDFS as the output destination. The Mark
Logic Connector for Hadoop provides Output Format subclasses for using a Mark Logic Server
database as the destination. For a list of available subclasses, see Output Format Subclasses. The
connector also provides classes for defining key and value types; see Mark Logic-Specific Key
and Value Types.
21
1. Scalability
Apache Hadoop is a highly scalable framework. This is because of its ability to store and
distribute huge data across plenty of servers. All these servers were inexpensive and can operate
in parallel. We can easily scale the storage and computation power by adding servers to the
cluster. Hadoop MapReduce programming enables organizations to run applications from large
sets of nodes which could involve the use of thousands of terabytes of data. Hadoop MapReduce
programming enables business organizations to run applications from large sets of nodes. This
can use thousands of terabytes of data.
2. Flexibility
The MapReduce programming model uses HBase and HDFS security platform that allows access
only to the authenticated users to operate on the data. Thus, it protects unauthorized access to
system data and enhances system security.
4. Cost-effective solution
Hadoop’s scalable architecture with the MapReduce programming framework allows the storage
and processing of large data sets in a very affordable manner.
5. Fast
Hadoop uses a distributed storage method called as a Hadoop Distributed File System that
basically implements a mapping system for locating data in a cluster. The tools that are used for
data processing, such as MapReduce programming, are generally located on the very same
servers that allow for the faster processing of data.
So, Even if we are dealing with large volumes of unstructured data, Hadoop MapReduce just
takes minutes to process terabytes of data. It can process petabytes of data in just an hour.
Amongst the various features of Hadoop MapReduce, one of the most important features is that
it is based on a simple programming model. Basically, this allows programmers to develop the
MapReduce programs which can handle tasks easily and efficiently.
22
The MapReduce programs can be written in Java, which is not very hard to pick up and is also
used widely. So, anyone can easily learn and write MapReduce programs and meet their data
processing needs.
7. Parallel Programming
One of the major aspects of the working of MapReduce programming is its parallel processing. It
divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these divided tasks. So the entire
program is run in less time.
Whenever the data is sent to an individual node, the same set of data is forwarded to some other
nodes in a cluster. So, if any particular node suffers from a failure, then there are always other
copies present on other nodes that can still be accessed whenever needed. This assures high
availability of data.
One of the major features offered by Apache Hadoop is its fault tolerance. The Hadoop
MapReduce framework has the ability to quickly recognizing faults that occur. It then applies a
quick and automatic recovery solution. This feature makes it a game-changer in the world of big
data processing.
The above image shows a data set that is the basis for our programming exercise example. There
are a total of 10 fields of information in each line. Our programming objective uses only the first
and fourth fields, which are arbitrarily called "year" and "delta" respectively. We will ignore all
the other fields of data. We will need to define a Mapper class, Reducer class and a Driver class.
The number of map tasks depends on the total number of blocks of the input files. In MapReduce
map, the right level of parallelism seems to be around 10-100 maps/node. But there is 300 map
for CPU-light map tasks.
23
For example, we have a block size of 128 MB. And we expect 10TB of input data. Thus it
produces 82,000 maps.
24
Input reducer class
25
Driver implemented 1
Driver implements 2
26
In the Driver class, we also define the types for output key and value in the job as Text and
Float Writable respectively. If the mapper and reducer classes do NOT use the same output
key and value types, we must specify for the mapper. In this case, the output value type of
the mapper is Text, while the output value type of the reducer is Float Writable.
There are 2 ways to launch the job – synchronously and asynchronously. The job wait For
Completion () launches the job synchronously. The driver code will block waiting for the job
to complete at this line. The true argument informs the framework to write verbose output to
the controlling terminal of the job.
The main () method is the entry point for the driver. In this method, we instantiate a new
Configuration object for the job. We then call the Tool Runner static run () method.
You have to compile the three classes and place the compiled classes into a directory called
“classes”. Use the jar command to put the mapper and reducer classes into a jar file the path
to which is included in the class path when you build the driver. After you build the driver,
the driver class is also added to the existing jar file.
Make sure that you delete the reduce output directory before you execute the Map Reduce
program.
Build jar
27
Examine output
Driver code
The driver class, we set the configuration of our Map Reduce job to run in Hadoop.
We specify the name of the job, the data type of input/output of the mapper and reducer.
28
The method set Input Format Class () is used for specifying how a Mapper will read the input
data or what will be the unit of work. Here, we have chosen Text Input B Format so that a
single line is read by the mapper at a time from the input text file.
The main () method is the entry point for the driver. In this method, we instantiate a new
Configuration object for the job.
In this MapReduce tutorial, we are going to learn the concept of a key-value pair in Hadoop. The
key Value pair is the record entity that MapReduce job receives for execution. By default,
RecordReader uses Text Input Format for converting data into a key-value pair. Here we will
learn what a key-value pair in MapReduce is, how key-value pairs are generated in Hadoop using
Input Split and RecordReader and on what basis generation of key-value pairs in Hadoop
MapReduce takes place? We will also see Hadoop key-value pair example in this tutorial.
Apache Hadoop is used mainly for Data Analysis. We look at statistical and logical techniques
in data Analysis to describe, illustrate and evaluate data. Hadoop deals with structured,
unstructured and semi-structured data. In Hadoop, when the schema is static we can directly
work on the column instead of keys and values, but, when the schema is not static, then we will
29
work on keys and values. Keys and values are not the intrinsic properties of the data, but they are
chosen by user analyzing the data.
MapReduce is the core component of Hadoop which provides data processing. Hadoop
MapReduce is a software framework for easily writing an application that processes the vast
amount of structured and unstructured data stored in the Hadoop Distributed File System
(HDFS). MapReduce works by breaking the processing into two phases: Map phase and Reduce
phase. Each phase has key-value as input and output. Learn more about how data flows in
Hadoop MapReduce?
3. Key-value pair Generation in MapReduce
Let us now learn how key-value pair is generated in Hadoop Map Reduce? In MapReduce
process, before passing the data to the mapper, data should be first converted into key-value pairs
as mapper only understands key-value pairs of data.
a. Map Input
30
Map-input by default will take the line offset as the key and the content of the line will be the
value as Text. By using custom Input Format we can modify them.
b. Map Output
Map basic responsibility is to filter the data and provide the environment for grouping of data
based on the key.
Key – It will be the field/ text/ object on which the data has to be grouped and aggregated on
the reducer side.
Value – It will be the field/ text/ object which is to be handled by each individual reduce
method.
c. Reduce Input
The output of Map is the input for reduce, so it is same as Map-Output.
d. Reduce Output
It depends on the required output.
Suppose, the content of the file which is stored in HDFS is John is Mark Joey is John. Using
Input Format, we will define how this file will split and read. By default, RecordReader uses
Text Input Format to convert this file into a key-value pair.
Key – It is offset of the beginning of the line within the file.
Value – It is the content of the line, excluding line terminators.
From the above content of the file-
Key is 0
Value is John is Mark Joey is John.
Hadoop Input Format, Types of Input Format in MapReduce
1. Objective
Hadoop Input Format checks the Input-Specification of the job. Input Format split the Input file
into Input Split and assign to individual Mapper. In this Hadoop Input Format Tutorial, we will
learn what is Input Format in Hadoop Map Reduce, different methods to get the data to the
mapper and different types of Input Format in Hadoop like File Input Format in Hadoop, Text
31
Input Format, Key Value Text Input Format, etc. We will also see what is the default Input
Format in Hadoop?
How the input files are split up and read in Hadoop is defined by the Input Format. A Hadoop
Input Format is the first component in Map-Reduce; it is responsible for creating the input splits
and dividing them into records. If you are not familiar with MapReduce Job Flow, so follow
our Hadoop MapReduce Data flow tutorial for more understanding.
Initially, the data for a MapReduce task is stored in input files, and input files typically reside
in HDFS. Although these files format is arbitrary, line-based log files and binary format can be
used. Using Input Format we define how these input files are split and read. The Input Format
class is one of the fundamental classes in the Hadoop MapReduce framework which provides the
following functionality:
The files or other objects that should be used for input is selected by the Input Format.
Input Format defines the Data splits, which defines both the size of individual Map
tasks and its potential execution server.
Input Format defines the RecordReader, which is responsible for reading actual records
from the input files.
How we get the data to mapper?
32
We have 2 methods to get the data to mapper in
MapReduce: get splits () and create Record Reader () as shown below:
[php]public abstract class Input Format<K, V>
{
public abstract List<Input Split> get Splits(Job Context)
throws IO Exception, Interrupted Exception;
public abstract RecordReader<K, V>
create Record Reader (Input Split,
Task Attempt Context) throws IO Exception,
Interrupted Exception;
}[/php]
33
Types of Input Format
Let us now see what are the types of Input Format in Hadoop?
It is the base class for all file-based Input Formats. Hadoop File Input Format specifies input
directory where data files are located. When we start a Hadoop job, File Input Format is provided
with a path containing files to read. File Input Format will read all files and divides these files
into one or more Input Splits.
It is the default Input Format of MapReduce. Text Input Format treats each line of each input file
as a separate record and performs no parsing. This is useful for unformatted data or line-based
records like log files.
34
Key – It is the byte offset of the beginning of the line within the file (not whole file just one
split), so it will be unique if combined with the file name.
Value – It is the contents of the line, excluding line terminators.
Key Value Text Input Format
It is similar to Text Input Format as it also treats each line of input as a separate record. While
Text Input Format treats entire line as the value, but the Key Value Text Input Format breaks the
line itself into key and value by a tab character (‘/t’). Here Key is everything up to the tab
character while the value is the remaining part of the line after tab character.
Hadoop Sequence File Input Format is an Input Format which reads sequence files. Sequence
files are binary files that stores sequences of binary key-value pairs. Sequence files block-
compress and provide direct serialization and deserialization of several arbitrary data types (not
just text). Here Key & Value both are user-defined.
Hadoop Sequence File as Text Input Format is another form of Sequence File Input Format
which converts the sequence file key values to Text objects. By calling ‘to string ()’ conversion
is performed on the keys and values. This Input Format makes sequence files suitable input for
streaming.
Sequence File as Binary Input Format
Hadoop Sequence File as Binary Input Format is a Sequence File Input Format using which
we can extract the sequence file’s keys and values as an opaque binary object.
Hadoop N Line Input Format is another form of Text Input Format where the keys are byte
offset of the line and values are contents of the line. Each mapper receives a variable number of
lines of input with Text Input Format and Key Value Text Input Format and the number depends
on the size of the split and the length of the lines. And if we want our mapper to receive a fixed
number of lines of input, then we use N Line Input Format.
N is the number of lines of input that each mapper receives. By default (N=1), each mapper
35
receives exactly one line of input. If N=2, then each split contains two lines. One mapper will
receive the first two Key-Value pairs and another mapper will receive the second two key-value
pairs.
DB Input Format
Hadoop DB Input Format is an Input Format that reads data from a relational database, using
JDBC. As it doesn’t have portioning capabilities, so we need to careful not to swamp the
database from which we are reading too many mappers. So it is best for loading relatively small
datasets, perhaps for joining with large datasets from HDFS using Multiple Inputs. Here Key is
Long Writable while Value is DB Writable.
In this Hadoop MapReduce tutorial, we will provide you the detailed description of Input Split
in Hadoop. In this blog, we will try to answer what is Hadoop Input Split, what is the need of
input Split in MapReduce and how Hadoop performs Input Split, How to change split size in
Hadoop. We will also learn the difference between Input Split vs. Blocks in HDFS.
36
What is Input Split in Hadoop?
Input Split in Hadoop MapReduce is the logical representation of data. It describes a unit of
work that contains a single map task in a MapReduce program.
Hadoop Input Split represents the data which is processed by an individual Mapper. The split is
divided into records. Hence, the mapper process each record (which is a key-value pair).
MapReduce Input Split length is measured in bytes and every Input Split has storage locations
(hostname strings). MapReduce system use storage locations to place map tasks as close to
split’s data as possible. Map tasks are processed in the order of the size of the splits so that the
largest one gets processed first (greedy approximation algorithm) and this is done to minimize
the job runtime (Learn MapReduce job optimization techniques)The important thing to notice is
that Input split does not contain the input data; it is just a reference to the data.
As a user, we don’t need to deal with Input Split directly; because they are created by an Input
Format (Input Format creates the Input split and divides into records). File Input Format, by
default, breaks a file into 128MB chunks (same as blocks in HDFS) and by
setting mapred.min.split.size parameter in mapred-site.xml we can control this value or by
overriding the parameter in the Job object used to submit a particular MapReduce job. We can
also control how the file is broken up into splits, by writing a custom Input Format.
3. How to change split size in Hadoop?
Input Split in Hadoop is user defined. User can control split size according to the size of data in
MapReduce program. Thus the number of map tasks is equal to the number of Input Splits.
The client (running the job) can calculate the splits for a job by calling ‘get Split ()’, and then
sent to the application master, which uses their storage locations to schedule map tasks that will
process them on the cluster. Then, map task passes the split to the create Record Reader
() method on Input Format to get RecordReader for the split and RecordReader generate record
(key-value pair), which it passes to the map function.
Hadoop RecordReader – How RecordReader Works in Hadoop?
Hadoop RecordReader Tutorial – Objective
In this Hadoop RecordReader Tutorial, We are going to discuss the important concept
of Hadoop MapReduce i.e. RecordReader. The MapReduce RecordReader in Hadoop takes the
byte-oriented view of input, provided by the Input Split and presents as a record-oriented view
for Mapper. It uses the data within the boundaries that were created by the Input Split and
creates Key-value pair. This blog will answer what is RecordReader in Hadoop, how Hadoop
37
RecordReader works and types of Hadoop RecordReader – Sequence File Record Reader and
Line RecordReader, the maximum size of a record in Hadoop.
Learn How to Install Hadoop on Single Machine. and Hadoop Installation on a multi-node
cluster.
To understand record reader in Hadoop, we need to understand the Hadoop data flow. Map
Reduce has a simple model of data processing. Inputs and Outputs for the map and reduce
functions are key-value pairs. The map and reduce functions in Hadoop MapReduce have the
following general form:
38
Hadoop Record Reader and its types.
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Now before processing, it needs to know on which data to process, this is achieved with
the Input Format class. Input Format is the class which selects the file from HDFS that should be
input to the map function. An Input Format is also responsible for creating the Input Splits and
dividing them into records. The data is divided into the number of splits (typically 64/128mb) in
HDFS. This is called as input split which is the input that is processed by a single map.
Input Format class calls the get Splits() function and computes splits for each file and then sends
them to the Job Tracker, which uses their storage locations to schedule map tasks to process
them on the Task Trackers. Map task then passes the split to the create Record Reader () method
on Input Format in task tracker to obtain a RecordReader for that split. The Record Reader load’s
data from its source and converts into key-value pairs suitable for reading by the mapper.
Hadoop RecordReader uses the data within the boundaries that are being created by the input
split and creates Key-value pairs for the mapper. The “start” is the byte position in the file where
the RecordReader should start generating key/value pairs and the “end” is where it should stop
reading records. In Hadoop RecordReader, the data is loaded from its source and then the data is
converted into key-value pairs suitable for reading by the Mapper.
39
cleanup(context);
}[/php]
After running setup (), the next Key Value () will repeat on the context, to populate the key and
value objects for the mapper. The key and value is retrieved from the record reader by way of
context and passed to the map () method to do its work. An input to the map function, which is a
key-value pair (K, V), gets processed as per the logic mentioned in the map code. When the
record gets to the end of the record, the next Key Value () method returns false.
A RecordReader usually stays in between the boundaries created by the input split to generate
key-value pairs but this is not mandatory. A custom implementation can even read more data
outside of the input split, but it is not encouraged a lot.
The RecordReader instance is defined by the Input Format. By default, it uses Text Input Format
for converting data into a key-value pair. Text Input Format provides 2 types of Record Readers:
I. Line RecordReader
Line RecordReader in Hadoop is the default RecordReader that text Input Format provides and it
treats each line of the input file as the new value and associated key is byte offset. Line
RecordReader always skips the first line in the split (or part of it), if it is not the first split. It read
one line after the boundary of the split in the end (if data is available, so it is not the last split).
There is a maximum size allowed for a single record to be processed. This value can be set using
below parameter.
[php]conf.setInt(“mapred.linerecordreader.maxlength”, Integer.MAX_VALUE);[/php]
A line with a size greater than this maximum value (default is 2,147,483,647) will be ignored.
40
Hadoop Partitioner – Internals of MapReduce Partitioner
Hadoop Partitioner / MapReduce Partitioner
Partitioning of the keys of the intermediate map output is controlled by the Partitioner. By hash
function, key (or a subset of the key) is used to derive the partition. According to the key-
value each mapper output is partitioned and records having the same key value go into the same
partition (within each mapper), and then each partition is sent to a reducer. Partition class
determines which partition a given (key, value) pair will go. Partition phase takes place after map
phase and before reduce phase. Lets move ahead with need of Hadoop Partitioner and if you face
any difficulty anywhere in Hadoop MapReduce tutorial, you can ask us in comments.
The Default Hadoop partitioner in Hadoop MapReduce is Hash Partitioner which computes a
hash value for the key and assigns the partition based on this result.
The total number of Practitioners that run in Hadoop is equal to the number of reducers i.e.
Partitioner will divide the data according to the number of reducers which is set by Job Conf.set
Num Reduce Tasks () method. Thus, the data from single partitioner is processed by a single
reducer. And partitioner is created only when there are multiple reducers.
If in data input one key appears more than any other key. In such case, we use two mechanisms
to send data to partitions.
42
How to overcome poor partitioning in MapReduce?
To overcome poor partitioner in Hadoop MapReduce, we can create Custom partitioner, which
allows sharing workload uniformly across different reducers.
Hadoop Combiner is also known as “Mini-Reducer” that summarizes the Mapper output record
with the same Key before passing to the Reducer. In this tutorial on MapReduce combiner we
are going to answer what is a Hadoop combiner, MapReduce program with and without
combiner, advantages of Hadoop combiner and disadvantages of the combiner in Hadoop.
On a large dataset when we run MapReduce job, large chunks of intermediate data is generated
by the Mapper and this intermediate data is passed on the Reducer for further processing, which
leads to enormous network congestion. MapReduce framework provides a function known
as Hadoop Combiner that plays a key role in reducing network congestion.
We have already seen earlier what mapper is and what is reducer in Hadoop MapReduce. Now
we in the next step to learn Hadoop MapReduce Combiner.
43
The combiner in MapReduce is also known as ‘Mini-reducer’. The primary job of Combiner is to
process the output data from the Mapper, before passing it to Reducer. It runs after the mapper
and before the Reducer and its use is optional.
Let us now see the working of the Hadoop combiner in MapReduce and how things change when
combiner is used as compared to when combiner is not used in MapReduce?
44
MapReduce Combiner: MapReduce program with combiner
Reducer now needs to process only 4 key/value pair data which is generated from 2 combiners.
Thus reducer gets executed only 4 times to produce final output, which increases the overall
performance.
As we have discussed what is Hadoop MapReduce Combiner in detail, now we will discuss
some advantages of Mapreduce Combiner.
Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
It decreases the amount of data that needed to be processed by the reducer.
The Combiner improves the overall performance of the reducer.
45
Read: Counters in MapReduce
5. Disadvantages of Hadoop combiner in MapReduce
There are also some disadvantages of hadoop Combiner. Let’s discuss them one by one-
MapReduce jobs cannot depend on the Hadoop combiner execution because there is no
guarantee in its execution.
In the local file system, the key-value pairs are stored in the Hadoop and run the combiner
later which will cause expensive disk IO.
Shuffling and Sorting in Hadoop MapReduce
In Hadoop, the process by which the intermediate output from mappers is transferred to
the reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the basis
of reducers. Intermediated key-value generated by mapper is sorted automatically by key. In this
blog, we will discuss in detail about shuffling and Sorting in Hadoop MapReduce.
Here we will learn what is sorting in Hadoop, what is shuffling in Hadoop, what is the purpose of
Shuffling and sorting phase in MapReduce, how MapReduce shuffle works and how MapReduce
sort works.
Before we start with Shuffle and Sort in MapReduce, let us revise the other phases Map Reduce
like Mapper, reducer in MapReduce, Combiner, partitioner in MapReduce and input Format in
MapReduce. Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in
MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data
from the mapper are grouped by the key, split among reducers and sorted by the key. Every
46
reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur
simultaneously and are done by the MapReduce framework.
Shuffling in MapReduce
The process of transferring data from the mappers to reducers is known as shuffling i.e. the
process by which the system performs the sort and transfers the map output to the reducer as
input. So, MapReduce shuffle phase is necessary for the reducers, otherwise, they would not
have any input (or input from every mapper). As shuffling can start even before the map phase
has finished so this saves some time and completes the tasks in lesser time.
Sorting in MapReduce
The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e.
before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by
mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they
can be in any order.
Sorting in Hadoop helps reduce to easily distinguish when a new reduce task should start. This
saves time for the reducer. Reducer starts a new reduce task when the next key in the sorted input
data is different than the previous. Each reduce task takes key-value pairs as input and generates
key-value pair as output.
Note that shuffling and sorting in Hadoop MapReduce is not performed at all if you specify zero
reducers (set Num Reduce Tasks (0)). Then, the MapReduce job stops at the map phase and the
map phase does not include any kind of sorting (so even the map phase is faster).
If we want to sort reducer’s values, then the secondary sorting technique is used as it enables us
to sort the values (in ascending or descending order) passed to each reducer.
The Hadoop Output Format checks the Output-Specification of the job. It determines how
Record Writer implementation is used to write output to output files. In this blog, we are going to
47
see what is Hadoop Output Format, what is Hadoop Record Writer, how Record Writer is used in
Hadoop.
As we saw above, Hadoop Record Writer takes output data from Reducer and writes this data to
output files. The way these output key-value pairs are written in output files by Record Writer is
determined by the Output Format. The Output Format and Input Format functions are alike.
Output Format instances provided by Hadoop are used to write to files on the HDFS or local
disk. Output Format describes the output-specification for a Map-Reduce job. On the basis of
output specification;
MapReduce job checks that the output directory does not already exist.
Output Format provides the Record Writer implementation to be used to write the output
files of the job. Output files are stored in a File System.
File Output Format.setOutputPath () method is used to set the output directory. Every Reducer
writes a separate file in a common output directory.
There are various types of Hadoop Output Format. Let us see some of them
below:
48
Types of Hadoop Output Formats
I. Text Output Format
MapReduce default Hadoop reducer Output Format is Text Output Format, which writes (key,
value) pairs on individual lines of text files and its keys and values can be of any type since Text
Output Format turns them to string by calling to String() on them. Each key-value pair is
separated by a tab character, which can be changed using Map
Reduce.output.textoutputformat.separator property. KeyValueTextOutputFormat is used for
reading these output text files since it breaks lines into key-value pairs based on a configurable
separator.
ii. Sequence File Output Format
It is an Output Format which writes sequences files for its output and it is intermediate format
use between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the
corresponding Sequence File Input Format will desterilize the file into the same types and
presents the data to the next mapper in the same manner as it was emitted by the previous
reducer, since these are compact and readily compressible. Compression is controlled by the
static methods on Sequence File Output Format.
49
iv. Map File Output Format
It is another form of File Output Format in Hadoop Output Format, which is used to write output
as map files. The key in a Map File must be added in order, so we need to ensure that reducer
emits keys in sorted order.
v. Multiple Outputs
It allows writing data to files whose names are derived from the output keys and values, or in
fact from an arbitrary string.
Block in HDFS
50
Block is a continuous location on the hard drive where data is stored. In general, File System
stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The
Hadoop application is responsible for distributing the data block across multiple nodes. Read
more about blocks here.
Input Split in Hadoop
The data to be processed by an individual Mapper is represented by Input Split. The split is
divided into records and each record (which is a key-value pair) is processed by the map. The
number of map tasks is equal to the number of Input Splits.
Initially, the data for MapReduce task is stored in input files and input files typically reside in
HDFS. Input Format is used to define how these input files are split and read. Input Format is
responsible for creating Input Split. Learn more about MapReduce Input Split.
Let’s discuss feature wise comparison between MapReduce Input Split vs. Blocks-
Block – The default size of the HDFS block is 128 MB which we can configure as per our
requirement. All blocks of the file are of the same size except the last block, which can be of
same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop
File System.
Input Split – By default, split size is approximately equal to block size. Input Split is user
defined and the user can control split size based on the size of data in MapReduce program.
ii. Data Representation in Hadoop Blocks vs. Input Split
Block – It is the physical representation of data. It contains a minimum amount of data that
can be read or write.
Input Split – It is the logical representation of data present in the block. It is used during
data processing in MapReduce program or other processing techniques. Input Split doesn’t
contain actual data, but a reference to the data.
51
iii. Example of Block vs. Input Split in Hadoop
52
Now, if we want to perform MapReduce operation on the blocks, it will not process, because the
2nd block is incomplete. Thus, this problem is solved by Input Split. Input Split will form a
logical grouping of blocks as a single block, because the Input Split Include a location for the
next block and the byte offset of the data needed to complete the block.
Map Only Job in Hadoop MapReduce with example
In Hadoop, Map-Only job is the process in which mapper does all task, no task is done by the
reducer and mappers output is the final output. In this tutorial on Map only job in Hadoop
MapReduce, we will learn about MapReduce process, the need of map only job in Hadoop, how
to set a number of reducers to 0 for Hadoop map only job.
We will also learn what are the advantages of Map Only job in Hadoop MapReduce, processing
in Hadoop without reducer along with MapReduce example with any reducer.
53
What is Map Only Job in Hadoop MapReduce?
54
Now, let us consider a scenario where we just need to perform the operation and no aggregation
required, in such case, we will prefer ‘Map-Only job’ in Hadoop. In Hadoop Map-Only job, the
map does all task with its Input Split and no job is done by the reducer. Here map output is the
final output.
How to avoid Reduce Phase in Hadoop?
We can achieve this by setting job. Set Num reduce Tasks (0) in the configuration in a driver.
This will make a number of reducer as 0 and thus the only mapper will be doing the complete
task.
Advantages of Map only job in Hadoop
In between map and reduces phases there is key, sort and shuffle phase. Sort and shuffle are
responsible for sorting the keys in ascending order and then grouping values based on same keys.
This phase is very expensive and if reduce phase is not required we should avoid it, as avoiding
reduce phase would eliminate sort and shuffle phase as well. This also saves network congestion
as in shuffling, an output of mapper travels to reducer and when data size is huge, large data
needs to travel to the reducer. Learn more about shuffling and sorting in Hadoop MapReduce.
The output of mapper is written to local disk before sending to reducer but in map only job, this
output is directly written to HDFS. This further saves time and reduces cost as well.
Also, there is no need of partitioner and combiner in Hadoop Map Only job that makes the
process fast.
Data locality in Hadoop: The Most Comprehensive Guide
1. Data Locality in Hadoop – Objective
In Hadoop, Data locality is the process of moving the computation close to where the actual data
resides on the node, instead of moving large data to computation. This minimizes network
congestion and increases the overall throughput of the system. This feature of Hadoop we will
discuss in detail in this tutorial. We will learn what is data locality in Hadoop, data locality
definition, how Hadoop exploits Data Locality, what is the need of Hadoop Data Locality,
various types of data locality in Hadoop MapReduce, Data locality optimization in Hadoop and
various advantages of Hadoop data locality.
55
Data locality in Hadoop: The Most Comprehensive Guide
Let us understand Data Locality concept and what is Data Locality in MapReduce?
The major drawback of Hadoop was cross-switch network traffic due to the huge volume of
data. To overcome this drawback, Data Locality in Hadoop came into the picture. Data locality
in MapReduce refers to the ability to move the computation close to where the actual data resides
on the node, instead of moving large data to computation. This minimizes network congestion
and increases the overall throughput of the system.In Hadoop, datasets are stored in HDFS.
Datasets are divided into blocks and stored across the data nodes in Hadoop cluster. When a
user runs the MapReduce job then Name Node sent this MapReduce code to the data nodes on
which data is available related to MapReduce job.
56
Requirements for Data locality in MapReduce
Our system architecture needs to satisfy the following conditions, in order to get the benefits of
all the advantages of data locality:
First of all the cluster should have the appropriate topology. Hadoop code must have the
ability to read data locality.
Second, Hadoop must be aware of the topology of the nodes where tasks are executed. And
Hadoop must know where the data is located.
Below are the various categories in which Data Locality in Hadoop is categorized:
When the data is located on the same node as the mapper working on the data it is known as data
local data locality. In this case, the proximity of data is very near to computation. This is the
most preferred scenario.
It is not always possible to execute the mapper on the same data node due to resource constraints.
In such case, it is preferred to run the mapper on the different node but on the same rack.
iii. Inter-Rack data locality in Hadoop
Sometimes it is not possible to execute mapper on a different node in the same rack due to
resource constraints. In such a case, we will execute the mapper on the nodes on different racks.
This is the least preferred scenario.
Although Data locality in Hadoop MapReduce is the main advantage of Hadoop MapReduce as
map code is executed on the same data node where data resides. But this is not always true in
practice due to various reasons like speculative execution in Hadoop, Heterogeneous cluster,
Data distribution and placement, and Data Layout and Input Splitter.
Challenges become more prevalent in large clusters, because more the number of data nodes and
57
data, less will be the locality. In larger clusters, some nodes are newer and faster than the other,
creating the data to compute ratio out of balance thus, large clusters tend not to be completely
homogenous. In speculative execution even though the data might not be local, but it uses the
computing power. The root cause also lies in the data layout/placement and the used Input
Splitter. Non-local data processing puts a strain on the network which creates problem to
scalability. Thus the network becomes the bottleneck.
We can improve data locality by first detecting which jobs has the data locality problem or
degrade over time. Problem-solving is more complex and involves changing the data placement
and data layout, using a different scheduler or by simply changing the number of mapper
and reducer slots for a job. Then we have to verify whether a new execution of the same
workload has a better data locality ratio.
There are two benefits of data Locality in MapReduce. Let’s discuss them one by one-
I. Faster Execution
In data locality, the program is moved to the node where data resides instead of moving large
data to the node, this makes Hadoop faster. Because the size of the program is always lesser than
the size of data, so moving data is a bottleneck of network transfer.
58
Speculative Execution in Hadoop MapReduce
What is Speculative Execution in Hadoop?
59
How Speculative Execution works in Hadoop?
Firstly all the tasks for the job are launched in Hadoop MapReduce. The speculative tasks are
launched for those tasks that have been running for some time (at least one minute) and have not
made any much progress, on average, as compared with other tasks from the job. The speculative
task is killed if the original task completes before the speculative task, on the other hand, the
original task is killed if the speculative task finishes before it.
60
What is the need to turn off Speculative Execution?
The main work of speculative execution is to reduce the job execution time; however, the
clustering efficiency is affected due to duplicate tasks. Since in speculative execution redundant
tasks are being executed, thus this can reduce overall throughput. For this reason, some cluster
administrators prefer to turn off the speculative execution in Hadoop.
In this MapReduce Hadoop Counters tutorial, we will provide you the detailed description of
MapReduce Counters in Hadoop. The tutorial covers an introduction to
Hadoop MapReduce counters, Types of Hadoop Counters such as Built-in Counters and User-
defined counters. In this Hadoop counters tutorial, we will also discuss the File Input Format and
File Output Format of Hadoop MapReduce.
Hadoop Counters
61
What is Hadoop MapReduce?
Before we start with Hadoop Counters, let us first see the overview of Hadoop MapReduce.
MapReduce is the core component of Hadoop which provides data processing. MapReduce
works by breaking the processing into two phases; Map phase and Reduce phase. The map is the
first phase of processing, where we specify all the complex logic/business rules/costly code,
whereas the Reduce phase is the second phase of processing, where we specify light-weight
processing like aggregation/ summation.
Hadoop Counters Explained: Hadoop Counters provides a way to measure the progress or the
number of operations that occur within map/reduce job. Counters in Hadoop MapReduce are a
useful channel for gathering statistics about the MapReduce job: for quality control or for
application-level. They are also useful for problem diagnosis.
Counters represent Hadoop global counters, defined either by the MapReduce framework or
applications. Each Hadoop counter is named by an “Enum” and has a long for the value.
Counters are bunched into groups, each comprising of counters from a particular Enum class.
Hadoop Counters validate that:
The correct number of bytes was read and written.
The correct number of tasks was launched and successfully ran.
The amount of CPU and memory consumed is appropriate for our job and cluster nodes.
Read about Key – Value Pairs
62
Built-In Counters in MapReduce
Hadoop maintains some built-in Hadoop counters for every job and these report various metrics,
like, there are counters for the number of bytes and records, which allow us to confirm that the
expected amount of input is consumed and the expected amount of output is produced.
Hadoop Counters are divided into groups and there are several groups for the built-in counters.
Each group either contains task counters (which are updated as task progress) or job counter
there are several groups for the Hadoop built-in Counters:
Hadoop Task counter collects specific information (like number of records read and written)
about tasks during its execution time. For example, the MAP_INPUT_RECORDS counter is the
Task Counter which counts the input records read by each map task.
Hadoop Task counters are maintained by each task attempt and periodically sent to the
application master so they can be globally aggregated.
Hadoop File System Counters in Hadoop MapReduce gather information like a number of bytes
read and written by the file system. Below are the name and description of the file system
counters:
File System bytes read– The number of bytes read by the file system by map and reduce
tasks.
File System bytes written– The number of bytes written to the file system by map and
reduce tasks.
File Input Format Counters in Hadoop
File Input Format Counters in Hadoop MapReduce gather information of a number of bytes read
by map tasks via File Input Format. Refer this guide to learn about Input Format in Hadoop
MapReduce.
File Output Format counters in MapReduce
File Output Format counters in Hadoop MapReduce gathers information of a number of bytes
written by map tasks (for map-only jobs) or reduce tasks via File Output Format. Refer this guide
to learn about Output Format in Hadoop MapReduce in detail.
63
MapReduce Job Counters
MapReduce Job counter measures the job-level statistics, not values that change while a task is
running. For example, TOTAL_LAUNCHED_MAPS, count the number of map tasks that were
launched over the course of a job (including tasks that failed). Application master maintains
MapReduce Job counters, so these Hadoop Counters don’t need to be sent across the network,
unlike all other counters, including user-defined ones.
In addition to MapReduce built-in counters, MapReduce allows user code to define a set of
counters, which are then incremented as desired in the mapper or reducer. For example, in Java,
‘enum’ is used to define counters.
A job may define an arbitrary number of ‘enums’, each with an arbitrary number of fields. The
name of the enum is the group name, and the enum’s fields are the counter names.
a. Dynamic Counters in Hadoop MapReduce
b. Java enum’s fields are defined at compile time, so we cannot create new counters
in Hadoop MapReduce at runtime using enums. To do so, we use dynamic
counters in Hadoop MapReduce, one that is not defined at compile time using
java enum.
This tutorial on Hadoop Optimization will explain you Hadoop cluster optimization
or MapReduce job optimization techniques that would help you in optimizing MapReduce job
performance to ensure the best performance for your Hadoop cluster.
64
Hadoop Optimization or Job Optimization Techniques
There are various ways to improve the Hadoop optimization. Let’s discuss each of them one by
one-
Dfs and MapReduce storage have been mounted with –noise option. This disables access
time and can improve I/O performance.
Avoid RAID on Task Tracker and data node machines, it generally reduces performance.
Make sure you have configured mapred.local.dir and dfs.data.dir to point to one directory
on each of your disks to ensure that all of your I/O capacity is used.
Ensure that you have smart monitoring to the health status of your disk drives. This is 1 of
the best practice for Hadoop MapReduce performance tuning. MapReduce jobs are fault
tolerant, but dying disks can cause performance to degrade as tasks must be re-executed.
Monitor the graph of swap usage and network usage with software like ganglia, Hadoop
monitoring metrics. If you see swap being used, reduce the amount of RAM allocated to
each task in mapred.child.java.opts.
ii. LZO compression usage
This is always a good idea for Intermediate data. Almost every Hadoop job that generates a non-
negligible amount of map output will benefit from intermediate data compression with LZO.
Although LZO adds a little bit of CPU overhead, it saves time by reducing the amount of disk IO
during the shuffle.
In order to enable LZO compression set mapred.compress.map.output to true. This is one of the
most important Hadoop optimization techniques.
iii. Proper tuning of the number of MapReduce tasks
If each task takes 30-40 seconds or more, then reduce the number of tasks. The start
of mapper or reducer process involves following things: first, you need to start JVM (JVM
loaded into the memory), then you need to initialize JVM and after processing
(mapper/reducer) you need to de-initialize JVM. All these JVM tasks are costly. Now
consider a case where mapper runs a task just for 20-30 seconds and for this we have to
start/initialize/stop JVM, which might take a considerable amount of time. It is
recommended to run the task for at least 1 minute.
65
If a job has more than 1TB of input, you should consider increasing the block size of the
input dataset to 256M or even 512M so that the number of tasks will be smaller. You can
change the block size of existing files by using the command Hadoop distcp –
Hdfs.block.size=$[256*1024*1024] /path/tm/input data /path/tm/input data-with-large
blocks
So long as each task runs for at least 30-40 seconds, you should increase the number of
mapper tasks to some multiple of the number of mapper slots in the cluster.
Don’t schedule too many reduce tasks – for most jobs, the number of reduce tasks equal to
or a bit less than the number of reduce slots in the cluster.
iv. Combiner between mapper and reducer
If your algorithm involves computing aggregates of any sort, it is suggested to use a Combiner to
perform some aggregation before the data hits the reducer. The MapReduce framework runs
combine intelligently to reduce the amount of data to be written to disk and that has to be
transferred between the Map and Reduce stages of computation.
Big data new users or users switching from Hadoop Streaming to Java MapReduce often use the
Text writable type unnecessarily. Although Text can be convenient, it’s inefficient to convert
numeric data to and from UTF8 strings and can actually make up a significant portion of CPU
time. Whenever dealing with non-textual data, consider using the binary Writables like Int
Writable, Float writable etc.
One of the common mistakes that many MapReduce users make is to allocate a new Writable
object for every output from a mapper or reducer. For example, to implement a word-count
mapper:
[php]public void map(…) {
…
for (String word : words) {
output. Collect(new Text(word), new Intractable(1));
}[/php]
66
This implementation causes allocation of thousands of short-lived objects. While Java garbage
collector does a reasonable job at dealing with this, it is more efficient to write:
[php]class MyMapper … {
Text wordText = new Text();
Intractable one = new Intractable(1);
public void map(…) {
… for (String word : words)
{
wordText.set(word);
output. Collect(word, one); }
}
}[/php]
Performance tuning in Hadoop will help in optimizing the Hadoop cluster performance. This
tutorial on Hadoop MapReduce performance tuning will provide you ways for improving your
Hadoop cluster performance and get the best result from your programming in Hadoop. It will
cover 7 important concepts like Memory Tuning in Hadoop, Map Disk spill in Hadoop, tuning
mapper tasks, Speculative execution in big data Hadoop and many other related concepts for
Hadoop MapReduce performance tuning. If you face any difficulty in Hadoop MapReduce
Performance tuning tutorial, please let us know in the comments.
67
Hadoop MapReduce Performance Tuning
Hadoop performance tuning will help you in optimizing your Hadoop cluster performance and
make it better to provide best results while doing Hadoop programming in Big Data companies.
To perform the same, you need to repeat the process given below till desired output is achieved
at optimal way.
Run Job –> Identify Bottleneck –> Address Bottleneck.
The first step in hadoop performance tuning is to run Hadoop job, Identify the bottlenecks and
address them using below methods to get the highest performance. You need to repeat above step
till a level of performance is achieved.
Tips for Hadoop MapReduce Performance Tuning
Here we are going to discuss the ways to improve the Hadoop MapReduce performance tuning.
We have classified these ways into two categories.
Hadoop run-time parameters based performance tuning.
Hadoop application-specific performance tuning.
Let’s discuss how to improve the performance of Hadoop cluster on the basis of these two
categories.
There are many options provided by Hadoop on CPU, memory, disk, and network for
performance tuning. Most Hadoop tasks are not CPU bounded, what is most considered is to
optimize usage of memory and disk spills. Let us get into the details in this Hadoop performance
tuning in Tuning Hadoop Run-time parameters.
a. Memory Tuning
The most general and common rule for memory tuning in MapReduce performance tuning is: use
as much memory as you can without triggering swapping. The parameter for task memory
is mapred.child.java.opts that can be put in your configuration file.
You can also monitor memory usage on the server using Ganglia, Cloud era manager, or Navies
for better memory performance.
b. Minimize the Map Disk Spill
Disk IO is usually the performance bottleneck in Hadoop. There are a lot of parameters you can
tune for minimizing spilling like:
68
Compression of mapper output
Usage of 70% of heap memory ion mapper for spill buffer
c. Tuning Mapper Tasks
The number of mapper tasks is set implicitly unlike reducer tasks. The most common hadoop
performance tuning way for the mapper is controlling the amount of mapper and the size of each
job. When dealing with large files, Hadoop split the file into smaller chunks so that mapper can
run it in parallel. However, initializing new mapper job usually takes few seconds that is also an
overhead to be minimized. Below are the suggestions for the same:
Reuse jvm task
Aim for map tasks running 1-3 minutes each. For this if the average mapper running time is
lesser than one minute, increase the mapred.min.split.size, to allocate less mappers in slot
and thus reduce the mapper initializing overhead.
Use Combine file input format for bunch of smaller files.
ii. Tuning Application Specific Performance
a. Minimize your Mapper Output
Minimizing the mapper output can improve the general performance a lot as this is sensitive to
disk IO, network IO, and memory sensitivity on shuffle phase.
For achieving this, below are the suggestions:
Unbalanced reducer tasks create another performance issue. Some reducers take most of the
output from mapper and ran extremely long compare to other reducers.
Below are the methods to do the same:
69
c. Reduce Intermediate data with Combiner in Hadoop
d. Speculative Execution
When tasks take long time to finish the execution, it affects the MapReduce jobs. This problem is
being solved by the approach of speculative execution by backing up slow tasks on alternate
machines.
You need to set the configuration parameters ‘mapreduce.map.tasks.speculative.execution’
and ‘mapreduce.reduce.tasks.speculative.execution’ to true for enabling speculative
execution. This will reduce the job execution time if the task progress is slow due to memory
unavailability.
This was all about the Hadoop Map reduce Combiner.
70