MapReduce Unit3
MapReduce Unit3
Introduction
MapReduce is a popular programming model for processing large
datasets in parallel across a distributed cluster of computers.
Developed by Google, it has become an essential component of the
Hadoop ecosystem, enabling efficient data processing and analysis.
•Text Analysis: Text analysis tasks, such as sentiment analysis, topic modeling,
or keyword extraction, can be efficiently performed using MapReduce, enabling
organizations to derive insights from unstructured textual data.
MapReduce Alternatives and Complementary Technologies
While MapReduce has proven effective for many data processing tasks, other
technologies have emerged to address specific needs or offer improved performance
in certain scenarios:
Apache Spark
Apache Spark is a fast, in-memory data processing engine that provides an
alternative to MapReduce for certain use cases. Spark's Resilient Distributed Datasets
(RDDs) enable more efficient iterative processing, making it particularly suitable for
machine learning and graph processing tasks.
Apache Flink
Apache Flink is a stream-processing framework that offers low-latency, high-
throughput data processing. While MapReduce focuses on batch processing, Flink's
ability to process data in real-time makes it an attractive option for time-sensitive
applications.
Apache Hive
Apache Hive is a data warehousing solution built on top of Hadoop
that provides an SQL-like interface for querying large datasets.
While not a direct replacement for MapReduce, Hive can simplify
data processing tasks for users familiar with SQL.
Difference Between MapReduce, Apache Spark, Apache Flink,
and Apache Hive
Primary Focus Batch Processing In-Memory Processing Stream Processing Data Warehousing
Resilient Distributed
Data Processing Model Map and Reduce Data Streaming SQL-like Querying
Datasets (RDDs)
Data Locality
MapReduce tries to place the data and the compute as close as possible. First, it
tries to put the compute on the same node where data resides, if that cannot be
done (due to reasons like compute on that node is down, compute on that node
is performing some other computation, etc.), then it tries to put the compute on
the node nearest to the respective data node(s) which contains the data to be
processed. This feature of MapReduce is “Data Locality”.
The following diagram shows the logical flow of a MapReduce programming
model.
Input:
This is the input data / file to be processed.
Split:
Hadoop splits the incoming data into smaller pieces called “splits”.
Map:
this step, MapReduce processes each split according to the logic defined in map() function.
Each mapper works on each split at a time. Each mapper is treated as a task and multiple
tasks are executed across different TaskTrackers and coordinated by the JobTracker.
Combine:
This is an optional step and is used to improve the performance by reducing the
amount of data transferred across the network. Combiner is the same as the
reduce step and is used for aggregating the output of the map() function before it
is passed to the subsequent steps.
Reduce:
This step is used to aggregate the outputs of mappers using the reduce()
function. Output of reducer is sent to the next and final step. Each reducer is
treated as a task and multiple tasks are executed across different TaskTrackers
and coordinated by the JobTracker.
Output:
Finally the output of reduce step is written to a file in HDFS.
Word Count Example
For the purpose of understanding MapReduce, let us consider a simple
example. Let us assume that we have a file which contains the following four
lines of text.
In this file, we need to count the number of occurrences of each word. For
instance, DW appears twice, BI appears once, SSRS appears twice, and so
on. Let us see how this counting operation is performed when this file is
input to MapReduce.
Below is a simplified representation of the data flow for Word Count Example.
•Input: In this step, the sample file is input to MapReduce.
•Split: In this step, Hadoop splits / divides our sample input file into four parts, each
part made up of one line from the input file. Note that, for the purpose of this
example, we are considering one line as each split. However, this is not necessarily
true in a real-time scenario.
•Map: In this step, each split is fed to a mapper which is the map() function
containing the logic on how to process the input data, which in our case is the line
of text present in the split. For our scenario, the map() function would contain the
logic to count the occurrence of each word and each occurrence is captured /
arranged as a (key, value) pair, which in our case is like (SQL, 1), (DW, 1), (SQL, 1),
and so on.
•Combine: This is an optional step and is often used to improve the performance by
reducing the amount of data transferred across the network. This is essentially the
same as the reducer (reduce() function) and acts on output from each mapper. In
our example, the key value pairs from first mapper “(SQL, 1), (DW, 1), (SQL, 1)” are
combined and the output of the corresponding combiner becomes “(SQL, 2), (DW,
1)”.
•Shuffle and Sort: In this step, output of all the mappers is collected,
shuffled, and sorted and arranged to be sent to reducer.
•Reduce: In this step, the collective data from various mappers, after being
shuffled and sorted, is combined / aggregated and the word counts are
produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.
•Output: In this step, the output of the reducer is written to a file on HDFS.
The following image is the output of our word count example.
Types of InputFormat in MapReduce
There are different types of MapReduce InputFormat in Hadoop which are used for
different purpose. Let’s discuss the Hadoop InputFormat types below:
1. FileInputFormat
It is the base class for all file-based InputFormats. FileInputFormat also specifies
input directory which has data files location. When we start a MapReduce job
execution, FileInputFormat provides a path containing files to read.
This InpuFormat will read all files. Then it divides these files into one or more
InputSplits.
2. TextInputFormat
It is the default InputFormat. This InputFormat treats each line of each input file as a
separate record. It performs no parsing. TextInputFormat is useful for unformatted
data or line-based records like log files. Hence,
•Key – It is the byte offset of the beginning of the line within the file (not whole file
one split). So it will be unique if combined with the file name.
•Value – It is the contents of the line. It excludes line terminators.
3. KeyValueTextInputFormat
It is similar to TextInputFormat. This InputFormat also treats each line of input
as a separate record. While the difference is that TextInputFormat treats entire
line as the value, but the KeyValueTextInputFormat breaks the line itself into
key and value by a tab character (‘/t’). Hence,
•Key – Everything up to the tab character.
•Value – It is the remaining part of the line after tab character.
4. SequenceFileInputFormat
It is an InputFormat which reads sequence files. Sequence files are binary files.
These files also store sequences of binary key-value pairs. These are block-
compressed and provide direct serialization and deserialization of several
arbitrary data. Hence,
Key & Value both are user-defined.
5. SequenceFileAsTextInputFormat
It is the variant of SequenceFileInputFormat. This format converts the
sequence file key values to Text objects. So, it performs conversion by
calling ‘tostring()’ on the keys and values. Hence,
SequenceFileAsTextInputFormat makes sequence files suitable input for
streaming.
6. SequenceFileAsBinaryInputFormat
By using SequenceFileInputFormat we can extract the sequence file’s
keys and values as an opaque binary object.
7. NlineInputFormat
It is another form of TextInputFormat where the keys are byte offset of
the line. And values are contents of the line. So, each mapper receives a
variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.
The number depends on the size of the split. Also, depends on the
length of the lines. So, if want our mapper to receive a fixed number of
lines of input, then we use NLineInputFormat.
N- It is the number of lines of input that each mapper receives.
By default (N=1), each mapper receives exactly one line of input.
Suppose N=2, then each split contains two lines. So, one mapper
receives the first two Key-Value pairs. Another mapper receives the
second two key-value pairs.
8. DBInputFormat
This InputFormat reads data from a relational database, using JDBC.
It also loads small datasets, perhaps for joining with large datasets
from HDFS using MultipleInputs. Hence,
•Key – LongWritables
•Value – DBWritables.
What is Interactive Analytics?
Businesses are collecting more data than ever, but if you don’t know how to effectively
interpret it, data is just facts and statistics. The value in data doesn’t come from
collecting it, but in how you make it actionable to drive business strategy. In order to
make better-informed decisions, your business needs to be able to effectively analyze
data, and that data needs to be comprehensible for as many decision-makers as
possible.
Interactive analytics is a way to make real-time data more intelligible for non-technical
users through the use of tools that visualize and crunch the data, enabling users to
quickly and easily run complex queries and interpret them to gain the valuable insights
that factor into critical business decisions.