0% found this document useful (0 votes)

1 views27 pages

MapReduce Unit3

MapReduce is a programming model developed by Google for processing large datasets in parallel across distributed clusters, forming a core component of the Hadoop ecosystem. It consists of two primary functions, Map and Reduce, which transform input data into key-value pairs and aggregate them, respectively, while offering features like scalability, fault tolerance, and data locality. Common use cases include log analysis, data transformation, and machine learning, with alternatives like Apache Spark and Apache Flink available for specific needs.

Uploaded by

FANZ3R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views27 pages

MapReduce Unit3

Uploaded by

FANZ3R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

MapReduce

Introduction
 MapReduce is a popular programming model for processing large
datasets in parallel across a distributed cluster of computers.
Developed by Google, it has become an essential component of the
Hadoop ecosystem, enabling efficient data processing and analysis.

 MapReduce is designed to address the challenges associated with

processing massive amounts of data by breaking the problem into
smaller, more manageable tasks. It consists of two primary functions:
Map and Reduce, which work together to process and analyze data.
The Map Function
The Map function takes input data and
processes it into intermediate key-value pairs.
It applies a user-defined function to each input
record, generating output pairs that are then
sorted and grouped by key.

The Reduce Function

The Reduce function processes the
intermediate key-value pairs generated by the
Map function. It aggregates, filters, or combines
the data based on a user-defined function,
generating the final output.
Key Features of
MapReduce
Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable for Big
Data applications.
Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing of data. It
automatically detects and handles node failures, rerunning tasks on available nodes as needed.
Data Locality
MapReduce takes advantage of data locality by processing data on the same node where it is
stored, minimizing data movement across the network and improving overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather than
low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making it
easier for a process to handle each job. Thanks to parallel processing, these distributed tasks
can be performed by multiple processors. Therefore, all software runs faster.
MapReduce in the Hadoop Ecosystem
Hadoop, an open-source Big Data processing framework, utilizes MapReduce as its
core component for data processing. Hadoop's implementation of MapReduce
provides additional features and capabilities, such as:
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system designed to store and manage large datasets
across multiple nodes in a cluster. It provides high throughput access to data,
facilitating efficient MapReduce processing.

JobTracker and TaskTracker

In the Hadoop ecosystem, the JobTracker and TaskTracker components manage the
scheduling, distribution, and monitoring of MapReduce jobs. The JobTracker assigns
tasks to available TaskTrackers, which in turn execute the Map and Reduce functions
on their respective nodes.
YARN (Yet Another Resource Negotiator)
YARN is a resource management layer in Hadoop that separates the concerns of
resource management and job scheduling, allowing for better scalability and
flexibility in the cluster.
Benefits of Using MapReduce
Enhanced Data Processing Efficiency
MapReduce's ability to process data in parallel across a distributed
cluster enables it to handle large datasets quickly and efficiently.
Cost-Effective Solution
By utilizing commodity hardware, MapReduce offers a cost-effective
solution for organizations looking to process and analyze Big Data
without investing in expensive infrastructure.
Adaptability
MapReduce can be applied to a wide range of data processing tasks,
making it an adaptable solution for various industries and use
cases.
Robustness
With built-in fault tolerance and data locality features, MapReduce
ensures that data processing remains robust and reliable, even in
the face of hardware failures or network issues.
Common Use Cases for MapReduce
•Log Analysis: MapReduce is widely used for log analysis, processing and
aggregating data from server logs to identify trends, patterns, or anomalies in
user behavior or system performance.

•Data Transformation: MapReduce can perform large-scale data

transformations, such as converting raw data into a more structured format,
filtering out irrelevant data, or extracting specific features for further analysis.

•Machine Learning: MapReduce can be employed in machine learning tasks,

including training models, feature extraction, and model evaluation. Its ability to
parallelize tasks makes it well-suited for processing massive datasets commonly
encountered in machine learning applications.

•Text Analysis: Text analysis tasks, such as sentiment analysis, topic modeling,
or keyword extraction, can be efficiently performed using MapReduce, enabling
organizations to derive insights from unstructured textual data.
MapReduce Alternatives and Complementary Technologies
While MapReduce has proven effective for many data processing tasks, other
technologies have emerged to address specific needs or offer improved performance
in certain scenarios:
Apache Spark
Apache Spark is a fast, in-memory data processing engine that provides an
alternative to MapReduce for certain use cases. Spark's Resilient Distributed Datasets
(RDDs) enable more efficient iterative processing, making it particularly suitable for
machine learning and graph processing tasks.

Apache Flink
Apache Flink is a stream-processing framework that offers low-latency, high-
throughput data processing. While MapReduce focuses on batch processing, Flink's
ability to process data in real-time makes it an attractive option for time-sensitive
applications.
Apache Hive
Apache Hive is a data warehousing solution built on top of Hadoop
that provides an SQL-like interface for querying large datasets.
While not a direct replacement for MapReduce, Hive can simplify
data processing tasks for users familiar with SQL.
Difference Between MapReduce, Apache Spark, Apache Flink,
and Apache Hive

Feature MapReduce Apache Spark Apache Flink Apache Hive

Primary Focus Batch Processing In-Memory Processing Stream Processing Data Warehousing

Resilient Distributed
Data Processing Model Map and Reduce Data Streaming SQL-like Querying
Datasets (RDDs)

Fault Tolerance Yes Yes Yes Yes

Scalability High High High High

Log Analysis, Data

Iterative Machine Real-time Data Large-scale Data
Transformation,
Use Cases Learning, Graph Processing, Time- Querying, Data
Machine Learning,
Processing sensitive Applications Analytics
Text Analysis

Java, Python, Ruby,

Language Support Java, Scala, Python, R Java, Scala, Python HiveQL (SQL-like)
and others
•Job – A Job in the context of Hadoop MapReduce is the unit of
work to be performed as requested by the client / user. The
information associated with the Job includes the data to be
processed (input data), MapReduce logic / program / algorithm,
and any other relevant configuration information necessary to
execute the Job.

•Task – Hadoop MapReduce divides a Job into multiple sub-jobs

known as Tasks. These tasks can be run independent of each
other on various nodes across the cluster. There are primarily
two types of Tasks – Map Tasks and Reduce Tasks.
JobTracker
 JobTracker – Just like the storage (HDFS), the computation
(MapReduce) also works in a master-slave / master-worker fashion. A
JobTracker node acts as the Master and is responsible for scheduling /
executing Tasks on appropriate nodes, coordinating the execution of
tasks, sending the information for the execution of tasks, getting the
results back after the execution of each task, re-executing the failed
Tasks, and monitors / maintains the overall progress of the Job. Since a
Job consists of multiple Tasks, a Job’s progress depends on the status /
progress of Tasks associated with it. There is only one JobTracker node
per Hadoop Cluster.
• TaskTracker :
A TaskTracker node acts as the Slave and is responsible for
executing a Task assigned to it by the JobTracker. There is no
restriction on the number of TaskTracker nodes that can exist in
a Hadoop Cluster. TaskTracker receives the information
necessary for execution of a Task from JobTracker, Executes the
Task, and Sends the Results back to JobTracker.
Map()
• Map Task in MapReduce is performed using the Map()
function. This part of the MapReduce is responsible for
processing one or more chunks of data and producing the
output results
Reduce()
The next part / component / stage of the MapReduce programming model is the
Reduce() function. This part of the MapReduce is responsible for consolidating
the results produced by each of the Map() functions/tasks.

Data Locality

MapReduce tries to place the data and the compute as close as possible. First, it
tries to put the compute on the same node where data resides, if that cannot be
done (due to reasons like compute on that node is down, compute on that node
is performing some other computation, etc.), then it tries to put the compute on
the node nearest to the respective data node(s) which contains the data to be
processed. This feature of MapReduce is “Data Locality”.
The following diagram shows the logical flow of a MapReduce programming
model.

The stages depicted above are

Input:
This is the input data / file to be processed.

Split:
Hadoop splits the incoming data into smaller pieces called “splits”.

Map:
this step, MapReduce processes each split according to the logic defined in map() function.
Each mapper works on each split at a time. Each mapper is treated as a task and multiple
tasks are executed across different TaskTrackers and coordinated by the JobTracker.
Combine:
This is an optional step and is used to improve the performance by reducing the
amount of data transferred across the network. Combiner is the same as the
reduce step and is used for aggregating the output of the map() function before it
is passed to the subsequent steps.

Shuffle & Sort:

In this step, outputs from all the mappers is shuffled, sorted to put them in order,
and grouped before sending them to the next step.

Reduce:
This step is used to aggregate the outputs of mappers using the reduce()
function. Output of reducer is sent to the next and final step. Each reducer is
treated as a task and multiple tasks are executed across different TaskTrackers
and coordinated by the JobTracker.

Output:
Finally the output of reduce step is written to a file in HDFS.
Word Count Example
For the purpose of understanding MapReduce, let us consider a simple
example. Let us assume that we have a file which contains the following four
lines of text.

In this file, we need to count the number of occurrences of each word. For
instance, DW appears twice, BI appears once, SSRS appears twice, and so
on. Let us see how this counting operation is performed when this file is
input to MapReduce.
Below is a simplified representation of the data flow for Word Count Example.
•Input: In this step, the sample file is input to MapReduce.
•Split: In this step, Hadoop splits / divides our sample input file into four parts, each
part made up of one line from the input file. Note that, for the purpose of this
example, we are considering one line as each split. However, this is not necessarily
true in a real-time scenario.
•Map: In this step, each split is fed to a mapper which is the map() function
containing the logic on how to process the input data, which in our case is the line
of text present in the split. For our scenario, the map() function would contain the
logic to count the occurrence of each word and each occurrence is captured /
arranged as a (key, value) pair, which in our case is like (SQL, 1), (DW, 1), (SQL, 1),
and so on.
•Combine: This is an optional step and is often used to improve the performance by
reducing the amount of data transferred across the network. This is essentially the
same as the reducer (reduce() function) and acts on output from each mapper. In
our example, the key value pairs from first mapper “(SQL, 1), (DW, 1), (SQL, 1)” are
combined and the output of the corresponding combiner becomes “(SQL, 2), (DW,
1)”.
•Shuffle and Sort: In this step, output of all the mappers is collected,
shuffled, and sorted and arranged to be sent to reducer.
•Reduce: In this step, the collective data from various mappers, after being
shuffled and sorted, is combined / aggregated and the word counts are
produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.
•Output: In this step, the output of the reducer is written to a file on HDFS.
The following image is the output of our word count example.
Types of InputFormat in MapReduce
There are different types of MapReduce InputFormat in Hadoop which are used for
different purpose. Let’s discuss the Hadoop InputFormat types below:

1. FileInputFormat
It is the base class for all file-based InputFormats. FileInputFormat also specifies
input directory which has data files location. When we start a MapReduce job
execution, FileInputFormat provides a path containing files to read.
This InpuFormat will read all files. Then it divides these files into one or more
InputSplits.

2. TextInputFormat
It is the default InputFormat. This InputFormat treats each line of each input file as a
separate record. It performs no parsing. TextInputFormat is useful for unformatted
data or line-based records like log files. Hence,
•Key – It is the byte offset of the beginning of the line within the file (not whole file
one split). So it will be unique if combined with the file name.
•Value – It is the contents of the line. It excludes line terminators.
3. KeyValueTextInputFormat
It is similar to TextInputFormat. This InputFormat also treats each line of input
as a separate record. While the difference is that TextInputFormat treats entire
line as the value, but the KeyValueTextInputFormat breaks the line itself into
key and value by a tab character (‘/t’). Hence,
•Key – Everything up to the tab character.
•Value – It is the remaining part of the line after tab character.

4. SequenceFileInputFormat
It is an InputFormat which reads sequence files. Sequence files are binary files.
These files also store sequences of binary key-value pairs. These are block-
compressed and provide direct serialization and deserialization of several
arbitrary data. Hence,
Key & Value both are user-defined.
5. SequenceFileAsTextInputFormat
It is the variant of SequenceFileInputFormat. This format converts the
sequence file key values to Text objects. So, it performs conversion by
calling ‘tostring()’ on the keys and values. Hence,
SequenceFileAsTextInputFormat makes sequence files suitable input for
streaming.

6. SequenceFileAsBinaryInputFormat
By using SequenceFileInputFormat we can extract the sequence file’s
keys and values as an opaque binary object.
7. NlineInputFormat
It is another form of TextInputFormat where the keys are byte offset of
the line. And values are contents of the line. So, each mapper receives a
variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.
The number depends on the size of the split. Also, depends on the
length of the lines. So, if want our mapper to receive a fixed number of
lines of input, then we use NLineInputFormat.
N- It is the number of lines of input that each mapper receives.
By default (N=1), each mapper receives exactly one line of input.
Suppose N=2, then each split contains two lines. So, one mapper
receives the first two Key-Value pairs. Another mapper receives the
second two key-value pairs.
8. DBInputFormat
This InputFormat reads data from a relational database, using JDBC.
It also loads small datasets, perhaps for joining with large datasets
from HDFS using MultipleInputs. Hence,
•Key – LongWritables
•Value – DBWritables.
What is Interactive Analytics?

 Businesses are collecting more data than ever, but if you don’t know how to effectively
interpret it, data is just facts and statistics. The value in data doesn’t come from
collecting it, but in how you make it actionable to drive business strategy. In order to
make better-informed decisions, your business needs to be able to effectively analyze
data, and that data needs to be comprehensible for as many decision-makers as
possible.
 Interactive analytics is a way to make real-time data more intelligible for non-technical
users through the use of tools that visualize and crunch the data, enabling users to
quickly and easily run complex queries and interpret them to gain the valuable insights
that factor into critical business decisions.

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Assn - No:1 Cloud Computing Assignment 13.10.2019
No ratings yet
Assn - No:1 Cloud Computing Assignment 13.10.2019
4 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Unit 5
No ratings yet
Unit 5
35 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Ch. 4
No ratings yet
Ch. 4
4 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Unit 5
No ratings yet
Unit 5
32 pages
Unit 5
No ratings yet
Unit 5
7 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
BDA Assignment 3
No ratings yet
BDA Assignment 3
24 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
1 MapReduce Introduction With Example
No ratings yet
1 MapReduce Introduction With Example
52 pages
Data Science
No ratings yet
Data Science
7 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
Hadoop
No ratings yet
Hadoop
5 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Controller Form: 4. Building The User Interface Layer
No ratings yet
Controller Form: 4. Building The User Interface Layer
5 pages
Virtex4 High Speed DDR Transceivers Xapp705
No ratings yet
Virtex4 High Speed DDR Transceivers Xapp705
20 pages
Directions-Program A Catch Game
No ratings yet
Directions-Program A Catch Game
6 pages
Python Programming Ka Parichay
No ratings yet
Python Programming Ka Parichay
3 pages
Imparative Programming in Python
No ratings yet
Imparative Programming in Python
16 pages
Leased Line
No ratings yet
Leased Line
5 pages
Sun Mysql 422458
No ratings yet
Sun Mysql 422458
1 page
User and Installation Manual User and Installation Manual User and Installation Manual User and Installation Manual
No ratings yet
User and Installation Manual User and Installation Manual User and Installation Manual User and Installation Manual
22 pages
Fallback Class in A BADI
No ratings yet
Fallback Class in A BADI
7 pages
Flow Control Java 6 Questions and Answers
No ratings yet
Flow Control Java 6 Questions and Answers
8 pages
Service Bulletin: Firmware Upgrade, European Optional Language Support
No ratings yet
Service Bulletin: Firmware Upgrade, European Optional Language Support
1 page
UNIT 3 - Describing Numbers
No ratings yet
UNIT 3 - Describing Numbers
9 pages
3 - Moats - Switching Cost
No ratings yet
3 - Moats - Switching Cost
10 pages
Docc 004
No ratings yet
Docc 004
10 pages
Cifx API PR 05 EN
No ratings yet
Cifx API PR 05 EN
118 pages
007 Python Introduction
No ratings yet
007 Python Introduction
22 pages
Curriculum Vitae Lucas Coutinho Marcelino Personal Information
No ratings yet
Curriculum Vitae Lucas Coutinho Marcelino Personal Information
3 pages
LIVING IN THE INFORMATION TECHNOLOGY ERA Notes and Reviewer (2nd Sem Prelim)
No ratings yet
LIVING IN THE INFORMATION TECHNOLOGY ERA Notes and Reviewer (2nd Sem Prelim)
27 pages
MT6572 Android Scatter
No ratings yet
MT6572 Android Scatter
6 pages
Google: Designs, Lessons and Advice From Building Large Distributed Systems
100% (3)
Google: Designs, Lessons and Advice From Building Large Distributed Systems
73 pages
8051 Microcontrollers - 1
No ratings yet
8051 Microcontrollers - 1
66 pages
Module 5 Coding Assignment
No ratings yet
Module 5 Coding Assignment
3 pages
FTP
No ratings yet
FTP
31 pages
4IT1 01R Que 20190514
No ratings yet
4IT1 01R Que 20190514
20 pages
Auto Power Supply Control From Different Sources To Ensure No Break Power
63% (8)
Auto Power Supply Control From Different Sources To Ensure No Break Power
76 pages
AX7035 User Manual
No ratings yet
AX7035 User Manual
42 pages
GAINSTRONG Wireless and Mini Router Quotation 2018!04!24
No ratings yet
GAINSTRONG Wireless and Mini Router Quotation 2018!04!24
15 pages
VC520 Pro3 Datasheet
No ratings yet
VC520 Pro3 Datasheet
4 pages
Read Me
No ratings yet
Read Me
3 pages
SST25VF016B: 16 Mbit SPI Serial Flash
No ratings yet
SST25VF016B: 16 Mbit SPI Serial Flash
30 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

MapReduce Unit3

Uploaded by

MapReduce Unit3

Uploaded by

MapReduce

 MapReduce is designed to address the challenges associated with

The Reduce Function

JobTracker and TaskTracker

•Data Transformation: MapReduce can perform large-scale data

•Machine Learning: MapReduce can be employed in machine learning tasks,

Feature MapReduce Apache Spark Apache Flink Apache Hive

Fault Tolerance Yes Yes Yes Yes

Scalability High High High High

Log Analysis, Data

Java, Python, Ruby,

•Task – Hadoop MapReduce divides a Job into multiple sub-jobs

The stages depicted above are

Shuffle & Sort:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.