0% found this document useful (0 votes)

4 views44 pages

Map Reduce

The document discusses batch processing and stream processing, highlighting their characteristics, use cases, and frameworks such as Map Reduce and Apache Spark for batch processing, and Apache Storm and Kafka for stream processing. It outlines the design goals for batch processing systems, common patterns for data flow, and the MapReduce paradigm, which simplifies parallel computation by abstracting the mapping and reducing stages. Additionally, it covers challenges in parallelization, the importance of data locality, and the role of combiners in optimizing data processing.

Uploaded by

Rifki Nurfalah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views44 pages

Map Reduce

Uploaded by

Rifki Nurfalah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

BATCH PROCESSING

WITH MAP REDUCE

Prasad M Deshpande
Patterns in processing

Processing

Synchronous Asynchronous

Streaming Batch
Synchronous vs Asynchronous

■ Synchronous
– Request is processed and response sent back immediately
– Client blocks for a response

■ Asynchronous
– Request is sent as an event/message
– Client does not block
– Event is put in a queue/file and processed later
– Response is generated as another event
– Consumer of response event can be a different service
Data at rest Vs Data in motion

■ At rest:
– Dataset is fixed (file)
– bounded
– can go back and forth on the
data
■ In motion:
– continuously incoming data
(queue)
– unbounded
– too large to store and then
process
– need to process in one pass
Batch processing

■ Problem statement : ■ Characteristics

– Process this entire data – Access to entire data
– give answer for X at the end – Split decided at the launch
time.
– Capable of doing complex
analysis (e.g. Model training)
– Optimize for Throughput (data
processed per sec)
■ Example frameworks : Map Reduce,
Apache Spark
Stream processing

■ Problem statement : ■ Characteristics

– Process incoming stream of – Results for X are based on the
data current data
– to give answer for X at this – Computes function on one
moment. record or smaller window.
– Optimizations for latency (avg.
time taken for a record)
■ Example frameworks: Apache
Storm, Apache Flink, Amazon
Kinesis, Kafka, Pulsar
Batch vs Streaming

■ Find stats about group in a closed ■ Finding stats about group in a

room marathon
■ Analyze sales data for last month to ■ Monitoring the health of a data
make strategic decisions center
When to use Batch vs Streaming

■ Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if
processed in batch mode.
■ Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at
rest’ as well (in many cases).

Streaming Batch

Ingest Store Analyze

Big data flow

Design goals of batch processing
systems
■ Fast processing
– Data ought to be in primary storage, or even better, RAM
■ Scalable
– Should be able to handle growing data volumes
■ Reliable
– Should be able to handle failures gracefully
■ Ease of programming
– Right level of abstractions to help build applications
■ Low cost

Ø Need a whole ecosystem

Batch processing flows

■ flow of work through a directed, acyclic graph

■ different operators for coordinating the flow
■ Lets look at some common patterns
Copier

■ Duplicate input to multiple outputs

■ Useful when different independent processing
steps need to be done on same input
Filter

■ Select a subset of the input items

■ Usually based on a predicate on the input
attribute values
Splitter

■ Split input set into two or more different

output sets
■ Partitioning vs copy
■ Usually based on some predicate – different
processing to be done for each partition
Sharding

■ Split based on some sharding function

■ Same processing for all parititions
■ Reasons for sharding
– To distribute load among multiple
processors
– Resilience to failures
Merge

■ Combine multiple input sets into a single

output set
■ A simple union

Merge
Join

■ Barrier synchronization
■ Ensures that previous step is complete before
starting the next step
■ Reduces parallelism
Reduce

■ Group and merge multiple input items into a

single output item
■ Usually, some form of aggregation
■ Need not wait for all input to be ready

Reduce
A simple problem

Product Sale
■ Find transactions with sale >= 10
P1 10
■ Which patterns will you use?
P2 15
■ How will you parallelize? P1 5
P2 40
P5 15
P1 55
P2 10
P5 30
P3 25
P3 15

Copy, Filter, Split, Shard, Merge, Join, Reduce

Copy, Filter, Split, Shard, Merge, Join, Reduce
A simple problem - extended

Product Sale
■ Find total sales by category for transactions
with sale >= 10 P1 10
P2 15
■ Which patterns will you use?
P1 5
■ How to parallelize?
P2 40

e.g.: PC1, 105 P5 15

P1 55
P2 10
Category Product P5 30
PC1 P1, P3 P3 25
PC2 P2, P4, P5 P3 15

Copy, Filter, Split, Shard, Merge, Join, Reduce

Copy, Filter, Split, Shard, Merge, Join, Reduce
Challenges in parallelization

■ How to break a large problem into smaller tasks?

■ How to assign tasks to workers distributed across machines?
■ How to ensure that workers get the data they need?
■ How to coordinate synchronization across workers?
■ How to share partial results from one worker to another?
■ How to handle software errors and hardware faults?

Programmer should not be burdened with all these details => need an abstraction
Map-reduce
Abstraction

Two processing layers/stages

■ map: (k1, v1) à [(k2, v2)]
■ reduce: (k2, [v2]) à [(k3, v3)]
Revisiting the problem
public class ProductMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
public class ProductReducer extends
@Override ReducerReducer<Text, IntWritable, Text, IntWritable> {
public void map(LongWritable key, Text value,
Context context) @Override
throws IOException, InterruptedException { public void reduce(Text key, Iterable<IntWritable>
String line = value.toString(); values, Context context)
String parts[] = line.split(","); throws IOException, InterruptedException {
int total = 0;
String product = parts[0];
for (IntWritable val : values) {
Integer sale = Integer.valueOf(parts[1]);
total += val;
if (sale >= 10) { }
String category = getCategory(product); context.write(key, new IntWritable(total));
context.write(new Text(category), new }
IntWritable(sale));
} }
}
}
Processing stages
Scaling out
Multiple reduce tasks
Our example

■ Map Tasks è
– Mapper task 1 : P1 [key], 10[sale value]; P2, 15; P1, 5
– Output: PC1, 10; PC2, 15; PC1, 5

– Mapper task 2 : P2, 40; P5, 15; P1, 55; P2, 10

– Output: PC2, 40; PC2, 15; PC1, 55; PC2, 10
Category Product
– Mapper task 3 : P5, 30; P3, 25; P3, 15
PC1 P1, P3
– Output: PC2, 30; PC1, 25; PC1, 15
PC2 P2, P4, P5

■ Partitions [reducers] èby product category

Shuffle, sort and partition
Data from Mappers: ■ PC1, 10

■ PC1, 10; PC2, 15; ■ PC1, 55 Partition [reducer] 1 è PC1, 105

■ PC2, 40; PC2, 15; PC1, 55; PC2, 10 ■ PC1, 25

■ PC2, 30; PC1, 25; PC1, 15 ■ PC1, 15

--------------------------

■ PC2, 15

■ PC2, 40

■ PC2, 15

■ PC2, 10 Partition [reducer] 2è ???

■ PC2, 30;
Can it be optimized further?

Data from Mappers:

■ PC1, 10; PC2, 15;
■ PC2, 40; PC2, 15; PC1, 55;
PC2, 10
■ PC2, 30; PC1, 25; PC1, 15
Combiner

■ Runs on the output of mapper

■ No guarantee on how many times it will be called by the framework
■ Calling the combiner function zero, one, or many times should produce the same
output from the reducer.
■ Contract for combiner – same as reducer
– (k2, [v2]) à [(k3, v3)]
■ Reduces the amount of data shuffled between the mappers and reducers
Combiner example

Data from Mappers: After combining:

■ PC1, 10; PC2, 15; ■ PC1, 10; PC2, 15;
■ PC2, 40; PC2, 15; PC1, 55; PC2, 10 ■ PC2, 65; PC1, 55

■ PC2, 30; PC1, 25; PC1, 15 ■ PC2, 30; PC1, 40

Framework design

■ So where should execution of mapper happen ?

■ And how many map tasks ?

“Where to execute?” : Data Locality

■ Move computation close to the data rather than data to computation”.

■ A computation requested by an application is much more efficient if it is

executed near the data it operates on when the size of the data is very huge.

■ Minimizes network congestion and increases the throughput of the system

■ Hadoop will try to execute the mapper on the nodes where the block resides.
– In case the nodes [think of replicas] are not available, Hadoop will try to pick a node that
is closest to the node that hosts the data block.
– It could pick another node in the same rack, for example.
Data locality

Data-local (a), rack-local (b), and off-

rack (c) map tasks
How many mapper tasks?

Number of mappers set to run are completely dependent on :

1) File Size and

2) Block [split] Size

Internals

■ Mapper writes the output to the local disk of the machine it is working.
– This is the temporary data. Also called intermediate output.

■ As mapper finishes, data (output of the mapper) travels from mapper node to
reducer node. Hence, this movement of output from mapper node to reducer node is
called shuffle.

■ An output from mapper is partitioned into many partitions;

– Each of this partition goes to a reducer based on some conditions
Map Internals

InputSplits are created by

InputFormat. Example formats –
FileInputFormat, DBInputFormat

RecordReader’s responsibility is to
keep reading/converting data into key-
value pairs until the end; which is sent
to the mapper.

Number of map tasks will be equal to the

number of InputSplits
Mapper on any node should be able to
access the split à need a distributed file
system (HDFS)
Intermediate output is written to local disks
Same with Output Formats and Record
Writers
MR Algorithm design

Pseudo-code for a basic word count algorithm

Improvement – local within document
aggregation
Local across document aggregation

No longer pure functional programming – state maintained across function calls!

Do we still need combiners?

■ Limitations of in-mapper combining

– State needs to be maintained
– Scalability – size of the state can grow without bounds
■ Keep bounded state
– Write intermediate results
– Use combiners
Summary

■ MR – powerful abstraction for parallel computation

■ Framework handles the complexity of distribution, data transfer, coordination, failure
recovery
Reading list

■ Designing Distributed Systems, Brendan Burns

– Chapters 11 and 12, except Hands on sections
■ Distributed and cloud computing, Kai Hwang, Geoffrey C Fox, Jack J Dongarra
– Sections 6.2.2 except 6.2.2.7

■ Optional reading
– Data-Intensive Text Processing with MapReduce
■ Sections 2.1 to 2.4

SAP ABAP Performance Tuning
From Everand
SAP ABAP Performance Tuning
May
4.5/5 (28)
BV350 Workshop Manual PDF
No ratings yet
BV350 Workshop Manual PDF
346 pages
Alumimium Bus Bar Calculation 4000A
71% (7)
Alumimium Bus Bar Calculation 4000A
5 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Big Data Analytics
No ratings yet
Big Data Analytics
44 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
06 Application Architecture
No ratings yet
06 Application Architecture
22 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
Lecture 11
No ratings yet
Lecture 11
17 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Bda 2
No ratings yet
Bda 2
35 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Unit 2
No ratings yet
Unit 2
12 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Agenda: Big Data Systems
No ratings yet
Agenda: Big Data Systems
25 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Lec 6
No ratings yet
Lec 6
16 pages
T05 MapReduce
No ratings yet
T05 MapReduce
20 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
Chap 6 - MapReduce Programming
No ratings yet
Chap 6 - MapReduce Programming
37 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
BDA Test Book
No ratings yet
BDA Test Book
9 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
55 pages
Hadoop
No ratings yet
Hadoop
38 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Unit 4 Map Reduce
No ratings yet
Unit 4 Map Reduce
10 pages
Bda Winter 2021 Solution
No ratings yet
Bda Winter 2021 Solution
27 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Mastering IPython 4.0
From Everand
Mastering IPython 4.0
Thomas Bitterman
No ratings yet
Network Traffic Engineering (Ebook)
No ratings yet
Network Traffic Engineering (Ebook)
819 pages
JESTEC Template (For Blind Review)
No ratings yet
JESTEC Template (For Blind Review)
7 pages
Ms v508 Radeon RX 6500 XT Mech 2x 4g Oc Ce
No ratings yet
Ms v508 Radeon RX 6500 XT Mech 2x 4g Oc Ce
1 page
Off Loading and Energy Consumption 2020
No ratings yet
Off Loading and Energy Consumption 2020
17 pages
An Empirical Analysis of Network Traffic - Device Profiling and CL
No ratings yet
An Empirical Analysis of Network Traffic - Device Profiling and CL
135 pages
OWP 20240816 Cyber Security For C Level
No ratings yet
OWP 20240816 Cyber Security For C Level
34 pages
KawalPilpres2019 A Highly Secured Real Count Votin
No ratings yet
KawalPilpres2019 A Highly Secured Real Count Votin
9 pages
HCHD 300
No ratings yet
HCHD 300
12 pages
B.A. Revised Syllabus
No ratings yet
B.A. Revised Syllabus
41 pages
550 KV Gas Insulated Transmission
No ratings yet
550 KV Gas Insulated Transmission
8 pages
LRDI-07 Number Based With Solutions
100% (1)
LRDI-07 Number Based With Solutions
10 pages
Lenze 8400 Electrical Shaft Technology Application - v1-0 - EN
No ratings yet
Lenze 8400 Electrical Shaft Technology Application - v1-0 - EN
50 pages
Uv-K5 User Manuel
No ratings yet
Uv-K5 User Manuel
55 pages
JSON Tutorial For Beginners
No ratings yet
JSON Tutorial For Beginners
7 pages
Software Requirement Analysis AND Estimation Unit - 3
No ratings yet
Software Requirement Analysis AND Estimation Unit - 3
40 pages
Microwave Solid Antennas: Introduction and Antenna Descriptions
No ratings yet
Microwave Solid Antennas: Introduction and Antenna Descriptions
56 pages
Compactly Powerful: Ugeo Pt60A
No ratings yet
Compactly Powerful: Ugeo Pt60A
6 pages
CA02CA3103 RMTTransportation Problem
No ratings yet
CA02CA3103 RMTTransportation Problem
25 pages
One Step Extraction of Essential Oils and Pectin From Pomelo (Citrus Grandis) Peels
No ratings yet
One Step Extraction of Essential Oils and Pectin From Pomelo (Citrus Grandis) Peels
5 pages
ABB DCS Function Code 15
No ratings yet
ABB DCS Function Code 15
2 pages
Simple Neon Lamp Circuits and Working Explained 2
No ratings yet
Simple Neon Lamp Circuits and Working Explained 2
36 pages
Fiber Optics: Propagation of Light in An Optical Fiber
No ratings yet
Fiber Optics: Propagation of Light in An Optical Fiber
16 pages
Me 208 Dynamics: Dr. Ergin TÖNÜK
No ratings yet
Me 208 Dynamics: Dr. Ergin TÖNÜK
31 pages
Final Exam SEE3433 Mei (Solution)
No ratings yet
Final Exam SEE3433 Mei (Solution)
9 pages
Development of A Static Aeroelastic Database Using NASTRAN SOL 14
No ratings yet
Development of A Static Aeroelastic Database Using NASTRAN SOL 14
108 pages
Rules of Differentiation: Example
No ratings yet
Rules of Differentiation: Example
6 pages
Ultrasonic Testing
100% (1)
Ultrasonic Testing
57 pages
Icar Syllabus-Physics, Chemistry, Maths, Bio & Agriculture
75% (4)
Icar Syllabus-Physics, Chemistry, Maths, Bio & Agriculture
26 pages
Lipid Chemistry BSN
No ratings yet
Lipid Chemistry BSN
53 pages
Introduction To Computer Fundamentals
No ratings yet
Introduction To Computer Fundamentals
15 pages
Electronic Electrical
No ratings yet
Electronic Electrical
1,536 pages
What Is Non-MMU or MMU-Less Linux
No ratings yet
What Is Non-MMU or MMU-Less Linux
2 pages
Metabolism: Amino Acid Biosynthesis
No ratings yet
Metabolism: Amino Acid Biosynthesis
7 pages
Sistema Piloto
No ratings yet
Sistema Piloto
24 pages
Insertion Sort Algorithm and Complexity Analysis
No ratings yet
Insertion Sort Algorithm and Complexity Analysis
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Map Reduce

Uploaded by

Map Reduce

Uploaded by

BATCH PROCESSING

WITH MAP REDUCE

■ Problem statement : ■ Characteristics

■ Problem statement : ■ Characteristics

■ Find stats about group in a closed ■ Finding stats about group in a

Ingest Store Analyze

Big data flow

Ø Need a whole ecosystem

■ flow of work through a directed, acyclic graph

■ Duplicate input to multiple outputs

■ Select a subset of the input items

■ Split input set into two or more different

■ Split based on some sharding function

■ Combine multiple input sets into a single

■ Group and merge multiple input items into a

Copy, Filter, Split, Shard, Merge, Join, Reduce

e.g.: PC1, 105 P5 15

Copy, Filter, Split, Shard, Merge, Join, Reduce

■ How to break a large problem into smaller tasks?

Two processing layers/stages

– Mapper task 2 : P2, 40; P5, 15; P1, 55; P2, 10

■ Partitions [reducers] èby product category

■ PC1, 10; PC2, 15; ■ PC1, 55 Partition [reducer] 1 è PC1, 105

■ PC2, 40; PC2, 15; PC1, 55; PC2, 10 ■ PC1, 25

■ PC2, 30; PC1, 25; PC1, 15 ■ PC1, 15

■ PC2, 10 Partition [reducer] 2è ???

Data from Mappers:

■ Runs on the output of mapper

Data from Mappers: After combining:

■ PC2, 30; PC1, 25; PC1, 15 ■ PC2, 30; PC1, 40

■ So where should execution of mapper happen ?

■ And how many map tasks ?

■ Move computation close to the data rather than data to computation”.

■ A computation requested by an application is much more efficient if it is

■ Minimizes network congestion and increases the throughput of the system

Data-local (a), rack-local (b), and off-

Number of mappers set to run are completely dependent on :

1) File Size and

2) Block [split] Size

■ An output from mapper is partitioned into many partitions;

InputSplits are created by

Number of map tasks will be equal to the

Pseudo-code for a basic word count algorithm

No longer pure functional programming – state maintained across function calls!

■ Limitations of in-mapper combining

■ MR – powerful abstraction for parallel computation

■ Designing Distributed Systems, Brendan Burns

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.