Map Reduce
Map Reduce
Processing
Synchronous Asynchronous
Streaming Batch
Synchronous vs Asynchronous
■ Synchronous
– Request is processed and response sent back immediately
– Client blocks for a response
■ Asynchronous
– Request is sent as an event/message
– Client does not block
– Event is put in a queue/file and processed later
– Response is generated as another event
– Consumer of response event can be a different service
Data at rest Vs Data in motion
■ At rest:
– Dataset is fixed (file)
– bounded
– can go back and forth on the
data
■ In motion:
– continuously incoming data
(queue)
– unbounded
– too large to store and then
process
– need to process in one pass
Batch processing
■ Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if
processed in batch mode.
■ Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at
rest’ as well (in many cases).
Streaming Batch
Merge
Join
■ Barrier synchronization
■ Ensures that previous step is complete before
starting the next step
■ Reduces parallelism
Reduce
Reduce
A simple problem
Product Sale
■ Find transactions with sale >= 10
P1 10
■ Which patterns will you use?
P2 15
■ How will you parallelize? P1 5
P2 40
P5 15
P1 55
P2 10
P5 30
P3 25
P3 15
Product Sale
■ Find total sales by category for transactions
with sale >= 10 P1 10
P2 15
■ Which patterns will you use?
P1 5
■ How to parallelize?
P2 40
Programmer should not be burdened with all these details => need an abstraction
Map-reduce
Abstraction
■ Map Tasks è
– Mapper task 1 : P1 [key], 10[sale value]; P2, 15; P1, 5
– Output: PC1, 10; PC2, 15; PC1, 5
--------------------------
■ PC2, 15
■ PC2, 40
■ PC2, 15
■ PC2, 30;
Can it be optimized further?
■ Hadoop will try to execute the mapper on the nodes where the block resides.
– In case the nodes [think of replicas] are not available, Hadoop will try to pick a node that
is closest to the node that hosts the data block.
– It could pick another node in the same rack, for example.
Data locality
■ Mapper writes the output to the local disk of the machine it is working.
– This is the temporary data. Also called intermediate output.
■ As mapper finishes, data (output of the mapper) travels from mapper node to
reducer node. Hence, this movement of output from mapper node to reducer node is
called shuffle.
RecordReader’s responsibility is to
keep reading/converting data into key-
value pairs until the end; which is sent
to the mapper.
■ Optional reading
– Data-Intensive Text Processing with MapReduce
■ Sections 2.1 to 2.4