BDA-4 MapReduce v.2
BDA-4 MapReduce v.2
22-01-2021 prakash.parmar@vit.edu.in 2
MapReduce, Why?
• MapReduce is a programming paradigm
• Traditional programming models work only when data is kept on a single machine and if data kept
on multiple machine in a distributed manner than we require a new programming model to solve the
problems.
• MapReduce is a computing paradigm for processing data that resides on many machines.
22-01-2021 prakash.parmar@vit.edu.in 3
MapReduce: Working Mechanism
• Hadoop works on principle of data Locality, i.e., the data is processed where it is kept.
• In MapReduce code is move to data, data is not coming towards code, it makes fast processing.
• Data Block – 128 MB but code around 20 KB.
22-01-2021 prakash.parmar@vit.edu.in 4
MapReduce: Work Flow
Split 0 Mapper 0
Reducer 0 Out 0
Input
Split 1 Mapper 1
Reducer 1 Out 1
Split 2 Mapper 2
22-01-2021 prakash.parmar@vit.edu.in 5
MapReduce: Word Count
Input Output
22-01-2021 prakash.parmar@vit.edu.in 6
MapReduce: Word Count
22-01-2021 prakash.parmar@vit.edu.in 7
MapReduce: Word Count
Mapping Phase Shuffling Phase Reducing Phase
22-01-2021 prakash.parmar@vit.edu.in 8
MapReduce with Combiner: Work Flow
• Combiner is also known as “Mini-Reducer” that synopsizes the Mapper output record with the same
Key before passing to the Reducer.
• On a huge dataset when we run a MapReduce job. So Mapper creates large chunks of intermediate
data. Then the framework passes this intermediate data on the Reducer for further handling. This
leads to huge network congestion. The Hadoop framework offers a function known as Combiner
that plays a key role in reducing network congestion.
• The main job of Combiner a “Mini-Reducer is to handle the output data from the Mapper, before
passing it to Reducer. It works after the mapper and before the Reducer. Its usage is optional.
22-01-2021 prakash.parmar@vit.edu.in 9
MapReduce with Combiner: Work Flow
22-01-2021 prakash.parmar@vit.edu.in 10
MapReduce with Combiner
Advantages Disadvantages
• Use of combiner decreases the time taken • In the native filesystem, when Hadoop stores
for data transfer between mapper and the key-value pairs and runs the combiner
reducer. later this will result in expensive disk IO.
• Combiner increases the overall performance • MapReduce jobs can’t rely on the combiner
of the reducer. execution as there is no guarantee in its
• It reduces the amount of data that the execution.
reducer has to process.
22-01-2021 prakash.parmar@vit.edu.in 11
MapReduce Use Case: Election Vote Counting
Votes is stored at different Booths and Result centre has the details of all the booths
22-01-2021 prakash.parmar@vit.edu.in 12
MapReduce Use Case: Election Vote Counting
Votes is stored at different Booths and Result centre has the details of all the booths
22-01-2021 prakash.parmar@vit.edu.in 13
Real-Time Uses of MapReduce
22-01-2021 prakash.parmar@vit.edu.in 14
Algorithm uses MapReduce : Matrix Vector Multiplication
Matrix-vector and matrix-matrix calculations fit nicely into the MapReduce style of computing.
22-01-2021 prakash.parmar@vit.edu.in 15
Matrix Multiplication using MapReduce
22-01-2021 prakash.parmar@vit.edu.in 16
Matrix Multiplication using MapReduce
Note:
Most matrices are sparse so large amount of cells have value zero. When we represent matrices in this form, we do not need to keep
entries for the cells that have values of zero to save large amount of disk space.
22-01-2021 prakash.parmar@vit.edu.in 17
Matrix Multiplication using MapReduce
k=#colums i = row j = column formula (key, value) i=#colums j = row k = column formula (key, value)
of N number of number of of M number of N number of
𝑖, 𝑘 , 𝑀, 𝑗, 𝑚𝑖𝑗 N 𝑖, 𝑘 , 𝑁, 𝑗, 𝑛𝑖𝑗
M M
k=1 i=1 j=1 ((1,1), (M, 1, 10)) i=1 j=1 k=1 ((1,1), (N, 1, 5))
j=2 ((1,1), (M, 2, 20)) k=2 ((1,2), (N, 1, 6))
i=2 j=1 ((2,1), (M, 1, 30)) j=2 k=1 ((1,1), (N, 2, 7))
j=2 ((2,1), (M, 2, 40)) k=2 ((1,2), (N, 2, 8))
k=2 i=1 j=1 ((1,2), (M, 1, 10)) i=2 j=1 k=1 ((2,1), (N, 1, 5))
j=2 ((1,2), (M, 2, 20)) k=2 ((2,2), (N, 1, 6))
i=2 j=1 ((2,2), (M, 1, 30)) j=2 k=1 ((2,1), (N, 2, 7))
j=2 ((2,2), (M, 2, 40)) k=2 ((2,2), (N, 2, 8))
22-01-2021 prakash.parmar@vit.edu.in 18
Matrix Multiplication using MapReduce
Return Set of (key, value) pairs that same key has a list with values M & N
22-01-2021 prakash.parmar@vit.edu.in 19
Matrix Multiplication using MapReduce
Reduce Code:
Sort values in M and N list according to J and perform multiplication.
((1,1), 10*5+20*7) => ((1,1), 190)
((1,2), 10*6+20*8) => ((1,2), 220)
((2,1), 30*5+40*7) => ((2,1), 430)
((2,2), 30*6+40*8) => ((2, 2), 500)
Result :
𝟏𝟎 𝟐𝟎 𝟓 𝟔 𝟏𝟗𝟎 𝟐𝟐𝟎
𝑴𝒊𝒋 = 𝑵𝒋𝒌 = 𝑴×𝑵=
𝟑𝟎 𝟒𝟎 𝟕 𝟖 𝟒𝟑𝟎 𝟓𝟎𝟎
22-01-2021 prakash.parmar@vit.edu.in 20
Failure in MapReduce
• Master failure
• Single point of failure; Resume from Execution Log
• Robust
• Google’s experience: lost 1600 of 1800 machines once!, but
finished fine
22-01-2021 prakash.parmar@vit.edu.in 21
Learn Fundamentals &
Enjoy Engineering
22-01-2021 prakash.parmar@vit.edu.in 22