0% found this document useful (0 votes)
3 views22 pages

BDA-4 MapReduce v.2

The document provides an overview of MapReduce, a framework for processing large data sets in a distributed manner, highlighting its core functions (Map and Reduce) and workflow. It discusses the advantages of using MapReduce, including its efficiency in handling large data and its application in scenarios like election vote counting and matrix multiplication. Additionally, it addresses the role of combiners in reducing network congestion and the handling of failures within the MapReduce framework.

Uploaded by

Karthi Devendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

BDA-4 MapReduce v.2

The document provides an overview of MapReduce, a framework for processing large data sets in a distributed manner, highlighting its core functions (Map and Reduce) and workflow. It discusses the advantages of using MapReduce, including its efficiency in handling large data and its application in scenarios like election vote counting and matrix multiplication. Additionally, it addresses the role of combiners in reducing network congestion and the handling of failures within the MapReduce framework.

Uploaded by

Karthi Devendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Big Data Analytics

Academic Year 2022-23


Index
1. MapReduce
2. MapReduce Workflow
3. MapReduce Word Count

Subject Teacher: Prof. Prakash Parmar, Assistant Professor, CMPN


22-01-2021 prakash.parmar@vit.edu.in 1
MapReduce: Data Processing using Programming
• MapReduce is a framework used for writing application that processes large data sets in a
distributed manner with parallels algorithms.
• It is core component of the Apache Hadoop ecosystem
• MapReduce have main two function
1. Map
2. Reduce
• Both these Map & Reduce works only on key-value pairs
Ex. (Roll_No, Name)
(1, Sohan)
(2, Mohan)
• In both function, input is Key-value pair and output also Key, Value pair
(k, v) -> Map -> (k, v) -> Reduce -> (k, v)

22-01-2021 prakash.parmar@vit.edu.in 2
MapReduce, Why?
• MapReduce is a programming paradigm
• Traditional programming models work only when data is kept on a single machine and if data kept
on multiple machine in a distributed manner than we require a new programming model to solve the
problems.

• MapReduce is a computing paradigm for processing data that resides on many machines.

22-01-2021 prakash.parmar@vit.edu.in 3
MapReduce: Working Mechanism
• Hadoop works on principle of data Locality, i.e., the data is processed where it is kept.
• In MapReduce code is move to data, data is not coming towards code, it makes fast processing.
• Data Block – 128 MB but code around 20 KB.

22-01-2021 prakash.parmar@vit.edu.in 4
MapReduce: Work Flow

Split 0 Mapper 0

Reducer 0 Out 0
Input

Split 1 Mapper 1

Reducer 1 Out 1

Split 2 Mapper 2

22-01-2021 prakash.parmar@vit.edu.in 5
MapReduce: Word Count

Input Output

Car Bike Car Motor


Car Bike Motor 'Car’: 7,
Car Bike Car 'Bike’: 6,
Bike Car 'Motor’: 2
Bike Car Bike

word_file.txt (300 MB) Expected output

22-01-2021 prakash.parmar@vit.edu.in 6
MapReduce: Word Count

b1 Node-1 (00, “Car Bike Car Motor”)


(23, “Car Bike Motor”) M1
Car Bike Car Motor
Car Bike Motor
b2
Car Bike Car Node-2 Record (56, “Car Bike Car”)
Reader (52, “Bike Car”) M2
Bike Car

Bike Car Bike b3

word_file.txt (300 MB) Node-3 (77, “Bike Car Bike”) M3

Default block size 128 MB


So file divided into 3 block Record reader takes the each line Mapper
as input, add dummy key and return Mapper Program /logic has
key-value pairs as output. to be written by the
It is done by framework internally. developer.

22-01-2021 prakash.parmar@vit.edu.in 7
MapReduce: Word Count
Mapping Phase Shuffling Phase Reducing Phase

Record Reader output Mapper Shuffling Sorting


(‘Car’, 1) (‘Car’, 1) (‘Bike’, 1)
(00, “Car Bike Car Motor”) (‘Bike’, 1) (‘Bike’, 1) (‘Bike’, 1)
(23, “Car Bike Motor”) (‘Car’,1) (‘Car’,1) (‘Bike’, 1)
(‘Motor’,1) (‘Motor’,1) (‘Bike’, 1)
(‘Car’, 1) (‘Car’, 1) (‘Bike’, 1) Reducer
Aggregating
(‘Bike’, 1) (‘Bike’, 1) (‘Bike’, 1)
(‘Motor’,1) (‘Motor’,1) (‘Car’, 1) {‘Bike’: [1,1,1,1,1,1]} {‘Bike’ : 6}
(56, “Car Bike Car”) (‘Car’, 1) (‘Car’, 1) (‘Car’,1) {‘Car’ : [1,1,1,1,1,1,1]) {‘Car’ : 7)
(52, “Bike Car”) (‘Bike’, 1) (‘Bike’, 1) (‘Car’, 1) {‘Motor’ : [1,1]} {‘Motor’ : 2}
(‘Car’,1) (‘Car’,1) (‘Car’, 1)
(‘Bike’, 1) (‘Bike’, 1) (‘Car’,1)
(‘Car’, 1) (‘Car’, 1) (‘Car’, 1) Done by Developer
(‘Bike’, 1) (‘Car’, 1)
(77, “Bike Car Bike”) (‘Bike’, 1) (‘Car’,1) (‘Motor’,1)
(‘Car’,1) (‘Bike’, 1) (‘Motor’,1)
(‘Bike’, 1)
Done by framework internally
Done by framework internally Done by Developer
Ignore the key and only focus
on the value part.

22-01-2021 prakash.parmar@vit.edu.in 8
MapReduce with Combiner: Work Flow

• Combiner is also known as “Mini-Reducer” that synopsizes the Mapper output record with the same
Key before passing to the Reducer.
• On a huge dataset when we run a MapReduce job. So Mapper creates large chunks of intermediate
data. Then the framework passes this intermediate data on the Reducer for further handling. This
leads to huge network congestion. The Hadoop framework offers a function known as Combiner
that plays a key role in reducing network congestion.
• The main job of Combiner a “Mini-Reducer is to handle the output data from the Mapper, before
passing it to Reducer. It works after the mapper and before the Reducer. Its usage is optional.

22-01-2021 prakash.parmar@vit.edu.in 9
MapReduce with Combiner: Work Flow

22-01-2021 prakash.parmar@vit.edu.in 10
MapReduce with Combiner
Advantages Disadvantages

• Use of combiner decreases the time taken • In the native filesystem, when Hadoop stores
for data transfer between mapper and the key-value pairs and runs the combiner
reducer. later this will result in expensive disk IO.
• Combiner increases the overall performance • MapReduce jobs can’t rely on the combiner
of the reducer. execution as there is no guarantee in its
• It reduces the amount of data that the execution.
reducer has to process.

22-01-2021 prakash.parmar@vit.edu.in 11
MapReduce Use Case: Election Vote Counting
Votes is stored at different Booths and Result centre has the details of all the booths

Counting: Traditional Approach


• Votes are moved to result centre for counting.
• Moving all the votes to centre is costly.
• Results centre is over burdened.
• Counting takes time.

22-01-2021 prakash.parmar@vit.edu.in 12
MapReduce Use Case: Election Vote Counting
Votes is stored at different Booths and Result centre has the details of all the booths

Counting: MapReduce Approach


• Votes are counted at individual booths.
• Booth-wise results are sent to the result centre.
• Final result is declared easily and quickly using
this way.

22-01-2021 prakash.parmar@vit.edu.in 13
Real-Time Uses of MapReduce

22-01-2021 prakash.parmar@vit.edu.in 14
Algorithm uses MapReduce : Matrix Vector Multiplication

Matrix-vector and matrix-matrix calculations fit nicely into the MapReduce style of computing.

Matrix Data Model for MapReduce

22-01-2021 prakash.parmar@vit.edu.in 15
Matrix Multiplication using MapReduce

22-01-2021 prakash.parmar@vit.edu.in 16
Matrix Multiplication using MapReduce

Matrix Data Model for MapReduce


𝟏𝟎 𝟐𝟎 𝟓 𝟔
𝑴𝒊𝒋 = 𝑵𝒋𝒌 =
𝟑𝟎 𝟒𝟎 𝟕 𝟖

As input data files, we store matrix M and N on HDFS in following format

𝑀, 𝑖, 𝑗, 𝑚𝑖𝑗 and 𝑁, 𝑖, 𝑗, 𝑚𝑖𝑗


𝑴, 𝟏, 𝟏, 𝟏𝟎 , 𝑴, 𝟏, 𝟐, 𝟐𝟎 , 𝑴, 𝟐, 𝟏, 𝟑𝟎 … …
𝑵, 𝟏, 𝟏, 𝟓 , 𝑵, 𝟏, 𝟐, 𝟔 , 𝑵, 𝟐, 𝟏, 𝟕 … …

Note:
Most matrices are sparse so large amount of cells have value zero. When we represent matrices in this form, we do not need to keep
entries for the cells that have values of zero to save large amount of disk space.

22-01-2021 prakash.parmar@vit.edu.in 17
Matrix Multiplication using MapReduce

for each element 𝑚𝑖𝑗 of M do for each element 𝑛𝑗𝑘 of N do


produce (key, value) pairs as 𝑖, 𝑘 , 𝑀, 𝑗, 𝑀𝑖𝑗 produce (key, value) pairs as 𝑖, 𝑘 , 𝑁, 𝑗, 𝑁𝑗𝑘
for k=1,2, .. Up to column of matrix N for i=1,2, .. Up to row of matrix M

Mapper code for Matrix M Mapper code for Matrix N

k=#colums i = row j = column formula (key, value) i=#colums j = row k = column formula (key, value)
of N number of number of of M number of N number of
𝑖, 𝑘 , 𝑀, 𝑗, 𝑚𝑖𝑗 N 𝑖, 𝑘 , 𝑁, 𝑗, 𝑛𝑖𝑗
M M
k=1 i=1 j=1 ((1,1), (M, 1, 10)) i=1 j=1 k=1 ((1,1), (N, 1, 5))
j=2 ((1,1), (M, 2, 20)) k=2 ((1,2), (N, 1, 6))
i=2 j=1 ((2,1), (M, 1, 30)) j=2 k=1 ((1,1), (N, 2, 7))
j=2 ((2,1), (M, 2, 40)) k=2 ((1,2), (N, 2, 8))
k=2 i=1 j=1 ((1,2), (M, 1, 10)) i=2 j=1 k=1 ((2,1), (N, 1, 5))
j=2 ((1,2), (M, 2, 20)) k=2 ((2,2), (N, 1, 6))
i=2 j=1 ((2,2), (M, 1, 30)) j=2 k=1 ((2,1), (N, 2, 7))
j=2 ((2,2), (M, 2, 40)) k=2 ((2,2), (N, 2, 8))

22-01-2021 prakash.parmar@vit.edu.in 18
Matrix Multiplication using MapReduce

Return Set of (key, value) pairs that same key has a list with values M & N

((1,1), [(M, 1, 10)), (M, 2, 20)]),


((2,1), [(M, 1, 30)), (M, 2, 40)]),
((1,2), [(M, 1, 10)), (M, 2, 20)]),
((2,2), [(M, 1, 30)), (M, 2, 40)])

((1,1), [(N, 1, 5)), (N, 2, 7)]),


((1,2), [(N, 1, 6)), (N, 2, 8)]),
((2,1), [(N, 1, 5)), (N, 2, 7)]),
((2,2), [(N, 1, 6)), (N, 2, 8)]),

22-01-2021 prakash.parmar@vit.edu.in 19
Matrix Multiplication using MapReduce
Reduce Code:
Sort values in M and N list according to J and perform multiplication.
((1,1), 10*5+20*7) => ((1,1), 190)
((1,2), 10*6+20*8) => ((1,2), 220)
((2,1), 30*5+40*7) => ((2,1), 430)
((2,2), 30*6+40*8) => ((2, 2), 500)

Result :
𝟏𝟎 𝟐𝟎 𝟓 𝟔 𝟏𝟗𝟎 𝟐𝟐𝟎
𝑴𝒊𝒋 = 𝑵𝒋𝒌 = 𝑴×𝑵=
𝟑𝟎 𝟒𝟎 𝟕 𝟖 𝟒𝟑𝟎 𝟓𝟎𝟎

22-01-2021 prakash.parmar@vit.edu.in 20
Failure in MapReduce

• Failures are norm in commodity hardware


• Worker failure
• Detect failure via periodic heartbeats
• Re-execute in-progress map/reduce tasks

• Master failure
• Single point of failure; Resume from Execution Log

• Robust
• Google’s experience: lost 1600 of 1800 machines once!, but
finished fine

22-01-2021 prakash.parmar@vit.edu.in 21
Learn Fundamentals &
Enjoy Engineering

Prof. Prakash Parmar


Assistant Professor
Computer Engineering Department
Vidyalankar Classes CSE GATE Faculty

22-01-2021 prakash.parmar@vit.edu.in 22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy