0% found this document useful (0 votes)
27 views30 pages

5 RK - MapReduce - v3

Uploaded by

ab24csm1r04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views30 pages

5 RK - MapReduce - v3

Uploaded by

ab24csm1r04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

BIG DATA

(2021-2022, I-SEMESTER)

MapReduce

By
Dr. Tene Ramakrishnudu
Assistant Professor
Department of Computer Science &Engineering
National Institute of Technology(NIT), Warangal, TS, India
MapReduce

❖The authors and many others at Google have


implemented hundreds of special-purpose computations
that process large amounts of raw data,
▪ crawled documents,
▪ web request logs, etc.,
▪ to compute various kinds of derived data,
▪ Inverted indices,
▪ various representations of the graph structure of web documents,
▪ summaries of the number of pages crawled per host,
▪ the set of most frequent queries in a given day, etc

11-11-2021 RK-CSE-NITW 3
MapReduce

❖The input data is large

❖The computations have to be distributed across hundreds or


thousands of machines in order to finish in a reasonable
amount of time.

❖How to parallelize the computation,


❖How to distribute the data, and
❖How to handle failures
❖How to convent original simple computation with large amounts
of complex code

❖To deal with these issues …..

11-11-2021 RK-CSE-NITW 4
MapReduce

❖Google researches (Jeffrey Dean and Sanjay Ghemawat)


designed a new abstraction

❖MapReduce created
▪ MapReduce: Simplified Data Processing on Large Clusters

❖Allows to express the simple computations


❖It hides the messy details of
▪ parallelization, fault-tolerance, data distribution and load balancing.

11-11-2021 RK-CSE-NITW 5
MapReduce

❖MapReduce is a programming model and an associated


implementation for processing and generating large data
sets.

❖MapReduce is a programming model for data


processing.

❖ Platform for reliable, scalable parallel computing

❖ Abstracts issues of distributed and parallel environment


from programmer.

11-11-2021 RK-CSE-NITW 6
MapReduce

❖Users specify the computation in terms of a map and a


reduce function.

❖Underlying runtime system automatically parallelizes the


computation across large-scale clusters of machines.

❖Handles machine failures, and schedules inter-machine


communication to make efficient use of the network and
disks.

11-11-2021 RK-CSE-NITW 7
MapReduce

❖An average of one hundred thousand MapReduce jobs


are executed on Google’s clusters every day

❖Processing a total of more than twenty petabytes of data


per day.

11-11-2021 RK-CSE-NITW 8
MapReduce

❖Programming Model:

❖The computation takes a set of input key/value pairs, and


produces a set of output key/value pairs.

❖The user of the MapReduce library expresses the computation


as two functions: map and reduce.

❖Map, written by the user, takes an input pair and produces a


set of intermediate key/value pairs.

❖The MapReduce library groups together all intermediate values


associated with the same intermediate key and passes them to
the reduce function.
11-11-2021 RK-CSE-NITW 9
MapReduce

❖The reduce function, also written by the user, accepts an


intermediate key I and a set of values for that key.

❖It merges these values together to form a possibly smaller set


of values.

❖Typically just zero or one output value is produced per reduce


invocation.

❖The intermediate values are supplied to the user’s reduce


function via an iterator.

❖This allows us to handle lists of values that are too large to fit
in memory.
11-11-2021 RK-CSE-NITW 10
MapReduce

Souce [3][4]
11-11-2021 RK-CSE-NITW 12
MapReduce
❖Map-Reduce programming: (When the user program calls the
MapReduce function, the following sequence of actions occurs)

❖1. The MapReduce library in the user program


▪ first splits the input files into M pieces
▪ Each split is typically 16 megabytes to 64 megabytes (MB) per piece
(controllable by the user via an optional parameter).

❖2. One of the copies of the program is special


▪ The master.
▪ The rest are workers that are assigned work by the master.
❖There are M map tasks and R reduce tasks to assign.
❖The master picks idle workers and assigns each one a map
task or a reduce task.
11-11-2021 RK-CSE-NITW 13
MapReduce

❖3. A worker who is assigned a map task reads the


contents of the corresponding input split.
❖It parses key/value pairs out of the input data and passes
each pair to the user-defined Map function.
❖The intermediate key/value pairs produced by the Map
function are buffered in memory.

❖4. Periodically, the buffered pairs are written to local disk,


partitioned into R regions by the partitioning function.
❖The locations of these buffered pairs on he local disk are
passed back to the master, who is responsible for
forwarding these locations to the reduce workers.
11-11-2021 RK-CSE-NITW 14
MapReduce

❖5. When a reduce worker is notified by the master about


these locations, it uses remote procedure calls to read the
buffered data from the local disks of the map workers.
❖When a reduce worker has read all intermediate data, it
sorts it by the intermediate keys so that all occurrences of
the same key are grouped together.
❖The sorting is needed because typically many different
keys map to the same reduce task. If the amount of
intermediate data is too large to fit in memory, an external
sort is used.

11-11-2021 RK-CSE-NITW 15
MapReduce

❖6. The reduce worker iterates over the sorted


intermediate data and for each unique intermediate key
encountered, it passes the key and the corresponding set
of intermediate values to the user’s Reduce function.
❖The output of the Reduce function is appended to a final
output file for this reduce partition.

❖7. When all map tasks and reduce tasks have been
completed, the master wakes up the user program.
❖At this point, the MapReduce call in the user program
returns back to the user code.

11-11-2021 RK-CSE-NITW 16
MapReduce
Big document
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

11-11-2021 RK-CSE-NITW 17
MapReduce: Example

❖Problem: Counting the number of occurrences of each


word in a large collection of documents.
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
11-11-2021 RK-CSE-NITW 19
MapReduce: Master Data Structures

❖Master keeps several data structures.

❖For each map task and reduce task, it stores the state
(idle, in-progress, or completed) and the identity of the
worker machine.

11-11-2021 RK-CSE-NITW 20
MapReduce: Fault Tolerance

❖Woke Failure: The master pings every worker


periodically.

❖If no response is received from a worker in a certain


amount of time, the master marks the worker as failed.

❖Any map tasks completed by the worker are reset back to


their initial idle state and therefore become eligible for
scheduling on other workers.

❖Similarly, any map task or reduce task in progress on a


failed worker is also reset to idle and becomes eligible for
rescheduling.
11-11-2021 RK-CSE-NITW 21
MapReduce: Fault Tolerance

❖Completed map tasks are re-executed on a failure


because their output is stored on the local disk(s) of the
failed machine and is therefore inaccessible.

❖Completed reduce tasks do not need to be re-executed


since their output is stored in a global file system.

❖When a map task is executed first by worker A and then


later executed by worker B (because A failed), all workers
executing reduce tasks are notified of the re-execution.

❖Any reduce task that has not already read the data from
worker A will read the data from worker B.
11-11-2021 RK-CSE-NITW 22
MapReduce: Fault Tolerance

❖MapReduce is resilient to large-scale worker failures.

❖For example, during one MapReduce operation, network


maintenance on a running cluster was causing groups of
80 machines at a time to become unreachable for several
minutes.

❖The MapReduce master simply re-executed the work


done by the unreachable worker machines, and continued
to make forward progress, eventually completing the
MapReduce operation.

11-11-2021 RK-CSE-NITW 23
❖Master Failure:
❖The master write periodic checkpoints of the master data
structures described above.

❖If the master task dies, a new copy can be started from the last
checkpointed state.

❖There is only a single master, its failure is unlikely;

❖Aborts the MapReduce computation if the master fails.

❖Clients can check for this condition and retry the MapReduce
operation if they desire

11-11-2021 RK-CSE-NITW 24
Algorithms Using MapReduce

❖Matrix-Vector Multiplication by MapReduce:

❖A n×n matrix M, whose element in row i and column j will be


denoted mij.

❖A vector v of length n, whose jth element is vj.

❖The matrix-vector product is the vector x of length n, whose ith


element xi is given by

❖The matrix M and the vector v each will be stored in a file of


the DFS.
11-11-2021 RK-CSE-NITW 25
Algorithms Using MapReduce

11-11-2021 RK-CSE-NITW 26
Algorithms Using MapReduce

❖The row-column coordinates of each matrix element will


be discoverable either from its position in the file, or
because it is stored with explicit coordinates, as a triple (i,
j, mij).

❖The position of element vj in the vector v will be


discoverable.

❖The Map Function:


❖The Map function is written to apply to one element of M.

11-11-2021 RK-CSE-NITW 27
Algorithms Using MapReduce

❖if v is not already read into main memory at the compute


node executing a Map task, then v is first read, in its
entirety, and subsequently will be available to all
applications of the Map function performed at this Map
task.

❖Each Map task will operate on a chunk of the matrix M.

❖From each matrix element mij it produces the key-value


pair (i, mijvj).

❖Thus, all terms of the sum that make up the component xi


of the matrix-vector product will get the same key, i.
11-11-2021 RK-CSE-NITW 28
Algorithms Using MapReduce

❖The Reduce Function:


❖The Reduce function simply sums all the values
associated with a given key i.

❖The result will be a pair (i, xi).

11-11-2021 RK-CSE-NITW 29
Algorithms Using MapReduce

❖The Vector v Cannot Fit in Main Memory:

❖The vector v is so large that it will not fit in its entirety in main memory.

❖A very large number of disk accesses as we move pieces of the


vector into main memory to multiply components by elements of the
matrix.

❖Divide the matrix into vertical stripes of equal width and divide the
vector into an equal number of horizontal stripes, of the same height.

❖Our goal is to use enough stripes so that the portion of the vector in
one stripe can fit conveniently into main memory at a compute node.

❖The partition looks like if the matrix and vector are each divided into
five stripes.

11-11-2021 RK-CSE-NITW 30
?

11-11-2021 RK-CSE-NITW 31
References

❖1. Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman, “Mining of


Massive Datasets”

❖2. Jimmy Lin and Chris Dyer “Data-Intensive Text Processing with
MapReduce”

❖3. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data


Processing on Large Clusters, 6th Symposium on Operating Systems
Design and Implementation, 2004.

❖4. Jeffrey Dean and Sanjay Ghemawa, MapReduce: Simplified Data


Processing on Large Clusters, communications of the ACM Vol. 51, No.
1, 2008.

❖5. Charalampos E. Tsourakakis, “Data Mining with MAPREDUCE: Graph


and Tensor Algorithms with Applications”
11-11-2021 RK-CSE-NITW 32
Thank You

11-11-2021 RK-CSE-NITW 33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy