1s07 Map Reduce Presentation 2019
1s07 Map Reduce Presentation 2019
Bin
map(key=url, val=contents):
▫ For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
▫ Sum all “1”s in values list
▫ Emit result “(word, sum)”
map(key=url, val=contents):
Count, For each word w in contents, emit (w, “1”)
Illustrated reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”
see 1 bob 1
see bob throw bob 1 run 1
run 1
see 2
see spot run see 1
spot 1 spot 1
throw 1 throw 1
Grep
Input consists of (url+offset, single line)
map(key=url+offset, val=line):
▫ If contents matches regexp, emit (line, “1”)
reduce(key=line, values=uniq_counts):
▫ Don’t do anything; just emit line
Reverse Web-Link Graph
Map
For each URL linking to target, …
Output <target, source> pairs
Reduce
Concatenate list of all source URLs
Outputs: <target, list (source)> pairs
Inverted Index
Map
The map function parses each document, and emits a sequence of
〈 word, document ID 〉 pairs.
Reduce
The reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a 〈 word, list(document
ID) 〉 pair.
The set of all output pairs forms a simple inverted index.
It is easy to augment this computation to keep track of
word positions.
Model is Widely Applicable
MapReduce Programs In Google Source Tree
Example uses:
distributed grep distributed sort web link-graph reversal
term-vector / host web access log stats inverted index construction
statistical machine
document clustering machine learning
translation
... ... ...
Implementation Overview
Typical cluster:
JobTracker
Effect
Thousands of machines read input at local disk speed
▫ Without this, rack switches limit read rate
Refinement
Skipping Bad Records
Map/Reduce functions sometimes fail for
particular inputs
Best solution is to debug & fix
▫ Not always possible ~ third-party source libraries
On segmentation fault:
▫ Send UDP packet to master from signal handler
▫ Include sequence number of record being
processed
If master sees two failures for same record:
▫ Next worker is told to skip the record
Other Refinements
Sorting guarantees
within each reduce partition
Compression of intermediate data
Combiner
Useful for saving network bandwidth
Local execution for debugging/testing
User-defined counters
Performance
Tests run on cluster of 1800 machines:
4 GB of memory
Dual-processor 2 GHz Xeons with Hyperthreading
Dual 160 GB IDE disks
Gigabit Ethernet per machine
Bisection bandwidth approximately 100 Gbps
Two benchmarks:
MR_GrepScan 1010 100-byte records to extract records
matching a rare pattern (92K matching
records)
Fun to use:
focus on problem,
let library deal w/ messy details