0% found this document useful (0 votes)

21 views70 pages

Lecture MapReduce

The document provides an overview of the MapReduce programming model, which is designed for processing and generating large data sets through a parallelized and distributed approach. It explains the internal workings, including the map and reduce functions, scheduling, fault tolerance, and various applications of MapReduce in real-world scenarios. The lecture emphasizes the ease of use for programmers and the efficiency of handling large-scale data processing tasks.

Uploaded by

bheeshma9347

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views70 pages

Lecture MapReduce

Uploaded by

bheeshma9347

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

MapReduce

Dr. Rajiv Misra

Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham MapReduce
Preface
Content of this Lecture:

In this lecture, we will discuss the ‘MapReduce

paradigm’ and its internal working and
implementation overview.

We will also see many examples and different

applications of MapReduce being used, and look into
how the ‘scheduling and fault tolerance’ works inside
MapReduce.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Introduction
MapReduce is a programming model and an associated
implementation for processing and generating large data
sets.

Users specify a map function that processes a key/value

pair to generate a set of intermediate key/value pairs, and
a reduce function that merges all intermediate values
associated with the same intermediate key.

Many real world tasks are expressible in this model.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Contd…
Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines.
The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of
machines, handling machine failures, and managing the required
inter-machine communication.
This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large
distributed system.
A typical MapReduce computation processes many terabytes of
data on thousands of machines. Hundreds of MapReduce
programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Distributed File System
Chunk Servers
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master node
Also known as Name Nodes in HDFS
Stores metadata
Might be replicated

Client library for file access

Talks to master to find chunk servers
Connects directly to chunkservers to access data
Cloud Computing and DistributedVuSystems
Pham MapReduce
Motivation for Map Reduce (Why)

Large-Scale Data Processing

Want to use 1000s of CPUs
But don’t want hassle of managing things

MapReduce Architecture provides

Automatic parallelization & distribution
Fault tolerance
I/O scheduling
Monitoring & status updates

Cloud Computing and DistributedVuSystems

Pham MapReduce
MapReduce Paradigm

Cloud Computing and DistributedVuSystems

Pham MapReduce
What is MapReduce?
Terms are borrowed from Functional Language (e.g., Lisp)
Sum of squares:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)
[processes each record sequentially and independently]
(reduce + ‘(1 4 9 16))
(+ 16 (+ 9 (+ 4 1) ) )
Output: 30
[processes set of all records in batches]
Let’s consider a sample application: Wordcount
You are given a huge dataset (e.g., Wikipedia dump or all of
Shakespeare’s works) and asked to list the count for each of the
words in each of the documents therein
Cloud Computing and DistributedVuSystems
Pham MapReduce
Map

Process individual records to generate intermediate

key/value pairs.

Key Value
Welcome 1
Welcome Everyone Everyone 1
Hello Everyone Hello 1
Everyone 1
Input <filename, file text>

Cloud Computing and DistributedVuSystems

Pham MapReduce
Map

Parallelly Process individual records to generate

intermediate key/value pairs.

MAP TASK 1

Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Everyone 1
Input <filename, file text>

MAP TASK 2

Cloud Computing and DistributedVuSystems

Pham MapReduce
Map

Parallelly Process a large number of individual

records to generate intermediate key/value pairs.

Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Why are you here
I am also here Everyone 1
They are also here Why 1
Yes, it’s THEM!
Are 1
The same people we were thinking of
You 1
…….
Here 1
…….

Input <filename, file text>

MAP TASKS

Cloud Computing and DistributedVuSystems

Pham MapReduce
Reduce
Reduce processes and merges all intermediate values
associated per key

Key Value
Welcome 1
Everyone 2
Everyone 1
Hello 1
Hello 1
Welcome 1
Everyone 1

Cloud Computing and DistributedVuSystems

Pham MapReduce
Reduce
• Each key assigned to one Reduce
• Parallelly Processes and merges all intermediate values
by partitioning keys

Welcome 1 REDUCE Everyone 2

Everyone 1 TASK 1 Hello 1
Hello 1 REDUCE Welcome 1
Everyone 1 TASK 2

• Popular: Hash partitioning, i.e., key is assigned to

– reduce # = hash(key)%number of reduce tasks
Cloud Computing and DistributedVuSystems
Pham MapReduce
Programming Model

The computation takes a set of input key/value pairs, and

produces a set of output key/value pairs.

The user of the MapReduce library expresses the

computation as two functions:

(i) The Map

(ii) The Reduce

Cloud Computing and DistributedVuSystems

Pham MapReduce
(i) Map Abstraction

Map, written by the user, takes an input pair and produces

a set of intermediate key/value pairs.

The MapReduce library groups together all intermediate

values associated with the same intermediate key ‘I’ and
passes them to the Reduce function.

Cloud Computing and DistributedVuSystems

Pham MapReduce
(ii) Reduce Abstraction
The Reduce function, also written by the user, accepts an
intermediate key ‘I’ and a set of values for that key.

It merges together these values to form a possibly smaller

set of values.

Typically just zero or one output value is produced per

Reduce invocation. The intermediate values are supplied to
the user's reduce function via an iterator.

This allows us to handle lists of values that are too large to

fit in memory.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Map-Reduce Functions for Word Count

map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Cloud Computing and DistributedVuSystems

Pham MapReduce
Map-Reduce Functions

Input: a set of key/value pairs

User supplies two functions:
map(k,v)  list(k1,v1)
reduce(k1, list(v1))  v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs

Cloud Computing and DistributedVuSystems

Pham MapReduce
MapReduce Applications

Cloud Computing and DistributedVuSystems

Pham MapReduce
Applications
Here are a few simple applications of interesting programs that
can be easily expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that
just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes
logs of web page requests and outputs (URL; 1). The reduce
function adds together all values for the same URL and emits a
(URL; total count) pair.
ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair:
(target; list(source))
Cloud Computing and DistributedVuSystems
Pham MapReduce
Contd…
Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of (word; frequency) pairs.

The map function emits a (hostname; term vector) pair

for each input document (where the hostname is
extracted from the URL of the document).

The reduce function is passed all per-document term

vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair

Cloud Computing and DistributedVuSystems

Pham MapReduce
Contd…
Inverted Index: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts
the corresponding document IDs and emits a (word;
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this
computation to keep track of word positions.

Distributed Sort: The map function extracts the key from

each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Applications of MapReduce
(1) Distributed Grep:

Input: large set of files

Output: lines that match pattern

Map – Emits a line if it matches the supplied

pattern

Reduce – Copies the intermediate data to output

Cloud Computing and DistributedVuSystems

Pham MapReduce
Applications of MapReduce
(2) Reverse Web-Link Graph:

Input: Web graph: tuples (a, b)

where (page a  page b)

Output: For each page, list of pages that link to it

Map – process web log and for each input <source,

target>, it outputs <target, source>
Reduce - emits <target, list(source)>

Cloud Computing and DistributedVuSystems

Pham MapReduce
Applications of MapReduce
(3) Count of URL access frequency:

Input: Log of accessed URLs, e.g., from proxy server

Output: For each URL, % of total accesses for that URL

Map – Process web log and outputs <URL, 1>

Multiple Reducers - Emits <URL, URL_count>
(So far, like Wordcount. But still need %)
Chain another MapReduce job after above one
Map – Processes <URL, URL_count> and outputs
<1, (<URL, URL_count> )>
1 Reducer – Does two passes. In first pass, sums up all
URL_count’s to calculate overall_count. In second pass
calculates %’s
Emits multiple <URL, URL_count/overall_count>
Cloud Computing and DistributedVuSystems
Pham MapReduce
Applications of MapReduce
(4) Map task’s output is sorted (e.g., quicksort)
Reduce task’s input is sorted (e.g., mergesort)

Sort
Input: Series of (key, value) pairs
Output: Sorted <value>s

Map – <key, value>  <value, _> (identity)

Reducer – <key, value>  <key, value> (identity)
Partitioning function – partition keys across reducers
based on ranges (can’t use hashing!)
• Take data distribution into account to balance
reducer tasks

Cloud Computing and DistributedVuSystems

Pham MapReduce
MapReduce Scheduling

Cloud Computing and DistributedVuSystems

Pham MapReduce
Programming MapReduce
Externally: For user
1. Write a Map program (short), write a Reduce program (short)
2. Specify number of Maps and Reduces (parallelism level)
3. Submit job; wait for result
4. Need to know very little about parallel/distributed programming!

Internally: For the Paradigm and Scheduler

1. Parallelize Map
2. Transfer data from Map to Reduce (shuffle data)
3. Parallelize Reduce
4. Implement Storage for Map input, Map output, Reduce input, and
Reduce output
(Ensure that no Reduce starts before all Maps are finished. That is,
ensure the barrier between the Map phase and Reduce phase)
Cloud Computing and DistributedVuSystems
Pham MapReduce
Inside MapReduce
For the cloud:
1. Parallelize Map: easy! each map task is independent of the other!
• All Map output records with same key assigned to same Reduce
2. Transfer data from Map to Reduce:
• Called Shuffle data
• All Map output records with same key assigned to same Reduce task
• use partitioning function, e.g., hash(key)%number of reducers
3. Parallelize Reduce: easy! each reduce task is independent of the other!
4. Implement Storage for Map input, Map output, Reduce input, and Reduce
output
• Map input: from distributed file system
• Map output: to local disk (at Map node); uses local file system
• Reduce input: from (multiple) remote disks; uses local file systems
• Reduce output: to distributed file system
local file system = Linux FS, etc.
distributed file system = GFS (Google File System), HDFS (Hadoop Distributed
File System)

Cloud Computing and DistributedVuSystems

Pham MapReduce
Internal Workings of MapReduce
Map tasks Reduce tasks Output files
1 into DFS

A A I
2

3
4 B B II

5
6 III
7 C C
Blocks Servers Servers
from DFS
(Local write, remote read)

Resource Manager (assigns maps and reduces to servers)

Cloud Computing and DistributedVuSystems
Pham MapReduce
The YARN Scheduler
• Used underneath Hadoop 2.x +
• YARN = Yet Another Resource Negotiator
• Treats each server as a collection of containers
– Container = fixed CPU + fixed memory

• Has 3 main components

– Global Resource Manager (RM)
• Scheduling
– Per-server Node Manager (NM)
• Daemon and server-specific functions
– Per-application (job) Application Master (AM)
• Container negotiation with RM and NMs
• Detecting task failures of that job

Cloud Computing and DistributedVuSystems

Pham MapReduce
YARN: How a job gets a container
Resource Manager
Capacity Scheduler In this figure
• 2 servers (A, B)
• 2 jobs (1, 2)

1. Need 3. Container on Node B 2. Container Completed

container
Node A Node Manager A Node B
Node Manager B

Application Application Task

4. Start task, please!
Master 1 Master 2 (App2)

Cloud Computing and DistributedVuSystems

Pham MapReduce
MapReduce Fault-Tolerance

Cloud Computing and DistributedVuSystems

Pham MapReduce
Fault Tolerance
• Server Failure
– NM heartbeats to RM

• If server fails, RM lets all affected AMs know, and AMs take
appropriate action
– NM keeps track of each task running at its server

• If task fails while in-progress, mark the task as idle and

restart it
– AM heartbeats to RM

• On failure, RM restarts AM, which then syncs up with its

running tasks
• RM Failure
– Use old checkpoints and bring up secondary RM

• Heartbeats also used to piggyback container requests

– Avoids extra messages

Cloud Computing and DistributedVuSystems

Pham MapReduce
Slow Servers
Slow tasks are called Stragglers

The slowest task slows the entire job down (why?)

Due to Bad Disk, Network Bandwidth, CPU, or Memory

Keep track of “progress” of each task (% done)

Perform backup (replicated) execution of straggler tasks

A task considered done when its first replica
complete called Speculative Execution.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Locality
• Locality
– Since cloud has hierarchical topology (e.g., racks)

– GFS/HDFS stores 3 replicas of each of chunks

(e.g., 64 MB in size)
• Maybe on different racks, e.g., 2 on a rack, 1 on a
different rack
– Mapreduce attempts to schedule a map task on

1. a machine that contains a replica of corresponding

input data, or failing that,
2. on the same rack as a machine containing the input,
or failing that,
3. Anywhere

Cloud Computing and DistributedVuSystems

Pham MapReduce
Implementation Overview

Cloud Computing and DistributedVuSystems

Pham MapReduce
Implementation Overview
Many different implementations of the MapReduce
interface are possible. The right choice depends on the
environment.

For example, one implementation may be suitable for a

small shared-memory machine, another for a large NUMA
multi-processor, and yet another for an even larger
collection of networked machines.

Here we describes an implementation targeted to the

computing environment in wide use at Google: large
clusters of commodity PCs connected together with
switched Ethernet.
Cloud Computing and DistributedVuSystems
Pham MapReduce
Contd…
(1) Machines are typically dual-processor x86 processor running
Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used . Typically either
100 megabits/second or 1 gigabit/second at the machine
level, but averaging considerably less in overall bisection
bandwidth.
(3) A cluster consists of hundreds or thousands of machines, and
therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks attached
directly to individual machines.
(5) Users submit jobs to a scheduling system. Each job consists
of a set of tasks, and is mapped by the scheduler to a set of
available machines within a cluster.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Distributed Execution Overview
The Map invocations are distributed across multiple machines
by automatically partitioning the input data into a set of M
splits.
The input splits can be processed in parallel by different
machines.

Reduce invocations are distributed by partitioning the

intermediate key space into R pieces using a partitioning
function (e.g., hash(key) mod R).
The number of partitions (R) and the partitioning function are
specified by the user.
Figure 1 shows the overall flow of a MapReduce operation.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Distributed Execution Overview
User
Program

(1)fork (1) fork (1) fork

(2)assign Master (2) assign

map reduce

Worker (6) write Output

(4) local Worker File 0
Split 0 (3) read
write (5) Remote read, sort
Split 1 Worker
Split 2 Output
Worker File 1
Worker

Intermediate
Input Files Map phase Files on Disk Reduce phase Output Files

Cloud Computing and DistributedVuSystems

Pham MapReduce
Sequence of Actions
When the user program calls the MapReduce function, the following
sequence of actions occurs:
1. The MapReduce library in the user program first splits the input
files into M pieces of typically 16 megabytes to 64 megabytes (MB)
per piece. It then starts up many copies of the program on a cluster
of machines.
2. One of the copies of the program is special- the master. The rest are
workers that are assigned work by the master. There are M map
tasks and R reduce tasks to assign. The master picks idle workers
and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input split. It parses key/value pairs out of the input
data and passes each pair to the user-defined Map function. The
intermediate key/value pairs produced by the Map function are
buffered in memory.
Cloud Computing and DistributedVuSystems
Pham MapReduce
Contd…
4. Periodically, the buffered pairs are written to local disk,
partitioned into R regions by the partitioning function.
The locations of these buffered pairs on the local disk are passed
back to the master, who is responsible for forwarding these
locations to the reduce workers.
5. When a reduce worker is notified by the master about these
locations, it uses remote procedure calls to read the buffered
data from the local disks of the map workers. When a reduce
worker has read all intermediate data, it sorts it by the
intermediate keys so that all occurrences of the same key are
grouped together.
The sorting is needed because typically many different keys map
to the same reduce task. If the amount of intermediate data is
too large to fit in memory, an external sort is used.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Contd…
6. The reduce worker iterates over the sorted intermediate data
and for each unique intermediate key encountered, it passes
the key and the corresponding set of intermediate values to
the user's Reduce function.
The output of the Reduce function is appended to a final
output file for this reduce partition.

7. When all map tasks and reduce tasks have been completed,
the master wakes up the user program.
At this point, the MapReduce call in the user program returns
back to the user code.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Contd…
After successful completion, the output of the mapreduce
execution is available in the R output files (one per reduce
task, with file names as specified by the user).

Typically, users do not need to combine these R output files

into one file- they often pass these files as input to another
MapReduce call, or use them from another distributed
application that is able to deal with input that is partitioned
into multiple files.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Master Data Structures
The master keeps several data structures. For each map task
and reduce task, it stores the state (idle, in-progress, or
completed), and the identity of the worker machine (for non-
idle tasks).

The master is the conduit through which the location of

intermediate le regions is propagated from map tasks to
reduce tasks. Therefore, for each completed map task, the
master stores the locations and sizes of the R intermediate file
regions produced by the map task.

Updates to this location and size information are received as

map tasks are completed. The information is pushed
incrementally to workers that have in-progress reduce tasks.
Cloud Computing and DistributedVuSystems
Pham MapReduce
Fault Tolerance
Since the MapReduce library is designed to help process very
large amounts of data using hundreds or thousands of
machines, the library must tolerate machine failures
gracefully.
Map worker failure
Map tasks completed or in-progress at worker are reset to
idle
Reduce workers are notified when task is rescheduled on
another worker
Reduce worker failure
Only in-progress tasks are reset to idle
Master failure
MapReduce task is aborted and client is notified
Cloud Computing and DistributedVuSystems
Pham MapReduce
Locality
Network bandwidth is a relatively scarce resource in the computing
environment. We can conserve network bandwidth by taking
advantage of the fact that the input data (managed by GFS) is
stored on the local disks of the machines that make up our cluster.
GFS divides each file into 64 MB blocks, and stores several copies of
each block (typically 3 copies) on different machines.
The MapReduce master takes the location information of the input
les into account and attempts to schedule a map task on a machine
that contains a replica of the corresponding input data. Failing that,
it attempts to schedule a map task near a replica of that task's
input data (e.g., on a worker machine that is on the same network
switch as the machine containing the data).
When running large MapReduce operations on a signiﬁcant
fraction of the workers in a cluster, most input data is read locally
and consumes no network bandwidth.
Cloud Computing and DistributedVuSystems
Pham MapReduce
Task Granularity
The Map phase is subdivided into M pieces and the reduce
phase into R pieces.
Ideally, M and R should be much larger than the number of
worker machines.
Having each worker perform many different tasks improves
dynamic load balancing, and also speeds up recovery when a
worker fails: the many map tasks it has completed can be
spread out across all the other worker machines.
There are practical bounds on how large M and R can be, since
the master must make O(M + R) scheduling decisions and
keeps O(M * R) state in memory.
Furthermore, R is often constrained by users because the
output of each reduce task ends up in a separate output file.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Partition Function
Inputs to map tasks are created by contiguous splits of
input file

For reduce, we need to ensure that records with the

same intermediate key end up at the same worker

System uses a default partition function e.g.,

hash(key) mod R
Sometimes useful to override
E.g., hash(hostname(URL)) mod R ensures URLs
from a host end up in the same output file

Cloud Computing and DistributedVuSystems

Pham MapReduce
Ordering Guarantees
It is guaranteed that within a given partition, the
intermediate key/value pairs are processed in increasing
key order.

This ordering guarantee makes it easy to generate a

sorted output file per partition, which is useful when
the output file format needs to support efficient random
access lookups by key, or users of the output and it
convenient to have the data sorted.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Combiners Function (1)
In some cases, there is signiﬁcant repetition in the intermediate
keys produced by each map task, and the user specified Reduce
function is commutative and associative.

A good example of this is the word counting example. Since

word frequencies tend to follow a Zipf distribution, each map
task will produce hundreds or thousands of records of the form
<the, 1>.

All of these counts will be sent over the network to a single

reduce task and then added together by the Reduce function to
produce one number. We allow the user to specify an optional
Combiner function that does partial merging of this data before it
is sent over the network.
Cloud Computing and DistributedVuSystems
Pham MapReduce
Combiners Function (2)
The Combiner function is executed on each machine that
performs a map task.
Typically the same code is used to implement both the
combiner and the reduce functions.
The only difference between a reduce function and a
combiner function is how the MapReduce library handles
the output of the function.
The output of a reduce function is written to the final
output file. The output of a combiner function is written
to an intermediate le that will be sent to a reduce task.
Partial combining significantly speeds up certain classes of
MapReduce operations.

Cloud Computing and DistributedVuSystems

Pham MapReduce
MapReduce Examples

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example: 1 Word Count using MapReduce

map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Cloud Computing and DistributedVuSystems

Pham MapReduce
Count Illustrated

map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”

see 1 bob 1
see bob run
bob 1 run 1
see spot throw
run 1 see 2
see 1 spot 1
spot 1 throw 1
throw 1

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 2: Counting words of different lengths
The map function takes a value and outputs key:value
pairs.

For instance, if we define a map function that takes a

string and outputs the length of the word as the key and
the word itself as the value then

map(steve) would return 5:steve and

map(savannah) would return 8:savannah.

This allows us to run the map function against values in

parallel and provides a huge advantage.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 2: Counting words of different lengths
Before we get to the reduce function, the mapreduce
framework groups all of the values together by key, so if the
map functions output the following key:value pairs:
3 : the
3 : and
They get grouped as:
3 : you
4 : then
3 : [the, and, you]
4 : what
4 : [then, what, when]
4 : when
5 : [steve, where]
5 : steve 8 : [savannah, research]
5 : where
8 : savannah
8 : research
Cloud Computing and DistributedVuSystems
Pham MapReduce
Example 2: Counting words of different lengths
Each of these lines would then be passed as an argument
to the reduce function, which accepts a key and a list of
values.
In this instance, we might be trying to figure out how many
words of certain lengths exist, so our reduce function will
just count the number of items in the list and output the
key with the size of the list, like:

3:3
4:3
5:2
8:2

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 2: Counting words of different lengths

The reductions can also be done in parallel, again providing

a huge advantage. We can then look at these final results
and see that there were only two words of length 5 in the
corpus, etc...

The most common example of mapreduce is for counting

the number of times words occur in a corpus.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 3: Finding Friends
Facebook has a list of friends (note that friends are a bi-directional
thing on Facebook. If I'm your friend, you're mine).
They also have lots of disk space and they serve hundreds of millions
of requests everyday. They've decided to pre-compute calculations
when they can to reduce the processing time of requests. One
common processing request is the "You and Joe have 230 friends in
common" feature.
When you visit someone's profile, you see a list of friends that you
have in common. This list doesn't change frequently so it'd be
wasteful to recalculate it every time you visited the profile (sure you
could use a decent caching strategy, but then we wouldn't be able to
continue writing about mapreduce for this problem).
We're going to use mapreduce so that we can calculate everyone's
common friends once a day and store those results. Later on it's just
a quick lookup. We've got lots of disk, it's cheap.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 3: Finding Friends
Assume the friends are stored as Person->[List of Friends], our
friends list is then:

A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 3: Finding Friends
For map(A -> B C D) :
(A B) -> B C D
(A C) -> B C D
(A D) -> B C D

For map(B -> A C D E) : (Note that A comes before B in the key)

(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 3: Finding Friends
For map(C -> A B D E) :

(A C) -> A B D E
(B C) -> A B D E And finally for map(E -> B C D):
(C D) -> A B D E
(C E) -> A B D E (B E) -> B C D
(C E) -> B C D
For map(D -> A B C E) :
(D E) -> B C D
(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 3: Finding Friends
Before we send these key-value pairs to the reducers, we
group them by their keys and get:

(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 3: Finding Friends
Each line will be passed as an argument to a reducer.

The reduce function will simply intersect the lists of values

and output the same key with the result of the intersection.

For example, reduce((A B) -> (A C D E) (B C D))

will output (A B) : (C D)
and means that friends A and B have C and D as common
friends.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Example 3: Finding Friends
The result after reduction is:
(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
Now when D visits B's profile,
(B D) -> (A C E)
we can quickly look up (B D) and
(B E) -> (C D) see that they have three friends
(C D) -> (A B E) in common, (A C E).
(C E) -> (B D)
(D E) -> (B C)

Cloud Computing and DistributedVuSystems

Pham MapReduce
Reading
Jeffrey Dean and Sanjay Ghemawat,

“MapReduce: Simplified Data Processing on Large

Clusters”

http://labs.google.com/papers/mapreduce.html

Cloud Computing and DistributedVuSystems

Pham MapReduce
Conclusion
The MapReduce programming model has been successfully used
at Google for many different purposes.

The model is easy to use, even for programmers without

experience with parallel and distributed systems, since it hides the
details of parallelization, fault-tolerance, locality optimization, and
load balancing.

A large variety of problems are easily expressible as MapReduce

computations.

For example, MapReduce is used for the generation of data for

Google's production web search service, for sorting, for data
mining, for machine learning, and many other systems.
Cloud Computing and DistributedVuSystems
Pham MapReduce
Conclusion

Mapreduce uses parallelization + aggregation to

schedule applications across clusters

Need to deal with failure

Plenty of ongoing research work in scheduling and

fault-tolerance for Mapreduce and Hadoop.

Cloud Computing and DistributedVuSystems

Pham MapReduce

Sagara Technology Profile
No ratings yet
Sagara Technology Profile
39 pages
DOE Exercise Book
No ratings yet
DOE Exercise Book
35 pages
FULL Guideline SOGC 2005 2016
100% (1)
FULL Guideline SOGC 2005 2016
756 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
4a MapReduce
No ratings yet
4a MapReduce
47 pages
Map Reduce
No ratings yet
Map Reduce
39 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
7 pages
Map Reduce - 3
No ratings yet
Map Reduce - 3
23 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Hadoop MapReduce2.0 (Part-I)
No ratings yet
Hadoop MapReduce2.0 (Part-I)
18 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Module 1 Algorithm For Massive Datasets
No ratings yet
Module 1 Algorithm For Massive Datasets
59 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce Design and Execution Framework Part 1
No ratings yet
Map Reduce Design and Execution Framework Part 1
19 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Map-reduce-Developing A Map-Reduce Application - Map-Reduce Working Procedure-2
No ratings yet
Map-reduce-Developing A Map-Reduce Application - Map-Reduce Working Procedure-2
10 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
Assn - No:1 Cloud Computing Assignment 13.10.2019
No ratings yet
Assn - No:1 Cloud Computing Assignment 13.10.2019
4 pages
Cloud 3
No ratings yet
Cloud 3
4 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Hadoop For Dummies: Mapreduce To The Rescue
No ratings yet
Hadoop For Dummies: Mapreduce To The Rescue
17 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
31 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
44 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Unit II Lec 7 - 8
No ratings yet
Unit II Lec 7 - 8
23 pages
Question Bank1
No ratings yet
Question Bank1
2 pages
Week 8 - Lecture Notes
No ratings yet
Week 8 - Lecture Notes
75 pages
UNIT1
No ratings yet
UNIT1
55 pages
ATC-SEAOC Training - Built To Resist Earthquakes - Contents
No ratings yet
ATC-SEAOC Training - Built To Resist Earthquakes - Contents
2 pages
Arm Cylinder - 331
No ratings yet
Arm Cylinder - 331
3 pages
Anjaney Deshpande Resume
No ratings yet
Anjaney Deshpande Resume
1 page
Article
No ratings yet
Article
10 pages
DISRUPT
100% (1)
DISRUPT
14 pages
IIFL - Awfis - Initiating Coverage - 20241125
No ratings yet
IIFL - Awfis - Initiating Coverage - 20241125
40 pages
Human Resource Management: Decenzo and Robbins
No ratings yet
Human Resource Management: Decenzo and Robbins
17 pages
The Determination of Exchange Rates
No ratings yet
The Determination of Exchange Rates
35 pages
Docslide - Us New Database1
No ratings yet
Docslide - Us New Database1
274 pages
15velasco v. Manila Electric Co. 42 SCRA 556 GR L 18390 12201971 G.R. No. L 18390
No ratings yet
15velasco v. Manila Electric Co. 42 SCRA 556 GR L 18390 12201971 G.R. No. L 18390
3 pages
Rhce Exams
No ratings yet
Rhce Exams
8 pages
G12 Empowerment Technologies Module (Learner's Copy)
No ratings yet
G12 Empowerment Technologies Module (Learner's Copy)
99 pages
Income Statements Notes Chapter 21
No ratings yet
Income Statements Notes Chapter 21
5 pages
Environmental Laws Chapter 3 1
No ratings yet
Environmental Laws Chapter 3 1
4 pages
MEP Myanmar
No ratings yet
MEP Myanmar
27 pages
SYLLABUS OCS 4410 Fall 2022
No ratings yet
SYLLABUS OCS 4410 Fall 2022
4 pages
CCTV View Sales by Time - 30-10-2024
No ratings yet
CCTV View Sales by Time - 30-10-2024
9 pages
Football Accumulator Tips For Today - WinDrawWin.
No ratings yet
Football Accumulator Tips For Today - WinDrawWin.
1 page
Building Services
No ratings yet
Building Services
17 pages
MockTest 4A 1 Q Eng
No ratings yet
MockTest 4A 1 Q Eng
9 pages
TSA BIM Ready Complete
No ratings yet
TSA BIM Ready Complete
19 pages
IBM Power E1050 Level 2 Quiz
No ratings yet
IBM Power E1050 Level 2 Quiz
17 pages
3) Sieve Analysis Test
100% (1)
3) Sieve Analysis Test
2 pages
Unified Compute Platform HC Datasheet
No ratings yet
Unified Compute Platform HC Datasheet
2 pages
Bottom Up Beta Template
No ratings yet
Bottom Up Beta Template
26 pages
Ic33 Print Out 660 English PDF
100% (1)
Ic33 Print Out 660 English PDF
54 pages
20ME901 Automobile Engineering Unit 1
No ratings yet
20ME901 Automobile Engineering Unit 1
87 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.