Hadoop
Hadoop

2 The Apache™ Hadoop® project
2003
2004
2006
5 History of Hadoop
FS image FS image
- House Keeping
- Backup NN Meta
Data
Data Nodes
Reply
(Control Info.
Embedded)
32 Secondary Name Node
HDFS: block-structured file system
File 1 3 [1, 2, 3]
File 2 2 [4, 5, 6]
File 3 1 [7,8]
Data Nodes
1, 2, 5, 7, 1, 5, 3, 1, 4, 3,
4, 3 2, 8, 6 2, 6
37 Data Replication
● Each block of a file is replicated across a number of
machines, To prevent loss of data.
● Block size and replicas are configurable per file.
● The Namenode receives a Heartbeat (every 3 secons) and a
BlockReport (with every 10th heartbeat) from each DataNode
in the cluster.
● BlockReport contains all the blocks on a Datanode.
38 Replica Placement
The placement of the replicas is critical to HDFS reliability and performance.
Optimizing replica placement distinguishes HDFS from other distributed file systems.
Rack-aware replica placement:
Goal: improve reliability, availability and network bandwidth utilization
Many racks, communication between racks are through switches.
Network bandwidth between machines on the same rack is greater than those in
different racks.
Namenode determines the rack id for each DataNode.
Replicas are typically placed on unique racks
Simple but non-optimal
Writes are expensive
Replication factor is 3
Replicas are placed: one on a node in a local rack, one on a different node in the
local rack and one on a node in a different rack.
1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across
remaining racks.
39 Replica Selection
Replica selection for READ operation: HDFS tries to minimize
the bandwidth consumption and latency.
If there is a replica on the Reader node then that is preferred.
HDFS cluster may span multiple data centers: replica in the
local data center is preferred over the remote one.
Write Mechanism: Setting Pipeline
40 Ensure that all Data Nodes which are expected to have a
copy of this block are ready to receive it
41 Pipelined Write
42 Acknowledgement
43 Multi-Block Write in Parallel
HDFS Inside: Read
1
Name Node
Client 2
3 4
Disadvantages
54
Designed for big data: Support from local disks based distributed
file system (GFS / HDFS)
Designed for large number of servers: Large clusters of commodity
machines
MapReduce works in a reliable, fault-tolerant manner, also
support Local optimization and Load balancing
The underlying system takes care of the partitioning of the
input data, scheduling the program’s execution across several
machines, handling machine failures, and managing required
inter-machine communication.
MapReduce: A Real World
Analogy
Coins Deposit
?
MapReduce: A Real World
Analogy
Coins Deposit
Map
Perform a function on individual values in a data set to create a new list of
values
Example: square x = x * x
map square [1,2,3,4,5]
returns [1,4,9,16,25]
Reduce
Combine values in a data set to create a new value
Example: sum = (each elem in arr, total +=)
reduce [1,2,3,4,5]
returns 15 (the sum of the elements)
73 Partitions
▪ In MapReduce, intermediate output values are not
usually reduced together
Map Phase
job, independent data chunks are processed by the mappers
M M M M
0 1 2 3
Map Tasks called Mappers in parallel.
IO0 IO1 IO2 IO3
▪ The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought
Reduce Phase
Shuffling Data
into a second set of tasks called Reducers
▪ The process of bringing together IOs into a set Reducers R0 R1
of Reducers is known as shuffling process
▪ The Reducers produce the final outputs (FOs) FO0 FO1
Job
Client
Task Trackers
Job
Tracker
Redu
Map
ce
Name Input Outpu
Node s ts
HDFS
Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1
River, 1
River, 1 River, 2
Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1
River, 1
River, 1 River, 2
Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1
River, 1
River, 1 River, 2
Car, 1 Car, 3
Car, 2 Car, 2
Car, 1 River, 1 Car, 1
River, 1
Map(key, value){
// key: line number
// value: tuples in a line
for each tuple t in value: Combiner is the
Emit(t->year, t->temperature);} same as Reducer
Reduce(key, list of values){
// key: year
//list of values: a list of monthly temperature
int max_temp = -100;
for each v in values:
max_temp= max(v, max_temp);
Emit(key, max_temp);}
MapReduce Example: Max Temperature
(200707,100), (200706,90)
Input (200508, 90), (200607,100)
(200708, 80), (200606,80)
Map
(2007,100), (2005, 90),
(2006,100) (2007, 80), (2006, 80)
(2007,90)
Combine
Shuttle/Sort
Map
Reduce
(2007,87.5
(2005,90) (2006,90)
)
Example 2: Combiner
Map
Output:
(A ,B) -> C,D
B D
(A,C) -> B
(A,D) -> ..
….
Mapper and Reducer of
Common Friends
Map(key, value){
// key: person_id
// value: the list of friends of the person
for each friend f_id in value:
Emit(<person_id, f_id>, value);}
Reduce(key, list of values){
// key: <friend pair>
// list of values: a set of friend lists related with the friend pair
for v1, v2 in values:
common_friends = v1 intersects v2;
Emit(key, common_friends);}
Map Reduce Problems
Discussion
Problem 3: Find Common Friends
Mapper and Reducer:
… d3
CS 4407 m3
University College Cork, Gregory M. Provan
From Intuition to Algorithm
CS 4407
University College Cork, Gregory M. Provan
Multiple Iterations Needed
CS 4407
University College Cork, Gregory M. Provan
109
Shortest path problem
110 Shortest path problem:
mapper
Shortest path problem:
111
reducer
Visualizing Parallel BFS
n7
n0 n1
n3 n2
n6
n5
n4
n8
n9
CS 4407
University College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in
113 MapReduce
● Adjacency matrix B C
A B C D E
∞ 1 ∞
A 10 5
B 1 2 10
A
C 4
D 3 9 2 0 2 3 9 4 6
E 7 6
5 7
∞ ∞
● Adjacency List 2
5 7
● Map output: <dest node ID, dist>
<B, 10> <D, 5> <A, <0, <(B, 10), (D, ∞ ∞
5)>>> 2
<C, inf> <D, inf>
<E, inf> <B, <inf, <(C, 1), (D, D E
<B, inf> <C, inf> <E, inf> 2)>>>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> <C,<inf,
<E, <inf,<(A,
<(E, 7),
4)>>>
(C, 6)>>>
Flushed to local disk!!
Example: SSSP – Parallel BFS in
115
MapReduce
● Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
∞ ∞
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6
∞ ∞
<D, <inf, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, inf>
D E
<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> 5 ∞
<C, 11> <D, 12> 2
<B, <10, <(C, 1), (D, 2)>>>
<E, inf> D E
<C, <inf, <(E, 4)>>>
<B, 8> <C, 14> <E, 7> <D, <5, <(B, 3), (C, 9), (E,
<A, inf> <C, inf> 2)>>> Flushed to local disk!!
<E, <inf,University
<(A, 7), (C, 6)>>>
CS 4407
College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6
5 ∞
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E
10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6
5 ∞
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E
CS 4407
University College Cork, Gregory M. Provan
Comparison to Dijkstra
en.wikipedia.org
www.nytimes.com
PageRank: Formula
■ Consequences of insights:
□ We can map each row of 'current' to a list of
PageRank “fragments” to assign to linkees
□These fragments
Reduce can
step: add together beintoreduced
fragments next PageRankinto a single
PageRank value for a page by summing
□Graph representation can be even more
compact; since each element is simply 0 or 1,
only transmit column numbers where it's 1
The map phase has M pieces and the reduce phase has R pieces.
M and R should be much larger than the number of worker machines.
Having each worker perform many different tasks improves dynamic
load balancing, and also speeds up recovery when a worker fails.
Larger the M and R, more the decisions the master must make
O(M+R) decisions and O(M*R) states