0% found this document useful (0 votes)
8 views154 pages

Hadoop

The Apache Hadoop project develops open-source software for distributed computing, enabling the processing of large data sets across clusters of computers. Originating from Google's MapReduce framework, Hadoop has evolved to include various components like Hive for data warehousing and YARN for resource management, supporting scalability and fault tolerance. The ecosystem consists of over 100 libraries, facilitating a range of applications from batch processing to real-time analytics.

Uploaded by

Md Hamid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views154 pages

Hadoop

The Apache Hadoop project develops open-source software for distributed computing, enabling the processing of large data sets across clusters of computers. Originating from Google's MapReduce framework, Hadoop has evolved to include various components like Hive for data warehousing and YARN for resource management, supporting scalability and fault tolerance. The ecosystem consists of over 100 libraries, facilitating a range of applications from batch processing to real-time analytics.

Uploaded by

Md Hamid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

1


2 The Apache™ Hadoop® project

 develops open-source software for reliable, scalable,


distributed computing
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models
 scale up from single servers to thousands of machines,
each offering local computation and storage
 Instead of focussing on hardware based high availability,
Hadoop software detect and handle failures at application
layer, delivering high availability running on commodity
clusters which are prone to failures
Hadoop’s Developers

2005: Doug Cutting and Michael J. Cafarella developed


Hadoop to support distribution for the Nutch search
engine project.
Doug Cutting
The project was funded by Yahoo.

2006: Yahoo gave the project to Apache


Software Foundation.
Google Origins

2003

2004

2006
5 History of Hadoop

 In 2004, Google developed MapReduce programming


framework for Google File System (GFS) for indexing
websites
 In 2005, Yahoo released open source framework based
on MapReduce called Hadoop
 In the following years, more software libraries which
provided different capabilities were added to Hadoop
like Hive, HBASE etc.
 Now over 100 open source libraries for Hadoop and
growing
History
6
of Hadoop

 The focus is on supporting redundancy, distributed


architectures, and parallel processing
 Completely written in java
 In 2007, Hadoop was used by Yahoo to sort about 1
TB data in only 62 seconds.
 The software libraries are organized in a layered
package (sometimes also called stacked)
7 History of Hadoop

moved from “running” on Hadoop DFS to “running” in


Hadoop operating system “YARN”
8
Hadoop Distributions
9 Distributions are composed of commercially packaged and supported
editions of open-source Apache Hadoop-related projects
Distributions provide access to applications, query/reporting tools, machine
learning and data management infrastructure components.
Layer diagram for Apache Hadoop 2.x
10
Ecosystem
Hadoop Common – contains libraries and utilities needed by other Hadoop
modules
11
12 Hadoop Main Components
13
14 Hive
 In Ecosystem of Hadoop, Hive also play a vital role as analytics tool.
 It can be seen as a data warehouse package mounted on the top
layer of Hadoop ecosystem for managing and processing huge
amount of data generated.
 It is featured with simple interface using SQL. Hence user don’t need
to bother about MapReduce complex program.
 Hive worked to access the data set stored at Hbase. It is can be
summarized as:
✓ Data Warehouse Infrastructure
✓ Definer of a Query Language popularly known as HiveQL (similar to
SQL)
✓ Provides us with various tools for easy extraction, transformation and
loading of data. d. Hive allows its users to embed customized
mappers and reducers.
15 Pig
 Pig: It is one of the popular big data tool that reflects the abstraction
over MapReduce of Hadoop.
 Based on a high level platform, it creates the program that run on
apache Hadoop.
 The functionality of Pig tool comprises of analysing larger datasets
consisting of high level language and express them into data flow.
 It is one of the important key component of Hadoop ecosystem
similar to hive.
 It provide an ease in writing small program running on a high level
data flow system and uses Pig Latin to manipulate queries.
 It is multi query approach, easy to read & write, simple to understand
as SQL, and equipped with nested data types like Maps, tuples that
are not featured by Hadoop`s MapReduce.
16
 Storm:
17  This analytics tool is widely applicable and preferable for real time
system in more reliable manner for analysis of unbounded streaming
data that can be developed in any programming language.
 The speed of the processing is millions of tuples per second per
node. The feature of this analytical tool are free, open source, fault
tolerant, and real time computing system.
 Spark:
 Analytics tool for streaming, predictive and graph computational
analytics.
 Open source engine
 Processing speed is measured 100 times faster than MapReduce as it
provide immediate data reuse in memory.
 Many research organizations uses both Hadoop and spark together
although it is designed as an alternative of Hadoop.
 In fact both are complementary framework of big data, together
they make an extremely powerful machine from batch processing
using HDFS for speed analysis to streaming and in-memory
distributive processing in real time.
18
19 HBase:
 It is also an open source tool.
 It is one of the distributed database and non-relational tool in
open source environment developed after Google`s Bigtable.
 Its core shell is written in java funded by apache software
foundation and natively integrated with Hadoop.
 It perform the mapping on large data and map it into number
of datasets, each of which is slated into n-tuple.
 This is performed as a separate task for the reduction part of
output and after that it merge all of them into original one.
20
21
22 HDFS: Assumptions
1. Hardware failure are common
2. Files are large
3. Two main types of reads: large streaming reads and small
random reads
4. Sequential writes that append to files
5. Files rarely modified
6. High sustained bandwidth (or throughput) preferred to low
latency
23 HDFS: Basic Features:
 Highly fault-tolerant
 Thousands of server machines
 Failure norm rather than exception
 High throughput
 Internet scale workloads
 Move compute to data (e.g., MapReduce)
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
 Scale to thousands of nodes in a cluster
 Write-once-read-many: a file once created rarely needs
to changed
Physical Cluster Organization
24 (by Brad Hedlund)

CSE4/587 B. Ramamurthy 10/15/24


Hadoop Cluster at Yahoo!
26
Architecture Overview
28 Namenode
● Master/slave architecture
● HDFS cluster consists of a single Namenode, a master server
that manages the file system namespace and regulates
access to files by clients.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary
after a DataNode failure
29 Datanodes
● There are a number of DataNodes usually one per node in a
cluster.
● The DataNodes manage storage attached to the nodes that
they run on.
● DataNodes: serves read, write requests, performs block
creation, deletion, and replication upon instruction from
Namenode.
30 File system Namespace
 Hierarchical file system with directories and files
 Create, remove, move, rename etc.
 Namenode maintains the file system
 Any meta information changes to the file system
recorded by the Namenode (fsImage and editLog)
 An application can specify the number of replicas of
the file needed: replication factor of the file. This
information is stored in the Namenode.
Secondary Name Node

Name Node Periodically Secondary Name Node

FS image FS image

Edit log Edit log

- House Keeping
- Backup NN Meta
Data

Data Nodes
Reply
(Control Info.
Embedded)
32 Secondary Name Node
HDFS: block-structured file system

 HDFS is designed to store very large files across machines


in a large cluster.
 A file can be made of several blocks, and they are stored
across a cluster of machines with data storage capacity.
Each file is a sequence of blocks.
 All blocks in the file except the last are of the same size.
 In Nutshell, HDFS is a block-structured file system: Files
broken into blocks of 128MB (per-file configurable).
 Less number of block access requests
 Minimized overhead: disk seek time is almost constant
34
35 Replication for Fault Tolerance
HDFS Inside: Name Node
Edit log:
Name Node Snapshot of FS record
changes to FS
Filename Replication factor Block ID

File 1 3 [1, 2, 3]

File 2 2 [4, 5, 6]

File 3 1 [7,8]

Data Nodes

1, 2, 5, 7, 1, 5, 3, 1, 4, 3,
4, 3 2, 8, 6 2, 6
37 Data Replication
● Each block of a file is replicated across a number of
machines, To prevent loss of data.
● Block size and replicas are configurable per file.
● The Namenode receives a Heartbeat (every 3 secons) and a
BlockReport (with every 10th heartbeat) from each DataNode
in the cluster.
● BlockReport contains all the blocks on a Datanode.
38 Replica Placement
 The placement of the replicas is critical to HDFS reliability and performance.
 Optimizing replica placement distinguishes HDFS from other distributed file systems.
 Rack-aware replica placement:
 Goal: improve reliability, availability and network bandwidth utilization
 Many racks, communication between racks are through switches.
 Network bandwidth between machines on the same rack is greater than those in
different racks.
 Namenode determines the rack id for each DataNode.
 Replicas are typically placed on unique racks
 Simple but non-optimal
 Writes are expensive
 Replication factor is 3
 Replicas are placed: one on a node in a local rack, one on a different node in the
local rack and one on a node in a different rack.
 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across
remaining racks.
39 Replica Selection
 Replica selection for READ operation: HDFS tries to minimize
the bandwidth consumption and latency.
 If there is a replica on the Reader node then that is preferred.
 HDFS cluster may span multiple data centers: replica in the
local data center is preferred over the remote one.
Write Mechanism: Setting Pipeline
40 Ensure that all Data Nodes which are expected to have a
copy of this block are ready to receive it
41 Pipelined Write
42 Acknowledgement
43 Multi-Block Write in Parallel
HDFS Inside: Read

1
Name Node

Client 2

3 4

DN1 DN2 DN3 ... DNn

1. Client connects to NN to read data


2. NN tells client where to find the data blocks
3. Client reads blocks directly from data nodes (without going through
NN)
4. In case of node failures, client connects to another node that serves
the missing block
HDFS Inside: Read

• Q: Why does HDFS choose such a design for read? Why


not ask client to read blocks through NN?
• Reasons:
• Prevent NN from being the bottleneck of the cluster
• Allow HDFS to scale to large number of concurrent clients
• Spread the data traffic across the cluster
HDFS Inside: Read

• Q: Given multiple replicas of the same block, how does


NN decide which replica the client should read?
• HDFS Solution:
• Rack awareness based on network topology
47 Read Example
HDFS Disadvantages
Multiple-Rack
Cluster
Switch Switch
Do not
ask me, I Single Point of
am down Failure

Name Node Secondary Name Node


(NN) (SNN)

Data Node (DN) Data Node (DN) Data Node (DN)

Rack 1 Rack 2 ... Rack N


49 HDFS Summary
50 YARN (Yet Another Resource
Negotiator)
 Provides flexible resource management for Hadoop cluster
 The fundamental idea of YARN is to split up the
functionalities of resource management and job
scheduling/monitoring into separate daemons
 to have a global Resource Manager (RM) and per-
application Application Master (AM). An application is
either a single job or a DAG of jobs.
 Improves resource efficiency
 For large volume data processing, it is quite necessary to
manage the available resources properly so that every
application can leverage them.
 Through its various components, it can dynamically allocate
various resources and schedule the application processing
51 Hadoop 1.0 to Hadoop 2.0
52 MapReduce on Hadoop 1.0
53

 Disadvantages
54

 evolved to be known as large-scale distributed operating system


used for Big Data processing
 YARN architecture basically separates resource management
layer from the processing layer
 Extends Hadoop to enable multiple frameworks such as
MapReduce, Giraph, Spark, Flink
 In Hadoop 2.x, YARN provides a standard framework that supports
customized application development
 Giraph for graph analytics
 Storm for streaming applications
 Spark for in-memory applications
 Flink for streaming data-flow applications
 etc.
55 YARN Features

 Scalability: The scheduler in Resource manager of YARN


architecture allows Hadoop to extend and manage
thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce
applications without disruptions thus making it compatible
with Hadoop 1.0 as well.
 Cluster Utilization: YARN supports Dynamic utilization of cluster
in Hadoop, which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
56
YARN Architecture
Main components of YARN
57
Architecture
Client: It submits jobs.

Container: It is a collection of physical resources such as RAM,


CPU cores and disk on a single node. The containers are
invoked by Container Launch Context(CLC) which is a record
that contains information such as environment variables,
security tokens, dependencies etc.
Main components of YARN
58
Architecture
Node Manager:
 It take care of individual node on Hadoop cluster and manages
application and workflow at that particular node.
 Its primary job is to keep-up with the Resource Manager.
 It monitors resource usage, performs log management and also kills a
container based on directions from the resource manager.
 It is also responsible for creating the container process and start it on the
request of Application master.
Application Master:
 An application is a single job submitted to a framework.
 The application master is responsible for negotiating resources with the
resource manager, tracking the status and monitoring progress of a
single application.
 The application master requests the container from the node manager
by sending a Container Launch Context(CLC) which includes everything
an application needs to run. Once the application is started, it sends the
health report to the resource manager from time-to-time.
Main components of YARN
59
Architecture
Resource Manager: It is the master daemon of YARN and
is responsible for resource assignment and management
among all the applications. Whenever it receives a
processing request, it forwards it to the corresponding
node manager and allocates resources for the
completion of the request accordingly. It has two major
components:
Scheduler: It performs scheduling based on the allocated
application and available resources. It is a pure scheduler,
means it does not perform other tasks such as monitoring or
tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler
and Fair Scheduler to partition the cluster resources.
Application manager: It is responsible for accepting the
application and negotiating the first container for the
application specific Application Master. It also restarts the
Application Master container if a task fails.
Application Workflow in YARN
60

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Master
3. The Application Master registers itself with the Resource Manager
4. The Application Master negotiates containers from the Resource Manager
5. The Application Master notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager to monitor application’s status
8. Once the processing is complete, the Application Master un-registers with the
Resource Manager
61 HDFS & YARN on Hadoop 2.0
62 YARN: more functionalities
 YARN supports the notion of resource reservation via the Reservation
System
 a component that allows users to specify a profile of resources over-time
and temporal constraints (e.g., deadlines), and reserve resources to
ensure the predictable execution of important jobs
 The Reservation System tracks resources over-time, performs admission
control for reservations, and dynamically instruct the underlying scheduler
to ensure that the reservation is fulfilled

 In order to scale YARN beyond few thousands nodes, YARN supports


the notion of Federation via the YARN Federation feature
 Federation allows to transparently wire together multiple yarn (sub-)
clusters, and make them appear as a single massive cluster
 This can be used to achieve larger scale, and/or to allow multiple
independent clusters to be used together for very large jobs, or for tenants
who have capacity across all of them.
63 MapReduce Paradigm
 Programming model developed at Google
 Initially, it was intended for their internal search/indexing application,
but now used extensively by more organizations (e.g., Yahoo,
Amazon.com, IBM, etc.)
 Sort/merge based distributed computing
 It is functional style programming (e.g., LISP) that is naturally
parallelizable across a large cluster of workstations or PCs.
 MapReduce programming model
 Data abstraction: KeyValue pairs
 Computation pattern: Map tasks and Reduce tasks
Features of MapReduce
64
Model
➢ It is developed to write applications which process huge
amounts of data in parallel manner on large cluster of
machines of commodity hardware.

 Designed for big data: Support from local disks based distributed
file system (GFS / HDFS)
 Designed for large number of servers: Large clusters of commodity
machines
 MapReduce works in a reliable, fault-tolerant manner, also
support Local optimization and Load balancing
 The underlying system takes care of the partitioning of the
input data, scheduling the program’s execution across several
machines, handling machine failures, and managing required
inter-machine communication.
MapReduce: A Real World
Analogy
Coins Deposit

?
MapReduce: A Real World
Analogy
Coins Deposit

Coins Counting Machine


MapReduce: A Real World
Analogy
Coins Deposit

Distribute the coins to multiple machines


Mapper: on each machine, Categorize coins by their
face values (key/value pairs e.g. <10,45>)
Reducer: Count the coins in each face value from all
mappers
Master-Slave framework
68
➢ MapReduce framework consists of a single master JobTracker and
several slave TaskTrackers
➢ The JobTracker runs on Master Node whereas TaskTrackers run on
Slave Nodes.
➢ Client applications submit MapReduce jobs to JobTracker.

➢ The JobTracker schedules MapReduce tasks out to available


TaskTracker nodes in the cluster, striving to keep the work as close to
the data as possible.
➢ TaskTrackers run the tasks and report the status of task to JobTracker.

➢ The master JobTracker is also responsible for monitoring the


component tasks of the job and re-executing the failed tasks.

▪ MapReduce adopts a pull scheduling strategy rather than


a push one
▪ I.e., JT does not push map and reduce tasks to TTs but rather TTs
pull them by making pertaining requests
MapReduce Master-Slave framework
69
HDFS & MapReduce Architecture
70
Programming Concept
71
▪ The programmer in MapReduce has to specify
two functions, the map function and the reduce
function that implement the Mapper and the
Reducer in a MapReduce program

▪ In MapReduce data elements are always


structured as
key-value (i.e., (K, V)) pairs

▪ The map and reduce functions receive and emit


(K, V) pairs
Input Splits Intermediate Outputs Final Outputs

Map Reduce (K’’,


(K, V) (K’, V’)
Pairs
Functio Pairs
Functio V’’)
n n Pairs
72 Programming Concept

 Map
 Perform a function on individual values in a data set to create a new list of
values
 Example: square x = x * x
map square [1,2,3,4,5]
returns [1,4,9,16,25]
 Reduce
 Combine values in a data set to create a new value
 Example: sum = (each elem in arr, total +=)
reduce [1,2,3,4,5]
returns 15 (the sum of the elements)
73 Partitions
▪ In MapReduce, intermediate output values are not
usually reduced together

▪ All values with the same key are presented to a single


Reducer together

▪ More specifically, a different subset of intermediate key


space is assigned to each Reducer

▪ These subsets are known as partitions

Different colors represent


different keys (potentially)
from different Mappers

Partitions are the input to Reducers


74 MapReduce
chunks C0 C1 C2 C3
▪ When the input dataset is provided to a MapReduce

Map Phase
job, independent data chunks are processed by the mappers
M M M M
0 1 2 3
Map Tasks called Mappers in parallel.
IO0 IO1 IO2 IO3
▪ The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought

Reduce Phase
Shuffling Data
into a second set of tasks called Reducers
▪ The process of bringing together IOs into a set Reducers R0 R1
of Reducers is known as shuffling process
▪ The Reducers produce the final outputs (FOs) FO0 FO1

▪ Overall, MapReduce breaks the data flow into two


phases, map phase and reduce phase
Partitioners: responsible for dividing up the intermediate key
space and assigning intermediate key-value pairs to reducers
75
MapReduce Architecture: Master-Slaves

Job
Client
Task Trackers

Job
Tracker
Redu
Map
ce
Name Input Outpu
Node s ts
HDFS

Job Client: Submit Jobs Task Tracker: Execute Jobs


Job: MapReduce Function+ Config
Job Tracker: Coordinate Jobs
(Scheduling, Phase Coordination, etc.)
MapReduce Architecture: Workflow
1
Job
Client
Task Trackers
1 6
3
Job
Tracker
4
2 Map Reduce

Name Inputs Outputs


5
Node HDFS
1. Client submits job to Job Tracker 4. Task trackers do the job and
and copy code to HDFS report progress/status to Job
2. Job Tracker talks to NN to find Tracker
data it needs 5. Job Tracker manages task
3. Job Tracker creates execution phases
plan and submits work to Task 6. Job Tracker finishes the job and
Trackers updates status
MapReduce: example
 Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
 How would you do it in parallel ?
 Solution:
 Divide documents among workers
 Each worker parses document to find all words, outputs (word,
count) pairs (where count=1)
 Partition (word, count) pairs across workers based on word
 For each word at a worker, locally add up counts
 Add all counts for each word
MapReduce Example: Word Count
Shuffle/
Input Split Map Reduce Output
Sort

Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1

Deer Beer Beer, 2


Car, 1 Car, 1 Car, 3
River Car Car Car, 3
Car, 1 Car, 1 Deer,
Car Car River River
River, 1 Car, 1 2
Deer Car
Beer River,
Deer, 1 Deer, 1 2
Deer Car Deer, 2
Beer Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Similar Flavor of Coins


Deposit ? ☺
MapReduce Example: Word Count
Shuttle/
Input Split Map Reduce Output
Sort

Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1

Deer Beer Beer, 2


Car, 1 Car, 1 Car, 3
River Car Car Car, 3
Car, 1 Car, 1 Deer,
Car Car River River
River, 1 Car, 1 2
Deer Car
Beer River,
Deer, 1 Deer, 1 2
Deer Car Deer, 2
Beer Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Q: What are the Key and Value Pairs of Map and


Reduce?
Map: Key=word, Value=1
Reduce: Key=word, Value=aggregated count
Mapper and Reducer of
Word Count
 Map(key, value){
// key: line number
// value: words in a line
for each word w in value:
Emit(w, "1");}
 Reduce(key, list of values){
// key: a word
// list of values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(key, result);}
MapReduce Example: Word Count
Input Split Map Shuffle/Sort Reduce Output

Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1

Deer Beer Beer, 2


Car, 1 Car, 1 Car, 3
River Car Car Car, 3
Car, 1 Car, 1 Deer,
Car Car River River
River, 1 Car, 1 2
Deer Car
Beer River,
Deer, 1 Deer, 1 2
Deer Car Deer, 2
Beer Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Q: Do you see any place we can improve the


efficiency?
Local aggregation at mapper will be able to improve
MapReduce efficiency.
MapReduce: Combiner
 Combiner: do local aggregation/combine task at mapper

Car, 1 Car, 3
Car, 2 Car, 2
Car, 1 River, 1 Car, 1
River, 1

 Q: What are the benefits of using combiner:


 Reduce memory/disk requirement of Map tasks
 Reduce network traffic

 Q: Can we remove the reduce function?


 No, reducer still needs to process records with same key but from different
mappers
 Q: How would you implement combiner?
 It is the same as Reducer!
84 Partitioner and Combiner

 Partitioning function: The users of MapReduce specify the


number of reduce tasks/output files that they desire (R). Data gets
partitioned across these tasks using a partitioning function on the
intermediate key. A default partitioning function is provided that
uses hashing (e.g. .hash(key) mod R.). In some cases, it may be
useful to partition data by some other function of the key. The user
of the MapReduce library can provide a special partitioning
function.
 Combiner function: User can specify a Combiner function that
does partial merging of the intermediate local disk data before it is
sent over the network. The Combiner function is executed on each
machine that performs a map task. Typically the same code is used
to implement both the combiner and the reduce functions.
85
MapReduce with Combiner
86 Shuffling

 Shuffling is the process of moving the intermediate


data provided by the partitioner to the reducer node.
During this phase, there are sorting and merging
subphases:
 Merging - combines all key-value pairs which have same
keys and returns (Key, List[Value]).
 Sorting - takes output from Merging step and sort all key-
value pairs by using Keys. This step also returns (Key,
List[Value]) output but with sorted key-value pairs.
 Output of shuffle-sort phase is sent directly to reducers.
87 Shuffle/Sort
MapReduce WordCount 2

 New Goal: output all words sorted by their frequencies


(total counts) in a document.
 Question: How would you adopt the basic word count
program to solve it?
 Solution:
 Sort words by their counts in the reducer
 Problem: what happens if we have more than one
reducer?
MapReduce WordCount 2

 New Goal: output all words sorted by their frequencies


(total counts) in a document.
 Question: How would you adopt the basic word count
program to solve it?
 Solution:
 Do two rounds of MapReduce
 In the 2nd round, take the output of WordCount as input but
switch key and value pair!
 Do not merge
 Only one reducer in the second round of MapReduce
 Leverage the sorting capability of shuffle/sort to do the
global sorting!
Example 2
 Problem: Find the maximum monthly temperature for each year from
weather reports
 Input: A set of records with format as:
<Year/Month, Average Temperature of that month>
- (200707,100), (200706,90)
- (200508, 90), (200607,100)
- (200708, 80), (200606,80)
 Question: write down the Map and Reduce function to solve this
problem
 Assume we split the input by line
Mapper and Reducer of Max Temperature

 Map(key, value){
// key: line number
// value: tuples in a line
for each tuple t in value: Combiner is the
Emit(t->year, t->temperature);} same as Reducer
 Reduce(key, list of values){
// key: year
//list of values: a list of monthly temperature
int max_temp = -100;
for each v in values:
max_temp= max(v, max_temp);
Emit(key, max_temp);}
MapReduce Example: Max Temperature
(200707,100), (200706,90)
Input (200508, 90), (200607,100)
(200708, 80), (200606,80)

Map
(2007,100), (2005, 90),
(2006,100) (2007, 80), (2006, 80)
(2007,90)
Combine

(2007,100) (2005, 90), (2006,100) (2007, 80), (2006, 80)

Partition: assuming there are 2 reducers

(2005,90) (2007,80),(2007,100), (2006,100), (2006,80)

Shuttle/Sort

(2005,[90]) (2006,[100, 80]) (2007,[100, 80])


Reduce: apply reduce func on each
key
(2006,100), (2007,100)
(2005,90)
Example 2

 Key-Value Pair of Map and Reduce:


 Map: (year, temperature)
 Reduce: (year, maximum temperature of the year)

Question: How to use the above Map


Reduce program (that contains the
combiner) with slight changes to find the
average monthly temperature of the year?
Mapper and Reducer of Average
Temperature
 Map(key, value){
// key: line number
// value: tuples in a line
for each tuple t in value:
Emit(t->year, t->temperature);}
 Reduce(key, list of values){
// key: year
// list of values: a list of monthly temperatures
int total_temp = 0;
for each v in values:
total_temp= total_temp+v;
Emit(key, total_temp/size_of(values));}
MapReduce Example: Average
Temperature
Input (200707,100), (200706,90)
Real average
(200508, 90), (200607,100)
of 2007: 90
(200708, 80), (200606,80)

Map

(2007,100), (2005, 90), (2007, 80), (2006,80)


(2007,90) (2006,100)
Combin
e
(2005, 90),
(2007,95) (2007, 80), (2006,80)
(2006,100)
Shuttle/
Sort
(2005,[90]) (2006,[100, 80]) (2007,[95, 80])

Reduce
(2007,87.5
(2005,90) (2006,90)
)
Example 2: Combiner

 The problem is with the combiner!


 Here is a simple counterexample:
 (2007, 100), (2007,90) -> (2007, 95)
(2007,80)->(2007,80)
 Average of the above is: (2007,87.5)
 However, the real average is: (2007,90)

 However, we can do a small trick to get around this


 Mapper: (2007, 100), (2007,90) -> (2007, <190,2>)
(2007,80)->(2007,<80,1>)
 Reducer: (2007,<270,3>)->(2007,90)
MapReduce Example: Average
Temperature
Input (200707,100), (200706,90)
(200508, 90), (200607,100)
(200708, 80), (200606,80)

Map

(2007,100), (2005, 90), (2007, 80), (2006,80)


(2007,90) (2006,100)
Combin
e
(2007,<190,2>) (2005, <90,1>), (2007, <80,1>),
(2006, <100,1>) (2006,<80,1>)
Shuttle/
Sort
(2005, (2007,[<190,2>,
(2006,[<100,1>, <80,1>])
[<90,1>]) <80,1>])
Reduce

(2005,90) (2006,90) (2007,90)


Mapper and Reducer of Average
Temperature • Combine(key, list of values){
// key: year
 Map(key, value){
// list of values: a list of monthly
// key: line number temperature
// value: tuples in a line int total_temp = 0;
for each tuple t in value: for each v in values:
total_temp= total_temp+v;
Emit(t->year, t->temperature);}
• Reduce (key, list of values){ Emit(key,<total_temp,size_of(values)>);}
// key: year
// list of values: a list of <temperature sums,
counts> tuples
int total_temp = 0;
int total_count=0;
for each v in values:
total_temp= total_temp+v->sum;
total_count=total_count+v->count;
Emit(key,total_temp/total_count);}
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Statement: Given a group of people on online social media (e.g.,
Facebook), each has a list of friends, use Map-Reduce to find
common friends of any two persons who are friends
 Question:
 What are the Mapper and Reducer Functions?
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Simple example:
Input:
A -> B,C,D
B-> A,C,D
A C C-> A,B
D->A,B MapReduce

Output:
(A ,B) -> C,D
B D
(A,C) -> B
(A,D) -> ..
….
Mapper and Reducer of
Common Friends
 Map(key, value){
// key: person_id
// value: the list of friends of the person
for each friend f_id in value:
Emit(<person_id, f_id>, value);}
 Reduce(key, list of values){
// key: <friend pair>
// list of values: a set of friend lists related with the friend pair
for v1, v2 in values:
common_friends = v1 intersects v2;
Emit(key, common_friends);}
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Mapper and Reducer:

Mapper(friend list of a person)


{ for each person in the friend list:
Emit (<friend pair>, <list of friends>) }
Reducer(output of map)
{ Emit (<friend pair>, Intersection of two
(i.e, the one in friend pair) friend lists)}
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Mapper and Reducer:

Input: Map: Reduce:


Suggest
Fiends ☺
A -> B,C,D (A,B) -> B,C,D (A,B) -> C,D
B-> A,C,D (A,C) -> B,C,D (A,C) -> B
C-> A,B (A,D) -> B,C,D (A,D) -> B
D->A,B (A,B) -> A,C,D (B,C) -> A
(B,C) -> A,C,D (B,D) -> A
(B,D) -> A,C,D
(A,C) -> A,B
(B,C) -> A,B
(A,D) -> A,B
(B,D) -> A,B
Finding the Shortest Path
• A common graph
search application is
finding the shortest
path from a start node
to one or more target
nodes
• Commonly done on a
single machine with
Dijkstra's Algorithm
• Can we use BFS to
find the shortest path
via MapReduce?

This is called the single-source shortest path problem. (a.k.a. SSSP)


Finding the Shortest Path
● Consider simple case of equal edge weights
● Solution to the problem can be defined inductively
● Here’s the intuition:
– Define: b is reachable from a if b is on adjacency list of a
– DISTANCETO(s) = 0
– For all nodes p directly reachable from
s,
– DISTANCETO(p) = 1
– For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m ∈ M)
d1 m1

d2
s … n
m2

… d3
CS 4407 m3
University College Cork, Gregory M. Provan
From Intuition to Algorithm

■ A map task receives a node n as a key, and


(D, points-to) as its value
□ D is the distance to the node from the start
□points-to is a list of nodes reachable from n
□ ∀p ∈ points-to, emit (p, D+1)

■ Reduce task gathers possible distances to


a given p and selects the minimum one
From Intuition to Algorithm
● Data representation:
– Key: node n
– Value: d (distance from start), adjacency list (list of nodes reachable
from n)
– Initialization: for all nodes except for start node, d = ∞
● Mapper:
– ∀m ∈ adjacency list: emit (m, d + 1)
● Sort/Shuffle
– Groups distances by reachable nodes
● Reducer:
– Selects minimum distance path for each reachable node
– Additional bookkeeping needed to keep track of actual path

CS 4407
University College Cork, Gregory M. Provan
Multiple Iterations Needed

● Each MapReduce iteration advances the “known frontier” by


one hop
– Subsequent iterations include more and more reachable nodes as
frontier expands
– Multiple iterations are needed to explore entire graph
● Preserving graph structure:
– Problem: Where did the adjacency list go?
– Solution: mapper emits (n, adjacency list) as well

CS 4407
University College Cork, Gregory M. Provan
109
Shortest path problem
110 Shortest path problem:
mapper
Shortest path problem:
111
reducer
Visualizing Parallel BFS

n7
n0 n1

n3 n2
n6

n5
n4
n8

n9

CS 4407
University College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in
113 MapReduce
● Adjacency matrix B C
A B C D E
∞ 1 ∞
A 10 5
B 1 2 10
A
C 4
D 3 9 2 0 2 3 9 4 6
E 7 6
5 7

∞ ∞
● Adjacency List 2

A: (B, 10), (D, 5) D E


B: (C, 1), (D, 2)
C: (E, 4)
D: (B, 3), (C, 9), (E, 2)
E: (A, 7), (C, 6) CS 4407
University College Cork, Gregory M. Provan
4
Example: SSSP – Parallel BFS in MapReduce
● Map input: <node ID, <dist, adj list>> B C
<A, <0, <(B, 10), (D, 5)>>> ∞ 1 ∞
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>> 10
A
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>

5 7
● Map output: <dest node ID, dist>
<B, 10> <D, 5> <A, <0, <(B, 10), (D, ∞ ∞
5)>>> 2
<C, inf> <D, inf>
<E, inf> <B, <inf, <(C, 1), (D, D E
<B, inf> <C, inf> <E, inf> 2)>>>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> <C,<inf,
<E, <inf,<(A,
<(E, 7),
4)>>>
(C, 6)>>>
Flushed to local disk!!
Example: SSSP – Parallel BFS in
115
MapReduce
● Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
∞ ∞
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6

<C, <inf, <(E, 4)>>>


<C, inf> <C, inf> <C, inf> 5 7

∞ ∞
<D, <inf, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, inf>
D E

<E, <inf, <(A, 7), (C, 6)>>>


<E, inf> <E, inf>
CS 4407
University College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
● Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
∞ ∞
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6
<C, <inf, <(E, 4)>>>
<C, inf> <C, inf> <C, inf> 5 7

<D, <inf, <(B, 3), (C, 9), (E, 2)>>> ∞ ∞


2
<D, 5> <D, inf>
D E
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, inf>
Example: SSSP – Parallel BFS in MapReduce
117
● Reduce output: <node ID, <dist, adj list>>B C
= Map input for next iteration
10 1 ∞
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <10, <(C, 1), (D, 2)>>> 10
A
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>
● Map output: <dest node ID, dis5t> 7

<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> 5 ∞
<C, 11> <D, 12> 2
<B, <10, <(C, 1), (D, 2)>>>
<E, inf> D E
<C, <inf, <(E, 4)>>>
<B, 8> <C, 14> <E, 7> <D, <5, <(B, 3), (C, 9), (E,
<A, inf> <C, inf> 2)>>> Flushed to local disk!!
<E, <inf,University
<(A, 7), (C, 6)>>>
CS 4407
College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce

● Reduce input: <node ID, dist> B C


<A, <0, <(B, 10), (D, 5)>>>
10 1 ∞
<A, inf>

10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>


5 7
<C, 11> <C, 14> <C, inf>

5 ∞
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>


<E, inf> <E, 7>
Example: SSSP – Parallel BFS in MapReduce
● Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
10 1 ∞
<A, inf>

10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>


5 7
<C, 11> <C, 14> <C, inf>

5 ∞
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>


<E, inf> <E, 7>
Example: SSSP – Parallel BFS in MapReduce
● Reduce output: <node ID, <dist, adj list>> B C
= Map input for next iteration 1
8 11
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <8, <(C, 1), (D, 2)>>> 10
A
<C, <11, <(E, 4)>>>
0 2 3 9 4 6
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
5 7

… the rest omitted … 5 7


2
D E
Blow-up and Termination

■ This algorithm starts from one node


■ Subsequent iterations include many more
nodes of the graph as frontier advances
■ Does this ever terminate?
□ Yes! Eventually, routes between nodes will
stop being discovered and no better distances
will be found. When distance is the same, we
stop
□Mapper should emit (n, D) to ensure that
“current distance” is carried into the reducer
Stopping Criterion
● How many iterations are needed in parallel BFS (equal edge
weight case)?
● Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
● Now answer the question...
– Six degrees of separation?
● Practicalities of implementation in MapReduce

CS 4407
University College Cork, Gregory M. Provan
Comparison to Dijkstra

■ Dijkstra's algorithm is more efficient


because at any step it only pursues edges
from the minimum-cost path inside the
frontier
■ MapReduce version explores all paths in
parallel; not as efficient overall, but the
architecture is more scalable
■ Equivalent to Dijkstra for weight=1 case
124 Iterations of MapReduce job

 Only one hop of the graph is completed in


MapReduce iteration
 Several iterations of MapReduce are needed
 How many iterations?
 Six (diameter of a large general graph)
 Until a criterion is met (e.g. all nodes distance have been
calculated)
 MapReduce counter API keep counting the various
events
 Driver code executes iterations of MapReduce Job
125
Summary of MapReduce
Graph Algorithms
PageRank: Random Walks
Over The Web

■ If a user starts at a random web page and


surfs by clicking links and randomly
entering new URLs, what is the probability
that s/he will arrive at a given page?
■ The PageRank of a page captures this
notion
□ More “popular” or “worthwhile” pages get a
higher rank
PageRank: Visually
www.cnn.com

en.wikipedia.org

www.nytimes.com
PageRank: Formula

Given page A, and pages T1 through Tn


linking to A, PageRank is defined as:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +


PR(Tn)/C(Tn))

C(P) is the cardinality (out-degree) of page P


d is the damping (“random URL”) factor
PageRank: Intuition

■ Calculation is iterative: PRi+1 is based on PRi


■ Each page distributes its PRi to all pages it
links to. Linkees add up their awarded rank
fragments to find their PRi+1
■ d is a tunable parameter (usually = 0.85)
encapsulating the “random jump factor”

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


PageRank: First Implementation

■ Create two tables 'current' and 'next' holding


the PageRank for each page. Seed 'current'
with initial PR values
■ Iterate over all pages in the graph,
distributing PR from 'current' into 'next' of
linkees
■ current := next; next := fresh_table();
■ Go back to iteration step or end if converged
Distribution of the Algorithm

■ Key insights allowing parallelization:


□ The 'next' table depends on 'current', but not on
any other rows of 'next'
□Individual rows of the adjacency matrix can be
processed in parallel
□Sparse matrix rows are relatively small
Distribution of the Algorithm
Map step: break page rank into even fragments to distribute to link targets

■ Consequences of insights:
□ We can map each row of 'current' to a list of
PageRank “fragments” to assign to linkees
□These fragments
Reduce can
step: add together beintoreduced
fragments next PageRankinto a single
PageRank value for a page by summing
□Graph representation can be even more
compact; since each element is simply 0 or 1,
only transmit column numbers where it's 1

Iterate for next step...


Phase 1: Parse HTML

■ Map task takes (URL, page content) pairs


and maps them to (URL, (PRinit, list-of-urls))
□PRinit is the “seed” PageRank for URL
□list-of-urls contains all pages pointed to by URL

■ Reduce task is just the identity function


Phase 2: PageRank Distribution

■ Map task takes (URL, (cur_rank, url_list))


□ For each u in url_list, emit (u, cur_rank/|url_list|)
□ Emit (URL, url_list) to carry the points-to list
along through iterations

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


Phase 2: PageRank Distribution

■ Reduce task gets (URL, url_list) and many


(URL, val) values
□ Sum vals and fix up with d
□ Emit (URL, (new_rank, url_list))

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


Finishing up...

■ A subsequent component determines


whether convergence has been achieved
(Fixed number of iterations? Comparison of
key values?)
■ If so, write out the PageRank lists - done!
■ Otherwise, feed output of Phase 2 into
another Phase 2 iteration
PageRank Conclusions
■ MapReduce runs the “heavy lifting” in
iterated computation
■ Key element in parallelization is
independent PageRank computations in a
given step
■ Parallelization requires thinking about
minimum data partitions to transmit (e.g.,
compact representations of graph rows)
□ Even the implementation shown today doesn't
actually scale to the whole Internet; but it works
for intermediate-sized graphs
138 MapReduce Algorithm for K-
mean clustering
139 How to find the mean?
140 Mapper for k-mean
141 Reducer for k-mean
142 MapReduce algorithm for
k-mean clustering
143
MapReduce for k-NN
classification Algorithm
 Let TR a training set and TS a test set of a arbitrary sizes that
are stored in the HDFS
 The map phase starts dividing the TR set into a given number
of disjoint subsets, Each map task (Map1,Map2, ...,Mapm)
will create an associated TRj
 we will compute the distance of each xt in TS against the
instances of TRj
 The class label of the closest k neighbours (minimum
distance) in TRj, for each test example xt, and the their
distance will be saved
 As a result, we will obtain a matrix CDj of pairs < class,
distance > with dimension n k

 Therefore, at row t of CD, we will have the distance and the


class of the k nearest neighbours of xt
144
MapReduce for k-NN
classification Algorithm
145 Mapper for k-NN
146 Reducer for k-NN
147 Reducer for k-NN

 reduce phase consists of determining which of the


tentative k nearest neighbours from the maps are the
closest ones for the complete TS
 the setup operation will allocate a class-distance matrix
CDreducer of fixed size (size(TS) kneighbors)

 This matrix will be initialized with random values for the


classes and positive infinitive value for the distances
148 Reducer for k-NN
149 Reducer for k-NN
150 More examples

 Distributed grep (as in Unix grep command)


 Count of URL Access Frequency
 ReverseWeb-Link Graph: list of all source URLs
associated with a given target URL
 Inverted index: Produces <word, list(Document ID)>
pairs
 Distributed sort
151 MapReduce-Fault tolerance

 Worker failure: The master pings every worker


periodically. If no response is received from a worker in
a certain amount of time, the master marks the worker
as failed. Any map tasks completed by the worker are
reset back to their initial idle state, and therefore
become eligible for scheduling on other workers.
Similarly, any map task or reduce task in progress on a
failed worker is also reset to idle and becomes eligible
for rescheduling.
 Master Failure: It is easy to make the master write periodic
checkpoints of the master data structures described above. If the
master task dies, a new copy can be started from the last
checkpointed state. However, in most cases, the user restarts the
job.
152 Mapping workers to
Processors
 The input data (on HDFS) is stored on the local disks of the
machines in the cluster. HDFS divides each file into 128 MB
blocks, and stores several copies of each block (typically 3 copies)
on different machines.

 The MapReduce master takes the location information of the input


files into account and attempts to schedule a map task on a
machine that contains a replica of the corresponding input data.
Failing that, it attempts to schedule a map task near a replica of
that task's input data. When running large MapReduce operations
on a significant fraction of the workers in a cluster, most input data
is read locally and consumes no network bandwidth.
153 Task Granularity

 The map phase has M pieces and the reduce phase has R pieces.
 M and R should be much larger than the number of worker machines.
 Having each worker perform many different tasks improves dynamic
load balancing, and also speeds up recovery when a worker fails.
 Larger the M and R, more the decisions the master must make
 O(M+R) decisions and O(M*R) states

 R is often constrained by users because the output of each reduce task


ends up in a separate output file.
 Typically, M is chosen so that each mapper handles 128MB of input
data and R is multiple of worker machines
 Typically, (at Google), M = 200,000 and R = 5,000, using 2,000
worker machines.
Map Reduce vs. Parallel
Databases
 Map Reduce widely used for parallel processing
 Google, Yahoo, and 100’s of other companies
 Example uses: compute PageRank, build keyword
indices, do data analysis of web click logs, ….
 Database people say: but parallel databases
have been doing this for decades
 Map Reduce people say:
 we operate at scales of 1000’s of machines
 We handle failures seamlessly
 We allow procedural code in map and reduce and
allow data of any type

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy