Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
Why Hadoop?
“The Hadoop Ecosystem is great for Big Data”
1. Enable Scalability
Commodity hardware is cheap
3
Rack
3
2. Handle Fault Tolerance
Job
1 2 3 4 5
Rack
5. Provide Value
Community-supported
Wide range of applications
The rest of this Module…
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
The rest of this Module…
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
YARN
HDFS
The rest of this Module…
Cloud Computing
PaaS SaaS
IaaS
The rest of this Module…
Hadoop Hadoop
The rest of this Module…
Exercises
The Hadoop Ecosystem:
B C
A
One possible layer diagram for Hadoop
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
One possible layer diagram for Hadoop
Higher levels:
Interactivity
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Lower levels:
Storage and scheduling
Distributed file system as foundation
Scalable storage
Fault tolerance
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
YARN
HDFS
Flexible scheduling and
resource management
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
YARN
HDFS
Simplified programming model
Map apply()
Reduce summarize()
Hive Pig Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
YARN
Google used MapReduce
for indexing web sites
HDFS
Higher-level programming models
Pig = dataflow scripting
Hive = SQL-like queries
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
YARN
Hive created at Facebook
HDFS
Specialized models
for graph processing
Giraph used by Facebook
to analyze social graphs
Hive Pig Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
YARN
HDFS
Real-time and
in-memory processing
In-memory 100x faster
for some tasks
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
YARN
HDFS
NoSQL for non-files
Key-values
Sparse tables
Hive Pig Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
HBase used
Messaging Platform
HDFS
Zookeeper for management
Synchronization
Configuration
High-availability
Hive Pig Giraph
Spark
Storm
Flink
Created by Yahoo to wrangle
MapReduce
HBase
Cassandra
Zookeeper
MongoDB
services
YARN named after animals
HDFS
All these tools are open-source
All these tools are open-source
Large community
for support
All these tools are open-source
Large community
for support
Download separately
or part of pre-built image
All these tools are open-source
Large community
for support
Download separately
or part of pre-built image
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Hive Pig
Giraph
Spark
Storm
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Store massively large
data sets up to 200 Petabytes,
4500 servers,
1 billion files and blocks!
HDFS splits files across
nodes for parallel access
What happens if node fails?
Replication for fault tolerance
✔
✔
Customized reading to handle
variety of file types
Text
Lines
Words
Customized reading to handle
variety of file types
Text GIS
LinesVectors
Words
Rasters
Customized reading to handle
variety of file types
Text GIS
Bio
LinesVectors
FASTA
Words
Rasters
FASTQ
Two key components
of HDFS
1. NameNode for metadata
Data locality
Data partitioning Scalability
Data locality
✔
✔
YARN:
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Hadoop evolved over time!
Giraph
Spark
Storm
Flink
MapReduce
HBase
MapReduce
Cassandra
MongoDB
Zookeeper
YARN
HDFS HDFS
Hadoop 1.0
Only
MapReduce Hive Pig Others Other
jobs applications not
MapReduce supported
HDFS
Poor
Resource
utilization
One dataset many applications
HADOOP 1.0 HADOOP 2.0
MAP
SPARK OTHERS
REDUCE
HDFS HDFS
Central Resource Manager Each machine
== gets a Node
ultimate decision maker
Manager
Resource Manager Node Manager
Data Computation
Framework
Application Master =
personal negotiator
Negotiates
Resource
Manager
Node Manager
Container
2X ↑ Jobs 2.5X ↑
per day Number of
2X ↑ CPU tasks from all
utilization jobs
* Source: Apache Hadoop YARN: Yet Another Resource Negotiator.” In Proceedings of the 4th Annual Symposium on Cloud
Computing, 5:1–5:16. SOCC ’13.
YARN More Applications
Apache Hama
and growing …
Data Value Many choices in Hadoop 2.0
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Parallel Programming = Requires Expertise
Semaphores
Threads Monitors
Message
Shared
Passing
Memory
Locks
MapReduce = Only Map and Reduce!
Semaphores
Threads Monitors
Message
Shared
Passing
Memory
Locks
Based on Functional Programming
Reduce = summarize
operation on elements
Example MapReduce Application: WordCount
File 1
Result
File 2 WordCount
File
File N
Step 0: File is stored in HDFS
Step 1: Map on each node
My apple is red and my rose is blue....
…
…
Map generates
My apple is red and my rose is blue.... key-value pairs
…
my, my (my, 1), (my, 1)
apple (apple, 1)
is, is (is, 1), (is, 1)
red (red, 1)
and (and, 1)
rose (rose, 1)
blue (blue, 1)
Map generates
You are the apple of my eye.... key-value pairs
…
You (You, 1)
are (are, 1)
the (the, 1)
apple (apple, 1)
of (of, 1)
my (my, 1)
eye (eye, 1)
Step 2: Sort and Shuffle
Pairs with same key
moved to same node
(You, 1) Step 2: Sort and Shuffle
(apple, 1) Pairs with same key
moved to same node
(apple, 1)
(is, 1)
(is, 1)
(rose, 1)
(red, 1)
Step 3: Reduce Add values for same keys
Step 3: Reduce Add values for same keys
(You, 1) (You, 1)
(apple, 1), (apple, 1) (apple, 2)
Represents a large
number of applications.
Sort and Shuffle (You, http://you1.fake)
(apple, http://apple1.fake)
(apple, http://apple2.fake)
(is, http://apple2.fake)
(is, http://apple2.fake)
(rose, http://apple2.fake)
(red, http://apple2.fake)
Reduce Results for “apple”
Key Value
(apple -> http://apple1.fake,
http://apple2.fake)
apple
Shuffle
Map Reduce
and Sort
Shuffle
Map Reduce
and Sort
Parallelization
over the input
Shuffle
Map Reduce
and Sort
Parallelization
Parallelization
over the input
data sorting
Shuffle
Map Reduce
and Sort
Parallelization Parallelization
Parallelization over
over the input intermediate data over data groups
MapReduce is bad for:
MapReduce is bad for: