0% found this document useful (0 votes)
75 views87 pages

Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1

The document provides an overview of the Hadoop ecosystem, including its main components and how they relate to each other. It discusses how Hadoop enables scalability, fault tolerance, and handling of various data types through distributed computing. It then describes some of the major frameworks such as MapReduce, HDFS, YARN, Hive, Pig, Giraph, Spark, Storm and Flink that operate at different layers, from storage and resource management to higher-level programming models and specialized tools. The ecosystem is open-source with a large community supporting a wide range of applications.

Uploaded by

Sadia Promi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views87 pages

Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1

The document provides an overview of the Hadoop ecosystem, including its main components and how they relate to each other. It discusses how Hadoop enables scalability, fault tolerance, and handling of various data types through distributed computing. It then describes some of the major frameworks such as MapReduce, HDFS, YARN, Hive, Pig, Giraph, Spark, Storm and Flink that operate at different layers, from storage and resource management to higher-level programming models and specialized tools. The ecosystem is open-source with a large community supporting a wide range of applications.

Uploaded by

Sadia Promi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Getting Started:

Why Hadoop?
“The Hadoop Ecosystem is great for Big Data”

The 4 W’s (and H):


What’s in the ecosystem?
Why is it beneficial?
Where is it used?
Who uses it?
How do these tools work?
Major Goals

1. Enable Scalability
Commodity hardware is cheap

3
Rack

3
2. Handle Fault Tolerance

Be ready: crashes happen


Rack
3. Optimized for a Variety Data Types
4. Facilitate a Shared Environment

Job
1 2 3 4 5
Rack
5. Provide Value
Community-supported
Wide range of applications
The rest of this Module…

The Hadoop Ecosystem

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The rest of this Module…

Main Hadoop Components

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
The rest of this Module…

Cloud Computing

PaaS SaaS

IaaS
The rest of this Module…

When to use Hadoop?

Hadoop Hadoop
The rest of this Module…

Exercises
The Hadoop Ecosystem:

So much free stuff!


Yahoo created
Hadoop in 2005
More Big Data frameworks released
Now there’s over a 100!
Layer Diagram
D

B C

A
One possible layer diagram for Hadoop

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
One possible layer diagram for Hadoop
Higher levels:
Interactivity

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Lower levels:
Storage and scheduling
Distributed file system as foundation
Scalable storage
Fault tolerance

Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
Flexible scheduling and
resource management

YARN schedules jobs on


Hive >Pig40,000 servers at Yahoo
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
Simplified programming model

Map  apply()
Reduce  summarize()
Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN
Google used MapReduce
for indexing web sites
HDFS
Higher-level programming models
Pig = dataflow scripting
Hive = SQL-like queries

Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

Pig created at Yahoo,

MongoDB
YARN
Hive created at Facebook
HDFS
Specialized models
for graph processing
Giraph used by Facebook
to analyze social graphs
Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
Real-time and
in-memory processing
In-memory  100x faster
for some tasks

Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
NoSQL for non-files
Key-values
Sparse tables
Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

YARN for Facebook’s

MongoDB
HBase used
Messaging Platform
HDFS
Zookeeper for management
Synchronization
Configuration
High-availability
Hive Pig Giraph

Spark
Storm

Flink
Created by Yahoo to wrangle
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
services
YARN named after animals
HDFS
All these tools are open-source
All these tools are open-source

Large community
for support
All these tools are open-source

Large community
for support

Download separately
or part of pre-built image
All these tools are open-source

Large community
for support

Download separately
or part of pre-built image
Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Growing number of open-source tools


The Hadoop Distributed
File System (HDFS):

A Storage System for Big


Data
HDFS = foundation for Scalability
Hadoop ecosystem
Reliability

Hive Pig

Giraph

Spark
Storm
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Store massively large
data sets up to 200 Petabytes,
4500 servers,
1 billion files and blocks!
HDFS splits files across
nodes for parallel access
What happens if node fails?
Replication for fault tolerance



Customized reading to handle
variety of file types

Text
Lines
Words
Customized reading to handle
variety of file types

Text GIS
LinesVectors
Words
Rasters
Customized reading to handle
variety of file types

Text GIS
Bio
LinesVectors
FASTA
Words
Rasters
FASTQ
Two key components
of HDFS
1. NameNode for metadata

2. DataNode for block storage


Two key components
of HDFS
1. NameNode for metadata
Usually one per cluster

2. DataNode for block storage


Usually one per machine
The NameNode
coordinates operations
Keeps track of file name,
location in directory, etc.
Mapping of contents
on DataNode.
DataNode stores file blocks
Listens to NameNode for
block creation, deletion,
replication
DataNode stores file blocks
Listens to NameNode for
block creation, deletion,
replication
Fault Tolerance

Data locality
Data partitioning Scalability

Data replication Fault tolerance

Data locality


YARN:

The Resource Manager


for Hadoop
HDFS Cluster Utilization

Share Hadoop across applications

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Hadoop evolved over time!

Hadoop 1.0 Hadoop 2.0

Hive Pig Others Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase
MapReduce

Cassandra

MongoDB
Zookeeper
YARN

HDFS HDFS
Hadoop 1.0

Only
MapReduce Hive Pig Others Other
jobs applications not
MapReduce supported
HDFS

Poor
Resource
utilization
One dataset  many applications
HADOOP 1.0 HADOOP 2.0

MAP
SPARK OTHERS
REDUCE

MAP REDUCE YARN


(Yet Another Resource Negotiator)

HDFS HDFS
Central Resource Manager Each machine
== gets a Node
ultimate decision maker
Manager
Resource Manager Node Manager

Data Computation
Framework
Application Master =
personal negotiator

Negotiates
Resource
Manager

Gets the job done Node Manager


Container = a machine Application Master = Personal
Negotiator
Essential gears in YARN engine

Resource Manager Applications Master

Node Manager

Container
2X ↑ Jobs 2.5X ↑
per day Number of
2X ↑ CPU tasks from all
utilization jobs

* Source: Apache Hadoop YARN: Yet Another Resource Negotiator.” In Proceedings of the 4th Annual Symposium on Cloud
Computing, 5:1–5:16. SOCC ’13.
YARN  More Applications

Apache Hama

and growing …
Data  Value Many choices in Hadoop 2.0

One dataset  Many applications

Higher Resource Utilization  Lower Cost


MapReduce:

Simple Programming for


Big Results
MapReduce = Programming
Model for Hadoop Ecosystem

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Parallel Programming = Requires Expertise

Semaphores
Threads Monitors
Message
Shared
Passing
Memory
Locks
MapReduce = Only Map and Reduce!

Semaphores
Threads Monitors
Message
Shared
Passing
Memory
Locks
Based on Functional Programming

Map = apply operation f (x) = y


to all elements

Reduce = summarize
operation on elements
Example MapReduce Application: WordCount

File 1
Result
File 2 WordCount
File

File N
Step 0: File is stored in HDFS
Step 1: Map on each node
My apple is red and my rose is blue....

You are the apple of my eye....



Map generates
My apple is red and my rose is blue.... key-value pairs

my, my  (my, 1), (my, 1)
apple  (apple, 1)
is, is  (is, 1), (is, 1)
red  (red, 1)
and  (and, 1)
rose  (rose, 1)
blue  (blue, 1)
Map generates
You are the apple of my eye.... key-value pairs

You  (You, 1)
are  (are, 1)
the  (the, 1)
apple  (apple, 1)
of  (of, 1)
my  (my, 1)
eye  (eye, 1)
Step 2: Sort and Shuffle
Pairs with same key
moved to same node
(You, 1) Step 2: Sort and Shuffle
(apple, 1) Pairs with same key
moved to same node
(apple, 1)

(is, 1)
(is, 1)

(rose, 1)
(red, 1)
Step 3: Reduce Add values for same keys
Step 3: Reduce Add values for same keys
(You, 1) (You, 1)
(apple, 1), (apple, 1) (apple, 2)

(my, 1), (my, 1),


(my, 3)
(my, 1)
(red, 1) (red, 1)
(rose, 1) (rose, 1)
Shuffle
Map Reduce
and Sort

Represents a large
number of applications.
Sort and Shuffle (You, http://you1.fake)
(apple, http://apple1.fake)
(apple, http://apple2.fake)

(is, http://apple2.fake)
(is, http://apple2.fake)

(rose, http://apple2.fake)
(red, http://apple2.fake)
Reduce Results for “apple”

(apple -> http://apple1.fake,


http://apple2.fake)
Reduce Results for “apple”

Key Value
(apple -> http://apple1.fake,
http://apple2.fake)

apple
Shuffle
Map Reduce
and Sort
Shuffle
Map Reduce
and Sort

Parallelization
over the input
Shuffle
Map Reduce
and Sort

Parallelization
Parallelization
over the input
data sorting
Shuffle
Map Reduce
and Sort

Parallelization Parallelization
Parallelization over
over the input intermediate data over data groups
MapReduce is bad for:
MapReduce is bad for:

Frequently changing data


MapReduce is bad for:

Frequently changing data


Dependent tasks
MapReduce is bad for:

Frequently changing data


Dependent tasks
Interactive analysis
MapReduce

Simplified parallel Applications with


programming independent data-
parallel tasks

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy