0% found this document useful (0 votes)

75 views87 pages

Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1

The document provides an overview of the Hadoop ecosystem, including its main components and how they relate to each other. It discusses how Hadoop enables scalability, fault tolerance, and handling of various data types through distributed computing. It then describes some of the major frameworks such as MapReduce, HDFS, YARN, Hive, Pig, Giraph, Spark, Storm and Flink that operate at different layers, from storage and resource management to higher-level programming models and specialized tools. The ecosystem is open-source with a large community supporting a wide range of applications.

Uploaded by

Sadia Promi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views87 pages

Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1

Uploaded by

Sadia Promi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Getting Started:

Why Hadoop?
“The Hadoop Ecosystem is great for Big Data”

The 4 W’s (and H):

What’s in the ecosystem?
Why is it beneficial?
Where is it used?
Who uses it?
How do these tools work?
Major Goals

1. Enable Scalability
Commodity hardware is cheap

3
Rack

3
2. Handle Fault Tolerance

Be ready: crashes happen

Rack
3. Optimized for a Variety Data Types
4. Facilitate a Shared Environment

Job
1 2 3 4 5
Rack
5. Provide Value
Community-supported
Wide range of applications
The rest of this Module…

The Hadoop Ecosystem

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
The rest of this Module…

Main Hadoop Components

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
The rest of this Module…

Cloud Computing

PaaS SaaS

IaaS
The rest of this Module…

When to use Hadoop?

Hadoop Hadoop
The rest of this Module…

Exercises
The Hadoop Ecosystem:

So much free stuff!

Yahoo created
Hadoop in 2005
More Big Data frameworks released
Now there’s over a 100!
Layer Diagram
D

B C

A
One possible layer diagram for Hadoop

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
One possible layer diagram for Hadoop
Higher levels:
Interactivity

Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Lower levels:
Storage and scheduling
Distributed file system as foundation
Scalable storage
Fault tolerance

Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
Flexible scheduling and
resource management

YARN schedules jobs on

Hive >Pig40,000 servers at Yahoo
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
Simplified programming model

Map  apply()
Reduce  summarize()
Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN
Google used MapReduce
for indexing web sites
HDFS
Higher-level programming models
Pig = dataflow scripting
Hive = SQL-like queries

Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

Pig created at Yahoo,

MongoDB
YARN
Hive created at Facebook
HDFS
Specialized models
for graph processing
Giraph used by Facebook
to analyze social graphs
Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
Real-time and
in-memory processing
In-memory  100x faster
for some tasks

Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
YARN

HDFS
NoSQL for non-files
Key-values
Sparse tables
Hive Pig Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra
Zookeeper

YARN for Facebook’s

MongoDB
HBase used
Messaging Platform
HDFS
Zookeeper for management
Synchronization
Configuration
High-availability
Hive Pig Giraph

Spark
Storm

Flink
Created by Yahoo to wrangle
MapReduce

HBase

Cassandra
Zookeeper

MongoDB
services
YARN named after animals
HDFS
All these tools are open-source
All these tools are open-source

Large community
for support
All these tools are open-source

Large community
for support

Download separately
or part of pre-built image
All these tools are open-source

Large community
for support

Download separately
or part of pre-built image
Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS

Growing number of open-source tools

The Hadoop Distributed
File System (HDFS):

A Storage System for Big

Data
HDFS = foundation for Scalability
Hadoop ecosystem
Reliability

Hive Pig

Giraph

Spark
Storm
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Store massively large
data sets up to 200 Petabytes,
4500 servers,
1 billion files and blocks!
HDFS splits files across
nodes for parallel access
What happens if node fails?
Replication for fault tolerance

✔
✔
Customized reading to handle
variety of file types

Text
Lines
Words
Customized reading to handle
variety of file types

Text GIS
LinesVectors
Words
Rasters
Customized reading to handle
variety of file types

Text GIS
Bio
LinesVectors
FASTA
Words
Rasters
FASTQ
Two key components
of HDFS
1. NameNode for metadata

2. DataNode for block storage

Two key components
of HDFS
1. NameNode for metadata
Usually one per cluster

2. DataNode for block storage

Usually one per machine
The NameNode
coordinates operations
Keeps track of file name,
location in directory, etc.
Mapping of contents
on DataNode.
DataNode stores file blocks
Listens to NameNode for
block creation, deletion,
replication
DataNode stores file blocks
Listens to NameNode for
block creation, deletion,
replication
Fault Tolerance

Data locality
Data partitioning Scalability

Data replication Fault tolerance

Data locality
✔
✔
YARN:

The Resource Manager

for Hadoop
HDFS Cluster Utilization

Share Hadoop across applications

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Hadoop evolved over time!

Hadoop 1.0 Hadoop 2.0

Hive Pig Others Hive Pig

Giraph

Spark
Storm

Flink
MapReduce

HBase
MapReduce

Cassandra

MongoDB
Zookeeper
YARN

HDFS HDFS
Hadoop 1.0

Only
MapReduce Hive Pig Others Other
jobs applications not
MapReduce supported
HDFS

Poor
Resource
utilization
One dataset  many applications
HADOOP 1.0 HADOOP 2.0

MAP
SPARK OTHERS
REDUCE

MAP REDUCE YARN

(Yet Another Resource Negotiator)

HDFS HDFS
Central Resource Manager Each machine
== gets a Node
ultimate decision maker
Manager
Resource Manager Node Manager

Data Computation
Framework
Application Master =
personal negotiator

Negotiates
Resource
Manager

Gets the job done Node Manager

Container = a machine Application Master = Personal
Negotiator
Essential gears in YARN engine

Resource Manager Applications Master

Node Manager

Container
2X ↑ Jobs 2.5X ↑
per day Number of
2X ↑ CPU tasks from all
utilization jobs

* Source: Apache Hadoop YARN: Yet Another Resource Negotiator.” In Proceedings of the 4th Annual Symposium on Cloud
Computing, 5:1–5:16. SOCC ’13.
YARN  More Applications

Apache Hama

and growing …
Data  Value Many choices in Hadoop 2.0

One dataset  Many applications

Higher Resource Utilization  Lower Cost

MapReduce:

Simple Programming for

Big Results
MapReduce = Programming
Model for Hadoop Ecosystem

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Parallel Programming = Requires Expertise

Semaphores
Threads Monitors
Message
Shared
Passing
Memory
Locks
MapReduce = Only Map and Reduce!

Semaphores
Threads Monitors
Message
Shared
Passing
Memory
Locks
Based on Functional Programming

Map = apply operation f (x) = y

to all elements

Reduce = summarize
operation on elements
Example MapReduce Application: WordCount

File 1
Result
File 2 WordCount
File

File N
Step 0: File is stored in HDFS
Step 1: Map on each node
My apple is red and my rose is blue....
…

You are the apple of my eye....

…

…
Map generates
My apple is red and my rose is blue.... key-value pairs
…
my, my  (my, 1), (my, 1)
apple  (apple, 1)
is, is  (is, 1), (is, 1)
red  (red, 1)
and  (and, 1)
rose  (rose, 1)
blue  (blue, 1)
Map generates
You are the apple of my eye.... key-value pairs
…
You  (You, 1)
are  (are, 1)
the  (the, 1)
apple  (apple, 1)
of  (of, 1)
my  (my, 1)
eye  (eye, 1)
Step 2: Sort and Shuffle
Pairs with same key
moved to same node
(You, 1) Step 2: Sort and Shuffle
(apple, 1) Pairs with same key
moved to same node
(apple, 1)

(is, 1)
(is, 1)

(rose, 1)
(red, 1)
Step 3: Reduce Add values for same keys
Step 3: Reduce Add values for same keys
(You, 1) (You, 1)
(apple, 1), (apple, 1) (apple, 2)

(my, 1), (my, 1),

(my, 3)
(my, 1)
(red, 1) (red, 1)
(rose, 1) (rose, 1)
Shuffle
Map Reduce
and Sort

Represents a large
number of applications.
Sort and Shuffle (You, http://you1.fake)
(apple, http://apple1.fake)
(apple, http://apple2.fake)

(is, http://apple2.fake)
(is, http://apple2.fake)

(rose, http://apple2.fake)
(red, http://apple2.fake)
Reduce Results for “apple”

(apple -> http://apple1.fake,

http://apple2.fake)
Reduce Results for “apple”

Key Value
(apple -> http://apple1.fake,
http://apple2.fake)

apple
Shuffle
Map Reduce
and Sort
Shuffle
Map Reduce
and Sort

Parallelization
over the input
Shuffle
Map Reduce
and Sort

Parallelization
Parallelization
over the input
data sorting
Shuffle
Map Reduce
and Sort

Parallelization Parallelization
Parallelization over
over the input intermediate data over data groups
MapReduce is bad for:
MapReduce is bad for:

Frequently changing data

MapReduce is bad for:

Frequently changing data

Dependent tasks
MapReduce is bad for:

Frequently changing data

Dependent tasks
Interactive analysis
MapReduce

Simplified parallel Applications with

programming independent data-
parallel tasks

Wa0005.
No ratings yet
Wa0005.
84 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Bda Guess Paper Solution
No ratings yet
Bda Guess Paper Solution
130 pages
Updated Unit-IV Reference PPT 08-02-2022
No ratings yet
Updated Unit-IV Reference PPT 08-02-2022
103 pages
BDA Session 3
No ratings yet
BDA Session 3
28 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
DATA228 Lecture Notes Week 3
No ratings yet
DATA228 Lecture Notes Week 3
21 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Week 14
No ratings yet
Week 14
33 pages
Big Data - Tomas Iglesias IV
No ratings yet
Big Data - Tomas Iglesias IV
37 pages
Unit 4
No ratings yet
Unit 4
85 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Hadoop Introduction
No ratings yet
Hadoop Introduction
29 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Unit 2
No ratings yet
Unit 2
9 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
BD U-4 (Anupam Sir)
No ratings yet
BD U-4 (Anupam Sir)
23 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Unit 3
No ratings yet
Unit 3
18 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Introduc) On To Bigdata
No ratings yet
Introduc) On To Bigdata
103 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Introduction-to-Hadoop-Ecosystem
No ratings yet
Introduction-to-Hadoop-Ecosystem
26 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
The Hadoop Ecosystem: So Much Free Stuff!
No ratings yet
The Hadoop Ecosystem: So Much Free Stuff!
21 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Biggdata
No ratings yet
Biggdata
24 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Big Data Exam Help
No ratings yet
Big Data Exam Help
7 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
No ratings yet
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
6 pages
Big Data
No ratings yet
Big Data
43 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Introduction To
No ratings yet
Introduction To
7 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
SAP - Data Migration
No ratings yet
SAP - Data Migration
6 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
SQL Notes 1
No ratings yet
SQL Notes 1
101 pages
DBMS M3
No ratings yet
DBMS M3
66 pages
Guide To Fast GraphRAG
No ratings yet
Guide To Fast GraphRAG
7 pages
Brand Chapter 7
No ratings yet
Brand Chapter 7
38 pages
Lecture 8 Data - Analytics - BI - Ghana
No ratings yet
Lecture 8 Data - Analytics - BI - Ghana
37 pages
2.viva Questions Excel Interview Questions With Answers PDF
No ratings yet
2.viva Questions Excel Interview Questions With Answers PDF
10 pages
Data Manipulation Language
No ratings yet
Data Manipulation Language
48 pages
BDA - M 3 - NoSQL
No ratings yet
BDA - M 3 - NoSQL
81 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
Data Modeling and Analysis: Irwin/Mcgraw-Hill
No ratings yet
Data Modeling and Analysis: Irwin/Mcgraw-Hill
33 pages
ADBMS Practicals
100% (1)
ADBMS Practicals
75 pages
DBMS Lab Manual Program 1 To 10
No ratings yet
DBMS Lab Manual Program 1 To 10
45 pages
Maximo JSON API - CRUD
No ratings yet
Maximo JSON API - CRUD
14 pages
Lecture 3-Access
No ratings yet
Lecture 3-Access
32 pages
Unit-2 DBMS
No ratings yet
Unit-2 DBMS
97 pages
4IT4-22 - DBMS Lab Manual - Anjali Pandey
No ratings yet
4IT4-22 - DBMS Lab Manual - Anjali Pandey
69 pages
BDA Viva
No ratings yet
BDA Viva
26 pages
Kafka Patterns and Anti-Patterns
No ratings yet
Kafka Patterns and Anti-Patterns
7 pages
Chapter 11 Data Privacy
No ratings yet
Chapter 11 Data Privacy
17 pages
Part 01 Introduction
No ratings yet
Part 01 Introduction
19 pages
Brand Chapter 6
No ratings yet
Brand Chapter 6
34 pages
L2 CSC209 2.0 Database Management Systems
No ratings yet
L2 CSC209 2.0 Database Management Systems
29 pages
November 2015 QP - Paper 1 OCR Maths (B) GCSE
No ratings yet
November 2015 QP - Paper 1 OCR Maths (B) GCSE
24 pages
QP Mock 7 Buspaper1 May2019R
No ratings yet
QP Mock 7 Buspaper1 May2019R
16 pages
Adv Java Notes
No ratings yet
Adv Java Notes
14 pages
Expt 4 C
No ratings yet
Expt 4 C
25 pages
Module 6
No ratings yet
Module 6
7 pages
Big Data Quiz 01
No ratings yet
Big Data Quiz 01
1 page
6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop
No ratings yet
6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop
2 pages
A Debate - DDS Versus DDL - TechChannel
No ratings yet
A Debate - DDS Versus DDL - TechChannel
6 pages
Take Assessment: Exercise 5: Programming With Transactions
No ratings yet
Take Assessment: Exercise 5: Programming With Transactions
11 pages
DATABASE MANAGEMENT SYSTEMS (18CS1T02) - End Term Exam - 2020-2021
No ratings yet
DATABASE MANAGEMENT SYSTEMS (18CS1T02) - End Term Exam - 2020-2021
3 pages
Literature Review On Big Data
No ratings yet
Literature Review On Big Data
10 pages
Freewha
No ratings yet
Freewha
2 pages
面向MapReduce的Hadoop优化: Chinese Edition
From Everand
面向MapReduce的Hadoop优化: Chinese Edition
Posts & Telecom Press
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Gabriele Modena
4/5 (1)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1

Uploaded by

Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1

Uploaded by

Getting Started:

The 4 W’s (and H):

Be ready: crashes happen

The Hadoop Ecosystem

Main Hadoop Components

When to use Hadoop?

So much free stuff!

Hive Pig Giraph

YARN schedules jobs on

Hive Pig Giraph

Pig created at Yahoo,

Hive Pig Giraph

YARN for Facebook’s

Growing number of open-source tools

A Storage System for Big

2. DataNode for block storage

2. DataNode for block storage

Data replication Fault tolerance

The Resource Manager

Share Hadoop across applications

Hadoop 1.0 Hadoop 2.0

Hive Pig Others Hive Pig

MAP REDUCE YARN

Gets the job done Node Manager

Resource Manager Applications Master

One dataset  Many applications

Higher Resource Utilization  Lower Cost

Simple Programming for

Map = apply operation f (x) = y

You are the apple of my eye....

(my, 1), (my, 1),

(apple -> http://apple1.fake,

Frequently changing data

Frequently changing data

Frequently changing data

Simplified parallel Applications with

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.