0% found this document useful (0 votes)

15 views36 pages

Lecture 4 - Hadoop Ecosystem - 1691899782480

The document provides an outline and introduction to the Hadoop ecosystem and some of its projects. It discusses projects like Hive, Pig, HBase, Cassandra, Mahout, ZooKeeper, Ambari, and others. The document is intended to educate about the Hadoop ecosystem and its related projects.

Uploaded by

Manish049

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views36 pages

Lecture 4 - Hadoop Ecosystem - 1691899782480

Uploaded by

Manish049

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

7082 CEM

Lecture 4 - Hadoop Ecosystem

DR. MARWAN FUAD

2020- 2021
Outline 2

o Part 1 - Introduction to Hadoop Ecosystem

o Part2 - Some Hadoop Ecosystem Projects (1)

• Hive
• Pig
• HBase
• Cassandra
• Mahout

o Part3 - Some Hadoop Ecosystem Projects (2)

• ZooKeeper
• Ambari
• Sqoop
• Impala
• Oozie
• Storm
• Flume
3

Lecture4 - Part 1
Introduction to Hadoop
Ecosystem
Introduction (1) 4

o The Hadoop ecosystem is a family of projects that fall under the umbrella of
infrastructure for distributed computing and large data processing.

o Most of these projects are hosted by Apache Software Foundation

(https://www.apache.org/), which provides support for a community of
open source software projects.

o As the Hadoop ecosystem grows, more projects are appearing. However,

not all these projects are hosted at Apache.

o It is important to mention that the number of various projects keeps

growing, and it’s very difficult to keep track of all of them.
Introduction (2) 5

o Given the high number of these projects, a new Apache incubator project
was created called BigTop (which was a contribution from Cloudera to
open source). It includes all of the major Hadoop ecosystem components
and runs a number of integration tests to ensure they all work in conjunction
with each other.
Introduction (3) 6

o Some Hadoop Ecosystem Projects

Introduction (4) 7
o Some Hadoop Ecosystem Projects

from https://www.edureka.co)
8

Lecture4 - Part 2
Some Hadoop Ecosystem
Projects (1)
Hive (1) 9

o Hive is a framework for data warehousing

on top of Hadoop

o Originally, it was built by a team at Facebook

to manage and learn from the huge volumes
of data that Facebook produced daily. It was
later adopted by Apache.

o Hive was created to make it possible for analysts with strong SQL skills to run
queries on the huge volumes of data stored in HDFS. This is why it is usually
thought of as “SQL for Hadoop”
Hive (2) 10

o The motivation behind building Apache Hive is that SQL was not suitable for
Big Data problems, but it is great for analyses. It is also widely used by many
organizations. SQL is also the language of choice for business
intelligence.

o The queries in Hive use a SQL-like

language called HiveQL .

o Hive has a new and important

component called metastore.
Hive (3) 11

o The metastore is stored in a relational database. It maintains metadata

which contains information about what tables exist, their columns,
privileges, and more (schema information).

o It is important to know that Hive uses schema-on-read. It does not enforce

schema on the data when they are written (like SQL, which uses schema-
on-write).

o One of the important concepts in Hive is the view. A view can be thought
of as a “virtual table” to present data to users in a way that differs from the
way it is actually stored on disk.

o A view is not materialized to disk when it is created

Pig (1) 12

o One frequent complaint about MapReduce

is that it’s difficult to program because the
programmer will have to think about what
he/ she wants to do at the level of map and
reduce functions and job chaining.

o Pig was created to simplify Hadoop programming.

o It was first started in Yahoo as a project to work rapidly with MapReduce.

Later it was adopted as Apache project.
Pig (2) 13

o Pig is a high-level language with rich data analysis capabilities. Compared with
MapReduce, Pig uses much richer data structures. The transformations it applies
are much more powerful.

o Pig has two major components:

1. A high-level data processing language called Pig Latin .

2. A compiler that compiles and runs Pig Latin scripts. This can be local or distributed

o A Pig Latin program is made up of a series of operations, or transformations, that

are applied to the input data to produce output.

o The operations describe a data flow, which the Pig execution environment
translates into an executable representation and then runs
Pig (3) 14

o Internally, Pig turns the transformations

into a series of MapReduce jobs, but
the programmer is unaware of this.

o This allows the programmer to focus

the on data rather than the nature of
the execution.

o One of the main advantage of Pig

Latin is that it’s a compact language.
It is also extensible.
HBase (1) 15

o Hbase is a database that is designed

particularly for large datasets.

o Technically, HBase is a column-oriented

database.

o It is based on Google’s Bigtable.

o HBase is designed to be fault tolerant.

o It is designed to be fully distributed and highly available.

HBase (2) 16

o HBase uses Hadoop HDFS as a file system in the same way that most
traditional relational databases use the operating system file system

o By using Hadoop HDFS as its file system, HBase is able to create tables of
truly massive size.

o While HDFS allows a file of any structure to be stored within Hadoop, HBase
does enforce structure on the data.

o Although HBase has the same objects we see in a relational database

(columns, rows, tables, keys, etc), these objects in HBase vary significantly
from their relational counterparts.
HBase (3) 17
o HBase data model compared to a relational data model.
Cassandra (1) 18

o Cassandra is a distributed, scalable,

and fault-tolerant NoSQL, column-
oriented database*.

o It is used for real time view.

o Compared with HBase, Cassandra is more monolithic in its implementation;

it has fewer dependencies.

o It uses a Cassandra Query Language (CQL)

o * In fact, this is a simplified description of Cassandra
Cassandra (2) 19

o Cassandra has a masterless distributed architecture. Therefore, it does not

have a single point of failure.

o Cassandra provides high availability through built-in support for data

replication

o Cassandra’s model has advanced features that are beyond the scope of
these lectures.
Mahout (1) 20

o This Apache project provides

executable Java libraries to apply
analytical techniques in a scalable
manner to Big Data.

o Apache Mahout is the tool set that

directs Hadoop to yield meaningful
analytic results. (Mahout is the person
who controls an elephant, so some
logos of Mahout show a man
riding an elephant)
Mahout (2) 21

o Mahout provides Java code that implements the algorithms for several techniques
in the following three categories
• Classification:
● Logistic regression
● Naïve Bayes
● Random forests
● Hidden Markov models
• Clustering:
● Canopy clustering
● K-means clustering
● Fuzzy k-means
● Expectation maximization (EM)
• Recommenders/collaborative filtering:
● Nondistributed recommenders
● Distributed item-based collaborative filtering
2
2

Lecture4 - Part 3
Some Hadoop Ecosystem
Projects (2)
ZooKeeper (1) 23

o ZooKeeper is Hadoop’s distributed

coordination service.

o When a message is sent across

the network between two nodes
and the network fails, the sender
does not know whether the receiver
got the message or not.

o It may have gotten through before the network failed, or it may not have Or
perhaps the receiver’s process died.

o This is partial failure: when we don’t even know if an operation failed.

ZooKeeper (2) 24

o ZooKeeper does not hide partial failures.

o ZooKeeper gives a set of tools to build distributed applications that can

safely handle partial failures

o ZooKeeper has the following characteristics:

• simple
• highly available
• facilitates loosely coupled interactions
• is a library
Ambari (1) 25

o Ambari enables Hadoop

management by supporting
provisioning, managing, and
monitoring Hadoop clusters

o It provides a very intuitive web-based user interface that allows

administrators to manage Hadoop clusters.

o Ambari has three components:

• Ambari agents
• Ambari server
• Ambari Web
Ambari (2) 26

o Ambari supports many Hadoop components such as:

o HDFS
o MapReduce
o Hive
o Pig
o Hbase
o ZooKeeper
o Oozie
o Sqoop
Sqoop (1) 27

o As we have seen in a previous

lecture, data in an organization is
often stored in structured data stores
such as relational database systems (RDBMS).

o Apache Sqoop allows users to extract data from a structured data store into
Hadoop for further processing.

o This processing can be done with MapReduce programs or other higher-level tools
such as Hive.

o When the final results of an analytic pipeline are available, Sqoop can export
these results back to the data store for consumption by other clients.
Sqoop (2) 28

o Scoop’s import process

Impala (1) 29

o Impala is an open source data analytics software.

o It provides SQL interface for analyzing large datasets

stored in HDFS and HBase.

o It supports HiveQL, the SQL-like language supported by Hive.

o It was designed to overcome some of the limitations of Hive

o It can be used for both batch and real-time queries

Impala (2) 30

o Impala does not use MapReduce. Instead, it uses a specialized

distributed query engine to avoid high latency.

o It generally provides an order-of-magnitude faster response time than Hive.

o It supports many of the same features as Hive.

Oozie (1) 31

o Large production clusters may

run many coordinated MapReduce
jobs in a workflow.

o Oozie is a system for running workflows of dependent jobs.

o Oozie is composed of two main parts:

• A workflow engine that stores and runs workflows composed of different types of
Hadoop jobs (MapReduce, Pig, Hive, and so on),

• A coordinator engine that runs workflow jobs based on predefined schedules and
data availability.
Oozie (2) 32

o Oozie has been designed to scale.

o It can manage the timely execution of thousands of workflows in a Hadoop

cluster, each composed of possibly dozens of constituent jobs.
Storm 33

o Storm is a distributed real-time

computation system for processing
large volumes of high-velocity data.

o Storm uses three powerful abstractions:

• A spout is a source of streams in a computation.

• A bolt which processes any number of input streams and produces any number
of new output streams.

• A topology is a DAG of spouts and bolts

Flume 34

o Flume is a distributed, reliable, and

available service for efficiently collecting,
streaming data.

o It has a simple and flexible architecture

based on streaming data flows.

o It is robust and fault tolerant

o It uses a simple extensible data model that allows for online analytic
application.
35
References 36
• Beginning Apache Spark 2, Hien Luu (2018)

• Big Data Analytics with Spark, Mohammed Guller (2015)

• Big Data - Principles and Best Practices of Scalable Real-Time Data Systems, Nathan Marz (2015)

• Big Data - Principles and Paradigms, R. Buyya, R. Calheiros, A. V. Dastjerdi (2016)

• Data Analytics with Spark Using Python, Jeffrey Aven (2018)

• Data Science and Big Data Analytics - Discovering, Analyzing, Visualizing and Presenting Data, EMC Education
Services (2015)

• Hadoop in Action, Chuck Lam (2011)

• Hadoop in Practice, Alex Holmes (2012)

• Hadoop Operations, Eric Sammer (2012)

• Hadoop: The Definitive Guide, 3rd Edition, Tom White (2012)

• Next Generation Databases - NoSQL, NewSQL, and Big Data, Guy Harrison (2015)

• Seven Databases in Seven Weeks - A Guide to Modern Databases and the NoSQL Movement, 2nd Edition, Luc Perkins
(2018)

BDT_Unit04
No ratings yet
BDT_Unit04
136 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
DOC-20250510-WA0005.
No ratings yet
DOC-20250510-WA0005.
84 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
UNIT 4
No ratings yet
UNIT 4
85 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
Introduction To BigData Hadoop
No ratings yet
Introduction To BigData Hadoop
12 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Notes - 4 Unit neha
No ratings yet
Notes - 4 Unit neha
44 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop
No ratings yet
Hadoop
14 pages
UNIT III
No ratings yet
UNIT III
9 pages
unit5
No ratings yet
unit5
4 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Sree Buddha College of Engineering, Pattoor: 08.607 Microcontroller Lab (TA) Lab Manual
No ratings yet
Sree Buddha College of Engineering, Pattoor: 08.607 Microcontroller Lab (TA) Lab Manual
103 pages
How To Write Adventure Games For The BBC Microcomputer Model B and The Acorn Electron - Peter Killworth
No ratings yet
How To Write Adventure Games For The BBC Microcomputer Model B and The Acorn Electron - Peter Killworth
111 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
L-2
No ratings yet
L-2
5 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
UNITS-5 (1)
No ratings yet
UNITS-5 (1)
3 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
U2_Hadoop EcoSytem
No ratings yet
U2_Hadoop EcoSytem
6 pages
HTML Quiz Qa
No ratings yet
HTML Quiz Qa
17 pages
E16800 Prime A520m-K Um Web
No ratings yet
E16800 Prime A520m-K Um Web
30 pages
Mod Menu Log - Com - Carxtech.sr
No ratings yet
Mod Menu Log - Com - Carxtech.sr
27 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
S_Pig_Hive_HBase_Zookeeper
No ratings yet
S_Pig_Hive_HBase_Zookeeper
19 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
1100 Series
No ratings yet
1100 Series
28 pages
M02 AWS Security+Management+in+AWS Ed9
No ratings yet
M02 AWS Security+Management+in+AWS Ed9
38 pages
WPG 6.1 Administrator Guide
No ratings yet
WPG 6.1 Administrator Guide
234 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
BDA Module-4
No ratings yet
BDA Module-4
4 pages
Big Data Links
No ratings yet
Big Data Links
7 pages
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
Sap Copa Tables
No ratings yet
Sap Copa Tables
9 pages
Chapter 4 Audcis
No ratings yet
Chapter 4 Audcis
41 pages
Soc Level Verification Using System Verilog
No ratings yet
Soc Level Verification Using System Verilog
3 pages
The Qwerty Keyboard Layout
No ratings yet
The Qwerty Keyboard Layout
5 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Bank Management System Using C
No ratings yet
Bank Management System Using C
14 pages
E- Commerce & Digital Marketing Unit 1.pptx
No ratings yet
E- Commerce & Digital Marketing Unit 1.pptx
21 pages
Fallacies
No ratings yet
Fallacies
4 pages
Using Conflict Detection in DHCP
No ratings yet
Using Conflict Detection in DHCP
2 pages
Homework #4: EE 382M VLSI Testing Homework #4
No ratings yet
Homework #4: EE 382M VLSI Testing Homework #4
3 pages
Design of Automatic Target-Scoring System of Shooting Game Based On Computer Vision
No ratings yet
Design of Automatic Target-Scoring System of Shooting Game Based On Computer Vision
6 pages
CIT301-2023-1
No ratings yet
CIT301-2023-1
2 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
L1 - Prog Fundamental - Q
No ratings yet
L1 - Prog Fundamental - Q
4 pages
rak
No ratings yet
rak
8 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
How To Make A Nanosim From A Sim or Microsim Iphone 5
No ratings yet
How To Make A Nanosim From A Sim or Microsim Iphone 5
5 pages
Changing SAP ALV Row Colour
No ratings yet
Changing SAP ALV Row Colour
6 pages
AWS Solutions Architect - Associate
No ratings yet
AWS Solutions Architect - Associate
5 pages
Multi 5 Game JACKPOT Connect 10 Machine 4:3&16:9 Monitor Support Dual Monitor Auto Play Function
No ratings yet
Multi 5 Game JACKPOT Connect 10 Machine 4:3&16:9 Monitor Support Dual Monitor Auto Play Function
10 pages
CV of Tanvir Ahmed Mutain
No ratings yet
CV of Tanvir Ahmed Mutain
2 pages
Syllabus (TOC)
No ratings yet
Syllabus (TOC)
1 page
STQA Prelim Question Paper AY 2024-25
No ratings yet
STQA Prelim Question Paper AY 2024-25
2 pages
Experience: Sumit Awasthi
No ratings yet
Experience: Sumit Awasthi
2 pages
Sy0 701
No ratings yet
Sy0 701
8 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.