0% found this document useful (0 votes)
15 views36 pages

Lecture 4 - Hadoop Ecosystem - 1691899782480

The document provides an outline and introduction to the Hadoop ecosystem and some of its projects. It discusses projects like Hive, Pig, HBase, Cassandra, Mahout, ZooKeeper, Ambari, and others. The document is intended to educate about the Hadoop ecosystem and its related projects.

Uploaded by

Manish049
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views36 pages

Lecture 4 - Hadoop Ecosystem - 1691899782480

The document provides an outline and introduction to the Hadoop ecosystem and some of its projects. It discusses projects like Hive, Pig, HBase, Cassandra, Mahout, ZooKeeper, Ambari, and others. The document is intended to educate about the Hadoop ecosystem and its related projects.

Uploaded by

Manish049
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

7082 CEM

Lecture 4 - Hadoop Ecosystem

DR. MARWAN FUAD

2020- 2021
Outline 2

o Part 1 - Introduction to Hadoop Ecosystem

o Part2 - Some Hadoop Ecosystem Projects (1)


• Hive
• Pig
• HBase
• Cassandra
• Mahout

o Part3 - Some Hadoop Ecosystem Projects (2)


• ZooKeeper
• Ambari
• Sqoop
• Impala
• Oozie
• Storm
• Flume
3

Lecture4 - Part 1
Introduction to Hadoop
Ecosystem
Introduction (1) 4

o The Hadoop ecosystem is a family of projects that fall under the umbrella of
infrastructure for distributed computing and large data processing.

o Most of these projects are hosted by Apache Software Foundation


(https://www.apache.org/), which provides support for a community of
open source software projects.

o As the Hadoop ecosystem grows, more projects are appearing. However,


not all these projects are hosted at Apache.

o It is important to mention that the number of various projects keeps


growing, and it’s very difficult to keep track of all of them.
Introduction (2) 5

o Given the high number of these projects, a new Apache incubator project
was created called BigTop (which was a contribution from Cloudera to
open source). It includes all of the major Hadoop ecosystem components
and runs a number of integration tests to ensure they all work in conjunction
with each other.
Introduction (3) 6

o Some Hadoop Ecosystem Projects


Introduction (4) 7
o Some Hadoop Ecosystem Projects

from https://www.edureka.co)
8

Lecture4 - Part 2
Some Hadoop Ecosystem
Projects (1)
Hive (1) 9

o Hive is a framework for data warehousing


on top of Hadoop

o Originally, it was built by a team at Facebook


to manage and learn from the huge volumes
of data that Facebook produced daily. It was
later adopted by Apache.

o Hive was created to make it possible for analysts with strong SQL skills to run
queries on the huge volumes of data stored in HDFS. This is why it is usually
thought of as “SQL for Hadoop”
Hive (2) 10

o The motivation behind building Apache Hive is that SQL was not suitable for
Big Data problems, but it is great for analyses. It is also widely used by many
organizations. SQL is also the language of choice for business
intelligence.

o The queries in Hive use a SQL-like


language called HiveQL .

o Hive has a new and important


component called metastore.
Hive (3) 11

o The metastore is stored in a relational database. It maintains metadata


which contains information about what tables exist, their columns,
privileges, and more (schema information).

o It is important to know that Hive uses schema-on-read. It does not enforce


schema on the data when they are written (like SQL, which uses schema-
on-write).

o One of the important concepts in Hive is the view. A view can be thought
of as a “virtual table” to present data to users in a way that differs from the
way it is actually stored on disk.

o A view is not materialized to disk when it is created


Pig (1) 12

o One frequent complaint about MapReduce


is that it’s difficult to program because the
programmer will have to think about what
he/ she wants to do at the level of map and
reduce functions and job chaining.

o Pig was created to simplify Hadoop programming.

o It was first started in Yahoo as a project to work rapidly with MapReduce.


Later it was adopted as Apache project.
Pig (2) 13

o Pig is a high-level language with rich data analysis capabilities. Compared with
MapReduce, Pig uses much richer data structures. The transformations it applies
are much more powerful.

o Pig has two major components:

1. A high-level data processing language called Pig Latin .


2. A compiler that compiles and runs Pig Latin scripts. This can be local or distributed

o A Pig Latin program is made up of a series of operations, or transformations, that


are applied to the input data to produce output.

o The operations describe a data flow, which the Pig execution environment
translates into an executable representation and then runs
Pig (3) 14

o Internally, Pig turns the transformations


into a series of MapReduce jobs, but
the programmer is unaware of this.

o This allows the programmer to focus


the on data rather than the nature of
the execution.

o One of the main advantage of Pig


Latin is that it’s a compact language.
It is also extensible.
HBase (1) 15

o Hbase is a database that is designed


particularly for large datasets.

o Technically, HBase is a column-oriented


database.

o It is based on Google’s Bigtable.

o HBase is designed to be fault tolerant.

o It is designed to be fully distributed and highly available.


HBase (2) 16

o HBase uses Hadoop HDFS as a file system in the same way that most
traditional relational databases use the operating system file system

o By using Hadoop HDFS as its file system, HBase is able to create tables of
truly massive size.

o While HDFS allows a file of any structure to be stored within Hadoop, HBase
does enforce structure on the data.

o Although HBase has the same objects we see in a relational database


(columns, rows, tables, keys, etc), these objects in HBase vary significantly
from their relational counterparts.
HBase (3) 17
o HBase data model compared to a relational data model.
Cassandra (1) 18

o Cassandra is a distributed, scalable,


and fault-tolerant NoSQL, column-
oriented database*.

o It is used for real time view.

o Compared with HBase, Cassandra is more monolithic in its implementation;


it has fewer dependencies.

o It uses a Cassandra Query Language (CQL)


o * In fact, this is a simplified description of Cassandra
Cassandra (2) 19

o Cassandra has a masterless distributed architecture. Therefore, it does not


have a single point of failure.

o Cassandra provides high availability through built-in support for data


replication

o Cassandra’s model has advanced features that are beyond the scope of
these lectures.
Mahout (1) 20

o This Apache project provides


executable Java libraries to apply
analytical techniques in a scalable
manner to Big Data.

o Apache Mahout is the tool set that


directs Hadoop to yield meaningful
analytic results. (Mahout is the person
who controls an elephant, so some
logos of Mahout show a man
riding an elephant)
Mahout (2) 21

o Mahout provides Java code that implements the algorithms for several techniques
in the following three categories
• Classification:
● Logistic regression
● Naïve Bayes
● Random forests
● Hidden Markov models
• Clustering:
● Canopy clustering
● K-means clustering
● Fuzzy k-means
● Expectation maximization (EM)
• Recommenders/collaborative filtering:
● Nondistributed recommenders
● Distributed item-based collaborative filtering
2
2

Lecture4 - Part 3
Some Hadoop Ecosystem
Projects (2)
ZooKeeper (1) 23

o ZooKeeper is Hadoop’s distributed


coordination service.

o When a message is sent across


the network between two nodes
and the network fails, the sender
does not know whether the receiver
got the message or not.

o It may have gotten through before the network failed, or it may not have Or
perhaps the receiver’s process died.

o This is partial failure: when we don’t even know if an operation failed.


ZooKeeper (2) 24

o ZooKeeper does not hide partial failures.

o ZooKeeper gives a set of tools to build distributed applications that can


safely handle partial failures

o ZooKeeper has the following characteristics:

• simple
• highly available
• facilitates loosely coupled interactions
• is a library
Ambari (1) 25

o Ambari enables Hadoop


management by supporting
provisioning, managing, and
monitoring Hadoop clusters

o It provides a very intuitive web-based user interface that allows


administrators to manage Hadoop clusters.

o Ambari has three components:


• Ambari agents
• Ambari server
• Ambari Web
Ambari (2) 26

o Ambari supports many Hadoop components such as:

o HDFS
o MapReduce
o Hive
o Pig
o Hbase
o ZooKeeper
o Oozie
o Sqoop
Sqoop (1) 27

o As we have seen in a previous


lecture, data in an organization is
often stored in structured data stores
such as relational database systems (RDBMS).

o Apache Sqoop allows users to extract data from a structured data store into
Hadoop for further processing.

o This processing can be done with MapReduce programs or other higher-level tools
such as Hive.

o When the final results of an analytic pipeline are available, Sqoop can export
these results back to the data store for consumption by other clients.
Sqoop (2) 28

o Scoop’s import process


Impala (1) 29

o Impala is an open source data analytics software.

o It provides SQL interface for analyzing large datasets


stored in HDFS and HBase.

o It supports HiveQL, the SQL-like language supported by Hive.

o It was designed to overcome some of the limitations of Hive

o It can be used for both batch and real-time queries


Impala (2) 30

o Impala does not use MapReduce. Instead, it uses a specialized


distributed query engine to avoid high latency.

o It generally provides an order-of-magnitude faster response time than Hive.

o It supports many of the same features as Hive.


Oozie (1) 31

o Large production clusters may


run many coordinated MapReduce
jobs in a workflow.

o Oozie is a system for running workflows of dependent jobs.

o Oozie is composed of two main parts:

• A workflow engine that stores and runs workflows composed of different types of
Hadoop jobs (MapReduce, Pig, Hive, and so on),

• A coordinator engine that runs workflow jobs based on predefined schedules and
data availability.
Oozie (2) 32

o Oozie has been designed to scale.

o It can manage the timely execution of thousands of workflows in a Hadoop


cluster, each composed of possibly dozens of constituent jobs.
Storm 33

o Storm is a distributed real-time


computation system for processing
large volumes of high-velocity data.

o Storm uses three powerful abstractions:

• A spout is a source of streams in a computation.

• A bolt which processes any number of input streams and produces any number
of new output streams.

• A topology is a DAG of spouts and bolts


Flume 34

o Flume is a distributed, reliable, and


available service for efficiently collecting,
streaming data.

o It has a simple and flexible architecture


based on streaming data flows.

o It is robust and fault tolerant

o It uses a simple extensible data model that allows for online analytic
application.
35
References 36
• Beginning Apache Spark 2, Hien Luu (2018)

• Big Data Analytics with Spark, Mohammed Guller (2015)

• Big Data - Principles and Best Practices of Scalable Real-Time Data Systems, Nathan Marz (2015)

• Big Data - Principles and Paradigms, R. Buyya, R. Calheiros, A. V. Dastjerdi (2016)

• Data Analytics with Spark Using Python, Jeffrey Aven (2018)

• Data Science and Big Data Analytics - Discovering, Analyzing, Visualizing and Presenting Data, EMC Education
Services (2015)

• Hadoop in Action, Chuck Lam (2011)

• Hadoop in Practice, Alex Holmes (2012)

• Hadoop Operations, Eric Sammer (2012)

• Hadoop: The Definitive Guide, 3rd Edition, Tom White (2012)

• Next Generation Databases - NoSQL, NewSQL, and Big Data, Guy Harrison (2015)

• Seven Databases in Seven Weeks - A Guide to Modern Databases and the NoSQL Movement, 2nd Edition, Luc Perkins
(2018)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy