Lecture 4 - Hadoop Ecosystem - 1691899782480
Lecture 4 - Hadoop Ecosystem - 1691899782480
2020- 2021
Outline 2
Lecture4 - Part 1
Introduction to Hadoop
Ecosystem
Introduction (1) 4
o The Hadoop ecosystem is a family of projects that fall under the umbrella of
infrastructure for distributed computing and large data processing.
o Given the high number of these projects, a new Apache incubator project
was created called BigTop (which was a contribution from Cloudera to
open source). It includes all of the major Hadoop ecosystem components
and runs a number of integration tests to ensure they all work in conjunction
with each other.
Introduction (3) 6
from https://www.edureka.co)
8
Lecture4 - Part 2
Some Hadoop Ecosystem
Projects (1)
Hive (1) 9
o Hive was created to make it possible for analysts with strong SQL skills to run
queries on the huge volumes of data stored in HDFS. This is why it is usually
thought of as “SQL for Hadoop”
Hive (2) 10
o The motivation behind building Apache Hive is that SQL was not suitable for
Big Data problems, but it is great for analyses. It is also widely used by many
organizations. SQL is also the language of choice for business
intelligence.
o One of the important concepts in Hive is the view. A view can be thought
of as a “virtual table” to present data to users in a way that differs from the
way it is actually stored on disk.
o Pig is a high-level language with rich data analysis capabilities. Compared with
MapReduce, Pig uses much richer data structures. The transformations it applies
are much more powerful.
o The operations describe a data flow, which the Pig execution environment
translates into an executable representation and then runs
Pig (3) 14
o HBase uses Hadoop HDFS as a file system in the same way that most
traditional relational databases use the operating system file system
o By using Hadoop HDFS as its file system, HBase is able to create tables of
truly massive size.
o While HDFS allows a file of any structure to be stored within Hadoop, HBase
does enforce structure on the data.
o Cassandra’s model has advanced features that are beyond the scope of
these lectures.
Mahout (1) 20
o Mahout provides Java code that implements the algorithms for several techniques
in the following three categories
• Classification:
● Logistic regression
● Naïve Bayes
● Random forests
● Hidden Markov models
• Clustering:
● Canopy clustering
● K-means clustering
● Fuzzy k-means
● Expectation maximization (EM)
• Recommenders/collaborative filtering:
● Nondistributed recommenders
● Distributed item-based collaborative filtering
2
2
Lecture4 - Part 3
Some Hadoop Ecosystem
Projects (2)
ZooKeeper (1) 23
o It may have gotten through before the network failed, or it may not have Or
perhaps the receiver’s process died.
• simple
• highly available
• facilitates loosely coupled interactions
• is a library
Ambari (1) 25
o HDFS
o MapReduce
o Hive
o Pig
o Hbase
o ZooKeeper
o Oozie
o Sqoop
Sqoop (1) 27
o Apache Sqoop allows users to extract data from a structured data store into
Hadoop for further processing.
o This processing can be done with MapReduce programs or other higher-level tools
such as Hive.
o When the final results of an analytic pipeline are available, Sqoop can export
these results back to the data store for consumption by other clients.
Sqoop (2) 28
• A workflow engine that stores and runs workflows composed of different types of
Hadoop jobs (MapReduce, Pig, Hive, and so on),
• A coordinator engine that runs workflow jobs based on predefined schedules and
data availability.
Oozie (2) 32
• A bolt which processes any number of input streams and produces any number
of new output streams.
o It uses a simple extensible data model that allows for online analytic
application.
35
References 36
• Beginning Apache Spark 2, Hien Luu (2018)
• Big Data - Principles and Best Practices of Scalable Real-Time Data Systems, Nathan Marz (2015)
• Data Science and Big Data Analytics - Discovering, Analyzing, Visualizing and Presenting Data, EMC Education
Services (2015)
• Next Generation Databases - NoSQL, NewSQL, and Big Data, Guy Harrison (2015)
• Seven Databases in Seven Weeks - A Guide to Modern Databases and the NoSQL Movement, 2nd Edition, Luc Perkins
(2018)