0% found this document useful (0 votes)
11 views23 pages

13 Lecture

Uploaded by

zartasha574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

13 Lecture

Uploaded by

zartasha574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Hadoop Ecosystem

LECTURE 13

Zaeem Anwaar
Assistant Director IT
Hadoop Ecosystem

There are many components that are a part of ecosystem


Mandatory components: Hadoop distributed file system (data Store in distributed
manner), Hadoop YARN (resource management and job scheduling), MapReduce
(data access/process)
Example of Mandatory Components: In our Laptop, CPU is a Map Reduce,
Operating System in YARN, New Technology File System (NTFS) is our file
system
For performing some extra functionalities (monitoring, management, security,
scalability), we use/need other components as well
Non-Mandatory (Essential Components)
The following are not mandatory but essential components that are used for performing the
functionalities like; Data collection, monitoring, Processing (SQL Query and ML) security,
scalability etc.
FLUME
SQOOP
Cloud ERA
HIVE
PIG
Mahout
Oozie
Zookeeper
HBASE/Drill
Spark
FLUME and Sqoop
These two components are used for Data Collection/injection
For example: Data in Laptop comes from
Downloads
USB Drives
External Drives
Etc.
In Hadoop data comes from Flume and Sqoop
Sqoop is for Structured data (data generated from SQL/oracle)and Flume is for unstructured/semi
structured data collection
Flume is majorly used in this era as data is Realtime data, unstructured ones
Data Collection comprises of .CSV Conversions from SQL/API/HTTP Request
(Postman)/Scraping etc.
The major difference between Sqoop and Flume is that Sqoop is used for loading data from
relational databases into HDFS while Flume is used to capture a stream/Realtime of moving data.
HBASE
NOSQL Concept (Column Oriented), non-relational distributed database
NoSQL databases (aka "not only SQL") are non-tabular databases and store data
differently than relational tables.
It is modelled after Google’s Big Table, which is a distributed storage system
designed to cope up with large data sets.
Data in large columns are stored using HBase rather then SQL/relational way
Horizontal Data Scaling is done using this component
Example: You have billions of customer emails and you need to find out the
number of customers who has used the word “complaint” in their emails. The
request needs to be processed quickly (i.e. at real time). So, here we are handling a
large data set while retrieving a small amount of data. For solving these kind of
problems, HBase was designed.
Cloud ERA

It includes all the leading Hadoop ecosystem components to


Search data ,
Discover data,
Explore data,
All above three steps are done using the highest enterprise standards for
stability and reliability.
HIVE

The Apache Hive data warehouse software facilitates reading, writing, and managing
large datasets residing in distributed storage using SQL query.
SQL Query
Hive can only be used if the data is structured (as it is based on SQL Commands)
It is an open source data warehouse system for querying and analyzing large datasets
stored in Hadoop files
Hive do three main functions:
Data summarization (finding a compact description of a dataset)
Query
Analysis
Hive Hadoop Component is used mainly by data analysts
Hive (Continued)

Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is
used to establish connection from data storage.
PIG
High-level platform for creating programs that run on Apache Hadoop.
Scripting (The language for this platform is called Pig Latin)
Pig Latin is used to develop the data analysis codes
Mapper Reducer class codes are too lengthy
To Reduce Line of Code from 200 to 20 the PIG scripting libraries/APIs/functions are used
For carrying out special purpose processing, users can create their own function.
Pig Hadoop Component is generally used by Researchers and Programmers
Pig works with both structured and semi-structured data
10 line of pig Latin = approx. 200 lines of Map-Reduce Java code
PIG Working:
Mahout
Machine-learning Algorithms
Machine learning algorithms allow us to build self-learning
machines that evolve by itself without being explicitly programmed
Supervised Learning vs Unsupervised Learning
Mahout provides an environment for creating machine learning
applications which are scalable.
It performs collaborative filtering, clustering and classification
It has a predefined set of library which already contains different
inbuilt algorithms for different use cases.
Mahout (continued)

Collaborative filtering: Mahout collects user behaviors, their patterns and


their characteristics and based on that it predicts and make
recommendations to the users. The typical use case is E-commerce
website.
Clustering: It organizes a similar group of data together like articles can
contain blogs, news, research papers etc.
Classification: It means classifying and categorizing data into various
sub-departments like articles can be categorized into blogs, news, essay,
research papers and other categories.
Oozie
Defines the Workflow
Consider Apache Oozie as a clock and alarm service inside Hadoop
Ecosystem.
For Apache jobs, Oozie has been just like a scheduler.
It schedules Hadoop jobs and binds them together as one logical work.
Oozie framework is fully integrated with Hadoop YARN as an
architecture center and supports Hadoop jobs
There are two kinds of Oozie jobs:
Oozie workflow: These are sequential set of actions to be executed.
Oozie Coordinator: These are the Oozie jobs which are triggered when the
data is made available to it.
Oozie

Oozie is scalable
It can manage timely execution of thousands of workflow
in a Hadoop cluster.
Oozie is very much flexible as well.
One can easily start, stop, suspend and rerun jobs.
It is even possible to skip a specific failed node or rerun it
in Oozie.
Zookeeper

It is a centralized service
A Hadoop Ecosystem component for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services.
Zookeeper manages and coordinates a large cluster of machines.
Apache Zookeeper coordinates with various services in a distributed
environment.
It saves a lot of time by performing synchronization, configuration
maintenance, grouping and naming.
Zookeeper maintains a record of all transactions.
Spark

Apache Spark is a framework for real time data analytics in a


distributed computing environment.
The Spark is written in Scala and was originally developed at the
University of California, Berkeley.
It executes in-memory computations to increase speed of data
processing over Map-Reduce.
It is 100x faster than Hadoop for large scale data processing by
exploiting in-memory computations and other optimizations.
Therefore, it requires high processing power than Map-Reduce.
Spark
Spark comes packed with high-level libraries, including support for SQL,
Python, Scala, Java etc. These standard libraries increase the seamless
integrations in complex workflow. Over this, it also allows various sets of
services to integrate with it like MLlib, GraphX, SQL + Data Frames,
Streaming services etc. to increase its capabilities.
“Apache Spark: A Killer or Savior of Apache Hadoop?”
The Answer to this :
Apache Spark best fits for real time processing, whereas Hadoop
was designed to store unstructured data and execute batch
processing over it.
When we combine, Apache Spark’s ability, i.e. high processing
speed, advance analytics and multiple integration support with
Hadoop’s low cost operation on commodity hardware, it gives the
best results.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy