0% found this document useful (0 votes)

11 views23 pages

13 Lecture

Uploaded by

zartasha574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views23 pages

13 Lecture

Uploaded by

zartasha574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Hadoop Ecosystem

LECTURE 13

Zaeem Anwaar
Assistant Director IT
Hadoop Ecosystem

There are many components that are a part of ecosystem

Mandatory components: Hadoop distributed file system (data Store in distributed
manner), Hadoop YARN (resource management and job scheduling), MapReduce
(data access/process)
Example of Mandatory Components: In our Laptop, CPU is a Map Reduce,
Operating System in YARN, New Technology File System (NTFS) is our file
system
For performing some extra functionalities (monitoring, management, security,
scalability), we use/need other components as well
Non-Mandatory (Essential Components)
The following are not mandatory but essential components that are used for performing the
functionalities like; Data collection, monitoring, Processing (SQL Query and ML) security,
scalability etc.
FLUME
SQOOP
Cloud ERA
HIVE
PIG
Mahout
Oozie
Zookeeper
HBASE/Drill
Spark
FLUME and Sqoop
These two components are used for Data Collection/injection
For example: Data in Laptop comes from
Downloads
USB Drives
External Drives
Etc.
In Hadoop data comes from Flume and Sqoop
Sqoop is for Structured data (data generated from SQL/oracle)and Flume is for unstructured/semi
structured data collection
Flume is majorly used in this era as data is Realtime data, unstructured ones
Data Collection comprises of .CSV Conversions from SQL/API/HTTP Request
(Postman)/Scraping etc.
The major difference between Sqoop and Flume is that Sqoop is used for loading data from
relational databases into HDFS while Flume is used to capture a stream/Realtime of moving data.
HBASE
NOSQL Concept (Column Oriented), non-relational distributed database
NoSQL databases (aka "not only SQL") are non-tabular databases and store data
differently than relational tables.
It is modelled after Google’s Big Table, which is a distributed storage system
designed to cope up with large data sets.
Data in large columns are stored using HBase rather then SQL/relational way
Horizontal Data Scaling is done using this component
Example: You have billions of customer emails and you need to find out the
number of customers who has used the word “complaint” in their emails. The
request needs to be processed quickly (i.e. at real time). So, here we are handling a
large data set while retrieving a small amount of data. For solving these kind of
problems, HBase was designed.
Cloud ERA

It includes all the leading Hadoop ecosystem components to

Search data ,
Discover data,
Explore data,
All above three steps are done using the highest enterprise standards for
stability and reliability.
HIVE

The Apache Hive data warehouse software facilitates reading, writing, and managing
large datasets residing in distributed storage using SQL query.
SQL Query
Hive can only be used if the data is structured (as it is based on SQL Commands)
It is an open source data warehouse system for querying and analyzing large datasets
stored in Hadoop files
Hive do three main functions:
Data summarization (finding a compact description of a dataset)
Query
Analysis
Hive Hadoop Component is used mainly by data analysts
Hive (Continued)

Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is
used to establish connection from data storage.
PIG
High-level platform for creating programs that run on Apache Hadoop.
Scripting (The language for this platform is called Pig Latin)
Pig Latin is used to develop the data analysis codes
Mapper Reducer class codes are too lengthy
To Reduce Line of Code from 200 to 20 the PIG scripting libraries/APIs/functions are used
For carrying out special purpose processing, users can create their own function.
Pig Hadoop Component is generally used by Researchers and Programmers
Pig works with both structured and semi-structured data
10 line of pig Latin = approx. 200 lines of Map-Reduce Java code
PIG Working:
Mahout
Machine-learning Algorithms
Machine learning algorithms allow us to build self-learning
machines that evolve by itself without being explicitly programmed
Supervised Learning vs Unsupervised Learning
Mahout provides an environment for creating machine learning
applications which are scalable.
It performs collaborative filtering, clustering and classification
It has a predefined set of library which already contains different
inbuilt algorithms for different use cases.
Mahout (continued)

Collaborative filtering: Mahout collects user behaviors, their patterns and

their characteristics and based on that it predicts and make
recommendations to the users. The typical use case is E-commerce
website.
Clustering: It organizes a similar group of data together like articles can
contain blogs, news, research papers etc.
Classification: It means classifying and categorizing data into various
sub-departments like articles can be categorized into blogs, news, essay,
research papers and other categories.
Oozie
Defines the Workflow
Consider Apache Oozie as a clock and alarm service inside Hadoop
Ecosystem.
For Apache jobs, Oozie has been just like a scheduler.
It schedules Hadoop jobs and binds them together as one logical work.
Oozie framework is fully integrated with Hadoop YARN as an
architecture center and supports Hadoop jobs
There are two kinds of Oozie jobs:
Oozie workflow: These are sequential set of actions to be executed.
Oozie Coordinator: These are the Oozie jobs which are triggered when the
data is made available to it.
Oozie

Oozie is scalable
It can manage timely execution of thousands of workflow
in a Hadoop cluster.
Oozie is very much flexible as well.
One can easily start, stop, suspend and rerun jobs.
It is even possible to skip a specific failed node or rerun it
in Oozie.
Zookeeper

It is a centralized service
A Hadoop Ecosystem component for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services.
Zookeeper manages and coordinates a large cluster of machines.
Apache Zookeeper coordinates with various services in a distributed
environment.
It saves a lot of time by performing synchronization, configuration
maintenance, grouping and naming.
Zookeeper maintains a record of all transactions.
Spark

Apache Spark is a framework for real time data analytics in a

distributed computing environment.
The Spark is written in Scala and was originally developed at the
University of California, Berkeley.
It executes in-memory computations to increase speed of data
processing over Map-Reduce.
It is 100x faster than Hadoop for large scale data processing by
exploiting in-memory computations and other optimizations.
Therefore, it requires high processing power than Map-Reduce.
Spark
Spark comes packed with high-level libraries, including support for SQL,
Python, Scala, Java etc. These standard libraries increase the seamless
integrations in complex workflow. Over this, it also allows various sets of
services to integrate with it like MLlib, GraphX, SQL + Data Frames,
Streaming services etc. to increase its capabilities.
“Apache Spark: A Killer or Savior of Apache Hadoop?”
The Answer to this :
Apache Spark best fits for real time processing, whereas Hadoop
was designed to store unstructured data and execute batch
processing over it.
When we combine, Apache Spark’s ability, i.e. high processing
speed, advance analytics and multiple integration support with
Hadoop’s low cost operation on commodity hardware, it gives the
best results.

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Alba SP LIST For Combi 26-32 & 32-36
No ratings yet
Alba SP LIST For Combi 26-32 & 32-36
24 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
UNIT III
No ratings yet
UNIT III
9 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
12 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
U2_Hadoop EcoSytem
No ratings yet
U2_Hadoop EcoSytem
6 pages
Part Big Data Unit-IV[1]
No ratings yet
Part Big Data Unit-IV[1]
12 pages
L-2
No ratings yet
L-2
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
4 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
BD Unit-4
No ratings yet
BD Unit-4
79 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
UNIT 4
No ratings yet
UNIT 4
85 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
BDT_Unit04
No ratings yet
BDT_Unit04
136 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
DOC-20250510-WA0005.
No ratings yet
DOC-20250510-WA0005.
84 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
BDP unit 3
No ratings yet
BDP unit 3
20 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Unit 2
No ratings yet
Unit 2
23 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Notes - 4 Unit neha
No ratings yet
Notes - 4 Unit neha
44 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
11 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
Hadoop
No ratings yet
Hadoop
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
1.1.1
No ratings yet
1.1.1
30 pages
Hadoop Ecosystem Short NotesTSpdf-1
No ratings yet
Hadoop Ecosystem Short NotesTSpdf-1
4 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
BDA Module-4
No ratings yet
BDA Module-4
4 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Abuzo ITC TP
No ratings yet
Abuzo ITC TP
4 pages
Week 7 Strategic International HRM
100% (1)
Week 7 Strategic International HRM
21 pages
Coc Theory Exam For DBA Level 3
No ratings yet
Coc Theory Exam For DBA Level 3
7 pages
9th Cbse English 5 May
No ratings yet
9th Cbse English 5 May
2 pages
Student Teacher'S Weekly Monitoring Report 2 SEMESTER, A.Y. 2020 - 2021 Practice Teaching Course Code / Course Title
No ratings yet
Student Teacher'S Weekly Monitoring Report 2 SEMESTER, A.Y. 2020 - 2021 Practice Teaching Course Code / Course Title
3 pages
SQL Lab Assignment On Normalization
No ratings yet
SQL Lab Assignment On Normalization
6 pages
Advanced Materials
No ratings yet
Advanced Materials
7 pages
Q&A With William Thorndike, Author of The Outsiders
100% (1)
Q&A With William Thorndike, Author of The Outsiders
6 pages
Virtual Customer Service Associate - Coimbatore: Knowledge and Skills Required Communication Skills
No ratings yet
Virtual Customer Service Associate - Coimbatore: Knowledge and Skills Required Communication Skills
2 pages
C12 - Cooling Water System
No ratings yet
C12 - Cooling Water System
2 pages
Create A Relationship Between Tables in Excel - Microsoft Support3
No ratings yet
Create A Relationship Between Tables in Excel - Microsoft Support3
5 pages
Xplore Learning Calendar: Line-Up of Learning Events For This Week
No ratings yet
Xplore Learning Calendar: Line-Up of Learning Events For This Week
1 page
STR-M - ASCE 7-16 Provisions For Lateral Drift Determination
50% (2)
STR-M - ASCE 7-16 Provisions For Lateral Drift Determination
4 pages
Decree in Original Suit
No ratings yet
Decree in Original Suit
1 page
Scope: This Specification Shall Not Be Used For New Product Fuel System Components, Use CES 16602
No ratings yet
Scope: This Specification Shall Not Be Used For New Product Fuel System Components, Use CES 16602
6 pages
Deutz F3M 2011ext - en
No ratings yet
Deutz F3M 2011ext - en
4 pages
Libro CULTIVOSTRADICIONALES
No ratings yet
Libro CULTIVOSTRADICIONALES
245 pages
Multimedia Based User Orientation Program For Library: A Case Study
100% (1)
Multimedia Based User Orientation Program For Library: A Case Study
10 pages
Sakurai Solutions 1-1 1-4 1-8
No ratings yet
Sakurai Solutions 1-1 1-4 1-8
4 pages
Analysis and Design of A Bridge at Bhoothathankettu Barrage
No ratings yet
Analysis and Design of A Bridge at Bhoothathankettu Barrage
6 pages
Lease Cheat Sheet: Not A Third Party
No ratings yet
Lease Cheat Sheet: Not A Third Party
9 pages
The Willem EPROM Programmer Can Support CHIPS
100% (1)
The Willem EPROM Programmer Can Support CHIPS
6 pages
Chapter 14 Demand Forecasting
No ratings yet
Chapter 14 Demand Forecasting
7 pages
Red Bull
No ratings yet
Red Bull
3 pages
Manual Supervisor - Elastix CallCenterPRO - EN PDF
No ratings yet
Manual Supervisor - Elastix CallCenterPRO - EN PDF
74 pages
Cowma
No ratings yet
Cowma
9 pages
Corporation 4
No ratings yet
Corporation 4
3 pages
resume warsha
No ratings yet
resume warsha
2 pages
Chapter 6 BAC
No ratings yet
Chapter 6 BAC
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

13 Lecture

Uploaded by

13 Lecture

Uploaded by

Hadoop Ecosystem

There are many components that are a part of ecosystem

It includes all the leading Hadoop ecosystem components to

Collaborative filtering: Mahout collects user behaviors, their patterns and

Apache Spark is a framework for real time data analytics in a

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.