0% found this document useful (0 votes)

19 views19 pages

S Pig Hive HBase

The document provides an overview of four key components in the Hadoop ecosystem: Pig, Hive, HBase, and ZooKeeper. Pig is a framework for analyzing large datasets using a data flow language, while Hive is a data warehousing system that allows SQL-like queries on structured data. HBase is a distributed column-oriented database for real-time data access, and ZooKeeper is a centralized service for managing distributed applications and synchronization across clusters.

Uploaded by

pandeyakshay301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views19 pages

S Pig Hive HBase

Uploaded by

pandeyakshay301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

INTRODUCTION TO

PIG, HIVE, HBASE and ZOOKEEPER

WHAT IS PIG?

Framework for analyzing large un-structured and

semi-structured data on top of hadoop

Pig engine: runtime environment where the

program executed.

Pig Latin: is simple but powerful data flow language

similar to scripting language.
1. SQL
2. Provide common data operations( loud,
filters, joins, group, store)
PIG CHARACTERISTIC

• A platform analyzing large data sets that runs on top hadoop

• Provides a high-level language for expressing data analysis

• Uses both Hadoop Distributed File System (HDFS) read and write files
and MapReducer ( execute jobs )
EXECUTION TYPES
• Local Mode:
• Pig runs in a single JVM and accesses the local filesystem. This mode is
suitable only for small datasets and when trying out Pig.

• MapReduce Mode:
• Pig translates queries into MapReduce jobs and runs them on a
Hadoop cluster. It is what you use when you want to run Pig on large
datasets.
RUNNING PIG PROGRAMS
There are three ways of executing Pig programs

• Script -> Pig can run a script file that contains Pig commands

• Grunt -> Interactive shell for running Pig commands

• Embedded -> users can run Pig programs from Java using the PigServer class
PIG LATIN DATA FLOW
• A LOAD statement to read data from the system

• A series of “transformation” statement to process the data

• A DUMP statement to view results or STORE statement to save the result

LOAD TRANSFORM DUMP OR STORE

WHAT IS HIVE?
• Hive: A data warehousing system to store structured data on Hadoop file system

• Developed by Facebook

• Provides an essay query by executing Hadoop MapReduce plans

• Provides SQL type language for querying called HiveQL or HQL

• The Hive shell is the primary way that we will interact with Hive
HIVE DATA MODEL
• Tables: all the data is stored in a directory in HDFS
• Primitives: numeric, boolean, string and timestamps
• Complex: Arrays, maps and structs

• Partitions: divides a table into parts

• Queries that are restricted to a particular date or set of dates can run much more
efficiently because they only need to scan the files in the partitions that the query
pertains to

• Buckets: data in each partitions is divided into buckets

1. Enable more efficient queries
2. Makes sampling more efficient
MAJOR COMPONENTS OF HIVE
• UI: Users submits queries and other operations to the system

• Driver: Session handles, executes and fetches APIs modeled on JDBC/ODBC interfaces

• Metastore: Stores all the structure information of the various tables and partitions in the
warehouse

• Compiler: Converts the HiveQL into a plan for execution

• Execution Engine: Manages dependencies between these different stages of the plan and
executes these stages on the appropriate system components
HIVE VS PIG

PIG HIVE
• Procedural Data Flow Language • Declarative SQL Language

• For Programming • For creating reports

• Mainly used by Researchers & Programmers • Mainly used by Data/Business Analysts

• Operates on the client side of a cluster • Operates on the server side of a cluster

• Better dev environments, debuggers • Better integration with technologies

expected expected

• Does not have a thrift server • Thrift Server

PROS CONS

• Easy way to process large scale data • Not designed for Online transaction

• Converting variety of format within processing (OLTP)

Hive is simple • Hive supports overwriting or

• Supports SQL based queries apprehending data, but not updates

and deletes
• Multiple users can simultaneously
• Subqueries are not supported
query the data using HiveQL

• Allows to write custom MapReduce • No easy way to append data

framework processes to perform

more detailed data analysis
HBase
• HBase is a distributed column-oriented database built on top of the
HDFS. It is an open-source project and horizontally scalable.
• HBase is a data model that is similar to Google’s big table that
designed to provide quick random access to huge amounts of
structured data.
• HBase is a part of Hadoop ecosystem that provides real-time
read/write access to data in the Hadoop File System.
• HBase stores its data in HDFS.
HBase vs. HDFS

• HBase is a database built on top of the HDFS. • HDFS is a suitable for storing large files.
• HBase provides fast lookups for larger tables. • HDFS does not support fast individual record
• It provides low latency access to single rows lookups.
from billions of records (Random access). • It provides high latency batch processing;
• HBase internally uses Hash tables and provides • It provides only sequential access of data.
random access, and it stores the data in indexed
HDFS files for faster lookups
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines
only column families, which are the key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column values are stored contiguously on
the disk. Each cell value of the table has a timestamp. In short, in an HBase:

• Table is a collection of rows.

• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
Hbase is Not Features of HBase

• Not an SQL Database.

HBase is linearly scalable.
• Not relational. It has automatic failure support.
• No joins. It provides consistent read and writes.
It integrates with Hadoop, both as a source and a
• No fancy query language and no destination.
sophisticated query engine. It has easy java API for client.
It provides data replication across clusters.

When Should I Use HBase?

make sure you have enough data. If you have hundreds of millions or
billions of rows, then HBase is a good candidate.
What is Zookeeper?
• ZooKeeper is an open source Apache™ project that provides a centralized infrastructure
and services that enable synchronization across a cluster. ZooKeeper maintains common
objects needed in large cluster environments. Examples of these objects include
configuration information, hierarchical naming space, and so on. Applications can
leverage these services to coordinate distributed processing across large clusters.

• Created by Yahoo! Research

• Started as sub-project of Hadoop

What is Zookeeper meant for?
• Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes

• Configuration management − Latest and up-to-date configuration information of the system

for a joining node

• Cluster management − Joining / leaving of a node in a cluster and node status at real time

• Leader election − Electing a node as leader for coordination purpose

• Locking and synchronization service − Locking the data while modifying it. This mechanism
helps you in automatic fail recovery while connecting other distributed applications like
Apache HBase

• Highly reliable data registry − Availability of data even when one or a few nodes are down
How Zookeeper Works

• Zookeeper gives you a set of tools to build distributed applications that can safely handle partial failures

• Zookeeper is simple

• Zookeeper is expressive

• Zookeeper facilitates loosely coupled interactions

• Zookeeper is a library

• Zookeeper requires Java to run

• Maintains configuration information

• Zookeeper has a hierarchical name space

THANK YOU

DA Unit-5
No ratings yet
DA Unit-5
78 pages
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
No ratings yet
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
219 pages
Case Study Pig Hive Hbase
No ratings yet
Case Study Pig Hive Hbase
15 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
Unit 5 Da
No ratings yet
Unit 5 Da
41 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
Bda 4 Og
No ratings yet
Bda 4 Og
18 pages
BDA Unit 5 Notes
No ratings yet
BDA Unit 5 Notes
19 pages
Unit 5
No ratings yet
Unit 5
10 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Big Data: Week - 11
No ratings yet
Big Data: Week - 11
22 pages
Twitter, Pig, and HBase. For Bay Area Hadoop User Group May 2010
100% (1)
Twitter, Pig, and HBase. For Bay Area Hadoop User Group May 2010
28 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Big Data UNIT 5 Own
No ratings yet
Big Data UNIT 5 Own
18 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
20 pages
Big Data Analytics Using Hadoop Tools - Apache Hive VS Apache Pig - 1604726800
No ratings yet
Big Data Analytics Using Hadoop Tools - Apache Hive VS Apache Pig - 1604726800
5 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Big Data BASICS
No ratings yet
Big Data BASICS
3 pages
Hadoop
No ratings yet
Hadoop
15 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
Data Science and Big Data UNIT 4
No ratings yet
Data Science and Big Data UNIT 4
10 pages
Unit 5 Short
No ratings yet
Unit 5 Short
14 pages
Ibm Hadoop
No ratings yet
Ibm Hadoop
4 pages
Big Data Overview
No ratings yet
Big Data Overview
39 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
Big Data
No ratings yet
Big Data
120 pages
Unit 5 Bigdata
No ratings yet
Unit 5 Bigdata
14 pages
Module 5 - Data Analytics
No ratings yet
Module 5 - Data Analytics
4 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Bda 06
No ratings yet
Bda 06
15 pages
Apache HIVE
No ratings yet
Apache HIVE
5 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
6 H Data With Hive Big Data Analytics B.tech. Final Year
No ratings yet
6 H Data With Hive Big Data Analytics B.tech. Final Year
24 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Hive
No ratings yet
Hive
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Rocket JDBC Connection
No ratings yet
Rocket JDBC Connection
102 pages
Coursera Cookies
100% (1)
Coursera Cookies
6 pages
Kumar Cheruku - Oracle PLSQL Developer, NJ US
71% (7)
Kumar Cheruku - Oracle PLSQL Developer, NJ US
6 pages
SQL - Ineuron - Final
No ratings yet
SQL - Ineuron - Final
72 pages
Sap HR Abap
No ratings yet
Sap HR Abap
140 pages
Normalization DBMS
No ratings yet
Normalization DBMS
10 pages
Oracle PLSQL Programming
No ratings yet
Oracle PLSQL Programming
138 pages
MANISH KUMAR 3+ Year
No ratings yet
MANISH KUMAR 3+ Year
3 pages
Oracle SQL For Secure Relational Databases by Richard Earp Sikha Bagui
No ratings yet
Oracle SQL For Secure Relational Databases by Richard Earp Sikha Bagui
188 pages
Unit 5 - 1
No ratings yet
Unit 5 - 1
20 pages
(Ebook PDF) Database System Concepts 6th Editioninstant Download
100% (4)
(Ebook PDF) Database System Concepts 6th Editioninstant Download
54 pages
L1 Introduction To DBMS
No ratings yet
L1 Introduction To DBMS
35 pages
Oracle Logminer Viewer and Flashback Transaction
No ratings yet
Oracle Logminer Viewer and Flashback Transaction
27 pages
Big Data KCS-061 ST Sol Even 22-23
No ratings yet
Big Data KCS-061 ST Sol Even 22-23
11 pages
Unit 2 Bottom-Up Parsing
No ratings yet
Unit 2 Bottom-Up Parsing
10 pages
ch3 Formal-Rel
No ratings yet
ch3 Formal-Rel
94 pages
70-467 Designing Business Intelligence Solutions With Microsoft SQL Server 2012
No ratings yet
70-467 Designing Business Intelligence Solutions With Microsoft SQL Server 2012
21 pages
04 Chapter Pattern in Mongodb2
No ratings yet
04 Chapter Pattern in Mongodb2
32 pages
Unit 2 & 3 - HDFS Read & Write
No ratings yet
Unit 2 & 3 - HDFS Read & Write
7 pages
Day 1 Questions
No ratings yet
Day 1 Questions
2 pages
JDBC Questions and Answers
No ratings yet
JDBC Questions and Answers
18 pages
DBS-C01-S02-B-03-Relational Databases
No ratings yet
DBS-C01-S02-B-03-Relational Databases
3 pages
Data Ogranization
No ratings yet
Data Ogranization
4 pages
Yupanadb
No ratings yet
Yupanadb
29 pages
KSession 3
No ratings yet
KSession 3
15 pages
21bce9836 DBMS Lab Assignment-1
No ratings yet
21bce9836 DBMS Lab Assignment-1
7 pages
Department of Information Sciences and Technologies
No ratings yet
Department of Information Sciences and Technologies
12 pages
BIS Project
No ratings yet
BIS Project
11 pages
DoCmd Methods (Access) - Microsoft Docs
No ratings yet
DoCmd Methods (Access) - Microsoft Docs
4 pages
Capgemini Data Analyst
No ratings yet
Capgemini Data Analyst
3 pages
Mid Term Exam Semester 2 Var 2
No ratings yet
Mid Term Exam Semester 2 Var 2
20 pages
Difference Between Explain Plan and Autotrace: %cpu Time
No ratings yet
Difference Between Explain Plan and Autotrace: %cpu Time
2 pages
9 Gtu Query Algebra Solution
No ratings yet
9 Gtu Query Algebra Solution
4 pages
Oracle SQL Developer Data Modeler 3.0 Series
No ratings yet
Oracle SQL Developer Data Modeler 3.0 Series
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

S Pig Hive HBase

Uploaded by

S Pig Hive HBase

Uploaded by

INTRODUCTION TO

PIG, HIVE, HBASE and ZOOKEEPER

Framework for analyzing large un-structured and

Pig engine: runtime environment where the

Pig Latin: is simple but powerful data flow language

• A platform analyzing large data sets that runs on top hadoop

• Provides a high-level language for expressing data analysis

• Grunt -> Interactive shell for running Pig commands

• A series of “transformation” statement to process the data

• A DUMP statement to view results or STORE statement to save the result

LOAD TRANSFORM DUMP OR STORE

• Provides an essay query by executing Hadoop MapReduce plans

• Provides SQL type language for querying called HiveQL or HQL

• Partitions: divides a table into parts

• Buckets: data in each partitions is divided into buckets

• Compiler: Converts the HiveQL into a plan for execution

• For Programming • For creating reports

• Mainly used by Researchers & Programmers • Mainly used by Data/Business Analysts

• Better dev environments, debuggers • Better integration with technologies

• Does not have a thrift server • Thrift Server

• Converting variety of format within processing (OLTP)

Hive is simple • Hive supports overwriting or

• Supports SQL based queries apprehending data, but not updates

• Allows to write custom MapReduce • No easy way to append data

framework processes to perform

• Table is a collection of rows.

• Not an SQL Database.

When Should I Use HBase?

• Created by Yahoo! Research

• Started as sub-project of Hadoop

• Configuration management − Latest and up-to-date configuration information of the system

• Leader election − Electing a node as leader for coordination purpose

• Zookeeper facilitates loosely coupled interactions

• Zookeeper requires Java to run

• Maintains configuration information

• Zookeeper has a hierarchical name space

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.