0% found this document useful (0 votes)

72 views21 pages

S Pig Hive HBase Zookeeper 07

hbase

Uploaded by

Johan Pp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views21 pages

S Pig Hive HBase Zookeeper 07

hbase

Uploaded by

Johan Pp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Introduction to

PIG, HIVE, HBASE

& ZOOKEEPER
What is PIG?
Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets.
Apache Pig creates a simpler procedural language abstraction over MapReduce to
expose a more SQL-like interface for Hadoop applications.
PIG Latin - A high-level language developed by Pig where programmers can develop
their own functions for reading, writing and processing data.
PIG Characteristics
► Operator Set - Many operations like join, filter and sort can be performed
through these operators.
► Programming ease - Pig Latin closely resembles to SQL. It is also easy to
write a Pig script if you’re good at SQL.
► User defined functions – Through Pig developers can create UDFs in
other programming languages like Java and invoke them in Pig scripts.
► Extensibility – Developers can develop their own functions to read,
process and write data.
► Optimization opportunities – Pig tasks optimize their execution
automatically. The programmers only need to focus on semantics of the
language.
Pig Latin - Data Flow:

► A LOAD statement to read data from the system

► A series of “transformation” statement to process the data
► A DUMP statement to view results or STORE statement to save the result
PIG Pros and Cons:
Pros: Cons:

• It has many advanced features built-in • There is not a good ide or plugin for Vim
which provides more functionality than
such as joins, secondary sort, many syntax completion to write the pig scripts.
optimizations, predicate push-down, etc.
• Provides a decent abstraction for Map- • Not mature. Even if it has been around for
Reduce jobs, allowing for a faster result quite some time, it is still in the
than creating your own MR jobs development.
• Can handle large and unstructured
• Pig does not support random writes to
update small portions of data, all writes are
datasets. bulk, streaming writes, just like
MapReduce.
What is Hive?

• Apache Hive is a data warehouse software project built on top of Apache

Hadoop
• Used for data query and analysis
• Developed by Facebook
• Provides SQL type language for querying called HiveQL or HQL
• The Hive shell is the primary way that we will interact with Hive
Hive Features
► It stores schema in a database and processed data into HDFS

► It is designed for OLAP

► It provides SQL type language for querying called HiveQL or HQL

► It is familiar, fast, scalable, and extensible

► Automatic and configurable sharing of tables

► Automatic failover support
► Strictly consistent read and writes
Hive architecture or data flow
► UI: Users submits queries and other operations to the
system

► Driver: Session handles, executes and fetches APIs

modeled on JDBC/ODBC interfaces

► Metastore: Stores all the structure information of the

various tables and partitions in the warehouse

► Compiler: Converts the HiveQL into a plan for execution

► Execution Engine: Manages dependencies between these

different stages of the plan and executes these stages on
the appropriate system components
Hive Pros and Cons
Pros Cons
► Easy way to process large scale data ► Not designed for Online transaction
► Converting variety of format within Hive is processing (OLTP)
simple ► Hive supports overwriting or apprehending
► Supports SQL based queries data, but not updates and deletes
► Multiple users can simultaneously query the ► Subqueries are not supported
data using HiveQL ► No easy way to append data
► Allows to write custom MapReduce
framework processes to perform more
detailed data analysis
Hive vs relational database
 By using Hive, we can perform some peculiar functionality that is not achieved
in Relational Databases

► Relational databases are of “Schema on READ and Schema on Write” Hive is

"Schema on READ only”

► No support for Update or Delete in HIVE

► No support for inserting single rows

► Supports Partitioning and Bucketing

Hive vs pig
Pig Hive

Procedural Data Flow Language Declarative SQL Language

It expects good development environments and debuggers. It expects better integration with technologies.

Operates on client side of a cluster Operates on server side of a cluster

Used for data analysis Used for creating reports

It is used to build complex data pipelines and Machine This is used to analyze the data that is available such as
learning such as researchers and programmers. Business Analysts.

Does not have a thrift server Thrift Server

What is HBase?
•HBase is a column-oriented non-relational database management system
that runs on top of Hadoop Distributed File System (HDFS)
•Bigtable like capabilities - read/write access to Big Data.
•It is an open source project and is horizontally scalable.
•HBase isn’t a relational data store.
•HBase applications are written in Java.
•Hbase does support writing applications in Apache Avro, REST and
Thrift.
Hbase Features

 Linear and modular scalability.

 Integrates with Hadoop both as a source and a destination

 Automatic and configurable sharding of tables.
 Automatic failure support.
 Provides data replication across clusters.
 Easy to use Java API for client access.
HBase architecture or data flow
HMaster: HMaster is a lightweight
process that assigns regions to region
servers in the Hadoop cluster for load
balancing.
Region Server: These are the worker
nodes which handle read, write, update,
and delete requests from clients. Region
Server process, runs on every node in
the hadoop cluster. Region Server runs
on HDFS DataNode
ZooKeeper: ZooKeeper service keeps
track of all the region servers that are
there in Hbase cluster.
Hbase Pros and Cons:

• Handles Large datasets.

• Fast processing.
• Failover support and load sharing.
• Easy to use Java API for client access.
• Consistency, Schema-less and scalability.
Zookeeper: Introduction

● ZooKeeper is an open source Apache™ project that provides a

centralized infrastructure and services that enable synchronization.

● ZooKeeper provides an infrastructure for cross-node

synchronization by maintaining status type information in memory
on ZooKeeper servers.
Features of Zookeeper:
 Fast Processing: Zookeeper is especially fast in "read-dominant" workloads (i.e.
workloads in which reads are much more common than writes).

 Reliable System: This system is very reliable as it keeps working even if a node
fails.

 Scalable: The performance of ZooKeeper can be improved by adding nodes.

 Simple Architecture: The architecture of ZooKeeper is quite simple as there is a

shared hierarchical namespace which helps coordinating the processes.
 One leader Zookeeper server synchronizes a set of follower
Zookeeper servers to be accessed by clients.

 Clients access Zookeeper servers to retrieve and update

synchronization information of the entire cluster.

 Clients only connect to one server at a time.

Components of Zookeeper:
 Client: Node in our distributed application cluster, access information from the server.
Interacts with the server to know that the connection is established.
 Server: Node in our ZooKeeper ensemble, provides all the services to clients. Gives
acknowledgement to client to inform that the server is alive.
 Ensemble: Ensemble are group of ZooKeeper servers. The minimum number of nodes
that is required to form an ensemble is 3
 Leader: Server node which performs automatic recovery if any of the connected node
failed. Leaders are elected on service startup.
 Follower: Server node which follows leader instruction.
References
PIG : https://www.tutorialspoint.com/apache_pig/index.htm
https://youtu.be/rxnXHlaSohM

HIVE : https://www.tutorialspoint.com/hive/index.htm

https://youtu.be/uY7Rr7ru9E4
Hbase : https://www.tutorialspoint.com/hbase/index.htm
https://youtu.be/kN01ELCAsn8
ZooKeeper : https://www.tutorialspoint.com/zookeeper/index.htm
https://youtu.be/Kgf9EjTNucM
Thank You!!!

►QUESTIONS??

CockroachDB - The Resilient Geo-Distributed SQL Database PDF
No ratings yet
CockroachDB - The Resilient Geo-Distributed SQL Database PDF
17 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
S Pig Hive HBase
No ratings yet
S Pig Hive HBase
19 pages
Software Mining Repository Practical
No ratings yet
Software Mining Repository Practical
28 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Practical Research 2 - First Quarter Exam
91% (34)
Practical Research 2 - First Quarter Exam
4 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
Fees Management System
60% (10)
Fees Management System
13 pages
Figure Descriptions and Rules
No ratings yet
Figure Descriptions and Rules
17 pages
6.1 GCP - Cloud - Bigtable PDF
100% (1)
6.1 GCP - Cloud - Bigtable PDF
18 pages
HBase
No ratings yet
HBase
38 pages
14 Types of Databases and Data Stores You Should Know
No ratings yet
14 Types of Databases and Data Stores You Should Know
16 pages
Using Volt DB
No ratings yet
Using Volt DB
228 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Nosql Is Dead: Eric Redmond @coderoshi
No ratings yet
Nosql Is Dead: Eric Redmond @coderoshi
55 pages
Bigtable: A Distributed Storage System For Structured Data: Presentation On Paper by
No ratings yet
Bigtable: A Distributed Storage System For Structured Data: Presentation On Paper by
12 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
Implement - Column-Family Stores
No ratings yet
Implement - Column-Family Stores
37 pages
Software Req Eng Project Game Developement
No ratings yet
Software Req Eng Project Game Developement
3 pages
Welcome To VoltDB Training
100% (1)
Welcome To VoltDB Training
102 pages
Repair
No ratings yet
Repair
0 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Cockroach DB
No ratings yet
Cockroach DB
37 pages
Density & Grid Based Clustering
100% (1)
Density & Grid Based Clustering
21 pages
noSQL V newSQL
No ratings yet
noSQL V newSQL
33 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
21csc205p Dbms Unit I
No ratings yet
21csc205p Dbms Unit I
154 pages
Google Bigtable
No ratings yet
Google Bigtable
3 pages
8how Might Prototyping Be Used As Part of The SDLC
No ratings yet
8how Might Prototyping Be Used As Part of The SDLC
3 pages
NDC Interview
100% (5)
NDC Interview
4 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Daraz - PK Final Project
No ratings yet
Daraz - PK Final Project
4 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Hooke Law Lab Report
No ratings yet
Hooke Law Lab Report
10 pages
Unit - Iv: Machine Learning (ML) For Iot
No ratings yet
Unit - Iv: Machine Learning (ML) For Iot
17 pages
Hadoop Questions and Answers Part 100
No ratings yet
Hadoop Questions and Answers Part 100
34 pages
MODULE T4 - DCC50242 BIM Terbaru
No ratings yet
MODULE T4 - DCC50242 BIM Terbaru
147 pages
SE MODULE 3 Unlocked
100% (1)
SE MODULE 3 Unlocked
12 pages
"I Can Hear The Father I Hear Him in My Blood ": Commander
No ratings yet
"I Can Hear The Father I Hear Him in My Blood ": Commander
8 pages
An Intelligent Approach For Food Standards Prediction Using Machine Learning
100% (1)
An Intelligent Approach For Food Standards Prediction Using Machine Learning
11 pages
Lumber Tycoon 2 Roblox
No ratings yet
Lumber Tycoon 2 Roblox
6 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
Object Relational DBMSs
No ratings yet
Object Relational DBMSs
34 pages
Research Methods in Psychology
100% (3)
Research Methods in Psychology
17 pages
COCOMO Model
No ratings yet
COCOMO Model
30 pages
Halstead Explained
No ratings yet
Halstead Explained
5 pages
Chapter 3
No ratings yet
Chapter 3
32 pages
Rasterisation
No ratings yet
Rasterisation
2 pages
Excel 365 Charts
No ratings yet
Excel 365 Charts
63 pages
Module 5
No ratings yet
Module 5
45 pages
Report
No ratings yet
Report
75 pages
Data Mining MCQ
No ratings yet
Data Mining MCQ
34 pages
Pemodelan Proses Bisnis: K Candra Brata
No ratings yet
Pemodelan Proses Bisnis: K Candra Brata
44 pages
Chapter 8 Trends in Information Systems
No ratings yet
Chapter 8 Trends in Information Systems
10 pages
N-Queens Problem
No ratings yet
N-Queens Problem
7 pages
Disease Diagnosis System: Software Requirement Specification (SRS)
No ratings yet
Disease Diagnosis System: Software Requirement Specification (SRS)
11 pages
BCS - SS-CRM 453 - 1 Aug2010
No ratings yet
BCS - SS-CRM 453 - 1 Aug2010
2 pages
IT4304: Rapid Software Development: University of Colombo, Sri Lanka
No ratings yet
IT4304: Rapid Software Development: University of Colombo, Sri Lanka
8 pages
8 Software Maintenance
No ratings yet
8 Software Maintenance
9 pages
AMIOA08 MODBUS RTU Commamd
No ratings yet
AMIOA08 MODBUS RTU Commamd
12 pages
Use Case Diagrams
No ratings yet
Use Case Diagrams
8 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Talent Ely 1
No ratings yet
Talent Ely 1
15 pages
Nanohub U Pen Alam l3.12
No ratings yet
Nanohub U Pen Alam l3.12
15 pages
Project Work
No ratings yet
Project Work
21 pages
Assignment - 3 BI
No ratings yet
Assignment - 3 BI
7 pages
Cohesion With Example
No ratings yet
Cohesion With Example
7 pages
Chapter6 Bearing Capacity and Settlement of Shallow Foundations
No ratings yet
Chapter6 Bearing Capacity and Settlement of Shallow Foundations
57 pages
Csizg514 Mar08 An PDF
No ratings yet
Csizg514 Mar08 An PDF
1 page
CS2055 - Software Quality Assurance
No ratings yet
CS2055 - Software Quality Assurance
15 pages
Game Testing Techniques
No ratings yet
Game Testing Techniques
3 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Total Automation Solution in Super Critical Thermal Power Plant PDF
No ratings yet
Total Automation Solution in Super Critical Thermal Power Plant PDF
28 pages
Software Quality Metrics
No ratings yet
Software Quality Metrics
3 pages
C 20 CE 3 4 Sem-Min
No ratings yet
C 20 CE 3 4 Sem-Min
92 pages
1.AM Methods-1
No ratings yet
1.AM Methods-1
46 pages
Heat and Temperature
No ratings yet
Heat and Temperature
52 pages
Critical Properties of Crude Oil
No ratings yet
Critical Properties of Crude Oil
11 pages
Verb Tense Consistency
No ratings yet
Verb Tense Consistency
10 pages
Presentation For Industrial
No ratings yet
Presentation For Industrial
22 pages
Solbrake Manual
No ratings yet
Solbrake Manual
16 pages
Presentation2 Chapter 9
No ratings yet
Presentation2 Chapter 9
21 pages
Materials Letters: S.T. Mane, S.S. Kamble, L.P. Deshmukh
No ratings yet
Materials Letters: S.T. Mane, S.S. Kamble, L.P. Deshmukh
3 pages
Introduction To Hook Length Formula PDF
No ratings yet
Introduction To Hook Length Formula PDF
4 pages
Establish A Lawn
No ratings yet
Establish A Lawn
4 pages
Elhassan Elboraey Resume - SWE
No ratings yet
Elhassan Elboraey Resume - SWE
1 page
PDF 0235
No ratings yet
PDF 0235
18 pages
Seven Segment Display Description
No ratings yet
Seven Segment Display Description
8 pages
Grade6 - Invision - Math - Topic 4 - Worksheet
No ratings yet
Grade6 - Invision - Math - Topic 4 - Worksheet
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

S Pig Hive HBase Zookeeper 07

Uploaded by

S Pig Hive HBase Zookeeper 07

Uploaded by

Introduction to

PIG, HIVE, HBASE

► A LOAD statement to read data from the system

• Apache Hive is a data warehouse software project built on top of Apache

► It is designed for OLAP

► It provides SQL type language for querying called HiveQL or HQL

► It is familiar, fast, scalable, and extensible

► Automatic and configurable sharing of tables

► Driver: Session handles, executes and fetches APIs

► Metastore: Stores all the structure information of the

► Compiler: Converts the HiveQL into a plan for execution

► Execution Engine: Manages dependencies between these

► Relational databases are of “Schema on READ and Schema on Write” Hive is

► No support for Update or Delete in HIVE

► No support for inserting single rows

► Supports Partitioning and Bucketing

Procedural Data Flow Language Declarative SQL Language

Operates on client side of a cluster Operates on server side of a cluster

Used for data analysis Used for creating reports

Does not have a thrift server Thrift Server

 Linear and modular scalability.

 Integrates with Hadoop both as a source and a destination

• Handles Large datasets.

● ZooKeeper is an open source Apache™ project that provides a

● ZooKeeper provides an infrastructure for cross-node

 Scalable: The performance of ZooKeeper can be improved by adding nodes.

 Simple Architecture: The architecture of ZooKeeper is quite simple as there is a

 Clients access Zookeeper servers to retrieve and update

 Clients only connect to one server at a time.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.