0% found this document useful (0 votes)
72 views21 pages

S Pig Hive HBase Zookeeper 07

hbase

Uploaded by

Johan Pp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views21 pages

S Pig Hive HBase Zookeeper 07

hbase

Uploaded by

Johan Pp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction to

PIG, HIVE, HBASE


& ZOOKEEPER
What is PIG?
Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets.
Apache Pig creates a simpler procedural language abstraction over MapReduce to
expose a more SQL-like interface for Hadoop applications.
PIG Latin - A high-level language developed by Pig where programmers can develop
their own functions for reading, writing and processing data.
PIG Characteristics
► Operator Set - Many operations like join, filter and sort can be performed
through these operators.
► Programming ease - Pig Latin closely resembles to SQL. It is also easy to
write a Pig script if you’re good at SQL.
► User defined functions – Through Pig developers can create UDFs in
other programming languages like Java and invoke them in Pig scripts.
► Extensibility – Developers can develop their own functions to read,
process and write data.
► Optimization opportunities – Pig tasks optimize their execution
automatically. The programmers only need to focus on semantics of the
language.
Pig Latin - Data Flow:

► A LOAD statement to read data from the system


► A series of “transformation” statement to process the data
► A DUMP statement to view results or STORE statement to save the result
PIG Pros and Cons:
Pros: Cons:

• It has many advanced features built-in • There is not a good ide or plugin for Vim
which provides more functionality than
such as joins, secondary sort, many syntax completion to write the pig scripts.
optimizations, predicate push-down, etc.
• Provides a decent abstraction for Map- • Not mature. Even if it has been around for
Reduce jobs, allowing for a faster result quite some time, it is still in the
than creating your own MR jobs development.
• Can handle large and unstructured
• Pig does not support random writes to
update small portions of data, all writes are
datasets. bulk, streaming writes, just like
MapReduce.
What is Hive?

• Apache Hive is a data warehouse software project built on top of Apache


Hadoop
• Used for data query and analysis
• Developed by Facebook
• Provides SQL type language for querying called HiveQL or HQL
• The Hive shell is the primary way that we will interact with Hive
Hive Features
► It stores schema in a database and processed data into HDFS

► It is designed for OLAP

► It provides SQL type language for querying called HiveQL or HQL

► It is familiar, fast, scalable, and extensible

► Automatic and configurable sharing of tables


► Automatic failover support
► Strictly consistent read and writes
Hive architecture or data flow
► UI: Users submits queries and other operations to the
system

► Driver: Session handles, executes and fetches APIs


modeled on JDBC/ODBC interfaces

► Metastore: Stores all the structure information of the


various tables and partitions in the warehouse

► Compiler: Converts the HiveQL into a plan for execution

► Execution Engine: Manages dependencies between these


different stages of the plan and executes these stages on
the appropriate system components
Hive Pros and Cons
Pros Cons
► Easy way to process large scale data ► Not designed for Online transaction
► Converting variety of format within Hive is processing (OLTP)
simple ► Hive supports overwriting or apprehending
► Supports SQL based queries data, but not updates and deletes
► Multiple users can simultaneously query the ► Subqueries are not supported
data using HiveQL ► No easy way to append data
► Allows to write custom MapReduce
framework processes to perform more
detailed data analysis
Hive vs relational database
 By using Hive, we can perform some peculiar functionality that is not achieved
in Relational Databases

► Relational databases are of “Schema on READ and Schema on Write” Hive is


"Schema on READ only”

► No support for Update or Delete in HIVE

► No support for inserting single rows

► Supports Partitioning and Bucketing


Hive vs pig
Pig Hive

Procedural Data Flow Language Declarative SQL Language

It expects good development environments and debuggers. It expects better integration with technologies.

Operates on client side of a cluster Operates on server side of a cluster

Used for data analysis Used for creating reports

It is used to build complex data pipelines and Machine This is used to analyze the data that is available such as
learning such as researchers and programmers. Business Analysts.

Does not have a thrift server Thrift Server


What is HBase?
•HBase is a column-oriented non-relational database management system
that runs on top of Hadoop Distributed File System (HDFS)
•Bigtable like capabilities - read/write access to Big Data.
•It is an open source project and is horizontally scalable.
•HBase isn’t a relational data store.
•HBase applications are written in Java.
•Hbase does support writing applications in Apache Avro, REST and
Thrift.
Hbase Features

 Linear and modular scalability.

 Integrates with Hadoop both as a source and a destination


 Automatic and configurable sharding of tables.
 Automatic failure support.
 Provides data replication across clusters.
 Easy to use Java API for client access.
HBase architecture or data flow
HMaster: HMaster is a lightweight
process that assigns regions to region
servers in the Hadoop cluster for load
balancing.
Region Server: These are the worker
nodes which handle read, write, update,
and delete requests from clients. Region
Server process, runs on every node in
the hadoop cluster. Region Server runs
on HDFS DataNode
ZooKeeper: ZooKeeper service keeps
track of all the region servers that are
there in Hbase cluster.
Hbase Pros and Cons:

• Handles Large datasets.


• Fast processing.
• Failover support and load sharing.
• Easy to use Java API for client access.
• Consistency, Schema-less and scalability.
Zookeeper: Introduction

● ZooKeeper is an open source Apache™ project that provides a


centralized infrastructure and services that enable synchronization.

● ZooKeeper provides an infrastructure for cross-node


synchronization by maintaining status type information in memory
on ZooKeeper servers.
Features of Zookeeper:
 Fast Processing: Zookeeper is especially fast in "read-dominant" workloads (i.e.
workloads in which reads are much more common than writes).

 Reliable System: This system is very reliable as it keeps working even if a node
fails.

 Scalable: The performance of ZooKeeper can be improved by adding nodes.

 Simple Architecture: The architecture of ZooKeeper is quite simple as there is a


shared hierarchical namespace which helps coordinating the processes.
 One leader Zookeeper server synchronizes a set of follower
Zookeeper servers to be accessed by clients.

 Clients access Zookeeper servers to retrieve and update


synchronization information of the entire cluster.

 Clients only connect to one server at a time.


Components of Zookeeper:
 Client: Node in our distributed application cluster, access information from the server.
Interacts with the server to know that the connection is established.
 Server: Node in our ZooKeeper ensemble, provides all the services to clients. Gives
acknowledgement to client to inform that the server is alive.
 Ensemble: Ensemble are group of ZooKeeper servers. The minimum number of nodes
that is required to form an ensemble is 3
 Leader: Server node which performs automatic recovery if any of the connected node
failed. Leaders are elected on service startup.
 Follower: Server node which follows leader instruction.
References
PIG : https://www.tutorialspoint.com/apache_pig/index.htm
https://youtu.be/rxnXHlaSohM

HIVE : https://www.tutorialspoint.com/hive/index.htm

https://youtu.be/uY7Rr7ru9E4
Hbase : https://www.tutorialspoint.com/hbase/index.htm
https://youtu.be/kN01ELCAsn8
ZooKeeper : https://www.tutorialspoint.com/zookeeper/index.htm
https://youtu.be/Kgf9EjTNucM
Thank You!!!

►QUESTIONS??

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy