S Pig Hive HBase
S Pig Hive HBase
• Uses both Hadoop Distributed File System (HDFS) read and write files
and MapReducer ( execute jobs )
EXECUTION TYPES
• Local Mode:
• Pig runs in a single JVM and accesses the local filesystem. This mode is
suitable only for small datasets and when trying out Pig.
• MapReduce Mode:
• Pig translates queries into MapReduce jobs and runs them on a
Hadoop cluster. It is what you use when you want to run Pig on large
datasets.
RUNNING PIG PROGRAMS
There are three ways of executing Pig programs
• Script -> Pig can run a script file that contains Pig commands
• Embedded -> users can run Pig programs from Java using the PigServer class
PIG LATIN DATA FLOW
• A LOAD statement to read data from the system
• Developed by Facebook
• The Hive shell is the primary way that we will interact with Hive
HIVE DATA MODEL
• Tables: all the data is stored in a directory in HDFS
• Primitives: numeric, boolean, string and timestamps
• Complex: Arrays, maps and structs
• Driver: Session handles, executes and fetches APIs modeled on JDBC/ODBC interfaces
• Metastore: Stores all the structure information of the various tables and partitions in the
warehouse
• Execution Engine: Manages dependencies between these different stages of the plan and
executes these stages on the appropriate system components
HIVE VS PIG
PIG HIVE
• Procedural Data Flow Language • Declarative SQL Language
• Operates on the client side of a cluster • Operates on the server side of a cluster
• Easy way to process large scale data • Not designed for Online transaction
• HBase is a database built on top of the HDFS. • HDFS is a suitable for storing large files.
• HBase provides fast lookups for larger tables. • HDFS does not support fast individual record
• It provides low latency access to single rows lookups.
from billions of records (Random access). • It provides high latency batch processing;
• HBase internally uses Hash tables and provides • It provides only sequential access of data.
random access, and it stores the data in indexed
HDFS files for faster lookups
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines
only column families, which are the key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column values are stored contiguously on
the disk. Each cell value of the table has a timestamp. In short, in an HBase:
make sure you have enough data. If you have hundreds of millions or
billions of rows, then HBase is a good candidate.
What is Zookeeper?
• ZooKeeper is an open source Apache™ project that provides a centralized infrastructure
and services that enable synchronization across a cluster. ZooKeeper maintains common
objects needed in large cluster environments. Examples of these objects include
configuration information, hierarchical naming space, and so on. Applications can
leverage these services to coordinate distributed processing across large clusters.
• Cluster management − Joining / leaving of a node in a cluster and node status at real time
• Locking and synchronization service − Locking the data while modifying it. This mechanism
helps you in automatic fail recovery while connecting other distributed applications like
Apache HBase
• Highly reliable data registry − Availability of data even when one or a few nodes are down
How Zookeeper Works
• Zookeeper gives you a set of tools to build distributed applications that can safely handle partial failures
• Zookeeper is simple
• Zookeeper is expressive
• Zookeeper is a library