0% found this document useful (0 votes)
34 views74 pages

BDA Unit-4 Part-2 HBase, Hive, Pig

Uploaded by

Jaya Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views74 pages

BDA Unit-4 Part-2 HBase, Hive, Pig

Uploaded by

Jaya Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 74

HBase in Big Data Processing

Hbase is an open source ,non relational distributed ,coloumn oriented database


developed as part of Apache Foundation.
Charactrisics of HBase
HBase stores data in a compressed format and thus occupies less memory space.It
saves the data in in cells in descending order with the help of timpstamps.So a rea
will always first determine the most recent values.
HBase is suitable in conditions where data changes gradually and rapidly.
HBase’s architecture is designed for scalable, distributed, and fault-tolerant
storage, modeled after Google’s Bigtable. It runs on top of HDFS (Hadoop
Distributed File System), allowing it to handle massive amounts of structured data
with random, real-time read/write access.
HBase Architecture and its Important Components
HBase architecture consists mainly of four components
 HMaster
 HRegionserver
 HRegions
 Zookeeper
 HDFS
Below is a detailed architrecutre of HBase with components:

HBase Architecture Diagram


HMaster
 HMaster in HBase is the implementation of a Master server in HBase
architecture. It acts as a monitoring agent to monitor all Region Server
instances present in the cluster and acts as an interface for all the
metadata changes. but it doesn’t store actual data.
In a distributed cluster environment, Master runs on NameNode. Master runs several
background threads.
The following are important roles performed by HMaster in HBase.
 Plays a vital role in terms of performance and maintaining nodes in the
cluster.
 HMaster provides admin performance and distributes services to different
region servers.
 HMaster assigns regions to region servers.
 HMaster has the features like controlling load balancing and failover to
handle the load over nodes present in the cluster.
 When a client wants to change any schema and to change any Metadata
operations, HMaster takes responsibility for these operations.
Some of the methods exposed by HMaster Interface are primarily Metadata oriented
methods.
 Table (createTable, removeTable, enable, disable)
 ColumnFamily (add Column, modify Column)
 Region (move, assign)
The client communicates in a bi-directional way with both HMaster and
ZooKeeper. For read and write operations, it directly contacts with HRegion
servers. HMaster assigns regions to region servers and in turn, check the health
status of region servers.
In entire architecture, we have multiple region servers. Hlog present in region
servers which are going to store all the log files.
HBase Region Servers
When HBase Region Server receives writes and read requests from the client,
it assigns the request to a specific region, where the actual column family resides.
However, the client can directly contact with HRegion servers, there is no need of
HMaster mandatory permission to the client regarding communication with HRegion
servers. The client requires HMaster help when operations related to metadata and
schema changes are required.
HRegionServer is the Region Server implementation. It is responsible for serving
and managing regions or data that is present in a distributed cluster. The
region servers run on Data Nodes present in the Hadoop cluster.
HMaster can get into contact with multiple HRegion servers and performs the
following functions.
 Hosting and managing regions
 Splitting regions automatically
 Handling read and writes requests
 Communicating with the client directly
 Rebalances across Region Servers to distribute load.
 Region Servers store data in HFiles on HDFS and use MemStore for in-
memory data storage to handle recent writes.

HBase Regions
 A Region is a continuous range of rows within a table, representing the basic
unit of scalability and distribution in HBase.
 Tables are divided into regions by row keys, and these regions are then
distributed across the cluster to different Region Servers.
 When a region grows beyond a certain size, it’s split into two new regions,
which can be moved to other Region Servers to balance the load.
ZooKeeper
HBase Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. Distributed synchronization is
to access the distributed applications running across the cluster with the
responsibility of providing coordination services between nodes. If the client wants to
communicate with regions, the server’s client has to approach ZooKeeper first.
It is an open source project, and it provides so many important services.
Services provided by ZooKeeper
 Maintains Configuration information
 Provides distributed synchronization
 Client Communication establishment with region servers
 Provides ephemeral nodes for which represent different region servers
 Master servers usability of ephemeral nodes for discovering available servers
in the cluster
 To track server failure and network partitions
Master and HBase slave nodes ( region servers) registered themselves with
ZooKeeper. The client needs access to ZK(zookeeper) quorum configuration to
connect with master and region servers.
During a failure of nodes that present in HBase cluster, ZKquoram will trigger error
messages, and it starts to repair the failed nodes.
HDFS
HDFS is a Hadoop distributed File System, as the name implies it provides a
distributed environment for the storage and it is a file system designed in a way to
run on commodity hardware. It stores each file in multiple blocks and to maintain
fault tolerance, the blocks are replicated across a Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity
hardware. By adding nodes to the cluster and performing processing & storing by
using the cheap commodity hardware, it will give the client better results as
compared to the existing one.
In here, the data stored in each block replicates into 3 nodes any in a case when any
node goes down there will be no loss of data, it will have a proper backup recovery
mechanism.
HDFS get in contact with the HBase components and stores a large amount of data
in a distributed manner.

HBase Data Model


HBase Data Model is a set of components that consists of Tables, Rows, Column
families, Cells, Columns, and Versions. HBase tables contain column families and
rows with elements defined as Primary keys. A column in HBase data model table
represents attributes to the objects.
HBase Data Model consists of following elements,
 Set of tables
 Each table with column families and rows
 Each table must have an element defined as Primary Key.
 Row key acts as a Primary key in HBase.
 Any access to HBase tables uses this Primary Key
 Each column present in HBase denotes attribute corresponding to object
Storage Mechanism in HBase
HBase is a column-oriented database and data is stored in tables. The tables are
sorted by RowId. As shown below, HBase has RowId, which is the collection of
several column families that are present in the table.
The column families that are present in the schema are key-value pairs. If we
observe in detail each column family having multiple numbers of columns. The
column values stored into disk memory. Each cell of the table has its own Metadata
like timestamp and other information.

Storage Mechanism in HBase


Coming to HBase the following are the key terms representing table schema
 Table: Collection of rows present.
 Row: Collection of column families.
 Column Family: Collection of columns.
 Column: Collection of key-value pairs.
 Namespace: Logical grouping of tables.
 Cell: A {row, column, version} tuple exactly specifies a cell definition in
HBase.

Fig above is lofical architecture of persistent partitions of Big Table


Column-oriented vs Row-oriented storages
Column and Row-oriented storages differ in their storage mechanism. As we all
know traditional relational models store data in terms of row-based format like in
terms of rows of data. Column-oriented storages store data tables in terms of
columns and column families.
The following Table gives some key differences between these two storages

Column-oriented Database Row oriented Database

When the situation comes to process and analytics Online Transactional process such
we use this approach. Such as Online Analytical as banking and finance domains use
Processing and it’s applications. this approach.

The amount of data that can able to store in this It is designed for a small number of
model is very huge like in terms of petabytes rows and columns.

HBase Data Storage Architecture


a. MemStore
 MemStore is an in-memory write cache for each region. Data is first written
to MemStore and then periodically flushed to HFiles on HDFS.
 It’s efficient for write operations, as data is stored in memory before being
written to disk.
b. HFiles
 HFiles are the actual storage files where HBase data is stored on HDFS,
providing persistence for data.
 These files are immutable; when MemStore is full, it flushes data to a new
HFile, ensuring the data is written in sorted order.
 Compaction processes periodically merge HFiles to improve read efficiency
and reduce the number of storage files.
c. Write-Ahead Log (WAL)
 Each Region Server has a Write-Ahead Log (WAL) for durability. Every write
operation is first recorded in the WAL before being added to MemStore.
 WAL helps in data recovery in case of Region Server failure, as it ensures no
data is lost.
 If a Region Server crashes, the WAL can be replayed to restore the most
recent data.
3. HBase Data Model
 Row Key: Uniquely identifies a row within a table. All data for a specific row is
stored together.
 Column Family: Groups related columns. Each table has at least one column
family, and each family contains columns with actual data.
 Column Qualifier: Represents individual columns within a family, allowing
flexibility in storing structured data.
 Timestamp: Every value stored in HBase is versioned with a timestamp,
enabling multiple versions of data within each cell.
4. HBase Workflow
Read Operations
 HBase first checks the MemStore for data; if found, it returns the data directly
from memory.
 If not in MemStore, it looks in the BlockCache (another in-memory cache for
frequently read data).
 If not found in the BlockCache, it searches the HFiles in HDFS.
Write Operations
 When a client writes data, it is first written to the WAL (Write-Ahead Log) and
then stored in MemStore.
 Once MemStore reaches a certain threshold, it flushes data to a new HFile on
HDFS.
 Regular Compactions are performed to merge HFiles, reducing the number
of files and improving read efficiency.

HBase Commands
HBase or Hadoop Database, is a distributed and non-relational database
management system that runs on top of the Hadoop Distributed File System (HDFS).
HBase can handle large amounts of data while providing low latency via parallel
processing.

Introduction

The HBase Shell is a command-line interface for interacting with the HBase
database.

When the shell is launched, it establishes a connection with the hbase client
which is responsible for interacting with the HBase Master for various
operations.

The HBase shell translates user-entered commands into appropriate API calls,
allowing users to interact with HBase in a more user-friendly and interactive manner.

There are several types of commands in HBase.

They are,

 General Commands.
 Data Definition Language (DDL) Commands.
 Data Manipulation Language (DML) Commands.
 Security Commands.
 Other HBase Shell Commands.

The HBase Shell is built on top of the HBase Java API and provides a
simplified interface for executing HBase commands without the need for
extensive coding.

General Commands

HBase shell offers a range of general commands to get information regarding the
database. The general HBase commands are,

 Whoami
 Status
 Version
 Table Help
 Cluster Help
 Namespace Help

1. whoami

The whoami HBase command is used to retrieve the current user's information.
Syntax:

whoami

Output:

user: hari

Explanation:

This output indicates that the user executing the command is logged in as hari in the
HBase Shell.

2. Status

The status command provides information about the HBase cluster, including the
cluster's overall health and the number of servers and regions.

Syntax:

status

Output:

1 active master, 3 region servers, 6 regions

Explanation:

 active master:
Number of active master servers in the HBase cluster.
 region servers:
Number of region servers that are currently running.
 regions:
Total number of regions in the cluster.

3. Version

The version command retrieves the version information of HBase. Syntax:

version

Output:

HBase 2.4.0

Explanation:

This output displays the version of HBase installed is 2.4.0.

4. Table Help

The table_help command provides a list of available commands related to table


operations.
Syntax:

table_help

Output:

TABLE COMMANDS:
create 'table_name', {NAME => 'column_family_name'}, ...
Create a table with the specified table_name and column families.

alter 'table_name', {NAME => 'column_family_name'}, ...


Alter the schema of an existing table, adding or modifying column families.

describe 'table_name'
Display the detailed schema of the specified table.

...
5. Cluster Help

The cluster_help command provides information on commands related


to cluster management. Syntax:

cluster_help

Output:

status
version
...
6. Namespace Help

The namespace_help command provides guidance on commands for


managing namespaces in HBase.

Syntax:

namespace_help

Output:

create_namespace 'namespace_name'
describe_namespace 'namespace_name'
...

Data Definition Language in HBase

The Data Definition Language (DDL) allows users to define and manipulate the
structure of the HBase tables.

Some of the DDL HBase commands are,

 Create
 Describe
 Alter
 Disable
 Enable
 Drop
 Exist

1. Create

The create command is used to create a new table in HBase. Syntax:

create 'table_name', 'column_family1', 'column_family2', ...

Explanation:

A table can have multiple-column families. A column family is a group of columns


that share common characteristics.

Example:

create 'employees', 'personal', 'professional'

Output:

0 row(s) in 1.4860 seconds

This command creates a table named employees with two column


families: personal and professional.

In HBase, a namespace is a logical container that provides a way to group related


tables together, similar to how directories for files.

We can create a namespace in HBase using the create_namespace


your_namespace command and switch to a namespace using the following
command,

namespace 'your_namespace'

All tables created after this command will be in the your_namespace namespace.

We can also add various configurations to the HBase commands while creating the
column family in a table. Example:

create 'products', {NAME => 'details', VERSIONS => 5, TTL => '86400'}, {NAME =>
'inventory', COMPRESSION => 'GZ'}

Explanation:

 {}:
Used to specify details of a column family.
 NAME:
Specifies the name of the column family.
 VERSION:
Sets the maximum number of versions to keep for each cell.
 TTL:
Sets the Time-to-Live (TTL).
 COMPRESSION:
Specifies the compression algorithm to be used.
 GZ means Gzip compression.

The value 86400 represents the number of seconds (1 day) that cells will be kept
before being automatically expired and deleted.

2. Describe

The describe command provides information about the specified table, including
column families and region details.

Syntax:

describe 'table_name'

Example:

describe 'employees'

Output:

Table employees is ENABLED


employees
COLUMN FAMILIES DESCRIPTION
{NAME => 'personal', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY
=> 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0',
BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

{NAME => 'professional', BLOOMFILTER => 'ROW', VERSIONS => '1',


IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE',
DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION =>
'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536',
REPLICATION_SCOPE => '0'}
2 row(s) in 0.0310 seconds

Explanation:

This command provides a detailed description of the employees table, including its
enabled status and the configuration of its column families.

3. Alter

The alter command allows users to modify the schema of an existing table.

Syntax:

alter 'table_name', {NAME => 'new_column_family'}

Example:

alter 'employees', {NAME => 'contact'}


Output:

0 row(s) in 1.2040 seconds

Explanation:

This command adds a new column family named contact to the employees table.

4. Disable

The disable command is used to disable a table. Once disabled, the table
becomes read-only, and no further modifications can be made to it.

Syntax:

disable 'table_name'

Example:

disable 'employees'

Output:

0 row(s) in 1.2410 seconds

Explanation:

This command disables the employees table, preventing any further modifications.

5. Enable

The enable command is used to enable a previously disabled table. Once enabled,
the table becomes writable again. Syntax:

enable 'table_name'

Example

enable 'employees'

Output:

0 row(s) in 0.8760 seconds

Explanation: This command enables the 'employees' table, allowing modifications


to be made.

6. Drop

The drop command in HBase Shell is used to delete an existing table from the
HBase database. The table to be deleted should be disabled first using
the disable command.

Syntax:
drop 'table_name'

Example:

drop 'employees'

Output:

0 row(s) in 1.2300 seconds

Explanation:

This command deletes the employees table from the database.

7. Exist

The exist command in HBase is used to check the existence of a table or a column
family within a table.

Syntax:

exists 'table_name'

Example:

exists 'employees'

Output:

Table table_name does not exist

Explanation:

Since we have deleted the table, the output shows that the table doesn't exist.

Data Manipulation Language in HBase

HBase Shell provides a Data Manipulation Language (DML) that enables users to
insert, update, and delete data within HBase tables. Some of the DML HBase
commands are,

 Put
 Get
 Scan
 Delete
 Truncate

1. Put

The put command is used to insert or update data in a table. Syntax:

put 'table_name', 'row_key', 'column_family:column_qualifier', 'value'

Explanation:
 table_name:
Name of the HBase table.
 row key:
Unique identifier for a specific row in the HBase table.
 Each row in an HBase table is indexed and accessed using its row key.
 column family:
Represents the group to which a specific column belongs.
 column qualifier:
Specifies the specific column within the column family.
 value:
Data that will be stored

Example:

put 'employees', '1001', 'personal:name', 'Sample user'

Output:

0 row(s) in 0.4560 seconds

Explanation:

This command inserts the value Sample user into the personal:name column of the
row with key 1001 in the employees table.

2. Get

The get command is used to retrieve data from a table based on the row
key. Syntax:

get 'table_name', 'row_key'

The command requires specifying the table name and row key. Example:

get 'employees', '1001'

Output:

COLUMN CELL
personal:name timestamp=1628272254000, value=John Doe
1 row(s) in 0.0390 seconds

Explanation:

This command retrieves the data associated with the row key 1001 in
the employees table, displaying the column name, timestamp, and cell value.

Command Options

We can also use different options with the GET command to get specific information
in the table. The options are,

 FILTER:
accepts a filter string that defines the filter conditions. You can use various
filter types like SingleColumnValueFilter, PrefixFilter, ColumnPrefixFilter, etc.
Example:

get 'employees', 'row1', {FILTER => "SingleColumnValueFilter('personal', 'age', >=,


'binary:25')"}

Explanation:

The above command retrieves the cell values from the row with the key row1 in
the employees table. The filter condition SingleColumnValueFilter is applied to fetch
only those cells where the age column in the personal column family is greater than
or equal to 25.

 COLUMN:
Can specify the name of the column or column family you want to retrieve.
 TIMERANGE:
Takes a single timestamp value and fetches cell values based on a specific
timestamp.
 TIMERANGE:
Takes a start time and an end time in the format yyyy-MM-dd HH:mm:ss and
retrieves cell values within a specific time range. For instance, TIMERANGE
=> ['2023-01-01 00:00:00', '2023-01-31 23:59:59'].

3. Scan

The scan command is used to retrieve multiple rows or a range of rows from a
table. Syntax:

scan 'table_name'

Example:

scan 'employees'

Output:

ROW COLUMN+CELL
1001 column=personal:name, timestamp=1628272254000,
value=Sample user
1002 column=personal:name, timestamp=1628272296250,
value=New User
2 row(s) in 0.0560 seconds

Explanation:

This command retrieves all the rows and associated column families and cells from
the employees table.

4. Delete

The delete command is used to delete data from a table based on the row key,
column family, and column qualifier. Syntax:

delete 'table_name', 'row_key', 'column_family:column_qualifier'


Explanation:

If we want to delete a specific cell, the column family and column qualifier must also
be specified. Example:

delete 'employees', '1002', 'personal:name'

Output:

0 row(s) in 0.4980 seconds

Explanation: This command deletes the data in the personal:name column of the
row with key 1002 in the employees table.

5. Truncate

The truncate command in HBase Shell is used to delete all data from a specific
table while retaining the table structure. Syntax:

truncate 'table_name'

Example:

truncate 'employees'

Output:

Truncated table employees

Explanation:

Through the above command, all the data from the employees table are removed.

Security Commands

Security commands in HBase Shell are used to manage user permissions, and
access control, and ensure the security of the HBase database. Some of the security
HBase commands are,

 Create
 Grant
 Revoke
 User Permissions
 Disable Security

1. Create

The create command is used to create a new user with a specified username and
password. Syntax:

create 'user', 'password'

Example:
create 'bob', 'hbasecommand'

Output:

0 row(s) in 1.5680 seconds

Explanation:

This command creates a new user named bob with the password hbasecommand.

2. Grant

The grant command is used to grant permissions to a user on a specific


table. Syntax:

grant 'user', 'permissions', 'table'

Explanation:

The permissions can be any of read, write, execute, create, and admin. The
permission is applied for a particular table. Example:

grant 'bob', 'read', 'employees'

Output:

0 row(s) in 0.7150 seconds

Explanation:

This command grants the read permission to the user bob on the employees table.

The execute permission allows the user to execute coprocessor functions on the
specified table. Coprocessors are custom code modules that run on HBase Region
Servers and can perform operations on the server side. The admin permission
provides full administrative privileges over the specified table.

3. Revoke

The revoke command is used to revoke permissions from a user on a specific


table. Syntax:

revoke 'user', 'permissions', 'table'

Example:

revoke 'bob', 'read', 'employees'

Output:

0 row(s) in 0.7890 seconds

Explanation:
This command revokes the read permission from the user bob on
the employees table.

4. User Permissions

The user_permission command is used to list the permissions assigned to a specific


user. Syntax:

user_permission 'user_name'

Example:

user_permission 'bob'

Output:

User Table, Family, Qualifier, Permission


bob
bob employees:, [read]
1 row(s) in 0.0710 seconds

Explanation:

The output displays the permissions assigned to the user bob.

5. Disable Security

The disable_security command is used to disable security in HBase, which removes


all user and permission-related restrictions.

Syntax:

disable_security

Example:

disable_security

Output:

0 row(s) in 1.2230 seconds

Explanation:

The output indicates that the security in HBase has been disabled.

Starting HBase Shell

Starting the HBase Shell provides a command-line interface to interact with the
HBase database through HBase commands.

The following steps should be followed to start the HBase Shell,

1. Navigate to the directory where HBase is installed through the terminal.


2. The HBase shell can be started using the following command:

bin/hbase shell

Once the HBase Shell is launched, you will see a command prompt that indicates
you are in the HBase Shell environment. From here, you can start entering various
HBase commands to interact with the HBase database.

Apache Hive
 Hive is a data warehouse system which is used to analyze structured
data.
 It is built on the top of Hadoop.
 It was developed by Facebook.
 Hive provides the functionality of reading, writing, and managing large
datasets residing in distributed storage.
 It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
 Using Hive, we can skip the requirement of the traditional approach of
writing complex MapReduce programs.
 Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).

Features of Hive

These are the following features of Hive:


 Hive is fast and scalable.
 It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
 It is capable of analyzing large datasets stored in HDFS.
 It allows different storage types such as plain text, RCFile, and HBase.
 It uses indexing to accelerate queries.
 It can operate on compressed data stored in the Hadoop ecosystem.
 It supports user-defined functions (UDFs) where user can provide its
functionality.

Limitations of Hive

 Hive is not capable of handling real-time data.


 It is not designed for online transaction processing.
 Hive queries contain high latency.

Hive vs. Pig

The following table compares the advantages of Hive with the advantages of Pig :

Features

Hive uses a declarative With Pig Latin, a procedural


1. Language
language called HiveQL data flow language is used
Creating schema is not
2. Schema Hive supports schema
required to store data in Pig

3. Data Hive is used for batch Pig is a high-level data-flow


Processing processing language

No. Pig does not support


4. Partitions Yes partitions although there is an
option for filtering

5. Web Pig does not support web


Hive has a web interface
interface interface

6. User Data analysts are the Programmers and researchers


Specification primary users use Pig

7. Used for Reporting Programming

Hive works on structured Pig works on structured, semi-


8. Type of
data. Does not work on structured and unstructured
data
other types of data data
9. Operates Works on the server-side of Works on the client-side of the
on the cluster cluster

10. Avro File


Hive does not support Avro Pig supports Avro
Format

11. Loading Hive takes time to load but


Pig loads data quickly
Speed executes quickly

12. JDBC/
Supported, but limited Unsupported
ODBC

Fig: Hive vs. Pig Comparison Table

Differences between Hive and Pig

Hive Pig

Hive is commonly used by Data Pig is commonly used by


Analysts. programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS It works on client-side of HDFS


cluster. cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.

Hive Architecture

Hive Client:

Hive drivers support applications written in any language like Python, Java,
C++, and Ruby, among others, using JDBC, ODBC, and Thrift drivers, to
perform queries on the Hive. Therefore, one may design a hive client in any
language of their choice.

The three types of Hive clients are referred to as Hive clients:


1. Thrift Clients: The Hive server can handle requests from a thrift client
by using Apache Thrift.
2. JDBC client: A JDBC driver connects to Hive using the Thrift
framework. Hive Server communicates with the Java applications using
the JDBC driver.
3. ODBC client: The Hive ODBC driver is similar to the JDBC driver in
that it uses Thrift to connect to Hive. However, the ODBC driver uses
the Hive Server to communicate with it instead of the Hive Server.

Hive Services:

Hive provides numerous services, including the Hive server2, Beeline, etc.
The services offered by Hive are:

1. Beeline: HiveServer2 supports the Beeline, a command shell that which


the user can submit commands and queries to. It is a JDBC client that
utilises SQLLINE CLI (a pure Java console utility for connecting with
relational databases and executing SQL queries). The Beeline is based
on JDBC.
2. Hive Server 2: HiveServer2 is the successor to HiveServer1. It provides
clients with the ability to execute queries against the Hive. Multiple
clients may submit queries to Hive and obtain the desired results. Open
API clients such as JDBC and ODBC are supported by HiveServer2.

Note: Hive server1, which is also known as a Thrift server, is used to


communicate with Hive across platforms. Different client applications can
submit requests to Hive and receive the results using this server.

HiveServer2 handled concurrent requests from more than one client, so it was
replaced by HiveServer1.

Hive Driver: The Hive driver receives the HiveQL statements submitted by
the user through the command shell and creates session handles for the
query.

Hive Compiler: Metastore and hive compiler both store metadata in order to
support the semantic analysis and type checking performed on the different
query blocks and query expressions by the hive compiler. The execution plan
generated by the hive compiler is based on the parse results.

The DAG (Directed Acyclic Graph) is a DAG structure created by the compiler.
Each step is a map/reduce job on HDFS, an operation on file metadata, and a
data manipulation step.
Optimizer: The optimizer splits the execution plan before performing the
transformation operations so that efficiency and scalability are improved.

Execution Engine: After the compilation and optimization steps, the


execution engine uses Hadoop to execute the prepared execution plan, which
is dependent on the compiler’s execution plan.

Metastore: Metastore stores metadata information about tables and


partitions, including column and column type information, in order to improve
search engine indexing.

The metastore also stores information about the serializer and deserializer as
well as HDFS files where data is stored and provides data storage. It is
usually a relational database. Hive metadata can be queried and modified
through Metastore.

We can either configure the metastore in either of the two modes:

1. Remote: A metastore is not enabled in remote mode, and non-Java


applications cannot benefit from Thrift services.
2. Embedded: A client in embedded mode can directly access the
metastore via JDBC.

HCatalog: HCatalog is a Hadoop table and storage management layer that


provides users with different data processing tools such as Pig, MapReduce,
etc. with simple access to read and write data on the grid.

The data processing tools can access the tabular data of Hive metastore
through It is built on the top of Hive metastore and exposes the tabular data to
other data processing tools.

WebHCat: The REST API for HCatalog provides an HTTP interface to


perform Hive metadata operations. WebHCat is a service provided by the user
to run Hadoop MapReduce (or YARN), Pig, and Hive jobs.

Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client

Hive allows writing applications in various languages, including Java, Python,


and C++.

It supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that


serves the request from all those programming languages that supports
Thrift.
o JDBC Driver - It is used to establish a connection between hive and
Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol
to connect to Hive.

Hive Services
The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive
CLI. It provides a web-based GUI for executing Hive queries and
commands.
o Hive MetaStore - It is a central repository that stores all the structure
information of various tables and partitions in the warehouse. It also
includes metadata of column and its type information, the serializers
and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the
request from different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and
expressions. It converts HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form
of DAG of map-reduce tasks and HDFS tasks. In the end, the execution
engine executes the incoming tasks in the order of their dependencies.

HIVE Data Types


Hive data types are categorized in numeric types, string types, misc types,
and complex types. A list of Hive data types is given below.

Integer Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

2,147,483,648 to
INT 4-byte signed integer
2,147,483,647

-
9,223,372,036,854,775,808
BIGINT 8-byte signed integer
to
9,223,372,036,854,775,807
Decimal Type

Type Size Range

Single precision floating


FLOAT 4-byte
point number

Double precision
DOUBLE 8-byte
floating point number

Date/Time Types

TIMESTAMP

o It supports traditional UNIX timestamp with optional nanosecond


precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in
seconds with decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
DATES

The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.

String Types

STRING

The string is a sequence of characters. It values can be enclosed within single


quotes (') or double quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535,
which specifies that the maximum number of characters allowed in the
character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.

Complex Type
Type Size Range

It is similar to C struct
or an object where
Struct fields are accessed struct('James','Roy')
using the "dot"
notation.

It contains the key-


value tuples where
Map the fields are map('first','James','last','Roy')
accessed using array
notation.

It is a collection of
similar type of values
Array array('James','Roy')
that indexable using
zero-based integers.

Hive - Create Database


In Hive, the database is considered as a catalog or namespace of tables. So,
we can maintain multiple tables within a database where a unique name is
assigned to each table.

Hive also provides a default database with a name default.

Initially, we check the default database provided by Hive.

So, to check the list of existing databases, follow the below command: -
hive> show databases;

Here, we can see the existence of a default database provided by Hive.

Let's create a new database by using the following command: -

hive> create database demo;

If we want to suppress the warning generated by Hive on creating the


database with the same name, follow the below command: -

hive> create database if not exists demo;


Hive also allows assigning properties with the database in the form of key-
value pair.
hive>create database demo
>WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-
06-03');

Now, drop the database by using the following command.


hive> drop database demo;

Hive - Create Table


In Hive, we can create a table by using the conventions similar to the SQL. It
supports a wide range of flexibility where the data files for tables are stored.

It provides two types of table: -

o Internal table
o External table

Internal Table

The internal tables are also called managed tables as the lifecycle of their
data is controlled by the Hive.

By default, these tables are stored in a subdirectory under the directory


defined by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse).

The internal tables are not flexible enough to share with other tools like Pig.

If we try to drop the internal table, Hive deletes both table schema and
data.

Let's create an internal table by using the following command:-

hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited fields terminated by ',' ;

We can see the metadata of the created table by using the following
command: -
hive> describe new_employee;

External Table

The external table allows us to create and access a table and a data
externally. The external keyword is used to specify the external table,
whereas the location keyword is used to determine the location of loaded
data.

As the table is external, the data is not present in the Hive directory.
Therefore, if we try to drop the table, the metadata of the table will be deleted,
but the data still exists.

Hive - Load Data


Once the internal table has been created, the next step is to load the data into
it.

So, in Hive, we can easily load data from any file to the database.

Let's load the data of the file into the database by using the following
command: -

hive>load data local inpath '/home/codegyani/hive/emp_details' into tabl


e demo.employee;

Here, emp_details is the file name that contains the data.

Now, we can use the following command to retrieve the data from the
database.
hive>select * from demo.employee;

Hive - Drop Table


Hive facilitates us to drop a table by using the SQL drop table command.
Let's follow the below steps to drop the table from the database.

Now, drop the table by using the following command: -

hive> drop table new_employee;

Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the
values of a particular column like date, course, city or country.

The advantage of partitioning is that since the data is stored in slices, the
query response time becomes faster.

As we know that Hadoop is used to handle the huge amount of data, it is


always required to use the best approach to deal with it.

Let's assume we have a data of 10 million students studying in an institute.


Now, we have to fetch the students of a particular course. If we use a
traditional approach, we have to go through the entire data. This leads to
performance degradation. In such a case, we can adopt the better approach
i.e., partitioning in Hive and divide the data among the different datasets
based on particular columns.

The partitioning in Hive can be executed in two ways -


 Static partitioning
 Dynamic partitioning

Static Partitioning
In static or manual partitioning, it is required to pass the values of partitioned
columns manually while loading the data into the table. Hence, the data file
doesn't contain the partitioned columns.

Example of Static Partitioning

First, select the database in which we want to create a table.


hive> use test;

Create the table and provide the partitioned columns by using the following
command:
hive> create table student (id int, name string, age int, institute string)
>partitioned by (course string)
>row format delimited
>fields terminated by ',';

Load the data into the table and pass the values of partition columns
with it by using the following command: -
hive> load data local inpath '/home/codegyani/hive/student_details1' into
table student partition(course= "java");

Dynamic Partitioning

In dynamic partitioning, the values of partitioned columns exist within the


table. So, it is not required to pass the values of partitioned columns manually.

Bucketing in Hive

The bucketing in Hive is a data organizing technique.

It is similar to partitioning in Hive with an added functionality that it divides


large datasets into more manageable parts known as buckets.

So, we can use bucketing in Hive when the implementation of partitioning


becomes difficult. However, we can also divide partitions further in buckets.

Working of Bucketing in Hive


 The concept of bucketing is based on the hashing technique.
 Here, modules of current column value and the number of required
buckets is calculated (let say, F(x) % 3).
 Now, based on the resulted value, the data is stored into the
corresponding bucket.

Hive commands to practice:


a.Creating a database,table
b.Dropping a database,table
c. Describe comand,alter, insert, select
d.Group by with having,Order by
Hive Commands with Syntax and
Examples
Create Database
 This command will create a database.
hive> create database <database-name>;
For Ex - create database demo;

 This command will show all databases that are present.


hive> show databases;

 This command will only create a database if it is not


present.
hive> create database if not exists <database-name>;
For ex - create database if not exists demo;
 Assigning properties with the database in the form of
key-value pair.
hive> create the database demo
> WITH DBPROPERTIES ('creator'='Varchasa Aggarwal',
'date'='18-04-2021');

 Let’s retrieve the information associated with the


database.
hive> describe database extended demo;

Drop Database
Delete a defined database.
 This command will delete a database.
hive> drop database demo;

 To check that database is deleted or not.


hive> show databases;

 Drop database if and only if it exists.


hive> drop database if exists demo;

 In Hive, it is not allowed to drop the database that


contains the tables directly. In such a case, we can drop
the database either by dropping tables first or use
Cascade keyword with the command.
 Let’s see the cascade command used to drop the
database:-
hive> drop database if exists demo cascade.

This command automatically drops the table present in the


database first.
Hive — Create Table
In hive, we have two types of table —
 Internal table
 External table
Internal Table
 The internal tables are also called managed tables as
the lifecycle of their data is controlled by the Hive.
 By default, these tables are stored in a subdirectory
under the directory defined by
hive.metastore.warehouse.dir (i.e.
/user/hive/warehouse).
hive> create table demo.employee(Id int, Name string, Salary
float)
> row format delimited
> fields terminated by ',';

Here, the command also includes the information that the


data is separated by ‘,’.
 Let’s see the metadata of the created table.
hive> describe demo.employee

 Let’s create a table if it not exists.


hive> create table if not exists demo.employee(Id int, Name
string, Salary float)
> row format delimited
> fields terminated by ','

 While creating a table, we can add the comments to the


columns and can also define the table properties.
hive> create table demo.new_employee(Id int comment 'Employee
Id' Name string comment 'Employee Name', Salary float comment
'Employee Salary') comment 'Table Description' TBLProperties
('creator'='Varchasa Aggarwal','created at'='18-04-2021');

 Let’s see the metadata of the created table.


hive> describe new_employee;

 Hive allows creating a new table by using the schema of


an existing table.
Schema is the skeleton structure that represents the
logical view of the entire database. It defines how the data
is organized and how the relations among them are
associated.
hive> create table if not exists demo.copy_employee like
demo.employee;

Here, we can say that the new table is a copy of an existing


table.
Hive — Load Data
Once the internal table has been created, the next step is
to load the data into it.
 Let’s load the data of the file into the database by using
the following command: -
load data local inpath '/home/<username>/hive/emp_details'into
table demo.employee;select * from.demo.employee;

Hive — Drop Table


Let’s delete a specific table from the database.
hive> show databases;
hive> use demo;
hive> show tables;
hive> drop table new_employee;
hive> show tables;

Hive — Alter table


In Hive, we can perform modifications in the existing table
like changing the table name, column name, comments,
and table properties.
 Rename a table
hive> Alter table <old_table_name> rename to <new_table_name>

Let’s check table name changed or not.


hive> show tables;

 Adding a column —
Alter table table_name add columns(columnName datatype);
 Change column —
hive> Alter table_name change <old_column_name>
<new_column_name> datatype;

 Delete or replace column —


alter table employee_data replace columns( id string, first_name
string, age int);

In hive with DML statements, we can add data to the Hive table in 2
different ways.
 Using INSERT Command
 Load Data Statement

1. Using INSERT Command


Syntax:
INSERT INTO TABLE <table_name> VALUES (<add values as per
column entity>);

Example:
To insert data into the table let’s create a table with the name student (By
default hive uses its default database to store hive tables).
Command:
CREATE TABLE IF NOT EXISTS student(

Student_Name STRING,

Student_Rollno INT,

Student_Marks FLOAT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';


We have successfully created the student table in the
Hive default database with the attribute Student_Name, Student_Rollno,
and Student_Marks respectively.
Now, let’s insert data into this table with an INSERT query.

INSERT Query:

INSERT INTO TABLE student VALUES ('Dikshant',1,'95'),


('Akshat', 2 , '96'),('Dhruv',3,'90');

We can check the data of the student table with the help of the below
command.
SELECT * FROM student;
2. Load Data Statement
Hive provides us the functionality to load pre-created table entities either
from our local file system or from HDFS. The LOAD DATA statement is
used to load data into the hive table.
Syntax:
LOAD DATA [LOCAL] INPATH '<The table data location>'
[OVERWRITE] INTO TABLE <table_name>;

Note:
 The LOCAL Switch specifies that the data we are loading is available
in our Local File System. If the LOCAL switch is not used, the hive will
consider the location as an HDFS path location.
 The OVERWRITE switch allows us to overwrite the table data.
Let’s make a CSV(Comma Separated Values) file with the
name data.csv since we have provided ‘,’ as a field terminator while
creating a table in the hive. We are creating this file in our local file system
at ‘/home/dikshant/Documents’ for demonstration purposes.
Command:
cd /home/dikshant/Documents // To change the directory

touch data.csv // use to create data.csv file

nano data.csv // nano is a linux command


line editor to edit files

cat data.csv // cat is used to see content


of file

LOAD DATA to the student hive table with the help of the below
command.
LOAD DATA LOCAL INPATH '/home/dikshant/Documents/data.csv'
INTO TABLE student;

Let’s see the student table content to observe the effect with the help of
the below command.
SELECT * FROM student;

We can observe that we have successfully added the data to


the student table.

Hive — Partitioning
The partitioning in hive can be done in two ways —
 Static partitioning
 Dynamic partitioning
Static Partitioning
In static or manual partitioning, it is required to pass the
values of partitioned columns manually while loading the
data into the table. Hence, the data file doesn’t contain the
partitioned columns.
hive> use test;
hive> create table student (id int, name string, age int,
institute string)
> partitioned by (course string)
> row format delimited
> fields terminated by ',';

 Let’s retrieve the information.


hive> describe student;

 Load the data into the table and pass the values of
partition columns with it by using the following
command: -
hive> load data local inpath
'/home/<username>/hive/student_details1' into table student
partition(course= "python");hive> load data local inpath
'/home/<username>/hive/student_details1' into table student
partition(course= "Hadoop");

 Now retrieve the data.


hive> select * from student;
hive> select * from student where course = 'Hadoop';

Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns
exist within the table. So, it is not required to pass the
values of partitioned columns manually.
hive> use show;

 Enable the dynamic partitioning.


hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;

 Create the dummy table.


hive> create table stud_demo(id int, name string, age int,
institute string, course string)
row format delimited
fields terminated by ',';

 Now load the data.


hive> load data local inpath
'/home/<username>/hive/student_details' into table stud_demo;

 Create a partition table.


hive> create table student_part (id int, name string, age int,
institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';

 Insert the data of dummy table in the partition table.


hive> insert into student_part
partition(course)
select id, name age, institute, course
from stud_demo;

 Now you can view the table data with the help
of select command.
HiveQL — Operators
he HiveQL operators facilitate to perform various
arithmetic and relational operations.
hive> use hql;
hive> create table employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;

 Now load the data.


hive> load data local inpath '/home/<username>/hive/emp_data'
into table employee;

 Fetch the data.


select * from employee;

Arithmetic Operators in Hive


 Adding 50 to salary column.
hive> select id, name, salary + 50 from employee;

 Substracting 50 from the salary column.


hive> select id, name, salary -50 from employee;

 Find out the 10% salary of each employee.


hive> select id, name, salary *10 from employee;

Relational Operators in Hive


 Fetch the details of the employee having
salary>=25000.
hive> select * from employee where salary>=25000;

 Fetch the details of the employee having salary<25000.


hive> select * from employee where salary < 25000;

Functions in Hive
hive> use hql;
hive> create table employee_data (Id int, Name string , Salary
float)
row format delimited
fields terminated by ',' ;

 Now load the data.


hive> load data local inpath
'/home/<username>/hive/employee_data' into table employee;

 Fetch the data.


select * from employee_data;

Mathematical Functions in Hive


 Let’s see an example to fetch the square root of each
employee’s salary.
hive> select Id, Name, sqrt(Salary) from employee_data ;

Aggregate Functions

 Let’s see an example to fetch the maximum/minimum


salary of an employee.
hive> select max(Salary) from employee_data;
hive> select min(Salary) from employee_data;

Other functions in Hive

 Let’s see an example to fetch the name of each


employee in uppercase.
hive> select Id, upper(Name) from employee_data;

 Let’s see an example to fetch the name of each


employee in lowercase.
hive> select Id, lower(Name) from employee_data;

GROUP BY Clause
The HQL Group By clause is used to group the data from
the multiple records based on one or more column. It is
generally used in conjunction with the aggregate functions
(like SUM, COUNT, MIN, MAX and AVG) to perform an
aggregation over each group.
hive> use hql;
hive> create table employee_data (Id int, Name string , Salary
float)
row format delimited
fields terminated by ',' ;

 Now load the data.


hive> load data local inpath
'/home/<username>/hive/employee_data' into table employee;

 Fetch the data.


select department, sum(salary) from employee_data group by
department;

HAVING CLAUSE
The HQL HAVING clause is used with GROUP BY clause.
Its purpose is to apply constraints on the group of data
produced by GROUP BY clause. Thus, it always returns the
data where the condition is TRUE.
 Let’s fetch the sum of employee’s salary based on
department having sum >= 35000 by using the
following command:
hive> select department, sum(salary) from emp group by
department having sum(salary)>=35000;
HiveQL — ORDER BY Clause
In HiveQL, ORDER BY clause performs a complete
ordering of the query result set. Hence, the complete data
is passed through a single reducer. This may take much
time in the execution of large datasets. However, we can
use LIMIT to minimize the sorting time.
hive> use hql;
hive> create table employee_data (Id int, Name string , Salary
float)
row format delimited
fields terminated by ',' ;

 Now load the data.


hive> load data local inpath
'/home/<username>/hive/employee_data' into table employee;

 Fetch the data.


select * from emp order by salary desc;

HiveQL — SORT BY Clause


The HiveQL SORT BY clause is an alternative of ORDER BY
clause. It orders the data within each reducer. Hence, it
performs the local ordering, where each reducer’s output
is sorted separately. It may also give a partially ordered
result.
 Let’s fetch the data in the descending order by using
the following command:
select * from emp sort by order by salary desc;

Apache Pig
Apache Pig is a high-level programming language especially designed for analyzing
large data sets.
To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce tasks.
Apache Pig has a component known as Pig Engine that accepts the Pig Latin
scripts as input and converts those scripts into MapReduce jobs.

To eliminate the complexities associated with MapReduce an abstraction called Pig


was built on top of Hadoop.
 Apache Pig allows developers to write data analysis programs using Pig
Latin.This is a highly flexible language and supports users in developing
custom functions for writing, reading and processing data.
 It enables the resources to focus more on data analysis by minimizing the time
taken for writing Map-Reduce programs.
 In order to analyze the large volumes of data , programmers write scripts using
Pig Latin language. These scripts are then transformed internally into Map and
Reduce tasks.
Why do we need Apache Pig?
Developers who are not good at Java struggle a lot while working with Hadoop,
especially when executing tasks related to the MapReduce framework.
Apache Pig is the best solution for all such programmers.
 Pig Latin simplifies the work of programmers by eliminating the need to write
complex codes in java for performing MapReduce tasks.
 The multi-query approach of Apache Pig reduces the length of code drastically
and minimizes development time.
 Pig Latin is almost similar to SQL and if you are familiar with SQL then it
becomes very easy for you to learn

Apache Pig – History


 In 2006, Apache Pig was developed as a research project at Yahoo,
especially to create and execute MapReduce jobs on every dataset.
 In 2007, Apache Pig was open sourced via Apache incubator.
 In 2008, the first release of Apache Pig came out.
 In 2010, Apache Pig graduated as an Apache top-level project.
Pig Philosopy
Apache Pig Architecture

Components of Pig Architecture As shown in the above diagram, Apache pig


consists of various components.
Parser : All the Pig scripts initially go through this parser component. It conducts
various checks which include syntax checks of the script, type checking, and other
miscellaneous checks. The Parser components produce output as a DAG (directed
acyclic graph) which depicts the Pig Latin logical operators and logical statements. In
the DAG the data flows are shown as edges and the logical operations represent Pig
Latin statements.
Optimizer: The Direct Acyclic Graph is passed to the logical optimizer, which
performs logical optimizations such as pushdown and projection.
Compiler : The compiler component transforms the optimized logical plan into a
sequence of MapReduce jobs.
Execution Engine: This component submits all the MapReduce jobs in sorted order
to the Hadoop. Finally, all the MapReduce jobs are executed on Apache Hadoop to
produce desired results.

Pig Components

1. Pig Latin-A Language used to express data flows


2. Pig Engine-an engine on top of HADOOP-2 execution environment , takes the
scripts written in Pig Latin as an input and converts them into MapReduce
jobs.

Apache Pig Run Modes


In Hadoop Pig can be executed in two different modes which are:
Local Mode
o Here Pig language makes use of a local file system and runs in a single JVM.
The local mode is ideal for analyzing small data sets.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data stored
in the local file system.
The command for local mode grunt shell:
$ pig-x local

MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o All the queries written using Pig Latin are converted into MapReduce jobs and
these jobs are run on a Hadoop cluster.
o It can be executed against semi-distributed or fully distributed Hadoop
installation.
o Here, the input and output data are present on HDFS.

The command for Map reduce mode:


$ pig

When to Use Each Mode


 Use Local Mode for small datasets, unit tests, debugging, or quick checks
where distributed processing is unnecessary.
 Use MapReduce Mode for production, large datasets, or whenever you want
to leverage Hadoop’s distributed computing capabilities.

Ways to execute Pig Program


These are the following ways of executing a Pig program on local and MapReduce
mode: -
o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To
invoke Grunt shell, run the pig command. Once the Grunt mode executes, we
can provide Pig Latin statements and command interactively at the command
line.
o Batch Mode - In this mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These
functions can be called as UDF (User Defined Functions). Here, we use
programming languages like Java and Python.

Pig Latin Data Model


The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple. Given below is the diagrammatical representation
of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It
is stored as string and can be used as string and number. int, long, float, double,
chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{ }’. It is similar to a table in RDBMS, but unlike a
table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented by
‘[ ]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).

Apache Pig comes with the following features :


Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Advantages of Apache Pig
o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types
like tuple, bag, and map.
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used −
 To process huge data sources such as web logs.
 To perform data processing for search platforms.
 To process time sensitive data loads.

Differences between Apache MapReduce and PIG


Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex


programs using Java or Python. programs.

It provides built-in operators to perform


It is difficult to perform data operations in
data operations like union, sorting and
MapReduce.
ordering.

It provides nested data types like tuple,


It doesn't allow nested data types.
bag, and map.

Apache Pig Vs SQL


Listed below are the major differences between Apache Pig and SQL.
Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store


data without designing a schema (values are stored Schema is mandatory in SQL.
as $01, $02 etc.)

The data model used in SQL is


The data model in Apache Pig is nested relational.
flat relational.

Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.

In addition to above differences, Apache Pig Latin −


 Allows splits in the pipeline.
 Allows developers to store data anywhere in the pipeline.
 Declares execution plans.
 Provides operators to perform ETL (Extract, Transform, and Load) functions.

Apache Pig Vs Hive


 Both Apache Pig and Hive are used to create MapReduce jobs. And in some
cases, Hive operates on HDFS in a similar way Apache Pig does. In the
following table, we have listed a few significant points that set Apache Pig
apart from Hive.
Apache Pig Hive

Apache Pig uses a language called Pig Hive uses a language called HiveQL. It
Latin. It was originally created at Yahoo. was originally created at Facebook.

Pig Latin is a data flow language. HiveQL is a query processing language.

Pig Latin is a procedural language and it fits


HiveQL is a declarative language.
in pipeline paradigm.

Apache Pig can handle structured,


Hive is mostly for structured data.
unstructured, and semi-structured data.

Invoking the Grunt Shell


You can invoke the Grunt shell in a desired mode (local/MapReduce) using
the −x option as shown below.
Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce or pig

Output −
Output −

Either of these commands gives you the Grunt shell prompt as shown below.
grunt>

You can exit the Grunt shell using ‘ctrl + d’ or quit command.
After invoking the Grunt shell, you can execute a Pig script by directly entering the
Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode


You can write an entire Pig Latin script in a file and execute it using the –x
command. Let us suppose we have a Pig script in a file
named Sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
dump student;

Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Pig Latin – Data Model


The data model of Pig is fully nested. A Relation is the outermost structure of
the Pig Latin data model. And it is a bag where −
 A bag is a collection of tuples.
 A tuple is an ordered set of fields.
 A field is a piece of data.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
 These statements work with relations. They
include expressions and schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided by Pig Latin,
through statements.
 Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
 As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you need to
use the Dump operator. Only after performing the dump operation, the
MapReduce job for loading the data into the file system will be carried
out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Pig Latin – Data types


Given below table describes the Pig Latin data types.
S.N. Data Type Description & Example
Represents a signed 32-bit integer.
1 Int
Example : 8

Represents a signed 64-bit integer.


2 long
Example : 5L

Represents a signed 32-bit floating point.


3 Float
Example : 5.5F

Represents a 64-bit floating point.


4 Double
Example : 10.5

Represents a character array (string) in Unicode UTF-8 format.


5 Chararray
Example : ‘tutorials point’

6 Bytearray Represents a Byte array (blob).

Represents a Boolean value.


7 Boolean
Example : true/ false.

Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00

Represents a Java BigInteger.


9 Biginteger
Example : 60708090709

Represents a Java BigDecimal


10 Bigdecimal
Example : 185.98376256272893883

Complex Types

A tuple is an ordered set of fields.


11 Tuple
Example : (raja, 30)

A bag is a collection of tuples.


12 Bag
Example : {(raju,30),(Mohhammad,45)}

A Map is a set of key-value pairs.


13 Map
Example : [ ‘name’#’Raju’, ‘age’#30]

Pig Latin – Arithmetic Operators


The following table describes the arithmetic operators of Pig Latin. Suppose a = 10
and b = 20.
Operator Description Example

+ Addition − Adds values on either side of the operator a + b will give 30

− Subtraction − Subtracts right hand operand from left a − b will give −10
hand operand

Multiplication − Multiplies values on either side of the


* a * b will give 200
operator

Division − Divides left hand operand by right hand


/ b / a will give 2
operand

Modulus − Divides left hand operand by right hand


% b % a will give 0
operand and returns remainder

b = (a == 1)? 20:
Bincond − Evaluates the Boolean operators. It has 30;
three operands as shown below. if a = 1 the value of
?:
variable x = (expression) ? value1 if true : value2 if b is 20.
false. if a!=1 the value of
b is 30.

CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case − The case operator is equivalent to nested 'even'
THEN
bincond operator. WHEN 1 THEN
ELSE
'odd'
END
END

Pig Latin – Comparison Operators


The following table describes the comparison operators of Pig Latin.
Operator Description Example

Equal − Checks if the values of two operands are equal or (a = b) is not


==
not; if yes, then the condition becomes true. true

Not Equal − Checks if the values of two operands are equal


(a != b) is
!= or not. If the values are not equal, then condition becomes
true.
true.

Greater than − Checks if the value of the left operand is


(a > b) is not
> greater than the value of the right operand. If yes, then the
true.
condition becomes true.

Less than − Checks if the value of the left operand is less


< than the value of the right operand. If yes, then the condition (a < b) is true.
becomes true.

Greater than or equal to − Checks if the value of the left


(a >= b) is not
>= operand is greater than or equal to the value of the right
true.
operand. If yes, then the condition becomes true.
Less than or equal to − Checks if the value of the left
(a <= b) is
<= operand is less than or equal to the value of the right
true.
operand. If yes, then the condition becomes true.

Pattern matching − Checks whether the string in the left- f1 matches


matches
hand side matches with the constant in the right-hand side. '.*tutorial.*'

Pig Latin – Type Construction Operators


The following table describes the Type construction operators of Pig Latin.
Operator Description Example

Tuple constructor operator − This operator is


() (Raju, 30)
used to construct a tuple.

Bag constructor operator − This operator is used {(Raju, 30),


{}
to construct a bag. (Mohammad, 45)}

Map constructor operator − This operator is used


[] [name#Raja, age#30]
to construct a tuple.

Pig Latin – Relational Operations


The following table describes the relational operators of Pig Latin.
Operator Description

Loading and Storing

To Load the data from the file system (local/HDFS) into a


LOAD
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.


CROSS To create the cross product of two or more relations.

Sorting

To arrange a relation in a sorted order based on one or more


ORDER
fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution plans


EXPLAIN
to compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes
large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data into
Apache Pig.
Student ID First Name Last Name Phone City

001 Rajiv Reddy 9848022337 Hyderabad

002 siddarth Battacharya 9848022338 Kolkata

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

The Load Operator


You can load data into Apache Pig from the file system (HDFS/ Local)
using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-hand
side, we need to mention the name of the relation where we want to store the data,
and on the right-hand side, we have to define how we store the data.
Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path' [USING function] [as schema];
Where,
 relation_name − We have to mention the relation in which we want to store
the data.
 Input file path − We have to mention the HDFS directory where the file is
stored. (In MapReduce mode)
 function − We have to choose a function from the set of load functions
provided by Apache Pig (BinStorage, JsonLoader, PigStorage,
TextLoader).
 Schema − We have to define the schema of the data. We can define the
required schema as follows −(column1 : data type, column2 : data type,
column3 : data type);
Note − We can load the data without specifying the schema. In that case, the
columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as
shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the
ExecType
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Apache Pig version
0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils - Default bootup
file /home/Hadoop/.pigbootup not found
2015-10-01 12:33:39,630 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000
grunt>

Execute the Load Statement


Now load the data from the file student_data.txt into Pig by executing the following
Pig Latin statement in the Grunt shell.
grunt> student = LOAD 'student_data.txt';
no schema specified no delimiter specified-default delimiter is tab space
or
grunt> student = LOAD 'student_data.txt' as ( id,firstname,lastname,phone,city);
Schema specified ,delimiter not specified
or
grunt> student = LOAD 'student_data.txt' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Both schema and delimiter are specified

We can take the input file separated with tab space for each column with above one,
and no need of specify the complete schema(data types) of the relation also.

Following is the description of the above statement.


Relation
We have stored the data in the schema student.
name

Input file We are reading data from the file student_data.txt, which is in the
path /pig_data/ directory of HDFS.

We have used the PigStorage() function. It loads and stores data as


Storage
structured text files. It takes a delimiter using which each entity of a tuple
function
is separated, as a parameter. By default, it takes ‘\t’ as a parameter.

We have stored the data using the following schema.


column id firstname lastname Phone city
schema
datatype int char array char array char array char array

Note − The load statement will simply load the data into the specified relation in Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
You can store the loaded data in the file system using the store operator.
Syntax
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING function];

Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown
below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');

Output
After executing the store statement, you will get the following output. A directory is
created with the specified name and the data will be stored in it.
The load statement will simply load the data into the specified relation in Apache
Pig. To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Pig Latin provides four different types of diagnostic operators −
 Dump operator
 Describe operator
 Explanation operator
 Illustration operator
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results
on the screen.
It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name

Describe Operator
The describe operator is used to view the schema of a relation.
Syntax
The syntax of the describe operator is as follows −
grunt> Describe Relation_name

Explain Operator
The explain operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;

Illustrate Operator
The illustrate operator gives you the step-by-step execution of a sequence of
statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Group Operator
The GROUP operator is used to group the data in one or more relations. It collects
the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);

Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;

We can verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;

Output
Then you will get output displaying the contents of the relation
named group_data as shown below.
Here you can observe that the resulting schema has two columns −
 One is age, by which we have grouped the relation.
 The other is a bag, which contains the group of tuples, student records with
the respective age.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune),
(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233
8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),
(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),
(7,Komal,Nayak,24,9848022334, trivendram)})
You can see the schema of the table after grouping the data using
the describe command as shown below.
grunt> Describe group_data;

group_data: {group: int,student_details: {(id: int,firstname: chararray,


lastname: chararray,age: int,phone: chararray,city: chararray)}}
Try:grunt>illustrate group_data;

Try:grunt>explain group_data;

Grouping by Multiple Columns


Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
You can verify the content of the relation named group_multiple using the Dump
operator .
Joins in Pig
The JOIN operator is used to combine records from two or more relations. While
performing a join operation, we declare one (or a group of) tuple(s) from each
relation, as keys. When these keys match, the two particular tuples are matched,
else the records are dropped.
Joins can be of the following types −
 Self-join
 Inner-join
 Outer-join − left join, right join, and full join
Assume that we have two files namely customers.txt and orders.txt in
the /pig_data/ directory of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these two files into Pig with the
relations customers and orders as shown below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);

Let us now perform various Join operations on these two relations.


Self - join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple
times, under different aliases (names). Therefore let us load the contents of the
file customers.txt as two tables as shown below.

grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING


PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);

grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING


PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Syntax
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;

Output
It will produce the following output, displaying the contents of the
relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join
returns rows when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B)
based upon the join-predicate. The query compares each row of A with each row of
B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is
satisfied, the column values for each matched pair of rows of A and B are combined
into a result row.
Syntax
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
Let us perform inner join operation on the two relations customers and orders as
shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;

Output
You will get the following output that will the contents of the relation
named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the
relations.
An outer join operation is carried out in three ways −
 Left outer join
 Right outer join
 Full outer join
Left Outer Join
The left outer Join operation returns all rows from the left table, even if there are no
matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation using
the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER,
Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations customers and orders as
shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer Join


The right outer join operation returns all rows from the right table, even if there are
no matches in the left table.
Syntax
Given below is the syntax of performing right outer join operation using
the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Full Outer Join


The full outer join operation returns rows when there is a match in one of the
relations.
Syntax
Given below is the syntax of performing full outer join using the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Using Multiple Keys


We can perform JOIN operation using multiple keys.
Syntax
Here is how you can perform a JOIN operation on two tables using multiple keys.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name
BY (key1, key2);
Cross Operator
The CROSS operator computes the cross-product of two or more relations. This
chapter explains with example how to use the cross operator in Pig Latin.
Syntax
Given below is the syntax of the CROSS operator.
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;

Foreach Operator
The FOREACH operator is used to generate specified data transformations based
on the column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);

grunt> foreach_data = FOREACH student_details GENERATE id,age,city;


grunt> foreach_data = FOREACH student_details GENERATE age>25;

Order By Operator
The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

grunt> order_by_data = ORDER student_details BY age DESC;


Limit Operator
The LIMIT operator is used to get a limited number of tuples from a relation.
Syntax
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Example
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);

grunt> limit_data = LIMIT student_details 4;

Verify the relation limit_data using the DUMP operator as shown below.
grunt> Dump limit_data;

Apache Pig - Load & Store Functions

The Load and Store functions in Apache Pig are used to determine how the data
goes and comes out of Pig. These functions are used with the load and store
operators. Given below is the list of load and store functions available in Pig.
S.N. Function & Description

PigStorage()
1
To load and store structured files.

TextLoader()
2
To load unstructured data into Pig.

BinStorage()
3
To load and store data into Pig using machine readable format.

Handling Compression
4
In Pig Latin, we can load and store compressed data.
Apache Pig - User Defined Functions
In addition to the built-in functions, Apache Pig provides extensive support
for User Defined Functions (UDF’s). Using these UDF’s, we can define our own
functions and use them. The UDF support is provided in six programming languages,
namely, Java, Jython, Python, JavaScript, Ruby and Groovy.

Comments in Pig Script


While writing a script in a file, we can include comments in it as shown below.
Multi-line comments
We will begin the multi-line comments with '/*', end them with '*/'.
/* These are the multi-line comments
In the pig script */
Single –line comments
We will begin the single-line comments with '--'.
--we can write single line comments like this.

Executing Pig Script in Batch mode


While executing Apache Pig statements in batch mode, follow the steps given below.
Step 1
Write all the required Pig Latin statements in a single file. We can write all the Pig
Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell (Linux)
as shown below.
Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

You can execute it from the Grunt shell as well using the exec/run command as
shown below.
grunt> exec /sample_script.pig

Executing a Pig Script from HDFS


We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig
script with the name Sample_script.pig in the HDFS directory named /pig_data/.
We can execute it as shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

When NOT to use PIG


Pig should not be used
1. When your data is completely in the unstructured form such as
videos ,text,audio.
2. When there is a time constraint because Pig is Slower than MapReduce jobs.

To Practice in PIG:
run pig in interactive mode where the files are in local file systems
run pig in script mode where files are in local file system
run pig in interactive mode where the files are in hdfs file system
run pig in script mode where files are in hdfs file system
Use HUE to write the pigscripts to perform order,joins
Wordcount program in a pig script
Operations to perform:
1. load
2. dump
3. store
4. group
5. foreach
6. order
7. joins
8. diagnostic operators

In Mapreduce mode:
cd Desktop
$>hadoop fs copyFromLocal customers.txt /tmp
$>hadoop fs copyFromLocal orders.txt /tmp
$>pig
grunt>c = load ‘/tmp/customers.txt’ using PigsSorage(‘,’) as((id:int, name:chararray,
age:int, address:chararray, salary:int);
grunt>dump c;
grunt>od= load 'tmp/orders.txt' using PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);
grunt>dump od;
grunt>g=group c by age;
grunt>dump g;
or
grunt>store g into ‘/tmp/gg’;
grunt>fs –cat /tmp/gg/part-*;
In Local mode:
$pig –x local;
grunt>c = load ‘customers.txt’ using PigStorage(‘,’) as((id:int, name:chararray,
age:int, address:chararray, salary:int);
grunt>dump c;
grunt>od= load 'orders.txt' using PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);
grunt>dump od;

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy