BDA Unit-4 Part-2 HBase, Hive, Pig
BDA Unit-4 Part-2 HBase, Hive, Pig
HBase Regions
A Region is a continuous range of rows within a table, representing the basic
unit of scalability and distribution in HBase.
Tables are divided into regions by row keys, and these regions are then
distributed across the cluster to different Region Servers.
When a region grows beyond a certain size, it’s split into two new regions,
which can be moved to other Region Servers to balance the load.
ZooKeeper
HBase Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. Distributed synchronization is
to access the distributed applications running across the cluster with the
responsibility of providing coordination services between nodes. If the client wants to
communicate with regions, the server’s client has to approach ZooKeeper first.
It is an open source project, and it provides so many important services.
Services provided by ZooKeeper
Maintains Configuration information
Provides distributed synchronization
Client Communication establishment with region servers
Provides ephemeral nodes for which represent different region servers
Master servers usability of ephemeral nodes for discovering available servers
in the cluster
To track server failure and network partitions
Master and HBase slave nodes ( region servers) registered themselves with
ZooKeeper. The client needs access to ZK(zookeeper) quorum configuration to
connect with master and region servers.
During a failure of nodes that present in HBase cluster, ZKquoram will trigger error
messages, and it starts to repair the failed nodes.
HDFS
HDFS is a Hadoop distributed File System, as the name implies it provides a
distributed environment for the storage and it is a file system designed in a way to
run on commodity hardware. It stores each file in multiple blocks and to maintain
fault tolerance, the blocks are replicated across a Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity
hardware. By adding nodes to the cluster and performing processing & storing by
using the cheap commodity hardware, it will give the client better results as
compared to the existing one.
In here, the data stored in each block replicates into 3 nodes any in a case when any
node goes down there will be no loss of data, it will have a proper backup recovery
mechanism.
HDFS get in contact with the HBase components and stores a large amount of data
in a distributed manner.
When the situation comes to process and analytics Online Transactional process such
we use this approach. Such as Online Analytical as banking and finance domains use
Processing and it’s applications. this approach.
The amount of data that can able to store in this It is designed for a small number of
model is very huge like in terms of petabytes rows and columns.
HBase Commands
HBase or Hadoop Database, is a distributed and non-relational database
management system that runs on top of the Hadoop Distributed File System (HDFS).
HBase can handle large amounts of data while providing low latency via parallel
processing.
Introduction
The HBase Shell is a command-line interface for interacting with the HBase
database.
When the shell is launched, it establishes a connection with the hbase client
which is responsible for interacting with the HBase Master for various
operations.
The HBase shell translates user-entered commands into appropriate API calls,
allowing users to interact with HBase in a more user-friendly and interactive manner.
They are,
General Commands.
Data Definition Language (DDL) Commands.
Data Manipulation Language (DML) Commands.
Security Commands.
Other HBase Shell Commands.
The HBase Shell is built on top of the HBase Java API and provides a
simplified interface for executing HBase commands without the need for
extensive coding.
General Commands
HBase shell offers a range of general commands to get information regarding the
database. The general HBase commands are,
Whoami
Status
Version
Table Help
Cluster Help
Namespace Help
1. whoami
The whoami HBase command is used to retrieve the current user's information.
Syntax:
whoami
Output:
user: hari
Explanation:
This output indicates that the user executing the command is logged in as hari in the
HBase Shell.
2. Status
The status command provides information about the HBase cluster, including the
cluster's overall health and the number of servers and regions.
Syntax:
status
Output:
Explanation:
active master:
Number of active master servers in the HBase cluster.
region servers:
Number of region servers that are currently running.
regions:
Total number of regions in the cluster.
3. Version
version
Output:
HBase 2.4.0
Explanation:
4. Table Help
table_help
Output:
TABLE COMMANDS:
create 'table_name', {NAME => 'column_family_name'}, ...
Create a table with the specified table_name and column families.
describe 'table_name'
Display the detailed schema of the specified table.
...
5. Cluster Help
cluster_help
Output:
status
version
...
6. Namespace Help
Syntax:
namespace_help
Output:
create_namespace 'namespace_name'
describe_namespace 'namespace_name'
...
The Data Definition Language (DDL) allows users to define and manipulate the
structure of the HBase tables.
Create
Describe
Alter
Disable
Enable
Drop
Exist
1. Create
Explanation:
Example:
Output:
namespace 'your_namespace'
All tables created after this command will be in the your_namespace namespace.
We can also add various configurations to the HBase commands while creating the
column family in a table. Example:
create 'products', {NAME => 'details', VERSIONS => 5, TTL => '86400'}, {NAME =>
'inventory', COMPRESSION => 'GZ'}
Explanation:
{}:
Used to specify details of a column family.
NAME:
Specifies the name of the column family.
VERSION:
Sets the maximum number of versions to keep for each cell.
TTL:
Sets the Time-to-Live (TTL).
COMPRESSION:
Specifies the compression algorithm to be used.
GZ means Gzip compression.
The value 86400 represents the number of seconds (1 day) that cells will be kept
before being automatically expired and deleted.
2. Describe
The describe command provides information about the specified table, including
column families and region details.
Syntax:
describe 'table_name'
Example:
describe 'employees'
Output:
Explanation:
This command provides a detailed description of the employees table, including its
enabled status and the configuration of its column families.
3. Alter
The alter command allows users to modify the schema of an existing table.
Syntax:
Example:
Explanation:
This command adds a new column family named contact to the employees table.
4. Disable
The disable command is used to disable a table. Once disabled, the table
becomes read-only, and no further modifications can be made to it.
Syntax:
disable 'table_name'
Example:
disable 'employees'
Output:
Explanation:
This command disables the employees table, preventing any further modifications.
5. Enable
The enable command is used to enable a previously disabled table. Once enabled,
the table becomes writable again. Syntax:
enable 'table_name'
Example
enable 'employees'
Output:
6. Drop
The drop command in HBase Shell is used to delete an existing table from the
HBase database. The table to be deleted should be disabled first using
the disable command.
Syntax:
drop 'table_name'
Example:
drop 'employees'
Output:
Explanation:
7. Exist
The exist command in HBase is used to check the existence of a table or a column
family within a table.
Syntax:
exists 'table_name'
Example:
exists 'employees'
Output:
Explanation:
Since we have deleted the table, the output shows that the table doesn't exist.
HBase Shell provides a Data Manipulation Language (DML) that enables users to
insert, update, and delete data within HBase tables. Some of the DML HBase
commands are,
Put
Get
Scan
Delete
Truncate
1. Put
Explanation:
table_name:
Name of the HBase table.
row key:
Unique identifier for a specific row in the HBase table.
Each row in an HBase table is indexed and accessed using its row key.
column family:
Represents the group to which a specific column belongs.
column qualifier:
Specifies the specific column within the column family.
value:
Data that will be stored
Example:
Output:
Explanation:
This command inserts the value Sample user into the personal:name column of the
row with key 1001 in the employees table.
2. Get
The get command is used to retrieve data from a table based on the row
key. Syntax:
The command requires specifying the table name and row key. Example:
Output:
COLUMN CELL
personal:name timestamp=1628272254000, value=John Doe
1 row(s) in 0.0390 seconds
Explanation:
This command retrieves the data associated with the row key 1001 in
the employees table, displaying the column name, timestamp, and cell value.
Command Options
We can also use different options with the GET command to get specific information
in the table. The options are,
FILTER:
accepts a filter string that defines the filter conditions. You can use various
filter types like SingleColumnValueFilter, PrefixFilter, ColumnPrefixFilter, etc.
Example:
Explanation:
The above command retrieves the cell values from the row with the key row1 in
the employees table. The filter condition SingleColumnValueFilter is applied to fetch
only those cells where the age column in the personal column family is greater than
or equal to 25.
COLUMN:
Can specify the name of the column or column family you want to retrieve.
TIMERANGE:
Takes a single timestamp value and fetches cell values based on a specific
timestamp.
TIMERANGE:
Takes a start time and an end time in the format yyyy-MM-dd HH:mm:ss and
retrieves cell values within a specific time range. For instance, TIMERANGE
=> ['2023-01-01 00:00:00', '2023-01-31 23:59:59'].
3. Scan
The scan command is used to retrieve multiple rows or a range of rows from a
table. Syntax:
scan 'table_name'
Example:
scan 'employees'
Output:
ROW COLUMN+CELL
1001 column=personal:name, timestamp=1628272254000,
value=Sample user
1002 column=personal:name, timestamp=1628272296250,
value=New User
2 row(s) in 0.0560 seconds
Explanation:
This command retrieves all the rows and associated column families and cells from
the employees table.
4. Delete
The delete command is used to delete data from a table based on the row key,
column family, and column qualifier. Syntax:
If we want to delete a specific cell, the column family and column qualifier must also
be specified. Example:
Output:
Explanation: This command deletes the data in the personal:name column of the
row with key 1002 in the employees table.
5. Truncate
The truncate command in HBase Shell is used to delete all data from a specific
table while retaining the table structure. Syntax:
truncate 'table_name'
Example:
truncate 'employees'
Output:
Explanation:
Through the above command, all the data from the employees table are removed.
Security Commands
Security commands in HBase Shell are used to manage user permissions, and
access control, and ensure the security of the HBase database. Some of the security
HBase commands are,
Create
Grant
Revoke
User Permissions
Disable Security
1. Create
The create command is used to create a new user with a specified username and
password. Syntax:
Example:
create 'bob', 'hbasecommand'
Output:
Explanation:
This command creates a new user named bob with the password hbasecommand.
2. Grant
Explanation:
The permissions can be any of read, write, execute, create, and admin. The
permission is applied for a particular table. Example:
Output:
Explanation:
This command grants the read permission to the user bob on the employees table.
The execute permission allows the user to execute coprocessor functions on the
specified table. Coprocessors are custom code modules that run on HBase Region
Servers and can perform operations on the server side. The admin permission
provides full administrative privileges over the specified table.
3. Revoke
Example:
Output:
Explanation:
This command revokes the read permission from the user bob on
the employees table.
4. User Permissions
user_permission 'user_name'
Example:
user_permission 'bob'
Output:
Explanation:
5. Disable Security
Syntax:
disable_security
Example:
disable_security
Output:
Explanation:
The output indicates that the security in HBase has been disabled.
Starting the HBase Shell provides a command-line interface to interact with the
HBase database through HBase commands.
bin/hbase shell
Once the HBase Shell is launched, you will see a command prompt that indicates
you are in the HBase Shell environment. From here, you can start entering various
HBase commands to interact with the HBase database.
Apache Hive
Hive is a data warehouse system which is used to analyze structured
data.
It is built on the top of Hadoop.
It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large
datasets residing in distributed storage.
It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of
writing complex MapReduce programs.
Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Features of Hive
Limitations of Hive
The following table compares the advantages of Hive with the advantages of Pig :
Features
12. JDBC/
Supported, but limited Unsupported
ODBC
Hive Pig
Hive Architecture
Hive Client:
Hive drivers support applications written in any language like Python, Java,
C++, and Ruby, among others, using JDBC, ODBC, and Thrift drivers, to
perform queries on the Hive. Therefore, one may design a hive client in any
language of their choice.
Hive Services:
Hive provides numerous services, including the Hive server2, Beeline, etc.
The services offered by Hive are:
HiveServer2 handled concurrent requests from more than one client, so it was
replaced by HiveServer1.
Hive Driver: The Hive driver receives the HiveQL statements submitted by
the user through the command shell and creates session handles for the
query.
Hive Compiler: Metastore and hive compiler both store metadata in order to
support the semantic analysis and type checking performed on the different
query blocks and query expressions by the hive compiler. The execution plan
generated by the hive compiler is based on the parse results.
The DAG (Directed Acyclic Graph) is a DAG structure created by the compiler.
Each step is a map/reduce job on HDFS, an operation on file metadata, and a
data manipulation step.
Optimizer: The optimizer splits the execution plan before performing the
transformation operations so that efficiency and scalability are improved.
The metastore also stores information about the serializer and deserializer as
well as HDFS files where data is stored and provides data storage. It is
usually a relational database. Hive metadata can be queried and modified
through Metastore.
The data processing tools can access the tabular data of Hive metastore
through It is built on the top of Hive metastore and exposes the tabular data to
other data processing tools.
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive
CLI. It provides a web-based GUI for executing Hive queries and
commands.
o Hive MetaStore - It is a central repository that stores all the structure
information of various tables and partitions in the warehouse. It also
includes metadata of column and its type information, the serializers
and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the
request from different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and
expressions. It converts HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form
of DAG of map-reduce tasks and HDFS tasks. In the end, the execution
engine executes the incoming tasks in the order of their dependencies.
Integer Types
2,147,483,648 to
INT 4-byte signed integer
2,147,483,647
-
9,223,372,036,854,775,808
BIGINT 8-byte signed integer
to
9,223,372,036,854,775,807
Decimal Type
Double precision
DOUBLE 8-byte
floating point number
Date/Time Types
TIMESTAMP
The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.
String Types
STRING
Varchar
The varchar is a variable length type whose range lies between 1 and 65535,
which specifies that the maximum number of characters allowed in the
character string.
CHAR
Complex Type
Type Size Range
It is similar to C struct
or an object where
Struct fields are accessed struct('James','Roy')
using the "dot"
notation.
It is a collection of
similar type of values
Array array('James','Roy')
that indexable using
zero-based integers.
So, to check the list of existing databases, follow the below command: -
hive> show databases;
o Internal table
o External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their
data is controlled by the Hive.
The internal tables are not flexible enough to share with other tools like Pig.
If we try to drop the internal table, Hive deletes both table schema and
data.
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited fields terminated by ',' ;
We can see the metadata of the created table by using the following
command: -
hive> describe new_employee;
External Table
The external table allows us to create and access a table and a data
externally. The external keyword is used to specify the external table,
whereas the location keyword is used to determine the location of loaded
data.
As the table is external, the data is not present in the Hive directory.
Therefore, if we try to drop the table, the metadata of the table will be deleted,
but the data still exists.
So, in Hive, we can easily load data from any file to the database.
Let's load the data of the file into the database by using the following
command: -
Now, we can use the following command to retrieve the data from the
database.
hive>select * from demo.employee;
Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the
values of a particular column like date, course, city or country.
The advantage of partitioning is that since the data is stored in slices, the
query response time becomes faster.
Static Partitioning
In static or manual partitioning, it is required to pass the values of partitioned
columns manually while loading the data into the table. Hence, the data file
doesn't contain the partitioned columns.
Create the table and provide the partitioned columns by using the following
command:
hive> create table student (id int, name string, age int, institute string)
>partitioned by (course string)
>row format delimited
>fields terminated by ',';
Load the data into the table and pass the values of partition columns
with it by using the following command: -
hive> load data local inpath '/home/codegyani/hive/student_details1' into
table student partition(course= "java");
Dynamic Partitioning
Bucketing in Hive
Drop Database
Delete a defined database.
This command will delete a database.
hive> drop database demo;
Adding a column —
Alter table table_name add columns(columnName datatype);
Change column —
hive> Alter table_name change <old_column_name>
<new_column_name> datatype;
In hive with DML statements, we can add data to the Hive table in 2
different ways.
Using INSERT Command
Load Data Statement
Example:
To insert data into the table let’s create a table with the name student (By
default hive uses its default database to store hive tables).
Command:
CREATE TABLE IF NOT EXISTS student(
Student_Name STRING,
Student_Rollno INT,
Student_Marks FLOAT)
INSERT Query:
We can check the data of the student table with the help of the below
command.
SELECT * FROM student;
2. Load Data Statement
Hive provides us the functionality to load pre-created table entities either
from our local file system or from HDFS. The LOAD DATA statement is
used to load data into the hive table.
Syntax:
LOAD DATA [LOCAL] INPATH '<The table data location>'
[OVERWRITE] INTO TABLE <table_name>;
Note:
The LOCAL Switch specifies that the data we are loading is available
in our Local File System. If the LOCAL switch is not used, the hive will
consider the location as an HDFS path location.
The OVERWRITE switch allows us to overwrite the table data.
Let’s make a CSV(Comma Separated Values) file with the
name data.csv since we have provided ‘,’ as a field terminator while
creating a table in the hive. We are creating this file in our local file system
at ‘/home/dikshant/Documents’ for demonstration purposes.
Command:
cd /home/dikshant/Documents // To change the directory
LOAD DATA to the student hive table with the help of the below
command.
LOAD DATA LOCAL INPATH '/home/dikshant/Documents/data.csv'
INTO TABLE student;
Let’s see the student table content to observe the effect with the help of
the below command.
SELECT * FROM student;
Hive — Partitioning
The partitioning in hive can be done in two ways —
Static partitioning
Dynamic partitioning
Static Partitioning
In static or manual partitioning, it is required to pass the
values of partitioned columns manually while loading the
data into the table. Hence, the data file doesn’t contain the
partitioned columns.
hive> use test;
hive> create table student (id int, name string, age int,
institute string)
> partitioned by (course string)
> row format delimited
> fields terminated by ',';
Load the data into the table and pass the values of
partition columns with it by using the following
command: -
hive> load data local inpath
'/home/<username>/hive/student_details1' into table student
partition(course= "python");hive> load data local inpath
'/home/<username>/hive/student_details1' into table student
partition(course= "Hadoop");
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns
exist within the table. So, it is not required to pass the
values of partitioned columns manually.
hive> use show;
Now you can view the table data with the help
of select command.
HiveQL — Operators
he HiveQL operators facilitate to perform various
arithmetic and relational operations.
hive> use hql;
hive> create table employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
Functions in Hive
hive> use hql;
hive> create table employee_data (Id int, Name string , Salary
float)
row format delimited
fields terminated by ',' ;
Aggregate Functions
GROUP BY Clause
The HQL Group By clause is used to group the data from
the multiple records based on one or more column. It is
generally used in conjunction with the aggregate functions
(like SUM, COUNT, MIN, MAX and AVG) to perform an
aggregation over each group.
hive> use hql;
hive> create table employee_data (Id int, Name string , Salary
float)
row format delimited
fields terminated by ',' ;
HAVING CLAUSE
The HQL HAVING clause is used with GROUP BY clause.
Its purpose is to apply constraints on the group of data
produced by GROUP BY clause. Thus, it always returns the
data where the condition is TRUE.
Let’s fetch the sum of employee’s salary based on
department having sum >= 35000 by using the
following command:
hive> select department, sum(salary) from emp group by
department having sum(salary)>=35000;
HiveQL — ORDER BY Clause
In HiveQL, ORDER BY clause performs a complete
ordering of the query result set. Hence, the complete data
is passed through a single reducer. This may take much
time in the execution of large datasets. However, we can
use LIMIT to minimize the sorting time.
hive> use hql;
hive> create table employee_data (Id int, Name string , Salary
float)
row format delimited
fields terminated by ',' ;
Apache Pig
Apache Pig is a high-level programming language especially designed for analyzing
large data sets.
To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce tasks.
Apache Pig has a component known as Pig Engine that accepts the Pig Latin
scripts as input and converts those scripts into MapReduce jobs.
Pig Components
MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o All the queries written using Pig Latin are converted into MapReduce jobs and
these jobs are run on a Hadoop cluster.
o It can be executed against semi-distributed or fully distributed Hadoop
installation.
o Here, the input and output data are present on HDFS.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It
is stored as string and can be used as string and number. int, long, float, double,
chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{ }’. It is similar to a table in RDBMS, but unlike a
table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented by
‘[ ]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.
Apache Pig uses a language called Pig Hive uses a language called HiveQL. It
Latin. It was originally created at Yahoo. was originally created at Facebook.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce or pig
Output −
Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’ or quit command.
After invoking the Grunt shell, you can execute a Pig script by directly entering the
Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00
Complex Types
− Subtraction − Subtracts right hand operand from left a − b will give −10
hand operand
b = (a == 1)? 20:
Bincond − Evaluates the Boolean operators. It has 30;
three operands as shown below. if a = 1 the value of
?:
variable x = (expression) ? value1 if true : value2 if b is 20.
false. if a!=1 the value of
b is 30.
CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case − The case operator is equivalent to nested 'even'
THEN
bincond operator. WHEN 1 THEN
ELSE
'odd'
END
END
Filtering
Sorting
Diagnostic Operators
In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes
large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data into
Apache Pig.
Student ID First Name Last Name Phone City
We can take the input file separated with tab space for each column with above one,
and no need of specify the complete schema(data types) of the relation also.
Input file We are reading data from the file student_data.txt, which is in the
path /pig_data/ directory of HDFS.
Note − The load statement will simply load the data into the specified relation in Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
You can store the loaded data in the file system using the store operator.
Syntax
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown
below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');
Output
After executing the store statement, you will get the following output. A directory is
created with the specified name and the data will be stored in it.
The load statement will simply load the data into the specified relation in Apache
Pig. To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Pig Latin provides four different types of diagnostic operators −
Dump operator
Describe operator
Explanation operator
Illustration operator
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results
on the screen.
It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
Describe Operator
The describe operator is used to view the schema of a relation.
Syntax
The syntax of the describe operator is as follows −
grunt> Describe Relation_name
Explain Operator
The explain operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;
Illustrate Operator
The illustrate operator gives you the step-by-step execution of a sequence of
statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Group Operator
The GROUP operator is used to group the data in one or more relations. It collects
the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
We can verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the relation
named group_data as shown below.
Here you can observe that the resulting schema has two columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with
the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),
(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233
8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),
(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),
(7,Komal,Nayak,24,9848022334, trivendram)})
You can see the schema of the table after grouping the data using
the describe command as shown below.
grunt> Describe group_data;
Try:grunt>explain group_data;
And we have loaded these two files into Pig with the
relations customers and orders as shown below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;
Output
It will produce the following output, displaying the contents of the
relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join
returns rows when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B)
based upon the join-predicate. The query compares each row of A with each row of
B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is
satisfied, the column values for each matched pair of rows of A and B are combined
into a result row.
Syntax
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
Let us perform inner join operation on the two relations customers and orders as
shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
Output
You will get the following output that will the contents of the relation
named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the
relations.
An outer join operation is carried out in three ways −
Left outer join
Right outer join
Full outer join
Left Outer Join
The left outer Join operation returns all rows from the left table, even if there are no
matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation using
the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER,
Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations customers and orders as
shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Foreach Operator
The FOREACH operator is used to generate specified data transformations based
on the column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Order By Operator
The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Verify the relation limit_data using the DUMP operator as shown below.
grunt> Dump limit_data;
The Load and Store functions in Apache Pig are used to determine how the data
goes and comes out of Pig. These functions are used with the load and store
operators. Given below is the list of load and store functions available in Pig.
S.N. Function & Description
PigStorage()
1
To load and store structured files.
TextLoader()
2
To load unstructured data into Pig.
BinStorage()
3
To load and store data into Pig using machine readable format.
Handling Compression
4
In Pig Latin, we can load and store compressed data.
Apache Pig - User Defined Functions
In addition to the built-in functions, Apache Pig provides extensive support
for User Defined Functions (UDF’s). Using these UDF’s, we can define our own
functions and use them. The UDF support is provided in six programming languages,
namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
You can execute it from the Grunt shell as well using the exec/run command as
shown below.
grunt> exec /sample_script.pig
To Practice in PIG:
run pig in interactive mode where the files are in local file systems
run pig in script mode where files are in local file system
run pig in interactive mode where the files are in hdfs file system
run pig in script mode where files are in hdfs file system
Use HUE to write the pigscripts to perform order,joins
Wordcount program in a pig script
Operations to perform:
1. load
2. dump
3. store
4. group
5. foreach
6. order
7. joins
8. diagnostic operators
In Mapreduce mode:
cd Desktop
$>hadoop fs copyFromLocal customers.txt /tmp
$>hadoop fs copyFromLocal orders.txt /tmp
$>pig
grunt>c = load ‘/tmp/customers.txt’ using PigsSorage(‘,’) as((id:int, name:chararray,
age:int, address:chararray, salary:int);
grunt>dump c;
grunt>od= load 'tmp/orders.txt' using PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);
grunt>dump od;
grunt>g=group c by age;
grunt>dump g;
or
grunt>store g into ‘/tmp/gg’;
grunt>fs –cat /tmp/gg/part-*;
In Local mode:
$pig –x local;
grunt>c = load ‘customers.txt’ using PigStorage(‘,’) as((id:int, name:chararray,
age:int, address:chararray, salary:int);
grunt>dump c;
grunt>od= load 'orders.txt' using PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);
grunt>dump od;