Big Data Notes Pig
Big Data Notes Pig
Introduction to pig:
Pig Represents Big Data as data flows. Pig is a high-level platform or
tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce. It provides a
high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is
stored in the HDFS, the programmers will write the scripts using the
Pig Latin Language. Internally Pig Engine(a component of Apache
Pig) converted all these scripts into a specific map and reduce task.
But these are not visible to the programmers in order to provide a
high-level of abstraction. Pig Latin and Pig Engine are the two main
components of the Apache Pig tool. The result of Pig always stored in
the HDFS.
Need of Pig: One limitation of MapReduce is that the development
cycle is very long. Writing the reducer and mapper, compiling
packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development
using the multi-query approach. Also, Pig is beneficial for
programmers who are not from Java background. 200 lines of Java
code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn
Pig Latin.
It uses query approach which results in reducing the length of the
code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).
2) Optimization opportunities
3) Extensibility
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.
Application of pig:
Pig Architecture:
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write
a Pig script using the Pig Latin language, and execute them using any of the
execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these
scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
and thus, it makes the programmer’s job easy. The architecture of Apache Pig is
shown below.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted
order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.
Atom
Any single value in Pig Latin, irrespective of their data, type is known
as an Atom. It is stored as string and can be used as string and
number. int, long, float, double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.
Bag
Map
A map (or data map) is a set of key-value pairs. The key needs to be
of type chararray and should be unique. The value might be of any
type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is
no guarantee that tuples are processed in any particular order).
Complex data types consist of a bit of logical and complicated data type. The
following are the complex data type −
Data Definition Code Example
Types
Definig schema:
In the previous chapter, we learnt how to load data into Apache Pig.
You can store the loaded data in the file system using
the store operator. This chapter explains how to store data in Apache
Pig using the Store operator.
Syntax:
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING
function];
Example
Assume we have a file student_data.txt in HDFS with the following
content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator
as shown below.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS
directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ '
USING PigStorage (',');
Output
After executing the store statement, you will get the following output.
A directory is created with the specified name and the data will be
stored in it.
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLau ncher - 100% complete
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats -
Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05
13:05:05 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime
AvgMapTime MedianMapTime
job_14459_06 1 0 n/a n/a n/a n/a
MaxReduceTime MinReduceTime AvgReduceTime
MedianReducetime Alias Feature
0 0 0 0 student MAP_ONLY
OutPut folder
hdfs://localhost:9000/pig_Output/
Input(s): Successfully read 0 records from:
"hdfs://localhost:9000/pig_data/student_data.txt"
Output(s): Successfully stored 0 records in:
"hdfs://localhost:9000/pig_Output"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG: job_1443519499159_0006
2015-10-05 13:06:06,192 [main] INFO
org.apache.pig.backend.hadoop.executionengine
.mapReduceLayer.MapReduceLau ncher - Success!
Verification
You can verify the stored data as shown below.
Step 1
First of all, list out the files in the directory named pig_output using
the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing
the store statement.
Step 2
Using cat command, list the contents of the file named part-m-
00000 as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai
In MapReduce mode, Pig reads (loads) data from HDFS and stores
the results back in HDFS. Therefore, let us start HDFS and create the
following sample data in HDFS.
Student First
Last Name Phone City
ID Name
Battachary
002 siddarth 9848022338 Kolkata
a
The above dataset contains personal details like id, first name, last
name, phone number and city, of six students.
Browse through the sbin directory of Hadoop and start yarn and
Hadoop dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-namenode-
localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-datanode-
localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-
localhost.localdomain.out
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-
localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-
localhost.localdomain.out
Step 3: Create a Directory in HDFS
Output
Now load the data from the file student_data.txt into Pig by
executing the following Pig Latin statement in the Grunt shell.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );
Relation
We have stored the data in the schema student.
name
Input We are reading data from the
file file student_data.txt, which is in the /pig_data/
path directory of HDFS.
Pig operator:
Features of Hive:
Architecture of Hive:
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C+
+. It supports different types of clients such as:-
Hive Services
Hive shell:
HiveQL:
JDBC Program
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
$ javac HiveCreateDb.java
$ java HiveCreateDb
Output:
JDBC Program
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
con.close();
}
}
$ javac HiveDropDb.java
$ java HiveDropDb
Output:
Drop userdb database successful.
Syntax
Example
1 Eid int
2 Name String
3 Salary Float
4 Designation string
If you add the option IF NOT EXISTS, Hive ignores the statement
in case the table already exists.
OK
Time taken: 5.905 seconds
hive>
JDBC Program
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// execute statement
stmt.executeQuery("CREATE TABLE IF NOT EXISTS "
+" employee ( eid int, name String, "
+" salary String, destignation String)"
+" COMMENT ‘Employee details’"
+" ROW FORMAT DELIMITED"
+" FIELDS TERMINATED BY ‘\t’"
+" LINES TERMINATED BY ‘\n’"
+" STORED AS TEXTFILE;");
System.out.println(“ Table employee created.”);
con.close();
}
}
$ javac HiveCreateDb.java
$ java HiveCreateDb
Output
Syntax
Example
We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin
The following query loads the given text into the table.
OK
Time taken: 15.905 seconds
hive>
JDBC Program
Given below is the JDBC program to load given data into the
table.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
$ javac HiveLoadData.java
$ java HiveLoadData
Output:
Data types:
Column Types
Column type are used as column data types of Hive. They are as
follows:
Integral Types
Integer type data can be specified using integral data types, INT.
When the data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or
double quotes (" "). It contains two data types: VARCHAR and
CHAR. Hive follows C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Timestamp
Decimals
DECIMAL(precision, scale)
decimal(10,0)
Union Types
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher
range than DOUBLE data type. The range of decimal type is
approximately -10-308 to 10308.
Null Value
Complex Types
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Structs
Operations in Hive:
It returns the
maximum BIGINT
BIGINT floor(double a)
value that is equal or
less than the double.
It returns a random
double rand(), rand(int seed) number that changes
from row to row.
Example
round() function
3.0
floor() function
ceil() function
3.0
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage
of these functions is as same as the SQL aggregate functions.
Return
Signature Description
Type
Print Page
Hive vs RDBMS:
RDBMS Hive
Example of Hive: