BDA Module 4 - Part 1 (Pig) 2023
BDA Module 4 - Part 1 (Pig) 2023
Module IV-Part 1
Introduction to Pig
By
Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Chapter 10
Introduction to
Pig
• It contains 2 components: a scripting language called pig Latin and the pig
Latin compiler.
• It provides an engine for executing data flows (how your data should
flow). Pig
processes data in parallel on the Hadoop cluster.
• It provides a language called “Pig Latin” to express data flows. It’s a high
level language used for writing programs for data processing and analysis.
• Pig Latin contains operators for many of the traditional data operations
such as
join, filter, sort, etc.
• By default, Pig reads input files from HDFS. Pig stores the
intermediate data (data produced by MapReduce jobs) and the
output in HDFS.
• However, Pig can also read input from and place output to other
sources.
Apache Pig Components
Parser: Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous
checks. The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and logical
operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
• Ease of Coding: Pig Latin lets you write complex programs. The code is simple and
easy to understand and maintain.
• Optimization: Pig Latin encodes tasks in such a way that they can be easily
optimised for execution.
• Extensibility: Pig Latin is designed in such a way that it allows you to create your
own custom functions called user defined functions.
Pig Latin
Overview
Installing Pig
Complex Data
Types
Name Description
Tuple An ordered set of fields.
Example: (2,3)
Bag A collection of tuples.
Example: {(2,3),(7,5)}
map key, value pair (open # Apache)
Running Pig
2. Batch Mode.
Relational Operators
Filter
Find the tuples of those student where the GPA is greater than 4.0.
DUMP B;
Grou
p
Group tuples of students based on their
GPA.
B = GROUP A BY
gpa; DUMP B;
Distinct
B = DISTINCT A;
DUMP B;
Join
To join two relations namely, “student” and “department” based on the
values
contained in the “rollno” column.
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
DUMP
C;
DUMP B;
Split
DUMP X;
Avg
B = GROUP A BY studname;
DUMP C;
B = GROUP A BY studname;
DUMP C;
Map
MAP represents a key/value pair.
DUMP B
register '/root/pigdemos/piggybank-0.12.0.jar';
DUMP upper;
Working with Functions in Pig
There are 5 categories of built-in functions in Pig
1. Eval or Evaluation functions
2. Math functions
3. String functions
4. Bag and Tuple function
5. Load and store function
Error handling in Pig
Pig runs all the jobs , but to know which job have succeeded or failed, it uses the
following options
1. Pig logs encapsulate all successful and failed store commands.
2. Pig return different types of codes upon completion of these scenarios;
• Return code 0- All jobs succeeded
• Return code 1- used for retrievable errors
• Return code2-All jobs have failed
• Return code 3-Some jobs have failed
Pig Vs.
Hive
http://pig.apache.org/docs/r0.12.0/index.ht
ml
http://www.edureka.co/blog/introduction-to
-pig/