0% found this document useful (0 votes)
32 views34 pages

BDA Module 4 - Part 1 (Pig) 2023

ANALYSING DATA WITH PIG: Introducing Pig: The Pig Architecture, Benefits of Pig, Properties of Pig, running Pig, Getting Started with Pig Latin, Working with Operators in Pig, Debugging Pig, Working with Functions in Pig, Error Handling in Pig.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views34 pages

BDA Module 4 - Part 1 (Pig) 2023

ANALYSING DATA WITH PIG: Introducing Pig: The Pig Architecture, Benefits of Pig, Properties of Pig, running Pig, Getting Started with Pig Latin, Working with Operators in Pig, Debugging Pig, Working with Functions in Pig, Error Handling in Pig.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Big Data Analytics

Module IV-Part 1

Introduction to Pig

By

Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Chapter 10

Introduction to
Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Learning Objectives and Learning
Outcomes
Learning Objectives Learning Outcomes
Introduction to Pig

1. To study the key features a) To have an easy comprehension


and anatomy of Pig. on when to use and when NOT
to use Pig.
2. To study the execution modes
of Pig. b) To be able to differentiate between
Pig and Hive.
3. To study the various relational
operators in pig.
Contents

 What is Pig?  Running Pig


 Key Features of  Execution Modes of
Pig Pig
 Benefits of Pig  Relational Operators
 Installation of Pig  Eval Function
 Properties of Pig  Piggy Bank
 Latin Overview  When to use Pig?
 Pig Latin  When NOT to use
Statements Pig?
 Pig Latin:
Identifiers
 Pig versus Hive
 Pig Latin:
Comments
 Data Types in Pig
 Simple Data Types
 Complex Data
Types
What is
• Pig?
Apache Pig is a platform for data analysis to handle
gigabytes or terabytes or zetabytes etc., developed
in 2006 at Yahoo.
• It is an alternative to Map Reduce Programming.

• Yahoo uses and executes 40 percent of its Hadoop


jobs with Pig.
• Pig platform is specially designed for handling many
kids of data.
• Pig enables users to focus more on what to do than
how to do it.
• Main purpose of using pig is for its ease of use, high
performance and massive scalability
• Pig can be divided into 3 categories: ETL(Extract,
Transform, and Load), research and interactive data
processing
Features of
Pig

• It contains 2 components: a scripting language called pig Latin and the pig
Latin compiler.

• It provides an engine for executing data flows (how your data should
flow). Pig
processes data in parallel on the Hadoop cluster.

• It provides a language called “Pig Latin” to express data flows. It’s a high
level language used for writing programs for data processing and analysis.

• Pig Latin contains operators for many of the traditional data operations
such as
join, filter, sort, etc.

• It allows users to develop their own functions (User Defined Functions)


for reading, processing, and writing data.
The Anatomy of
Pig

The main components of Pig are as follows:

• Data flow language (Pig Latin).

• Interactive shell where you can type Pig Latin statements


(Grunt).

• Pig interpreter and execution engine.


Pig on
Hadoop

• Pig runs on Hadoop.

• Pig uses both Hadoop Distributed File System and MapReduce


Programming.

• By default, Pig reads input files from HDFS. Pig stores the
intermediate data (data produced by MapReduce jobs) and the
output in HDFS.

• However, Pig can also read input from and place output to other
sources.
Apache Pig Components
Parser: Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous
checks. The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and logical
operators.

In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.

Optimizer: The logical plan (DAG) is passed to the logical optimizer,


which carries out the logical optimizations such as projection and
pushdown.

Compiler: The compiler compiles the optimized logical plan into a


series of MapReduce jobs.

Execution engine: Finally the MapReduce jobs are submitted to


Hadoop in a sorted order. Finally, these MapReduce jobs are executed
on Hadoop producing the desired results.
Benefits of Pig

• Ease of Coding: Pig Latin lets you write complex programs. The code is simple and
easy to understand and maintain.
• Optimization: Pig Latin encodes tasks in such a way that they can be easily
optimised for execution.
• Extensibility: Pig Latin is designed in such a way that it allows you to create your
own custom functions called user defined functions.
Pig Latin
Overview
Installing Pig

• Pig can be installed on UNIX or Windows system


• Before installing you need to make sure that you have the following:
 Hadoop (version 0.20.2 onwards)
 Java (version 1.6 onwards)
Running Pig

Pig scripts can be run in the following two modes:


Local Mode - To run Pig in local mode, you need access to a single machine;
all files are installed and run using your local host and file system. Specify
local mode using the -x flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a
Hadoop cluster and HDFS installation. Mapreduce mode is the default
mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x
mapreduce).
Schema for writing the script

• Before running Pig programs, it is necessary to know about pig


shell.
• Without shell, know one can access the pig’s inbuilt
characteristics . Pig shell is known as “Grunt”.
• Grunt is a command shell, which is graphical in nature and used
for scripting of pig.
• For every scripting language, there is a schema definition that tells
everything about the script.
Pig Latin Statements

Pig Latin Statements are generally ordered as follows:

1. LOAD statement is for reads data from the file system.


2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result on the screen

A = load 'student' (rollno, name, gpa);


A = filter A by gpa > 4.0;
A = foreach A generate UPPER (name);
STORE A INTO ‘myreport’

Some more Pig Latin Statements are as follows:


1. GROUP operator is used for aggregating input_records
2. ALL statement is sued for aggregating all the tuples into a single group.
3. FOREACH operator is used for iterations
Pig Latin Identifiers
Valid Identifier Y A1 A1_2014 Sample

Invalid 5 Sales$ Sales% _Sales


Identifier

Pig Latin Comments

In Pig Latin two types of comments are supported:


1. Single line comments that begin with “—”.
2. Multiline comments that begin with “/* and end
with */”.
Operators in Pig
Latin

Arithmetic Comparison Null Boolean


+ == IS NULL AND
- != IS NOT NULL OR
* < NOT
/ >
% <=
>=
Data Types in Pig Latin
Simple Data
Types
Name Description
int Whole numbers
long Large whole numbers
float Decimals
double Very precise decimals
chararray Text strings
bytearray Raw bytes
datetime Datetime
boolean true or false

Complex Data
Types
Name Description
Tuple An ordered set of fields.
Example: (2,3)
Bag A collection of tuples.
Example: {(2,3),(7,5)}
map key, value pair (open # Apache)
Running Pig

Pig can run in two


ways:
1. Interactive Mode.

2. Batch Mode.
Relational Operators
Filter

Find the tuples of those student where the GPA is greater than 4.0.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);


B = filter A by gpa >
4.0; DUMP B;
FOREACH

Display the name of all students in uppercase.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = foreach A generate UPPER (name);

DUMP B;
Grou
p
Group tuples of students based on their
GPA.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = GROUP A BY

gpa; DUMP B;
Distinct

To remove duplicate tuples of students.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = DISTINCT A;

DUMP B;
Join
To join two relations namely, “student” and “department” based on the
values
contained in the “rollno” column.
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = load '/pigdemo/department.tsv' as (rollno:int, deptno:int,deptname:chararray);

C = JOIN A BY rollno, B BY rollno;

DUMP

C;

DUMP B;
Split

To partition a relation based on the GPAs acquired by the


students.
 GPA = 4.0, place it into relation X.
 GPA is < 4.0, place it into relation Y.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

SPLIT A INTO X IF gpa==4.0, Y IF gpa<=4.0;

DUMP X;
Avg

To calculate the average marks for each student.

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as


(studname:chararray,marks:int);

B = GROUP A BY studname;

C = FOREACH B GENERATE A.studname, AVG(A.marks);

DUMP C;

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Max
To calculate the maximum marks for each student.

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as (studname:chararray, marks:int);

B = GROUP A BY studname;

C = FOREACH B GENERATE A.studname, MAX(A.marks);

DUMP C;
Map
MAP represents a key/value pair.

To depict the complex data type “map”.

John [city#Bangalore] Jack[city#Pune]


James [city#Chennai]

A = load '/root/pigdemos/studentcity.tsv' Using PigStorage as


(studname:chararray,m:map[chararray]);

B = foreach A generate m#'city' as CityName:chararray;

DUMP B

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Piggy Bank
To use Piggy Bank upper
function

register '/root/pigdemos/piggybank-0.12.0.jar';

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray,


gpa:float);

upper = foreach A generate


org.apache.pig.piggybank.evaluation.string.UPP
ER(name);

DUMP upper;
Working with Functions in Pig
There are 5 categories of built-in functions in Pig
1. Eval or Evaluation functions
2. Math functions
3. String functions
4. Bag and Tuple function
5. Load and store function
Error handling in Pig

Pig runs all the jobs , but to know which job have succeeded or failed, it uses the
following options
1. Pig logs encapsulate all successful and failed store commands.
2. Pig return different types of codes upon completion of these scenarios;
• Return code 0- All jobs succeeded
• Return code 1- used for retrievable errors
• Return code2-All jobs have failed
• Return code 3-Some jobs have failed
Pig Vs.
Hive

Features Pig Hive


Used By Programmers and Analyst
Researchers
Used For Programming Reporting
Language Procedural data flow SQL Like
language
Suitable For Semi - Structured Structured
Schema / Types Explicit Implicit
UDF Support YES YES
Join / Order / Sort YES YES
DFS Direct Access YES (Implicit) YES (Explicit)
Web Interface YES NO
Partitions YES No
Shell YES YES
Further
Readings


http://pig.apache.org/docs/r0.12.0/index.ht
ml

http://www.edureka.co/blog/introduction-to
-pig/

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy