0% found this document useful (0 votes)

32 views34 pages

BDA Module 4 - Part 1 (Pig) 2023

ANALYSING DATA WITH PIG: Introducing Pig: The Pig Architecture, Benefits of Pig, Properties of Pig, running Pig, Getting Started with Pig Latin, Working with Operators in Pig, Debugging Pig, Working with Functions in Pig, Error Handling in Pig.

Uploaded by

recoverytherapy10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views34 pages

BDA Module 4 - Part 1 (Pig) 2023

Uploaded by

recoverytherapy10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Big Data Analytics

Module IV-Part 1

Introduction to Pig

Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Chapter 10

Introduction to
Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

Copyright 2015, WILEY INDIA PVT. LTD.
Learning Objectives and Learning
Outcomes
Learning Objectives Learning Outcomes
Introduction to Pig

1. To study the key features a) To have an easy comprehension

and anatomy of Pig. on when to use and when NOT
to use Pig.
2. To study the execution modes
of Pig. b) To be able to differentiate between
Pig and Hive.
3. To study the various relational
operators in pig.
Contents

 What is Pig?  Running Pig

 Key Features of  Execution Modes of
Pig Pig
 Benefits of Pig  Relational Operators
 Installation of Pig  Eval Function
 Properties of Pig  Piggy Bank
 Latin Overview  When to use Pig?
 Pig Latin  When NOT to use
Statements Pig?
 Pig Latin:
Identifiers
 Pig versus Hive
 Pig Latin:
Comments
 Data Types in Pig
 Simple Data Types
 Complex Data
Types
What is
• Pig?
Apache Pig is a platform for data analysis to handle
gigabytes or terabytes or zetabytes etc., developed
in 2006 at Yahoo.
• It is an alternative to Map Reduce Programming.

• Yahoo uses and executes 40 percent of its Hadoop

jobs with Pig.
• Pig platform is specially designed for handling many
kids of data.
• Pig enables users to focus more on what to do than
how to do it.
• Main purpose of using pig is for its ease of use, high
performance and massive scalability
• Pig can be divided into 3 categories: ETL(Extract,
Transform, and Load), research and interactive data
processing
Features of
Pig

• It contains 2 components: a scripting language called pig Latin and the pig
Latin compiler.

• It provides an engine for executing data flows (how your data should
flow). Pig
processes data in parallel on the Hadoop cluster.

• It provides a language called “Pig Latin” to express data flows. It’s a high
level language used for writing programs for data processing and analysis.

• Pig Latin contains operators for many of the traditional data operations
such as
join, filter, sort, etc.

• It allows users to develop their own functions (User Defined Functions)

for reading, processing, and writing data.
The Anatomy of
Pig

The main components of Pig are as follows:

• Data flow language (Pig Latin).

• Interactive shell where you can type Pig Latin statements

(Grunt).

• Pig interpreter and execution engine.

Pig on
Hadoop

• Pig runs on Hadoop.

• Pig uses both Hadoop Distributed File System and MapReduce

Programming.

• By default, Pig reads input files from HDFS. Pig stores the
intermediate data (data produced by MapReduce jobs) and the
output in HDFS.

• However, Pig can also read input from and place output to other
sources.
Apache Pig Components
Parser: Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous
checks. The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and logical
operators.

In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.

Optimizer: The logical plan (DAG) is passed to the logical optimizer,

which carries out the logical optimizations such as projection and
pushdown.

Compiler: The compiler compiles the optimized logical plan into a

series of MapReduce jobs.

Execution engine: Finally the MapReduce jobs are submitted to

Hadoop in a sorted order. Finally, these MapReduce jobs are executed
on Hadoop producing the desired results.
Benefits of Pig

• Ease of Coding: Pig Latin lets you write complex programs. The code is simple and
easy to understand and maintain.
• Optimization: Pig Latin encodes tasks in such a way that they can be easily
optimised for execution.
• Extensibility: Pig Latin is designed in such a way that it allows you to create your
own custom functions called user defined functions.
Pig Latin
Overview
Installing Pig

• Pig can be installed on UNIX or Windows system

• Before installing you need to make sure that you have the following:
 Hadoop (version 0.20.2 onwards)
 Java (version 1.6 onwards)
Running Pig

Pig scripts can be run in the following two modes:

Local Mode - To run Pig in local mode, you need access to a single machine;
all files are installed and run using your local host and file system. Specify
local mode using the -x flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a
Hadoop cluster and HDFS installation. Mapreduce mode is the default
mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x
mapreduce).
Schema for writing the script

• Before running Pig programs, it is necessary to know about pig

shell.
• Without shell, know one can access the pig’s inbuilt
characteristics . Pig shell is known as “Grunt”.
• Grunt is a command shell, which is graphical in nature and used
for scripting of pig.
• For every scripting language, there is a schema definition that tells
everything about the script.
Pig Latin Statements

Pig Latin Statements are generally ordered as follows:

1. LOAD statement is for reads data from the file system.

2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result on the screen

A = load 'student' (rollno, name, gpa);

A = filter A by gpa > 4.0;
A = foreach A generate UPPER (name);
STORE A INTO ‘myreport’

Some more Pig Latin Statements are as follows:

1. GROUP operator is used for aggregating input_records
2. ALL statement is sued for aggregating all the tuples into a single group.
3. FOREACH operator is used for iterations
Pig Latin Identifiers
Valid Identifier Y A1 A1_2014 Sample

Invalid 5 Sales$ Sales% _Sales

Identifier

Pig Latin Comments

In Pig Latin two types of comments are supported:

1. Single line comments that begin with “—”.
2. Multiline comments that begin with “/* and end
with */”.
Operators in Pig
Latin

Arithmetic Comparison Null Boolean

+ == IS NULL AND
- != IS NOT NULL OR
* < NOT
/ >
% <=
>=
Data Types in Pig Latin
Simple Data
Types
Name Description
int Whole numbers
long Large whole numbers
float Decimals
double Very precise decimals
chararray Text strings
bytearray Raw bytes
datetime Datetime
boolean true or false

Complex Data
Types
Name Description
Tuple An ordered set of fields.
Example: (2,3)
Bag A collection of tuples.
Example: {(2,3),(7,5)}
map key, value pair (open # Apache)
Running Pig

Pig can run in two

ways:
1. Interactive Mode.

2. Batch Mode.
Relational Operators
Filter

Find the tuples of those student where the GPA is greater than 4.0.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = filter A by gpa >
4.0; DUMP B;
FOREACH

Display the name of all students in uppercase.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = foreach A generate UPPER (name);

DUMP B;
Grou
p
Group tuples of students based on their
GPA.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = GROUP A BY

gpa; DUMP B;
Distinct

To remove duplicate tuples of students.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = DISTINCT A;

DUMP B;
Join
To join two relations namely, “student” and “department” based on the
values
contained in the “rollno” column.
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = load '/pigdemo/department.tsv' as (rollno:int, deptno:int,deptname:chararray);

C = JOIN A BY rollno, B BY rollno;

DUMP

DUMP B;
Split

To partition a relation based on the GPAs acquired by the

students.
 GPA = 4.0, place it into relation X.
 GPA is < 4.0, place it into relation Y.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

SPLIT A INTO X IF gpa==4.0, Y IF gpa<=4.0;

DUMP X;
Avg

To calculate the average marks for each student.

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as

(studname:chararray,marks:int);

B = GROUP A BY studname;

C = FOREACH B GENERATE A.studname, AVG(A.marks);

DUMP C;

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as (studname:chararray, marks:int);

B = GROUP A BY studname;

C = FOREACH B GENERATE A.studname, MAX(A.marks);

DUMP C;
Map
MAP represents a key/value pair.

To depict the complex data type “map”.

John [city#Bangalore] Jack[city#Pune]

James [city#Chennai]

A = load '/root/pigdemos/studentcity.tsv' Using PigStorage as

(studname:chararray,m:map[chararray]);

B = foreach A generate m#'city' as CityName:chararray;

DUMP B

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray,

gpa:float);

upper = foreach A generate

org.apache.pig.piggybank.evaluation.string.UPP
ER(name);

DUMP upper;
Working with Functions in Pig
There are 5 categories of built-in functions in Pig
1. Eval or Evaluation functions
2. Math functions
3. String functions
4. Bag and Tuple function
5. Load and store function
Error handling in Pig

Pig runs all the jobs , but to know which job have succeeded or failed, it uses the
following options
1. Pig logs encapsulate all successful and failed store commands.
2. Pig return different types of codes upon completion of these scenarios;
• Return code 0- All jobs succeeded
• Return code 1- used for retrievable errors
• Return code2-All jobs have failed
• Return code 3-Some jobs have failed
Pig Vs.
Hive

Features Pig Hive

Used By Programmers and Analyst
Researchers
Used For Programming Reporting
Language Procedural data flow SQL Like
language
Suitable For Semi - Structured Structured
Schema / Types Explicit Implicit
UDF Support YES YES
Join / Order / Sort YES YES
DFS Direct Access YES (Implicit) YES (Explicit)
Web Interface YES NO
Partitions YES No
Shell YES YES
Further
Readings


http://pig.apache.org/docs/r0.12.0/index.ht
ml

http://www.edureka.co/blog/introduction-to
-pig/

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

Bio-Stats Step 3
100% (6)
Bio-Stats Step 3
9 pages
Playboy Magazine Edition Croatia January 2016 - Pamela Anderson - Free Poster Calendar 2016 - Plastic Wrap Unopened
No ratings yet
Playboy Magazine Edition Croatia January 2016 - Pamela Anderson - Free Poster Calendar 2016 - Plastic Wrap Unopened
5 pages
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction To Pig
67% (3)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction To Pig
34 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
BDP U4
No ratings yet
BDP U4
58 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
13 Wax
No ratings yet
13 Wax
67 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
No ratings yet
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
58 pages
3.4 - Deodorization Cpo
No ratings yet
3.4 - Deodorization Cpo
33 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Scet Unit 5
No ratings yet
Scet Unit 5
9 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Brms Final
0% (1)
Brms Final
2 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Unit 4
No ratings yet
Unit 4
5 pages
"Standing On The Shoulders of Giants": Dominican College of Tarlac
100% (1)
"Standing On The Shoulders of Giants": Dominican College of Tarlac
3 pages
Chapter 10
No ratings yet
Chapter 10
50 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Corporate and Academic Services: Part 1: Basic Data
No ratings yet
Corporate and Academic Services: Part 1: Basic Data
3 pages
Emailing Pig PDF
No ratings yet
Emailing Pig PDF
23 pages
Pig
No ratings yet
Pig
16 pages
1 F40, R-41, In-House IHTM-14 Test Report
No ratings yet
1 F40, R-41, In-House IHTM-14 Test Report
1 page
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Leong Hup International: Rules The Roost in Poultry Industry
No ratings yet
Leong Hup International: Rules The Roost in Poultry Industry
17 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Intel® Architecture Instruction Set Extensions and Future Features Programming Reference
No ratings yet
Intel® Architecture Instruction Set Extensions and Future Features Programming Reference
145 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
Ecology of Pelagic Marine Animals (OCN627) : Spring 2014
No ratings yet
Ecology of Pelagic Marine Animals (OCN627) : Spring 2014
4 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Building Your Money Making Machine
100% (1)
Building Your Money Making Machine
2 pages
Bluetooth Radio Interface Basics
No ratings yet
Bluetooth Radio Interface Basics
21 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
An Investigation of Cranial Motion Through A Review of Biomechanically Based Skull Deformation Literature
No ratings yet
An Investigation of Cranial Motion Through A Review of Biomechanically Based Skull Deformation Literature
8 pages
Giverny Capital - Annual Letter 2018 Web PDF
No ratings yet
Giverny Capital - Annual Letter 2018 Web PDF
18 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
3 Pig
No ratings yet
3 Pig
77 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
RM-898 Research Methodology For DM-5, CE&M-4, SE-3 and TN-2 Students Only
No ratings yet
RM-898 Research Methodology For DM-5, CE&M-4, SE-3 and TN-2 Students Only
3 pages
Extended Abstract For Diesel Combustion
No ratings yet
Extended Abstract For Diesel Combustion
8 pages
Joan Batayo Profile
No ratings yet
Joan Batayo Profile
2 pages
Butterfly Arrow 500 W Mixer Grinder: Grand Total 1625.00
No ratings yet
Butterfly Arrow 500 W Mixer Grinder: Grand Total 1625.00
1 page
FLYLITE - Pilot Training Program Effective AUGUST 01, 2022 Trainee Copy RV080822
No ratings yet
FLYLITE - Pilot Training Program Effective AUGUST 01, 2022 Trainee Copy RV080822
11 pages
Unit 5
No ratings yet
Unit 5
76 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Term One Edited
No ratings yet
Term One Edited
70 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
BS 2nd Shift Time Table Wef 11-12-2023 (1st, 5th, 7th Semester)
No ratings yet
BS 2nd Shift Time Table Wef 11-12-2023 (1st, 5th, 7th Semester)
3 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
(001~052) 영어3학년 미래엔 (최연희) 정답.indd 1 2020-12-11 오후 3:17:30
No ratings yet
(001~052) 영어3학년 미래엔 (최연희) 정답.indd 1 2020-12-11 오후 3:17:30
52 pages
How To Draw and Read Line Diagrams Onboard Ships
No ratings yet
How To Draw and Read Line Diagrams Onboard Ships
23 pages
Pig 2
No ratings yet
Pig 2
63 pages
Notice To IEA Dwarka Museum
No ratings yet
Notice To IEA Dwarka Museum
2 pages
Pig
No ratings yet
Pig
6 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
List of Imran Series by Ibn-e-Safi - Wikipedia
No ratings yet
List of Imran Series by Ibn-e-Safi - Wikipedia
25 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Fm200 Checklist and Task Id
No ratings yet
Fm200 Checklist and Task Id
15 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
No ratings yet
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
1 page
Apache PIG
No ratings yet
Apache PIG
41 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Unit 5
No ratings yet
Unit 5
24 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
CBSE - X Biology Phase - 2 Session - II (Set - A)
No ratings yet
CBSE - X Biology Phase - 2 Session - II (Set - A)
3 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
6 Part2
No ratings yet
6 Part2
45 pages
Pig
No ratings yet
Pig
61 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
IT Based Decision Making in Health Care
No ratings yet
IT Based Decision Making in Health Care
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDA Module 4 - Part 1 (Pig) 2023

Uploaded by

BDA Module 4 - Part 1 (Pig) 2023

Uploaded by

Big Data Analytics

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

1. To study the key features a) To have an easy comprehension

 What is Pig?  Running Pig

• Yahoo uses and executes 40 percent of its Hadoop

• It allows users to develop their own functions (User Defined Functions)

The main components of Pig are as follows:

• Data flow language (Pig Latin).

• Interactive shell where you can type Pig Latin statements

• Pig interpreter and execution engine.

• Pig runs on Hadoop.

• Pig uses both Hadoop Distributed File System and MapReduce

Optimizer: The logical plan (DAG) is passed to the logical optimizer,

Compiler: The compiler compiles the optimized logical plan into a

Execution engine: Finally the MapReduce jobs are submitted to

• Pig can be installed on UNIX or Windows system

Pig scripts can be run in the following two modes:

• Before running Pig programs, it is necessary to know about pig

Pig Latin Statements are generally ordered as follows:

1. LOAD statement is for reads data from the file system.

A = load 'student' (rollno, name, gpa);

Some more Pig Latin Statements are as follows:

Invalid 5 Sales$ Sales% _Sales

Pig Latin Comments

In Pig Latin two types of comments are supported:

Arithmetic Comparison Null Boolean

Pig can run in two

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

Display the name of all students in uppercase.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = foreach A generate UPPER (name);

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

To remove duplicate tuples of students.

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = load '/pigdemo/department.tsv' as (rollno:int, deptno:int,deptname:chararray);

C = JOIN A BY rollno, B BY rollno;

To partition a relation based on the GPAs acquired by the

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

SPLIT A INTO X IF gpa==4.0, Y IF gpa<=4.0;

To calculate the average marks for each student.

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as

C = FOREACH B GENERATE A.studname, AVG(A.marks);

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as (studname:chararray, marks:int);

C = FOREACH B GENERATE A.studname, MAX(A.marks);

To depict the complex data type “map”.

John [city#Bangalore] Jack[city#Pune]

A = load '/root/pigdemos/studentcity.tsv' Using PigStorage as

B = foreach A generate m#'city' as CityName:chararray;

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray,

upper = foreach A generate

Features Pig Hive

Big Data and Analytics by Seema Acharya and Subhashini Chellappan

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.