0% found this document useful (0 votes)

112 views31 pages

Apache PIG by Sravanthi

This document provides an overview of Apache Pig, a platform for analyzing large datasets. Key points include: - Pig is a tool that represents data as flows to enable analysis of large amounts of data using its Pig Latin scripting language. Internally, Pig scripts are converted to MapReduce jobs. - Pig Latin is the command-based language used in Pig for data transformation and flow expression. - Pig can run in local or MapReduce mode depending on whether a single machine or Hadoop cluster is being used. - Pig provides high-level abstractions for tasks like ETL, data analysis, and iterative processing without needing to write MapReduce code directly.

Uploaded by

Richie James

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views31 pages

Apache PIG by Sravanthi

Uploaded by

Richie James

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Apache PIG

By Sravanthi
We are going to discuss below topics:

Introduction to Pig
Pig Latin
Pig modes
Hive Vs Pig
Installation of Pig
Datatypes in Pig
Modes of execution in Pig
Operators in Pig
Pig programming
Eval Functions
Load and store functions
Pig Execution models
Pig UDF
Introduction to PIG

Apache Pig is a tool used to analyze large amounts of data by representing them as data flows.

Using the Pig Latin scripting language operations like ETL, adhoc data anlaysis and iterative
processing can be easily achieved.

Pig is an abstraction over MapReduce. Internally , all Pig scripts internally are converted into
Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce
applications easier.

Pig was first built in Yahoo! and later became a top level Apache project

Pig provides high-level language that can be used by analysts, data scientists, statisticians etc.
PIG LATIN

PIG Latin command based language

Designed specifically for data transformation and flow expression.

PIG compiler converts pig Latin to MapReduce.

PIG Modes

Pig has two run modes or exec types:

Local Mode - To run Pig in local mode, you need access to a single machine.
To run the scripts in local mode, no Hadoop or HDFS installation is required.
All files are installed and run from your local host and file system.

$ pig -x local

MapReduce Mode - To run Pig in MapReduce mode, you need access to a Hadoop cluster
and HDFS installation. Pig will automatically allocate and deallocate a 15-node cluster.

It can execute against semi distributed and full distributed Hadoop installation.

$ pig
or
$ pig -x MapReduce

You can run the Grunt shell, Pig scripts, or embedded programs using either mode.
PIG vs Hive
PIG Installation

To install Pig, do the following:

Download the Pig installable (pig-0.15.0.tar.gz) from below.

http://www.apache.org/dyn/closer.cgi/pig

Unzip the Pig tutorial file (the files are stored in a newly created directory, pigtmp).
$ tar -xzf pig-0.15.0.tar.gz

Move to folder where you want your pig to be .ex:/usr/local/hadoop/pig

Update .bashrc to add the following:

export PIG_HOME=/usr/local/hadoop/pig
export PATH=$PATH:$PIG_HOME/bin
PIG Installation

Pig in Local Mode

Pig In MapReduce mode

PIG Terminology

Tuple: Ordered set of fields, represented with ( and )

it have collection datatypes. Tuple containing multiple fields .
A tuple is just like a row in a table. It is comma separated list of fields.

(1,john,10000,101)

Bag: Unordered collection of tuples. Represented with { }

{(1,john,10000,101) ,(2,Ram ,50000,102)}

As in RDMS, bag is table and tuple is row in table, but bag does not require that all
tuples contains the same number.
PIG DATA Types
LOAD Data

First step in dataflow language we need to specify the input, which is done by using
load keyword.

Syntax:

LOAD <dataset path> [USING FUNCTION] AS [Schema];

Dataset path: Name of the directory where our dataset presented in hdfs and give single quotes

USING Function: By default it uses PIGStorage which parses each line into fields using delimiter
Default it uses tab delimiter

AS: We can assign names to fields and also declare datatypes for fields.
LOAD Data

weather= LOAD '/MyHDFS/WeatherDataSet.csv' USING PigStorage(',')

as (location:chararray,giventime:chararray,tempdetails:chararray,tempvalue:int);
DUMP

Dump display the results on the screen.

No action is taken until DUMP or STORE commands are executed

It is only after the DUMP statement that a MapReduce job is initiated.

As we see our data in the output we can confirm that the data has been loaded successfully.

Pig will parse, validate and analyze statements and but not execute them.

Syntax: dump weather

STORE

After we have completed process, then result should write into somewhere,
Pig provides the store statement for this purpose

STORE weather INTO '/MyHDFS/MyWeatherData;

The result file path is in HDFS.

LIMIT

For debugging purposes if you want to display only coupe of records on screen ,
Then use LIMIT command. Below display 10 records from previous weather data.

Fewrecords=LIMIT weather 10;

FILTER

Filters are similar to where clause in SQL. filter which contain predicate.
The below statements filters the alias weather and store the results in a new alias.

FILTER <alias> BY <condition>

Diagnostic details

DESCIRBE: It displays the structure

EXPLAIN: It displays execution plan

ILLUSTRATE: It illustrate how pig engine transforms the data

To view the step-by-step execution of a sequence of statements .
FOR EACH

For each takes a set of expressions and applies them to every record in the data pipeline.

FOREACH gives a simple way to apply transformations based on columns.

In Pig, any time you want to add, remove, or change the data you have in an alias,
youll use the FOREACH GENERATE syntax.

To remove unwanted columns

GROUP BY

The group statement collects together records with the same key.

In weather data set, Group by location

FOREACH -GROUP BY

Find maximum temperature in each location.

GROUP ALL

GROUP ALL, groups all the tuples to one group.

This is very useful when we need to perform aggregation operations on the entire set

For example If you want to find out sum of all the values un temp values.

grunt> weather= LOAD '/MyHDFS/WeatherDataset.csv' USING PigStorage(',')

as (location:chararray,giventime:chararray,tempdetails:chararray,tempvalue:int);

grunt> AllTempValues= GROUP weather ALL;

grunt> describe AllTempValues;

AllTempValues: {group: chararray,weather: {(location: chararray,giventime: chararray,tempdetails:
chararray,tempvalue: int)}}

grunt> sumofAllTempValues= FOREACH AllTempValues GENERATE group,SUM(weather.tempvalue);

ORDER BY

The order statement sorts your data for you, producing a total order of your output data.
You indicate a key or set of keys by which you wish to order your data
DISTINCT

The distinct statement is very simple. It removes duplicate records

JOIN

Combine the dataset and process the results.

Stntax: JOIN tablename1 BY fieldname , tablename2 by fieldname

Left outer join:

A = LOAD 'a.txt' AS (n:chararray, a:int);

B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;

Full outer join:

A = LOAD 'a.txt' AS (n:chararray, a:int);

B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A BY $0 FULL, B BY $0;
JOIN

moviedetails = load '/MyHDFS/movielens_test/u.data' using PigStorage('\t')

as (userid:int, movieid:int, rating:int, ratingDate:chararray );

movieinfo = load '/MyHDFS/moviedata/u.item' using PigStorage('|')

as (movieid:int, moviename:chararray );

Joinresult = JOIN moviedetails BY movieid, movieinfo by movieid;

Built in Functions
PIGGYBANK Functions

Piggybank is Pigs repository of user-contributed functions.

Piggybank functions are distributed as part of the Pig distribution, but they are not built in.

Piggy Bank is a place for Pig users to share the Java UDFs they have written for use with Pig.

You must register the Piggybank JAR to use them, which you can do in your distribution at
contrib/piggybank/java/piggybank.jar.
PIGGYBANK Functions

To use a function, you need to determine which package it belongs to.

The top level packages correspond to the function type and currently are:

org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator

org.apache.pig.piggybank.evaluation - for eval functions like aggregates and column transformations
org.apache.pig.piggybank.filtering - for functions used in FILTER operator
org.apache.pig.piggybank.grouping - for grouping functions
org.apache.pig.piggybank.storage - for load/store functions
(The exact package of the function can be seen in the javadocs or by navigating the source tree.)

For example, to use the UPPER command:

REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text)
MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ;
PIG UDF

Pig allows the use of User-Defined Functions in other languages.

Those languages include Java, Ruby, Python, and javascript.

Create a function in eclipse and override definition for exec. And take the jar file

Write the pig script.

Myudfs is pkg name and ucfirst the classname

Code:

public class Ucfirst extends EvalFunc<String> {

public String exec(Tuple input) throws Exception {
if (input.size() == 0)
return null;
try {
String str = (String) input.get(0);
char ch = str.toUpperCase().charAt(0);
String str1 = String.valueOf(ch);
return str1;

} catch (Exception e) {}

}
QUESTIONS??

Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Bda Module 5
No ratings yet
Bda Module 5
26 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
Unit 5
No ratings yet
Unit 5
19 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Pig Notes-1
No ratings yet
Pig Notes-1
6 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Notes
No ratings yet
Notes
19 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
BDP U4
No ratings yet
BDP U4
58 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Bank Project and PL
50% (12)
Bank Project and PL
25 pages
Unit 4
No ratings yet
Unit 4
5 pages
Unit 5
No ratings yet
Unit 5
16 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Unit 4
No ratings yet
Unit 4
29 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Pig 2
No ratings yet
Pig 2
63 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Bda V
No ratings yet
Bda V
10 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
PL-300 Dumps With Header and Footer
No ratings yet
PL-300 Dumps With Header and Footer
222 pages
ERP-EWM Manual Configuration V1.0
100% (2)
ERP-EWM Manual Configuration V1.0
8 pages
Microsoft SQL Server 2000 Programming by Example
88% (8)
Microsoft SQL Server 2000 Programming by Example
704 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Pig
No ratings yet
Pig
16 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
SAC Training 01
No ratings yet
SAC Training 01
11 pages
AZ 104T00A Module - 07
No ratings yet
AZ 104T00A Module - 07
50 pages
Chapter 2
No ratings yet
Chapter 2
43 pages
6 Circular Doubly Linked List
No ratings yet
6 Circular Doubly Linked List
24 pages
HRP Tables List
No ratings yet
HRP Tables List
28 pages
How To Cluster Odoo 12 With PostgreSQL Streaming Replication For High Availability - Severalnines
No ratings yet
How To Cluster Odoo 12 With PostgreSQL Streaming Replication For High Availability - Severalnines
13 pages
Learn SAS Programming
No ratings yet
Learn SAS Programming
29 pages
Rights Reserved. 1
No ratings yet
Rights Reserved. 1
33 pages
Can Magento Really Handle Over 500
No ratings yet
Can Magento Really Handle Over 500
6 pages
Virtual Storage Platform 5000 Series Family Matrix
No ratings yet
Virtual Storage Platform 5000 Series Family Matrix
2 pages
Introduction To SRDF: Symmetrix Licensing Requirements
No ratings yet
Introduction To SRDF: Symmetrix Licensing Requirements
1 page
What Are Schedules and Serializability
No ratings yet
What Are Schedules and Serializability
3 pages
Script para Informes
No ratings yet
Script para Informes
49 pages
Sap Bo Interview Questions Answers Explanations
100% (1)
Sap Bo Interview Questions Answers Explanations
2 pages
Class 12 IP Practical File Question
No ratings yet
Class 12 IP Practical File Question
36 pages
Database Types
No ratings yet
Database Types
9 pages
Naukri PradeepKumarYadav (6y 0m)
No ratings yet
Naukri PradeepKumarYadav (6y 0m)
4 pages
Features of Cassandra
No ratings yet
Features of Cassandra
6 pages
Chapter No 2: Relational Data Model and Security and Integrity Specification
No ratings yet
Chapter No 2: Relational Data Model and Security and Integrity Specification
37 pages
Ict Lesson Note Grade 9
No ratings yet
Ict Lesson Note Grade 9
6 pages
Dsba Career Transition Handbook
No ratings yet
Dsba Career Transition Handbook
17 pages
Group-By, Join, Logical Oprations, View
No ratings yet
Group-By, Join, Logical Oprations, View
18 pages
Voucher-DOUTI ET FILS WIFI-demijours-up-143-05.03.24-gggggg
No ratings yet
Voucher-DOUTI ET FILS WIFI-demijours-up-143-05.03.24-gggggg
13 pages
SAP Note 547314 - FAQ System Copy Procedure
No ratings yet
SAP Note 547314 - FAQ System Copy Procedure
5 pages
Getting Started With DB2 Express C
No ratings yet
Getting Started With DB2 Express C
7 pages
Resumetata Consultancy Services: Current Role Description
No ratings yet
Resumetata Consultancy Services: Current Role Description
6 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Apache PIG by Sravanthi

Uploaded by

Apache PIG by Sravanthi

Uploaded by

Apache PIG

PIG Latin command based language

Designed specifically for data transformation and flow expression.

PIG compiler converts pig Latin to MapReduce.

Pig has two run modes or exec types:

To install Pig, do the following:

Download the Pig installable (pig-0.15.0.tar.gz) from below.

Move to folder where you want your pig to be .ex:/usr/local/hadoop/pig

Update .bashrc to add the following:

Pig in Local Mode

Pig In MapReduce mode

Tuple: Ordered set of fields, represented with ( and )

Bag: Unordered collection of tuples. Represented with { }

LOAD <dataset path> [USING FUNCTION] AS [Schema];

weather= LOAD '/MyHDFS/WeatherDataSet.csv' USING PigStorage(',')

Dump display the results on the screen.

It is only after the DUMP statement that a MapReduce job is initiated.

Syntax: dump weather

STORE weather INTO '/MyHDFS/MyWeatherData;

The result file path is in HDFS.

Fewrecords=LIMIT weather 10;

FILTER <alias> BY <condition>

DESCIRBE: It displays the structure

EXPLAIN: It displays execution plan

ILLUSTRATE: It illustrate how pig engine transforms the data

FOREACH gives a simple way to apply transformations based on columns.

To remove unwanted columns

In weather data set, Group by location

Find maximum temperature in each location.

GROUP ALL, groups all the tuples to one group.

grunt> weather= LOAD '/MyHDFS/WeatherDataset.csv' USING PigStorage(',')

grunt> AllTempValues= GROUP weather ALL;

grunt> describe AllTempValues;

grunt> sumofAllTempValues= FOREACH AllTempValues GENERATE group,SUM(weather.tempvalue);

The distinct statement is very simple. It removes duplicate records

Combine the dataset and process the results.

Left outer join:

A = LOAD 'a.txt' AS (n:chararray, a:int);

Full outer join:

A = LOAD 'a.txt' AS (n:chararray, a:int);

moviedetails = load '/MyHDFS/movielens_test/u.data' using PigStorage('\t')

movieinfo = load '/MyHDFS/moviedata/u.item' using PigStorage('|')

Joinresult = JOIN moviedetails BY movieid, movieinfo by movieid;

Piggybank is Pigs repository of user-contributed functions.

To use a function, you need to determine which package it belongs to.

org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator

For example, to use the UPPER command:

Pig allows the use of User-Defined Functions in other languages.

Write the pig script.

Myudfs is pkg name and ucfirst the classname

public class Ucfirst extends EvalFunc<String> {

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.