0% found this document useful (0 votes)
112 views31 pages

Apache PIG by Sravanthi

This document provides an overview of Apache Pig, a platform for analyzing large datasets. Key points include: - Pig is a tool that represents data as flows to enable analysis of large amounts of data using its Pig Latin scripting language. Internally, Pig scripts are converted to MapReduce jobs. - Pig Latin is the command-based language used in Pig for data transformation and flow expression. - Pig can run in local or MapReduce mode depending on whether a single machine or Hadoop cluster is being used. - Pig provides high-level abstractions for tasks like ETL, data analysis, and iterative processing without needing to write MapReduce code directly.

Uploaded by

Richie James
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views31 pages

Apache PIG by Sravanthi

This document provides an overview of Apache Pig, a platform for analyzing large datasets. Key points include: - Pig is a tool that represents data as flows to enable analysis of large amounts of data using its Pig Latin scripting language. Internally, Pig scripts are converted to MapReduce jobs. - Pig Latin is the command-based language used in Pig for data transformation and flow expression. - Pig can run in local or MapReduce mode depending on whether a single machine or Hadoop cluster is being used. - Pig provides high-level abstractions for tasks like ETL, data analysis, and iterative processing without needing to write MapReduce code directly.

Uploaded by

Richie James
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Apache PIG

By Sravanthi
We are going to discuss below topics:

Introduction to Pig
Pig Latin
Pig modes
Hive Vs Pig
Installation of Pig
Datatypes in Pig
Modes of execution in Pig
Operators in Pig
Pig programming
Eval Functions
Load and store functions
Pig Execution models
Pig UDF
Introduction to PIG

Apache Pig is a tool used to analyze large amounts of data by representing them as data flows.

Using the Pig Latin scripting language operations like ETL, adhoc data anlaysis and iterative
processing can be easily achieved.

Pig is an abstraction over MapReduce. Internally , all Pig scripts internally are converted into
Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce
applications easier.

Pig was first built in Yahoo! and later became a top level Apache project

Pig provides high-level language that can be used by analysts, data scientists, statisticians etc.
PIG LATIN

PIG Latin command based language

Designed specifically for data transformation and flow expression.

PIG compiler converts pig Latin to MapReduce.


PIG Modes

Pig has two run modes or exec types:

Local Mode - To run Pig in local mode, you need access to a single machine.
To run the scripts in local mode, no Hadoop or HDFS installation is required.
All files are installed and run from your local host and file system.

$ pig -x local

MapReduce Mode - To run Pig in MapReduce mode, you need access to a Hadoop cluster
and HDFS installation. Pig will automatically allocate and deallocate a 15-node cluster.

It can execute against semi distributed and full distributed Hadoop installation.

$ pig
or
$ pig -x MapReduce

You can run the Grunt shell, Pig scripts, or embedded programs using either mode.
PIG vs Hive
PIG Installation

To install Pig, do the following:

Download the Pig installable (pig-0.15.0.tar.gz) from below.


http://www.apache.org/dyn/closer.cgi/pig

Unzip the Pig tutorial file (the files are stored in a newly created directory, pigtmp).
$ tar -xzf pig-0.15.0.tar.gz

Move to folder where you want your pig to be .ex:/usr/local/hadoop/pig

Update .bashrc to add the following:

export PIG_HOME=/usr/local/hadoop/pig
export PATH=$PATH:$PIG_HOME/bin
PIG Installation

Pig in Local Mode

Pig In MapReduce mode


PIG Terminology

Tuple: Ordered set of fields, represented with ( and )


it have collection datatypes. Tuple containing multiple fields .
A tuple is just like a row in a table. It is comma separated list of fields.

(1,john,10000,101)

Bag: Unordered collection of tuples. Represented with { }


{(1,john,10000,101) ,(2,Ram ,50000,102)}

As in RDMS, bag is table and tuple is row in table, but bag does not require that all
tuples contains the same number.
PIG DATA Types
LOAD Data

First step in dataflow language we need to specify the input, which is done by using
load keyword.

Syntax:

LOAD <dataset path> [USING FUNCTION] AS [Schema];

Dataset path: Name of the directory where our dataset presented in hdfs and give single quotes

USING Function: By default it uses PIGStorage which parses each line into fields using delimiter
Default it uses tab delimiter

AS: We can assign names to fields and also declare datatypes for fields.
LOAD Data

weather= LOAD '/MyHDFS/WeatherDataSet.csv' USING PigStorage(',')


as (location:chararray,giventime:chararray,tempdetails:chararray,tempvalue:int);
DUMP

Dump display the results on the screen.


No action is taken until DUMP or STORE commands are executed

It is only after the DUMP statement that a MapReduce job is initiated.


As we see our data in the output we can confirm that the data has been loaded successfully.

Pig will parse, validate and analyze statements and but not execute them.

Syntax: dump weather


STORE

After we have completed process, then result should write into somewhere,
Pig provides the store statement for this purpose

STORE weather INTO '/MyHDFS/MyWeatherData;

The result file path is in HDFS.


LIMIT

For debugging purposes if you want to display only coupe of records on screen ,
Then use LIMIT command. Below display 10 records from previous weather data.

Fewrecords=LIMIT weather 10;


FILTER

Filters are similar to where clause in SQL. filter which contain predicate.
The below statements filters the alias weather and store the results in a new alias.

FILTER <alias> BY <condition>


Diagnostic details

DESCIRBE: It displays the structure

EXPLAIN: It displays execution plan

ILLUSTRATE: It illustrate how pig engine transforms the data


To view the step-by-step execution of a sequence of statements .
FOR EACH

For each takes a set of expressions and applies them to every record in the data pipeline.

FOREACH gives a simple way to apply transformations based on columns.

In Pig, any time you want to add, remove, or change the data you have in an alias,
youll use the FOREACH GENERATE syntax.

To remove unwanted columns


GROUP BY

The group statement collects together records with the same key.

In weather data set, Group by location


FOREACH -GROUP BY

Find maximum temperature in each location.


GROUP ALL

GROUP ALL, groups all the tuples to one group.


This is very useful when we need to perform aggregation operations on the entire set

For example If you want to find out sum of all the values un temp values.

grunt> weather= LOAD '/MyHDFS/WeatherDataset.csv' USING PigStorage(',')


as (location:chararray,giventime:chararray,tempdetails:chararray,tempvalue:int);

grunt> AllTempValues= GROUP weather ALL;

grunt> describe AllTempValues;


AllTempValues: {group: chararray,weather: {(location: chararray,giventime: chararray,tempdetails:
chararray,tempvalue: int)}}

grunt> sumofAllTempValues= FOREACH AllTempValues GENERATE group,SUM(weather.tempvalue);


ORDER BY

The order statement sorts your data for you, producing a total order of your output data.
You indicate a key or set of keys by which you wish to order your data
DISTINCT

The distinct statement is very simple. It removes duplicate records


JOIN

Combine the dataset and process the results.


Stntax: JOIN tablename1 BY fieldname , tablename2 by fieldname

Left outer join:

A = LOAD 'a.txt' AS (n:chararray, a:int);


B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;

Full outer join:

A = LOAD 'a.txt' AS (n:chararray, a:int);


B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A BY $0 FULL, B BY $0;
JOIN

moviedetails = load '/MyHDFS/movielens_test/u.data' using PigStorage('\t')


as (userid:int, movieid:int, rating:int, ratingDate:chararray );

movieinfo = load '/MyHDFS/moviedata/u.item' using PigStorage('|')


as (movieid:int, moviename:chararray );

Joinresult = JOIN moviedetails BY movieid, movieinfo by movieid;


Built in Functions
PIGGYBANK Functions

Piggybank is Pigs repository of user-contributed functions.

Piggybank functions are distributed as part of the Pig distribution, but they are not built in.

Piggy Bank is a place for Pig users to share the Java UDFs they have written for use with Pig.

You must register the Piggybank JAR to use them, which you can do in your distribution at
contrib/piggybank/java/piggybank.jar.
PIGGYBANK Functions

To use a function, you need to determine which package it belongs to.


The top level packages correspond to the function type and currently are:

org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator


org.apache.pig.piggybank.evaluation - for eval functions like aggregates and column transformations
org.apache.pig.piggybank.filtering - for functions used in FILTER operator
org.apache.pig.piggybank.grouping - for grouping functions
org.apache.pig.piggybank.storage - for load/store functions
(The exact package of the function can be seen in the javadocs or by navigating the source tree.)

For example, to use the UPPER command:

REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text)
MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ;
PIG UDF

Pig allows the use of User-Defined Functions in other languages.


Those languages include Java, Ruby, Python, and javascript.

Create a function in eclipse and override definition for exec. And take the jar file

Register Jarname;

Write the pig script.

REGISTER ucfirst.jar;
A = LOAD sample.txt as (logid:chararray);
B = FOREACH A GENERATE myudfs.Ucfirst(logid);
DUMP B;

Myudfs is pkg name and ucfirst the classname


Code:

public class Ucfirst extends EvalFunc<String> {


public String exec(Tuple input) throws Exception {
if (input.size() == 0)
return null;
try {
String str = (String) input.get(0);
char ch = str.toUpperCase().charAt(0);
String str1 = String.valueOf(ch);
return str1;

} catch (Exception e) {}

}
QUESTIONS??

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy