Apache PIG by Sravanthi
Apache PIG by Sravanthi
By Sravanthi
We are going to discuss below topics:
Introduction to Pig
Pig Latin
Pig modes
Hive Vs Pig
Installation of Pig
Datatypes in Pig
Modes of execution in Pig
Operators in Pig
Pig programming
Eval Functions
Load and store functions
Pig Execution models
Pig UDF
Introduction to PIG
Apache Pig is a tool used to analyze large amounts of data by representing them as data flows.
Using the Pig Latin scripting language operations like ETL, adhoc data anlaysis and iterative
processing can be easily achieved.
Pig is an abstraction over MapReduce. Internally , all Pig scripts internally are converted into
Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce
applications easier.
Pig was first built in Yahoo! and later became a top level Apache project
Pig provides high-level language that can be used by analysts, data scientists, statisticians etc.
PIG LATIN
Local Mode - To run Pig in local mode, you need access to a single machine.
To run the scripts in local mode, no Hadoop or HDFS installation is required.
All files are installed and run from your local host and file system.
$ pig -x local
MapReduce Mode - To run Pig in MapReduce mode, you need access to a Hadoop cluster
and HDFS installation. Pig will automatically allocate and deallocate a 15-node cluster.
It can execute against semi distributed and full distributed Hadoop installation.
$ pig
or
$ pig -x MapReduce
You can run the Grunt shell, Pig scripts, or embedded programs using either mode.
PIG vs Hive
PIG Installation
Unzip the Pig tutorial file (the files are stored in a newly created directory, pigtmp).
$ tar -xzf pig-0.15.0.tar.gz
export PIG_HOME=/usr/local/hadoop/pig
export PATH=$PATH:$PIG_HOME/bin
PIG Installation
(1,john,10000,101)
As in RDMS, bag is table and tuple is row in table, but bag does not require that all
tuples contains the same number.
PIG DATA Types
LOAD Data
First step in dataflow language we need to specify the input, which is done by using
load keyword.
Syntax:
Dataset path: Name of the directory where our dataset presented in hdfs and give single quotes
USING Function: By default it uses PIGStorage which parses each line into fields using delimiter
Default it uses tab delimiter
AS: We can assign names to fields and also declare datatypes for fields.
LOAD Data
Pig will parse, validate and analyze statements and but not execute them.
After we have completed process, then result should write into somewhere,
Pig provides the store statement for this purpose
For debugging purposes if you want to display only coupe of records on screen ,
Then use LIMIT command. Below display 10 records from previous weather data.
Filters are similar to where clause in SQL. filter which contain predicate.
The below statements filters the alias weather and store the results in a new alias.
For each takes a set of expressions and applies them to every record in the data pipeline.
In Pig, any time you want to add, remove, or change the data you have in an alias,
youll use the FOREACH GENERATE syntax.
The group statement collects together records with the same key.
For example If you want to find out sum of all the values un temp values.
The order statement sorts your data for you, producing a total order of your output data.
You indicate a key or set of keys by which you wish to order your data
DISTINCT
Piggybank functions are distributed as part of the Pig distribution, but they are not built in.
Piggy Bank is a place for Pig users to share the Java UDFs they have written for use with Pig.
You must register the Piggybank JAR to use them, which you can do in your distribution at
contrib/piggybank/java/piggybank.jar.
PIGGYBANK Functions
REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text)
MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ;
PIG UDF
Create a function in eclipse and override definition for exec. And take the jar file
Register Jarname;
REGISTER ucfirst.jar;
A = LOAD sample.txt as (logid:chararray);
B = FOREACH A GENERATE myudfs.Ucfirst(logid);
DUMP B;
} catch (Exception e) {}
}
QUESTIONS??