06 Pig 01 Intro 1
06 Pig 01 Intro 1
Unit 5 (Pig)
Agenda
• Pig Overview
• Execution Modes
• Installation
• Pig Latin Basics
• Developing Pig Script
– Most Occurred Start Letter
• Resources
Pig
“is a platform for analyzing large data sets that
consists of a high-level language for expressing
data analysis programs, coupled with
infrastructure for evaluating these programs. “
• Top Level Apache Project
– http://pig.apache.org
• Pig is an abstraction on top of Hadoop
– Provides high level programming language designed for
data processing
– Converted into MapReduce and executed on Hadoop
Clusters
• Pig is widely accepted and used
– Yahoo!, Twitter, Netflix, etc...
5
Pig and MapReduce
• MapReduce requires programmers
– Must think in terms of map and reduce functions
– More than likely will require Java programmers
• Pig provides high-level language that can be
used by
– Analysts
– Data Scientists
– Statisticians
– Etc...
• Originally implemented at Yahoo! to allow
analysts to access data
6
Pig’s Features
• Join Datasets
• Sort Datasets
• Filter
• Data Types
• Group By
• User Defined Functions
• Etc..
7
Pig’s Use Cases
• Extract Transform Load (ETL)
– Ex: Processing large amounts of log data
• clean bad entries, join with other data-sets
• Research of “raw” information
– Ex. User Audit Logs
– Schema maybe unknown or inconsistent
– Data Scientists and Analysts may like Pig’s data
transformation paradigm
Pig Components
• Pig Latin
– Command based language
– Designed specifically for data transformation and flow
expression
• Execution Environment
– The environment in which Pig Latin commands are
executed
– Currently there is support for Local and Hadoop
modes
• Pig compiler converts Pig Latin to
MapReduce
– Compiler strives to optimize execution
– You automatically get optimization improvements with Pig
updates
9
Execution Modes
• Local
– Executes in a single JVM
– Works exclusively with local file system
– Great for development, experimentation and prototyping
• Hadoop Mode
– Also known as MapReduce mode
– Pig renders Pig Latin into MapReduce jobs and executes
them on the cluster
– Can execute against semi-distributed or fully-distributed
hadoop installation
• We will run on semi-distributed cluster
10
Hadoop Mode
-- 1: Load text into a bag, where a row is a
line of text
lines = LOAD '/training/playArea/hamlet.txt' AS
(line:chararray);
-- 2: Tokenize the provided text
tokens = FOREACH lines GENERATE
flatten(TOKENIZE(line)) AS token:chararray;
PigLatin.pig
Execute on
Hadoop
Cluster
Pig
Hadoop
Parse Pig script and Execution
compile into a set of ...
Environment Monitor/
MapReduce jobs Report
Hadoop
Cluster
11
Installation Prerequisites
• Java 6
– With $JAVA_HOME environment variable properly set
• Cygwin on Windows
12
Installation
• Add pig script to path
– export PIG_HOME=$CDH_HOME/pig-0.9.2-cdh4.0.0
– export PATH=$PATH:$PIG_HOME/bin
• $ pig -help
• That’s all we need to run in local mode
– Think of Pig as a ‘Pig Latin’ compiler, development tool
and executor
– Not tightly coupled with Hadoop clusters
13
Pig Installation for Hadoop
Mode
• Make sure Pig compiles with Hadoop
– Not a problem when using a distribution such as Cloudera
Distribution for Hadoop (CDH)
• Pig will utilize $HADOOP_HOME and
$HADOOP_CONF_DIR variables to locate
Hadoop configuration
– We already set these properties during MapReduce
installation
– Pig will use these properties to locate Namenode and
Resource Manager
14
Running Modes
• Can manually override the default mode via
‘-x’ or ‘-exectype’ options
– $pig -x local
– $pig -x mapreduce
$ pig
2012-07-14 13:38:58,139 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/hadoop/Training/play_area/pig/pig_1342287538128.log
2012-07-14 13:38:58,458 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:8020
$ pig -x local
2012-07-14 13:39:31,029 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/hadoop/Training/play_area/pig/pig_1342287571019.log
2012-07-14 13:39:31,232 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: file:///
15
Running Pig
• Script
– Execute commands in a file
– $pig scriptFile.pig
• Grunt
– Interactive Shell for executing Pig Commands
– Started when script file is NOT provided
– Can execute scripts from Grunt via run or exec
commands
• Embedded
– Execute Pig commands using PigServer class
• Just like JDBC to execute SQL
– Can have programmatic access to Grunt via PigRunner
16 class
17
Simple Pig Latin Example
$ pig Start Grunt with
grunt> cat /training/playArea/pig/a.txt default MapReduce
a 1 mode
d 4 Grunt supports Load contents of text
c 9 file system files into a Bag named
k 6 commands records
grunt> records = LOAD '/training/playArea/pig/a.txt' as
(letter:chararray, count:int);
grunt> dump records; Display records bag
... to the screen
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
.MapReduceLauncher - 50% complete
2012-07-14 17:36:22,040 [main] INFO
org.apache.pig.backend.hadoop.execut
ionengine.mapReduceLayer
.MapReduceLauncher - 100% complete
...
(a,1)
(d,4) Results of the bag named
(c,9) records are printed to the
(k,6) screen
18 grunt>
20
LOAD Command
LOAD 'data' [USING function] [AS schema];
records =
LOAD '/training/playArea/pig/excite-small.log'
USING PigStorage()
AS (userId:chararray, timestamp:long,
query:chararray);
Schema
22
24
by c; grunt> ILLUSTRATE
charGroup;
| | |
| chars |c c:chararray
| |
|
c
| charGroup | |
group:chararray chars:bag{:tuple(c:chararra
y)} |
| |c
| {(c), (c)} |
26
| chars | |
c:chararray
| |c |
| |c |
| | group:chararray | |
charGroup chars:bag{:tuple(c:chararray)}
| |
|c | {(c), (c)}
Inner Bag
Outer Bag
27
Inner vs. Outer Bag
grunt> chars = LOAD '/training/playArea/pig/b.txt' AS
(c:chararray);
grunt> charGroup = GROUP chars by c;
grunt> dump charGroup;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})
Inner Bag
Outer Bag
28
TOKENIZE Function
• Splits a string into tokens and outputs as a
bag of tokens
– Separators are: space, double
quote("), coma(,) parenthesis(()), star(*)
grunt> linesOfText = LOAD 'data/c.txt' AS (line:chararray);
grunt> dump linesOfText;
(this is a line of text) Split each row line by
(yet another line of text) space and return a bag
(third line of words) of tokens
grunt> tokenBag = FOREACH linesOfText GENERATE TOKENIZE(line);
Alias Keywords
Case Case
Sensitive Insensitive
• General conventions
– Upper case is a system keyword
33
– Lowercase is something that you
provide
Problem: Locate Most Occurred
Start Letter
• Calculate number of (For so this side of our known world esteem'd
him) Did slay this Fortinbras; who, by a
seal'd compact, Well ratified by law and
T 49595
9
34
35
1: Load Text Into a Bag
grunt> lines = LOAD '/training/data/hamlet.txt'
AS (line:chararray);
4: Group by Letter
grunt> letterGroup = GROUP letters BY letter;
INSPECT orderedCountPerLetter
bag:
grunt> describe orderedCountPerLetter;
orderedCountPerLetter: {group: chararray,long}
grunt> toDisplay = LIMIT orderedCountPerLetter 5;
(a,2379) All we have to do now
grunt> dump toDisplay;
(s,1938) is just grab the first
(t,3711)
(m,1787) element
(h,1725)
41
7: Grab the First Element
grunt> result = LIMIT orderedCountPerLetter 1;
There it is
42
INSPECT result
$ hdfs dfs -cat
/training/playArea/pig/mostSeenLetterOutput/part-r-00000
t 3711
Pig Tools
• Community has developed several tools to
support Pig
– https://cwiki.apache.org/confluence/display/PIG/PigTools
• We have PigPen Eclipse Plugin installed:
– Download the latest jar release at
https://issues.apache.org/jira/browse/PIG-366
• As of writing
org.apache.pig.pigpen_0.7.5.jar
– Place jar in eclupse/plugins/
– Restart eclipse
45
Summary
• We learned about
– Pig Overview
– Execution Modes
– Installation
– Pig Latin Basics
– Resources
• We developed Pig Script to locate “Most
Occurred Start Letter”
49