Mortar Pig Cheat Sheet
Mortar Pig Cheat Sheet
P
C heat
igSheet
We love Apache Pig for data processing
its easy to learn, it works with all kinds of
data, and it plays well with Python, Java,
and other popular languages. And, of
course, Pig runs on Hadoop, so its built
for high-scale data science.
Whether youre just getting started with
Pig or youve already written a variety of Pig
scripts, this compact reference gathers in
one place many of the tools youll need to
make the most of your data using Pig 0.12.
mortardata.com
Data Types
Diagnostic Operators
Relational Operators
Syntax Tips
Mathematical Functions
Eval Functions
String Functions
DateTime Functions
10
Load/Store Functions
11
12
Additional Resources
Official Pig website
pig.apache.org
Data Types
Complex
Simple
Pig is written in Java, so Pig data types correspond to underlying Java data
types. When using a UDF (see page 5), the Java types need to be translated
into the target language.
Types
Description
Example
int
10
int
long
10L
long
float
10.5f
float
double
10.5
float
chararray
hello world
str or unicode
bytearray
boolean
boolean
true/false
bool
datetime
org.joda.time.DateTime
1970-01-01T00:00:00.000+00:00
datetime
biginteger
Java BigInteger
100000000000
long
bigdecimal
Java BigDecimal
42.4242424242424
float
tuple
(17, random)
tuple
bag
A collection of tuples.
{(17,random), (42,life)}
list
map
[foo#bar,baz#quux]
dict
bytearray
Diagnostic Operators
Describe
Dump
Explain
Illustrate
mortardata.com
Relational Operators
These operators are the heart of Pig set operations. The fundamental ways
to manipulate data appear here, including GROUP, FILTER, and JOIN.
Name
Description
DISTINCT alias
...FLATTEN(tuple)
...FLATTEN(bag)
Efficient join when one or more relations are small enough to fit in
main memory.
LIMIT alias n
mortardata.com
Name
Description
Syntax Tips
The fields in a Pig relation have an explicit order (unlike SQL columns).
As a result, there are syntax shortcuts that rely on that field order.
Assume:
my_data = LOAD file.tsv AS (field1:int, field2:chararray, field3:float, field4:int);
Statement
Output
(field1, field3)
mortardata.com
Python
Pig
REGISTER udfs.py USING streaming_python AS udfs;
Python
@outputSchema(user_products:bag{t:(product_id:long)})
def deserialize_user_products(product_ids):
return [ (product_id, ) for product_id in product_ids.split(,) ]
Jython
Pig
REGISTER udfs.py USING jython as udfs;
Jython
@outputSchema(user_products:bag{t:(product_id:long)})
def deserialize_user_products(product_ids):
return [ (product_id, ) for product_id in product_ids.split(,) ]
Java
Ruby
Pig
REGISTER udf-project.jar;
DEFINE My_Function my.function.path.MyFunction();
Pig
register test.rb using jruby as myfuncs;
Ruby
Java
mortardata.com
require pigudf
class Myudfs < PigUdf
outputSchema num:int
def square num
return nil if num.nil?
num**2
end
end
Mathematical Functions
These functions are very similar to what you can find in java.lang.Math
basic mathematical functions and the trigonometric repertoire.
Name (Signature)
Return Type
Description
ACOS(double a)
double
ASIN(double a)
double
ATAN(double a)
double
CBRT(double a)
double
CEIL(double a)
double
COS(double a)
double
COSH(double a)
double
EXP(double a)
double
FLOOR(double a)
double
LOG(double a)
double
LOG10(double a)
double
RANDOM( )
double
int, long
SIN(double a)
double
SINH(double a)
double
SQRT(double a)
double
TAN(double a)
double
TANH(double a)
double
mortardata.com
Eval Functions
This group contains aggregate functions such as COUNT and SUM,
along with useful utility methods such as IsEmpty.
Name (Signature)
Return Type
Description
AVG(col)
double
CONCAT(String expression1,
String expression2)
CONCAT(byte[] expression1,
byte[] expression2)
String, byte[]
COUNT(DataBag bag)
long
COUNT_STAR(DataBag bag)
long
DataBag
boolean
MAX(col)
MIN(col)
Tuple
SIZE(expression)
long
DataBag
SUM(col)
double, long
TOKENIZE(String expression
[, field_delimiter])
DataBag
mortardata.com
String Functions
Providing convenient ways to handle text fields, these functions mirror
commonly used String functions in other languages.
Name (Signature)
Return Type
Description
boolean
Tests inputs to determine if the first argument ends with the string
in the second.
EqualsIgnoreCase(String string1,
String string2)
boolean
int
LAST_INDEX_OF(String string,
String 'character')
int
LCFIRST(String expression)
String
LOWER(String expression)
String
LTRIM(String expression)
String
String
Tuple
String
RTRIM(String expression)
String
STARTSWITH(String string,
String testAgainst)
boolean
Tests inputs to determine if the first argument starts with the string
in the second.
Tuple
String
TRIM(String expression)
String
UCFIRST(String expression)
String
UPPER(String expression)
String
mortardata.com
DateTime Functions
Pig uses the Joda-Time DateTime class for date and time handling.
Name (Signature)
Return Type
Description
DateTime
CurrentTime()
DateTime
DaysBetween(DateTime datetime1,
DateTime datetime2)
long
GetDay(DateTime datetime)
int
GetHour(DateTime datetime)
int
GetMilliSecond(DateTime datetime)
int
GetMinute(DateTime datetime)
int
GetMonth(DateTime datetime)
int
GetSecond(DateTime datetime)
int
GetWeek(DateTime datetime)
int
GetWeekYear(DateTime datetime)
int
GetYear(DateTime datetime)
int
HoursBetween(DateTime datetime1,
DateTime datetime2)
long
MilliSecondsBetween(datetime1, datetime2)
long
MinutesBetween(DateTime datetime1,
DateTime datetime2)
long
MonthsBetween(DateTime datetime1,
DateTime datetime2)
long
SecondsBetween(DateTime datetime1,
DateTime datetime2)
long
SubtractDuration(DateTime datetime,
String duration)
DateTime
mortardata.com
Return Type
Description
ToDate(long milliseconds)
DateTime
ToDate(String isostring)
DateTime
DateTime
DateTime
ToMilliSeconds(DateTime datetime)
long
ToString(DateTime datetime)
String
String
ToUnixTime(DateTime datetime)
long
WeeksBetween(DateTime datetime1,
DateTime datetime2)
long
YearsBetween(DateTime datetime1,
DateTime datetime2)
long
Return Type
Description
Tuple
DataBag
TOMAP(key-expression, value-expression
[, key-expression, value-expression ...])
Map
DataBag
mortardata.com
10
Load/Store Functions
Crucial to data manipulation in Pig, load and store functions bring data
into Pig and push it out again in a multitude of formats.
Name
File Format
Load/Store
org.apache.pig.piggybank.storage.avro.AvroStorage()
org.apache.pig.piggybank.storage.avro.AvroStorage(no_schema_check,
$SCHEMA_FILE, $PATH);
Avro
Load, Store
org.apache.pig.piggybank.storage.apachelog.CommonLogLoader()
Load
org.apache.pig.piggybank.storage.CSVExcelStorage()
org.apache.pig.piggybank.storage.CSVExcelStorage(
$DELIMITER, YES_MULTILINE, NOCHANGE, SKIP_INPUT_HEADER)
CSV
Load, Store
com.mortardata.pig.storage.DynamoDBStorage($DYNAMODB_TABLE,
$DYNAMODB_AWS_ACCESS_KEY_ID, $DYNAMODB_AWS_SECRET_ACCESS_KEY)
DynamoDB
Store
see PostgreSQL
Greenplum
Store
Fixed-Width
Load, Store
org.apache.pig.piggybank.storage.HiveColumnarLoader($SCHEMA)
Hive
Load
org.apache.pig.piggybank.storage.JsonLoader()
org.apache.pig.piggybank.storage.JsonLoader($SCHEMA)
JSON
Load, Store
com.mongodb.hadoop.pig.MongoLoader()
com.mongodb.hadoop.pig.MongoLoader($SCHEMA)
MongoDB
Load, Store
com.mortardata.pig.PapertrailLoader()
Papertrail Logs
Load
PigStorage()
PigStorage($DELIMITER)
character-delimited
(CSV, TSV)
Load, Store
org.apache.pig.piggybank.storage.DBStorage(org.postgresql.Driver,
jdbc:postgresql://$POSTGRESQL_HOST/$POSTGRESQL_DATABASE,
$POSTGRESQL_USER,
$POSTGRESQL_PASS,
INSERT INTO my_table(my_col_1,my_col_2,my_col_3) VALUES (?,?,?))
PostgreSQL
Store
TextLoader()
Text/Unformatted
Load
org.apache.pig.piggybank.storage.StreamingXMLLoader($DOCUMENT)
XML
Load
mortardata.com
11
SQL
Pig
SELECT
SELECT column_name,column_name
FROM table_name;
SELECT *
DISTINCT
DISTINCT(FOREACH alias
generate column_name, column_name);
WHERE
SELECT column_name,column_name
FROM table_name
WHERE column_name operator value;
AND/OR
ORDER BY
TOP/LIMIT
GROUP BY
SELECT function(column_name)
FROM table GROUP BY column_name;
LIKE
IN
JOIN
SELECT column_name(s)
FROM table1
JOIN table2
ON table1.column_name=table2.column_name;
mortardata.com
12
SQL Function
SQL
Pig
LEFT/RIGHT/FULL OUTER
JOIN
SELECT column_name(s)
FROM table1
LEFT|RIGHT|FULL OUTER JOIN table2
ON table1.column_name=table2.column_name;
UNION ALL
AVG
COUNT
COUNT DISTINCT
FOREACH alias {
unique_column = DISTINCT column_name;
GENERATE COUNT(unique_column);
};
MAX
MIN
SUM
HAVING
FILTER alias BY
aggregate_function(column_name) operator
value;
UCASE/UPPER
LCASE/LOWER
SUBSTRING
SELECT SUBSTRING(column_name,start,length)
AS some_name FROM table_name;
LEN
ROUND
mortardata.com
13