0% found this document useful (0 votes)

11 views23 pages

06 Pig 01 Intro 1

The document provides an overview of Apache Pig, a platform for analyzing large datasets using a high-level language called Pig Latin, which simplifies data processing on Hadoop. It covers key features, execution modes, installation prerequisites, and basic commands for developing Pig scripts. Additionally, it highlights the components of Pig, including its execution environment and diagnostic tools for analyzing data transformations.

Uploaded by

arunsjoseph5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views23 pages

06 Pig 01 Intro 1

Uploaded by

arunsjoseph5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Big Data Analytics 8CAI4-01

Unit 5 (Pig)
Agenda
• Pig Overview
• Execution Modes
• Installation
• Pig Latin Basics
• Developing Pig Script
– Most Occurred Start Letter
• Resources

Pig
“is a platform for analyzing large data sets that
consists of a high-level language for expressing
data analysis programs, coupled with
infrastructure for evaluating these programs. “
• Top Level Apache Project
– http://pig.apache.org
• Pig is an abstraction on top of Hadoop
– Provides high level programming language designed for
data processing
– Converted into MapReduce and executed on Hadoop
Clusters
• Pig is widely accepted and used
– Yahoo!, Twitter, Netflix, etc...
5
Pig and MapReduce
• MapReduce requires programmers
– Must think in terms of map and reduce functions
– More than likely will require Java programmers
• Pig provides high-level language that can be
used by
– Analysts
– Data Scientists
– Statisticians
– Etc...
• Originally implemented at Yahoo! to allow
analysts to access data
6

Pig’s Features
• Join Datasets
• Sort Datasets
• Filter
• Data Types
• Group By
• User Defined Functions
• Etc..

7
Pig’s Use Cases
• Extract Transform Load (ETL)
– Ex: Processing large amounts of log data
• clean bad entries, join with other data-sets
• Research of “raw” information
– Ex. User Audit Logs
– Schema maybe unknown or inconsistent
– Data Scientists and Analysts may like Pig’s data
transformation paradigm

Pig Components
• Pig Latin
– Command based language
– Designed specifically for data transformation and flow
expression
• Execution Environment
– The environment in which Pig Latin commands are
executed
– Currently there is support for Local and Hadoop
modes
• Pig compiler converts Pig Latin to
MapReduce
– Compiler strives to optimize execution
– You automatically get optimization improvements with Pig
updates
9
Execution Modes
• Local
– Executes in a single JVM
– Works exclusively with local file system
– Great for development, experimentation and prototyping
• Hadoop Mode
– Also known as MapReduce mode
– Pig renders Pig Latin into MapReduce jobs and executes
them on the cluster
– Can execute against semi-distributed or fully-distributed
hadoop installation
• We will run on semi-distributed cluster

Hadoop Mode
-- 1: Load text into a bag, where a row is a
line of text
lines = LOAD '/training/playArea/hamlet.txt' AS
(line:chararray);
-- 2: Tokenize the provided text
tokens = FOREACH lines GENERATE
flatten(TOKENIZE(line)) AS token:chararray;

PigLatin.pig
Execute on
Hadoop
Cluster
Pig
Hadoop
Parse Pig script and Execution
compile into a set of ...
Environment Monitor/
MapReduce jobs Report

Hadoop
Cluster
11
Installation Prerequisites
• Java 6
– With $JAVA_HOME environment variable properly set
• Cygwin on Windows

Installation
• Add pig script to path
– export PIG_HOME=$CDH_HOME/pig-0.9.2-cdh4.0.0
– export PATH=$PATH:$PIG_HOME/bin
• $ pig -help
• That’s all we need to run in local mode
– Think of Pig as a ‘Pig Latin’ compiler, development tool
and executor
– Not tightly coupled with Hadoop clusters

13
Pig Installation for Hadoop
Mode
• Make sure Pig compiles with Hadoop
– Not a problem when using a distribution such as Cloudera
Distribution for Hadoop (CDH)
• Pig will utilize $HADOOP_HOME and
$HADOOP_CONF_DIR variables to locate
Hadoop configuration
– We already set these properties during MapReduce
installation
– Pig will use these properties to locate Namenode and
Resource Manager

Running Modes
• Can manually override the default mode via
‘-x’ or ‘-exectype’ options
– $pig -x local
– $pig -x mapreduce

$ pig
2012-07-14 13:38:58,139 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/hadoop/Training/play_area/pig/pig_1342287538128.log
2012-07-14 13:38:58,458 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:8020

$ pig -x local
2012-07-14 13:39:31,029 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/hadoop/Training/play_area/pig/pig_1342287571019.log
2012-07-14 13:39:31,232 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: file:///
15
Running Pig
• Script
– Execute commands in a file
– $pig scriptFile.pig
• Grunt
– Interactive Shell for executing Pig Commands
– Started when script file is NOT provided
– Can execute scripts from Grunt via run or exec
commands
• Embedded
– Execute Pig commands using PigServer class
• Just like JDBC to execute SQL
– Can have programmatic access to Grunt via PigRunner
16 class

Pig Latin Concepts

• Building blocks
– Field – piece of data
– Tuple – ordered set of fields, represented with “(“ and “)”
• (10.4, 5, word, 4, field1)
– Bag – collection of tuples, represented with “{“ and “}”
• { (10.4, 5, word, 4, field1), (this, 1, blah) }
• Similar to Relational Database
– Bag is a table in the database
– Tuple is a row in a table
– Bags do not require that all tuples contain the same
number
• Unlike relational table

17
Simple Pig Latin Example
$ pig Start Grunt with
grunt> cat /training/playArea/pig/a.txt default MapReduce
a 1 mode
d 4 Grunt supports Load contents of text
c 9 file system files into a Bag named
k 6 commands records
grunt> records = LOAD '/training/playArea/pig/a.txt' as
(letter:chararray, count:int);
grunt> dump records; Display records bag
... to the screen
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
.MapReduceLauncher - 50% complete
2012-07-14 17:36:22,040 [main] INFO
org.apache.pig.backend.hadoop.execut
ionengine.mapReduceLayer
.MapReduceLauncher - 100% complete
...
(a,1)
(d,4) Results of the bag named
(c,9) records are printed to the
(k,6) screen
18 grunt>

DUMP and STORE statements

• No action is taken until DUMP or STORE
commands are encountered
– Pig will parse, validate and analyze statements but not
execute them
• DUMP – displays the results to the screen
• STORE – saves results (typically to a file)
Nothing is records = LOAD '/training/playArea/pig/a.txt' as
executed; (letter:chararray, count:int);
Pig will ...
optimize ...
this entire ...
chunk ...
of ...
script DUMP final_bag; The fun begins
here
19
Large Data
• Hadoop data is usually quite large and it
doesn’t make sense to print it to the screen
• The common pattern is to persist results to
Hadoop (HDFS, HBase)
– This is done with STORE command
• For information and debugging purposes
you can print a small sub-set to the screen
grunt> records = LOAD '/training/playArea/pig/excite-small.log'
AS (userId:chararray, timestamp:long, query:chararray);
grunt> toPrint = LIMIT records 5;
grunt> DUMP toPrint;
Only 5 records will
be displayed

LOAD Command
LOAD 'data' [USING function] [AS schema];

• data – name of the directory or file

– Must be in single quotes
• USING – specifies the load function to use
– By default uses PigStorage which parses each line into
fields using a delimiter
• Default delimiter is tab (‘\t’)
• The delimiter can be customized using
regular expressions
• AS – assign a schema to incoming
data
– Assigns names to fields
– Declares types to fields
21
LOAD Command Example
Data

records =
LOAD '/training/playArea/pig/excite-small.log'
USING PigStorage()
AS (userId:chararray, timestamp:long,
query:chararray);

Schema

User selected Load Function,

there are a lot of choices or
you can implement your own

Schema Data Types

Type Description Example
Simple
int Signed 32-bit integer 10
long Signed 64-bit integer 10L or 10l
float 32-bit floating point 10.5F or 10.5f
double 64-bit floating point 10.5 or 10.5e2 or
10.5E2
Arrays
chararray Character array (string) in Unicode hello world
UTF-8
bytearray Byte array (blob)
Complex Data Types
tuple An ordered set of fields (19,2)
bag An collection of tuples {(19,2), (18,1)}
map An collection of tuples [open#apache]
23 Source: Apache Pig Documentation 0.9.2; “Pig Latin
Basics”. 2012
Pig Latin – Diagnostic Tools
• Display the structure of the Bag
– grunt> DESCRIBE <bag_name>;
• Display Execution Plan
– Produces Various reports
• Logical Plan
• MapReduce Plan
– grunt> EXPLAIN <bag_name>;
• Illustrate how Pig engine transforms the
data
– grunt> ILLUSTRATE <bag_name>;

Pig Latin - Grouping

grunt> chars = LOAD '/training/playArea/pig/b.txt' AS
(c:chararray);
grunt> describe chars;
chars: {c: chararray}
grunt> dump chars;
(a) The chars bag is
(k) Creates a new bag with element grouped by “c”;
... named
... therefore ‘group’
group and element named chars element will
(k)
(c) contain unique
(k) values
grunt> charGroup = GROUP chars by c;
grunt> describe charGroup;
charGroup: {group: chararray,chars: {(c: chararray)}}
grunt> dump charGroup;
(a,{(a),(a),(a)})
(c,{(c),(c)}) ‘chars’ element is a bag itself
(i,{(i),(i),(i)}) and contains all tuples from
(k,{(k),(k),(k),(k)}) ‘chars’ bag that match the
(l,{(l),(l)}) value form ‘c’
25
ILUSTRATE Command
grunt> chars = LOAD ‘/training/playArea/pig/b.txt' AS
(c:chararray);

grunt> charGroup = GROUP chars

by c; grunt> ILLUSTRATE

charGroup;

| | |
| chars |c c:chararray
| |
|
c
| charGroup | |
group:chararray chars:bag{:tuple(c:chararra
y)} |
| |c
| {(c), (c)} |
26

Inner vs. Outer Bag

grunt> chars = LOAD ‘/training/playArea/pig/b.txt' AS
(c:chararray); grunt> charGroup = GROUP chars by c;
grunt> ILLUSTRATE charGroup;

| chars | |
c:chararray
| |c |
| |c |

| | group:chararray | |
charGroup chars:bag{:tuple(c:chararray)}
| |
|c | {(c), (c)}

Inner Bag

Outer Bag
27
Inner vs. Outer Bag
grunt> chars = LOAD '/training/playArea/pig/b.txt' AS
(c:chararray);
grunt> charGroup = GROUP chars by c;
grunt> dump charGroup;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})

Inner Bag

Outer Bag

Pig Latin - FOREACH

• FOREACH <bag> GENERATE <data>
– Iterate over each element in the bag and produce a result
– Ex: grunt> result = FOREACH bag GENERATE f1;

grunt> records = LOAD 'data/a.txt' AS (c:chararray, i:int);

grunt> dump records;
(a,1)
(d,4)
(c,9)
(k,6)
grunt> counts = foreach records generate i;
grunt> dump counts;
(1)
(4)
(9)

For each row emit ‘i’

29
field
(6)
FOREACH with Functions
FOREACH B GENERATE group, FUNCTION(A);
– Pig comes with many functions including COUNT, FLATTEN, CONCAT,
etc...
– Can implement a custom function
grunt> chars = LOAD 'data/b.txt' AS (c:chararray);
grunt> charGroup = GROUP chars by c;
grunt> dump charGroup;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})
grunt> describe
charGroup;
charGroup: {group:
chararray,chars: {(c:
chararray)}}
grunt> counts =
(c,2) For each row in ‘charGroup’ bag,
FOREACH
(i,3) charGroup emit group field and count the
GENERATE group,
(k,4) number of items in ‘chars’ bag
30 COUNT(chars);
(l,2)
grunt> dump counts;
(a,3)

TOKENIZE Function
• Splits a string into tokens and outputs as a
bag of tokens
– Separators are: space, double
quote("), coma(,) parenthesis(()), star(*)
grunt> linesOfText = LOAD 'data/c.txt' AS (line:chararray);
grunt> dump linesOfText;
(this is a line of text) Split each row line by
(yet another line of text) space and return a bag
(third line of words) of tokens
grunt> tokenBag = FOREACH linesOfText GENERATE TOKENIZE(line);

grunt> dump tokenBag; ({(this),(is),

(a),(line),(of),(text)}) Each row is a bag
({(yet),(another),(line),(of),(text)}) of words produced
({(third),(line),(of),(words)}) by TOKENIZE
grunt> describe tokenBag; function
tokenBag: {bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)}}
31
FLATTEN Operator
• Flattens nested bags and data types
• FLATTEN is not a function, it’s an operator
– Re-arranges output
grunt> dump tokenBag; ({(this),(is),
(a),(line),(of),(text)}) Nested structure:
({(yet),(another),(line),(of),(text)}) bag of bags of tuples
({(third),(line),(of),(words)})
grunt> flatBag = FOREACH tokenBag GENERATE flatten($0);
grunt> dump flatBag;
(this)
(is)
(a) Each row is flatten resulting
... in a bag of simple tokens
...
(text)
(third)
Elements in a bag
(line) can be referenced
(of) by index
32
(words)

Conventions and Case

Sensitivity
• Case Sensitive
– Alias names
– Pig Latin Functions
• Case Insensitive
Alias Function
– Pig Latin Keywords Case
Case
Sensitive Sensitive

counts = FOREACH charGroup GENERATE group, COUNT(c);

Alias Keywords
Case Case
Sensitive Insensitive

• General conventions
– Upper case is a system keyword
33
– Lowercase is something that you
provide
Problem: Locate Most Occurred
Start Letter
• Calculate number of (For so this side of our known world esteem'd
him) Did slay this Fortinbras; who, by a
seal'd compact, Well ratified by law and

occurrences of each letter in

heraldry,
Did forfeit, with his life, all those his
lands Which he stood seiz'd of, to
the conqueror; Against the which a

the provided body of text moiety competent Was gaged by

our king; which had return'd
To the inheritance of Fortinbras,

• Traverse each letter

comparing occurrence count
• Produce start letter that has A 8953
the most occurrences 0
B 3920
..
..
Z 876

T 49595
9
34

‘Most Occurred Start Letter’

Pig Way
1. Load text into a bag (named ‘lines’)
2. Tokenize the text in the ‘lines’ bag
3. Retain first letter of each token
4. Group by letter
5. Count the number of occurrences in each
group
6. Descending order the group by the count
7. Grab the first element => Most occurring
letter
8. Persist result on a file system

35
1: Load Text Into a Bag
grunt> lines = LOAD '/training/data/hamlet.txt'
AS (line:chararray);

Load text file into a bag, stick entire

line into element ‘line’ of type
‘chararray’

INSPECT lines bag:

grunt> describe lines;
lines: {line: chararray} Each row is a line of
grunt> toDisplay = LIMIT lines 5; text
grunt> dump toDisplay;
(This Etext file is presented by Project Gutenberg, in)
(This etext is a typo-corrected version of Shakespeare's Hamlet,)
(cooperation with World Library, Inc., from their Library of the)
(*This Etext has certain copyright implications you should read!*
(Future and Shakespeare CDROMS. Project Gutenberg often releases
36

2: Tokenize the Text in the

‘Lines’ Bag
grunt> tokens = FOREACH lines GENERATE
flatten(TOKENIZE(line)) AS token:chararray;

For each line of text (1) tokenize that line

(2) flatten the structure to produce 1 word
per row

INSPECT tokens bag:

grunt> describe tokens
tokens: {token: chararray}
grunt> toDisplay = LIMIT tokens 5;
grunt> dump toDisplay;
(a)
(of)
(is) Each row is now a
(This) token
(etext)
37
3: Retain First Letter of Each
Token
grunt> letters = FOREACH tokens GENERATE
SUBSTRING(token,0,1) AS letter:chararray;

For each token grab the first letter;

utilize SUBSTRING function

INSPECT letters bag:

grunt> describe letters;
letters: {letter: chararray}
grunt> toDisplay = LIMIT letters 5;
grunt> dump toDisplay;
(a)
(i)
What we have no
(T)
is 1 character per
(e)
row
(t)
38

4: Group by Letter
grunt> letterGroup = GROUP letters BY letter;

Create a bag for each unique character;

the “grouped” bag will contain the same
character for each occurrence of that
character

INSPECT letterGroup bag:

grunt> describe letterGroup;
letterGroup: {group: chararray,letters: {(letter: chararray)}}
grunt> toDisplay = LIMIT letterGroup 5;
grunt> dump toDisplay; (0,{(0),(0),
(0)})
Next we’ll need to convert
(a,{(a),(a)) (2,{(2),(2),
characters occurrences into
(2),(2),(2))
counts; Note this display was
(3,{(3),(3),(3)})
modified as there were too
(b,{(b)})
characters to fit on the
many
39 screen
5: Count the Number of
Occurrences in Each Group
grunt> countPerLetter = FOREACH letterGroup
GENERATE group, COUNT(letters);

For each row, count occurrence of the letter

INSPECT countPerLetter bag:

grunt> describe countPerLetter;
countPerLetter: {group: chararray,long}
grunt> toDisplay = LIMIT countPerLetter 5;
grunt> dump toDisplay;
(A,728) Each row now has
(B,325) the character and
(C,291) the number of times
(D,194) it was found to start
a word. All we have
(E,264)
to do is find the
40
maximum

6: Descending Order the Group

by the Count
grunt> orderedCountPerLetter = ORDER
countPerLetter BY $1 DESC;

Simply order the bag by the first

element, a number of occurrences for
that element

INSPECT orderedCountPerLetter
bag:
grunt> describe orderedCountPerLetter;
orderedCountPerLetter: {group: chararray,long}
grunt> toDisplay = LIMIT orderedCountPerLetter 5;
(a,2379) All we have to do now
grunt> dump toDisplay;
(s,1938) is just grab the first
(t,3711)
(m,1787) element
(h,1725)
41
7: Grab the First Element
grunt> result = LIMIT orderedCountPerLetter 1;

The rows were already ordered in

descending order, so simply limiting to
one element gives us the result

INSPECT orderedCountPerLetter bag:

grunt> describe result;
result: {group: chararray,long}
grunt> dump result;
(t,3711)

There it is

8: Persist Result on a File

System
grunt> STORE result INTO
'/training/playArea/pig/mostSeenLetterOutput';

Result is saved under the provided directory

INSPECT result
$ hdfs dfs -cat
/training/playArea/pig/mostSeenLetterOutput/part-r-00000
t 3711

result Notice that result was stored int part-r-

0000, the regular artifact of a MapReduce
reducer; Pig compiles Pig Latin into
MapReduce code and executes.
43
MostSeenStartLetter.pig Script
-- 1: Load text into a bag, where a row is a line of text
lines = LOAD '/training/data/hamlet.txt' AS (line:chararray);
-- 2: Tokenize the provided text
tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray;
-- 3: Retain first letter of each token
letters = FOREACH tokens GENERATE SUBSTRING(token,0,1) AS letter:chararray;
-- 4: Group by letter
letterGroup = GROUP letters BY letter;
-- 5: Count the number of occurrences in each group
countPerLetter = FOREACH letterGroup GENERATE group, COUNT(letters);
-- 6: Descending order the group by the count
orderedCountPerLetter = ORDER countPerLetter BY $1 DESC;
-- 7: Grab the first element => Most occurring letter
result = LIMIT orderedCountPerLetter 1;
-- 8: Persist result on a file system
STORE result INTO '/training/playArea/pig/mostSeenLetterOutput';

• Execute the script:

44 – $ pig MostSeenStartLetter.pig

Pig Tools
• Community has developed several tools to
support Pig
– https://cwiki.apache.org/confluence/display/PIG/PigTools
• We have PigPen Eclipse Plugin installed:
– Download the latest jar release at
https://issues.apache.org/jira/browse/PIG-366
• As of writing
org.apache.pig.pigpen_0.7.5.jar
– Place jar in eclupse/plugins/
– Restart eclipse

45
Summary
• We learned about
– Pig Overview
– Execution Modes
– Installation
– Pig Latin Basics
– Resources
• We developed Pig Script to locate “Most
Occurred Start Letter”

Bdaut 2
No ratings yet
Bdaut 2
66 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Unit III
No ratings yet
Unit III
118 pages
4.1 Pig Unit4
No ratings yet
4.1 Pig Unit4
55 pages
6 Part2
No ratings yet
6 Part2
45 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
BDA Unit5
No ratings yet
BDA Unit5
36 pages
PIG
No ratings yet
PIG
9 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Functional Programming
No ratings yet
Functional Programming
1,020 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit 3
No ratings yet
Unit 3
26 pages
Dynamic Programming For Coding Interviews - A Bottom-Up Approach To Problem Solving
No ratings yet
Dynamic Programming For Coding Interviews - A Bottom-Up Approach To Problem Solving
146 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
18MCA31 - DBMS - Notes - Docx - Drakshaveni G
No ratings yet
18MCA31 - DBMS - Notes - Docx - Drakshaveni G
196 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Pig Notes-1
No ratings yet
Pig Notes-1
6 pages
20CSS01 PPS Question Bank
100% (1)
20CSS01 PPS Question Bank
24 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Operating Systems - IPC: Inter-Process Communication: Message Passing
No ratings yet
Operating Systems - IPC: Inter-Process Communication: Message Passing
27 pages
Pig 2
No ratings yet
Pig 2
63 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
3 Pig
No ratings yet
3 Pig
77 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
BDP U4
No ratings yet
BDP U4
58 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Logic in Programming (PROLOG)
0% (1)
Logic in Programming (PROLOG)
12 pages
Divya Java
No ratings yet
Divya Java
39 pages
Pig
No ratings yet
Pig
12 pages
Chapter 6 - 10
No ratings yet
Chapter 6 - 10
214 pages
Graph Theory - Study Plan - LeetCode
No ratings yet
Graph Theory - Study Plan - LeetCode
8 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
ESIOT Manual
No ratings yet
ESIOT Manual
23 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Java Fundamentals Midterm Exam
No ratings yet
Java Fundamentals Midterm Exam
18 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
PPS UNIT 2 ONE SHOT NOTES 778097e2 5816 4c32 bf5c 76df4a36b496
No ratings yet
PPS UNIT 2 ONE SHOT NOTES 778097e2 5816 4c32 bf5c 76df4a36b496
71 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Model Deployment GL
No ratings yet
Model Deployment GL
20 pages
2.3 Digital Inputs
No ratings yet
2.3 Digital Inputs
5 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Class-IX St. Paul's School Computer Applicatoo
No ratings yet
Class-IX St. Paul's School Computer Applicatoo
2 pages
NIT3213 Assignment 1
No ratings yet
NIT3213 Assignment 1
3 pages
Portion For Class XI
No ratings yet
Portion For Class XI
3 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Karmarkar's Algorithm: Sahil Lodha Chirag Sancheti Prajeeth Prabhu Shrivani Pandiya
No ratings yet
Karmarkar's Algorithm: Sahil Lodha Chirag Sancheti Prajeeth Prabhu Shrivani Pandiya
15 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Unit 4
No ratings yet
Unit 4
5 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Scet Unit 5
No ratings yet
Scet Unit 5
9 pages
PWP Microproject
No ratings yet
PWP Microproject
11 pages
Java (Set Map)
No ratings yet
Java (Set Map)
16 pages
Creating, Publishing, Testing and Describing A Web Service: Example
No ratings yet
Creating, Publishing, Testing and Describing A Web Service: Example
36 pages
Constructors MCQS
No ratings yet
Constructors MCQS
7 pages
Methods and Constructors: Unit-1
No ratings yet
Methods and Constructors: Unit-1
7 pages
HR Database Exercises: Name: Dhiraj Subrao Desai
No ratings yet
HR Database Exercises: Name: Dhiraj Subrao Desai
23 pages
Bindings
No ratings yet
Bindings
10 pages
PHP Practical File
No ratings yet
PHP Practical File
45 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
No ratings yet
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
58 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
7-8 Java Exercise Sample
No ratings yet
7-8 Java Exercise Sample
23 pages
PDF Learning React Js Learn React Js From Scratch With Hands On Projects 2Nd Edition Alves Ebook Full Chapter
100% (9)
PDF Learning React Js Learn React Js From Scratch With Hands On Projects 2Nd Edition Alves Ebook Full Chapter
53 pages
Indexing
No ratings yet
Indexing
88 pages
Pig
No ratings yet
Pig
16 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
UVM Interview Questions
No ratings yet
UVM Interview Questions
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

06 Pig 01 Intro 1

Uploaded by

06 Pig 01 Intro 1

Uploaded by

Big Data Analytics 8CAI4-01

Pig Latin Concepts

DUMP and STORE statements

• data – name of the directory or file

User selected Load Function,

Schema Data Types

Pig Latin - Grouping

grunt> charGroup = GROUP chars

Inner vs. Outer Bag

Pig Latin - FOREACH

grunt> records = LOAD 'data/a.txt' AS (c:chararray, i:int);

For each row emit ‘i’

grunt> dump tokenBag; ({(this),(is),

Conventions and Case

counts = FOREACH charGroup GENERATE group, COUNT(c);

occurrences of each letter in

the provided body of text moiety competent Was gaged by

• Traverse each letter

‘Most Occurred Start Letter’

Load text file into a bag, stick entire

INSPECT lines bag:

2: Tokenize the Text in the

For each line of text (1) tokenize that line

INSPECT tokens bag:

For each token grab the first letter;

INSPECT letters bag:

Create a bag for each unique character;

INSPECT letterGroup bag:

For each row, count occurrence of the letter

INSPECT countPerLetter bag:

6: Descending Order the Group

Simply order the bag by the first

The rows were already ordered in

INSPECT orderedCountPerLetter bag:

8: Persist Result on a File

Result is saved under the provided directory

result Notice that result was stored int part-r-

• Execute the script:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.