0% found this document useful (0 votes)

12 views59 pages

Pig Hive

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop, using a language called Pig Latin to simplify data processing tasks. It supports various data types and operations such as grouping, filtering, and joins, while allowing users to define their own functions. Pig operates in two modes: Local Mode for local file systems and MapReduce Mode for Hadoop clusters, facilitating easy data summarization and analysis without requiring extensive programming knowledge.

Uploaded by

himanshugmarekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views59 pages

Pig Hive

Uploaded by

himanshugmarekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

Apache Pig

Based on slides from Adam Shook

What Is Pig?
• Developed by Yahoo! and a top level Apache project
• Immediately makes data on a cluster available to
non-Java programmers via Pig Latin – a dataflow
language
• Interprets Pig Latin and generates MapReduce jobs
that run on the cluster
• Enables easy data summarization, ad-hoc reporting
and querying, and analysis of large volumes of data
• Pig interpreter runs on a client machine – no
administrative overhead required
Pig Terms
• All data in Pig one of four types:
– An Atom is a simple data value - stored as a string but can
be used as either a string or a number
– A Tuple is a data record consisting of a sequence of
"fields"
• Each field is a piece of data of any type (atom, tuple or bag)
– A Bag is a set of tuples (also referred to as a ‘Relation’)
• The concept of a “kind of a” table
– A Map is a map from keys that are string literals to values
that can be any data type
• The concept of a hash map
Pig Capabilities
• Support for
– Grouping
– Joins
– Filtering
– Aggregation
• Extensibility
– Support for User Defined Functions (UDF’s)
• Leverages the same massive parallelism as
native MapReduce
Pig Basics
• Pig is a client application
– No cluster software is required
• Interprets Pig Latin scripts to MapReduce jobs
– Parses Pig Latin scripts
– Performs optimization
– Creates execution plan
• Submits MapReduce jobs to the cluster
Execution Modes
• Pig has two execution modes
– Local Mode - all files are installed and run using your local host
and file system
– MapReduce Mode - all files are installed and run on a Hadoop
cluster and HDFS installation
• Interactive
– By using the Grunt shell by invoking Pig on the command line
$ pig
grunt>
• Batch
– Run Pig in batch mode using Pig Scripts and the "pig" command
$ pig –f id.pig –p <param>=<value> ...
Pig Latin
• Pig Latin scripts are generally organized as follows
– A LOAD statement reads data
– A series of “transformation” statements process the data
– A STORE statement writes the output to the filesystem
• A DUMP statement displays output on the screen
• Logical vs. physical plans:
– All statements are stored and validated as a logical plan
– Once a STORE or DUMP statement is found the logical
plan is executed
Example Pig Script
-- Load the content of a file into a pig bag named ‘input_lines’
input_lines = LOAD 'CHANGES.txt' AS (line:chararray);

-- Extract words from each line and put them into a pig bag named ‘words’
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces

filtered_words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word

word_groups = GROUP filtered_words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS
word;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC;

-- Store the results ( executes the pig script )

STORE ordered_word_count INTO 'output’;
Basic “grunt” Shell Commands
• Help is available
$ pig -h
• Pig supports HDFS commands
grunt> pwd
– put, get, cp, ls, mkdir, rm, mv, etc.
About Pig Scripts
• Pig Latin statements grouped together in a file
• Can be run from the command line or the
shell
• Support parameter passing
• Comments are supported
– Inline comments '--'
– Block comments /* */
Simple Data Types
Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal
Complex Data Types

Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
Pig Data Formats
• BinStorage
– Loads and stores data in machine-readable (binary) format
• PigStorage
– Loads and stores data as structured, field delimited text files
• TextLoader
– Loads unstructured data in UTF-8 format
• PigDump
– Stores data in UTF-8 format
• YourOwnFormat!
– via UDFs
Loading Data Into Pig
• Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name,
salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
• Each LOAD statement defines a new bag
– Each bag can have multiple elements (atoms)
– Each element can be referenced by name or position ($n)
• A bag is immutable
• A bag can be aliased and referenced later
Input And Output
• STORE
– Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
• Fails if directory exists
• Writes output files, part-[m|r]-xxxxx, to the directory
– PigStorage can be used to specify a field delimiter
• DUMP
– Write output to screen
grunt> DUMP processed;
Relational Operators
• FOREACH
– Applies expressions to every record in a bag
• FILTER
– Filters by expression
• GROUP
– Collect records with the same key
• ORDER BY
– Sorting
• DISTINCT
– Removes duplicates
FOREACH . . .GENERATE
• Use the FOREACH …GENERATE operator to work
with rows of data, call functions, etc.
• Basic syntax:
alias2 = FOREACH alias1 GENERATE expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)
FILTER. . .BY
• Use the FILTER operator to restrict tuples or rows
of data
• Basic syntax:
alias2 = FILTER alias1 BY expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
GROUP. . .ALL
• Use the GROUP…ALL operator to group data
– Use GROUP when only one relation is involved
– Use COGROUP with multiple relations are involved
• Basic syntax:
alias2 = GROUP alias1 ALL;
• Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
ORDER. . .BY
• Use the ORDER…BY operator to sort a relation
based on one or more fields
• Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
DISTINCT. . .
• Use the DISTINCT operator to remove
duplicate tuples in a relation.
• Basic syntax:
alias2 = DISTINCT alias1;

• Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
Relational Operators
• FLATTEN
– Used to un-nest tuples as well as bags
• INNER JOIN
– Used to perform an inner join of two or more relations based on
common field values
• OUTER JOIN
– Used to perform left, right or full outer joins
• SPLIT
– Used to partition the contents of a relation into two or more
relations
• SAMPLE
– Used to select a random data sample with the stated sample size
INNER JOIN. . .
• Use the JOIN operator to perform an inner, equi-join
join of two or more relations based on common
field values
• The JOIN operator always performs an inner join
• Inner joins ignore null keys
– Filter null keys before the join
• JOIN and COGROUP operators perform similar
functions
– JOIN creates a flat set of output records
– COGROUP creates a nested set of output records
INNER JOIN Example
DUMP Alias1; Join Alias1 by Col1 to
(1,2,3) Alias2 by Col1
(4,2,1) Alias3 = JOIN Alias1
(8,3,4) BY Col1, Alias2 BY
(4,3,3) Col1;
(7,2,5)
(8,4,3) Dump Alias3;
DUMP Alias2;
(1,2,3,1,3)
(2,4)
(4,2,1,4,6)
(8,9)
(1,3) (4,3,3,4,6)
(2,7) (4,2,1,4,9)
(2,9) (4,3,3,4,9)
(4,6) (8,3,4,8,9)
(4,9) (8,4,3,8,9)
OUTER JOIN. . .
• Use the OUTER JOIN operator to perform left, right, or full
outer joins
– Pig Latin syntax closely adheres to the SQL standard
• The keyword OUTER is optional
– keywords LEFT, RIGHT and FULL will imply left outer, right outer and
full outer joins respectively
• Outer joins will only work provided the relations which need
to produce nulls (in the case of non-matching keys) have
schemas
• Outer joins will only work for two-way joins
– To perform a multi-way outer join perform multiple two-way outer
join statements
User-Defined Functions
• Natively written in Java, packaged as a jar file
– Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
• Register the jar with the REGISTER statement
• Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE
• DEFINE can be used to work with UDFs and also
streaming commands
– Useful when dealing with complex input/output
formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;

/* Define UDFs to a more readable format */

DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float, gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;
Apache Hive

Based on Slides by Adam Shook

What Is Hive?
• Developed by Facebook and a top-level Apache
project
• A data warehousing infrastructure based on Hadoop
• Immediately makes data on a cluster available to non-
Java programmers via SQL like queries
• Built on HiveQL (HQL), a SQL-like query language
• Interprets HiveQL and generates MapReduce jobs
that run on the cluster
• Enables easy data summarization, ad-hoc reporting
and querying, and analysis of large volumes of data
What Hive Is Not
• Hive, like Hadoop, is designed for batch
processing of large datasets
• Not an OLTP or real-time system
• Latency and throughput are both high
compared to a traditional RDBMS
– Even when dealing with relatively small data
( <100 MB )
Data Hierarchy
• Hive is organised hierarchically into:
– Databases: namespaces that separate tables and other
objects
– Tables: homogeneous units of data with the same
schema
• Analogous to tables in an RDBMS
– Partitions: determine how the data is stored
• Allow efficient access to subsets of the data
– Buckets/clusters
• For sub-sampling within a partition
• Join optimization
HiveQL
• HiveQL / HQL provides the basic SQL-like operations:
– Select columns using SELECT
– Filter rows using WHERE
– JOIN between tables
– Evaluate aggregates using GROUP BY
– Store query results into another table
– Download results to a local directory (i.e., export from
HDFS)
– Manage tables and queries with CREATE, DROP, and
ALTER
Primitive Data Types
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
HiveQL Limitations
• HQL only supports equi-joins, outer joins, left semi-
joins
• Because it is only a shell for Map-Reduce, complex
queries can be hard to optimise
• Missing large parts of full SQL specification:
– HAVING clause in SELECT
– Correlated sub-queries
– Sub-queries outside FROM clauses
– Updatable or materialized views
– Stored procedures
Hive Metastore
• Stores Hive metadata
• Default metastore database uses Apache Derby
• Various configurations:
– Embedded (in-process metastore, in-process database)
• Mainly for unit tests
– Local (in-process metastore, out-of-process database)
• Each Hive client connects to the metastore directly
– Remote (out-of-process metastore, out-of-process
database)
• Each Hive client connects to a metastore server, which connects
to the metadata database itself
Hive Warehouse
• Hive tables are stored in the Hive
“warehouse”
– Default HDFS location: /user/hive/warehouse
• Tables are stored as sub-directories in the
warehouse directory
• Partitions are subdirectories of tables
• External tables are supported in Hive
• The actual data is stored in flat files
Hive Schemas
• Hive is schema-on-read
– Schema is only enforced when the data is read (at
query time)
– Allows greater flexibility: same data can be read
using multiple schemas
• Contrast with an RDBMS, which is schema-on-
write
– Schema is enforced when the data is loaded
– Speeds up queries at the expense of load times
Create Table Syntax
CREATE TABLE table_name
(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Simple Table
CREATE TABLE page_view
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
More Complex Table
CREATE TABLE employees (
(name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables
• CREATE TABLE
– LOAD: file moved into Hive’s data warehouse
directory
– DROP: both metadata and data deleted
• CREATE EXTERNAL TABLE
– LOAD: no files moved
– DROP: only metadata deleted
– Use this when sharing with other Hadoop
applications, or when you want to use multiple
schemas on the same data
Partitioning
• Can make some queries faster
• Divide data based on partition column
• Use PARTITION BY clause when creating table
• Use PARTITION clause when loading data
• SHOW PARTITIONS will show a table’s
partitions
Bucketing
• Can speed up queries that involve sampling
the data
– Sampling works without bucketing, but Hive has to
scan the entire dataset
• Use CLUSTERED BY when creating table
– For sorted buckets, add SORTED BY
• To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Partitions
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specification ( uses regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful
only for debugging )
DESCRIBE page_view List information about a partition
PARTITION (ds='2008-10-31');
Loading Data
• Use LOAD DATA to load data from a file or
directory
– Will read from HDFS unless LOCAL keyword is specified
– Will append data unless OVERWRITE specified
– PARTITION required if destination table is partitioned

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'

OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inserting Data
• Use INSERT to load data from a Hive query
– Will append data unless OVERWRITE specified
– PARTITION required if destination table is
partitioned

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view
PARTITION (dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url
WHERE pvs.country = 'US';
Loading And Inserting Data: Summary

Use this For this purpose

LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
Sample Select Clauses
• Select from a single table
SELECT *
FROM sales
WHERE amount > 10 AND
region = "US";
• Select from a partitioned table
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND
page_views.date <= '2008-03-31'
Relational Operators
• ALL and DISTINCT
– Specify whether duplicate rows should be returned
– ALL is the default (all matching rows are returned)
– DISTINCT removes duplicate rows from the result set
• WHERE
– Filters by expression
– Does not support IN, EXISTS or sub-queries in the WHERE
clause
• LIMIT
– Indicates the number of rows to be returned
Relational Operators
• GROUP BY
– Group data by column values
– Select statement can only include columns included
in the
GROUP BY clause
• ORDER BY / SORT BY
– ORDER BY performs total ordering
• Slow, poor performance
– SORT BY performs partial ordering
• Sorts output from each reducer
Advanced Hive Operations
• JOIN
– If only one column in each table is used in the join, then
only one MapReduce job will run
• This results in 1 MapReduce job:
SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key = c.key

• This results in 2 MapReduce jobs:

SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key2 = c.key

– If multiple tables are joined, put the biggest table last and
the reducer will stream the last table, buffer the others
– Use left semi-joins to take the place of IN/EXISTS
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;
Advanced Hive Operations
• JOIN
– Do not specify join conditions in the WHERE clause
• Hive does not know how to optimise such queries
• Will compute a full Cartesian product before filtering it
• Join Example

SELECT
a.ymd, a.price_close, b.price_close
FROM stocks a
JOIN stocks b ON a.ymd = b.ymd
WHERE a.symbol = 'AAPL' AND
b.symbol = 'IBM' AND
a.ymd > '2010-01-01';
Hive Stinger
• MPP-style execution of Hive queries
• Available since Hive 0.13
• No MapReduce
• We will talk about this more when we get to
SQL on Hadoop

Ball Thorsten Writing An Compiler in Go PDF
100% (1)
Ball Thorsten Writing An Compiler in Go PDF
355 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Pig_2
No ratings yet
Pig_2
63 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
101 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
94 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
Pig
No ratings yet
Pig
27 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Pig
No ratings yet
Pig
55 pages
Bda Module 5
No ratings yet
Bda Module 5
26 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
PIG LATIN COMMAND
No ratings yet
PIG LATIN COMMAND
12 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
big-data-unit-5-big-data-notes-of-unit-5
No ratings yet
big-data-unit-5-big-data-notes-of-unit-5
16 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Pig Latin Users Guide
No ratings yet
Pig Latin Users Guide
13 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Pig
No ratings yet
Pig
16 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Unit-4_PIG_
No ratings yet
Unit-4_PIG_
9 pages
Unit 5
No ratings yet
Unit 5
16 pages
Unit 5(Pig,Hive,Hbase)
No ratings yet
Unit 5(Pig,Hive,Hbase)
18 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
unit5-part1-notes
No ratings yet
unit5-part1-notes
21 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Unit 4
No ratings yet
Unit 4
29 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Pig_Notes-1
No ratings yet
Pig_Notes-1
6 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Perl One-Liners: 130 Programs That Get Things Done
From Everand
Perl One-Liners: 130 Programs That Get Things Done
Peteris Krumins
4/5 (3)
API Solidworks
No ratings yet
API Solidworks
23 pages
Lecture 4
No ratings yet
Lecture 4
29 pages
Bugreport dh0lm QKQ1.200311.002 2022 09 12 17 37 29 Dumpstate - Log 14355
No ratings yet
Bugreport dh0lm QKQ1.200311.002 2022 09 12 17 37 29 Dumpstate - Log 14355
24 pages
Error Handling in Informatica
No ratings yet
Error Handling in Informatica
5 pages
11th Computer Science Study Material English Medium PDF Download
50% (2)
11th Computer Science Study Material English Medium PDF Download
177 pages
SpringBoot Features
No ratings yet
SpringBoot Features
110 pages
Python Cheat Sheet-1
No ratings yet
Python Cheat Sheet-1
8 pages
Haha 5
No ratings yet
Haha 5
9 pages
All Theory Questions
No ratings yet
All Theory Questions
2 pages
Integration of Cake Build Script With TeamCity - CodeProject
No ratings yet
Integration of Cake Build Script With TeamCity - CodeProject
6 pages
Bitwise-Operators - Ipynb - Colaboratory
No ratings yet
Bitwise-Operators - Ipynb - Colaboratory
6 pages
Developer Guide
No ratings yet
Developer Guide
31 pages
Python 2.7 - Opencv - Detect Mouse Position Clicking Over A Picture - Stack Overflow
No ratings yet
Python 2.7 - Opencv - Detect Mouse Position Clicking Over A Picture - Stack Overflow
8 pages
Android Programming
No ratings yet
Android Programming
9 pages
Objectives: The C++ Programming Skills That Should Be Acquired in This Lab
No ratings yet
Objectives: The C++ Programming Skills That Should Be Acquired in This Lab
5 pages
Stacks & Subroutines
No ratings yet
Stacks & Subroutines
35 pages
Chapter 8_ Conditional Constructs in Java _ Solutions for Class 10 ICSE Logix Kips Computer Applications with BlueJ Java _ KnowledgeBoat
No ratings yet
Chapter 8_ Conditional Constructs in Java _ Solutions for Class 10 ICSE Logix Kips Computer Applications with BlueJ Java _ KnowledgeBoat
39 pages
It Fresher
No ratings yet
It Fresher
2 pages
AP Suppliers
No ratings yet
AP Suppliers
4 pages
Scheduler Commands Cheatsheet-2020-Ally
No ratings yet
Scheduler Commands Cheatsheet-2020-Ally
1 page
Oomd Paper
No ratings yet
Oomd Paper
2 pages
Animation
No ratings yet
Animation
3 pages
Computer Dot Com A Complete Computer Education Under W.B Government Registered Chapter - 4 (Question) Class Date: 15-7-2020
No ratings yet
Computer Dot Com A Complete Computer Education Under W.B Government Registered Chapter - 4 (Question) Class Date: 15-7-2020
3 pages
Upgrade Oracle Database From 12.1.0.2 To 19.3.0.0
No ratings yet
Upgrade Oracle Database From 12.1.0.2 To 19.3.0.0
5 pages
WaveForms SDK Reference Manual
No ratings yet
WaveForms SDK Reference Manual
121 pages
Schoolmgmtsystem
No ratings yet
Schoolmgmtsystem
22 pages
Practical No11
No ratings yet
Practical No11
7 pages
Stupid Twitter Tricks: An Introduction To Gosu, Ronin, Goson and Intellij
No ratings yet
Stupid Twitter Tricks: An Introduction To Gosu, Ronin, Goson and Intellij
14 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Pig Hive

Uploaded by

Pig Hive

Uploaded by

Apache Pig

Based on slides from Adam Shook

-- filter out any words that are just white spaces

-- create a group for each word

-- count the entries in each group

-- order the records by count

-- Store the results ( executes the pig script )

/* Define UDFs to a more readable format */

Based on Slides by Adam Shook

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'

FROM page_view_stg pvs

Use this For this purpose

• This results in 2 MapReduce jobs:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.