0% found this document useful (0 votes)
27 views23 pages

Tutorialspoint HBase Pig

Pig syntax, queries and functions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views23 pages

Tutorialspoint HBase Pig

Pig syntax, queries and functions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Creating a Table using HBase Shell

You can create a table using the create command, here you must
specify the table name and the Column Family name.
The syntax to create a table in HBase shell is shown below.

create ‘<table name>’,’<column family>’

Example

Given below is a sample schema of a table named emp. It has


two column families: “personal data” and “professional data”.

Row key personal data professional data

You can create this table in HBase shell as shown below.

hbase(main):002:0> create 'emp', 'personal data', 'professional data'

Inserting Data using HBase Shell


How to create data in an HBase table? To create data in an HBase
table, the following commands is used:

• put command,

As an example, we are going to create the following table in


HBase.
Using put command, you can insert rows into a table. Its syntax is
as follows:

put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’

Inserting the First Row

Let us insert the first row values into the emp table as shown
below.

hbase(main):007:0> put 'emp','1','professional data:salary','50000'


0 row(s) in 0.0240 seconds

Updating Data using HBase Shell


You can update an existing cell value using the put command. To
do so, just follow the same syntax and mention your new value
as shown below.

put ‘table name’,’row ’,'Column family:column name',’new value’

The newly given value replaces the existing value, updating the
row.

Example
Suppose there is a table in HBase called emp with the following
data.

hbase(main):003:0> scan 'emp'


ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418051555, value = raju
row1 column = personal:city, timestamp = 1418275907, value = Hyderabad
row1 column = professional:designation, timestamp = 14180555,value = manager
row1 column = professional:salary, timestamp = 1418035791555,value = 50000
1 row(s) in 0.0100 seconds

The following command will update the city value of the employee
named ‘Raju’ to Delhi.

hbase(main):002:0> put 'emp','row1','personal:city','Delhi'


0 row(s) in 0.0400 seconds

Reading Data using HBase Shell


The get command and the get() method of HTable class are used to
read data from a table in HBase. Using get command, you can get
a single row of data at a time. Its syntax is as follows:

get ’<table name>’,’row1’

Example

The following example shows how to use the get command. Let
us scan the first row of the emp table.

hbase(main):012:0> get 'emp', '1'

Deleting a Specific Cell in a Table


Using the delete command, you can delete a specific cell in a table.
The syntax of delete command is as follows:

delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’

Example
Here is an example to delete a specific cell. Here we are deleting
the salary.

hbase(main):006:0> delete 'emp', '1', 'personal data:city',


1417521848375
0 row(s) in 0.0060 seconds

Deleting All Cells in a Table


Using the “deleteall” command, you can delete all the cells in a
row. Given below is the syntax of deleteall command.

deleteall ‘<table name>’, ‘<row>’,

Example

Here is an example of “deleteall” command, where we are


deleting all the cells of row1 of emp table.

hbase(main):007:0> deleteall 'emp','1'


0 row(s) in 0.0240 seconds

Verify the table using the scan command. A snapshot of the table
after deleting the table is given below.

hbase(main):022:0> scan 'emp'

count
You can count the number of rows of a table using
the count command. Its syntax is as follows:

count ‘<table name>’

After deleting the first row, emp table will have two rows. Verify it
as shown below.

hbase(main):023:0> count 'emp'


2 row(s) in 0.090 seconds
⇒2
Dropping a Table using HBase Shell
Using the drop command, you can delete a table. Before dropping
a table, you have to disable it.

hbase(main):018:0> disable 'emp'


0 row(s) in 1.4580 seconds

hbase(main):019:0> drop 'emp'


0 row(s) in 0.3060 seconds

Apache Pig - Group Operator

The GROUP operator is used to group the data in one or more


relations. It collects the data having the same key.

Syntax
Given below is the syntax of the group operator.

grunt> Group_data = GROUP Relation_name BY age;

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Apache Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);

Now, let us group the records/tuples in the relation by age as


shown below.

grunt> group_data = GROUP student_details by age;

Verification
Verify the relation group_data using the DUMP operator as shown
below.

grunt> Dump group_data;

Output
Then you will get output displaying the contents of the relation
named group_data as shown below. Here you can observe that the
resulting schema has two columns −

• One is age, by which we have grouped the relation.


• The other is a bag, which contains the group of tuples,
student records with the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233
8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336
,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,
trivendram)})

You can see the schema of the table after grouping the data using
the describe command as shown below.

grunt> Describe group_data;

group_data: {group: int,student_details: {(id: int,firstname: chararray,


lastname: chararray,age: int,phone: chararray,city: chararray)}
The JOIN operator is used to combine records from two or more
relations. While performing a join operation, we declare one (or a
group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records
are dropped. Joins can be of the following types −

• Self-join
• Inner-join
• Outer-join − left join, right join, and full join

This chapter explains with examples how to use the join operator
in Pig Latin. Assume that we have two files
namely customers.txt and orders.txt in the /pig_data/ directory of
HDFS as shown below.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these two files into Pig with the
relations customers and orders as shown below.

grunt> customers = LOAD


'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int,
address:chararray, salary:int);
grunt> orders = LOAD
'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',')
as (oid:int, date:chararray, customer_id:int,
amount:int);

Let us now perform various Join operations on these two


relations.

Self - join
Self-join is used to join a table with itself as if the table were two
relations, temporarily renaming at least one relation.

Generally, in Apache Pig, to perform self-join, we will load the


same data multiple times, under different aliases (names).
Therefore let us load the contents of the file customers.txt as two
tables as shown below.

grunt> customers1 = LOAD


'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int,
address:chararray, salary:int);

grunt> customers2 = LOAD


'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int,
address:chararray, salary:int);

Syntax

Given below is the syntax of performing self-join operation using


the JOIN operator.

grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

Example

Let us perform self-join operation on the relation customers, by


joining the two relations customers1 and customers2 as shown
below.
grunt> customers3 = JOIN customers1 BY id, customers2
BY id;

Verification

Verify the relation customers3 using the DUMP operator as shown


below.

grunt> Dump customers3;

Output

It will produce the following output, displaying the contents of the


relation customers.

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin.
An inner join returns rows when there is a match in both tables.

It creates a new relation by combining column values of two


relations (say A and B) based upon the join-predicate. The query
compares each row of A with each row of B to find all pairs of
rows which satisfy the join-predicate. When the join-predicate is
satisfied, the column values for each matched pair of rows of A
and B are combined into a result row.

Syntax

Here is the syntax of performing inner join operation using


the JOIN operator.

grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;

Example
Let us perform inner join operation on the two
relations customers and orders as shown below.

grunt> coustomer_orders = JOIN customers BY id, orders


BY customer_id;

Verification

Verify the relation coustomer_orders using the DUMP operator as


shown below.

grunt> Dump coustomer_orders;

Output

You will get the following output that will the contents of the
relation named coustomer_orders.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note −

Outer Join: Unlike inner join, outer join returns all the rows from at
least one of the relations. An outer join operation is carried out in
three ways −

• Left outer join


• Right outer join
• Full outer join

Left Outer Join


The left outer Join operation returns all rows from the left table,
even if there are no matches in the right relation.

Syntax

Given below is the syntax of performing left outer join operation


using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY
customer_id;

Example

Let us perform left outer join operation on the two relations


customers and orders as shown below.

grunt> outer_left = JOIN customers BY id LEFT OUTER,


orders BY customer_id;

Verification

Verify the relation outer_left using the DUMP operator as shown


below.

grunt> Dump outer_left;

Output

It will produce the following output, displaying the contents of the


relation outer_left.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer Join


The right outer join operation returns all rows from the right table,
even if there are no matches in the left table.

Syntax

Given below is the syntax of performing right outer join operation


using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Example

Let us perform right outer join operation on the two


relations customers and orders as shown below.

grunt> outer_right = JOIN customers BY id RIGHT, orders


BY customer_id;

Verification

Verify the relation outer_right using the DUMP operator as shown


below.

grunt> Dump outer_right

Output

It will produce the following output, displaying the contents of the


relation outer_right.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Full Outer Join


The full outer join operation returns rows when there is a match in
one of the relations.

Syntax

Given below is the syntax of performing full outer join using


the JOIN operator.

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Example

Let us perform full outer join operation on the two


relations customers and orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER,
orders BY customer_id;

Verification

Verify the relation outer_full using the DUMP operator as shown


below.

grun> Dump outer_full;

Output

It will produce the following output, displaying the contents of the


relation outer_full.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

The SPLIT operator is used to split a relation into two or more


relations.

Syntax
Given below is the syntax of the SPLIT operator.

grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name


(condition2),

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Pig with the relation
name student_details as shown below.

student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);

Let us now split the relation into two, one listing the employees of
age less than 23, and the other listing the employees having the
age between 22 and 25.

SPLIT student_details into student_details1 if age<23,


student_details2 if (22<age and age>25);

Verification

Verify the relations student_details1 and student_details2 using


the DUMP operator as shown below.

grunt> Dump student_details1;

grunt> Dump student_details2;

Output

It will produce the following output, displaying the contents of the


relations student_details1 and student_details2 respectively.

grunt> Dump student_details1;


(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
grunt> Dump student_details2;
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,23,9848022335,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

The FILTER operator is used to select the required tuples from a


relation based on a condition.

Syntax
Given below is the syntax of the FILTER operator.

grunt> Relation2_name = FILTER Relation1_name BY (condition);

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Pig with the relation
name student_details as shown below.

grunt> student_details = LOAD


'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the
students who belong to the city Chennai.

filter_data = FILTER student_details BY city ==


'Chennai';

Verification

Verify the relation filter_data using the DUMP operator as shown


below.

grunt> Dump filter_data;

Output

It will produce the following output, displaying the contents of the


relation filter_data as follows.

(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

The DISTINCT operator is used to remove redundant (duplicate)


tuples from a relation.

Syntax
Given below is the syntax of the DISTINCT operator.

grunt> Relation_name2 = DISTINCT Relatin_name1;

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai

And we have loaded this file into Pig with the relation
name student_details as shown below.

grunt> student_details = LOAD


'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray);

Let us now remove the redundant (duplicate) tuples from the


relation named student_details using the DISTINCT operator, and
store it as another relation named distinct_data as shown below.

grunt> distinct_data = DISTINCT student_details;

Verification

Verify the relation distinct_data using the DUMP operator as shown


below.

grunt> Dump distinct_data;

Output

It will produce the following output, displaying the contents of the


relation distinct_data as follows.

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)

The FOREACH operator is used to generate specified data


transformations based on the column data.
Syntax
Given below is the syntax of FOREACH operator.

grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Pig with the relation
name student_details as shown below.

grunt> student_details = LOAD


'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray,
city:chararray);

Let us now get the id, age, and city values of each student from
the relation student_details and store it into another relation
named foreach_data using the foreach operator as shown below.

grunt> foreach_data = FOREACH student_details GENERATE


id,age,city;

Verification

Verify the relation foreach_data using the DUMP operator as shown


below.
grunt> Dump foreach_data;

Output

It will produce the following output, displaying the contents of the


relation foreach_data.

(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)

The ORDER BY operator is used to display the contents of a


relation in a sorted order based on one or more fields.

Syntax
Given below is the syntax of the ORDER BY operator.

grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation
name student_details as shown below.

grunt> student_details = LOAD


'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray,
city:chararray);

Let us now sort the relation in a descending order based on the


age of the student and store it into another relation
named order_by_data using the ORDER BY operator as shown
below.

grunt> order_by_data = ORDER student_details BY age


DESC;

Verification

Verify the relation order_by_data using the DUMP operator as


shown below.

grunt> Dump order_by_data;

Output

It will produce the following output, displaying the contents of the


relation order_by_data.

(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)

The LIMIT operator is used to get a limited number of tuples


from a relation.

Syntax
Given below is the syntax of the LIMIT operator.

grunt> Result = LIMIT Relation_name required number of tuples;

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Pig with the relation
name student_details as shown below.

grunt> student_details = LOAD


'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray,
city:chararray);

Now, let’s sort the relation in descending order based on the age
of the student and store it into another relation
named limit_data using the ORDER BY operator as shown below.

grunt> limit_data = LIMIT student_details 4;

Verification

Verify the relation limit_data using the DUMP operator as shown


below.

grunt> Dump limit_data;


Output

It will produce the following output, displaying the contents of the


relation limit_data as follows.

(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)

Eval Functions
Given below is the list of eval functions provided by Apache Pig.

S.N. Function & Description

AVG()
1
To compute the average of the numerical values within a bag.

BagToString()
2
To concatenate the elements of a bag into a string. While concatenating, we
can place a delimiter between these values (optional).

CONCAT()
3
To concatenate two or more expressions of same type.

COUNT()
4
To get the number of elements in a bag, while counting the number of tuples
in a bag.

COUNT_STAR()
5
It is similar to the COUNT() function. It is used to get the number of
elements in a bag.

DIFF()
6
To compare two bags (fields) in a tuple.

IsEmpty()
7
To check if a bag or map is empty.
MAX()
8
To calculate the highest value for a column (numeric values or chararrays) in
a single-column bag.

MIN()
9
To get the minimum (lowest) value (numeric or chararray) for a certain
column in a single-column bag.

PluckTuple()
10
Using the Pig Latin PluckTuple() function, we can define a string Prefix and
filter the columns in a relation that begin with the given prefix.

SIZE()
11
To compute the number of elements based on any Pig data type.

SUBTRACT()
12
To subtract two bags. It takes two bags as inputs and returns a bag which
contains the tuples of the first bag that are not in the second bag.

SUM()
13
To get the total of the numeric values of a column in a single-column bag.

TOKENIZE()
14
To split a string (which contains a group of words) in a single tuple and
return a bag which contains the output of the split operation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy