0% found this document useful (0 votes)

182 views

Big Data Notes Pig

Uploaded by

rohitmarale77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

182 views

Big Data Notes Pig

Uploaded by

rohitmarale77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Pig Programming:

Introduction to pig:
Pig Represents Big Data as data flows. Pig is a high-level platform or
tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce. It provides a
high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is
stored in the HDFS, the programmers will write the scripts using the
Pig Latin Language. Internally Pig Engine(a component of Apache
Pig) converted all these scripts into a specific map and reduce task.
But these are not visible to the programmers in order to provide a
high-level of abstraction. Pig Latin and Pig Engine are the two main
components of the Apache Pig tool. The result of Pig always stored in
the HDFS.
Need of Pig: One limitation of MapReduce is that the development
cycle is very long. Writing the reducer and mapper, compiling
packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development
using the multi-query approach. Also, Pig is beneficial for
programmers who are not from Java background. 200 lines of Java
code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn
Pig Latin.
It uses query approach which results in reducing the length of the
code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).

Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s

researchers. At that time, the main idea to develop Pig was to execute the
MapReduce jobs on extremely large datasets. In the year 2007, it moved to
Apache Software Foundation(ASF) which makes it an open source project.
The first version(0.1) of Pig came in the year 2008. The latest version of
Apache Pig is 0.18 which came in the year 2017.
Features of pig:
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-
programmers. Pig makes this process easy. In the Pig, the queries are
converted to MapReduce internally.

2) Optimization opportunities

It is how tasks are encoded permits the system to optimize their

execution automatically, allowing the user to focus on semantics
rather than efficiency.

3) Extensibility

A user-defined function is written in which the user can write their

logic to execute over the data set.

4) Flexible
It can easily handle structured as well as unstructured data.

5) In-built operators
It contains various type of operators such as sort, filter and joins.

Application of pig:

 For exploring large datasets Pig Scripting is used.

 Provides the supports across large data-sets for Ad-hoc queries.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 For collecting large amounts of datasets in form of search logs and web
crawls.
 Used where the analytical insights are needed using the sampling.

Pig Architecture:
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write
a Pig script using the Pig Latin language, and execute them using any of the
execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these
scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
and thus, it makes the programmer’s job easy. The architecture of Apache Pig is
shown below.

Apache Pig Components

As shown in the figure, there are various components in the Apache

Pig framework. Let us take a look at the major components.

Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce
jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted
order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known
as an Atom. It is stored as string and can be used as string and
number. int, long, float, double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’

Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordered set of tuples. In other words, a collection of

tuples (non-unique) is known as a bag. Each tuple can have any
number of fields (flexible schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that
the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map

A map (or data map) is a set of key-value pairs. The key needs to be
of type chararray and should be unique. The value might be of any
type. It is represented by ‘[]’
Example − [name#Raja, age#30]

Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is
no guarantee that tuples are processed in any particular order).

Pig data types:

Primitive Data type:
Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

 int − Signed 32-bit integer and similar to Integer in Java.

 Long − It is a fully signed 64-bit number similar to Long in
Java.
 Float − It is a signed 32-bit floating surface that appears to be
similar to Java's float.
 Double − A floating-point 63-bit and similar to Double in Java.
 Char array − A list of characters in the Unicode format, UTF-
8. This is compatible with the Java character unit item.
 byte array − The byte data type represents bytes by default.
When the data file type is not specified, the default value is byte
array.
 Boolean − A value that is either true or false

Complex Data type

Complex data types consist of a bit of logical and complicated data type. The
following are the complex data type −
Data Definition Code Example
Types

Tuple A set of (field[,fields....]) (1,2)

ordered
fields. The
tuple is
written with
braces.

Bag A group of {tuple,[,tuple...]} {(1,2), (3,4)}

tuples is
called a
bag.
Represented
by folded
weights or
curly
braces.

Map A set of [Key # Value] ['keyname'#'valuename']

key-value
pairs. The
map is
represented
by square
brackets.

 Key − An element of finding an element, the key must be

unique and must be charrarray.
 Value − Any data can be stored in a value, and each key has
particular data related to it. The map is built using a bracket and

hash between key and values. Cas to separate pairs of over one key
value. Here # is used to distinguish key and value.
 Null Values − Valuable value is missing or unknown, and any
data may apply. The pig handles an empty value similar to SQL.
Pig detects blank values when data is missing, or an error occurs
during data processing. Also, null can be used as a value
proposition of your choice

Definig schema:

storing data through pig:

In the previous chapter, we learnt how to load data into Apache Pig.
You can store the loaded data in the file system using
the store operator. This chapter explains how to store data in Apache
Pig using the Store operator.
Syntax:
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING
function];
Example
Assume we have a file student_data.txt in HDFS with the following
content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator
as shown below.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS
directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ '
USING PigStorage (',');

Output
After executing the store statement, you will get the following output.
A directory is created with the specified name and the data will be
stored in it.
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLau ncher - 100% complete
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats -
Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05
13:05:05 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime
AvgMapTime MedianMapTime
job_14459_06 1 0 n/a n/a n/a n/a
MaxReduceTime MinReduceTime AvgReduceTime
MedianReducetime Alias Feature
0 0 0 0 student MAP_ONLY
OutPut folder
hdfs://localhost:9000/pig_Output/
Input(s): Successfully read 0 records from:
"hdfs://localhost:9000/pig_data/student_data.txt"
Output(s): Successfully stored 0 records in:
"hdfs://localhost:9000/pig_Output"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG: job_1443519499159_0006
2015-10-05 13:06:06,192 [main] INFO
org.apache.pig.backend.hadoop.executionengine
.mapReduceLayer.MapReduceLau ncher - Success!
Verification
You can verify the stored data as shown below.
Step 1
First of all, list out the files in the directory named pig_output using
the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing
the store statement.
Step 2
Using cat command, list the contents of the file named part-m-
00000 as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai

Reading data through pig:

In general, Apache Pig works on top of Hadoop. It is an analytical

tool that analyzes large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data
into Apache Pig. This chapter explains how to load data to Apache
Pig from HDFS.
Preparing HDFS

In MapReduce mode, Pig reads (loads) data from HDFS and stores
the results back in HDFS. Therefore, let us start HDFS and create the
following sample data in HDFS.

Student First
Last Name Phone City
ID Name

001 Rajiv Reddy 9848022337 Hyderabad

Battachary
002 siddarth 9848022338 Kolkata
a

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

The above dataset contains personal details like id, first name, last
name, phone number and city, of six students.

Step 1: Verifying Hadoop

First of all, verify the installation using Hadoop version command, as

shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH
variable, then you will get the following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using
/home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar
Step 2: Starting HDFS

Browse through the sbin directory of Hadoop and start yarn and
Hadoop dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-namenode-
localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-datanode-
localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-
localhost.localdomain.out
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-
localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-
localhost.localdomain.out
Step 3: Create a Directory in HDFS

In Hadoop DFS, you can create directories using the

command mkdir. Create a new directory in HDFS with the
name Pig_Data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS
The input file of Pig contains each tuple/record in individual lines.
And the entities of the record are separated by a delimiter (In our
example we used “,”).
In the local file system, create an input file student_data.txt containing
data as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS
using put command as shown below. (You can
use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put
/home/Hadoop/Pig/Pig_Data/student_data.txt
dfs://localhost:9000/pig_data/
Verifying the file
You can use the cat command to verify whether the file has been
moved into the HDFS, as shown below.
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat
hdfs://localhost:9000/pig_data/student_data.txt

Output

You can see the content of the file as shown below.

15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load
native-hadoop
library for your platform... using builtin-java classes where applicable
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
The Load Operator
You can load data into Apache Pig from the file system (HDFS/
Local) using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator.
On the left-hand side, we need to mention the name of the
relation where we want to store the data, and on the right-hand side,
we have to define how we store the data. Given below is the syntax of
the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
Where,
relation_name − We have to mention the relation in which we want
to store the data.
Input file path − We have to mention the HDFS directory where the
file is stored. (In MapReduce mode)
function − We have to choose a function from the set of load
functions provided by Apache Pig (BinStorage, JsonLoader,
PigStorage, TextLoader).
Schema − We have to define the schema of the data. We can define
the required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case,
the columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under
the schema named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in
MapReduce mode as shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.

15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :

LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked
MAPREDUCE as the ExecType

2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -

Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015,
11:44:35
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -
Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils
- Default bootup file /home/Hadoop/.pigbootup not found
2015-10-01 12:33:39,630 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000
grunt>

Execute the Load Statement

Now load the data from the file student_data.txt into Pig by
executing the following Pig Latin statement in the Grunt shell.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );

Following is the description of the above statement.

Relation
We have stored the data in the schema student.
name
Input We are reading data from the
file file student_data.txt, which is in the /pig_data/
path directory of HDFS.

We have used the PigStorage() function. It

Storage loads and stores data as structured text files. It
functio takes a delimiter using which each entity of a
n tuple is separated, as a parameter. By default, it
takes ‘\t’ as a parameter.

We have stored the data using the following

schema.
firstnam lastnam phon
column id city
e e e
schema
char
datatyp in char char char
arra
e t array array array
y

Pig operator:

Perfoming inner and outer join in pig:

Introduction to Hive:

Hive is a data warehouse system which is used to analyze structured

data. It is built on the top of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing

large datasets residing in distributed storage. It runs SQL like queries
called HQL (Hive query language) which gets internally converted to
MapReduce jobs.

Using Hive, we can skip the requirement of the traditional approach

of writing complex MapReduce programs. Hive supports Data
Definition Language (DDL), Data Manipulation Language (DML),
and User Defined Functions (UDF).

Features of Hive:

o Hive is fast and scalable.

o It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and
HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop
ecosystem.
o It supports user-defined functions (UDFs) where user can
provide its functionality.
Application of Hive:

Architecture of Hive:

Hive Client
Hive allows writing applications in various languages, including Java, Python, and C+
+. It supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform

that serves the request from all those programming languages
that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive
and Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the
ODBC protocol to connect to Hive.

Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell

where we can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an
alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the
structure information of various tables and partitions in the
warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where the
data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It
accepts the request from different clients and provides it to Hive
Driver.
o Hive Driver - It receives queries from different sources like
web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the
queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the
query and perform semantic analysis on the different query
blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan
in the form of DAG of map-reduce tasks and HDFS tasks. In the
end, the execution engine executes the incoming tasks in the
order of their dependencies.
Componenets of Hive:

Hive shell:

HiveQL:

Hive Database and Tabls:

Create Database is a statement used to create a database in
Hive. A database in Hive is a namespace or a collection of tables.
The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user
that a database with the same name already exists. We can use
SCHEMA in place of DATABASE in this command. The following
query is executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;

The following query is used to verify a databases list:

hive> SHOW DATABASES;

default
userdb

JDBC Program

The JDBC program to create a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateDb {

private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000
/default", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("CREATE DATABASE userdb");
System.out.println(“Database userdb created
successfully.”);
con.close();
}
}

Save the program in a file named HiveCreateDb.java. The

following commands are used to compile and execute this
program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output:

Database userdb created successfully.

Drop Database Statement

Drop Database is a statement that drops all the tables and

deletes the database. Its syntax is as follows:

DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]

database_name
[RESTRICT|CASCADE];
The following queries are used to drop a database. Let us assume
that the database name is userdb.
hive> DROP DATABASE IF EXISTS userdb;

The following query drops the database using CASCADE. It means

dropping respective tables before dropping the database.
hive> DROP DATABASE IF EXISTS userdb CASCADE;

The following query drops the database using SCHEMA.

hive> DROP SCHEMA userdb;

This clause was added in Hive 0.6.

JDBC Program

The JDBC program to drop a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveDropDb {

System.out.println(“Drop userdb database successful.”);

con.close();
}
}

Save the program in a file named HiveDropDb.java. Given below

are the commands to compile and execute this program.

$ javac HiveDropDb.java
$ java HiveDropDb

Output:
Drop userdb database successful.

Create Table is a statement used to create a table in Hive. The

syntax and example are as follows:

Syntax

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]

table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Example

Let us assume you need to create a table

named employee using CREATE TABLE statement. The following
table lists the fields and their data types in employee table:
Sr.No Field Name Data Type

1 Eid int

2 Name String

3 Salary Float

4 Designation string

The following data is a Comment, Row formatted fields such as

Field terminator, Lines terminator, and Stored File type.

COMMENT ‘Employee details’

FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the
above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

If you add the option IF NOT EXISTS, Hive ignores the statement
in case the table already exists.

On successful creation of table, you get to see the following

response:

OK
Time taken: 5.905 seconds
hive>

JDBC Program

The JDBC program to create a table is given example.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateTable {

// execute statement
stmt.executeQuery("CREATE TABLE IF NOT EXISTS "
+" employee ( eid int, name String, "
+" salary String, destignation String)"
+" COMMENT ‘Employee details’"
+" ROW FORMAT DELIMITED"
+" FIELDS TERMINATED BY ‘\t’"
+" LINES TERMINATED BY ‘\n’"
+" STORED AS TEXTFILE;");
System.out.println(“ Table employee created.”);
con.close();
}
}

Save the program in a file named HiveCreateDb.java. The

following commands are used to compile and execute this
program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output

Table employee created.

Load Data Statement

Generally, after creating a table in SQL, we can insert data using

the Insert statement. But in Hive, we can insert data using the
LOAD DATA statement.

While inserting data into Hive, it is better to use LOAD DATA to

store bulk records. There are two ways to load data: one is from
local file system and second is from Hadoop file system.

Syntax

The syntax for load data is as follows:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename

[PARTITION (partcol1=val1, partcol2=val2 ...)]
 LOCAL is identifier to specify the local path. It is optional.
 OVERWRITE is optional to overwrite the data in the table.
 PARTITION is optional.

Example

We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin

The following query loads the given text into the table.

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'

OVERWRITE INTO TABLE employee;

On successful download, you get to see the following response:

OK
Time taken: 15.905 seconds
hive>

JDBC Program

Given below is the JDBC program to load given data into the
table.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveLoadData {

private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/us
erdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("LOAD DATA LOCAL INPATH
'/home/user/sample.txt'" + "OVERWRITE INTO TABLE
employee;");
System.out.println("Load Data into employee
successful");
con.close();
}
}

Save the program in a file named HiveLoadData.java. Use the

following commands to compile and execute this program.

$ javac HiveLoadData.java
$ java HiveLoadData

Output:

Load Data into employee successful

Data types:
Column Types

Column type are used as column data types of Hive. They are as
follows:

Integral Types

Integer type data can be specified using integral data types, INT.
When the data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types

String type data types can be specified using single quotes (' ') or
double quotes (" "). It contains two data types: VARCHAR and
CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp

It supports traditional UNIX timestamp with optional nanosecond

precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates

DATE values are described in year/month/day format in the form

{{YYYY-MM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of

Java. It is used for representing immutable arbitrary precision. The
syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

Union Types

Union is a collection of heterogeneous data types. You can create an

instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals

The following literals are used in Hive:

Floating Point Types

Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.

Decimal Type
Decimal type data is nothing but floating point value with higher
range than DOUBLE data type. The range of decimal type is
approximately -10-308 to 10308.
Null Value

Missing values are represented by the special value NULL.

Complex Types

The Hive complex data types are as follows:

Arrays

Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>

Maps

Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>

Structs

Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT

col_comment], ...>

Operations in Hive:

Performing Inner and Outer join in Hive:

Built in functions in Hive:

Hive supports the following built-in functions:

Return Type Signature Description

It returns the rounded

BIGINT round(double a) BIGINT value of the
double.

It returns the
maximum BIGINT
BIGINT floor(double a)
value that is equal or
less than the double.

It returns the minimum

BIGINT value that is
BIGINT ceil(double a)
equal or greater than
the double.

It returns a random
double rand(), rand(int seed) number that changes
from row to row.

It returns the string

concat(string A, string resulting from
string
B,...) concatenating B after
A.

It returns the substring

substr(string A, int of A starting from start
string
start) position till the end of
string A.

string substr(string A, int It returns the substring

start, int length) of A starting from start
position with the given
length.

It returns the string

resulting from
string upper(string A) converting all
characters of A to
upper case.

string ucase(string A) Same as above.

It returns the string

resulting from
string lower(string A) converting all
characters of B to
lower case.

string lcase(string A) Same as above.

It returns the string

resulting from
string trim(string A)
trimming spaces from
both ends of A.

It returns the string

resulting from
string ltrim(string A) trimming spaces from
the beginning (left
hand side) of A.

string rtrim(string A) rtrim(string A) It

returns the string
resulting from
trimming spaces from
the end (right hand
side) of A.

It returns the string

resulting from
replacing all substrings
regexp_replace(string
string in B that match the
A, string B, string C)
Java regular
expression syntax with
C.

It returns the number

int size(Map<K.V>) of elements in the map
type.

It returns the number

int size(Array<T>) of elements in the
array type.

It converts the results

of the expression expr
to <type> e.g. cast('1'
as BIGINT) converts
value of
cast(<expr> as <type>) the string '1' to it
<type>
integral representation.
A NULL is returned if
the conversion does
not succeed.

string from_unixtime(int convert the number of

seconds from Unix
epoch (1970-01-01
00:00:00 UTC) to a
string representing the
unixtime) timestamp of that
moment in the current
system time zone in
the format of "1970-
01-01 00:00:00"

It returns the date part

of a timestamp string:
to_date(string
string to_date("1970-01-01
timestamp)
00:00:00") = "1970-
01-01"

It returns the year part

of a date or a
timestamp string:
int year(string date) year("1970-01-01
00:00:00") = 1970,
year("1970-01-01") =
1970

It returns the month

part of a date or a
timestamp string:
int month(string date) month("1970-11-01
00:00:00") = 11,
month("1970-11-01")
= 11
It returns the day part
of a date or a
timestamp string:
int day(string date)
day("1970-11-01
00:00:00") = 1,
day("1970-11-01") = 1

It extracts json object

from a json string
based on json path
get_json_object(string specified, and returns
string json_string, string json string of the
path) extracted json object.
It returns NULL if the
input json string is
invalid.

Example

The following queries demonstrate some built-in functions:

round() function

hive> SELECT round(2.6) from temp;

On successful execution of query, you get to see the following

response:

3.0

floor() function

hive> SELECT floor(2.6) from temp;

On successful execution of the query, you get to see the following

response:
2.0

ceil() function

hive> SELECT ceil(2.6) from temp;

On successful execution of the query, you get to see the following

response:

3.0
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage
of these functions is as same as the SQL aggregate functions.
Return
Signature Description
Type

count(), count() - Returns the total

BIGINT
count(expr), number of retrieved rows.

It returns the sum of the

sum(col),
elements in the group or the
DOUBLE sum(DISTINCT
sum of the distinct values of
col)
the column in the group.

It returns the average of the

avg(col),
elements in the group or the
DOUBLE avg(DISTINCT
average of the distinct values
col)
of the column in the group.

It returns the minimum value

DOUBLE min(col)
of the column in the group.

DOUBLE max(col) It returns the maximum value

of the column in the group.

Print Page

Dtabase operators in Hive:

Hive vs RDBMS:

RDBMS Hive

It is used to maintain database. It is used to maintain data warehouse.

It uses SQL (Structured Query

It uses HQL (Hive Query Language).
Language).

Schema is fixed in RDBMS. Schema varies in it.

Normalized and de-normalized both type of

Normalized data is stored.
data is stored.

Tables in rdms are sparse. Table in hive are dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used. Sharding method is used for partition.

Example of Hive:

TAFJ-Lock Manager
50% (2)
TAFJ-Lock Manager
24 pages
pig
No ratings yet
pig
23 pages
Pig
No ratings yet
Pig
6 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
BDA unit5
No ratings yet
BDA unit5
36 pages
Unit 5
No ratings yet
Unit 5
76 pages
6 part2
No ratings yet
6 part2
45 pages
Unit IV
No ratings yet
Unit IV
36 pages
PIG
No ratings yet
PIG
5 pages
Pig
No ratings yet
Pig
61 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
BDA_HIVE & PIG-Other Notes in Detail
No ratings yet
BDA_HIVE & PIG-Other Notes in Detail
162 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
Lecture 12
No ratings yet
Lecture 12
21 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
BDA_UNIT-4-PIG-Notes
No ratings yet
BDA_UNIT-4-PIG-Notes
9 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
BDA_UNIT_IV_NOTES (1)
No ratings yet
BDA_UNIT_IV_NOTES (1)
32 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
6 part1
No ratings yet
6 part1
5 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
No ratings yet
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
9 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
28 pages
Pig
No ratings yet
Pig
59 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
BDP U4
No ratings yet
BDP U4
58 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
pig skb
No ratings yet
pig skb
7 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Unit 4
No ratings yet
Unit 4
29 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Unit 5
No ratings yet
Unit 5
39 pages
Unit 4
No ratings yet
Unit 4
20 pages
Scet Unit 5
No ratings yet
Scet Unit 5
9 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
PIG
No ratings yet
PIG
9 pages
Introduction To Apache Pig: Geeksforgeeks
No ratings yet
Introduction To Apache Pig: Geeksforgeeks
5 pages
Pig
No ratings yet
Pig
27 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
bda unit 4
No ratings yet
bda unit 4
16 pages
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
No ratings yet
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
6 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Pig
No ratings yet
Pig
16 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
03-303 F22 Algorithm Analysis
No ratings yet
03-303 F22 Algorithm Analysis
41 pages
DB 03
No ratings yet
DB 03
24 pages
Mini Project Report
No ratings yet
Mini Project Report
50 pages
Top 100 PHP Interview Questions and Answers Are Below
No ratings yet
Top 100 PHP Interview Questions and Answers Are Below
19 pages
Sweetcase EP - User Manual
No ratings yet
Sweetcase EP - User Manual
20 pages
Schedule of Experiment: Date
No ratings yet
Schedule of Experiment: Date
4 pages
Changelog
No ratings yet
Changelog
6 pages
PHP Assignment 3 For Database
No ratings yet
PHP Assignment 3 For Database
6 pages
Research and Design of My Hometown JESSORE Based On JAVA Web Technology Introduces My Hometown
No ratings yet
Research and Design of My Hometown JESSORE Based On JAVA Web Technology Introduces My Hometown
78 pages
Publication Ready Scientific Reports and Presentations With Jupyter Notebooks - Live Editing of LaTeX Documents
No ratings yet
Publication Ready Scientific Reports and Presentations With Jupyter Notebooks - Live Editing of LaTeX Documents
6 pages
Report5 - Test Documentation
No ratings yet
Report5 - Test Documentation
8 pages
Belarc Advisor Computer Profile
No ratings yet
Belarc Advisor Computer Profile
7 pages
Systematic Hypermedia Application Design With OOHDM
No ratings yet
Systematic Hypermedia Application Design With OOHDM
13 pages
Msonline
No ratings yet
Msonline
56 pages
Web-Development Using GWT + mvp4g
No ratings yet
Web-Development Using GWT + mvp4g
305 pages
AMD Software Release Notes Ver. 3.08.06.148: Package Contents and Compatible Operating Systems
No ratings yet
AMD Software Release Notes Ver. 3.08.06.148: Package Contents and Compatible Operating Systems
4 pages
Secure Online Voting System
92% (13)
Secure Online Voting System
178 pages
12 - Using The Latest JDK 7.0 Update With Oracle E-Business Suite Release 12.2
No ratings yet
12 - Using The Latest JDK 7.0 Update With Oracle E-Business Suite Release 12.2
12 pages
The object primer Agile Modeling driven development with UML 2 3rd ed Edition Ambler all chapter instant download
100% (22)
The object primer Agile Modeling driven development with UML 2 3rd ed Edition Ambler all chapter instant download
85 pages
Object-Oriented Programming: This Self
No ratings yet
Object-Oriented Programming: This Self
18 pages
Introduction To Excel
No ratings yet
Introduction To Excel
15 pages
FPGA Design Verification Flow
No ratings yet
FPGA Design Verification Flow
2 pages
Chapter 1 - Introduction To Computers, The Internet, and The Web
No ratings yet
Chapter 1 - Introduction To Computers, The Internet, and The Web
26 pages
Naveen Udaya Kumar - Imp
No ratings yet
Naveen Udaya Kumar - Imp
5 pages
Fortigate NSE3 - Teoria y Examen
No ratings yet
Fortigate NSE3 - Teoria y Examen
43 pages
Quik RevisionCS Class12
No ratings yet
Quik RevisionCS Class12
194 pages
Login Form Using JavaScript Validation
No ratings yet
Login Form Using JavaScript Validation
8 pages
VHDL and HDL Designer Primer: Instructor: Jason D. Bakos
No ratings yet
VHDL and HDL Designer Primer: Instructor: Jason D. Bakos
46 pages
100068966
No ratings yet
100068966
87 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.