0% found this document useful (0 votes)
182 views

Big Data Notes Pig

Uploaded by

rohitmarale77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
182 views

Big Data Notes Pig

Uploaded by

rohitmarale77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Pig Programming:

Introduction to pig:
Pig Represents Big Data as data flows. Pig is a high-level platform or
tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce. It provides a
high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is
stored in the HDFS, the programmers will write the scripts using the
Pig Latin Language. Internally Pig Engine(a component of Apache
Pig) converted all these scripts into a specific map and reduce task.
But these are not visible to the programmers in order to provide a
high-level of abstraction. Pig Latin and Pig Engine are the two main
components of the Apache Pig tool. The result of Pig always stored in
the HDFS.
Need of Pig: One limitation of MapReduce is that the development
cycle is very long. Writing the reducer and mapper, compiling
packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development
using the multi-query approach. Also, Pig is beneficial for
programmers who are not from Java background. 200 lines of Java
code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn
Pig Latin.
It uses query approach which results in reducing the length of the
code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).

Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s


researchers. At that time, the main idea to develop Pig was to execute the
MapReduce jobs on extremely large datasets. In the year 2007, it moved to
Apache Software Foundation(ASF) which makes it an open source project.
The first version(0.1) of Pig came in the year 2008. The latest version of
Apache Pig is 0.18 which came in the year 2017.
Features of pig:
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-
programmers. Pig makes this process easy. In the Pig, the queries are
converted to MapReduce internally.

2) Optimization opportunities

It is how tasks are encoded permits the system to optimize their


execution automatically, allowing the user to focus on semantics
rather than efficiency.

3) Extensibility

A user-defined function is written in which the user can write their


logic to execute over the data set.

4) Flexible
It can easily handle structured as well as unstructured data.

5) In-built operators
It contains various type of operators such as sort, filter and joins.

Application of pig:

 For exploring large datasets Pig Scripting is used.


 Provides the supports across large data-sets for Ad-hoc queries.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 For collecting large amounts of datasets in form of search logs and web
crawls.
 Used where the analytical insights are needed using the sampling.

Pig Architecture:
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write
a Pig script using the Pig Latin language, and execute them using any of the
execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these
scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
and thus, it makes the programmer’s job easy. The architecture of Apache Pig is
shown below.

Apache Pig Components

As shown in the figure, there are various components in the Apache


Pig framework. Let us take a look at the major components.

Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce
jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted
order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.

Pig Latin Data Model


The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known
as an Atom. It is stored as string and can be used as string and
number. int, long, float, double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’

Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordered set of tuples. In other words, a collection of


tuples (non-unique) is known as a bag. Each tuple can have any
number of fields (flexible schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that
the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map

A map (or data map) is a set of key-value pairs. The key needs to be
of type chararray and should be unique. The value might be of any
type. It is represented by ‘[]’
Example − [name#Raja, age#30]

Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is
no guarantee that tuples are processed in any particular order).

Pig data types:


Primitive Data type:
Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

 int − Signed 32-bit integer and similar to Integer in Java.


 Long − It is a fully signed 64-bit number similar to Long in
Java.
 Float − It is a signed 32-bit floating surface that appears to be
similar to Java's float.
 Double − A floating-point 63-bit and similar to Double in Java.
 Char array − A list of characters in the Unicode format, UTF-
8. This is compatible with the Java character unit item.
 byte array − The byte data type represents bytes by default.
When the data file type is not specified, the default value is byte
array.
 Boolean − A value that is either true or false

Complex Data type

Complex data types consist of a bit of logical and complicated data type. The
following are the complex data type −
Data Definition Code Example
Types

Tuple A set of (field[,fields....]) (1,2)


ordered
fields. The
tuple is
written with
braces.

Bag A group of {tuple,[,tuple...]} {(1,2), (3,4)}


tuples is
called a
bag.
Represented
by folded
weights or
curly
braces.

Map A set of [Key # Value] ['keyname'#'valuename']


key-value
pairs. The
map is
represented
by square
brackets.

 Key − An element of finding an element, the key must be


unique and must be charrarray.
 Value − Any data can be stored in a value, and each key has
particular data related to it. The map is built using a bracket and

hash between key and values. Cas to separate pairs of over one key
value. Here # is used to distinguish key and value.
 Null Values − Valuable value is missing or unknown, and any
data may apply. The pig handles an empty value similar to SQL.
Pig detects blank values when data is missing, or an error occurs
during data processing. Also, null can be used as a value
proposition of your choice

Definig schema:

storing data through pig:

In the previous chapter, we learnt how to load data into Apache Pig.
You can store the loaded data in the file system using
the store operator. This chapter explains how to store data in Apache
Pig using the Store operator.
Syntax:
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING
function];
Example
Assume we have a file student_data.txt in HDFS with the following
content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator
as shown below.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS
directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ '
USING PigStorage (',');

Output
After executing the store statement, you will get the following output.
A directory is created with the specified name and the data will be
stored in it.
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLau ncher - 100% complete
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats -
Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05
13:05:05 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime
AvgMapTime MedianMapTime
job_14459_06 1 0 n/a n/a n/a n/a
MaxReduceTime MinReduceTime AvgReduceTime
MedianReducetime Alias Feature
0 0 0 0 student MAP_ONLY
OutPut folder
hdfs://localhost:9000/pig_Output/
Input(s): Successfully read 0 records from:
"hdfs://localhost:9000/pig_data/student_data.txt"
Output(s): Successfully stored 0 records in:
"hdfs://localhost:9000/pig_Output"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG: job_1443519499159_0006
2015-10-05 13:06:06,192 [main] INFO
org.apache.pig.backend.hadoop.executionengine
.mapReduceLayer.MapReduceLau ncher - Success!
Verification
You can verify the stored data as shown below.
Step 1
First of all, list out the files in the directory named pig_output using
the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing
the store statement.
Step 2
Using cat command, list the contents of the file named part-m-
00000 as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai

Reading data through pig:

In general, Apache Pig works on top of Hadoop. It is an analytical


tool that analyzes large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data
into Apache Pig. This chapter explains how to load data to Apache
Pig from HDFS.
Preparing HDFS

In MapReduce mode, Pig reads (loads) data from HDFS and stores
the results back in HDFS. Therefore, let us start HDFS and create the
following sample data in HDFS.

Student First
Last Name Phone City
ID Name

001 Rajiv Reddy 9848022337 Hyderabad

Battachary
002 siddarth 9848022338 Kolkata
a

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

The above dataset contains personal details like id, first name, last
name, phone number and city, of six students.

Step 1: Verifying Hadoop

First of all, verify the installation using Hadoop version command, as


shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH
variable, then you will get the following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using
/home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar
Step 2: Starting HDFS

Browse through the sbin directory of Hadoop and start yarn and
Hadoop dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-namenode-
localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-datanode-
localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-
localhost.localdomain.out
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-
localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-
localhost.localdomain.out
Step 3: Create a Directory in HDFS

In Hadoop DFS, you can create directories using the


command mkdir. Create a new directory in HDFS with the
name Pig_Data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS
The input file of Pig contains each tuple/record in individual lines.
And the entities of the record are separated by a delimiter (In our
example we used “,”).
In the local file system, create an input file student_data.txt containing
data as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS
using put command as shown below. (You can
use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put
/home/Hadoop/Pig/Pig_Data/student_data.txt
dfs://localhost:9000/pig_data/
Verifying the file
You can use the cat command to verify whether the file has been
moved into the HDFS, as shown below.
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat
hdfs://localhost:9000/pig_data/student_data.txt

Output

You can see the content of the file as shown below.


15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load
native-hadoop
library for your platform... using builtin-java classes where applicable
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
The Load Operator
You can load data into Apache Pig from the file system (HDFS/
Local) using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator.
On the left-hand side, we need to mention the name of the
relation where we want to store the data, and on the right-hand side,
we have to define how we store the data. Given below is the syntax of
the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
Where,
relation_name − We have to mention the relation in which we want
to store the data.
Input file path − We have to mention the HDFS directory where the
file is stored. (In MapReduce mode)
function − We have to choose a function from the set of load
functions provided by Apache Pig (BinStorage, JsonLoader,
PigStorage, TextLoader).
Schema − We have to define the schema of the data. We can define
the required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case,
the columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under
the schema named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in
MapReduce mode as shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.

15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :


LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked
MAPREDUCE as the ExecType

2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -


Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015,
11:44:35
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -
Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils
- Default bootup file /home/Hadoop/.pigbootup not found
2015-10-01 12:33:39,630 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000
grunt>

Execute the Load Statement

Now load the data from the file student_data.txt into Pig by
executing the following Pig Latin statement in the Grunt shell.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );

Following is the description of the above statement.

Relation
We have stored the data in the schema student.
name
Input We are reading data from the
file file student_data.txt, which is in the /pig_data/
path directory of HDFS.

We have used the PigStorage() function. It


Storage loads and stores data as structured text files. It
functio takes a delimiter using which each entity of a
n tuple is separated, as a parameter. By default, it
takes ‘\t’ as a parameter.

We have stored the data using the following


schema.
firstnam lastnam phon
column id city
e e e
schema
char
datatyp in char char char
arra
e t array array array
y

Pig operator:

Perfoming inner and outer join in pig:


Introduction to Hive:

Hive is a data warehouse system which is used to analyze structured


data. It is built on the top of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing


large datasets residing in distributed storage. It runs SQL like queries
called HQL (Hive query language) which gets internally converted to
MapReduce jobs.

Using Hive, we can skip the requirement of the traditional approach


of writing complex MapReduce programs. Hive supports Data
Definition Language (DDL), Data Manipulation Language (DML),
and User Defined Functions (UDF).

Features of Hive:

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and
HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop
ecosystem.
o It supports user-defined functions (UDFs) where user can
provide its functionality.
Application of Hive:

Architecture of Hive:

Hive Client
Hive allows writing applications in various languages, including Java, Python, and C+
+. It supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform


that serves the request from all those programming languages
that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive
and Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the
ODBC protocol to connect to Hive.

Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell


where we can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an
alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the
structure information of various tables and partitions in the
warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where the
data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It
accepts the request from different clients and provides it to Hive
Driver.
o Hive Driver - It receives queries from different sources like
web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the
queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the
query and perform semantic analysis on the different query
blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan
in the form of DAG of map-reduce tasks and HDFS tasks. In the
end, the execution engine executes the incoming tasks in the
order of their dependencies.
Componenets of Hive:

Hive shell:

HiveQL:

Hive Database and Tabls:


Create Database is a statement used to create a database in
Hive. A database in Hive is a namespace or a collection of tables.
The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user
that a database with the same name already exists. We can use
SCHEMA in place of DATABASE in this command. The following
query is executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;

The following query is used to verify a databases list:

hive> SHOW DATABASES;


default
userdb

JDBC Program

The JDBC program to create a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateDb {


private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000
/default", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("CREATE DATABASE userdb");
System.out.println(“Database userdb created
successfully.”);
con.close();
}
}

Save the program in a file named HiveCreateDb.java. The


following commands are used to compile and execute this
program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output:

Database userdb created successfully.

Drop Database Statement

Drop Database is a statement that drops all the tables and


deletes the database. Its syntax is as follows:

DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]


database_name
[RESTRICT|CASCADE];
The following queries are used to drop a database. Let us assume
that the database name is userdb.
hive> DROP DATABASE IF EXISTS userdb;

The following query drops the database using CASCADE. It means


dropping respective tables before dropping the database.
hive> DROP DATABASE IF EXISTS userdb CASCADE;

The following query drops the database using SCHEMA.


hive> DROP SCHEMA userdb;

This clause was added in Hive 0.6.

JDBC Program

The JDBC program to drop a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveDropDb {


private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/de
fault", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("DROP DATABASE userdb");

System.out.println(“Drop userdb database successful.”);

con.close();
}
}

Save the program in a file named HiveDropDb.java. Given below


are the commands to compile and execute this program.

$ javac HiveDropDb.java
$ java HiveDropDb

Output:
Drop userdb database successful.

Create Table is a statement used to create a table in Hive. The


syntax and example are as follows:

Syntax

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]


table_name

[(col_name data_type [COMMENT col_comment], ...)]


[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Example

Let us assume you need to create a table


named employee using CREATE TABLE statement. The following
table lists the fields and their data types in employee table:
Sr.No Field Name Data Type

1 Eid int

2 Name String

3 Salary Float

4 Designation string

The following data is a Comment, Row formatted fields such as


Field terminator, Lines terminator, and Stored File type.

COMMENT ‘Employee details’


FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the
above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

If you add the option IF NOT EXISTS, Hive ignores the statement
in case the table already exists.

On successful creation of table, you get to see the following


response:

OK
Time taken: 5.905 seconds
hive>

JDBC Program

The JDBC program to create a table is given example.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateTable {


private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/us
erdb", "", "");
// create statement
Statement stmt = con.createStatement();

// execute statement
stmt.executeQuery("CREATE TABLE IF NOT EXISTS "
+" employee ( eid int, name String, "
+" salary String, destignation String)"
+" COMMENT ‘Employee details’"
+" ROW FORMAT DELIMITED"
+" FIELDS TERMINATED BY ‘\t’"
+" LINES TERMINATED BY ‘\n’"
+" STORED AS TEXTFILE;");
System.out.println(“ Table employee created.”);
con.close();
}
}

Save the program in a file named HiveCreateDb.java. The


following commands are used to compile and execute this
program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output

Table employee created.

Load Data Statement

Generally, after creating a table in SQL, we can insert data using


the Insert statement. But in Hive, we can insert data using the
LOAD DATA statement.

While inserting data into Hive, it is better to use LOAD DATA to


store bulk records. There are two ways to load data: one is from
local file system and second is from Hadoop file system.

Syntax

The syntax for load data is as follows:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename


[PARTITION (partcol1=val1, partcol2=val2 ...)]
 LOCAL is identifier to specify the local path. It is optional.
 OVERWRITE is optional to overwrite the data in the table.
 PARTITION is optional.

Example

We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin

The following query loads the given text into the table.

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'


OVERWRITE INTO TABLE employee;

On successful download, you get to see the following response:

OK
Time taken: 15.905 seconds
hive>

JDBC Program

Given below is the JDBC program to load given data into the
table.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveLoadData {


private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/us
erdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("LOAD DATA LOCAL INPATH
'/home/user/sample.txt'" + "OVERWRITE INTO TABLE
employee;");
System.out.println("Load Data into employee
successful");
con.close();
}
}

Save the program in a file named HiveLoadData.java. Use the


following commands to compile and execute this program.

$ javac HiveLoadData.java
$ java HiveLoadData

Output:

Load Data into employee successful

Data types:
Column Types

Column type are used as column data types of Hive. They are as
follows:

Integral Types

Integer type data can be specified using integral data types, INT.
When the data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types

String type data types can be specified using single quotes (' ') or
double quotes (" "). It contains two data types: VARCHAR and
CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp

It supports traditional UNIX timestamp with optional nanosecond


precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates

DATE values are described in year/month/day format in the form


{{YYYY-MM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of


Java. It is used for representing immutable arbitrary precision. The
syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

Union Types

Union is a collection of heterogeneous data types. You can create an


instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals

The following literals are used in Hive:

Floating Point Types

Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.

Decimal Type
Decimal type data is nothing but floating point value with higher
range than DOUBLE data type. The range of decimal type is
approximately -10-308 to 10308.
Null Value

Missing values are represented by the special value NULL.

Complex Types

The Hive complex data types are as follows:

Arrays

Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>

Maps

Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>

Structs

Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT


col_comment], ...>

Operations in Hive:

Performing Inner and Outer join in Hive:

Built in functions in Hive:

Hive supports the following built-in functions:


Return Type Signature Description

It returns the rounded


BIGINT round(double a) BIGINT value of the
double.

It returns the
maximum BIGINT
BIGINT floor(double a)
value that is equal or
less than the double.

It returns the minimum


BIGINT value that is
BIGINT ceil(double a)
equal or greater than
the double.

It returns a random
double rand(), rand(int seed) number that changes
from row to row.

It returns the string


concat(string A, string resulting from
string
B,...) concatenating B after
A.

It returns the substring


substr(string A, int of A starting from start
string
start) position till the end of
string A.

string substr(string A, int It returns the substring


start, int length) of A starting from start
position with the given
length.

It returns the string


resulting from
string upper(string A) converting all
characters of A to
upper case.

string ucase(string A) Same as above.

It returns the string


resulting from
string lower(string A) converting all
characters of B to
lower case.

string lcase(string A) Same as above.

It returns the string


resulting from
string trim(string A)
trimming spaces from
both ends of A.

It returns the string


resulting from
string ltrim(string A) trimming spaces from
the beginning (left
hand side) of A.

string rtrim(string A) rtrim(string A) It


returns the string
resulting from
trimming spaces from
the end (right hand
side) of A.

It returns the string


resulting from
replacing all substrings
regexp_replace(string
string in B that match the
A, string B, string C)
Java regular
expression syntax with
C.

It returns the number


int size(Map<K.V>) of elements in the map
type.

It returns the number


int size(Array<T>) of elements in the
array type.

It converts the results


of the expression expr
to <type> e.g. cast('1'
as BIGINT) converts
value of
cast(<expr> as <type>) the string '1' to it
<type>
integral representation.
A NULL is returned if
the conversion does
not succeed.

string from_unixtime(int convert the number of


seconds from Unix
epoch (1970-01-01
00:00:00 UTC) to a
string representing the
unixtime) timestamp of that
moment in the current
system time zone in
the format of "1970-
01-01 00:00:00"

It returns the date part


of a timestamp string:
to_date(string
string to_date("1970-01-01
timestamp)
00:00:00") = "1970-
01-01"

It returns the year part


of a date or a
timestamp string:
int year(string date) year("1970-01-01
00:00:00") = 1970,
year("1970-01-01") =
1970

It returns the month


part of a date or a
timestamp string:
int month(string date) month("1970-11-01
00:00:00") = 11,
month("1970-11-01")
= 11
It returns the day part
of a date or a
timestamp string:
int day(string date)
day("1970-11-01
00:00:00") = 1,
day("1970-11-01") = 1

It extracts json object


from a json string
based on json path
get_json_object(string specified, and returns
string json_string, string json string of the
path) extracted json object.
It returns NULL if the
input json string is
invalid.

Example

The following queries demonstrate some built-in functions:

round() function

hive> SELECT round(2.6) from temp;

On successful execution of query, you get to see the following


response:

3.0

floor() function

hive> SELECT floor(2.6) from temp;

On successful execution of the query, you get to see the following


response:
2.0

ceil() function

hive> SELECT ceil(2.6) from temp;

On successful execution of the query, you get to see the following


response:

3.0
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage
of these functions is as same as the SQL aggregate functions.
Return
Signature Description
Type

count(*), count(*) - Returns the total


BIGINT
count(expr), number of retrieved rows.

It returns the sum of the


sum(col),
elements in the group or the
DOUBLE sum(DISTINCT
sum of the distinct values of
col)
the column in the group.

It returns the average of the


avg(col),
elements in the group or the
DOUBLE avg(DISTINCT
average of the distinct values
col)
of the column in the group.

It returns the minimum value


DOUBLE min(col)
of the column in the group.

DOUBLE max(col) It returns the maximum value


of the column in the group.

Print Page

Dtabase operators in Hive:

Hive vs RDBMS:

RDBMS Hive

It is used to maintain database. It is used to maintain data warehouse.

It uses SQL (Structured Query


It uses HQL (Hive Query Language).
Language).

Schema is fixed in RDBMS. Schema varies in it.

Normalized and de-normalized both type of


Normalized data is stored.
data is stored.

Tables in rdms are sparse. Table in hive are dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used. Sharding method is used for partition.

Example of Hive:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy