0% found this document useful (0 votes)
31 views39 pages

Hive Unit VI

Uploaded by

mehah43766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views39 pages

Hive Unit VI

Uploaded by

mehah43766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT-VI

Applying Structure to Hadoop


Data with Hive
Outcomes

At the end of the course the student will be able to


 Say hello to hive

 See how the hive is put together,

 Start with apache hive

 Examine the hive clients

 Work with hive data types

 Create and manage databases and tables

 See how the hive data manipulation language works

 Query and analyze data.


Introduction to Hive
 Hive is a data warehouse infrastructure tool to process
structured data in Hadoop.
 It resides on top of Hadoop to summarize Big Data, and
makes querying and analyzing easy.
 The term ‘Big Data’ is used for collections of large datasets
that include huge volume, high velocity, and a variety of
data that is increasing day by day.
 Using traditional data management systems, it is difficult to
process Big Data.
 Therefore, the Apache Software Foundation introduced a
framework called Hadoop to solve Big Data management
and processing challenges.
HADOOP ECO SYSTEM
Introduction to Hive
 Hive Hadoop was founded by Jeff Hammerbacher who was
working with Facebook.
 When working with Facebook he realized that they receive
huge amounts of data on a daily basis and there needs to be a
mechanism which can store, mine and help analysis of the
data.
 This idea to mine and analyze huge amounts of data gave
birth to Hive.
 It is Hive that has enabled Facebook to deal with 10’s of
Terabytes of Data on a daily basis with ease.
 Hive mainly does three functions:
 Data summarization,
 Query, and
 Analysis.

 Hive uses a language called HiveQL( HQL), which is similar to


SQL.
 Hive QL works as a translator which translates the SQL queries
into MapReduce Jobs, which will be executed on Hadoop.
Main components of Hive are:
 Metastore- It serves as a storage device for the metadata. This
metadata holds the information of each table such as location and
schema. Metadata keeps track of data and replicates it, and acts as a
backup store in case of data loss.

 Driver- Driver receives the HiveQL instructions and acts as a


Controller. It observes the progress and life cycle of various executions
by creating sessions. Whenever HiveQL executes a statement, driver
stores the metadata generated out of that action.

 Compiler- The compiler is allocated with the task of converting the


HiveQL query into MapReduce input. A compiler is designed with the
process to execute the steps and functions needed to enable the
HiveQL output, as required by the MapReduce.
Hive is not
 A relational database

 A design for OnLine Transaction Processing (OLTP)

 A language for real-time queries and row-level updates

Features of Hive
 It stores schema in a database and processed data into
HDFS.
 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or


HQL.
 It is familiar, fast, scalable, and extensible.
Seeing How the Hive is Put Together
Components of HIVE Architecture
• The Architecture of Hive describes the flow in which a query is submitted

into Hive and finally processed using the MapReduce framework. The Major

Components of Hive are:


• Interfaces

• Hive Clients

• Hive Services

• Processing Framework and Resource Management

• Distributed Storage
Hive Clients:
The Hive supports different types of client applications for performing

queries. These clients are categorized into 3 types:

• Thrift Clients – As Apache Hive server is based on Thrift, so it can

serve the request from all those languages that support Thrift.

• JDBC Clients – Apache Hive allows Java applications to connect to

it using JDBC driver. It is defined in the class

apache.hadoop.hive.jdbc.HiveDriver.

• ODBC Clients – ODBC Driver allows applications that support

ODBC protocol to connect to Hive. For example JDBC driver, ODBC

uses Thrift to communicate with the Hive server


Hive Services
Apache Hive provides various services:
• CLI(Command Line Interface) – This is the default shell that
Hive provides, in which you can execute your Hive queries and
command directly.
• Web Interface – Hive also provides web based GUI for executing
Hive queries and commands.
• Hive Server – It is built on Apache Thrift and thus is also called as
Thrift server. It allows different clients to submit requests to Hive
and retrieve the final result.
• Hive Driver – Driver is responsible for receiving the queries
submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive
client
– Complier –After that hive driver passes the query to the
compiler. Where parsing, type checking, and semantic analysis
takes place with the help of schema present in the metastore.
– Optimizer – It generates the optimized logical plan in the form
of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks
– Executor – Once compilation and optimization complete,
execution engine executes these tasks in the order of their
dependencies using Hadoop
• Processing framework and Resource
Management: Internally, Hive uses Hadoop MapReduce framework
as de facto engine to execute the queries.
• Distributed Storage: As Hive is installed on top of Hadoop, it uses
the underlying HDFS for the distributed storage.
Working of Hive
Hive Working Flow
1.The Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.)
to execute.
2.The driver takes the help of query compiler that parses the
query to check the syntax and query plan or the requirement of
query.
3.The compiler sends metadata request to Metastore (any
database).
4.Metastore sends metadata as a response to the compiler.
5.The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is
complete.
6.The driver sends the execute plan to the execution engine.
• Meanwhile in execution, the execution engine can execute
metadata operations with Metastore.
• The execution engine receives the results from Data nodes.

• The execution engine sends those resultant values to the


driver.
• The driver sends the results to Hive Interfaces.
Examining the Hive Clients
There are quite a number of client
options for Hive as below.
• Hive command-line interface (CLI)
• Hive Web Interface (HWI) Server
• Open source SQuirreL client using
the JDBC driver
The Hive CLI client
Hive Web Interface
1. Configure the $HIVE_HOME/conf/hive-site.xml file as below
to ensure that Hive can find and load the HWI’s Java server
pages.
<property>
<name>hive.hwi.war.file </name>
<value>${HIVE_HOME}/lib/hive_hwi.war</
value>
<description>
This is the WAR file with the jsp content
for Hive Web Interface
</description>
</property>
2. The HWI Server requires Apache Ant libraries to run,
so download Ant from the Apache site at
http://ant.apache.org/bindownload.cgi .

3.Install Ant using the following commands:


– mkdir ant

– cp apache-ant-1.9.2-bin.tar.gz ant; cd ant gunzip apache-


ant-1.9.2-bin.tar.gz
– tar xvf apache-ant-1.9.2-bin.tar

4. Set the $ANT_LIB environment variable and start the


HWI Server by using the following commands:
– $ export ANT_LIB=/home/user/ant/apache-ant-1.9.2/lib

– $ bin/hive --service hwi


SQuirreL as Hive client with
the JDBC Driver
Hive - Data Types
All the data types in Hive are classified into four types, given as
follows:
• Column Types
– Integral Types
– String Types
– Timestamp
– Dates
– Decimals
– Union Types
• Literals
i. Floating Point Types
ii. Decimal Types
• Null Values
• Complex Types
– Arrays
– Maps
– Structs
Column types are used as column data types of Hive. They
are as follows:

Integral Types

Integer type data can be specified using integral data types,


INT. When the data range exceeds the range of INT, you need
to use BIGINT and if the data range is smaller than the INT,
you use SMALLINT. TINYINT is smaller than SMALLINT

The following table depicts various INT data types


String Types
• String type data types can be specified using single quotes
(' ') or double quotes (" "). It contains two data types:
VARCHAR and CHAR. Hive follows C-types escape
characters.
• The following table depicts various CHAR data types:
Timestamp

It supports traditional UNIX timestamp with optional nanosecond


precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

Dates

DATE values are described in year/month/day format in the form


{{YYYYMM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of


Java. It is used for representing immutable arbitrary precision.
The syntax and example is as follows:

DECIMAL(precision, scale)

decimal(10,0)
Union Types
Union is a collection of heterogeneous data types.
You can create an instance using create union. The
syntax and example is as follows:
Literals

The following literals are used in Hive:


• Floating Point Types
Floating point types are nothing but numbers with decimal
points. Generally, this type of data is composed of DOUBLE data
type.
• Decimal Type

Decimal type data is nothing but floating point value with higher
range than DOUBLE data type. The range of decimal type is
approximately -10-308 to 10308.
Null Value

Missing values are represented by the special value NULL.


Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT
col_comment], ...>
Hive Installation
Step 1: Verifying JAVA Installation
• Java must be installed on your system before installing
Hive. Let us verify java installation using the following
command:
$ java –version
Step 2: Verifying Hadoop Installation
• Hadoop must be installed on your system before installing
Hive. Let us verify the Hadoop installation using the
following command:
$ hadoop version
If Hadoop is already installed on your system, then
you will get the following response:
Hadoop 2.4.1 Subversion
https://svn.apache.org/repos/asf/hadoop/common -r
1529768 Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0 From source with checksum
79e53ce7994d1628b240f09af91e1af
Step 3: Downloading Hive

Install hive in your system by using below link:

http://apache.petsads.us/hive/hive-0.14.0/

• The following command is used to verify the download:

– $ cd Downloads

– $ ls

• On successful download, you get to see the following


response:

apache-hive-0.14.0-bin.tar.gz

Step 4: Installing Hive

The following steps are required for installing Hive on your


system. Let us assume the Hive archive is downloaded onto
Extracting and verifying Hive Archive
• The following command is used to verify the download and
extract the hive archive:
• $ tar zxvf apache-hive-0.14.0-bin.tar.gz $ lsOn successful
download, you get to see the following response:
• apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory


$ su – *****
passwd: *****
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
Setting up environment for Hive

• You can set up the Hive environment by appending the

following lines to ~/.bashrc file:

export HIVE_HOME=/usr/local/hive

export PATH=$PATH:$HIVE_HOME/bin

export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.

export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:

The following command is used to execute ~/.bashrc file.

$ source ~/.bashr
Step 5: Configuring Hive

• To configure Hive with Hadoop, you need to edit the hive-

env.sh file, which is placed in

the $HIVE_HOME/conf directory. The following commands

redirect to Hive config folder and copy the template file:

• $ cd $HIVE_HOME/conf $ cp hive-env.sh.template hive-

env.shEdit the hive-env.sh file by appending the following

line:

• export HADOOP_HOME=/usr/local/hadoopHive installation is

completed successfully.

• Now you require an external database server to configure


Step 6: Downloading and Installing Apache Derby

Follow the steps given below to download and install Apache


Derby:
• Downloading Apache Derby
• The following command is used to download Apache Derby. It
takes some time to download.
• $ cd ~ $ wget http://archive.apache.org/dist/db/derby/db-
derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gzThe following
command is used to verify the download:
• $ lsOn successful download, you get to see the following
response:

db-derby-10.4.2.0-bin.tar.gz
Setting up environment for Derby
• You can set up the Derby environment by appending the following lines
to ~/.bashrc file:
Export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
• ExportCLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:
$DERBY_HOME/lib/derbytools.jar.
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc
Create a directory to store Metastore
• Create a directory named data in $DERBY_HOME directory to store
Metastore data.
• $ mkdir $DERBY_HOME/dataDerby installation and environmental setup is
now complete
Step 7: Verifying Hive Installation
• Before running Hive, you need to create the /tmp folder and a
separate Hive folder in HDFS. Here, we use
the /user/hive/warehouse folder. You need to set write permission
for these newly created folders as shown below:

chmod g+w
Now set them in HDFS before verifying Hive. Use the following
commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse


$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

$ $HADOOP_HOME/bin/hadoopfs –chmod g+w


/user/hive/warehouse
• The following commands are used to verify Hive installation:
$ cd $HIVE_HOME

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy