0% found this document useful (0 votes)
11 views10 pages

Unit-5 Sgs

bda

Uploaded by

shweta.shete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Unit-5 Sgs

bda

Uploaded by

shweta.shete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Hive Shell & services:

Hive Introduction
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can


create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows
server).
Meta Store Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a table,
their data types, and HDFS mapping.

HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and


MapReduce is Hive Execution Engine. Execution
engine processes the query and generates results as same
as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.

Hive Services
The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI.
It provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure
information of various tables and partitions in the warehouse. It also
includes metadata of column and its type information, the serializers and
deserializers which is used to read and write data and the corresponding
HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request
from different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and expressions. It
converts HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of
DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine
executes the incoming tasks in the order of their dependencies.

Difference between RDBMS and Hive:

RDBMS Hive

It is used to maintain database. It is used to maintain data warehouse.

It uses SQL (Structured Query


It uses HQL (Hive Query Language).
Language).

Schema is fixed in RDBMS. Schema varies in it.

Normalized and de-normalized both type of


Normalized data is stored.
data is stored.

Tables in rdms are sparse. Table in hive are dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used. Sharding method is used for partition.


Pig Vs Hive

Hadoop Hive Hadoop Pig


Hadoop Hive component is mainly used by the Pig Hadoop component is generally used by
Data Analysts the Researchers and Programmers
Hive is used against completely structured data Pig is used against semi-structured data
Hive has a declarative SQL like language termed Pig has a procedural data flow like language
as HiveQL termed as Pig Latin
Hive is basically used for generation/creation of
Pig is basically used for programming
reports
Hive operates on the server side of an HDFS Pig operates on the client side of an HDFS
cluster cluster
Pig is a wonderful ETL tool for Big Data (for
Hive is very helpful in the areas of ETL its powerful transformation and processing
capabilities)
Hive has an ability to start an optional thrift
Pig does not provide any such provision for
based server which is used to send queries from
this feature
any part to the Hive Server directly to execute
There is no provision of maintaining a
Hive leverages upon the SQL DLL language
dedicated metadata database and hence the
with definitions to tables upfront and storing the
schemas/data types are defined in the actual
schema details on a local database
scripts itself in Pig
There is no provision from Hive to support Avro Pig provides support to Avro
There is no provision of installation for Hive as Pig on the other had can be installed very
it is completely shell based for interaction easily
Pig do not provision anything like partitions
Hive provisions partitions on the data to process
directly but the feature can be achieved using
subsets based on dates or in chronological orders
Filters
Pig renders sample data for each of its
There is no provision in Hive for illustrations
scenarios through Illustrate function
Raw data access is not possible with Pig Latin
There is a provision in Hive to access raw data
scripts as fast as HiveQL
In Hive, a user can join data, order data and even
There is a provision from Pig to perform
can sort data dynamically (in an aggregated
OUTER JOINS using the COGROUP feature.
manner though)
Introduction to Hive metastore

Hive metastore (HMS) is a service that stores metadata related to Apache Hive and other
services, in a backend RDBMS, such as MySQL or PostgreSQL. Impala, Spark, Hive, and
other services share the metastore. The connections to and from HMS include HiveServer,
Ranger, and the NameNode that represents HDFS.

Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to
HiveServer. The HiveServer instance reads/writes data to HMS. By default, redundant HMS
operate in active/active mode. The physical data resides in a backend RDBMS, one for HMS.
You must configure all HMS instances to use the same backend database. A separate
RDBMS supports the security service, Ranger for example. All connections are routed to a
single RDBMS service at any given time. HMS talks to the NameNode over thrift and functions
as a client to HDFS.

HMS connects directly to Ranger and the NameNode (HDFS), and so does HiveServer, but this is not
shown in the diagram for simplicity. One or more HMS instances on the backend can talk to other
services, such as Ranger.
Querying Data

HiveQL - JOIN

The HiveQL Join clause is used to combine the data of two or more tables based on a related
column between them. The various type of HiveQL joins are: -

o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join

Here, we are going to execute the join clauses on the records of the following table:

Inner Join in HiveQL


The HiveQL inner join is used to return the rows of multiple tables where the join condition
satisfies. In other words, the join criteria find the match records in every table being joined.

Example of Inner Join in Hive

In this example, we take two table employee and employee_department. The primary key
(empid) of employee table represents the foreign key (depid) of employee_department table.
Let's perform the inner join operation by using the following steps: -

hive> use hiveql;

Now, create a table by using the following command:

hive> create table employee(empid int, empname string , state string) ;

Now, create another table by using the following command:

1. hive> create table employee_department(depid int, department_name string) ;

Now, perform the inner join operation by using the following command: -

1. hive>select e1.empname, e2.department_name from employee e1 join employee_department e2


on e1.empid= e2.depid;
Left Outer Join in HiveQL

The HiveQL left outer join returns all the records from the left (first) table and only that records
from the right (second) table where join criteria find the match.

Example of Left Outer Join in Hive

In this example, we perform the left outer join operation.

o Let's us execute the left outer join operation by using the following command: -

1. hive> select e1.empname, e2.department_name from employee e1 left outer join employee_depa
rtment e2 on e1.empid= e2.depid;

Right Outer Join in HiveQL

The HiveQL right outer join returns all the records from the right (second) table and only that
records from the left (first) table where join criteria find the match.
Example of Left Outer Join in Hive

In this example, we perform the left outer join operation.

o Let's us execute the left outer join operation by using the following command: -

1. hive> select e1.empname, e2.department_name from employee e1 right outer join


employee_de partment e2 on e1.empid= e2.depid;

Full Outer Join

The HiveQL full outer join returns all the records from both the tables. It assigns Null for missing
records in either table
UDFs (User Defined Functions)
In Hive, the users can define own functions to meet certain client
requirements. These are known as UDFs in Hive. User Defined
Functions written in Java for specific modules.

Some of UDFs are specifically designed for the reusability of code in


application frameworks. The developer will develop these functions in
Java and integrate those UDFs with the Hive.

During the Query execution, the developer can directly use the code,
and UDFs will return outputs according to the user-defined tasks. It
will provide high performance in terms of coding and execution.

For example, for string stemming we don’t have any predefined


function in Hive. For this, we can write stem UDF in Java. Wherever
we require Stem functionality, we can directly call this Stem UDF in
Hive.

Here stem functionality means deriving words from its root words. It is
like stemming algorithm reduces the words “wishing”, “wished”, and
“wishes” to the root word “wish.” For performing this type of
functionality, we can write UDF in Java and integrate it with Hive.

Depending on the use cases, the UDFs can be written. It will accept
and produce different numbers of input and output values.

The general type of UDF will accept a single input value and produce
a single output value. If the UDF is used in the query, then UDF will be
called once for each row in the result data set.

In the other way, it can accept a group of values as input and return a
single output value as well.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy