0% found this document useful (0 votes)
120 views30 pages

Hive

Apache Hive is an open-source data warehouse infrastructure built on Hadoop for querying and managing large datasets stored in HDFS. It allows users to query data using SQL and retrieves results by translating queries into MapReduce jobs. Hive supports structured data storage and provides data summarization, query, and analysis capabilities. It stores metadata in a metastore database and uses HDFS for storing the actual data.

Uploaded by

Nitish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views30 pages

Hive

Apache Hive is an open-source data warehouse infrastructure built on Hadoop for querying and managing large datasets stored in HDFS. It allows users to query data using SQL and retrieves results by translating queries into MapReduce jobs. Hive supports structured data storage and provides data summarization, query, and analysis capabilities. It stores metadata in a metastore database and uses HDFS for storing the actual data.

Uploaded by

Nitish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Hive

•Apache Hive is a data warehouse infrastructure built on top of


Hadoop for providing data summarization, ad-hoc queries, managing and the
analysis of large data sets using a SQL-like language called HiveQL.
• Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive.
•It is a query engine , not a database.
•It does not have storage to store the data. (Uses HDFS to store data)
•Why?
• Used to efficiently store and process large datasets.(petabytes of data)
Features of hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
• Internally translate the queries into MapReduce jobs.
• Provides rich data types , structs, Map and Array.
Hive Characteristics
 
• Has the capability to translate queries into MapReduce jobs. This makes Hive
scalable.
• Supports web interfaces as well.
• Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL
or just HQL) for querying data stored in a Hadoop cluster.

• Reads data from HDFS and process it using hive with the help of MapReduce and
then output can be stored back to HDFS.
Limitations of Hive

 Not a full database. Main disadvantage is that Hive does not


provide update, alter and deletion of records in the database.
 Not developed for unstructured data.
 Not designed for real-time queries.
 
Comparison with RDBMS

Characteristics Hive RDBMS

Record level queries No Update and Insert, Update and Delete

Delete
Transaction support No Yes
Latency Minutes or more In fractions of a second

Data size Petabytes Terabytes


Data per query Petabytes Gigabytes
Query language HiveQL SQL
Support JDBC/ODBC Limited Full
Hive Data Types and File Formats
Data TypeName  
Description
TINYINT 1 byte signed integer. Postfix letter is Y.
SMALLINT 2 byte signed integer. Postfix letter is S.
INT 4 byte signed integer
BIGINT 8 byte signed integer. Postfix letter is L.
FLOAT 4 byte single-precision floating-point number
DOUBLE 8 byte double-precision floating-point number
BOOLEAN True or False
  UNIX timestamp with optional nanosecond
TIMESTAMP precision. It supports 0ava .sql.Timestamp format "YYYY-MM-
DD HH:MM:SS.fffffffff'
DATE YYYY-MM-DD format

VARCHAR 1 to 65355 bytes. Use single quotes('') or double quotes("")


CHAR 255 bytes
Data types
•The following Table gives Hive three Collection data types and their descriptions.

File Format  
Description

  The default file format, and a line represents a record.


Text file The delimiting characters separate the lines. Text file examples are CSV,
TSV,JSON and XML.

Sequenti alfile  
Flat file which stores binary key-value pairs, and supports compression.
(SequenceFile might contain a massive number of log files for a server
where the key would be a timestamp and the value would be the entire log
file)
RCFile Record Columnar file

  ORC stands for Optimized Row Columnar which means


ORCFILE it can store data in an optimized way than in the other file formats
RCFile format
ORCFILE format
HIVE ARCHITECTURE
HIVE ARCHITECTURE
•Components of Hive architecture are:
 Hive Server (Thrift) - An optional service that allows a remote client to submit requests to
Hive and retrieve results. Requests can use a variety of programming languages.
 Thrift Server exposes a very simple client API to execute HiveQL statements.
 Hive CLI (Command Line Interface) - Popular interface to interact with Hive. Hive
runs in local mode that uses local storage when running the CLI on a Hadoop cluster
instead of HDFS.
 Web Interface - Hive can be accessed using a web browser as well. This requires a
HWI Server running on some designated code. The URL http:// hadoop:<port no.> /
hwi command can be used to access Hive through the web.
 Metastore - It is the system catalog. All other components of Hive interact with the
Metastore. It stores the schema or metadata of tables, databases, columns in a table,
their data types and HDFS mapping.
 Hive Driver - It manages the life cycle of a HiveQL statement during compilation,
optimization and execution.
•  
Hive Integration and Workflow Steps

1. Execute Query
2. Get Plan
3. Get Metadata
4. Send Metadata
5. Send Plan
6. Execute Plan
7. Execute Job
8. Metadata Operations
9.Fetch Result
10. Send Results
Working
• UI : Can send read/write request in 3 ways. i) Hive CLI ii)Web interface iii) Thrift
server(JDBC/ODBC)
• Driver – Receives query from User interface. Fetch required API request for
JDBC/ODBC interface.
• Compiler : Convert hive to Mapreduce programs and also semantic analysis of
the program.
• Metastore: It stores the schema or metadata of tables, databases, columns in a
table, their data types and HDFS mapping.
• Execution engine : Connected with Hadoop engine. Interact with Resource
manager and namenode to fetch the result from HDFS and finally send final result
to User.
The workflow steps are as follows :
• Execute Query: Hive interface (CLI or Web Interface) sends a query
to DatabaseDriver to execute the query
• Get Plan: Driver sends the query to query compiler that parses the
query to check the syntax and query plan or the requirement of the
query
• Get Metadata: Compiler sends metadata request to Metastore (of any
database, such as MySQL).
• Send Metadata: Metastore sends metadata as a response to compiler.
• Send Plan: Compiler checks the requirement and resends the plan to
driver. The parsing and compiling of the query is complete at this
place.
The workflow steps are as follows :
• Execute Plan: Driver sends the execute plan to execution engine.
• Execute Job: Internally, the process of execution job is a MapReduce
job. The execution engine sends the job to JobTracker, which is in
Name node and it assigns this job to TaskTracker, which is in Data
node. Then , the query executes the job.
• Metadata Operations: Meanwhile the execution engine can execute
the metadata operations with Metastore.
• Fetch Result: Execution engine receives the results from Data nodes.
• Send Results: Execution engine sends the result to Dr iver.
• Send Results: Driver sends the results to Hive Interfaces.
HIVE Data Model

Name Description
Database Namespace for tables
  Similar to tables in RDBMS Support filter, projection, join and
Tables union operations. The table data stores in a directory in HDFS

Partitions Table can have one or more partition keys that tell how the
data stores
  Data in each partition further divides into buckets based on
Buckets hash of a column in the table. Stored as a file in the partition
directory.
HIVEQL
• Hive Query Language (abbreviated HiveQL) is for querying the large datasets which
reside in the HDFS environment.
• HiveQL script commands enable data definition, data manipulation and query processing.
• HiveQL supports a large base of SQL users who are acquainted with SQL to extract
information from data warehouses.

  HiveQL is similar to SQL for querying on schema information at the


HiveQL Process Metastore. It is one of the replacements of traditional approach for
Engine MapReduce program . Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and
process it.

  The bridge between HiveQL process Engine and MapReduce is Hive


Execution Engine Execution Engine. Execution engine processes the query and
generates results same as MapReduce results. It uses the flavor of
MapReduce.
HiveQL
• HiveQL script commands enable:
• 1.data definition,
• 2.data manipulation and
• 3.Query processing ( Data definitions and manipulations , creates
tables and files)
HiveQL Data Definition Language (DDL)
• HiveQL Data Definition Language (DDL)

• HiveQL database commands for data definition for DBs and Tables
are: CREATE DATABASE, SHOW DATABASE {list of all DBs),
CREATE SCHEMA, CREATE TABLE.
HiveQL Data Manipulation Language (DML)
• USE database name;
• DROP DATABASE;
• DROP SCHEMA;
• ALTER TABLE;
• DROP TABLE;
• and LOAD DATA (inserting the data).
COMMANDS
• CREATE DATABASE database_name;

• Example : create database toy;

• Show database;

• Delete database DBName;

• Drop database DBName;


Commands

 In Hive there are 2 types of tables :


 Internal(managed) and

 External table.

 Internal table is a default table, when it will be created then your data
is not secured.

 In external table data is secured, that is if someone deletes the table


that table will be deleted from local space but still available on hive.
Hive - Create Table
• Command :
• create table emp (Id int, Name string , Salary float)  
row format delimited  
fields terminated by ',' ; 

Describe emp; ( Displays all columns of table)

Describe formatted emp;  ( To check the type of a table


managed/external) .
Create table

• External Table creation.


• Command :
• create external table emplist (Id int, Name string , Salary float)  
row format delimited  
 fields terminated by ','   
location '/HiveDirectory';  
OR
• create external table emplist (Id int, Name string , Salary float)  
row format delimited  
 fields terminated by ',' 
Stored as textfile.

Describe emp; ( Displays all columns of table)


Describe formatted emplist;  ( To check the type of a table managed/external) .
Alter table commands
• Alter table old_table_name rename to new_table_name;  
• Alter table table_name add columns(column_name datatype);  
• Alter table table_name change old_column_name new_column_name  dataty
pe;  

• Alter table emp RENAME TO Employee;

• Alter table Employee add columns(Lname string);

• Describe Employee;

• Alter table Employee change name fname string  


Hive Partitioning vs Bucketing
• Hive Partition is a way to organize large tables into smaller logical tables based on
values of columns; one logical table (partition) for each distinct value. In Hive, tables are
created as a directory on HDFS. A table can have one or more partitions that correspond to
a sub-directory for each partition inside a table directory.

• Hive Bucketing(Clustering) is a technique to split the data into more manageable files, (By
specifying the number of buckets to create). The value of the bucketing column will be
hashed by a user-defined number into buckets.

• Bucketing can be created on just one column, you can also create bucketing on a partitioned
table to further split the data which further improves the query performance of the
partitioned table.

• Each bucket is stored as a file within the table’s directory or the partitions directories.
Commands
• CREATE TABLE student (Id int, Name string , age int, course string)
PARTITIONED BY(course string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Commands
• Create database college;
• Use college;

• Create table stu_bucket(Id int, Name string , age int, course string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

Load data to local inpath ’path’ into table std_bucket;

Set hive.enforce.bicketing=true;
Commands
• CREATE TABLE student (Id int, Name string , age int, course string)
CLUSTERED BY course INTO 3 BUCKETS

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';


Differences Between Hive Partitioning vs Bucketing

PARTITIONING BUCKETING

Directory is created on HDFS for each partition. File is created on HDFS for each bucket.

You can have one or more Partition columns You can have only one Bucketing column

You can’t manage the number of partitions to create You can manage the number of buckets to create by
specifying the count

NA Bucketing can be created on a partitioned table

Uses PARTITIONED BY Uses CLUSTERED BY

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy