0% found this document useful (0 votes)

120 views30 pages

Hive

Apache Hive is an open-source data warehouse infrastructure built on Hadoop for querying and managing large datasets stored in HDFS. It allows users to query data using SQL and retrieves results by translating queries into MapReduce jobs. Hive supports structured data storage and provides data summarization, query, and analysis capabilities. It stores metadata in a metastore database and uses HDFS for storing the actual data.

Uploaded by

Nitish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views30 pages

Hive

Uploaded by

Nitish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Hive

•Apache Hive is a data warehouse infrastructure built on top of

Hadoop for providing data summarization, ad-hoc queries, managing and the
analysis of large data sets using a SQL-like language called HiveQL.
• Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive.
•It is a query engine , not a database.
•It does not have storage to store the data. (Uses HDFS to store data)
•Why?
• Used to efficiently store and process large datasets.(petabytes of data)
Features of hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
• Internally translate the queries into MapReduce jobs.
• Provides rich data types , structs, Map and Array.
Hive Characteristics

• Has the capability to translate queries into MapReduce jobs. This makes Hive
scalable.
• Supports web interfaces as well.
• Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL
or just HQL) for querying data stored in a Hadoop cluster.

• Reads data from HDFS and process it using hive with the help of MapReduce and
then output can be stored back to HDFS.
Limitations of Hive

 Not a full database. Main disadvantage is that Hive does not

provide update, alter and deletion of records in the database.
 Not developed for unstructured data.
 Not designed for real-time queries.

Comparison with RDBMS

Characteristics Hive RDBMS

Record level queries No Update and Insert, Update and Delete

Delete
Transaction support No Yes
Latency Minutes or more In fractions of a second

Data size Petabytes Terabytes

Data per query Petabytes Gigabytes
Query language HiveQL SQL
Support JDBC/ODBC Limited Full
Hive Data Types and File Formats
Data TypeName
Description
TINYINT 1 byte signed integer. Postfix letter is Y.
SMALLINT 2 byte signed integer. Postfix letter is S.
INT 4 byte signed integer
BIGINT 8 byte signed integer. Postfix letter is L.
FLOAT 4 byte single-precision floating-point number
DOUBLE 8 byte double-precision floating-point number
BOOLEAN True or False
UNIX timestamp with optional nanosecond
TIMESTAMP precision. It supports 0ava .sql.Timestamp format "YYYY-MM-
DD HH:MM:SS.fffffffff'
DATE YYYY-MM-DD format

VARCHAR 1 to 65355 bytes. Use single quotes('') or double quotes("")

CHAR 255 bytes
Data types
•The following Table gives Hive three Collection data types and their descriptions.

File Format
Description

The default file format, and a line represents a record.

Text file The delimiting characters separate the lines. Text file examples are CSV,
TSV,JSON and XML.

Sequenti alfile
Flat file which stores binary key-value pairs, and supports compression.
(SequenceFile might contain a massive number of log files for a server
where the key would be a timestamp and the value would be the entire log
file)
RCFile Record Columnar file

ORC stands for Optimized Row Columnar which means

ORCFILE it can store data in an optimized way than in the other file formats
RCFile format
ORCFILE format
HIVE ARCHITECTURE
HIVE ARCHITECTURE
•Components of Hive architecture are:
 Hive Server (Thrift) - An optional service that allows a remote client to submit requests to
Hive and retrieve results. Requests can use a variety of programming languages.
 Thrift Server exposes a very simple client API to execute HiveQL statements.
 Hive CLI (Command Line Interface) - Popular interface to interact with Hive. Hive
runs in local mode that uses local storage when running the CLI on a Hadoop cluster
instead of HDFS.
 Web Interface - Hive can be accessed using a web browser as well. This requires a
HWI Server running on some designated code. The URL http:// hadoop:<port no.> /
hwi command can be used to access Hive through the web.
 Metastore - It is the system catalog. All other components of Hive interact with the
Metastore. It stores the schema or metadata of tables, databases, columns in a table,
their data types and HDFS mapping.
 Hive Driver - It manages the life cycle of a HiveQL statement during compilation,
optimization and execution.
•
Hive Integration and Workflow Steps

1. Execute Query
2. Get Plan
3. Get Metadata
4. Send Metadata
5. Send Plan
6. Execute Plan
7. Execute Job
8. Metadata Operations
9.Fetch Result
10. Send Results
Working
• UI : Can send read/write request in 3 ways. i) Hive CLI ii)Web interface iii) Thrift
server(JDBC/ODBC)
• Driver – Receives query from User interface. Fetch required API request for
JDBC/ODBC interface.
• Compiler : Convert hive to Mapreduce programs and also semantic analysis of
the program.
• Metastore: It stores the schema or metadata of tables, databases, columns in a
table, their data types and HDFS mapping.
• Execution engine : Connected with Hadoop engine. Interact with Resource
manager and namenode to fetch the result from HDFS and finally send final result
to User.
The workflow steps are as follows :
• Execute Query: Hive interface (CLI or Web Interface) sends a query
to DatabaseDriver to execute the query
• Get Plan: Driver sends the query to query compiler that parses the
query to check the syntax and query plan or the requirement of the
query
• Get Metadata: Compiler sends metadata request to Metastore (of any
database, such as MySQL).
• Send Metadata: Metastore sends metadata as a response to compiler.
• Send Plan: Compiler checks the requirement and resends the plan to
driver. The parsing and compiling of the query is complete at this
place.
The workflow steps are as follows :
• Execute Plan: Driver sends the execute plan to execution engine.
• Execute Job: Internally, the process of execution job is a MapReduce
job. The execution engine sends the job to JobTracker, which is in
Name node and it assigns this job to TaskTracker, which is in Data
node. Then , the query executes the job.
• Metadata Operations: Meanwhile the execution engine can execute
the metadata operations with Metastore.
• Fetch Result: Execution engine receives the results from Data nodes.
• Send Results: Execution engine sends the result to Dr iver.
• Send Results: Driver sends the results to Hive Interfaces.
HIVE Data Model

Name Description
Database Namespace for tables
Similar to tables in RDBMS Support filter, projection, join and
Tables union operations. The table data stores in a directory in HDFS

Partitions Table can have one or more partition keys that tell how the
data stores
Data in each partition further divides into buckets based on
Buckets hash of a column in the table. Stored as a file in the partition
directory.
HIVEQL
• Hive Query Language (abbreviated HiveQL) is for querying the large datasets which
reside in the HDFS environment.
• HiveQL script commands enable data definition, data manipulation and query processing.
• HiveQL supports a large base of SQL users who are acquainted with SQL to extract
information from data warehouses.

HiveQL is similar to SQL for querying on schema information at the

HiveQL Process Metastore. It is one of the replacements of traditional approach for
Engine MapReduce program . Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and
process it.

The bridge between HiveQL process Engine and MapReduce is Hive

Execution Engine Execution Engine. Execution engine processes the query and
generates results same as MapReduce results. It uses the flavor of
MapReduce.
HiveQL
• HiveQL script commands enable:
• 1.data definition,
• 2.data manipulation and
• 3.Query processing ( Data definitions and manipulations , creates
tables and files)
HiveQL Data Definition Language (DDL)
• HiveQL Data Definition Language (DDL)

• HiveQL database commands for data definition for DBs and Tables
are: CREATE DATABASE, SHOW DATABASE {list of all DBs),
CREATE SCHEMA, CREATE TABLE.
HiveQL Data Manipulation Language (DML)
• USE database name;
• DROP DATABASE;
• DROP SCHEMA;
• ALTER TABLE;
• DROP TABLE;
• and LOAD DATA (inserting the data).
COMMANDS
• CREATE DATABASE database_name;

• Example : create database toy;

• Show database;

• Delete database DBName;

• Drop database DBName;

Commands

 In Hive there are 2 types of tables :

 Internal(managed) and

 External table.

 Internal table is a default table, when it will be created then your data
is not secured.

 In external table data is secured, that is if someone deletes the table

that table will be deleted from local space but still available on hive.
Hive - Create Table
• Command :
• create table emp (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;

Describe emp; ( Displays all columns of table)

Describe formatted emp; ( To check the type of a table

managed/external) .
Create table

• External Table creation.

• Command :
• create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';
OR
• create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
Stored as textfile.

Describe emp; ( Displays all columns of table)

Describe formatted emplist; ( To check the type of a table managed/external) .
Alter table commands
• Alter table old_table_name rename to new_table_name;
• Alter table table_name add columns(column_name datatype);
• Alter table table_name change old_column_name new_column_name  dataty
pe;

• Alter table emp RENAME TO Employee;

• Alter table Employee add columns(Lname string);

• Describe Employee;

• Alter table Employee change name fname string

Hive Partitioning vs Bucketing
• Hive Partition is a way to organize large tables into smaller logical tables based on
values of columns; one logical table (partition) for each distinct value. In Hive, tables are
created as a directory on HDFS. A table can have one or more partitions that correspond to
a sub-directory for each partition inside a table directory.

• Hive Bucketing(Clustering) is a technique to split the data into more manageable files, (By
specifying the number of buckets to create). The value of the bucketing column will be
hashed by a user-defined number into buckets.

• Bucketing can be created on just one column, you can also create bucketing on a partitioned
table to further split the data which further improves the query performance of the
partitioned table.

• Each bucket is stored as a file within the table’s directory or the partitions directories.
Commands
• CREATE TABLE student (Id int, Name string , age int, course string)
PARTITIONED BY(course string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Commands
• Create database college;
• Use college;

• Create table stu_bucket(Id int, Name string , age int, course string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

Load data to local inpath ’path’ into table std_bucket;

Set hive.enforce.bicketing=true;
Commands
• CREATE TABLE student (Id int, Name string , age int, course string)
CLUSTERED BY course INTO 3 BUCKETS

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

Differences Between Hive Partitioning vs Bucketing

PARTITIONING BUCKETING

Directory is created on HDFS for each partition. File is created on HDFS for each bucket.

You can have one or more Partition columns You can have only one Bucketing column

You can’t manage the number of partitions to create You can manage the number of buckets to create by
specifying the count

NA Bucketing can be created on a partitioned table

Uses PARTITIONED BY Uses CLUSTERED BY

Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
HIVE
No ratings yet
HIVE
33 pages
LectureNotes Hive Final
No ratings yet
LectureNotes Hive Final
36 pages
Chapter 7
No ratings yet
Chapter 7
84 pages
Hive
No ratings yet
Hive
63 pages
Hive
No ratings yet
Hive
52 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
7 Hive
No ratings yet
7 Hive
30 pages
HIVE
No ratings yet
HIVE
16 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Hive
No ratings yet
Hive
49 pages
Unit-5 Sgs
No ratings yet
Unit-5 Sgs
10 pages
HIVE
No ratings yet
HIVE
18 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
01 Introduction To Hive
No ratings yet
01 Introduction To Hive
17 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
01 Introduction To Hive (1) 2 15
No ratings yet
01 Introduction To Hive (1) 2 15
14 pages
Call Center Management System Project Report
No ratings yet
Call Center Management System Project Report
126 pages
HIVE
No ratings yet
HIVE
28 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Hive Final
No ratings yet
Hive Final
75 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Hive
No ratings yet
Hive
5 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive
No ratings yet
Hive
12 pages
InteliDrive Nano 1.6 Reference Guide - Rev1
100% (1)
InteliDrive Nano 1.6 Reference Guide - Rev1
55 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Hive
No ratings yet
Hive
23 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit 3
No ratings yet
Unit 3
8 pages
CSC 126 Project Report
No ratings yet
CSC 126 Project Report
72 pages
Lift Traffic Design Spreadsheet - All Peaks
No ratings yet
Lift Traffic Design Spreadsheet - All Peaks
8 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Assignment 4-Gcc: Hive Is Not
No ratings yet
Assignment 4-Gcc: Hive Is Not
3 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Introduction To HIVE
No ratings yet
Introduction To HIVE
8 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Midterm Project Report
No ratings yet
Midterm Project Report
39 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Git Flow Tutorial
No ratings yet
Git Flow Tutorial
29 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
SAP Simple Finance Training Course Content
No ratings yet
SAP Simple Finance Training Course Content
5 pages
Pronest 8 Manual
No ratings yet
Pronest 8 Manual
275 pages
Email and Internet Policy
No ratings yet
Email and Internet Policy
7 pages
CMG Numerical Methods
No ratings yet
CMG Numerical Methods
4 pages
MockTest 4 W
No ratings yet
MockTest 4 W
3 pages
ABI Product Brochure
No ratings yet
ABI Product Brochure
24 pages
IT2042 Info Sec UNIT V NOTES
No ratings yet
IT2042 Info Sec UNIT V NOTES
13 pages
Oracle Private Cloud Appliance 2018 Installation Specialist Assessment PDF
No ratings yet
Oracle Private Cloud Appliance 2018 Installation Specialist Assessment PDF
25 pages
Mpesa Web User Application Form
No ratings yet
Mpesa Web User Application Form
2 pages
Umesh Ratan Singh Sugara
No ratings yet
Umesh Ratan Singh Sugara
1 page
The Processor: CPU Performance Factors
No ratings yet
The Processor: CPU Performance Factors
66 pages
SQL Refresher
No ratings yet
SQL Refresher
5 pages
Paper Implementation Major Project
No ratings yet
Paper Implementation Major Project
6 pages
Authorizing Users To Access Resources
No ratings yet
Authorizing Users To Access Resources
11 pages
Issues in SQL Server Replication and Troubleshooting Steps
No ratings yet
Issues in SQL Server Replication and Troubleshooting Steps
7 pages
11 - Ir. Dr. Harriezan Ahmad PDF
No ratings yet
11 - Ir. Dr. Harriezan Ahmad PDF
10 pages
EVO 4 User Guide Manual V3.0
No ratings yet
EVO 4 User Guide Manual V3.0
24 pages
What Is ClassIn
No ratings yet
What Is ClassIn
23 pages
GFW0018 W6 Poster (S2116309)
No ratings yet
GFW0018 W6 Poster (S2116309)
3 pages
Semi Empirical Methods in Computational Chemistry
No ratings yet
Semi Empirical Methods in Computational Chemistry
6 pages
Muskan (Graphic Designer) CV
No ratings yet
Muskan (Graphic Designer) CV
2 pages
STM32 F3 Series Discovery Kit
No ratings yet
STM32 F3 Series Discovery Kit
3 pages
Seño, Judy Ann F
No ratings yet
Seño, Judy Ann F
4 pages
Advanced Glossaries Use With TeXShop and Latexmk Engines
No ratings yet
Advanced Glossaries Use With TeXShop and Latexmk Engines
2 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hive

Uploaded by

Hive

Uploaded by

Hive

•Apache Hive is a data warehouse infrastructure built on top of

 Not a full database. Main disadvantage is that Hive does not

Characteristics Hive RDBMS

Record level queries No Update and Insert, Update and Delete

Data size Petabytes Terabytes

VARCHAR 1 to 65355 bytes. Use single quotes('') or double quotes("")

The default file format, and a line represents a record.

ORC stands for Optimized Row Columnar which means

HiveQL is similar to SQL for querying on schema information at the

The bridge between HiveQL process Engine and MapReduce is Hive

• Example : create database toy;

• Delete database DBName;

• Drop database DBName;

 In Hive there are 2 types of tables :

 In external table data is secured, that is if someone deletes the table

Describe emp; ( Displays all columns of table)

Describe formatted emp; ( To check the type of a table

• External Table creation.

Describe emp; ( Displays all columns of table)

• Alter table emp RENAME TO Employee;

• Alter table Employee add columns(Lname string);

• Alter table Employee change name fname string

• Create table stu_bucket(Id int, Name string , age int, course string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

Load data to local inpath ’path’ into table std_bucket;

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

NA Bucketing can be created on a partitioned table

Uses PARTITIONED BY Uses CLUSTERED BY

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.