Hive
Hive
• Reads data from HDFS and process it using hive with the help of MapReduce and
then output can be stored back to HDFS.
Limitations of Hive
Delete
Transaction support No Yes
Latency Minutes or more In fractions of a second
File Format
Description
Sequenti alfile
Flat file which stores binary key-value pairs, and supports compression.
(SequenceFile might contain a massive number of log files for a server
where the key would be a timestamp and the value would be the entire log
file)
RCFile Record Columnar file
1. Execute Query
2. Get Plan
3. Get Metadata
4. Send Metadata
5. Send Plan
6. Execute Plan
7. Execute Job
8. Metadata Operations
9.Fetch Result
10. Send Results
Working
• UI : Can send read/write request in 3 ways. i) Hive CLI ii)Web interface iii) Thrift
server(JDBC/ODBC)
• Driver – Receives query from User interface. Fetch required API request for
JDBC/ODBC interface.
• Compiler : Convert hive to Mapreduce programs and also semantic analysis of
the program.
• Metastore: It stores the schema or metadata of tables, databases, columns in a
table, their data types and HDFS mapping.
• Execution engine : Connected with Hadoop engine. Interact with Resource
manager and namenode to fetch the result from HDFS and finally send final result
to User.
The workflow steps are as follows :
• Execute Query: Hive interface (CLI or Web Interface) sends a query
to DatabaseDriver to execute the query
• Get Plan: Driver sends the query to query compiler that parses the
query to check the syntax and query plan or the requirement of the
query
• Get Metadata: Compiler sends metadata request to Metastore (of any
database, such as MySQL).
• Send Metadata: Metastore sends metadata as a response to compiler.
• Send Plan: Compiler checks the requirement and resends the plan to
driver. The parsing and compiling of the query is complete at this
place.
The workflow steps are as follows :
• Execute Plan: Driver sends the execute plan to execution engine.
• Execute Job: Internally, the process of execution job is a MapReduce
job. The execution engine sends the job to JobTracker, which is in
Name node and it assigns this job to TaskTracker, which is in Data
node. Then , the query executes the job.
• Metadata Operations: Meanwhile the execution engine can execute
the metadata operations with Metastore.
• Fetch Result: Execution engine receives the results from Data nodes.
• Send Results: Execution engine sends the result to Dr iver.
• Send Results: Driver sends the results to Hive Interfaces.
HIVE Data Model
Name Description
Database Namespace for tables
Similar to tables in RDBMS Support filter, projection, join and
Tables union operations. The table data stores in a directory in HDFS
Partitions Table can have one or more partition keys that tell how the
data stores
Data in each partition further divides into buckets based on
Buckets hash of a column in the table. Stored as a file in the partition
directory.
HIVEQL
• Hive Query Language (abbreviated HiveQL) is for querying the large datasets which
reside in the HDFS environment.
• HiveQL script commands enable data definition, data manipulation and query processing.
• HiveQL supports a large base of SQL users who are acquainted with SQL to extract
information from data warehouses.
• HiveQL database commands for data definition for DBs and Tables
are: CREATE DATABASE, SHOW DATABASE {list of all DBs),
CREATE SCHEMA, CREATE TABLE.
HiveQL Data Manipulation Language (DML)
• USE database name;
• DROP DATABASE;
• DROP SCHEMA;
• ALTER TABLE;
• DROP TABLE;
• and LOAD DATA (inserting the data).
COMMANDS
• CREATE DATABASE database_name;
• Show database;
External table.
Internal table is a default table, when it will be created then your data
is not secured.
• Describe Employee;
• Hive Bucketing(Clustering) is a technique to split the data into more manageable files, (By
specifying the number of buckets to create). The value of the bucketing column will be
hashed by a user-defined number into buckets.
• Bucketing can be created on just one column, you can also create bucketing on a partitioned
table to further split the data which further improves the query performance of the
partitioned table.
• Each bucket is stored as a file within the table’s directory or the partitions directories.
Commands
• CREATE TABLE student (Id int, Name string , age int, course string)
PARTITIONED BY(course string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Commands
• Create database college;
• Use college;
Set hive.enforce.bicketing=true;
Commands
• CREATE TABLE student (Id int, Name string , age int, course string)
CLUSTERED BY course INTO 3 BUCKETS
PARTITIONING BUCKETING
Directory is created on HDFS for each partition. File is created on HDFS for each bucket.
You can have one or more Partition columns You can have only one Bucketing column
You can’t manage the number of partitions to create You can manage the number of buckets to create by
specifying the count