0% found this document useful (0 votes)
6 views5 pages

Unit 3 Hive Overview and Architecture

Apache Hive is a data warehouse system built on Hadoop that simplifies querying and managing large datasets through a SQL-like language called HiveQL. Its architecture includes components such as the User Interface, Hive Driver, Compiler, Optimizer, Execution Engine, Metastore, and integration with Hadoop's ecosystem. HiveQL supports various operations for data definition, manipulation, and querying, along with features like partitioning and bucketing for efficient data management.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views5 pages

Unit 3 Hive Overview and Architecture

Apache Hive is a data warehouse system built on Hadoop that simplifies querying and managing large datasets through a SQL-like language called HiveQL. Its architecture includes components such as the User Interface, Hive Driver, Compiler, Optimizer, Execution Engine, Metastore, and integration with Hadoop's ecosystem. HiveQL supports various operations for data definition, manipulation, and querying, along with features like partitioning and bucketing for efficient data management.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Hive Overview and Architecture (Detailed Explanation)

Apache Hive is a data warehouse system built on top of Hadoop that provides a high-level
interface to query, analyze, and manage large datasets stored in Hadoop's HDFS (Hadoop
Distributed File System). It abstracts away the complexities of writing low-level MapReduce
code by providing a SQL-like query language, HiveQL, making it easier for analysts and
developers to interact with big data.

Hive was initially developed by Facebook and later contributed to the Apache Software
Foundation. Its primary use case is to perform data summarization, querying, and analysis on
large-scale datasets, particularly those stored in HDFS.

Hive Architecture in Detail:

1. User Interface (UI):


o The user interface is how users interact with Hive. Users can interact with Hive
via:
 Hive Command Line Interface (CLI): This is the most common method
for running Hive queries. It provides a terminal-based interface for
running queries, managing databases, and accessing data.
 Hive Web Interface (Beeline): Beeline is a JDBC client that can be used
to interact with Hive via a web interface or shell-like commands.
 JDBC/ODBC Interfaces: Applications can interact with Hive using
JDBC or ODBC for seamless integration with various applications (e.g.,
BI tools like Tableau, Excel, or custom applications).
2. Hive Driver:
o The Hive Driver is the core component responsible for managing the lifecycle of
a query execution. When a user submits a query via the UI, the Hive Driver
coordinates the entire process of query execution.
o It performs the following tasks:
 Parsing: Translates the SQL-like HiveQL query into an internal format
that can be processed.
 Compilation: Transforms the query into a series of logical steps (abstract
syntax tree).
 Optimization: The query undergoes an optimization phase to ensure that
it can be executed efficiently (e.g., minimizing I/O).
 Execution: The Driver sends the optimized query to the Execution Engine
for processing.
3. Compiler:
o The Compiler is responsible for parsing the HiveQL query and converting it into
a series of lower-level operations, usually expressed as MapReduce tasks or tasks
for other execution engines (like Apache Tez or Apache Spark).
o The compiler performs the following operations:
 Lexical analysis: Converts the query into tokens that can be understood
by the system.
 Parsing: The query is parsed into an Abstract Syntax Tree (AST), which
represents the structure of the query.
 Semantic analysis: Checks for query validity and ensures that the
referenced tables, columns, and partitions exist in the metastore.
 Logical plan generation: Converts the parsed query into an optimized
logical plan (a series of operations) that can be executed on Hadoop.
4. Optimizer:
o After the compiler generates a logical plan, the Optimizer takes over to improve
the performance of the plan. The optimizer focuses on reducing the query's
complexity, minimizing resource consumption, and improving overall
performance.
o Some common optimizations performed by the Hive optimizer include:
 Predicate Pushdown: Moves filtering operations closer to data (i.e.,
filters are applied at the data source level).
 Join Reordering: Reorders joins to minimize the data processed and
reduce computational overhead.
 Column pruning: Removes unnecessary columns from the query,
reducing the amount of data to process.
5. Execution Engine:
o The Execution Engine takes the optimized plan and executes it across the
Hadoop cluster. This engine typically executes the plan as a series of MapReduce
tasks, although it can also execute with other frameworks like Apache Tez or
Apache Spark.
o The execution engine interacts with Hadoop's YARN (Yet Another Resource
Negotiator) to manage resources across the cluster.
6. Metastore:
o The Metastore is a central repository in Hive that stores metadata about tables,
partitions, columns, data types, etc. It provides a logical view of the data, and the
underlying physical data is stored in HDFS or other compatible storage systems.
o Metadata: Includes information about table schemas, data formats, and
partitioning schemes.
o The metastore can be stored in relational databases like MySQL or PostgreSQL,
and it can be accessed using APIs or JDBC.
7. Storage (HDFS):
o The actual data storage is in HDFS (Hadoop Distributed File System), which is
designed for storing very large files across distributed clusters. Hive does not
store data by itself; it relies on Hadoop to manage the physical storage of the data
in HDFS.
o Hive supports various file formats for storage, including:
 Text files: Standard delimited files like CSV.
 ORC (Optimized Row Columnar): A columnar storage format
optimized for large-scale data processing.
 Parquet: Another columnar storage format designed for performance and
compatibility with other big data tools.
 Avro: A row-based storage format used for serializing structured data.
8. Hadoop Ecosystem Integration:
o HDFS (Hadoop Distributed File System): Hive stores and retrieves data from
HDFS, taking advantage of the distributed nature of HDFS for scalability and
fault tolerance.
o YARN (Yet Another Resource Negotiator): YARN is used for resource
management in a Hadoop cluster. The Hive execution engine interacts with
YARN to allocate resources (like CPU and memory) for running queries.
o Apache HBase: Hive can be integrated with HBase for real-time querying of data
stored in HBase.
o Apache Tez or Apache Spark: While Hive traditionally used MapReduce for
query execution, newer versions of Hive allow users to run queries using Apache
Tez or Apache Spark, which are more efficient than MapReduce for certain
types of workloads.

Hive Query Language (HiveQL) in Detail:

HiveQL is a SQL-like language used to query, insert, update, and manage large datasets in Hive.
It provides a rich set of commands for working with structured data. Below are the key elements
of HiveQL:

1. Data Definition Language (DDL):


o CREATE DATABASE: Creates a new database in Hive.

sql
Copy
CREATE DATABASE company;

o CREATE TABLE: Defines a new table schema in Hive.

sql
Copy
CREATE TABLE employees (id INT, name STRING, age INT, department
STRING)
STORED AS ORC;

o DROP TABLE: Removes an existing table from Hive.

sql
Copy
DROP TABLE employees;

o ALTER TABLE: Modifies an existing table's schema. For example, you can add
a new column.

sql
Copy
ALTER TABLE employees ADD COLUMNS (salary DOUBLE);

2. Data Manipulation Language (DML):


o SELECT: Retrieves data from one or more tables, similar to SQL.

sql
Copy
SELECT * FROM employees WHERE age > 30;

o INSERT INTO: Inserts data into a table. This is typically used for inserting new
records.

sql
Copy
INSERT INTO employees VALUES (1, 'John Doe', 28, 'Engineering');

o LOAD: Loads data from external files into Hive tables.

sql
Copy
LOAD DATA INPATH '/path/to/data' INTO TABLE employees;

o UPDATE: Hive does not natively support updates in the same way as relational
databases. It can be used with limited functionality (in Hive 0.14+).

sql
Copy
UPDATE employees SET salary = 5000 WHERE id = 1;

3. Querying Data:
o SELECT with WHERE, GROUP BY, ORDER BY, and JOINs:

sql
Copy
SELECT department, AVG(age)
FROM employees
WHERE age > 30
GROUP BY department
ORDER BY AVG(age) DESC;

4. Partitioning:
o Hive supports partitioning, which helps manage large datasets by dividing the
data into smaller, manageable parts. A partition is created by splitting data into
subdirectories within HDFS based on a partition key (e.g., date, region).

sql
Copy
CREATE TABLE sales (id INT, amount DOUBLE)
PARTITIONED BY (year INT, month INT);
o Partitioning helps in faster query execution as only relevant partitions are scanned
during query execution.
5. Bucketing:
o Bucketing divides the data into "buckets" (files) based on a hash function. This is
different from partitioning, as it ensures the data is evenly distributed across a
fixed number of files.

sql
Copy
CREATE TABLE employees (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;

6. Aggregation and Functions:


o Hive supports common aggregate functions like SUM(), AVG(), COUNT(), MAX(),
MIN(), and also allows user-defined functions (UDFs) for advanced calculations.

sql
Copy
SELECT department, COUNT(*)
FROM employees
GROUP BY department;

7. Joins:
o Hive supports different types of joins (like INNER JOIN, LEFT JOIN, RIGHT JOIN,
and FULL OUTER JOIN) to combine data from multiple tables.

sql
Copy
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy