0% found this document useful (0 votes)

6 views5 pages

Unit 3 Hive Overview and Architecture

Apache Hive is a data warehouse system built on Hadoop that simplifies querying and managing large datasets through a SQL-like language called HiveQL. Its architecture includes components such as the User Interface, Hive Driver, Compiler, Optimizer, Execution Engine, Metastore, and integration with Hadoop's ecosystem. HiveQL supports various operations for data definition, manipulation, and querying, along with features like partitioning and bucketing for efficient data management.

Uploaded by

kannan.niran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views5 pages

Unit 3 Hive Overview and Architecture

Uploaded by

kannan.niran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Hive Overview and Architecture (Detailed Explanation)

Apache Hive is a data warehouse system built on top of Hadoop that provides a high-level
interface to query, analyze, and manage large datasets stored in Hadoop's HDFS (Hadoop
Distributed File System). It abstracts away the complexities of writing low-level MapReduce
code by providing a SQL-like query language, HiveQL, making it easier for analysts and
developers to interact with big data.

Hive was initially developed by Facebook and later contributed to the Apache Software
Foundation. Its primary use case is to perform data summarization, querying, and analysis on
large-scale datasets, particularly those stored in HDFS.

Hive Architecture in Detail:

1. User Interface (UI):

o The user interface is how users interact with Hive. Users can interact with Hive
via:
 Hive Command Line Interface (CLI): This is the most common method
for running Hive queries. It provides a terminal-based interface for
running queries, managing databases, and accessing data.
 Hive Web Interface (Beeline): Beeline is a JDBC client that can be used
to interact with Hive via a web interface or shell-like commands.
 JDBC/ODBC Interfaces: Applications can interact with Hive using
JDBC or ODBC for seamless integration with various applications (e.g.,
BI tools like Tableau, Excel, or custom applications).
2. Hive Driver:
o The Hive Driver is the core component responsible for managing the lifecycle of
a query execution. When a user submits a query via the UI, the Hive Driver
coordinates the entire process of query execution.
o It performs the following tasks:
 Parsing: Translates the SQL-like HiveQL query into an internal format
that can be processed.
 Compilation: Transforms the query into a series of logical steps (abstract
syntax tree).
 Optimization: The query undergoes an optimization phase to ensure that
it can be executed efficiently (e.g., minimizing I/O).
 Execution: The Driver sends the optimized query to the Execution Engine
for processing.
3. Compiler:
o The Compiler is responsible for parsing the HiveQL query and converting it into
a series of lower-level operations, usually expressed as MapReduce tasks or tasks
for other execution engines (like Apache Tez or Apache Spark).
o The compiler performs the following operations:
 Lexical analysis: Converts the query into tokens that can be understood
by the system.
 Parsing: The query is parsed into an Abstract Syntax Tree (AST), which
represents the structure of the query.
 Semantic analysis: Checks for query validity and ensures that the
referenced tables, columns, and partitions exist in the metastore.
 Logical plan generation: Converts the parsed query into an optimized
logical plan (a series of operations) that can be executed on Hadoop.
4. Optimizer:
o After the compiler generates a logical plan, the Optimizer takes over to improve
the performance of the plan. The optimizer focuses on reducing the query's
complexity, minimizing resource consumption, and improving overall
performance.
o Some common optimizations performed by the Hive optimizer include:
 Predicate Pushdown: Moves filtering operations closer to data (i.e.,
filters are applied at the data source level).
 Join Reordering: Reorders joins to minimize the data processed and
reduce computational overhead.
 Column pruning: Removes unnecessary columns from the query,
reducing the amount of data to process.
5. Execution Engine:
o The Execution Engine takes the optimized plan and executes it across the
Hadoop cluster. This engine typically executes the plan as a series of MapReduce
tasks, although it can also execute with other frameworks like Apache Tez or
Apache Spark.
o The execution engine interacts with Hadoop's YARN (Yet Another Resource
Negotiator) to manage resources across the cluster.
6. Metastore:
o The Metastore is a central repository in Hive that stores metadata about tables,
partitions, columns, data types, etc. It provides a logical view of the data, and the
underlying physical data is stored in HDFS or other compatible storage systems.
o Metadata: Includes information about table schemas, data formats, and
partitioning schemes.
o The metastore can be stored in relational databases like MySQL or PostgreSQL,
and it can be accessed using APIs or JDBC.
7. Storage (HDFS):
o The actual data storage is in HDFS (Hadoop Distributed File System), which is
designed for storing very large files across distributed clusters. Hive does not
store data by itself; it relies on Hadoop to manage the physical storage of the data
in HDFS.
o Hive supports various file formats for storage, including:
 Text files: Standard delimited files like CSV.
 ORC (Optimized Row Columnar): A columnar storage format
optimized for large-scale data processing.
 Parquet: Another columnar storage format designed for performance and
compatibility with other big data tools.
 Avro: A row-based storage format used for serializing structured data.
8. Hadoop Ecosystem Integration:
o HDFS (Hadoop Distributed File System): Hive stores and retrieves data from
HDFS, taking advantage of the distributed nature of HDFS for scalability and
fault tolerance.
o YARN (Yet Another Resource Negotiator): YARN is used for resource
management in a Hadoop cluster. The Hive execution engine interacts with
YARN to allocate resources (like CPU and memory) for running queries.
o Apache HBase: Hive can be integrated with HBase for real-time querying of data
stored in HBase.
o Apache Tez or Apache Spark: While Hive traditionally used MapReduce for
query execution, newer versions of Hive allow users to run queries using Apache
Tez or Apache Spark, which are more efficient than MapReduce for certain
types of workloads.

Hive Query Language (HiveQL) in Detail:

HiveQL is a SQL-like language used to query, insert, update, and manage large datasets in Hive.
It provides a rich set of commands for working with structured data. Below are the key elements
of HiveQL:

1. Data Definition Language (DDL):

o CREATE DATABASE: Creates a new database in Hive.

sql
Copy
CREATE DATABASE company;

o CREATE TABLE: Defines a new table schema in Hive.

sql
Copy
CREATE TABLE employees (id INT, name STRING, age INT, department
STRING)
STORED AS ORC;

o DROP TABLE: Removes an existing table from Hive.

sql
Copy
DROP TABLE employees;

o ALTER TABLE: Modifies an existing table's schema. For example, you can add
a new column.

sql
Copy
ALTER TABLE employees ADD COLUMNS (salary DOUBLE);

2. Data Manipulation Language (DML):

o SELECT: Retrieves data from one or more tables, similar to SQL.

sql
Copy
SELECT * FROM employees WHERE age > 30;

o INSERT INTO: Inserts data into a table. This is typically used for inserting new
records.

sql
Copy
INSERT INTO employees VALUES (1, 'John Doe', 28, 'Engineering');

o LOAD: Loads data from external files into Hive tables.

sql
Copy
LOAD DATA INPATH '/path/to/data' INTO TABLE employees;

o UPDATE: Hive does not natively support updates in the same way as relational
databases. It can be used with limited functionality (in Hive 0.14+).

sql
Copy
UPDATE employees SET salary = 5000 WHERE id = 1;

3. Querying Data:
o SELECT with WHERE, GROUP BY, ORDER BY, and JOINs:

sql
Copy
SELECT department, AVG(age)
FROM employees
WHERE age > 30
GROUP BY department
ORDER BY AVG(age) DESC;

4. Partitioning:
o Hive supports partitioning, which helps manage large datasets by dividing the
data into smaller, manageable parts. A partition is created by splitting data into
subdirectories within HDFS based on a partition key (e.g., date, region).

sql
Copy
CREATE TABLE sales (id INT, amount DOUBLE)
PARTITIONED BY (year INT, month INT);
o Partitioning helps in faster query execution as only relevant partitions are scanned
during query execution.
5. Bucketing:
o Bucketing divides the data into "buckets" (files) based on a hash function. This is
different from partitioning, as it ensures the data is evenly distributed across a
fixed number of files.

sql
Copy
CREATE TABLE employees (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;

6. Aggregation and Functions:

o Hive supports common aggregate functions like SUM(), AVG(), COUNT(), MAX(),
MIN(), and also allows user-defined functions (UDFs) for advanced calculations.

sql
Copy
SELECT department, COUNT(*)
FROM employees
GROUP BY department;

7. Joins:
o Hive supports different types of joins (like INNER JOIN, LEFT JOIN, RIGHT JOIN,
and FULL OUTER JOIN) to combine data from multiple tables.

sql
Copy
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;

Hive Architecture
No ratings yet
Hive Architecture
7 pages
Bda Exp-6
No ratings yet
Bda Exp-6
10 pages
Bda Report
No ratings yet
Bda Report
16 pages
Hive Architecture and Working
No ratings yet
Hive Architecture and Working
2 pages
Unit 3 Hive
No ratings yet
Unit 3 Hive
3 pages
BDA Answers
No ratings yet
BDA Answers
10 pages
Hive
No ratings yet
Hive
52 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Chapter - 4 - Data Access - Hive
No ratings yet
Chapter - 4 - Data Access - Hive
35 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
7 Hive
No ratings yet
7 Hive
30 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive
No ratings yet
Hive
5 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive
No ratings yet
Hive
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hive
No ratings yet
Hive
49 pages
HIVE
No ratings yet
HIVE
18 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Big Data 4
No ratings yet
Big Data 4
14 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
Hive
No ratings yet
Hive
12 pages
Web Based Data Management of Apache Hive
No ratings yet
Web Based Data Management of Apache Hive
22 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hive
No ratings yet
Hive
65 pages
Day 4
No ratings yet
Day 4
10 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Bigdata Lecture 5
No ratings yet
Bigdata Lecture 5
19 pages
HIVE
No ratings yet
HIVE
7 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Hive
No ratings yet
Hive
28 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive Introduction
No ratings yet
Hive Introduction
47 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
What Is Hive
No ratings yet
What Is Hive
4 pages
Assignment 4-Gcc: Hive Is Not
No ratings yet
Assignment 4-Gcc: Hive Is Not
3 pages
Unit-5 Sgs
No ratings yet
Unit-5 Sgs
10 pages
Bda (M-4)
No ratings yet
Bda (M-4)
8 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
Unit 3
No ratings yet
Unit 3
23 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Execution Environments For Distributed Computing: Apache Hive
No ratings yet
Execution Environments For Distributed Computing: Apache Hive
23 pages
Descriptive Dataset
No ratings yet
Descriptive Dataset
6 pages
All Analysiscode Explanation
No ratings yet
All Analysiscode Explanation
22 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
Unit Iv
No ratings yet
Unit Iv
11 pages
Unit5 Autoencoders
No ratings yet
Unit5 Autoencoders
45 pages
Biradari PDF
No ratings yet
Biradari PDF
13 pages
G-Stomper 5 - Pattern Sequencer
No ratings yet
G-Stomper 5 - Pattern Sequencer
83 pages
Review of Related Literature
No ratings yet
Review of Related Literature
7 pages
Education: CNS-218-3I Citrix ADC 12.x Essentials
100% (1)
Education: CNS-218-3I Citrix ADC 12.x Essentials
15 pages
2 Week 9c English A
No ratings yet
2 Week 9c English A
3 pages
Linux VI and Vim Editor: Tutorial and Advanced Features
No ratings yet
Linux VI and Vim Editor: Tutorial and Advanced Features
17 pages
Nafs and Rizq PDF
No ratings yet
Nafs and Rizq PDF
4 pages
Java Programming 2 Syllabus
No ratings yet
Java Programming 2 Syllabus
3 pages
06-04-2024 - JR - Super60 (Incoming) - NUCLEUS BT - Jee-Main - Special Test WTM - Q.Paper
No ratings yet
06-04-2024 - JR - Super60 (Incoming) - NUCLEUS BT - Jee-Main - Special Test WTM - Q.Paper
15 pages
Graphic Organiz-WPS Office
No ratings yet
Graphic Organiz-WPS Office
10 pages
Conditionals: A) - Not Possible
100% (1)
Conditionals: A) - Not Possible
2 pages
Phonemic Awareness and Phonics
No ratings yet
Phonemic Awareness and Phonics
19 pages
Literary Theory Workshop Handout - Structuralism
100% (1)
Literary Theory Workshop Handout - Structuralism
1 page
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
No ratings yet
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
2 pages
Jewish Art and Civilization
100% (3)
Jewish Art and Civilization
358 pages
Vocab 2
No ratings yet
Vocab 2
1 page
16 Passive PDF
No ratings yet
16 Passive PDF
2 pages
Ca3 Es-Cs-201 Cse 2nd Semester
No ratings yet
Ca3 Es-Cs-201 Cse 2nd Semester
1 page
Resiliensi Keluarga Pada Keluarga Yang Memiliki Anak Autis
No ratings yet
Resiliensi Keluarga Pada Keluarga Yang Memiliki Anak Autis
13 pages
Mastering HTML A Beginners Guide (Sufyan Bin Uzayr) (Z-Library)
No ratings yet
Mastering HTML A Beginners Guide (Sufyan Bin Uzayr) (Z-Library)
341 pages
Unitlessonplan Bats Caitlineyestone
No ratings yet
Unitlessonplan Bats Caitlineyestone
41 pages
Art and Prod Reviewer
No ratings yet
Art and Prod Reviewer
5 pages
AMAPOLA (LAbM)
No ratings yet
AMAPOLA (LAbM)
3 pages
Qdoc - Tips Gold First Coursebook
No ratings yet
Qdoc - Tips Gold First Coursebook
30 pages
Taluk Map of Karnataka State
50% (2)
Taluk Map of Karnataka State
1 page
Customs Manners and Etiquette in Malaysia
No ratings yet
Customs Manners and Etiquette in Malaysia
24 pages
Digital Logic Design (DLD) : Lecturer: Engr. Ali Iqbal
No ratings yet
Digital Logic Design (DLD) : Lecturer: Engr. Ali Iqbal
18 pages
Top 50 SQL Server Interview Question
No ratings yet
Top 50 SQL Server Interview Question
15 pages
CO - Module 1
No ratings yet
CO - Module 1
31 pages
Techniques in Selecting & Organizing Information
100% (2)
Techniques in Selecting & Organizing Information
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 3 Hive Overview and Architecture

Uploaded by

Unit 3 Hive Overview and Architecture

Uploaded by

Hive Overview and Architecture (Detailed Explanation)

Hive Architecture in Detail:

1. User Interface (UI):

Hive Query Language (HiveQL) in Detail:

1. Data Definition Language (DDL):

o CREATE TABLE: Defines a new table schema in Hive.

o DROP TABLE: Removes an existing table from Hive.

2. Data Manipulation Language (DML):

o LOAD: Loads data from external files into Hive tables.

6. Aggregation and Functions:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.