Unit 3 Hive Overview and Architecture
Unit 3 Hive Overview and Architecture
Apache Hive is a data warehouse system built on top of Hadoop that provides a high-level
interface to query, analyze, and manage large datasets stored in Hadoop's HDFS (Hadoop
Distributed File System). It abstracts away the complexities of writing low-level MapReduce
code by providing a SQL-like query language, HiveQL, making it easier for analysts and
developers to interact with big data.
Hive was initially developed by Facebook and later contributed to the Apache Software
Foundation. Its primary use case is to perform data summarization, querying, and analysis on
large-scale datasets, particularly those stored in HDFS.
HiveQL is a SQL-like language used to query, insert, update, and manage large datasets in Hive.
It provides a rich set of commands for working with structured data. Below are the key elements
of HiveQL:
sql
Copy
CREATE DATABASE company;
sql
Copy
CREATE TABLE employees (id INT, name STRING, age INT, department
STRING)
STORED AS ORC;
sql
Copy
DROP TABLE employees;
o ALTER TABLE: Modifies an existing table's schema. For example, you can add
a new column.
sql
Copy
ALTER TABLE employees ADD COLUMNS (salary DOUBLE);
sql
Copy
SELECT * FROM employees WHERE age > 30;
o INSERT INTO: Inserts data into a table. This is typically used for inserting new
records.
sql
Copy
INSERT INTO employees VALUES (1, 'John Doe', 28, 'Engineering');
sql
Copy
LOAD DATA INPATH '/path/to/data' INTO TABLE employees;
o UPDATE: Hive does not natively support updates in the same way as relational
databases. It can be used with limited functionality (in Hive 0.14+).
sql
Copy
UPDATE employees SET salary = 5000 WHERE id = 1;
3. Querying Data:
o SELECT with WHERE, GROUP BY, ORDER BY, and JOINs:
sql
Copy
SELECT department, AVG(age)
FROM employees
WHERE age > 30
GROUP BY department
ORDER BY AVG(age) DESC;
4. Partitioning:
o Hive supports partitioning, which helps manage large datasets by dividing the
data into smaller, manageable parts. A partition is created by splitting data into
subdirectories within HDFS based on a partition key (e.g., date, region).
sql
Copy
CREATE TABLE sales (id INT, amount DOUBLE)
PARTITIONED BY (year INT, month INT);
o Partitioning helps in faster query execution as only relevant partitions are scanned
during query execution.
5. Bucketing:
o Bucketing divides the data into "buckets" (files) based on a hash function. This is
different from partitioning, as it ensures the data is evenly distributed across a
fixed number of files.
sql
Copy
CREATE TABLE employees (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
sql
Copy
SELECT department, COUNT(*)
FROM employees
GROUP BY department;
7. Joins:
o Hive supports different types of joins (like INNER JOIN, LEFT JOIN, RIGHT JOIN,
and FULL OUTER JOIN) to combine data from multiple tables.
sql
Copy
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;