Hive and Pig
Hive and Pig
What is Hive
Apache Hive is a data warehouse and ETL (Extract, Transform, Load) tool built on
top of Hadoop. It provides an SQL-like interface to access and process structured
data stored in the Hadoop Distributed File System (HDFS), simplifying the querying
and analysis of big data.
Hive enables reading, writing, and managing large datasets distributed across
storage using SQL-based queries. It is best suited for data warehousing tasks such
as data encapsulation, ad-hoc queries, and large-scale data analysis, rather than
Online Transactional Processing (OLTP).
Hive is designed for scalability, extensibility, performance, and fault-tolerance. It
offers a loosely coupled architecture with input formats, making it a robust solution
for processing wide datasets in distributed environments.
Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Features of Hive
● OLAP-Focused:
Designed for Online Analytical Processing (OLAP), Hive is ideal for data
analysis and reporting on large datasets.
● Schema on read
Architecture of Hive
1. UI: interface, query and operations from user and supports web UI…..
2. META STORE: metadata of hive objects , rdbms, help execution engine and
compiler to understand data structure
3. HQL PROCESS ENGINE: uses hql, instead of writing map reduce programs
in java, we generate map reduce queries and process it.
4. EXECUTION ENGINE : bridges the gap between the HiveQL Process Engine
and map reduce, executes the query plans and generates results .
5. HDFS/ HBASE: store and process large datasets into a file system.
6. DRIVER: Manages query lifecycle
7. COMPILER: Converts HiveQL to execution plan
https://www.tutorialspoint.com/hive/hive_introduction.htm
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
Hive Query Language
Hive Query Language (HQL or HiveQL) is a SQL-like language designed for querying
and analyzing large datasets stored in Apache Hive, which operates on top of the
Hadoop ecosystem.
● Integration with Hadoop Ecosystem: HiveQL can interact with data in Hadoop
Distributed File System (HDFS) and other storage systems like Amazon S3 or
Apache HBase
HIVE SQL
size of dataset good for large dataset not suitable for large dataset
Schema
Schema-on-read Schema-on-write
Application
SIMILARITIES:
HQL CODES:
-- 1. Create a table
CREATE TABLE employees (
id INT,
name STRING,
department STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
-- 2. Load data into the table (assuming the file 'employees.csv' exists in HDFS)
LOAD DATA INPATH '/user/hive/warehouse/employees.csv' INTO TABLE
employees;
DATA TYPES:
Sure! Let's break down and explain *each Hive data type* from your notes, along
with examples. Hive data types are broadly divided into two categories:
## 🔹 *1. Primitive Data Types*
These represent *single values* and are directly mapped to columns.
---
### 📌 *Type-2: Date/Time & String Data Types*
| Data Type | Description | Format/Example |
| ------------- | --------------------------------------- | ------------------------------------ |
| *TIMESTAMP* | Date and time with nanosecond precision | CREATE TABLE
demo(time TIMESTAMP); |
| *STRING* | Variable-length string | CREATE TABLE demo(name
STRING); |
---
---
### 📌 *ARRAY*
* A collection of *same type* elements.
* *Syntax*: ARRAY<DATA_TYPE>
✅ *Example*:
sql
CREATE TABLE student_subjects (
name STRING,
subjects ARRAY<STRING>
);
📌 To insert data:
sql
INSERT INTO student_subjects VALUES ("John", ARRAY("Math", "Physics", "CS"));
---
### 📌 *MAP*
* Collection of *key-value* pairs.
* Keys must be primitive types.
* *Syntax*: MAP<KEY_TYPE, VALUE_TYPE>
✅ *Example*:
sql
CREATE TABLE product_info (
product_id INT,
attributes MAP<STRING, STRING>
);
📌 Sample Value:
json
{ "color": "red", "size": "M", "brand": "Nike" }
---
### 📌 *STRUCT*
* A *nested structure* with fields of possibly different types.
* *Syntax*: STRUCT<field1: TYPE, field2: TYPE>
✅ *Example*:
sql
CREATE TABLE employee_details (
id INT,
personal_info STRUCT<name:STRING, age:INT, gender:STRING>
);
📌 Access a field:
sql
SELECT personal_info.name FROM employee_details;
---
### 📌 *UNIONTYPE*
* Holds *only one* of the specified types at a time.
* *Syntax*: UNIONTYPE<INT, STRING, FLOAT>
✅ *Example*:
sql
CREATE TABLE mixed_values (
data UNIONTYPE<INT, STRING>
);
📌 Not commonly used directly in simple cases, but supported for serialization
formats like Avro, ORC, etc.
---
7058,cse^1,1|2|3,A|50000,3,true
7059,cse^2,1|2,B|40000,good,true
### Suppose schema is:
---
Pig latin
It is a high-level tool/platform which is used to analyze , process and manipulate
large sets of data representing them as data flows.
Pig is generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.Apache Pig is an abstraction over
MapReduce.
To analyze data using Apache Pig, programmers need to write scripts using
Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Programmers who are not so good at Java normally used to struggle working
with Hadoop, especially while performing any MapReduce tasks. Apache Pig is
a boon for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
Apache Pig provides many built-in operators to support data operations like
joins, filters, ordering, etc. In addition, it also provides nested data types like
tuples, bags, and maps that are missing from MapReduce.
Features of Pig
Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig
script if you are good at SQL.
Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
● Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes
and the data flows are represented as edges.
● Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
● Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
● Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.
Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00
Complex Types
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in a
similar way as SQL does.
The following table describes the arithmetic operators of Pig Latin. Suppose a = 10
and b = 20.
+ Addition − Adds values on either side of the operator a + b will
− Subtraction − Subtracts right hand operand from left hand operand a − b will give
/ Division − Divides left hand operand by right hand operand b / a will give 2
b = (a == 1)?
Bincond − Evaluates the Boolean operators. It has three operands if a = 1 the va
?: as shown below. is 20.
variable x = (expression) ? value1 if true : value2 if false. if a!=1 the valu
30.
CASE CASE f2 % 2
WHEN
WHEN 0 THE
THEN Case − The case operator is equivalent to nested bincond operator.
WHEN 1 THE
ELSE
END END
Operato
Description Examp
r
Equal − Checks if the values of two operands are equal or not; if yes, then (a = b)
==
the condition becomes true. true
Not Equal − Checks if the values of two operands are equal or not. If the
!= (a != b
values are not equal, then condition becomes true.
Greater than − Checks if the value of the left operand is greater than the (a > b)
>
value of the right operand. If yes, then the condition becomes true. true.
Less than − Checks if the value of the left operand is less than the value of
< (a < b)
the right operand. If yes, then the condition becomes true.
Greater than or equal to − Checks if the value of the left operand is greater
(a >= b
>= than or equal to the value of the right operand. If yes, then the condition
true.
becomes true.
Less than or equal to − Checks if the value of the left operand is less than
<= or equal to the value of the right operand. If yes, then the condition becomes (a <= b
true.
Pattern matching − Checks whether the string in the left-hand side matches f1 matc
matches
with the constant in the right-hand side. '.*tutor
The following table describes the Type construction operators of Pig Latin.
Operato
Description Example
r
Bag constructor operator − This operator is used to construct a {(Raju, 30), (Moham
{}
bag. 45)}
Filtering
Sorting
Diagnostic Operators
Apache Pig is primarily used for Extract, Transform, Load (ETL) operations and
ad-hoc data analysis in Hadoop. It simplifies data processing on large datasets by
allowing users to define data transformations using a high-level language, Pig Latin,
without needing to write complex MapReduce code. Pig is also used for various
other tasks, including processing web logs, analyzing social media data, and
powering search platforms.
Here's a more detailed look at its use cases:
1. ETL Operations: Pig is a powerful tool for extracting, transforming, and loading
data into Hadoop. It can handle various data formats and perform operations like
filtering, joining, and sorting data.
2. Data Analysis and Prototyping: Pig is often used by data scientists and
researchers for ad-hoc data analysis and quick prototyping. Its ease of use allows
them to explore large datasets and develop algorithms without complex coding.
3. Web Log Processing: Pig can be used to process large web logs, extract useful
information, and analyze server usage patterns. This can help identify performance
bottlenecks, detect security threats, and improve the overall user experience.
4. Social Media Analysis: Pig is well-suited for processing and analyzing social
media data, such as tweets. It can be used to perform sentiment analysis, identify
trends, and understand user behavior.
5. Search Platform Data Processing: Pig can be used to process data for search
platforms, cleaning and transforming data from various sources to improve search
quality and relevancy.
6. Healthcare Data Processing: Pig can be used to process and analyze healthcare
data, such as patient records and medical images. It can help researchers and
clinicians develop new insights and treatments.
7. Handling Unstructured Data: Pig's ability to handle structured, semi-structured,
and unstructured data makes it a valuable tool for processing a wide range of data
types, including text files, log files, and social media feeds.
The ETL (Extract, Transform, Load) process plays an important role in data
warehousing by ensuring seamless integration and preparation of data for analysis.
ETL Process
The ETL process, which stands for Extract, Transform, and Load, is a critical
methodology used to prepare data for storage, analysis, and reporting in a data
warehouse. It involves three distinct stages that help to streamline raw data from
multiple sources into a clean, structured, and usable form. Here’s a detailed
breakdown of each phase:
ETL
1. Extraction
The Extract phase is the first step in the ETL process, where raw data is collected
from various data sources. These sources can be diverse, ranging from structured
sources like databases (SQL, NoSQL), to semi-structured data like JSON, XML,
or unstructured data such as emails or flat files. The main goal of extraction is to
gather data without altering its format, enabling it to be further processed in the next
stage.
Types of data sources can include:
● Structured: SQL databases, ERPs, CRMs
● Semi-structured: JSON, XML
● Unstructured: Emails, web pages, flat files
2. Transformation
The Transform phase is where the magic happens. Data extracted in the previous
phase is often raw and inconsistent. During transformation, the data is cleaned,
aggregated, and formatted according to business rules. This is a crucial step
because it ensures that the data meets the quality standards required for accurate
analysis.
Common transformations include:
● Data Filtering: Removing irrelevant or incorrect data.
● Data Sorting: Organizing data into a required order for easier analysis.
● Data Aggregating: Summarizing data to provide meaningful insights (e.g.,
averaging sales data).
The transformation stage can also involve more complex operations such as
currency conversions, text normalization, or applying domain-specific rules to ensure
the data aligns with organizational needs.
3. Loading
Once data has been cleaned and transformed, it is ready for the final step: Loading.
This phase involves transferring the transformed data into a data warehouse, data
lake, or another target system for storage. Depending on the use case, there are two
types of loading methods:
● Full Load: All data is loaded into the target system, often used during the initial
population of the warehouse.
● Incremental Load: Only new or updated data is loaded, making this method
more efficient for ongoing data updates.
ETL Pipelining
In short, the ETL process involves extracting raw data from various sources,
transforming it into a clean format, and loading it into a target system for analysis.
This is crucial for organizations to consolidate data, improve quality, and enable
actionable insights for decision-making, reporting, and machine learning. ETL forms
the foundation of effective data management and advanced analytics.
Importance of ETL
● Data Integration: ETL combines data from various sources,
including structured and unstructured formats, ensuring seamless integration for
a unified view.
● Data Quality: By transforming raw data, ETL cleanses and standardizes it,
improving data accuracy and consistency for more reliable insights.
● Essential for Data Warehousing: ETL prepares data for storage in data
warehouses, making it accessible for analysis and reporting by aligning it with the
target system’s requirements.
● Enhanced Decision-Making: ETL helps businesses derive actionable insights,
enabling better forecasting, resource allocation, and strategic planning.
● Operational Efficiency: Automating the data pipeline through ETL speeds up
data processing, allowing organizations to make real-time decisions based on the
most current data.
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local
file system. There is no need of Hadoop or HDFS. This mode is generally used
for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we
execute the Pig Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation on the data that
exists in the HDFS.
Apache Pig scripts can be executed in three ways, namely, interactive mode,
batch mode, and embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you can enter the Pig Latin statements and
get the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the
Pig Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our
own functions (User Defined Functions) in programming languages such as
Java, and using them in our script.
Introduction to Hive:
Hive Architecture,
Hive Physical Architecture,
Hive Data types,
Hive Query Language,
Introduction to Pig,
Anatomy of Pig,
Pig on Hadoop,
Use Case for Pig,
ETL Processing,
Data types in Pig running Pig,
Execution model of Pig,
Operators,
functions,
Data types of Pig.
Pig on Hadoop
● Pig runs on Hadoop Distributed File System (HDFS).
● Pig scripts are compiled into a sequence of MapReduce jobs which execute
in Hadoop's distributed environment.
● When Pig scripts are run in MapReduce mode, each statement or set of
operations in Pig Latin is converted into a MapReduce job behind the
scenes.
● Supports:
Pig on Hadoop is a highly efficient tool for large-scale data processing, offering
simplicity, scalability, and flexibility. It allows users to focus on data
transformations without the complexities of low-level MapReduce programming. Its
architecture and execution model are designed to process big data efficiently within
the Hadoop ecosystem.
○ Pig uses Pig Latin, a data flow language, which is easy to understand
and designed for data transformations (e.g., joins, filters,
aggregations).
○ Parser: Converts the Pig Latin script into a logical plan (an abstract
representation of the data flow).
○ Local Mode: Runs Pig on a single machine using local files without
Hadoop.
1. Log Processing
Pig is commonly used for processing large web server logs. It helps to analyze logs
by filtering relevant information, identifying patterns, aggregating statistics, and
preparing data for further analysis or reporting. Since Pig abstracts much of the
complexity of MapReduce, it is well-suited for analyzing logs that need to be
processed in parallel on large data sets.
🔹 4. Data Enrichment
Pig enables the enrichment of data by joining datasets from multiple sources. For
example, you can join customer data with product information or enrich web logs
with user demographic data. This is useful in scenarios like customer
segmentation, product recommendation, and personalization.
🔹 5. Real-Time Data Processing
Although Pig is designed for batch processing, it can be integrated with streaming
data tools (like Apache Kafka) to process and analyze real-time data in near
real-time. This is beneficial for applications that require real-time insights, such as
fraud detection or clickstream analysis.
4. Operators in Pig
Pig supports a rich set of relational operators:
🔧 5. Functions in Pig
🏗️ Built-in Functions:
● Math: ABS(), ROUND(), RANDOM()
DATA TYPES
● chararray – String
Complex Types:
● String
● Array/List
● Map/Dictionary
● Binary or Blob
● Timestamp/Date
● Null
🧾 Summary Table
Type Hive Pig NoSQL MongoDB
(Generic)
QUESTIONS
A1:
The key characteristics of Big Data are often referred to as the 3 Vs:
A2:
Big Data can be classified into three main types:
● Semi-structured Data: Data that doesn't have a strict schema but contains
tags or markers, like XML or JSON.
A3:
Traditional data systems are optimized for handling structured data in small to
medium volumes, typically processed by relational databases. Big Data, on the other
hand, is designed to process vast amounts of structured, semi-structured, and
unstructured data. Traditional systems struggle with the scale, complexity, and
speed of Big Data.
A4:
Key challenges with Big Data include:
● Data Processing: Ensuring that data can be processed at scale and speed.
A5:
Technologies for Big Data include:
A6:
Big Data infrastructure generally requires:
● Data Pipelines: For moving data from various sources to storage and
analysis tools.
● Cloud Infrastructure: Like AWS, Google Cloud, and Azure for scalability and
flexibility.
A7:
A Big Data system should exhibit the following properties:
A1:
The core components of Hadoop are:
● Hadoop Distributed File System (HDFS): Distributed storage system for Big
Data.
● Hadoop Common: The supporting libraries and utilities required for other
Hadoop modules.
A2:
The Hadoop ecosystem consists of several components and projects that
complement the core Hadoop framework:
A3:
Some limitations of Hadoop include:
● Latency: Since it is optimized for large-scale batch jobs, Hadoop has high
latency in certain use cases.
A4:
HDFS is a distributed file system designed to store very large files across multiple
machines. It is highly scalable and fault-tolerant. Key features include:
● Block-based storage: Files are split into fixed-size blocks and stored across
the cluster.
A5:
MapReduce is a programming model for processing large datasets in a parallel and
distributed manner. It consists of two phases:
● Map: The Mapper processes input data and generates key-value pairs.
● Reduce: The Reducer processes the intermediate key-value pairs generated
by the Map phase and produces the final output.
A1:
Hive architecture consists of:
● Driver: Receives HiveQL queries and interacts with the Metastore and
Compiler.
A2:
Pig is a high-level platform for processing large datasets using Pig Latin, a simple,
SQL-like scripting language. It simplifies complex data processing tasks.
○ Physical Plan: Converts the logical plan into a physical execution plan.
A1:
NoSQL (Not Only SQL) refers to a class of databases designed to handle
unstructured, semi-structured, or rapidly changing data. NoSQL databases provide
horizontal scalability and are highly flexible, making them suitable for handling Big
Data workloads. They are crucial for applications that require fast read/write
operations, such as social media platforms or real-time analytics.
A2:
NoSQL databases are categorized into:
A3:
MongoDB is a popular document-oriented NoSQL database with the following
features:
A4:
MongoDB Query Language (MQL) is the language used to query MongoDB
databases. It allows you to perform CRUD (Create, Read, Update, Delete)
operations on MongoDB collections. The syntax is similar to JavaScript and supports
complex queries, including filtering, sorting, and aggregation.
Certainly! Below are clear and university exam-ready definitions for Big Data,
Hadoop, Hive, Pig, NoSQL, and MongoDB:
Big Data refers to extremely large and complex datasets that traditional data
processing tools cannot handle efficiently. It is characterized by the 3 Vs:
Big Data is used in fields like social media, healthcare, finance, and e-commerce to
extract insights and make data-driven decisions.
2. Hadoop – Definition
3. Hive – Definition
Apache Pig is a high-level data processing platform for analyzing large datasets
using a scripting language called Pig Latin.
It simplifies the process of writing complex MapReduce jobs. Pig is used for ETL
(Extract, Transform, Load) operations and is especially useful for semi-structured
and unstructured data processing.
5. NoSQL – Definition
● Key-Value Stores
● Document Stores
● Column-Family Stores
● Graph Databases
They are used in applications requiring high speed and flexibility, like real-time web
apps and big data analytics.
6. MongoDB – Definition
● Scalability
● High performance
● Ease of use
MongoDB is ideal for storing unstructured data and is widely used in modern web
and mobile applications.
Great! Since you're a beginner, I’ll walk you through basic Pig Latin scripts to
perform:
● Sort
● Group
● Join
● Project
● Filter
I’ll explain each step simply using example data so you can understand and use it in
your college practical.
1,John,80
2,Alice,92
3,Mark,67
4,Emma,85
Script to Load:
✅ 2. Sorting Data
Goal: Sort students by marks (highest to lowest)
DUMP sorted_students;
✅ 3. Grouping Data
Example: Group students based on marks (e.g., to count how many got
same marks)
DUMP grouped_by_marks;
✅ 4. Joining Data
Let’s say we have another file details.txt:
1,Computer Science
2,Mathematics
3,Physics
4,Biology
AS (id:int, dept:chararray);
Join Both Tables on id:
DUMP joined_data;
✅ 5. Projecting Columns
Goal: Select only names and marks
DUMP name_marks;
✅ 6. Filtering Data
Goal: Get students who scored more than 80
DUMP top_students;
Let me know if you’d like a sample input file or this formatted for your practical
record book.
Great! Since you're a beginner and this is your college practical, I’ll guide you
through Hive commands to:
📘 Deletes the database and all tables inside it. Use CASCADE to remove all
contents.
✅ 2. Hive Table Commands
🔹 a) CREATE TABLE
CREATE TABLE students (
id INT,
name STRING,
marks FLOAT
STORED AS TEXTFILE;
📘 Creates a students table with three columns. Data is stored in text format,
separated by commas.
🔹 b) ALTER TABLE
-- Add a new column
-- Rename a column
FROM student_data
If you don’t have custom functions, this part may not be required in your basic
practical.
AS 'COMPACT'
✅ Summary Table
Command Action