0% found this document useful (0 votes)

44 views57 pages

Hive and Pig

Apache Hive is a data warehouse and ETL tool built on Hadoop, providing an SQL-like interface for processing structured data in HDFS. It is designed for scalability and is best suited for OLAP tasks, allowing users to perform complex queries on large datasets. HiveQL, its query language, resembles SQL and supports various data types and operations, making it user-friendly for data analysis and manipulation.

Uploaded by

Hiya Paliwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views57 pages

Hive and Pig

Uploaded by

Hiya Paliwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Unit-3

What is Hive

Apache Hive is a data warehouse and ETL (Extract, Transform, Load) tool built on
top of Hadoop. It provides an SQL-like interface to access and process structured
data stored in the Hadoop Distributed File System (HDFS), simplifying the querying
and analysis of big data.

Hive enables reading, writing, and managing large datasets distributed across
storage using SQL-based queries. It is best suited for data warehousing tasks such
as data encapsulation, ad-hoc queries, and large-scale data analysis, rather than
Online Transactional Processing (OLTP).

Hive is designed for scalability, extensibility, performance, and fault-tolerance. It
offers a loosely coupled architecture with input formats, making it a robust solution
for processing wide datasets in distributed environments.
Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.

Features of Hive

● Schema and Data Storage:

Hive stores metadata (schema) in a database and the actual data in Hadoop
Distributed File System (HDFS).

● OLAP-Focused:
Designed for Online Analytical Processing (OLAP), Hive is ideal for data
analysis and reporting on large datasets.

● SQL-Like Query Language:

Offers HiveQL (HQL), a query language similar to SQL, enabling easy data
manipulation and retrieval.

● Scalable and Extensible:

Hive is built to be scalable, fast, and extensible, capable of handling large
volumes of data efficiently.

● User-Defined Functions (UDFs):

Includes built-in UDFs and allows users to define custom UDFs for
specialized data processing tasks.

● Schema on read

Architecture of Hive

The following component diagram depicts the architecture of Hive:

1. UI: interface, query and operations from user and supports web UI…..
2. META STORE: metadata of hive objects , rdbms, help execution engine and
compiler to understand data structure
3. HQL PROCESS ENGINE: uses hql, instead of writing map reduce programs
in java, we generate map reduce queries and process it.
4. EXECUTION ENGINE : bridges the gap between the HiveQL Process Engine
and map reduce, executes the query plans and generates results .
5. HDFS/ HBASE: store and process large datasets into a file system.
6. DRIVER: Manages query lifecycle
7. COMPILER: Converts HiveQL to execution plan
https://www.tutorialspoint.com/hive/hive_introduction.htm

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.
Hive Query Language

Hive Query Language (HQL or HiveQL) is a SQL-like language designed for querying
and analyzing large datasets stored in Apache Hive, which operates on top of the
Hadoop ecosystem.

Key Features and Capabilities

● SQL-Like Syntax: HiveQL closely resembles SQL, using familiar constructs
like SELECT, FROM, WHERE, GROUP BY, ORDER BY, and JOIN to query and
manipulate data

● Schema-on-Read: Unlike traditional databases that require a schema before

data is loaded (schema-on-write), Hive allows users to apply schemas at
query time (schema-on-read), offering flexibility for semi-structured or
evolving data

● Abstraction from MapReduce: HiveQL abstracts away the complexity of

Hadoop’s underlying MapReduce framework, translating queries into
MapReduce or Tez jobs that execute in parallel across the Hadoop cluster

● Integration with Hadoop Ecosystem: HiveQL can interact with data in Hadoop
Distributed File System (HDFS) and other storage systems like Amazon S3 or
Apache HBase

● Scalability: Designed for processing petabytes of data, HiveQL is suitable for

large-scale data warehousing, log processing, text analytics, and customer
behavior analysis

How HiveQL Works

● Query Submission: Users submit HiveQL queries via CLI, JDBC/ODBC, or web
interfaces.
● Compilation and Optimization: The query is parsed, optimized, and converted
into an execution plan.
● Execution: The plan is executed as MapReduce or Tez jobs on the Hadoop
cluster.
● Results: Output is stored in HDFS and can be accessed or visualized using BI
tools
DIFFERENCE

HIVE SQL

Database engine (e.g.,

Execution Engine MapReduce/Tez on Hadoop
MySQL)

size of dataset good for large dataset not suitable for large dataset

Schema
Schema-on-read Schema-on-write
Application

Use Case Big Data, batch processing OLTP, interactive queries

SIMILARITIES:

1. Syntax Structure:

HiveQL follows a syntax that is almost identical to SQL. Commands such as
SELECT, FROM, WHERE, GROUP BY, and ORDER BY are used in the same
manner.

2. Data Definition Language (DDL):

Both HQL and SQL use DDL statements such as CREATE TABLE, DROP
TABLE, and ALTER TABLE to define and manage database schema.

3. Data Manipulation Language (DML):

HiveQL supports DML operations similar to SQL, including INSERT, LOAD
DATA, and SELECT for querying and modifying data.

4. Data Types:

Both languages support similar primitive data types like INT, FLOAT,
STRING, BOOLEAN, DATE, etc.

5. Joins and Subqueries:

HiveQL allows the use of various SQL-style joins (INNER JOIN, LEFT
JOIN, etc.) and supports subqueries, similar to traditional SQL databases.

6. Aggregate Functions:

HiveQL supports SQL-like aggregate functions such as COUNT(), SUM(),
AVG(), MAX(), and MIN() for data summarization.

7. Use of Aliases:

Both SQL and HQL support the use of aliases for columns and tables using
the AS keyword for better readability.

8. Query Execution Plan:

Although the underlying execution differs (HQL translates to MapReduce or
Spark jobs), the logical execution of queries in terms of stages like filtering,
grouping, and sorting is similar to SQL.

HQL CODES:
-- 1. Create a table
CREATE TABLE employees (
id INT,
name STRING,
department STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

-- 2. Load data into the table (assuming the file 'employees.csv' exists in HDFS)
LOAD DATA INPATH '/user/hive/warehouse/employees.csv' INTO TABLE
employees;

-- 3. Insert data into the table (manually inserting a row)

INSERT INTO TABLE employees (id, name, department, salary)
VALUES (1, 'John Doe', 'Engineering', 80000);

-- 4. Select data from the table

SELECT * FROM employees;
-- 5. Deleting data (delete employees with a salary less than 30000)
DELETE FROM employees WHERE salary < 30000;

-- 6. Drop the table

DROP TABLE employees;

-- 7. Alter the table (add a new column 'joining_date')

CREATE TABLE employees_new (
id INT,
name STRING,
department STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

-- Adding a column to the new table

ALTER TABLE employees_new ADD COLUMNS (joining_date STRING);

-- 8. Insert data into the altered table

INSERT INTO TABLE employees_new (id, name, department, salary, joining_date)
VALUES (2, 'Jane Smith', 'Marketing', 55000, '2022-01-01');

-- 9. Query data using a condition

SELECT name, salary FROM employees_new WHERE department = 'Marketing';

-- 10. Query data with GROUP BY and aggregation function

SELECT department, COUNT(id) AS num_employees, AVG(salary) AS avg_salary
FROM employees_new
GROUP BY department;

-- 11. Perform a JOIN operation

CREATE TABLE departments (
dept_id INT,
dept_name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

-- Insert data into the departments table

INSERT INTO TABLE departments (dept_id, dept_name)
VALUES (1, 'Engineering'), (2, 'Marketing');

-- Perform JOIN between employees_new and departments table

SELECT e.name, e.salary, d.dept_name
FROM employees_new e
JOIN departments d ON e.department = d.dept_name;

DATA TYPES:
Sure! Let's break down and explain *each Hive data type* from your notes, along
with examples. Hive data types are broadly divided into two categories:
## 🔹 *1. Primitive Data Types*
These represent *single values* and are directly mapped to columns.

### 📌 Type-1: Numeric Data Types

| Data Type | Description | Size | Range |
Example |
| ---------------- | ------------------------------- | --------- | ------------------------------------------ |
------------------------------------------ |
| *TINYINT* | 1-byte signed integer | 1 byte | -128 to 127
| CREATE TABLE demo(col TINYINT); |
| *SMALLINT* | 2-byte signed integer | 2 bytes | -32,768 to 32,767
| CREATE TABLE demo(col SMALLINT); |
| *INT* | 4-byte signed integer | 4 bytes | -2,147,483,648 to
2,147,483,647 | CREATE TABLE demo(col INT); |
| *BIGINT* | 8-byte signed integer | 8 bytes | Huge range
| CREATE TABLE demo(phone BIGINT); |
| *FLOAT* | Single precision floating point | \~4 bytes | Approximate
| CREATE TABLE demo(score FLOAT); |
| *DOUBLE* | Double precision floating point | \~8 bytes | More precise than float
| CREATE TABLE demo(score DOUBLE); |
| *DECIMAL(p,s)* | Arbitrary precision decimal | Variable | Depends on p
(precision) and s (scale) | CREATE TABLE demo(amount DECIMAL(10,2)); |

✅ Use Case Example:

sql
CREATE TABLE student_scores (
id INT,
score FLOAT,
total DOUBLE,
grade DECIMAL(5,2)
);

---
### 📌 *Type-2: Date/Time & String Data Types*
| Data Type | Description | Format/Example |
| ------------- | --------------------------------------- | ------------------------------------ |
| *TIMESTAMP* | Date and time with nanosecond precision | CREATE TABLE
demo(time TIMESTAMP); |
| *STRING* | Variable-length string | CREATE TABLE demo(name
STRING); |

✅ Use Case Example:

sql
CREATE TABLE events (
event_name STRING,
event_time TIMESTAMP
);

---

## 🔹 2. Complex Data Types

These can hold *collections or nested data*.

---

### 📌 *ARRAY*
* A collection of *same type* elements.
* *Syntax*: ARRAY<DATA_TYPE>
✅ *Example*:
sql
CREATE TABLE student_subjects (
name STRING,
subjects ARRAY<STRING>
);

📌 To insert data:
sql
INSERT INTO student_subjects VALUES ("John", ARRAY("Math", "Physics", "CS"));

---

### 📌 *MAP*
* Collection of *key-value* pairs.
* Keys must be primitive types.
* *Syntax*: MAP<KEY_TYPE, VALUE_TYPE>

✅ *Example*:
sql
CREATE TABLE product_info (
product_id INT,
attributes MAP<STRING, STRING>
);

📌 Sample Value:
json
{ "color": "red", "size": "M", "brand": "Nike" }

---

### 📌 *STRUCT*
* A *nested structure* with fields of possibly different types.
* *Syntax*: STRUCT<field1: TYPE, field2: TYPE>

✅ *Example*:
sql
CREATE TABLE employee_details (
id INT,
personal_info STRUCT<name:STRING, age:INT, gender:STRING>
);

📌 Access a field:
sql
SELECT personal_info.name FROM employee_details;

---

### 📌 *UNIONTYPE*
* Holds *only one* of the specified types at a time.
* *Syntax*: UNIONTYPE<INT, STRING, FLOAT>

✅ *Example*:
sql
CREATE TABLE mixed_values (
data UNIONTYPE<INT, STRING>
);

📌 Not commonly used directly in simple cases, but supported for serialization
formats like Avro, ORC, etc.

---

## 🔎 Example Data: (Geek.txt)

7058,cse^1,1|2|3,A|50000,3,true
7059,cse^2,1|2,B|40000,good,true
### Suppose schema is:

| Field | Data Type |

| ------------ | ----------------------------------- |
| ID | INT |
| Department | STRING |
| Scores | ARRAY<INT> |
| Grade-Salary | STRUCT\<grade\:STRING, salary\:INT> |
| Rating | UNIONTYPE\<INT, STRING> |
| Status | BOOLEAN |

✅ Example Hive table:

sql
CREATE TABLE geek_data (
id INT,
dept STRING,
scores ARRAY<INT>,
grade_salary STRUCT<grade:STRING, salary:INT>,
rating UNIONTYPE<INT, STRING>,
active BOOLEAN
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':';

---
Pig latin
It is a high-level tool/platform which is used to analyze , process and manipulate
large sets of data representing them as data flows.

Pig is generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.Apache Pig is an abstraction over
MapReduce.

To write data analysis programs, Pig provides a high-level language known as

Pig Latin. This language provides various operators using which programmers
can develop their own functions for reading, writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts using
Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?

Programmers who are not so good at Java normally used to struggle working
with Hadoop, especially while performing any MapReduce tasks. Apache Pig is
a boon for all such programmers.

Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.

Apache Pig uses a multi-query approach, thereby reducing the length of

codes. For example, an operation that would require you to type 200 lines of
code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by almost 16
times.

Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
Apache Pig provides many built-in operators to support data operations like
joins, filters, ordering, etc. In addition, it also provides nested data types like
tuples, bags, and maps that are missing from MapReduce.

Features of Pig

Apache Pig comes with the following features −

Rich set of operators − It provides many operators to perform operations like

join, sort, filter, etc.

Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig
script if you are good at SQL.

Optimization opportunities − The tasks in Apache Pig optimize their execution

automatically, so the programmers need to focus only on semantics of the
language.

Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.

UDFs − Pig provides the facility to create User-defined Functions in other

programming languages such as Java and invoke or embed them in Pig
Scripts.

Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.

● Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes
and the data flows are represented as edges.

● Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.

● Compiler

The compiler compiles the optimized logical plan into a series of MapReduce
jobs.

● Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin Data Model

The data model of Pig is fully nested. A Relation is the outermost structure of
the Pig Latin data model. And it is a bag where −

A bag is a collection of tuples.

A tuple is an ordered set of fields.

A field is a piece of data

Pig Latin Data types

Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example

Represents a signed 32-bit integer.

1 int
Example : 8

Represents a signed 64-bit integer.

2 long
Example : 5L

Represents a signed 32-bit floating point.

3 float
Example : 5.5F

Represents a 64-bit floating point.

4 double
Example : 10.5

Represents a character array (string) in Unicode UTF-8 format.

5 chararray
Example : tutorials point

6 Bytearray Represents a Byte array (blob).

Represents a Boolean value.

7 Boolean
Example : true/ false.

Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00

Represents a Java BigInteger.

9 Biginteger
Example : 60708090709

Represents a Java BigDecimal

10 Bigdecimal
Example : 185.98376256272893883

Complex Types

11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)

A bag is a collection of tuples.

12 Bag
Example : {(raju,30),(Mohhammad,45)}

A Map is a set of key-value pairs.

13 Map
Example : [ name#Raju, age#30]

Null Values

Values for all the above data types can be NULL. Apache Pig treats null values in a
similar way as SQL does.

A null can be an unknown value or a non-existent value. It is used as a placeholder

for optional values. These nulls can occur naturally or can be the result of an
operation.

Pig Latin Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Suppose a = 10
and b = 20.

Operator Description Example

+ Addition − Adds values on either side of the operator a + b will

− Subtraction − Subtracts right hand operand from left hand operand a − b will give

* Multiplication − Multiplies values on either side of the operator a * b will give

/ Division − Divides left hand operand by right hand operand b / a will give 2

Modulus − Divides left hand operand by right hand operand and

% b % a will give
returns remainder

b = (a == 1)?
Bincond − Evaluates the Boolean operators. It has three operands if a = 1 the va
?: as shown below. is 20.
variable x = (expression) ? value1 if true : value2 if false. if a!=1 the valu
30.

CASE CASE f2 % 2
WHEN
WHEN 0 THE
THEN Case − The case operator is equivalent to nested bincond operator.
WHEN 1 THE
ELSE
END END

Pig Latin Comparison Operators

The following table describes the comparison operators of Pig Latin.

Operato
Description Examp
r

Equal − Checks if the values of two operands are equal or not; if yes, then (a = b)
==
the condition becomes true. true

Not Equal − Checks if the values of two operands are equal or not. If the
!= (a != b
values are not equal, then condition becomes true.

Greater than − Checks if the value of the left operand is greater than the (a > b)
>
value of the right operand. If yes, then the condition becomes true. true.

Less than − Checks if the value of the left operand is less than the value of
< (a < b)
the right operand. If yes, then the condition becomes true.

Greater than or equal to − Checks if the value of the left operand is greater
(a >= b
>= than or equal to the value of the right operand. If yes, then the condition
true.
becomes true.

Less than or equal to − Checks if the value of the left operand is less than
<= or equal to the value of the right operand. If yes, then the condition becomes (a <= b
true.

Pattern matching − Checks whether the string in the left-hand side matches f1 matc
matches
with the constant in the right-hand side. '.*tutor

Pig Latin Type Construction Operators

The following table describes the Type construction operators of Pig Latin.

Operato
Description Example
r

Tuple constructor operator − This operator is used to construct

() (Raju, 30)
a tuple.

Bag constructor operator − This operator is used to construct a {(Raju, 30), (Moham
{}
bag. 45)}

Map constructor operator − This operator is used to construct

[] [name#Raja, age#3
a tuple.

Pig Latin Relational Operations

The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

To Load the data from the file system (local/HDFS) into a

LOAD
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, To generate data transformations based on columns of

GENERATE data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

To arrange a relation in a sorted order based on one or

ORDER
more fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution

EXPLAIN
plans to compute a relation.
To view the step-by-step execution of a series of
ILLUSTRATE
statements

Apache Pig is primarily used for Extract, Transform, Load (ETL) operations and
ad-hoc data analysis in Hadoop. It simplifies data processing on large datasets by
allowing users to define data transformations using a high-level language, Pig Latin,
without needing to write complex MapReduce code. Pig is also used for various
other tasks, including processing web logs, analyzing social media data, and
powering search platforms.
Here's a more detailed look at its use cases:
1. ETL Operations: Pig is a powerful tool for extracting, transforming, and loading
data into Hadoop. It can handle various data formats and perform operations like
filtering, joining, and sorting data.
2. Data Analysis and Prototyping: Pig is often used by data scientists and
researchers for ad-hoc data analysis and quick prototyping. Its ease of use allows
them to explore large datasets and develop algorithms without complex coding.
3. Web Log Processing: Pig can be used to process large web logs, extract useful
information, and analyze server usage patterns. This can help identify performance
bottlenecks, detect security threats, and improve the overall user experience.
4. Social Media Analysis: Pig is well-suited for processing and analyzing social
media data, such as tweets. It can be used to perform sentiment analysis, identify
trends, and understand user behavior.
5. Search Platform Data Processing: Pig can be used to process data for search
platforms, cleaning and transforming data from various sources to improve search
quality and relevancy.
6. Healthcare Data Processing: Pig can be used to process and analyze healthcare
data, such as patient records and medical images. It can help researchers and
clinicians develop new insights and treatments.
7. Handling Unstructured Data: Pig's ability to handle structured, semi-structured,
and unstructured data makes it a valuable tool for processing a wide range of data
types, including text files, log files, and social media feeds.

Key Advantages of Pig:

● High-Level Language (Pig Latin):
Pig Latin simplifies the process of defining data transformations, making it easier for
users to work with large datasets.
● MapReduce Integration:
Pig automatically translates Pig Latin scripts into MapReduce jobs, allowing users to
leverage the power of Hadoop without writing complex code.
● Flexibility:
Pig can handle various data types, including structured, semi-structured, and
unstructured data.
● Extensibility:
Pig allows users to define their own functions (User Defined Functions - UDFs) to
extend its capabilities and handle specific data processing needs.
● Optimization:
Pig's architecture allows for automatic optimization of data processing tasks,
improving performance and efficiency.

The ETL (Extract, Transform, Load) process plays an important role in data
warehousing by ensuring seamless integration and preparation of data for analysis.

This method involves extracting data from multiple sources, transforming it

into a uniform format, and loading it into a centralized data warehouse or data
lake.

ETL is essential for businesses to consolidate vast amounts of data, enhancing

decision-making processes and enabling accurate business insights. In today’s
digital ecosystem, where data comes from various sources and formats, the ETL
process ensures that organizations can efficiently clean, standardize, and organize
this data for advanced analytics. It provides a structured foundation for data
analytics, improving the quality, security, and accessibility of enterprise data.

ETL Process
The ETL process, which stands for Extract, Transform, and Load, is a critical
methodology used to prepare data for storage, analysis, and reporting in a data
warehouse. It involves three distinct stages that help to streamline raw data from
multiple sources into a clean, structured, and usable form. Here’s a detailed
breakdown of each phase:
ETL

1. Extraction
The Extract phase is the first step in the ETL process, where raw data is collected
from various data sources. These sources can be diverse, ranging from structured
sources like databases (SQL, NoSQL), to semi-structured data like JSON, XML,
or unstructured data such as emails or flat files. The main goal of extraction is to
gather data without altering its format, enabling it to be further processed in the next
stage.
Types of data sources can include:
● Structured: SQL databases, ERPs, CRMs
● Semi-structured: JSON, XML
● Unstructured: Emails, web pages, flat files

2. Transformation
The Transform phase is where the magic happens. Data extracted in the previous
phase is often raw and inconsistent. During transformation, the data is cleaned,
aggregated, and formatted according to business rules. This is a crucial step
because it ensures that the data meets the quality standards required for accurate
analysis.
Common transformations include:
● Data Filtering: Removing irrelevant or incorrect data.
● Data Sorting: Organizing data into a required order for easier analysis.
● Data Aggregating: Summarizing data to provide meaningful insights (e.g.,
averaging sales data).
The transformation stage can also involve more complex operations such as
currency conversions, text normalization, or applying domain-specific rules to ensure
the data aligns with organizational needs.
3. Loading
Once data has been cleaned and transformed, it is ready for the final step: Loading.
This phase involves transferring the transformed data into a data warehouse, data
lake, or another target system for storage. Depending on the use case, there are two
types of loading methods:
● Full Load: All data is loaded into the target system, often used during the initial
population of the warehouse.
● Incremental Load: Only new or updated data is loaded, making this method
more efficient for ongoing data updates.

Pipelining in ETL Process

Pipelining in the ETL process involves processing data in overlapping stages to
enhance efficiency. Instead of completing each step sequentially, data is extracted,
transformed, and loaded concurrently. As soon as data is extracted, it is transformed,
and while transformed data is being loaded into the warehouse, new data can
continue being extracted and processed.
This parallel execution reduces downtime, speeds up the overall process, and
improves system resource utilization, making the ETL pipeline faster and more
scalable.

ETL Pipelining

In short, the ETL process involves extracting raw data from various sources,
transforming it into a clean format, and loading it into a target system for analysis.
This is crucial for organizations to consolidate data, improve quality, and enable
actionable insights for decision-making, reporting, and machine learning. ETL forms
the foundation of effective data management and advanced analytics.

Importance of ETL
● Data Integration: ETL combines data from various sources,
including structured and unstructured formats, ensuring seamless integration for
a unified view.
● Data Quality: By transforming raw data, ETL cleanses and standardizes it,
improving data accuracy and consistency for more reliable insights.
● Essential for Data Warehousing: ETL prepares data for storage in data
warehouses, making it accessible for analysis and reporting by aligning it with the
target system’s requirements.
● Enhanced Decision-Making: ETL helps businesses derive actionable insights,
enabling better forecasting, resource allocation, and strategic planning.
● Operational Efficiency: Automating the data pipeline through ETL speeds up
data processing, allowing organizations to make real-time decisions based on the
most current data.

Challenges in ETL Process

The ETL process, while essential for data integration, comes with its own set of
challenges that can hinder efficiency and accuracy. These challenges, if not
addressed properly, can impact the overall performance and reliability of data
systems.
● Data Quality Issues: Inconsistent, incomplete, or duplicate data from multiple
sources can impact transformation and loading, leading to inaccurate insights.
● Performance Bottlenecks: Large datasets can slow down or cause ETL
processes to fail, particularly during complex transformations like cleansing and
aggregation.
● Scalability Issues: Legacy ETL systems may struggle to scale with growing data
volumes, diverse sources, and more complex transformations.

Solutions to Overcome ETL Challenges:

● Data Quality Management: Use data validation and cleansing tools, along with
automated checks, to ensure accurate and relevant data during the ETL process.
● Optimization Techniques: Overcome performance bottlenecks by parallelizing
tasks, using batch processing, and leveraging cloud solutions for better
processing power and storage.
● Scalable ETL Systems: Modern cloud-based ETL tools (e.g., Google BigQuery,
Amazon Redshift) offer scalability, automation, and efficient handling of growing
data volumes.

ETL Tools and Technologies

ETL (Extract, Transform, Load) tools play a vital role in automating the process of
data integration, making it easier for businesses to manage and analyze large
datasets. These tools simplify the movement, transformation, and storage of data
from multiple sources to a centralized location like a data warehouse, ensuring
high-quality, actionable insights.
Some of the widely used ETL tools include:
● Apache Nifi: Open-source tool for real-time data flow management and
automation across systems.
● Talend: Open-source ETL tool supporting batch and real-time data processing
for large-scale integration.
● Microsoft SSIS: Commercial ETL tool integrated with SQL Server, known for
performance and scalability in data integration.
● Hevo: Modern data pipeline platform automating ETL and real-time data
replication for cloud data warehouses.
● Oracle Warehouse Builder: Commercial ETL tool for managing large-scale data
warehouses with transformation, cleansing, and integration features.

Apache Pig Execution Modes

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode

In this mode, all the files are installed and run from your local host and local
file system. There is no need of Hadoop or HDFS. This mode is generally used
for testing purpose.

MapReduce Mode

MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we
execute the Pig Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation on the data that
exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode,
batch mode, and embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you can enter the Pig Latin statements and
get the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the
Pig Latin script in a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our
own functions (User Defined Functions) in programming languages such as
Java, and using them in our script.
Introduction to Hive:
Hive Architecture,
Hive Physical Architecture,
Hive Data types,
Hive Query Language,
Introduction to Pig,
Anatomy of Pig,
Pig on Hadoop,
Use Case for Pig,
ETL Processing,
Data types in Pig running Pig,
Execution model of Pig,
Operators,
functions,
Data types of Pig.

Pig on Hadoop
● Pig runs on Hadoop Distributed File System (HDFS).

● Uses MapReduce, Tez, or Spark as the execution engine.

● Pig scripts are compiled into a sequence of MapReduce jobs which execute
in Hadoop's distributed environment.

● When Pig scripts are run in MapReduce mode, each statement or set of
operations in Pig Latin is converted into a MapReduce job behind the
scenes.

● Pig interacts with HDFS for data input and output.

● Supports:

○ Batch processing of massive datasets

○ Fault tolerance via Hadoop’s framework

Pig on Hadoop is a highly efficient tool for large-scale data processing, offering
simplicity, scalability, and flexibility. It allows users to focus on data
transformations without the complexities of low-level MapReduce programming. Its
architecture and execution model are designed to process big data efficiently within
the Hadoop ecosystem.

Architecture of Pig on Hadoop:

1. Pig Latin:

○ Pig uses Pig Latin, a data flow language, which is easy to understand
and designed for data transformations (e.g., joins, filters,
aggregations).

○ It is procedural and can express complex data flows with minimal

code.

2. Execution Model:

○ Parser: Converts the Pig Latin script into a logical plan (an abstract
representation of the data flow).

○ Logical Optimizer: Optimizes the logical plan by applying different

optimization techniques (e.g., projection pushdown, filter reordering).

○ Physical Plan: Converts the optimized logical plan into a physical

plan, specifying the actual operations.

○ MapReduce Compiler: Translates the physical plan into one or more

MapReduce jobs for execution on a Hadoop cluster.

○ Execution Engine: Executes the MapReduce jobs on Hadoop and

retrieves the results.

3. Execution Modes:

○ Local Mode: Runs Pig on a single machine using local files without
Hadoop.

○ MapReduce Mode: Executes Pig scripts on a Hadoop cluster,

leveraging the full power of MapReduce.

○ Tez/Spark Mode: Executes Pig scripts using Apache Tez or Apache

Spark instead of traditional MapReduce (for faster execution).

4. Pig Grunt Shell:

○ The Grunt shell is an interactive environment that allows users to run
and test Pig commands interactively. It is useful for testing smaller
scripts before running them in larger Hadoop jobs.

Use Cases of Apache Pig

1. Log Processing

Pig is commonly used for processing large web server logs. It helps to analyze logs
by filtering relevant information, identifying patterns, aggregating statistics, and
preparing data for further analysis or reporting. Since Pig abstracts much of the
complexity of MapReduce, it is well-suited for analyzing logs that need to be
processed in parallel on large data sets.

🔹 2. ETL (Extract, Transform, Load) Processing

Pig is ideal for performing ETL operations, where data is extracted from various
sources, transformed (e.g., cleaning, filtering, joining), and loaded into a target
database or data warehouse. It is efficient in handling both structured and
semi-structured data and can perform complex transformations with ease.

🔹 3. Data Aggregation and Summarization

Pig is frequently used to aggregate data and produce summary statistics. For
instance, it can be used to calculate totals, averages, or counts over large data sets,
such as calculating total sales by category or identifying trends over time. Its simple
syntax makes it an excellent choice for users who need to quickly group and
aggregate large amounts of data.

🔹 4. Data Enrichment
Pig enables the enrichment of data by joining datasets from multiple sources. For
example, you can join customer data with product information or enrich web logs
with user demographic data. This is useful in scenarios like customer
segmentation, product recommendation, and personalization.
🔹 5. Real-Time Data Processing
Although Pig is designed for batch processing, it can be integrated with streaming
data tools (like Apache Kafka) to process and analyze real-time data in near
real-time. This is beneficial for applications that require real-time insights, such as
fraud detection or clickstream analysis.

🔹 6. Social Media Analysis

Pig is widely used to process and analyze large datasets from social media
platforms, where it helps in sentiment analysis, identifying trending topics, and user
behavior analysis. Given the volume of data from sources like Twitter and Facebook,
Pig's ability to process and transform such large datasets makes it highly effective for
these use cases.

🔹 7. Data Transformation Between Different Formats

Pig can be used for converting between different data formats. It allows data to
be loaded, transformed, and stored in various formats such as CSV, JSON, Avro, or
Parquet. This flexibility makes it a valuable tool for data interoperability in big data
ecosystems.

4. Operators in Pig
Pig supports a rich set of relational operators:

1. LOAD – Load data from HDFS or local

2. STORE – Save data to storage

3. DUMP – Display output on the screen

4. FILTER – Select rows based on conditions

5. FOREACH...GENERATE – Transform data

6. GROUP – Group data for aggregation

7. JOIN – Combine multiple datasets

8. ORDER – Sort data

9. DISTINCT – Remove duplicates

10.LIMIT – Restrict number of output rows

11.UNION – Combine data from multiple relations

🔧 5. Functions in Pig
🏗️ Built-in Functions:
● Math: ABS(), ROUND(), RANDOM()

● String: CONCAT(), SUBSTRING(), STRSPLIT()

● Date/Time: ToDate(), GetYear()

● Aggregation: SUM(), AVG(), COUNT(), MAX(), MIN()

👨‍💻 User Defined Functions (UDFs):

● You can write your own UDFs in Java, Python, JavaScript, etc.

● Useful for custom logic not covered by built-in functions.

DATA TYPES

📊 1. Hive Data Types

Primitive Data Types:
● TINYINT – 1-byte integer

● SMALLINT – 2-byte integer

● INT – 4-byte integer

● BIGINT – 8-byte integer

● FLOAT – 4-byte single-precision float

● DOUBLE – 8-byte double-precision float

● DECIMAL – Arbitrary-precision decimal

● STRING – Variable-length string

● VARCHAR(n) – Variable-length string with a limit

● CHAR(n) – Fixed-length string

● BOOLEAN – True or False

● BINARY – Binary data

● DATE – Date value

● TIMESTAMP – Timestamp with time zone support

Complex Data Types:

● ARRAY<T> – Ordered list of elements

● MAP<K,V> – Key-value pairs

● STRUCT – Struct with named fields

● UNIONTYPE – Multiple types in one field

🐷 2. Pig Data Types
Simple Types:

● int – 32-bit integer

● long – 64-bit integer

● float – 32-bit float

● double – 64-bit float

● chararray – String

● bytearray – Raw bytes

● boolean – True or False

Complex Types:

● Tuple – Ordered collection of fields (a, b)

● Bag – Unordered collection of tuples {(a, b), (c, d)}

● Map – Key-value pairs ['name'#'John']

📦 3. NoSQL (General Types)

Note: NoSQL includes document, key-value, column-family, and graph databases.
Data types vary by system, but here are common ones:

Typical NoSQL Data Types:

● String

● Number (integer, float)

● Boolean

● Array/List

● Map/Dictionary

● Binary or Blob

● Timestamp/Date

● Null

These are conceptual types—actual implementation depends on the NoSQL engine

(e.g., Cassandra, Redis, Couchbase).

🍃 4. MongoDB Data Types

MongoDB stores data in BSON (Binary JSON), which supports a wide range of data
types:

Key MongoDB Data Types:

● String – UTF-8 text

● Integer – 32-bit and 64-bit

● Double – Floating point

● Boolean – True or False

● Date – Date/time object

● Timestamp – Special BSON timestamp

● Array – Ordered list of values

● Object – Embedded document (like JSON object)

● Null – Null value

● Binary – Binary data

● ObjectId – Unique identifier

● Decimal128 – 128-bit decimal for high-precision

● Regex – Regular expression pattern

🧾 Summary Table
Type Hive Pig NoSQL MongoDB
(Generic)

Integer TINYINT, INT, int, long Number Int32, Int64

BIGINT

Float FLOAT, DOUBLE float, Number Double,

double Decimal128

String STRING, VARCHAR chararray String String

Boolean BOOLEAN boolean Boolean Boolean

Binary BINARY bytearray Binary/Blob Binary

Date/Tim DATE, TIMESTAMP — Timestamp/Dat Date, Timestamp

e e

Array ARRAY Bag Array/List Array

Map MAP Map Map/Dict Object

Struct STRUCT Tuple Object/Struct Object

Special — — — ObjectId, Regex

QUESTIONS

Unit-I: Introduction to Big Data

Q1: What are the key characteristics of Big Data?

A1:
The key characteristics of Big Data are often referred to as the 3 Vs:

● Volume: Refers to the massive amount of data generated every second.

● Velocity: Refers to the speed at which data is generated and processed.

● Variety: Refers to the different types and formats of data (structured,

semi-structured, and unstructured).

Some experts also add:

● Veracity: The uncertainty or trustworthiness of the data.

● Value: The usefulness of the data for analysis and decision-making.

Q2: What are the different types of Big Data?

A2:
Big Data can be classified into three main types:

● Structured Data: Organized data in tabular format, such as relational

databases.

● Semi-structured Data: Data that doesn't have a strict schema but contains
tags or markers, like XML or JSON.

● Unstructured Data: Raw data without a predefined format, like videos,

images, and social media posts.
Q3: How does Big Data differ from traditional data?

A3:
Traditional data systems are optimized for handling structured data in small to
medium volumes, typically processed by relational databases. Big Data, on the other
hand, is designed to process vast amounts of structured, semi-structured, and
unstructured data. Traditional systems struggle with the scale, complexity, and
speed of Big Data.

Q4: What are the challenges associated with Big Data?

A4:
Key challenges with Big Data include:

● Data Storage: Storing massive amounts of data in an efficient and

cost-effective manner.

● Data Processing: Ensuring that data can be processed at scale and speed.

● Data Security and Privacy: Protecting sensitive data and ensuring

compliance with regulations.

● Data Quality: Managing inconsistencies and ensuring accuracy in large

datasets.

● Integration: Integrating data from different sources and formats.

Q5: What technologies are available for Big Data?

A5:
Technologies for Big Data include:

● Hadoop: A framework for distributed storage and processing of large

datasets.

● Apache Spark: A fast, in-memory data processing engine.

● NoSQL Databases: Such as MongoDB, Cassandra, and HBase for managing

unstructured and semi-structured data.
● Data Warehousing Solutions: Like Amazon Redshift and Google BigQuery.

● Data Analytics Tools: Such as Tableau, Apache Drill, and R.

Q6: What infrastructure is needed for Big Data?

A6:
Big Data infrastructure generally requires:

● Distributed Storage Systems: Such as Hadoop Distributed File System

(HDFS).

● Computational Clusters: To process large datasets in parallel, like Hadoop

and Spark clusters.

● Data Pipelines: For moving data from various sources to storage and
analysis tools.

● Cloud Infrastructure: Like AWS, Google Cloud, and Azure for scalability and
flexibility.

Q7: What are the desired properties of a Big Data system?

A7:
A Big Data system should exhibit the following properties:

● Scalability: Ability to scale horizontally as data volume grows.

● Fault Tolerance: Ability to recover from hardware or software failures.

● Flexibility: Can handle different types of data (structured, semi-structured,

and unstructured).

● Real-time Processing: Capability to process and analyze data in real time or

near-real time.

● Cost-Effectiveness: Efficient use of resources while managing large

datasets.
Unit-II: Introduction to Hadoop

Q1: What are the core components of Hadoop?

A1:
The core components of Hadoop are:

● Hadoop Distributed File System (HDFS): Distributed storage system for Big
Data.

● MapReduce: A programming model for processing large datasets in parallel.

● YARN (Yet Another Resource Negotiator): Manages resources and

scheduling of tasks across the cluster.

● Hadoop Common: The supporting libraries and utilities required for other
Hadoop modules.

Q2: What is the Hadoop Ecosystem?

A2:
The Hadoop ecosystem consists of several components and projects that
complement the core Hadoop framework:

● Pig: A high-level data processing language for Hadoop.

● Hive: A data warehousing and SQL-like query language.

● HBase: A NoSQL database that runs on top of Hadoop.

● Zookeeper: Manages distributed coordination of services.

● Oozie: A workflow scheduler for managing Hadoop jobs.

● Sqoop: A tool for importing/exporting data between Hadoop and relational

databases.

● Flume: A service for collecting and transferring log data.

Q3: What are the limitations of Hadoop?

A3:
Some limitations of Hadoop include:

● Complexity: Hadoop requires expertise for setup, configuration, and

maintenance.

● Batch Processing: Hadoop is more suitable for batch processing than

real-time processing.

● Latency: Since it is optimized for large-scale batch jobs, Hadoop has high
latency in certain use cases.

● Data Processing Speed: MapReduce (Hadoop's core processing framework)

is slower compared to newer technologies like Apache Spark.

Q4: Explain the Hadoop Distributed File System (HDFS).

A4:
HDFS is a distributed file system designed to store very large files across multiple
machines. It is highly scalable and fault-tolerant. Key features include:

● Block-based storage: Files are split into fixed-size blocks and stored across
the cluster.

● Replication: Data is replicated multiple times to ensure fault tolerance.

● Master-Slave Architecture: HDFS has a NameNode (master) that manages

metadata and DataNodes (slaves) that store actual data.

Q5: How does MapReduce work in Hadoop?

A5:
MapReduce is a programming model for processing large datasets in a parallel and
distributed manner. It consists of two phases:

● Map: The Mapper processes input data and generates key-value pairs.
● Reduce: The Reducer processes the intermediate key-value pairs generated
by the Map phase and produces the final output.

Unit-III: Introduction to Hive & Pig

Q1: What is Hive Architecture?

A1:
Hive architecture consists of:

● HiveQL: A SQL-like query language used for querying data in Hive.

● Hive Metastore: Stores metadata about the structure of the data.

● Driver: Receives HiveQL queries and interacts with the Metastore and
Compiler.

● Compiler: Translates HiveQL queries into MapReduce jobs.

● Execution Engine: Executes the generated MapReduce jobs on Hadoop.

Q2: What is Pig and its Anatomy?

A2:
Pig is a high-level platform for processing large datasets using Pig Latin, a simple,
SQL-like scripting language. It simplifies complex data processing tasks.

● Anatomy of Pig includes:

○ Pig Latin: The scripting language.

○ Parser: Converts the script into a logical plan.

○ Logical Optimizer: Optimizes the logical plan.

○ Physical Plan: Converts the logical plan into a physical execution plan.

○ MapReduce Compiler: Generates MapReduce jobs from the physical

plan.
○ Execution Engine: Executes the jobs on Hadoop.

Unit-IV: Introduction to NoSQL & MongoDB

Q1: What is NoSQL and why is it important?

A1:
NoSQL (Not Only SQL) refers to a class of databases designed to handle
unstructured, semi-structured, or rapidly changing data. NoSQL databases provide
horizontal scalability and are highly flexible, making them suitable for handling Big
Data workloads. They are crucial for applications that require fast read/write
operations, such as social media platforms or real-time analytics.

Q2: What are the different types of NoSQL databases?

A2:
NoSQL databases are categorized into:

● Document Stores: Store data in document formats (e.g., MongoDB,

CouchDB).

● Key-Value Stores: Store data as key-value pairs (e.g., Redis, DynamoDB).

● Column-family Stores: Store data in columns (e.g., Cassandra, HBase).

● Graph Databases: Store data as graphs (e.g., Neo4j).

Q3: What are the key features of MongoDB?

A3:
MongoDB is a popular document-oriented NoSQL database with the following
features:

● Flexible Schema: Allows for storage of unstructured and semi-structured

data.

● Scalability: Supports horizontal scaling via sharding.

● Rich Query Language: Allows powerful queries on document data.

● Replication: Provides data redundancy and high availability through replica

sets.

Q4: What is MongoDB Query Language (MQL)?

A4:
MongoDB Query Language (MQL) is the language used to query MongoDB
databases. It allows you to perform CRUD (Create, Read, Update, Delete)
operations on MongoDB collections. The syntax is similar to JavaScript and supports
complex queries, including filtering, sorting, and aggregation.
Certainly! Below are clear and university exam-ready definitions for Big Data,
Hadoop, Hive, Pig, NoSQL, and MongoDB:

1. Big Data – Definition

Big Data refers to extremely large and complex datasets that traditional data
processing tools cannot handle efficiently. It is characterized by the 3 Vs:

● Volume (large amounts of data),

● Velocity (high speed of data generation), and

● Variety (different types of data: structured, semi-structured, unstructured).

Big Data is used in fields like social media, healthcare, finance, and e-commerce to
extract insights and make data-driven decisions.

2. Hadoop – Definition

Hadoop is an open-source framework developed by Apache for storing and

processing large datasets in a distributed computing environment.
It consists of:

● HDFS (Hadoop Distributed File System) for data storage, and

● MapReduce for parallel data processing.

Hadoop is scalable, fault-tolerant, and suitable for batch processing of big
data.

3. Hive – Definition

Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows

users to query and manage large datasets using a SQL-like language called
HiveQL.
Hive converts HiveQL queries into MapReduce jobs and is mainly used for data
summarization, analysis, and reporting.
4. Pig – Definition

Apache Pig is a high-level data processing platform for analyzing large datasets
using a scripting language called Pig Latin.
It simplifies the process of writing complex MapReduce jobs. Pig is used for ETL
(Extract, Transform, Load) operations and is especially useful for semi-structured
and unstructured data processing.

5. NoSQL – Definition

NoSQL (Not Only SQL) refers to a category of non-relational databases designed

to store and process large volumes of unstructured and semi-structured data.
NoSQL databases are scalable, flexible, and support a wide variety of data models
such as:

● Key-Value Stores

● Document Stores

● Column-Family Stores

● Graph Databases

They are used in applications requiring high speed and flexibility, like real-time web
apps and big data analytics.

6. MongoDB – Definition

MongoDB is a popular open-source NoSQL database that stores data in JSON-like

documents with dynamic schemas.
It is classified as a document-oriented database and is known for:

● Scalability

● High performance

● Ease of use

MongoDB is ideal for storing unstructured data and is widely used in modern web
and mobile applications.
Great! Since you're a beginner, I’ll walk you through basic Pig Latin scripts to
perform:

● Sort

● Group

● Join

● Project

● Filter

I’ll explain each step simply using example data so you can understand and use it in
your college practical.

✅ 1. Loading Data in Pig

Let’s say we have a file students.txt with this data:

1,John,80

2,Alice,92

3,Mark,67

4,Emma,85

Script to Load:

students = LOAD 'students.txt' USING PigStorage(',')

AS (id:int, name:chararray, marks:int);

🔹 This loads data with 3 fields: id, name, marks.

✅ 2. Sorting Data
Goal: Sort students by marks (highest to lowest)

sorted_students = ORDER students BY marks DESC;

DUMP sorted_students;

🔹 ORDER BY is used to sort. Use ASC for ascending.

✅ 3. Grouping Data
Example: Group students based on marks (e.g., to count how many got
same marks)

grouped_by_marks = GROUP students BY marks;

DUMP grouped_by_marks;

🔹 Groups rows where marks is the same.

✅ 4. Joining Data
Let’s say we have another file details.txt:

1,Computer Science

2,Mathematics

3,Physics

4,Biology

Load Second Dataset:

details = LOAD 'details.txt' USING PigStorage(',')

AS (id:int, dept:chararray);
Join Both Tables on id:

joined_data = JOIN students BY id, details BY id;

DUMP joined_data;

🔹 Joins both datasets on id.

✅ 5. Projecting Columns
Goal: Select only names and marks

name_marks = FOREACH students GENERATE name, marks;

DUMP name_marks;

🔹 FOREACH ... GENERATE is used to select (project) columns.

✅ 6. Filtering Data
Goal: Get students who scored more than 80

top_students = FILTER students BY marks > 80;

DUMP top_students;

🔹 FILTER is used to apply conditions.

✅ Final Summary Table

Operation Pig Latin Keyword Example

Load data LOAD LOAD 'file'

Sort data ORDER BY ORDER students BY

marks DESC

Group GROUP GROUP students BY

data marks

Join data JOIN JOIN A BY id, B BY id

Select FOREACH ... FOREACH students

fields GENERATE GENERATE name

Filter rows FILTER FILTER students BY

marks > 80

Let me know if you’d like a sample input file or this formatted for your practical
record book.

Great! Since you're a beginner and this is your college practical, I’ll guide you
through Hive commands to:

● CREATE, ALTER, DROP → Databases, Tables, Views, Functions, and

Indexes

● With sample syntax + clear explanation

✅ 1. Hive Database Commands
🔹 a) CREATE DATABASE
CREATE DATABASE college_db;

📘 Creates a new database named college_db.

🔹 b) USE DATABASE
USE college_db;

📘 Switches to college_db so all further commands apply to it.

🔹 c) ALTER DATABASE
ALTER DATABASE college_db SET DBPROPERTIES ('description' = 'Database for
college records');

📘 Adds or updates properties of the database.

🔹 d) DROP DATABASE
DROP DATABASE college_db CASCADE;

📘 Deletes the database and all tables inside it. Use CASCADE to remove all
contents.
✅ 2. Hive Table Commands
🔹 a) CREATE TABLE
CREATE TABLE students (

id INT,

name STRING,

marks FLOAT

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

📘 Creates a students table with three columns. Data is stored in text format,
separated by commas.

🔹 b) ALTER TABLE
-- Add a new column

ALTER TABLE students ADD COLUMNS (email STRING);

-- Rename a column

ALTER TABLE students CHANGE name student_name STRING;

-- Rename the table

ALTER TABLE students RENAME TO student_data;

📘 Modifies the structure or name of the table.

🔹 c) DROP TABLE
DROP TABLE student_data;

📘 Deletes the table and its data.

✅ 3. Hive View Commands

🔹 a) CREATE VIEW
CREATE VIEW top_students AS

SELECT id, student_name, marks

FROM student_data

WHERE marks > 75;

📘 Creates a view to show only students with marks above 75.

🔹 b) DROP VIEW
DROP VIEW top_students;

📘 Deletes the view.

✅ 4. Hive UDF (User Defined Function)

Hive has built-in functions, but you can create your own using Java. Here's how
you reference and register one:

🔹 a) Registering a UDF (Assuming .jar file is uploaded)

ADD JAR /path/to/myfunctions.jar;

CREATE TEMPORARY FUNCTION to_upper AS 'com.example.hive.udf.ToUpper';

📘 Registers a custom function named to_upper.

Use it like:

SELECT to_upper(student_name) FROM student_data;

If you don’t have custom functions, this part may not be required in your basic
practical.

✅ 5. Hive Index Commands

🔹 a) CREATE INDEX
CREATE INDEX idx_marks ON TABLE student_data (marks)

AS 'COMPACT'

WITH DEFERRED REBUILD;

📘 Creates an index on the marks column to speed up queries.

🔹 b) REBUILD INDEX
ALTER INDEX idx_marks ON student_data REBUILD;

📘 Builds the index after it's created.

🔹 c) DROP INDEX
DROP INDEX idx_marks ON student_data;
📘 Deletes the index.

✅ Summary Table
Command Action

CREATE Create DB, table, view

ALTER Modify table or DB

DROP Delete table, DB, view

CREATE Add index on table

INDEX column

UDF Create custom functions

Would you like this in a printable practical record format (Word/PDF)?

Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
SAP FSCM-Dispute Management-FI/AR: Wipro Confidential
100% (1)
SAP FSCM-Dispute Management-FI/AR: Wipro Confidential
97 pages
Goal Statement
No ratings yet
Goal Statement
1 page
CRISP-DM Template Final Project
No ratings yet
CRISP-DM Template Final Project
13 pages
HIVE
No ratings yet
HIVE
80 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Hive
No ratings yet
Hive
29 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive Final
No ratings yet
Hive Final
75 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
HIVE
No ratings yet
HIVE
28 pages
Unit IV
No ratings yet
Unit IV
22 pages
Hive
No ratings yet
Hive
65 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive
No ratings yet
Hive
42 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Hive
No ratings yet
Hive
13 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Bda Bi Jit Chapter-5
No ratings yet
Bda Bi Jit Chapter-5
27 pages
Hive Documet
No ratings yet
Hive Documet
33 pages
HIVE Data Types
No ratings yet
HIVE Data Types
6 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Big Data
No ratings yet
Big Data
120 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
Unit 3
No ratings yet
Unit 3
23 pages
Hive
No ratings yet
Hive
9 pages
Unit IV
No ratings yet
Unit IV
64 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Unit 5
No ratings yet
Unit 5
21 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
TD Hive Guide V2.0
No ratings yet
TD Hive Guide V2.0
34 pages
TD Hive Guide V2.0 PDF
No ratings yet
TD Hive Guide V2.0 PDF
34 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Bda Report
No ratings yet
Bda Report
16 pages
Exp4 BDI 60004200124
No ratings yet
Exp4 BDI 60004200124
11 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
SRC
No ratings yet
SRC
3 pages
Bentley Vision
No ratings yet
Bentley Vision
61 pages
Use Case Lookup
No ratings yet
Use Case Lookup
17 pages
Quiz 1 Topic 1 and 2
No ratings yet
Quiz 1 Topic 1 and 2
6 pages
White Paper GFM Functional Specification
No ratings yet
White Paper GFM Functional Specification
51 pages
GOLF Proposal
No ratings yet
GOLF Proposal
7 pages
Course Work of PHD
100% (1)
Course Work of PHD
8 pages
PPC Case
No ratings yet
PPC Case
1 page
Impact of Bonus Issue On Market Price
No ratings yet
Impact of Bonus Issue On Market Price
70 pages
8601 Quiz - 03087611772
0% (1)
8601 Quiz - 03087611772
55 pages
Morning Briefing (May 07, 2012)
No ratings yet
Morning Briefing (May 07, 2012)
2 pages
ISO27001 2013 ComplianceChecklist
67% (3)
ISO27001 2013 ComplianceChecklist
21 pages
Criminal Law
No ratings yet
Criminal Law
11 pages
A Guide To: Project Auditing
No ratings yet
A Guide To: Project Auditing
37 pages
Natural Gas Continuous
No ratings yet
Natural Gas Continuous
7 pages
Recruitment Process: Step 1: Application
No ratings yet
Recruitment Process: Step 1: Application
6 pages
Chaifetz, Jagger - 2014 - 40 Years of Dialogue On Food Sovereignty A Review and A Look Ahead
No ratings yet
Chaifetz, Jagger - 2014 - 40 Years of Dialogue On Food Sovereignty A Review and A Look Ahead
7 pages
(Sample) Overall Comparative Analysis
No ratings yet
(Sample) Overall Comparative Analysis
21 pages
Deloitte
No ratings yet
Deloitte
12 pages
Valvula Mariposa Danais 150
No ratings yet
Valvula Mariposa Danais 150
15 pages
Unit 123: Fixing Sheet Materials: Multiple Choice Questions
100% (2)
Unit 123: Fixing Sheet Materials: Multiple Choice Questions
4 pages
MCSC202 Theory Chap 4 Lec 1
No ratings yet
MCSC202 Theory Chap 4 Lec 1
53 pages
Child Labor and Its Impact On Education Sse 200
No ratings yet
Child Labor and Its Impact On Education Sse 200
6 pages
Cambria Company v. Cosmos Granite & Marble, NC - Complaint
No ratings yet
Cambria Company v. Cosmos Granite & Marble, NC - Complaint
17 pages
Chan V. Honda Motor Co., Ltd. and Honda Phil.: Rights, Regulations and Remedies) in Relation To Sec 170
No ratings yet
Chan V. Honda Motor Co., Ltd. and Honda Phil.: Rights, Regulations and Remedies) in Relation To Sec 170
3 pages
Professional Price List: 03-04-2018
No ratings yet
Professional Price List: 03-04-2018
17 pages
Petition - Notarial Commission - Template
No ratings yet
Petition - Notarial Commission - Template
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hive and Pig

Uploaded by

Hive and Pig

Uploaded by

Unit-3

●​ Schema and Data Storage:​

●​ SQL-Like Query Language:​

●​ Scalable and Extensible:​

●​ User-Defined Functions (UDFs):​

The following component diagram depicts the architecture of Hive:

Key Features and Capabilities

●​ Schema-on-Read: Unlike traditional databases that require a schema before

●​ Abstraction from MapReduce: HiveQL abstracts away the complexity of

●​ Scalability: Designed for processing petabytes of data, HiveQL is suitable for

How HiveQL Works

Database engine (e.g.,

Use Case Big Data, batch processing OLTP, interactive queries

1.​ Syntax Structure:​

2.​ Data Definition Language (DDL):​

3.​ Data Manipulation Language (DML):​

4.​ Data Types:​

5.​ Joins and Subqueries:​

6.​ Aggregate Functions:​

7.​ Use of Aliases:​

8.​ Query Execution Plan:​

-- 3. Insert data into the table (manually inserting a row)

-- 4. Select data from the table

-- 6. Drop the table

-- 7. Alter the table (add a new column 'joining_date')

-- Adding a column to the new table

-- 8. Insert data into the altered table

-- 9. Query data using a condition

-- 10. Query data with GROUP BY and aggregation function

-- 11. Perform a JOIN operation

-- Insert data into the departments table

-- Perform JOIN between employees_new and departments table

### 📌 *Type-1: Numeric Data Types*

✅ *Use Case Example*:

✅ *Use Case Example*:

## 🔹 *2. Complex Data Types*

## 🔎 *Example Data: (Geek.txt)*

| Field | Data Type |

✅ Example Hive table:

To write data analysis programs, Pig provides a high-level language known as

Why Do We Need Apache Pig?

Apache Pig uses a multi-query approach, thereby reducing the length of

Apache Pig comes with the following features −

Rich set of operators − It provides many operators to perform operations like

Optimization opportunities − The tasks in Apache Pig optimize their execution

UDFs − Pig provides the facility to create User-defined Functions in other

Apache Pig Components

Pig Latin Data Model

A bag is a collection of tuples.

A tuple is an ordered set of fields.

A field is a piece of data

Pig Latin Data types

Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example

Represents a signed 32-bit integer.

Represents a signed 64-bit integer.

Represents a signed 32-bit floating point.

Represents a 64-bit floating point.

Represents a character array (string) in Unicode UTF-8 format.

6 Bytearray Represents a Byte array (blob).

Represents a Boolean value.

Represents a Java BigInteger.

Represents a Java BigDecimal

11 Tuple A tuple is an ordered set of fields.

A bag is a collection of tuples.

A Map is a set of key-value pairs.

A null can be an unknown value or a non-existent value. It is used as a placeholder

Pig Latin Arithmetic Operators

Operator Description Example

* Multiplication − Multiplies values on either side of the operator a * b will give

Modulus − Divides left hand operand by right hand operand and

Pig Latin Comparison Operators

Pig Latin Type Construction Operators

Tuple constructor operator − This operator is used to construct

Map constructor operator − This operator is used to construct

Pig Latin Relational Operations

● Schema and Data Storage:

● SQL-Like Query Language:

● Scalable and Extensible:

● User-Defined Functions (UDFs):

● Schema-on-Read: Unlike traditional databases that require a schema before

● Abstraction from MapReduce: HiveQL abstracts away the complexity of

● Scalability: Designed for processing petabytes of data, HiveQL is suitable for

1. Syntax Structure:

2. Data Definition Language (DDL):

3. Data Manipulation Language (DML):

4. Data Types:

5. Joins and Subqueries:

6. Aggregate Functions:

7. Use of Aliases:

8. Query Execution Plan:

### 📌 Type-1: Numeric Data Types

✅ Use Case Example:

✅ Use Case Example:

## 🔹 2. Complex Data Types

## 🔎 Example Data: (Geek.txt)

● Uses MapReduce, Tez, or Spark as the execution engine.

● Pig interacts with HDFS for data input and output.

○ Batch processing of massive datasets

○ Fault tolerance via Hadoop’s framework

1. Pig Latin:

○ It is procedural and can express complex data flows with minimal

2. Execution Model:

○ Logical Optimizer: Optimizes the logical plan by applying different

○ Physical Plan: Converts the optimized logical plan into a physical

○ MapReduce Compiler: Translates the physical plan into one or more

○ Execution Engine: Executes the MapReduce jobs on Hadoop and

3. Execution Modes:

○ MapReduce Mode: Executes Pig scripts on a Hadoop cluster,

○ Tez/Spark Mode: Executes Pig scripts using Apache Tez or Apache

4. Pig Grunt Shell:

1. LOAD – Load data from HDFS or local

2. STORE – Save data to storage

3. DUMP – Display output on the screen

4. FILTER – Select rows based on conditions

5. FOREACH...GENERATE – Transform data

6. GROUP – Group data for aggregation

8. ORDER – Sort data

9. DISTINCT – Remove duplicates

10.LIMIT – Restrict number of output rows

11.UNION – Combine data from multiple relations

● String: CONCAT(), SUBSTRING(), STRSPLIT()

● Date/Time: ToDate(), GetYear()

● Aggregation: SUM(), AVG(), COUNT(), MAX(), MIN()

● Useful for custom logic not covered by built-in functions.

● SMALLINT – 2-byte integer

● INT – 4-byte integer

● BIGINT – 8-byte integer