0% found this document useful (0 votes)

4 views7 pages

Big Data Notes

The document outlines the steps for developing and using a User Defined Function (UDF) in Hive, including writing the UDF in Java, compiling the code, registering the UDF in Hive, and using it in HQL queries. It also discusses Hive's schema management in HDFS, comparing it with MongoDB's data handling capabilities, and highlights the differences between schema-on-write and schema-on-read approaches. Additionally, it explains how MongoDB and Hive can complement each other in managing operational and analytical data.

Uploaded by

srassy45

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

Big Data Notes

Uploaded by

srassy45

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

General Steps Involved in Developing and Using a Simple UDF in Hive:

1. Write the UDF in Java:

○ Create a Java class that extends one of the Hive UDF base classes. The most
common is org.apache.hadoop.hive.ql.exec.UDF for basic UDFs that take one or
more arguments and return a single value. For more complex scenarios (e.g.,
returning multiple values or operating on groups of rows), you might use other base
classes like GenericUDF or UDAF (User Defined Aggregate Function).
○ Implement one or more evaluate() methods within your Java class. The evaluate()
method(s) will contain the custom logic of your function. The method should be
public and can be overloaded to handle different input data types. The return type
of the evaluate() method will be the data type of the value returned by your UDF in
Hive.
○ Ensure that the data types used in your Java code are compatible with Hive's data
types. Hive provides a set of org.apache.hadoop.hive.serde2.objectinspector
classes to help with data type conversion and inspection.
// Example UDF to convert a string to uppercase
package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class UpperCase extends UDF {
public Text evaluate(Text str) {
if (str == null) {
return null;
}
return new Text(str.toString().toUpperCase());
}
}

2. Compile the Java Code:

○ Use a Java compiler (like javac) to compile your Java source file(s) into .class files.
○ Package the compiled .class files into a JAR (Java Archive) file. You'll need to
include the Hive and Hadoop dependencies in your classpath during compilation or
package them in the JAR if they aren't already available on the Hive server. For a
simple UDF like the example, you might not need to include these dependencies in
the JAR itself if Hive provides them at runtime.
javac -classpath $(hive --auxpath)
com/example/hive/udf/UpperCase.java
jar cf upper.jar com/example/hive/udf/UpperCase.class

3. Register the UDF in Hive:

○ Start the Hive CLI or connect to Hive through a client.
○ Use the ADD JAR command to make the JAR file containing your UDF available to
the Hive session. The path to the JAR file should be accessible by the Hive server
(it can be a local file path on the server or an HDFS path).
ADD JAR /path/to/your/upper.jar;
○ Create a temporary or permanent function in Hive that maps a name you'll use in
HQL to your Java UDF class.
■ Temporary Function: Available only for the current Hive session.
CREATE TEMPORARY FUNCTION to_upper AS
'com.example.hive.udf.UpperCase';

■ Permanent Function: Registered in the Hive metastore and available across

sessions and users (requires appropriate permissions).
CREATE FUNCTION to_upper AS 'com.example.hive.udf.UpperCase' USING
JAR 'hdfs:///path/to/your/upper.jar';

4. Use the UDF in HQL Queries:

○ Once the function is registered, you can use its assigned name in your HQL queries
just like any built-in function.
SELECT name, to_upper(name) AS upper_name FROM employees;

○ For permanent functions, you might need to specify the database name if it's not in
the current context (e.g., SELECT my_database.to_upper(column) FROM ...).
How does Hive manage schema definition for data residing in the Hadoop Distributed File
System (HDFS)?
Hive employs a "schema-on-read" approach for managing schema definition over data in HDFS.
This means that the schema of a Hive table is defined in the Hive metastore before any data is
loaded or queried. However, Hive does not enforce this schema when data is written to HDFS.
Instead, it interprets the data according to the schema you've defined when you query the table.
Key aspects of Hive's schema management:
● Metastore: The central component for schema management is the Hive metastore. It
stores metadata about tables, including their names, columns, data types, partitioning
information, storage properties (like file format and delimiters), and the HDFS location of
the data files.
● Table Definition: When you create a Hive table using the CREATE TABLE statement,
you explicitly define the schema (column names and their data types). You also specify
the storage format and the HDFS directory where the data files for this table are located.
CREATE TABLE my_table (
id INT,
name STRING,
value DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/my_table';

● Schema on Read in Action: When you execute a Hive query (e.g., SELECT * FROM
my_table), Hive uses the schema information from the metastore to interpret the raw
bytes in the HDFS files at the specified location. It deserializes the data according to the
defined row format and data types.
● Flexibility and Potential Issues: The schema-on-read approach offers flexibility. You can
potentially have different files with varying structures in the same HDFS location, and as
long as the schema you define in Hive can interpret a subset of the data, you can query it.
However, this also introduces the risk of schema mismatch. If the actual data in the HDFS
files does not conform to the schema defined in Hive, you might get unexpected results,
errors during query execution, or data corruption issues.
● Schema Evolution: Hive allows for schema evolution through ALTER TABLE statements
(e.g., adding columns, changing data types). When the schema is altered, Hive only
updates the metadata in the metastore. Existing data in HDFS remains as it is.
Subsequent queries will use the new schema, and it's the user's responsibility to ensure
compatibility between the new schema and the existing data.
Compare and Contrast MongoDB and Hive:
Feature MongoDB Hive
Primary Use Case Operational database, real-time Data warehousing, batch
data, web applications, mobile processing, analytical queries
backends over large datasets
Data Model Document-oriented (JSON-like Relational-like tables with a
BSON documents with dynamic defined schema
schema) (schema-on-read)
Schema Schema-less or dynamic Schema-on-read
schema
Scalability Horizontal scaling (sharding) is Designed for scalability on
a core feature Hadoop (distributed
processing)
Query Language MongoDB Query Language Hive Query Language (HQL),
(JavaScript-based) SQL-like
Data Structure Flexible, nested documents, Flat tables, supports complex
arrays types (arrays, maps, structs)
Transactions Supports ACID transactions Limited transactional
(with some limitations in early capabilities (ACID compliance
versions and across shards) not a primary goal)
Performance High performance for reads and Optimized for batch processing
writes on indexed fields, and large-scale data analysis;
suitable for high-throughput query latency can be higher
operations
Data Updates Supports in-place updates of Primarily designed for
documents read-heavy workloads; updates
are less efficient and often
involve rewriting data
Real-time vs. Batch Primarily real-time operations Primarily batch-oriented
processing
Data Suitability Unstructured or semi-structured Large volumes of structured,
data, evolving data models, semi-structured, or
hierarchical data unstructured data that can be
given a schema for analysis
Trade-offs Between Schema-on-Write and Schema-on-Read:
Feature Schema-on-Write (Typical of Schema-on-Read
RDBMS) (Characteristic of Hive)
Data Validation Enforced at write time Enforced at read time
Feature Schema-on-Write (Typical of Schema-on-Read
RDBMS) (Characteristic of Hive)
Data Quality Generally higher due to upfront Can be lower if data doesn't
validation conform to the defined schema
Schema Changes Can be complex and More flexible; schema changes
time-consuming (schema involve updating metadata
migrations)
Data Flexibility Less flexible; all data must More flexible; can handle
conform to the schema diverse data formats to some
extent
Query Performance Can be optimized based on a Performance depends on how
well-defined schema well the schema aligns with the
actual data and query patterns
Storage Overhead Can have overhead due to Lower overhead as data is
strict schema requirements stored as is
(e.g., null values for missing
fields)
Development Speed Can be slower initially due to Can be faster initially as you
upfront schema design can start working with data
without a rigid schema
Data Transformation Often done before writing to the Can be done during the read
database process (using Hive's
transformation capabilities)
How Joins are Performed in a Traditional RDBMS:
In a traditional RDBMS, joins are used to combine rows from two or more tables based on a
related column between them. The database management system (DBMS) uses various
algorithms to perform joins, such as:
● Nested Loop Join: The outer table is iterated row by row, and for each row, the inner
table is scanned to find matching rows based on the join condition. This can be inefficient
for large tables.
● Hash Join: One of the tables (typically the smaller one) is used to build a hash table on
the join key. Then, the other table is scanned, and for each row, the join key is hashed
and looked up in the hash table to find matching rows. This is often more efficient than
nested loop join for larger tables.
● Sort-Merge Join: Both tables are sorted on the join key. Then, the sorted tables are
merged, and matching rows are identified. This can be efficient if the tables are already
sorted or if sorting is relatively inexpensive.
The specific join algorithm used by the RDBMS depends on factors like the size of the tables,
the existence of indexes on the join columns, and the specific database system. The JOIN
clause (e.g., INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN) in SQL specifies how
the rows from the joined tables should be combined based on the join condition (usually
specified in the ON clause).
Analogous Approaches or Considerations When Working with Related Data in MongoDB:
MongoDB, being a document database, does not have the same concept of joins as RDBMS at
the storage level. However, there are ways to handle related data:
1. Embedding: Related data can be embedded within a single document as sub-documents
or arrays. This is suitable for one-to-many or one-to-one relationships where the related
data is frequently accessed together with the main document and is not excessively large
or frequently updated independently.
{
"_id": ObjectId("..."),
"name": "Product A",
"category": "Electronics",
"details": {
"model": "X100",
"manufacturer": "ABC Corp"
},
"reviews": [
{"user": "User1", "rating": 5},
{"user": "User2", "rating": 4}
]
}

○ Pros: Faster reads as all related data is in one document, reduced need for
complex queries.
○ Cons: Increased document size, potential for data redundancy if related data is
referenced by multiple top-level documents, more complex updates if embedded
data needs to be updated independently.
2. Referencing (Linking): Related documents can be linked by storing the _id of the related
document in the main document. This is similar to foreign keys in RDBMS. To retrieve the
related data, you would need to perform a separate query.
// Order document
{
"_id": ObjectId("order1"),
"customer_id": ObjectId("customer1"),
"items": [
{"product_id": ObjectId("prod1"), "quantity": 2},
{"product_id": ObjectId("prod2"), "quantity": 1}
]
}

// Customer document
{
"_id": ObjectId("customer1"),
"name": "John Doe",
"email": "john.doe@example.com"
}

○ Pros: Reduced data redundancy, easier to update related data independently.

○ Cons: Requires multiple queries to retrieve related data, can be less performant for
operations that frequently need to access related data.
3. $lookup (Aggregation Framework): MongoDB's aggregation framework provides the
$lookup stage, which performs a left outer join to another collection in the same database
to filter in documents from the "foreign" collection for use in the output. This is the closest
equivalent to a join operation in MongoDB.
db.orders.aggregate([
{
$lookup: {
from: "customers",
localField: "customer_id",
foreignField: "_id",
as: "customerInfo"
}
},
{ $unwind: "$customerInfo" } // To flatten the customerInfo
array
])

○ Considerations: $lookup operations can be more resource-intensive than querying

embedded data. Performance depends on factors like indexes on the joined fields
and the size of the collections.
Scenario: MongoDB for Operational Data and Hive for Analytical Processing
Consider an e-commerce platform:
● MongoDB for Operational Data: MongoDB can be used to store the real-time
operational data of the platform, such as:
○ User profiles: User information, preferences, activity logs. The flexible schema
allows for easy updates as user attributes evolve.
○ Product catalogs: Product details, inventory levels. Embedded documents can
represent product variations (e.g., size, color).
○ Shopping carts and orders: Real-time tracking of items in carts and completed
order details. Embedded arrays can store order items.
○ Session data: User session information for maintaining state.
MongoDB's strengths in handling high write volumes, flexible schemas, and providing
low-latency reads make it well-suited for these operational tasks that require quick
updates and retrievals to serve user interactions.
● Hive for Analytical Processing: The same e-commerce data can be periodically
exported from MongoDB to HDFS (perhaps using tools like MongoDB Connector for
Hadoop) and then analyzed using Hive for business intelligence:
○ Sales trend analysis: Analyzing historical order data to identify popular products,
peak sales times, and regional trends.
○ Customer behavior analysis: Segmenting customers based on their purchase
history, browsing patterns, and demographics to understand different customer
groups.
○ Inventory forecasting: Analyzing past sales data to predict future demand and
optimize inventory levels.
○ Marketing campaign effectiveness: Analyzing the impact of marketing campaigns
on sales and customer engagement.
Hive's ability to process large volumes of data using a SQL-like interface makes it ideal for these
complex analytical queries that often involve aggregations, joins across different datasets (e.g.,
orders and user profiles), and historical analysis.
How these systems complement each other:
● Data Lifecycle Management: MongoDB handles the active, frequently changing
operational data, while Hive serves as the data warehouse for historical analysis and
business intelligence.
● Performance Optimization: MongoDB provides low-latency access for user-facing
applications, while Hive is optimized for batch processing of large datasets for analytical
insights.
● Schema Flexibility vs. Analytical Structure: MongoDB's flexible schema
accommodates evolving operational data, while Hive's schema-on-read allows for
structuring the data for specific analytical needs without altering the source data.
● Different User Groups: Application developers can work with MongoDB's document
model, while data analysts and business users can leverage Hive's SQL interface for
analysis.
Data Modeling for Nested and Varying Structures:
Consider a dataset representing customer reviews for products. Each review might have a
varying set of attributes, and some reviews might include nested information like reviewer
details or lists of helpful votes.
MongoDB:
In MongoDB, you would naturally model this data using flexible documents that can
accommodate the nested and varying structures directly:
[
{
"_id": ObjectId("review1"),
"product_id": ObjectId("prod1"),
"user_id": ObjectId("userA"),
"rating": 5,
"comment": "Great product!",
"review_date": ISODate("2025-04-26T18:00:00Z"),
"reviewer_details": {
"username": "AwesomeBuyer",
"location": "India"
},
"helpful_votes": [
ObjectId("userB"),
ObjectId("userC")
]
},
{
"_id": ObjectId("review2"),
"product_id": ObjectId("prod1"),
"user_id": ObjectId("userD"),
"rating": 3,
"comment": "It's okay.",
"review_date": ISODate("2025-04-25T10:00:00Z"),
"upvotes": 2 //

Document Sans Titre
No ratings yet
Document Sans Titre
61 pages
Java OOps Interview Questions
No ratings yet
Java OOps Interview Questions
39 pages
21BCE9726 - DSA Lab - FINAL REPORT
No ratings yet
21BCE9726 - DSA Lab - FINAL REPORT
50 pages
Big Data Record 2
No ratings yet
Big Data Record 2
117 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Web Development Long Term Notes
No ratings yet
Web Development Long Term Notes
114 pages
Unit 3 Accessiblity
No ratings yet
Unit 3 Accessiblity
17 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
CSS Chapter 2
No ratings yet
CSS Chapter 2
34 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
Unit IV
No ratings yet
Unit IV
64 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
HIVE
No ratings yet
HIVE
28 pages
Quetion 19
No ratings yet
Quetion 19
3 pages
Hive
No ratings yet
Hive
15 pages
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
No ratings yet
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
91 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Hive
No ratings yet
Hive
26 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Unit 5
No ratings yet
Unit 5
21 pages
Unit IV
No ratings yet
Unit IV
22 pages
Hive Final
No ratings yet
Hive Final
75 pages
Hive Query Language
No ratings yet
Hive Query Language
33 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive
No ratings yet
Hive
29 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Aditya Arun Funde 9th C Roll No 14 Basic Ict Skills
No ratings yet
Aditya Arun Funde 9th C Roll No 14 Basic Ict Skills
9 pages
Youtube Playlist Downloader
No ratings yet
Youtube Playlist Downloader
55 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
HIVE
No ratings yet
HIVE
80 pages
Unit 3
No ratings yet
Unit 3
23 pages
Hive Documet
No ratings yet
Hive Documet
33 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
(22223) - Model Answer-Q.6
No ratings yet
(22223) - Model Answer-Q.6
9 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Hive
No ratings yet
Hive
65 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
02 Build A Role Based Access Control in A Spring Boot 3 API
No ratings yet
02 Build A Role Based Access Control in A Spring Boot 3 API
23 pages
Hive
No ratings yet
Hive
50 pages
Elhassan Elboraey PM Resume
No ratings yet
Elhassan Elboraey PM Resume
1 page
Hive Notes
No ratings yet
Hive Notes
15 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Enterprise Architecture PDF
No ratings yet
Enterprise Architecture PDF
175 pages
Sap Idoc Interface (Development)
No ratings yet
Sap Idoc Interface (Development)
50 pages
4.SQL Queries DML
No ratings yet
4.SQL Queries DML
47 pages
Finite Automata Theory and Formal Languages: Assignment # 02
No ratings yet
Finite Automata Theory and Formal Languages: Assignment # 02
5 pages
Telemedicine Documentation
No ratings yet
Telemedicine Documentation
100 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
HIVE Data Types
No ratings yet
HIVE Data Types
6 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Test Classes
No ratings yet
Test Classes
10 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Unit 3
No ratings yet
Unit 3
8 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
SoapUI Cookbook - Sample Chapter
No ratings yet
SoapUI Cookbook - Sample Chapter
40 pages
LS6 Computer Software WS
No ratings yet
LS6 Computer Software WS
9 pages
Hive Presentation
No ratings yet
Hive Presentation
18 pages
AJP Micro Project
No ratings yet
AJP Micro Project
10 pages
ISE 101 Projects J
No ratings yet
ISE 101 Projects J
25 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
Handson Technology: MAX7218 8x8 Dot Matrix Display Module
No ratings yet
Handson Technology: MAX7218 8x8 Dot Matrix Display Module
7 pages
Hive PPT
No ratings yet
Hive PPT
25 pages
Roles: Unit 4: Workstream Overview
No ratings yet
Roles: Unit 4: Workstream Overview
1 page
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
2 SQL Hadoop Analyzing Big Data Hive m2 Intro Slides
No ratings yet
2 SQL Hadoop Analyzing Big Data Hive m2 Intro Slides
14 pages
Question No 1
No ratings yet
Question No 1
2 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
HBase Integration Hive
No ratings yet
HBase Integration Hive
7 pages
BDM. Python Coding & Programming 12ed 2022
100% (2)
BDM. Python Coding & Programming 12ed 2022
163 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
Ucla sr2
No ratings yet
Ucla sr2
3 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Spring Cloud Architecture
No ratings yet
Spring Cloud Architecture
14 pages
Hive User Defined Functions: Step 1: Code
No ratings yet
Hive User Defined Functions: Step 1: Code
5 pages
ABAP Performance Tips: Using All The Keys in SELECT Statement
No ratings yet
ABAP Performance Tips: Using All The Keys in SELECT Statement
4 pages
CBAP Notes
No ratings yet
CBAP Notes
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data Notes

Uploaded by

Big Data Notes

Uploaded by

General Steps Involved in Developing and Using a Simple UDF in Hive:

1. Write the UDF in Java:

2. Compile the Java Code:

3. Register the UDF in Hive:

■ Permanent Function: Registered in the Hive metastore and available across

4. Use the UDF in HQL Queries:

○ Pros: Reduced data redundancy, easier to update related data independently.

○ Considerations: $lookup operations can be more resource-intensive than querying

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Big Data Notes

Uploaded by

Big Data Notes

Uploaded by

General Steps Involved in Developing and Using a Simple UDF in Hive:

1.​ Write the UDF in Java:

2.​ Compile the Java Code:

3.​ Register the UDF in Hive:

■​ Permanent Function: Registered in the Hive metastore and available across

4.​ Use the UDF in HQL Queries:

○​ Pros: Reduced data redundancy, easier to update related data independently.

○​ Considerations: $lookup operations can be more resource-intensive than querying

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

1. Write the UDF in Java:

2. Compile the Java Code:

3. Register the UDF in Hive:

■ Permanent Function: Registered in the Hive metastore and available across

4. Use the UDF in HQL Queries:

○ Pros: Reduced data redundancy, easier to update related data independently.

○ Considerations: $lookup operations can be more resource-intensive than querying