Big Data Notes
Big Data Notes
○ For permanent functions, you might need to specify the database name if it's not in
the current context (e.g., SELECT my_database.to_upper(column) FROM ...).
How does Hive manage schema definition for data residing in the Hadoop Distributed File
System (HDFS)?
Hive employs a "schema-on-read" approach for managing schema definition over data in HDFS.
This means that the schema of a Hive table is defined in the Hive metastore before any data is
loaded or queried. However, Hive does not enforce this schema when data is written to HDFS.
Instead, it interprets the data according to the schema you've defined when you query the table.
Key aspects of Hive's schema management:
● Metastore: The central component for schema management is the Hive metastore. It
stores metadata about tables, including their names, columns, data types, partitioning
information, storage properties (like file format and delimiters), and the HDFS location of
the data files.
● Table Definition: When you create a Hive table using the CREATE TABLE statement,
you explicitly define the schema (column names and their data types). You also specify
the storage format and the HDFS directory where the data files for this table are located.
CREATE TABLE my_table (
id INT,
name STRING,
value DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/my_table';
● Schema on Read in Action: When you execute a Hive query (e.g., SELECT * FROM
my_table), Hive uses the schema information from the metastore to interpret the raw
bytes in the HDFS files at the specified location. It deserializes the data according to the
defined row format and data types.
● Flexibility and Potential Issues: The schema-on-read approach offers flexibility. You can
potentially have different files with varying structures in the same HDFS location, and as
long as the schema you define in Hive can interpret a subset of the data, you can query it.
However, this also introduces the risk of schema mismatch. If the actual data in the HDFS
files does not conform to the schema defined in Hive, you might get unexpected results,
errors during query execution, or data corruption issues.
● Schema Evolution: Hive allows for schema evolution through ALTER TABLE statements
(e.g., adding columns, changing data types). When the schema is altered, Hive only
updates the metadata in the metastore. Existing data in HDFS remains as it is.
Subsequent queries will use the new schema, and it's the user's responsibility to ensure
compatibility between the new schema and the existing data.
Compare and Contrast MongoDB and Hive:
Feature MongoDB Hive
Primary Use Case Operational database, real-time Data warehousing, batch
data, web applications, mobile processing, analytical queries
backends over large datasets
Data Model Document-oriented (JSON-like Relational-like tables with a
BSON documents with dynamic defined schema
schema) (schema-on-read)
Schema Schema-less or dynamic Schema-on-read
schema
Scalability Horizontal scaling (sharding) is Designed for scalability on
a core feature Hadoop (distributed
processing)
Query Language MongoDB Query Language Hive Query Language (HQL),
(JavaScript-based) SQL-like
Data Structure Flexible, nested documents, Flat tables, supports complex
arrays types (arrays, maps, structs)
Transactions Supports ACID transactions Limited transactional
(with some limitations in early capabilities (ACID compliance
versions and across shards) not a primary goal)
Performance High performance for reads and Optimized for batch processing
writes on indexed fields, and large-scale data analysis;
suitable for high-throughput query latency can be higher
operations
Data Updates Supports in-place updates of Primarily designed for
documents read-heavy workloads; updates
are less efficient and often
involve rewriting data
Real-time vs. Batch Primarily real-time operations Primarily batch-oriented
processing
Data Suitability Unstructured or semi-structured Large volumes of structured,
data, evolving data models, semi-structured, or
hierarchical data unstructured data that can be
given a schema for analysis
Trade-offs Between Schema-on-Write and Schema-on-Read:
Feature Schema-on-Write (Typical of Schema-on-Read
RDBMS) (Characteristic of Hive)
Data Validation Enforced at write time Enforced at read time
Feature Schema-on-Write (Typical of Schema-on-Read
RDBMS) (Characteristic of Hive)
Data Quality Generally higher due to upfront Can be lower if data doesn't
validation conform to the defined schema
Schema Changes Can be complex and More flexible; schema changes
time-consuming (schema involve updating metadata
migrations)
Data Flexibility Less flexible; all data must More flexible; can handle
conform to the schema diverse data formats to some
extent
Query Performance Can be optimized based on a Performance depends on how
well-defined schema well the schema aligns with the
actual data and query patterns
Storage Overhead Can have overhead due to Lower overhead as data is
strict schema requirements stored as is
(e.g., null values for missing
fields)
Development Speed Can be slower initially due to Can be faster initially as you
upfront schema design can start working with data
without a rigid schema
Data Transformation Often done before writing to the Can be done during the read
database process (using Hive's
transformation capabilities)
How Joins are Performed in a Traditional RDBMS:
In a traditional RDBMS, joins are used to combine rows from two or more tables based on a
related column between them. The database management system (DBMS) uses various
algorithms to perform joins, such as:
● Nested Loop Join: The outer table is iterated row by row, and for each row, the inner
table is scanned to find matching rows based on the join condition. This can be inefficient
for large tables.
● Hash Join: One of the tables (typically the smaller one) is used to build a hash table on
the join key. Then, the other table is scanned, and for each row, the join key is hashed
and looked up in the hash table to find matching rows. This is often more efficient than
nested loop join for larger tables.
● Sort-Merge Join: Both tables are sorted on the join key. Then, the sorted tables are
merged, and matching rows are identified. This can be efficient if the tables are already
sorted or if sorting is relatively inexpensive.
The specific join algorithm used by the RDBMS depends on factors like the size of the tables,
the existence of indexes on the join columns, and the specific database system. The JOIN
clause (e.g., INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN) in SQL specifies how
the rows from the joined tables should be combined based on the join condition (usually
specified in the ON clause).
Analogous Approaches or Considerations When Working with Related Data in MongoDB:
MongoDB, being a document database, does not have the same concept of joins as RDBMS at
the storage level. However, there are ways to handle related data:
1. Embedding: Related data can be embedded within a single document as sub-documents
or arrays. This is suitable for one-to-many or one-to-one relationships where the related
data is frequently accessed together with the main document and is not excessively large
or frequently updated independently.
{
"_id": ObjectId("..."),
"name": "Product A",
"category": "Electronics",
"details": {
"model": "X100",
"manufacturer": "ABC Corp"
},
"reviews": [
{"user": "User1", "rating": 5},
{"user": "User2", "rating": 4}
]
}
○ Pros: Faster reads as all related data is in one document, reduced need for
complex queries.
○ Cons: Increased document size, potential for data redundancy if related data is
referenced by multiple top-level documents, more complex updates if embedded
data needs to be updated independently.
2. Referencing (Linking): Related documents can be linked by storing the _id of the related
document in the main document. This is similar to foreign keys in RDBMS. To retrieve the
related data, you would need to perform a separate query.
// Order document
{
"_id": ObjectId("order1"),
"customer_id": ObjectId("customer1"),
"items": [
{"product_id": ObjectId("prod1"), "quantity": 2},
{"product_id": ObjectId("prod2"), "quantity": 1}
]
}
// Customer document
{
"_id": ObjectId("customer1"),
"name": "John Doe",
"email": "john.doe@example.com"
}