Bda Unit Iv Notes
Bda Unit Iv Notes
UNIT 4
Both Apache Pig and Apache Hive are high-level frameworks built on top of Apache Hadoop,
designed to simplify and accelerate the development of data processing applications. They
provide an abstraction layer over the complexities of Hadoop, allowing developers and data
analysts to write queries and transformations using familiar languages (Pig Latin for Pig and
SQL-like HiveQL for Hive) without needing to write complex MapReduce jobs manually.
Here are some common applications of Apache Pig and Apache Hive in big data environments:
• Pig and Hive are commonly used for preprocessing large volumes of data stored
in various formats (e.g., CSV, JSON, XML) into a format suitable for analysis or
loading into data warehouses.
• HiveQL provides SQL-like querying capabilities, making it easy for analysts and
data scientists to write complex queries to analyze and explore large datasets
stored in Hadoop Distributed File System (HDFS) or Hadoop-compatible file
systems.
• Pig Latin offers a more procedural approach to data processing, allowing users to
define custom data flows and transformations using a scripting language. This
flexibility is useful for scenarios where SQL-like querying is insufficient.
3. Batch Processing:
• Both Pig and Hive are suitable for performing batch processing tasks on large
datasets. They leverage the parallel processing capabilities of Hadoop to distribute
computation across a cluster of machines, enabling efficient processing of large
volumes of data.
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath
4. Data Warehousing:
• Pig can also be used for data warehousing tasks, although it is more commonly
used for ETL and data processing tasks.
• Pig's scripting language allows users to write custom data processing scripts for
ad-hoc analysis tasks. While not as SQL-like as HiveQL, Pig Latin provides
greater flexibility and control over data processing workflows.
• Pig and Hive can be integrated with workflow orchestration tools like Apache
Oozie or Apache Airflow to create complex data pipelines for ETL, data
processing, and analysis tasks. These tools enable users to define dependencies
between jobs and schedule their execution at regular intervals.
INTRODUCTION TO PIG:
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze
larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can
perform all the data manipulation operations in Hadoop using Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own functions
for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language.
All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component
known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath
MapReduce jobs.
Listed below are the major differences between Apache Pig and MapReduce.
Features of Pig
• Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to
read, process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as
well as unstructured. It stores the results in HDFS.
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high level
data processing language which provides a rich set of data types and operators to perform
various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt
Shell, UDFs, Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes
the programmer’s job easy. The architecture of Apache Pig is shown below.
As shown in the figure, there are various components in the Apache Pig framework. Let us take
a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG Latin
statements and logical (directed acyclic graph), which represents the Pig operators.
In the DAG, the logical operators of the script are represented and the data flows are represented
as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
The data model of Pig Latin is fully nested and it allows complex non-atomic data types such as
map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
Atom: An atom is any single value, such as a string or a number — ‘Diego’, for example. Pig’s
atomic values are scalar types that appear in most programming languages — int, long, float,
double chararray, and bytearray.
Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type —
‘Diego’, ‘Gomez’, or 6, for example. Think of a tuple as a row in a table.
Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible — each tuple
in the collection can contain an arbitrary number of fields, and each field can be of any type.
Map: A map is a collection of key value pairs. Any type can be stored in the value, and the key
needs to be unique. The key of a map must be a chararray and the value can be of any type.
In a Hadoop context, accessing data means allowing developers to load, store, and stream data,
whereas transforming data means taking advantage of Pig’s ability to group, join, combine,
split, filter, and sort data.
We can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There
is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the
Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using
Dump operator).
• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using
them in our script.
HIVE Introduction
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
topof Hadoop to summarize Big Data, and makes querying and analyzing easy. The term „Big
Data‟ is used for collections of large datasets that include huge volume, high velocity, and a
variety of data that is increasing day by day. Using traditional data management systems, it is
difficult to process Big Data. Therefore, the Apache Software Foundation introduced a
framework called Hadoop to solve Big Data management and processing challenges.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that
are used to help Hadoop modules.
• Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
• Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
• Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
• The traditional approach using Java MapReduce program for structured, semi-
structured,and unstructured data.
• The scripting approach for MapReduce to process structured and semi structured data
usingPig.
• The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
topof Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath
Features of Hive
It allows users to submit HiveQL queries and manage Hive resources programmatically
over HTTP.
7. Hive Beeline: Beeline is a lightweight JDBC client provided by Hive for connecting to
HiveServer2. It allows users to run HiveQL queries and commands from the command
line or scripts.
ARCHITECTURE OF HIVE
The following component diagram depicts the architecture of Hive:
Unit Operation
Name
UI The user interface for users to submit queries and other operations to the system.
As of 2011 the system had a command line interface and a web based GUI was
being developed.
Driver The component which receives the queries. This component implements the notion
of session handles and provides execute and fetch APIs modeled on JDBC/ODBC
interfaces.
Compiler The component that parses the query, does semantic analysis on the different query
blocks and query expressions and eventually generates an execution plan with the
help of the table and partition metadata looked up from the metastore.
Metastore The component that stores all the structure information of the various tables and
partitions in the warehouse including column and column type information, the
serializers and deserializers necessary to read and write data and the corresponding
HDFS files where the data is stored.
Execution The component which executes the execution plan created by the compiler. The
Engine plan is a DAG of stages. The execution engine manages the dependencies between
these different stages of the plan and executes these stages on the appropriate
system components.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check thesyntax
and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up tohere, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operationswith
Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
Tables – These are analogous to Tables in Relational Databases. Tables can be filtered,
projected, joined and unioned. Additionally all the data of a table is stored in a directory in
HDFS. Hive also supports the notion of external tables wherein a table can be created on
prexisting files or directories in HDFS by providing the appropriate location to the table
creation DDL. The rows in a table are organized into typed columns similar to Relational
Databases.
Partitions – Each Table can have one or more partition keys which determine how the data is
stored, for example a table T with a date partition column ds had files with data for a particular
date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system
to prune data to be inspected based on query predicates, for example a query that is interested
in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in
<table location>/ds=2008-09-01/ directory in HDFS.
Buckets – Data in each partition may in turn be divided into Buckets based on the hash of a
column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows
the system to efficiently evaluate queries that depend on a sample of data (these are queries that
use the SAMPLE clause on the table).
• Column Types
• Literals
• Null Values
• Complex Types
➢ Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
therange of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
Timestamp
Dates
DATE values are described in year/month/day format in the form {{YYYY- MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union.The syntax and example is as follows:
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
➢ Literals
Floating point types are nothing but numbers with decimal points. Generally, this type of data
iscomposed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type.The range of decimal type is approximately -10 -308 to 10 308 .
➢ Null Value
➢ Complex Types
Arrays
Arrays in Hive are used the same way they are used in Java.Syntax: ARRAY<data_type>
Maps
department;
4. Sorting Data:
• Use the ORDER BY clause to sort the result set based on one or more columns.
• Example:
SELECT * FROM employees ORDER BY salary DESC;
5. Joining Tables:
• Use join operations to combine data from multiple tables based on common
columns.
• Example:
SELECT e.name, d.department_name FROM employees e JOIN departments d ON
e.department_id = d.department_id;
6. Subqueries:
• Use subqueries to embed one query inside another query.
• Example:
SELECT * FROM employees WHERE department_id IN (SELECT department_id FROM
departments WHERE location = 'New York');
7. Conditional Logic:
• Use CASE statements to apply conditional logic within queries.
• Example:
SELECT name, CASE WHEN salary > 100000 THEN 'High' ELSE 'Low' END AS salary_level
FROM employees;
8. Window Functions:
• Use window functions for advanced analytics tasks such as ranking, aggregation,
and moving averages.
• Example:
SELECT name, salary, AVG(salary) OVER (PARTITION BY department_id) AS avg_salary
FROM employees;
9. Creating Views:
• Create views to store the results of queries as virtual tables for reuse.
• Example:
CREATE VIEW high_salary_employees AS SELECT * FROM employees WHERE salary >
100000;
10. Exporting Data:
• Use the INSERT OVERWRITE or INSERT INTO statements to export query
results to external storage or files.
• Example:
INSERT OVERWRITE DIRECTORY '/user/hive/output' SELECT * FROM employees
WHERE department = 'IT';
These are some common examples of querying data in Apache Hive using HiveQL.
Hive provides extensive support for SQL-like syntax, making it easy to perform
various data manipulation and analysis tasks on large datasets stored in Hadoop.
• Hive is a database technology that can define databases and tables to analyze structured
data. The theme for structured data analysis is to store the data in a tabular manner, and
pass queries to analyze it. This chapter explains how to create Hive database. Hive
containsa default database named default.
Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as
follows:
The following queries are used to drop a database. Let us assume that the database name is userdb.
hive> DROP DATABASE IF EXISTS userdb;
The following query drops the database using CASCADE. It means dropping respective tables
before dropping the database.
The following query drops the database using SCHEMA.hive> DROP SCHEMA userdb;
CREATE TABLE STATEMENT
Create Table is a statement used to create a table in Hive. The syntax and example are as follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)][COMMENT table_comment]
[ROW FORMAT row_format] [STORED AS file_format]Example
Let us assume you need to create a table named employee using CREATE TABLE statement.
Thefollowing table lists the fields and their data types in employee table:
The following data is a Comment, Row formatted fields such as Field terminator, Lines
terminator,and Stored File type.
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee(eid int, name String, salary String,
destination String)
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already
exists.On successful creation of table, you get to see the following response:
OK
Time taken: 5.905 secondshive>
LOAD DATA STATEMENT
Generally, after creating a table in SQL, we can insert data using the Insert statement. But in
Hive,we can insert data using the LOAD DATA statement.
While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are
two ways to load data: one is from local file system and second is from Hadoop file system.
Syntax
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt' OVERWRITE INTO
TABLE employee;
On successful download, you get to see the following response:OK
Time taken: 15.905 secondshive>
ALTER TABLE STATEMENT
It is used to alter a table in Hive.Syntax
The statement takes any of the following syntaxes based on what attributes we wish to modify
ina table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])ALTER TABLE name
DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type ALTER TABLE
name REPLACE COLUMNS (col_spec[, col_spec ...])
Rename To… Statement
The following query renames the table from employee to emp.hive> ALTER TABLE employee
RENAME TO emp; Change Statement
The following table contains the fields of employee table and it shows the fields to be changed
(inbold).
Field name Convert from data Change field name Convert to data type
type
Eid int eid int
name String ename String
salary Float salary Double
designation String designation String
The following queries rename the column name and column data type using the above data:
hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;
andname columns:
hive> ALTER TABLE employee REPLACE COLUMNS (eid INT empid Int, ename
STRING name String);
DROP TABLE STATEMENT
Hive Metastore, it removes the table/column data and their metadata. It can be a normal table
(stored in Metastore) or an external table (stored in local file system); Hive treats both in the
samemanner, irrespective of their types.
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name; The following query drops a table named
employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to response:OK
Time taken: 5.3 secondshive>
The following query is used to verify the list of tables:
hive> SHOW TABLES;
emp ok
Time taken: 2.1 secondshive>
Operators in HIVE:
There are four types of operators in Hive:
• Relational Operators
• Arithmetic Operators
• Logical Operators
• Complex Operators
Relational Operators: These operators are used to compare two operands. The following table
describes the relational operators available in Hive:
The following query is executed to retrieve the employee details using the above table:hive>
SELECT * FROM employee WHERE Id=1205;
On successful execution of query, you get to see the following response:
The following query is executed to retrieve the employee details whose salary is more than or
equal to Rs 40000.
Arithmetic Operators
These operators support various common arithmetic operations on the operands. All of them
returnnumber types. The following table describes the arithmetic operators available in Hive:
Operators Operand Description
A+B all number types Gives the result of adding A and B.
A-B all number types Gives the result of subtracting B from A.
A*B all number types Gives the result of multiplying A and B.
A/B all number types Gives the result of dividing B from A.
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath
A%B all number types Gives the reminder resulting from dividing A by B.
A&B all number types Gives the result of bitwise AND of A and B.
A|B all number types Gives the result of bitwise OR of A and B.
A^B all number types Gives the result of bitwise XOR of A and B.
~A all number types Gives the result of bitwise NOT of A.
Example
The following query adds two numbers, 20 and 30.hive> SELECT 20+30 ADD FROM temp;
On successful execution of the query, you get to see the following response:
+ +
| ADD |
+ +
| 50 |
+ +
Logical operators
The operators are logical expressions. All of them return either TRUE or FALSE.
Operators Operands Description
A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE.
A && B boolean Same as A AND B.
A OR B boolean TRUE if either A or B or both are TRUE, otherwise
FALSE.
A || B boolean Same as A OR B.
NOT A boolean TRUE if A is FALSE, otherwise FALSE.
!A boolean Same as NOT A.
Example
The following query is used to retrieve employee details whose Department is TP and Salary is
more than Rs 40000.
hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP; On
successful execution of the query, you get to see the following response:
Complex Operators
These operators provide an expression to access the elements of Complex Types.
Operator Operand Description
A[n] A is an Array and n isan int
It returns the nth element in the array A.
The first element has index 0.
M[key] M is a Map<K, V> and key It returns the value corresponding to thekey
has type K in the map.
HIVEQL - SELECT-WHERE
• The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement
with WHERE clause.
• SELECT statement is used to retrieve the data from a table. WHERE clause works
similarto a condition. It filters the data using the condition and gives you a finite result.
The built-in operators and functions generate an expression, which fulfills the condition.
Syntax
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]][LIMIT
number];
Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.
The following query retrieves the employee details using the above scenario:hive> SELECT *
FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
HIVEQL - SELECT-ORDER BY
This chapter explains how to use the ORDER BY clause in a SELECT statement. The ORDER
BY clause is used to retrieve the details based on one column and sort the result set by ascending
or descending order.
Syntax
Given below is the syntax of the ORDER BY clause:
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath
Example
• Let us take an example for SELECT...ORDER BY clause. Assume employee table as
givenbelow, with the fields named Id, Name, Salary, Designation, and Dept.
Generate a query to retrieve the employee details in order by using Department name.
The following query retrieves the employee details using the above scenario: hive> SELECT
Id, Name, Dept FROM employee ORDER BY DEPT;
HIVEQL - SELECT-GROUP BY
This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP
BYclause is used to group all the records in a result set using a particular collection column. It
is usedto query a group of records.
The syntax of GROUP BY clause is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition][ORDER BY col_list]] [LIMIT number];
HIVEQL - SELECT-JOINS
JOIN is a clause that is used for combining specific fields from two tables by using values
commonto each one. It is used to combine records from two or more tables in the database. It
is more or less similar to SQL JOIN.
Syntax join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_referencejoin_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]
We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS.
JOIN
There are different types of joins given as follows:
• JOIN
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign
keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN
ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
FUNDAMENTALS OF HBASE:
Apache HBase is a distributed, scalable, and column-oriented NoSQL database
built on top of Apache Hadoop. It is designed to provide real-time random
read/write access to large volumes of data, enabling users to store and manage big
data in a fault-tolerant and highly available manner. Here are the fundamentals of
HBase:
1. Data Model:
• HBase organizes data into tables, rows, and columns similar to a traditional
relational database.
• Tables are composed of rows, and each row has a unique row key.
• Rows are further divided into column families, which contain one or more
columns.
• Columns are identified by a column family and a qualifier.
• HBase supports dynamic schema, allowing you to add columns to a table
without predefining them.
2. Schema Design:
• Unlike traditional databases, HBase does not enforce a schema on data.
Instead, it allows flexible schema design to accommodate evolving data
requirements.
• Column families should be defined based on access patterns and data locality
to optimize performance.
• Denormalization is common in HBase schema design to minimize data
duplication and improve query performance.
3. Architecture:
• HBase follows a master-slave architecture. The HBase Master server
coordinates metadata operations and manages regionservers.
• Regionservers host the actual data and handle read and write requests from
clients.
• Each table is partitioned into regions, and each region is served by a single
regionserver.
• ZooKeeper is used for distributed coordination and leader election among
HBase components.
4. Storage:
• HBase stores data in Hadoop Distributed File System (HDFS), which provides
fault tolerance and high availability.
• Data is stored in sorted order based on row keys, enabling efficient range scans
and point queries.
• HBase employs a Write-Ahead Log (WAL) to ensure data durability in case of
node failures.
• Data is stored in indexed HFiles on disk, and an in-memory MemStore is used
for write buffering before data is flushed to disk.
5. Access Patterns:
• HBase supports random read and write access to individual rows based on row
keys.
• It is optimized for low-latency, real-time access to data, making it suitable for
use cases such as serving web applications, IoT data storage, and real-time
analytics.
• HBase also supports batch operations, scans, and filters to process large
volumes of data efficiently.
6. Consistency and Replication:
• HBase provides strong consistency guarantees within a region, ensuring that
read and write operations see consistent views of the data.
• It supports asynchronous replication to replicate data across multiple clusters
for disaster recovery, backup, and load balancing purposes.
7. Integration with Hadoop Ecosystem:
• HBase integrates seamlessly with other components of the Hadoop ecosystem,
such as Apache Spark, Apache Hive, Apache Pig, and Apache Flume.
• This integration allows users to leverage the power of distributed computing
frameworks for data processing and analytics on HBase data.
Overall, Apache HBase is a powerful NoSQL database that offers scalability, high
availability, and real-time access to large-scale data. Its flexible data model,
distributed architecture, and integration with Hadoop ecosystem make it well-suited
for a wide range of use cases in big data analytics and real-time applications.
FUNDAMENTALS OF ZOOKEEPER:
Apache ZooKeeper is a centralized service for maintaining configuration
information, providing distributed synchronization, and offering group services. It
acts as a distributed coordination service for distributed applications. Here are the
fundamentals of ZooKeeper:
1. Data Model:
• ZooKeeper maintains a hierarchical namespace similar to a file system, called
znodes.
• Znodes are organized in a tree-like structure, where each node can contain data
and can have children nodes.
• Each znode can have associated data and metadata such as permissions, version
numbers, and timestamps.
• Znodes can be ephemeral or persistent. Ephemeral znodes exist only as long as
the session that created them is active.
2. Coordination:
• ZooKeeper provides coordination primitives such as locks, semaphores,
barriers, and queues to facilitate coordination among distributed processes.
• Processes can create, read, write, and delete znodes to implement coordination
patterns such as leader election, distributed locks, and configuration
management.
3. Consensus:
• ZooKeeper implements the ZAB (ZooKeeper Atomic Broadcast) protocol to
provide strong consistency and fault tolerance.
• ZAB ensures that updates to the ZooKeeper state are linearizable and ordered,
even in the presence of failures.
• ZooKeeper uses a leader-follower model, where one server acts as the leader
and coordinates updates, while other servers act as followers and replicate state
changes.
4. Quorums:
• ZooKeeper uses a replicated state machine model, where multiple ZooKeeper
servers form an ensemble to replicate and synchronize data.
• To achieve fault tolerance, ZooKeeper requires a majority (quorum) of servers
to agree on updates before committing them.
• The size of the quorum is determined by the formula (n/2 + 1), where n is the
total number of servers in the ensemble.
5. Session Management:
• Clients connect to ZooKeeper servers to create and manage sessions for
interacting with the ZooKeeper service.
• Sessions are associated with a specific client and are used to maintain client-
server communication state.
• ZooKeeper servers monitor client sessions and detect session timeouts to
remove inactive clients and release associated resources.
6. Watches:
• Clients can set watches on znodes to receive notifications when changes occur
to the watched znodes.
• Watches are one-time triggers that fire when specific events, such as node
creation, deletion, or data modification, occur on the watched znode.
• Watches allow clients to react to changes in the ZooKeeper state and
synchronize distributed processes.
7. Use Cases:
• ZooKeeper is used in distributed systems to manage configuration, coordinate
distributed processes, and maintain distributed locks.
• It is widely used in distributed databases, messaging systems, and big data
frameworks to provide coordination and consistency guarantees.
Overall, ZooKeeper plays a crucial role in building reliable and scalable distributed
systems by providing coordination, consistency, and synchronization services. Its
simple and efficient design makes it suitable for a wide range of distributed
applications and use cases.