Unit V
Unit V
Apache Hive is an open source data warehouse system built on top of Hadoop cluster.
Housed for querying and analyzing large datasets stored in Hadoop files. It process
structured and semi-structured data in Hadoop.
Initially, you have to write complex Map-Reduce jobs, but now with the help of the
Hive, you just need to submit merely SQL queries.
Hive is mainly targeted towards users who are comfortable with SQL.
Hive use language called HiveQL (HQL), which is similar to SQL.
HiveQL automatically translates SQL-like queries into Map Reduce jobs.
Hive Architecture
Metastore – It stores metadata for each of the tables like their schema and location.Hive
Meta data helps the driver to track the progress of various data sets distributed over the
cluster. It stores the data in a traditional RDBMS format.
Driver – It acts like a controller which receives the HiveQL statements. The driver starts
the execution of the statement by creating sessions. It monitors the life cycle and progress of
the execution. Driver stores the necessary metadata generated during the execution of a
HiveQL statement. It also acts as a collection point of data or query result obtained after the
Reduce operation.
Compiler –
It performs the compilation of the HiveQL query. This converts the query to an execution
plan. The plan contains the tasks. It also contains steps needed to be performed by
the MapReduce to get the output .
The compiler in Hive converts the query to an Abstract Syntax Tree (AST). First, check
for compatibility and compile-time errors, then converts the AST to a Directed Acyclic
Graph (DAG).
Executor – Once compilation and optimization complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.
CLI, UI, and Thrift Server – CLI (command-line interface) provides a user interface for
an external user to interact with Hive. Thrift server in Hive allows external clients to interact
with Hive over a network, similar to the JDBC or ODBC protocols.
Hive Shell:
Hive Shell is almost similar to MySQL Shell. It is the command line interface for Hive. In Hive
Shell users can run HQL queries. HiveQL is also case-insensitive (except for string
comparisons) same as SQL.
We can run the Hive Shell in two modes which are: Non-Interactive mode and
Interactive mode.
Hive Data-Model:
Table
Partition
Bucket
2. What is Sharding?
Sharding is a very important concept which helps the system to keep data into different
resources according to the sharding process.
The word “Shard” means “a small part of a whole“. Hence Sharding means dividing a larger
part into smaller parts.
In DBMS, Sharding is a type of DataBase partitioning in which a large DataBase is divided or
partitioned into smaller data, also known as shards. These shards are not only smaller, but also
faster and hence easily manageable.
Need for Sharding:
Consider a very large database whose sharding has not been done. For example, let’s take a
DataBase of a college in which all the student’s record (present and past) in the whole college
are maintained in a single database. So, it would contain very very large number of data, say 100,
000 records.
Now when we need to find a student from this Database, each time around 100, 000 transactions
has to be done to find the student, which is very very costly.
Now consider the same college students records, divided into smaller data shards based on years.
Now each data shard will have around 1000-5000 students records only. So not only the database
became much more manageable, but also the transaction cost of each time also reduces by a huge
factor, which is achieved by Sharding.
Features of Sharding:
Sharding makes the Database smaller
Sharding makes the Database faster
Sharding makes the Database much more easily manageable
Sharding can be a complex operation sometimes
Sharding reduces the transaction cost of the Database
3. What is Hbase
HBase is an open-source, column-oriented distributed database system in a Hadoop environment.
Initially, it was Google Big Table, afterward, it was re-named as HBase and is primarily written
in Java. Apache HBase is needed for real-time Big Data applications.
HBase is a column-oriented database and data is stored in tables. The tables are sorted by
RowId. As shown below, HBase has RowId, which is the collection of several column
families that are present in the table.
The column families that are present in the schema are key-value pairs. If we observe in
detail each column family having multiple numbers of columns. The column values
stored into disk memory. Each cell of the table has its own Metadata like timestamp and
other information.
Coming to HBase the following are the key terms representing table schema
The amount of data that can able to store in It is designed for a small
this model is very huge like in terms of number of rows and columns.
petabytes
Set of tables
Each table with column families and rows
Each table must have an element defined as Primary Key.
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key
Each column present in HBase denotes attribute corresponding to object
HMaster:
When Region Server receives writes and read requests from the client, it assigns the request to a
specific region, where the actual column family resides.
However, the client can directly contact with HRegion servers, there is no need of HMaster
mandatory permission to the client regarding communication with HRegion servers.
HBase Regions:
HRegions are the basic building elements of HBase cluster that consists of the distribution of
tables and are comprised of Column families. It contains multiple stores, one for each column
family. It consists of mainly two components, which are Memstore and Hfile.
ZooKeeper:
HDFS:-
HDFS is a Hadoop distributed file system, as the name implies it provides a distributed
environment for the storage and it is a file system designed in a way to run on commodity
hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are
replicated across a Hadoop cluster.