0% found this document useful (0 votes)
13 views6 pages

Unit V

Apache Hive is an open source data warehouse system built on top of Hadoop that allows users to query and analyze large datasets stored in Hadoop files using SQL-like queries. It processes structured and semi-structured data in Hadoop. Hive components include a metastore to store metadata, a driver to control query execution, a compiler to convert queries to execution plans, an optimizer to optimize plans, and an executor to run tasks. HBase is a column-oriented distributed database that provides low-latency operations for random reads and writes. It stores large amounts of data in tables across a cluster and provides automatic sharding and failover.

Uploaded by

S.GOPINATH5035
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Unit V

Apache Hive is an open source data warehouse system built on top of Hadoop that allows users to query and analyze large datasets stored in Hadoop files using SQL-like queries. It processes structured and semi-structured data in Hadoop. Hive components include a metastore to store metadata, a driver to control query execution, a compiler to convert queries to execution plans, an optimizer to optimize plans, and an executor to run tasks. HBase is a column-oriented distributed database that provides low-latency operations for random reads and writes. It stores large amounts of data in tables across a cluster and provides automatic sharding and failover.

Uploaded by

S.GOPINATH5035
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

1. What is Apache Hive?

 Apache Hive is an open source data warehouse system built on top of Hadoop cluster.
Housed for querying and analyzing large datasets stored in Hadoop files. It process
structured and semi-structured data in Hadoop.
 Initially, you have to write complex Map-Reduce jobs, but now with the help of the
Hive, you just need to submit merely SQL queries.
 Hive is mainly targeted towards users who are comfortable with SQL.
 Hive use language called HiveQL (HQL), which is similar to SQL.
 HiveQL automatically translates SQL-like queries into Map Reduce jobs.
Hive Architecture

The Apache Hive components are-

Metastore – It stores metadata for each of the tables like their schema and location.Hive
Meta data helps the driver to track the progress of various data sets distributed over the
cluster. It stores the data in a traditional RDBMS format.

Driver – It acts like a controller which receives the HiveQL statements. The driver starts
the execution of the statement by creating sessions. It monitors the life cycle and progress of
the execution. Driver stores the necessary metadata generated during the execution of a
HiveQL statement. It also acts as a collection point of data or query result obtained after the
Reduce operation.
Compiler –

It performs the compilation of the HiveQL query. This converts the query to an execution
plan. The plan contains the tasks. It also contains steps needed to be performed by
the MapReduce to get the output .

The compiler in Hive converts the query to an Abstract Syntax Tree (AST). First, check
for compatibility and compile-time errors, then converts the AST to a Directed Acyclic
Graph (DAG).

Optimizer – It performs various transformations on the execution plan to provide


optimized DAG. It aggregates the transformations together, such as converting a pipeline of
joins to a single join, for better performance.

Executor – Once compilation and optimization complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.

CLI, UI, and Thrift Server – CLI (command-line interface) provides a user interface for
an external user to interact with Hive. Thrift server in Hive allows external clients to interact
with Hive over a network, similar to the JDBC or ODBC protocols.

Hive Shell:

Hive Shell is almost similar to MySQL Shell. It is the command line interface for Hive. In Hive
Shell users can run HQL queries. HiveQL is also case-insensitive (except for string
comparisons) same as SQL.
We can run the Hive Shell in two modes which are: Non-Interactive mode and
Interactive mode.

Hive Data-Model:

Data in Apache Hive can be categorized into:

 Table
 Partition
 Bucket
2. What is Sharding?

Sharding is a very important concept which helps the system to keep data into different
resources according to the sharding process.
The word “Shard” means “a small part of a whole“. Hence Sharding means dividing a larger
part into smaller parts.
In DBMS, Sharding is a type of DataBase partitioning in which a large DataBase is divided or
partitioned into smaller data, also known as shards. These shards are not only smaller, but also
faster and hence easily manageable.
Need for Sharding:
Consider a very large database whose sharding has not been done. For example, let’s take a
DataBase of a college in which all the student’s record (present and past) in the whole college
are maintained in a single database. So, it would contain very very large number of data, say 100,
000 records.
Now when we need to find a student from this Database, each time around 100, 000 transactions
has to be done to find the student, which is very very costly.
Now consider the same college students records, divided into smaller data shards based on years.
Now each data shard will have around 1000-5000 students records only. So not only the database
became much more manageable, but also the transaction cost of each time also reduces by a huge
factor, which is achieved by Sharding.
Features of Sharding:
 Sharding makes the Database smaller
 Sharding makes the Database faster
 Sharding makes the Database much more easily manageable
 Sharding can be a complex operation sometimes
 Sharding reduces the transaction cost of the Database

3. What is Hbase
HBase is an open-source, column-oriented distributed database system in a Hadoop environment.
Initially, it was Google Big Table, afterward, it was re-named as HBase and is primarily written
in Java. Apache HBase is needed for real-time Big Data applications.

HBase Unique Features

 HBase is built for low latency operations


 HBase is used extensively for random read and write operations
 HBase stores a large amount of data in terms of tables
 Provides linear and modular scalability over cluster environment
 Strictly consistent to read and write operations
 Automatic and configurable sharding of tables
 Automatic failover supports between Region Servers
 Convenient base classes for backing Hadoop MapReduce jobs in HBase tables
 Easy to use Java API for client access
 Block cache and Bloom Filters for real-time queries
 Query predicate pushes down via server-side filters.

Storage Mechanism in HBase

 HBase is a column-oriented database and data is stored in tables. The tables are sorted by
RowId. As shown below, HBase has RowId, which is the collection of several column
families that are present in the table.
 The column families that are present in the schema are key-value pairs. If we observe in
detail each column family having multiple numbers of columns. The column values
stored into disk memory. Each cell of the table has its own Metadata like timestamp and
other information.

Coming to HBase the following are the key terms representing table schema

 Table: Collection of rows present.


 Row: Collection of column families.
 Column Family: Collection of columns.
 Column: Collection of key-value pairs.
 Namespace: Logical grouping of tables.
 Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.

Column-oriented vs Row-oriented storages


Column-oriented Database Row oriented Database
 When the situation comes to process and  Online Transactional
analytics we use this approach. Such process such as banking and
as Online Analytical Processing and it's finance domains use this
applications. approach.

 The amount of data that can able to store in  It is designed for a small
this model is very huge like in terms of number of rows and columns.
petabytes

HBase Data Model


HBase Data Model consists of following elements,

 Set of tables
 Each table with column families and rows
 Each table must have an element defined as Primary Key.
 Row key acts as a Primary key in HBase.
 Any access to HBase tables uses this Primary Key
 Each column present in HBase denotes attribute corresponding to object

HMaster:

HMaster is the implementation of a Master server in HBase architecture. It acts as a monitoring


agent to monitor all Region Server instances present in the cluster and acts as an interface for all
the metadata changes. In a distributed cluster environment, Master runs on NameNode. Master
runs several background threads.

HBase Regions Servers:

When Region Server receives writes and read requests from the client, it assigns the request to a
specific region, where the actual column family resides.

However, the client can directly contact with HRegion servers, there is no need of HMaster
mandatory permission to the client regarding communication with HRegion servers.

HBase Regions:

HRegions are the basic building elements of HBase cluster that consists of the distribution of
tables and are comprised of Column families. It contains multiple stores, one for each column
family. It consists of mainly two components, which are Memstore and Hfile.

ZooKeeper:

In HBase, Zookeeper is a centralized monitoring server which maintains configuration


information and provides distributed synchronization. Distributed synchronization is to access
the distributed applications running across the cluster with the responsibility of providing
coordination services between nodes. If the client wants to communicate with regions, the
server's client has to approach ZooKeeper first.

HDFS:-

HDFS is a Hadoop distributed file system, as the name implies it provides a distributed
environment for the storage and it is a file system designed in a way to run on commodity
hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are
replicated across a Hadoop cluster.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy