Chapter 4 HBase Technical Principles
Chapter 4 HBase Technical Principles
Foreword
1 Huawei Confidential
Objectives
2 Huawei Confidential
Contents
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
3 Huawei Confidential
Introduction to HBase
HBase is a column-based distributed storage system that features high
reliability, performance, and scalability.
HBase is suitable for storing data in a big table (the table can store billions of rows
and millions of columns) and allows real-time data access.
Hadoop HDFS (Hadoop Distributed File System) is used as the file storage system to
provide a distributed database system that supports real-time read and write
operations.
HBase uses ZooKeeper as the collaboration service.
4 Huawei Confidential
Comparison Between HBase and RDB
HBase differs from traditional relational databases in the following aspects:
Data indexing: A relational database can build multiple complex indexes for different columns to improve
data access performance. HBase has only one index, that is, the row key. All access methods in HBase can
be accessed by using the row key or row key scanning, ensuring the proper system running.
Data maintenance: In the relational database, the latest current value is used to replace the original value
in the record during the update operation. The original value does not exist after being overwritten. When
an update operation is performed in HBase, a new version is generated with the original version retained.
Scalability: It is difficult to implement horizontal expansion of relational databases, and the space for
vertical expansion is limited. On the contrary, distributed databases, such as HBase and BigTable, are
developed to implement flexible horizontal expansion. They can easily implement performance scaling by
adding or reducing hardware in a cluster.
5 Huawei Confidential
HBase Application Scenario
Data
User image Time series
storage
data
Meteorological
Message/Order HBase data
storage Scenarios
6 Huawei Confidential
Contents
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
7 Huawei Confidential
Data Model
Simply, applications store data in HBase as tables.
A table consists of rows and columns. All columns belong to a column family.
The intersection of a row and a column is called a cell, and the cell is versioned.
The contents of the cell are an indivisible byte array.
The row key of a table is also a byte array, so anything can be saved, either as a
string or as a number.
HBase tables are sorted by key. The sorting mode is byte. All tables must have a
primary key.
8 Huawei Confidential
HBase Table Structure (1)
Column Family
Column
Info
name age gender
20200301 Tom 18 male
Row Key 20200302 Jack 19 male
female
20200303 Lily 20 female
t1
t2
Cell The cell has two timestamps t1 and t2.
One timestamp corresponds to one data
version.
9 Huawei Confidential
HBase Table Structure (2)
Table: HBase uses tables to organize data. A table consists of rows and columns. A column is
divided into several column families.
Row: Each HBase table consists of multiple rows, and each row is identified by a row key.
Column family: An HBase table is divided into multiple column families, which are basic access
control units.
Column qualifier: Data in a column family is located by column qualifiers (or columns).
Cell: In an HBase table, a cell is determined by the row, column family, and column qualifier. Data
stored in a cell has no data type and is considered as a byte array byte[].
Timestamp: Each cell stores multiple versions of the same data. These versions are indexed using
timestamps.
10 Huawei Confidential
Conceptual View of Data Storage
There is a table named webtable that contains two column families: contents
and anchor. In this example, anchor has two columns (anchor:aa.com and
anchor:bb.com), and contents has only one column (contents:html).
11 Huawei Confidential
Physical View of Data Storage
Although in the conceptual view, a table can be considered as a collection of
sparse rows. Physically, however, it differentiates column family storage. New
columns can be added to a column family without being declared.
12 Huawei Confidential
Row-based Storage
Row-based storage refers to data stored by rows in an underlying file system.
Generally, a fixed amount of space is allocated to each row.
Advantages: Data can be added, modified, or read by row.
Disadvantage: Some unnecessary data is obtained when data in a column is queried.
13 Huawei Confidential
Column-based Storage
Column-based storage refers to data stored by columns in an underlying file
system.
Advantage: Data can be read or calculated by column.
Disadvantage: When a row is read, multiple I/O operations may be required.
K1 V1 K2 V2 K3 V3 K4 V4
K5 V5 K6 V6 K7 V7 K8 V8
14 Huawei Confidential
Contents
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
15 Huawei Confidential
HBase Architecture (1)
HRegionServer HRegionServer
HRegion HRegion
HBase
HLog
StoreFile StoreFile ... StoreFile ... ... ... StoreFile StoreFile ... StoreFile ... ... ...
HFile HFile HFile HFile HFile HFile
...
16 Huawei Confidential
HBase Architecture (2)
The HBase architecture consists of the following functional components:
Library functions (linking to each client)
HMaster
HRegionServer
17 Huawei Confidential
HBase Architecture (3)
The HMaster server manages and maintains the partition information in the HBase
table, maintains the HRegionServer list, allocates regions, and balances loads.
HRegionServer stores and maintains the allocated regions and process read and write
requests from clients.
The client does not directly read data from HMaster. Instead, the client directly reads
data from HRegionServer after obtaining the storage location of the region.
The client does not depend on the HMaster. Instead, the client obtains the region
location through ZooKeeper. Most clients do not even communicate with the HMaster.
This design reduces the load of the HMaster.
18 Huawei Confidential
HBase Architecture (4)
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
− MemStore (MemStore for each Store for each Region for the table)
− StoreFile (StoreFiles for each Store for each Region for the table)
~ Block (Blocks within a StoreFile within a Store for each Region for the table)
19 Huawei Confidential
Table and Region
In normal cases, an HBase table has only one region. As the data volume increases, the HBase table is split
into multiple regions.
The region splitting operation is fast because the region still reads the original storage file after the splitting.
The region reads the new file only after the storage file is asynchronously written to an independent file.
Table Table
Table Region Region
lexicographical order
Region Split
By row key
Region
Region Region
Region
20 Huawei Confidential
Region Positioning (1)
Region is classified into Meta Region and
User table1
User Region. Hbase:meta
table
Meta Region records the routing
information of each User Region.
To read and write region data routing
User table N
information, perform the following steps:
Find the Meta Region address.
Find the User Region address based on
Meta Region.
21 Huawei Confidential
Region Positioning (2)
To speed up access, the hbase:meta table is saved in memory.
Assume that each row (a mapping entry) in the hbase:meta table occupies
about 1 KB in the memory, and the maximum size of each region is 128 MB.
In the two-layer structure, 217 (128 MB/1 KB) regions can be saved.
22 Huawei Confidential
Client
The client contains the interface for accessing HBase and maintains the location
information of the accessed regions in the cache to accelerate subsequent data access.
The client queries the hbase:meta table first, and determines the location of the region.
After the required region is located, the client directly accesses the corresponding region
(without passing through the HMaster) and initiates a read/write request.
23 Huawei Confidential
HMaster HA
ZooKeeper can help elect an HMaster node as the primary management
node of the cluster and ensure that there is only one HMaster node
running at any time, preventing single point of failures (SPOFs) of the
HMaster node.
24 Huawei Confidential
HMaster
The HMaster server manages tables and regions by performing the
following operations:
Manages users' operations on tables, such as adding, deleting, modifying, and
querying.
Implements load balancing between different HRegionServers.
Adjusts the distribution of regions after they are split or merged.
Migrates the regions on the faulty HRegionServers.
25 Huawei Confidential
HRegionServer
HRegionServer is the core module of HBase. It provide the following main
functions:
Maintains the regions allocated.
Responds to users' read and write requests.
26 Huawei Confidential
Contents
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
27 Huawei Confidential
Data Read and Write Process
When you write data, the data is allocated to the corresponding HRegionServer
for execution.
Your data is first written to MemStore and HLog.
The commit() invocation returns the data to the client only after the operation
is written to HLog.
When you read data, the HRegionServer first accesses MemStore cache. If the
MemStore cache cannot be found, the HRegionServer searches StoreFile on the
disk.
28 Huawei Confidential
Cache Refreshing
The system periodically writes the content in the MemStore cache to the
StoreFile file in the disk, clears the cache, and writes a tag in the HLog.
A new StoreFile file is generated each time data is written. Therefore, each
Store contains multiple StoreFile files.
Each HRegionServer has its own HLog file. Each time the HRegionServer is
started, the HLog file is checked to confirm the latest startup. Check whether a
new write operation is performed after the cache is refreshed. If an update is
detected, the data is written to MemStore and then to StoreFile. At last, the old
HLog file is deleted, and HRegionServer provides services for you.
29 Huawei Confidential
Merging StoreFiles
A new StoreFile is generated each time data is flushed, affecting the search
speed due to the large number of StoreFiles.
The Store.compact() function is used to combine multiple StoreFiles into one.
The merge operation is started only when the number of StoreFiles reaches a
threshold because the merge operation consumes a large number of resources.
30 Huawei Confidential
Store Implementation
Store is the core of a HRegionServer.
Multiple StoreFiles are combined into one Store.
When the size of a single StoreFile is too large, splitting is triggered. One parent
region is split into two sub-regions.
StoreFile1: 64 MB
Split StoreFile6: 128 MB
StoreFile5: 256 MB
StoreFile3: 64 MB StoreFile5B: 128 MB
StoreFile7: 128 MB
StoreFile4: 64 MB
31 Huawei Confidential
HLog Implementation
In a distributed environment, you need to consider system errors. HBase uses
HLog to ensure system recovery.
The HBase system configures an HLog file for each HRegionServer, which is a
write-ahead log (WAL).
The updated data can be written to the MemStore cache only after the data is
written to logs. In addition, the cached data can be written to the disk only after
the logs corresponding to the data cached in the MemStore are written to the
disk.
32 Huawei Confidential
Contents
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
34 Huawei Confidential
Impact of Multiple HFiles
36 Huawei Confidential
Compaction - 2
Write
put MemStore
Flush
Minor Compaction
Major Compaction
HFile
37 Huawei Confidential
OpenScanner
In the OpenScanner process, two different scanners are created to read HFile
and MemStore data.
The scanner corresponding to HFile is StoreFileScanner.
The scanner corresponding to MemStore is MemStoreScanner.
ColumnFamily-1
MemStore
HFile-11
HFile-12
Region
ColumnFamily-2
MemStore
HFile-21
HFile-22
38 Huawei Confidential
BloomFilter
BloomFilter is used to optimize some random read scenarios, that is, the Get
scenario. It can be used to quickly determine whether a piece of user data exists
in a large data set (most data in the data set cannot be loaded to the memory).
BloomFilter has possibility of misjudgment when determining whether a piece
of data exists. However, the judgment result of "The data xxxx does not exist" is
reliable.
BloomFilter's data in HBase is stored in HFiles.
39 Huawei Confidential
Contents
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
40 Huawei Confidential
Row Key
Row keys are stored in alphabetical order. Therefore, when designing row keys,
you need to fully use the sorting feature to store the data that is frequently
read together and the data that may be accessed recently.
For example, if the data that is recently written to the HBase table is most likely
to be accessed, you can use the timestamp as a part of the row key. Because
the data is sorted in alphabetical order, you can use Long.MAX_VALUE -
timestamp as the row key, in this way, newly written data can be quickly hit
when being read.
41 Huawei Confidential
Creating HBase Secondary Index (1)
HBase has only one index for row keys.
There are three methods for accessing rows in the HBase table:
Access through a single rowkey
Access through a row key interval
Full table scan
42 Huawei Confidential
Creating HBase Secondary Index (2)
Hindex Secondary Index
Hindex is a Java-based HBase secondary index developed by Huawei and is
compatible with Apache HBase 0.94.8. The current features are as follows:
Multiple table indexes
Multiple column indexes
Index based on some column values
43 Huawei Confidential
Contents
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
44 Huawei Confidential
Common HBase Shell Commands
create: creating Hive tables
list: listing all tables in HBase
put: adding data to a specified cell in a table, row, or column
scan: browsing information about a table
get: obtaining the value of a cell based on the table name, row, column,
timestamp, time range, and version number
enable/disable: enabling or disabling a table
drop: deleting a table
45 Huawei Confidential
Summary
This course describes the knowledge about the HBase database. HBase is an open
source implementation of BigTable. Similar to BigTable, HBase supports a large
amount of data and distributed concurrent data processing. It is easy to expand,
supporting dynamic scaling, and is applicable to inexpensive devices.
Additionally, this course describes the differences between the conceptual view and
physical view of HBase data. HBase is a mapping table that stores data in a sparse,
multi-dimensional, and persistent manner. It uses row keys, column keys, and
timestamps for indexing, and each value is an unexplained string.
46 Huawei Confidential
Quiz
B. Long
C. String
D. Byte[]
47 Huawei Confidential
Quiz
B. Column Family
C. Column
D. Cell
48 Huawei Confidential
Recommendations
49 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.