0% found this document useful (0 votes)
61 views

Chapter 4 HBase Technical Principles

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Chapter 4 HBase Technical Principles

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Chapter 4 HBase Technical Principles

Foreword

 This course describes the non-relational distributed database called HBase


in the Hadoop open-source community, which can meet the requirements
of large-scale and real-time data processing applications.

1 Huawei Confidential
Objectives

 On completion of this course, you will be able to be familiar with:


 HBase system architecture and related concepts
 HBase key processes and prominent features
 HBase performance tuning
 Basic shell operations of HBase

2 Huawei Confidential
Contents

1. Introduction to HBase

2. HBase Related Concepts

3. HBase Architecture

4. HBase Key Processes

5. HBase Highlights

6. HBase Performance Tuning

7. Common HBase Shell Commands

3 Huawei Confidential
Introduction to HBase
 HBase is a column-based distributed storage system that features high
reliability, performance, and scalability.
 HBase is suitable for storing data in a big table (the table can store billions of rows
and millions of columns) and allows real-time data access.
 Hadoop HDFS (Hadoop Distributed File System) is used as the file storage system to
provide a distributed database system that supports real-time read and write
operations.
 HBase uses ZooKeeper as the collaboration service.

4 Huawei Confidential
Comparison Between HBase and RDB
 HBase differs from traditional relational databases in the following aspects:
 Data indexing: A relational database can build multiple complex indexes for different columns to improve
data access performance. HBase has only one index, that is, the row key. All access methods in HBase can
be accessed by using the row key or row key scanning, ensuring the proper system running.
 Data maintenance: In the relational database, the latest current value is used to replace the original value
in the record during the update operation. The original value does not exist after being overwritten. When
an update operation is performed in HBase, a new version is generated with the original version retained.
 Scalability: It is difficult to implement horizontal expansion of relational databases, and the space for
vertical expansion is limited. On the contrary, distributed databases, such as HBase and BigTable, are
developed to implement flexible horizontal expansion. They can easily implement performance scaling by
adding or reducing hardware in a cluster.

5 Huawei Confidential
HBase Application Scenario

Data
User image Time series
storage
data

Meteorological
Message/Order HBase data
storage Scenarios

Feed flow Cube


NewSQL analysis

6 Huawei Confidential
Contents

1. Introduction to HBase

2. HBase Data Model

3. HBase Architecture

4. HBase Key Processes

5. HBase Highlights

6. HBase Performance Tuning

7. Common HBase Shell Commands

7 Huawei Confidential
Data Model
 Simply, applications store data in HBase as tables.
 A table consists of rows and columns. All columns belong to a column family.
 The intersection of a row and a column is called a cell, and the cell is versioned.
The contents of the cell are an indivisible byte array.
 The row key of a table is also a byte array, so anything can be saved, either as a
string or as a number.
 HBase tables are sorted by key. The sorting mode is byte. All tables must have a
primary key.

8 Huawei Confidential
HBase Table Structure (1)

Column Family
Column

Info
name age gender
20200301 Tom 18 male
Row Key 20200302 Jack 19 male
female
20200303 Lily 20 female
t1
t2
Cell The cell has two timestamps t1 and t2.
One timestamp corresponds to one data
version.

9 Huawei Confidential
HBase Table Structure (2)
 Table: HBase uses tables to organize data. A table consists of rows and columns. A column is
divided into several column families.
 Row: Each HBase table consists of multiple rows, and each row is identified by a row key.
 Column family: An HBase table is divided into multiple column families, which are basic access
control units.
 Column qualifier: Data in a column family is located by column qualifiers (or columns).
 Cell: In an HBase table, a cell is determined by the row, column family, and column qualifier. Data
stored in a cell has no data type and is considered as a byte array byte[].
 Timestamp: Each cell stores multiple versions of the same data. These versions are indexed using
timestamps.

10 Huawei Confidential
Conceptual View of Data Storage
 There is a table named webtable that contains two column families: contents
and anchor. In this example, anchor has two columns (anchor:aa.com and
anchor:bb.com), and contents has only one column (contents:html).

Row Key Time Stamp ColumnFamily contents ColumnFamily anchor


"com.cnn.www" t9 anchor:aa.com= "CNN"
"com.cnn.www" t8 anchor:bb.com= "CNN.com"
"com.cnn.www" t6 contents:html="<html>..."
"com.cnn.www" t5 contents:html="<html>..."
"com.cnn.www" t3 contents:html="<html>..."

11 Huawei Confidential
Physical View of Data Storage
 Although in the conceptual view, a table can be considered as a collection of
sparse rows. Physically, however, it differentiates column family storage. New
columns can be added to a column family without being declared.

Row Key Time Stamp ColumnFamily anchor


"com.cnn.www" t9 anchor:aa.com= "CNN"
"com.cnn.www" t8 anchor:bb.com= "CNN.com"

Row Key Time Stamp ColumnFamily contents


"com.cnn.www" t6 contents:html="<html>..."
"com.cnn.www" t5 contents:html="<html>..."
"com.cnn.www" t3 contents:html="<html>..."

12 Huawei Confidential
Row-based Storage
 Row-based storage refers to data stored by rows in an underlying file system.
Generally, a fixed amount of space is allocated to each row.
 Advantages: Data can be added, modified, or read by row.
 Disadvantage: Some unnecessary data is obtained when data in a column is queried.

ID Name Phone Address

13 Huawei Confidential
Column-based Storage
 Column-based storage refers to data stored by columns in an underlying file
system.
 Advantage: Data can be read or calculated by column.
 Disadvantage: When a row is read, multiple I/O operations may be required.

ID Name Phone Address

K1 V1 K2 V2 K3 V3 K4 V4
K5 V5 K6 V6 K7 V7 K8 V8

14 Huawei Confidential
Contents

1. Introduction to HBase

2. HBase Related Concepts

3. HBase Architecture

4. HBase Key Processes

5. HBase Highlights

6. HBase Performance Tuning

7. Common HBase Shell Commands

15 Huawei Confidential
HBase Architecture (1)

Client ZooKeeper HMaster

HRegionServer HRegionServer
HRegion HRegion
HBase

Store MemStore Store MemStore Store MemStore Store MemStore


HLog

HLog
StoreFile StoreFile ... StoreFile ... ... ... StoreFile StoreFile ... StoreFile ... ... ...
HFile HFile HFile HFile HFile HFile

...DFS Client ...DFS Client


HDFS

...

DataNode DataNode DataNode DataNode DataNode

16 Huawei Confidential
HBase Architecture (2)
 The HBase architecture consists of the following functional components:
 Library functions (linking to each client)
 HMaster
 HRegionServer

17 Huawei Confidential
HBase Architecture (3)
 The HMaster server manages and maintains the partition information in the HBase
table, maintains the HRegionServer list, allocates regions, and balances loads.
 HRegionServer stores and maintains the allocated regions and process read and write
requests from clients.
 The client does not directly read data from HMaster. Instead, the client directly reads
data from HRegionServer after obtaining the storage location of the region.
 The client does not depend on the HMaster. Instead, the client obtains the region
location through ZooKeeper. Most clients do not even communicate with the HMaster.
This design reduces the load of the HMaster.

18 Huawei Confidential
HBase Architecture (4)
 Table (HBase table)
 Region (Regions for the table)
 Store (Store per ColumnFamily for each Region for the table)
− MemStore (MemStore for each Store for each Region for the table)

− StoreFile (StoreFiles for each Store for each Region for the table)
~ Block (Blocks within a StoreFile within a Store for each Region for the table)

19 Huawei Confidential
Table and Region
 In normal cases, an HBase table has only one region. As the data volume increases, the HBase table is split
into multiple regions.
 The region splitting operation is fast because the region still reads the original storage file after the splitting.
The region reads the new file only after the storage file is asynchronously written to an independent file.

Table Table
Table Region Region
lexicographical order

Region Split
By row key

Region
Region Region
Region

Region Region Region

20 Huawei Confidential
Region Positioning (1)
 Region is classified into Meta Region and
User table1
User Region. Hbase:meta
table
 Meta Region records the routing
information of each User Region.
 To read and write region data routing
User table N
information, perform the following steps:
 Find the Meta Region address.
 Find the User Region address based on
Meta Region.

21 Huawei Confidential
Region Positioning (2)
 To speed up access, the hbase:meta table is saved in memory.
 Assume that each row (a mapping entry) in the hbase:meta table occupies
about 1 KB in the memory, and the maximum size of each region is 128 MB.
 In the two-layer structure, 217 (128 MB/1 KB) regions can be saved.

22 Huawei Confidential
Client
 The client contains the interface for accessing HBase and maintains the location
information of the accessed regions in the cache to accelerate subsequent data access.
 The client queries the hbase:meta table first, and determines the location of the region.
After the required region is located, the client directly accesses the corresponding region
(without passing through the HMaster) and initiates a read/write request.

23 Huawei Confidential
HMaster HA
 ZooKeeper can help elect an HMaster node as the primary management
node of the cluster and ensure that there is only one HMaster node
running at any time, preventing single point of failures (SPOFs) of the
HMaster node.

24 Huawei Confidential
HMaster
 The HMaster server manages tables and regions by performing the
following operations:
 Manages users' operations on tables, such as adding, deleting, modifying, and
querying.
 Implements load balancing between different HRegionServers.
 Adjusts the distribution of regions after they are split or merged.
 Migrates the regions on the faulty HRegionServers.

25 Huawei Confidential
HRegionServer
 HRegionServer is the core module of HBase. It provide the following main
functions:
 Maintains the regions allocated.
 Responds to users' read and write requests.

26 Huawei Confidential
Contents

1. Introduction to HBase

2. HBase Related Concepts

3. HBase Architecture

4. HBase Key Processes

5. HBase Highlights

6. HBase Performance Tuning

7. Common HBase Shell Commands

27 Huawei Confidential
Data Read and Write Process
 When you write data, the data is allocated to the corresponding HRegionServer
for execution.
 Your data is first written to MemStore and HLog.
 The commit() invocation returns the data to the client only after the operation
is written to HLog.
 When you read data, the HRegionServer first accesses MemStore cache. If the
MemStore cache cannot be found, the HRegionServer searches StoreFile on the
disk.

28 Huawei Confidential
Cache Refreshing
 The system periodically writes the content in the MemStore cache to the
StoreFile file in the disk, clears the cache, and writes a tag in the HLog.
 A new StoreFile file is generated each time data is written. Therefore, each
Store contains multiple StoreFile files.
 Each HRegionServer has its own HLog file. Each time the HRegionServer is
started, the HLog file is checked to confirm the latest startup. Check whether a
new write operation is performed after the cache is refreshed. If an update is
detected, the data is written to MemStore and then to StoreFile. At last, the old
HLog file is deleted, and HRegionServer provides services for you.

29 Huawei Confidential
Merging StoreFiles
 A new StoreFile is generated each time data is flushed, affecting the search
speed due to the large number of StoreFiles.
 The Store.compact() function is used to combine multiple StoreFiles into one.
 The merge operation is started only when the number of StoreFiles reaches a
threshold because the merge operation consumes a large number of resources.

30 Huawei Confidential
Store Implementation
 Store is the core of a HRegionServer.
 Multiple StoreFiles are combined into one Store.
 When the size of a single StoreFile is too large, splitting is triggered. One parent
region is split into two sub-regions.

StoreFile1: 64 MB
Split StoreFile6: 128 MB

StoreFile2: 64 MB Combine StoreFile5A: 128 MB

StoreFile5: 256 MB
StoreFile3: 64 MB StoreFile5B: 128 MB
StoreFile7: 128 MB

StoreFile4: 64 MB

31 Huawei Confidential
HLog Implementation
 In a distributed environment, you need to consider system errors. HBase uses
HLog to ensure system recovery.
 The HBase system configures an HLog file for each HRegionServer, which is a
write-ahead log (WAL).
 The updated data can be written to the MemStore cache only after the data is
written to logs. In addition, the cached data can be written to the disk only after
the logs corresponding to the data cached in the MemStore are written to the
disk.

32 Huawei Confidential
Contents

1. Introduction to HBase

2. HBase Related Concepts

3. HBase Architecture

4. HBase Key Processes

5. HBase Highlights

6. HBase Performance Tuning

7. Common HBase Shell Commands

34 Huawei Confidential
Impact of Multiple HFiles

 The read latency prolongs as the number of HFiles increases.


35 Huawei Confidential
Compaction (1)
 Compaction is used to reduce the number of small files (HFiles) in the same
column family of the same region to improve the read performance.
 Compaction is classified into minor compaction and major compaction.
 Minor: indicates small-scale compaction. There are limits on the minimum and
maximum number of files. Generally, small files in a continuous time range are
merged.
 Major: indicates the compaction of all HFile files under the column family of the
region.
 Minor compaction complies with a certain algorithm when selecting files.

36 Huawei Confidential
Compaction - 2

Write
put MemStore

Flush

HFile HFile HFile HFile HFile HFile HFile

Minor Compaction

HFile HFile HFile

Major Compaction

HFile

37 Huawei Confidential
OpenScanner
 In the OpenScanner process, two different scanners are created to read HFile
and MemStore data.
 The scanner corresponding to HFile is StoreFileScanner.
 The scanner corresponding to MemStore is MemStoreScanner.

ColumnFamily-1
MemStore
HFile-11
HFile-12
Region
ColumnFamily-2
MemStore
HFile-21
HFile-22

38 Huawei Confidential
BloomFilter
 BloomFilter is used to optimize some random read scenarios, that is, the Get
scenario. It can be used to quickly determine whether a piece of user data exists
in a large data set (most data in the data set cannot be loaded to the memory).
 BloomFilter has possibility of misjudgment when determining whether a piece
of data exists. However, the judgment result of "The data xxxx does not exist" is
reliable.
 BloomFilter's data in HBase is stored in HFiles.

39 Huawei Confidential
Contents

1. Introduction to HBase

2. HBase Related Concepts

3. HBase Architecture

4. HBase Key Processes

5. HBase Highlights

6. HBase Performance Tuning

7. Common HBase Shell Commands

40 Huawei Confidential
Row Key
 Row keys are stored in alphabetical order. Therefore, when designing row keys,
you need to fully use the sorting feature to store the data that is frequently
read together and the data that may be accessed recently.
 For example, if the data that is recently written to the HBase table is most likely
to be accessed, you can use the timestamp as a part of the row key. Because
the data is sorted in alphabetical order, you can use Long.MAX_VALUE -
timestamp as the row key, in this way, newly written data can be quickly hit
when being read.

41 Huawei Confidential
Creating HBase Secondary Index (1)
 HBase has only one index for row keys.
 There are three methods for accessing rows in the HBase table:
 Access through a single rowkey
 Access through a row key interval
 Full table scan

42 Huawei Confidential
Creating HBase Secondary Index (2)
 Hindex Secondary Index
 Hindex is a Java-based HBase secondary index developed by Huawei and is
compatible with Apache HBase 0.94.8. The current features are as follows:
 Multiple table indexes
 Multiple column indexes
 Index based on some column values

43 Huawei Confidential
Contents

1. Introduction to HBase

2. HBase Related Concepts

3. HBase Architecture

4. HBase Key Processes

5. HBase Highlights

6. HBase Performance Tuning

7. Common HBase Shell Commands

44 Huawei Confidential
Common HBase Shell Commands
 create: creating Hive tables
 list: listing all tables in HBase
 put: adding data to a specified cell in a table, row, or column
 scan: browsing information about a table
 get: obtaining the value of a cell based on the table name, row, column,
timestamp, time range, and version number
 enable/disable: enabling or disabling a table
 drop: deleting a table

45 Huawei Confidential
Summary

 This course describes the knowledge about the HBase database. HBase is an open
source implementation of BigTable. Similar to BigTable, HBase supports a large
amount of data and distributed concurrent data processing. It is easy to expand,
supporting dynamic scaling, and is applicable to inexpensive devices.
 Additionally, this course describes the differences between the conceptual view and
physical view of HBase data. HBase is a mapping table that stores data in a sparse,
multi-dimensional, and persistent manner. It uses row keys, column keys, and
timestamps for indexing, and each value is an unexplained string.

46 Huawei Confidential
Quiz

1. Which of the following is type used to store data in HBase? ( )


A. Int

B. Long

C. String

D. Byte[]

47 Huawei Confidential
Quiz

2. What is the smallest storage unit of HBase? ( )


A. Region

B. Column Family

C. Column

D. Cell

48 Huawei Confidential
Recommendations

 Huawei Cloud Official Web Link:


 https://www.huaweicloud.com/intl/en-us/
 Huawei MRS Documentation:
 https://www.huaweicloud.com/intl/en-us/product/mrs.html
 Huawei TALENT ONLINE:
 https://e.huawei.com/en/talent/#/

49 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright© 2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy