Module 02 HDFS - Hadoop Distributed File System
Module 02 HDFS - Hadoop Distributed File System
HDFS
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. HDFS Overview and Application Scenarios
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Dictionary vs. File System
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
HDFS Overview
Hadoop distributed file system (HDFS) is developed based on
Google file system (GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems,
HDFS also provides the following features:
High fault tolerance: resolves hardware unreliability problems.
High throughput: supports applications involved with a large amount of
data.
Large file storage: supports TB and PB level data storage.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
HDFS Application Scenarios
HDFS is a distributed file system of the Hadoop technical
framework and is used to manage files on multiple independent
physical servers.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Contents
1. HDFS Overview and Application Scenarios
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Position of HDFS in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Contents
1. HDFS Overview and Application Scenarios
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic System Architecture
HDFS Architecture
Metadata(Name,replicas,...):
/home/foo/data,3,...
NameNode
Metadata ops
Block ops
Client
DataNode Datanodes
Read
Replication
Blocks Blocks
Client Rack 2
Rack 1
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
HDFS Data Write Process
Client node
4 4
DataNode DataNode DataNode
5 5
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
HDFS Data Read Process
2:get block
HDFS 1:open location NameNode
Distributed
Client 3:read FileSystem
NameNode
6:close FSData
InputStream
Client node
4:read 5:read
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Contents
1. HDFS Overview and Application Scenarios
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Key Design of HDFS Architecture
HA
HDFS Data replication
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
HDFS High Availability (HA)
ZooKeeper ZooKeeper ZooKeeper
Heartbeat
心跳
EditLog
ZKFC JN JN JN ZKFC
Re
W
ad
rit
log
e
lo
g
NameNode FSImage NameNode
synchronization
(Active) (Standby)
ion
erat
op
ta
ta da Heartbeat
HDFS Me Blo
ck
Dat ope
ar rati
Client writ ead/
on
e
Copy
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Metadata Persistence
Active NameNode Standby NameNode
FSImage
.ckpt
5. Rolls back
Fsimage.
Editlog Fsimage
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
HDFS Federation
APP Client-1 Client-k Client-n
NN1 NN-k
Namespace
NN-n
… …
NS1 NS-k
NS-n
Pool
Pool 1 Pool n
Block Pools
Storage
Block
Common Storage
DataNode1 DataNode2 DataNodeN
… … …
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Data Replication
Distance=0 Distance=4 Distance=4
Client B1 B2 B4
Node1 Node1
Distance=2
Node2 Node2 Node2
B3 Node3 Node3
Node3
Data Center
Placement policy
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Configuring HDFS Data Storage Policies
By default, the HDFS NameNode automatically selects DataNodes to
store data replicas. There are the following scenarios in practice:
Select a proper storage device for layered data storage from multiple
devices on a DataNode.
Select a proper DataNode according to directory tags that indicate
data importance levels.
Store key data in highly reliable node groups because the DataNode
cluster uses heterogeneous servers.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Configuring HDFS Data Storage Policies -
Layered Storage
Configuring DataNode with layered storage :
The HDFS layered storage architecture provides four types of storage devices: RAM_DISK
(memory virtualization hard disk), DISK (mechanical hard disk), ARCHIVE (high-density and
low-cost storage media), and SSD (solid state disk).
Storage policies for different scenarios are formulated by combining the four types of storage
devices.
Alternative
Block Location (Number Alternative Replica
Policy ID Name Storage
of Replicas) Storage Policy
Policy
15 LAZY_PERSIST RAM_DISK: 1, DISK: n-1 DISK DISK
12 All_SSD SSD: n DISK DISK
10 ONE_SSD SSD: 1, DISK: n-1 SSD, DISK SSD, DISK
7 HOT (default) DISK: n <none> ARCHIVE
5 WARM DISK: 1, ARCHIVE: n-1 ARCHIVE, DISK ARCHIVE, DISK
2 COLD ARCHIVE: n <none> <none>
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Configuring HDFS Data Storage Policies -
Tag Storage
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Configuring HDFS Data Storage Policies
- Node Group Storage
Rack group
Rack group 2 Rack group Rack group
1 (mandatory) 3 4
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Colocation
The definition of Colocation: is to store associated data or data that is going to
be associated on the same storage node.
According to the picture below, assume that file A and file D are going to be
associated with each other, which involves massive data migration. Data
transmission consumes much bandwidth, which greatly affects the processing
speed of massive data and system performance.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Colocation Benefits
The HDFS colocation: is to store files that need to be associated with each
other on the same data node so that data does not have to be obtained from
other nodes during associated computing. This greatly reduces network
bandwidth consumption.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
HDFS Data Integrity Assurance
HDFS ensures the completeness of the stored data. It implements reliability processing in case of
failure of each component.
Reconstructs data replicas in invalid data disks.
The DataNode periodically reports blocks’messages to the NameNode, if one replica(block) is failed, the
NameNode will start the procedure to recover lost replicas.
Ensures data balance among DataNodes.
The HDFS architecture is configured with the data balance mechanism, which ensures the even distribution
of data among all DataNodes.
Ensures metadata reliability.
The log mechanism is used to operate metadata, which is stored on both active and standby NameNodes.
The snapshot mechanism of the file system ensures that data can be recovered in a timely manner when a
misoperation occurs.
Provides the security mode.
HDFS provides a unique security mode to prevent fault spreading when a DataNode or hard disk is faulty.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Other Key Design Points of the HDFS
Architecture
Unified file system:
HDFS presents itself as one unified file system externally.
Space reclamation:
The recycle bin mechanism is provided and the number of replicas can be dynamically set.
Data organization:
Data is stored by block in the HDFS.
Access mode:
Data can be accessed through Java APIs, HTTP, or shell commands.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Common Shell Commands
Type Commands Description
-cat Show the file contents
-ls Show a directory listing
-rm Delete files
-put Upload directory/files to HDFS
dfs
-get Download directory/files from
HDFS
-mkdir Create a directory
-chmod/-chown Change the group of files
… …
-safemode Safety mode operation
dfsadmin
-report Report service status
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Summary
This module describes the following information about HDFS:
basic concepts, application scenarios, technical architecture and
its key features.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Quiz
1. What is HDFS and what can it be used for?
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36