0% found this document useful (0 votes)
20 views31 pages

Module 02 HDFS - Hadoop Distributed File System

The document provides an overview of the Hadoop Distributed File System (HDFS), including its application scenarios, system architecture, and key features. HDFS is designed for high fault tolerance, high throughput, and large file storage, making it suitable for managing files across multiple servers. It also covers data storage policies, read/write processes, and the importance of data integrity and colocation in optimizing performance.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views31 pages

Module 02 HDFS - Hadoop Distributed File System

The document provides an overview of the Hadoop Distributed File System (HDFS), including its application scenarios, system architecture, and key features. HDFS is designed for high fault tolerance, high throughput, and large file storage, making it suitable for managing files across multiple servers. It also covers data storage policies, read/write processes, and the importance of data integrity and colocation in optimizing performance.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Technical Principles of

HDFS

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.


Objectives
 Upon completion of this course, you will be able to know:
 HDFS application scenarios
 HDFS system architecture
 Key HDFS features

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. HDFS Overview and Application Scenarios

2. Position of HDFS in FusionInsight HD

3. HDFS System Architecture

4. Key Features

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Dictionary vs. File System

Dictionary File System


 File name
Character index  Metadata

Dictionary body Data block

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
HDFS Overview
 Hadoop distributed file system (HDFS) is developed based on
Google file system (GFS) and runs on commondity hardware.
 In addition to the features provided by other distributed file systems,
HDFS also provides the following features:
 High fault tolerance: resolves hardware unreliability problems.
 High throughput: supports applications involved with a large amount of
data.
 Large file storage: supports TB and PB level data storage.

HDFS is inapplicable to:


HDFS is applicable to:
 Store massive small files
 Store large files
 Random write
 Streaming data access
 Low-delay read

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
HDFS Application Scenarios
HDFS is a distributed file system of the Hadoop technical
framework and is used to manage files on multiple independent
physical servers.

It is applicable to the following scenarios:


 Website user behavior data storage
 Ecosystem data storage
 Meteorological data storage

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Contents
1. HDFS Overview and Application Scenarios

2. Position of HDFS in FusionInsight HD

3. HDFS System Architecture

4. Key Features

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Position of HDFS in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Storm Flink
Hadoop LibrA
YARN/ Zookeeper Security
management
HDFS/HBase

As a Hadoop storage infrastructure, HDFS serves as a distributed, fault-tolerant


file system with linear scalability.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Contents
1. HDFS Overview and Application Scenarios

2. Position of HDFS in FusionInsight HD

3. HDFS System Architecture

4. Key Features

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic System Architecture
HDFS Architecture

Metadata(Name,replicas,...):
/home/foo/data,3,...
NameNode

Metadata ops

Block ops
Client

DataNode Datanodes
Read

Replication
Blocks Blocks

Client Rack 2
Rack 1

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
HDFS Data Write Process

1:create 2:create NameNode


HDFS Distributed
Client 3:write FileSystem
7:complete
NameNode
6:close FSData
OutputStream

Client node

4:write packet 5:ack packet

4 4
DataNode DataNode DataNode
5 5

DataNode DataNode DataNode

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
HDFS Data Read Process

2:get block
HDFS 1:open location NameNode
Distributed
Client 3:read FileSystem
NameNode
6:close FSData
InputStream

Client node

4:read 5:read

DataNode DataNode DataNode

DataNode DataNode DataNode

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Contents
1. HDFS Overview and Application Scenarios

2. Position of HDFS in FusionInsight HD

3. HDFS System Architecture

4. Key Features

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Key Design of HDFS Architecture

Federation storage NameNode/DataNode


in master/slave mode

Data storage policy Unified file system


namespace

HA
HDFS Data replication

Multiple access modes Metadata persistence

Space reclamation Robustness

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
HDFS High Availability (HA)
ZooKeeper ZooKeeper ZooKeeper
Heartbeat

心跳
EditLog

ZKFC JN JN JN ZKFC

Re

W
ad

rit
log

e
lo
g
NameNode FSImage NameNode
synchronization
(Active) (Standby)
ion
erat
op
ta
ta da Heartbeat
HDFS Me Blo
ck
Dat ope
ar rati
Client writ ead/
on
e

Copy

DataNode DataNode DataNode DataNode

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Metadata Persistence
Active NameNode Standby NameNode

2. Obtains Editlog and Fsimage from the


Editlog Fsimage active node. Download Fsimage when
NameNode is initialized and the local
Fsimage file is used later.
1. Rolls back
Editlog.

Editlog Editlog Fsimage


.new
3. Merges
Editlog and
Fsimage.
FSImage
.ckpt

4. Uploads the new Fsimage


to the active node.

FSImage
.ckpt

5. Rolls back
Fsimage.

Editlog Fsimage

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
HDFS Federation
APP Client-1 Client-k Client-n

HDFS Namespace-1 Namespace-k Namespace-n

NN1 NN-k
Namespace

NN-n

… …
NS1 NS-k
NS-n

Pool
Pool 1 Pool n
Block Pools
Storage
Block

Common Storage
DataNode1 DataNode2 DataNodeN
… … …

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Data Replication
Distance=0 Distance=4 Distance=4

Client B1 B2 B4
Node1 Node1
Distance=2
Node2 Node2 Node2

B3 Node3 Node3
Node3

Node4 Node4 Node4

Node5 Node5 Node5

RACK1 RACK2 RACK3

Data Center
Placement policy

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Configuring HDFS Data Storage Policies
By default, the HDFS NameNode automatically selects DataNodes to
store data replicas. There are the following scenarios in practice:
 Select a proper storage device for layered data storage from multiple
devices on a DataNode.
 Select a proper DataNode according to directory tags that indicate
data importance levels.
 Store key data in highly reliable node groups because the DataNode
cluster uses heterogeneous servers.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Configuring HDFS Data Storage Policies -
Layered Storage
 Configuring DataNode with layered storage :
 The HDFS layered storage architecture provides four types of storage devices: RAM_DISK
(memory virtualization hard disk), DISK (mechanical hard disk), ARCHIVE (high-density and
low-cost storage media), and SSD (solid state disk).

 Storage policies for different scenarios are formulated by combining the four types of storage
devices.

Alternative
Block Location (Number Alternative Replica
Policy ID Name Storage
of Replicas) Storage Policy
Policy
15 LAZY_PERSIST RAM_DISK: 1, DISK: n-1 DISK DISK
12 All_SSD SSD: n DISK DISK
10 ONE_SSD SSD: 1, DISK: n-1 SSD, DISK SSD, DISK
7 HOT (default) DISK: n <none> ARCHIVE
5 WARM DISK: 1, ARCHIVE: n-1 ARCHIVE, DISK ARCHIVE, DISK
2 COLD ARCHIVE: n <none> <none>

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Configuring HDFS Data Storage Policies -
Tag Storage

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Configuring HDFS Data Storage Policies
- Node Group Storage
Rack group
Rack group 2 Rack group Rack group
1 (mandatory) 3 4

Node 1 Node 3 Node 5 Node 7

Node 2 Node 4 Node 6 Node 8

File 1 (Number of replicas = 1)


File 2 (Number of replicas = 3)
File 3 (Number of replicas = 2)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Colocation
The definition of Colocation: is to store associated data or data that is going to
be associated on the same storage node.

According to the picture below, assume that file A and file D are going to be
associated with each other, which involves massive data migration. Data
transmission consumes much bandwidth, which greatly affects the processing
speed of massive data and system performance.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Colocation Benefits
 The HDFS colocation: is to store files that need to be associated with each
other on the same data node so that data does not have to be obtained from
other nodes during associated computing. This greatly reduces network
bandwidth consumption.

 When joining files A and D with colocation feature, resource consumption


can be greatly reduced because the blocks of multiple associated files are
distributed on the same storage node.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
HDFS Data Integrity Assurance
HDFS ensures the completeness of the stored data. It implements reliability processing in case of
failure of each component.
 Reconstructs data replicas in invalid data disks.
 The DataNode periodically reports blocks’messages to the NameNode, if one replica(block) is failed, the
NameNode will start the procedure to recover lost replicas.
 Ensures data balance among DataNodes.
 The HDFS architecture is configured with the data balance mechanism, which ensures the even distribution
of data among all DataNodes.
 Ensures metadata reliability.
 The log mechanism is used to operate metadata, which is stored on both active and standby NameNodes.
 The snapshot mechanism of the file system ensures that data can be recovered in a timely manner when a
misoperation occurs.
 Provides the security mode.
 HDFS provides a unique security mode to prevent fault spreading when a DataNode or hard disk is faulty.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Other Key Design Points of the HDFS
Architecture
 Unified file system:
 HDFS presents itself as one unified file system externally.

 Space reclamation:
 The recycle bin mechanism is provided and the number of replicas can be dynamically set.

 Data organization:
 Data is stored by block in the HDFS.

 Access mode:
 Data can be accessed through Java APIs, HTTP, or shell commands.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Common Shell Commands
Type Commands Description
-cat Show the file contents
-ls Show a directory listing
-rm Delete files
-put Upload directory/files to HDFS

dfs
-get Download directory/files from
HDFS
-mkdir Create a directory
-chmod/-chown Change the group of files
… …
-safemode Safety mode operation
dfsadmin
-report Report service status

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Summary
 This module describes the following information about HDFS:
basic concepts, application scenarios, technical architecture and
its key features.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Quiz
1. What is HDFS and what can it be used for?

2. What are the design objectives of HDFS?

3. Describe the HDFS read and write processes.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
More Information
 Training materials:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Exam outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
 Mock exam:
 http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Thank You
www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy