0% found this document useful (0 votes)

39 views23 pages

4.2 HDFS Federation

Uploaded by

Hare Ram Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views23 pages

4.2 HDFS Federation

Uploaded by

Hare Ram Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 23

HDFS Federation

HDFS Federation feature and how to configure

and manage the federated cluster.
• HDFS has two main layers:
• Namespace
– Consists of directories, files and blocks.
– It supports all the namespace related file system operations such as
create, delete, modify and list files and directories.
• Block Storage Service, which has two parts:
– Block Management (performed in the Namenode)
• Provides Datanode cluster membership by handling registrations, and
periodic heart beats.
• Processes block reports and maintains location of blocks.
• Supports block related operations such as create, delete, modify and get
block location.
• Manages replica placement, block replication for under replicated blocks, and
deletes blocks that are over replicated.
– Storage - is provided by Datanodes by storing blocks on the local file
system and allowing read/write access.
• The prior HDFS architecture allows only a single namespace for the
entire cluster. In that configuration, a single Namenode manages
the namespace. HDFS Federation addresses this limitation by
adding support for multiple Namenodes/namespaces to HDFS.
• In order to scale the name service horizontally,
federation uses multiple independent
Namenodes/namespaces. The Namenodes are
federated; the Namenodes are independent and do
not require coordination with each other. The
Datanodes are used as common storage for blocks by
all the Namenodes. Each Datanode registers with all
the Namenodes in the cluster. Datanodes send
periodic heartbeats and block reports. They also
handle commands from the Namenodes.
• Users may use ViewFs to create personalized
namespace views. ViewFs is analogous to client side
mount tables in some Unix/Linux systems.
• Block Pool
• A Block Pool is a set of blocks that belong to a single
namespace. Datanodes store blocks for all the block
pools in the cluster. Each Block Pool is managed
independently. This allows a namespace to generate
Block IDs for new blocks without the need for
coordination with the other namespaces. A Namenode
failure does not prevent the Datanode from serving
other Namenodes in the cluster.
• A Namespace and its block pool together are called
Namespace Volume. It is a self-contained unit of
management. When a Namenode/namespace is
deleted, the corresponding block pool at the
Datanodes is deleted. Each namespace volume is
upgraded as a unit, during cluster upgrade.
• ClusterID
• A ClusterID identifier is used to identify all the nodes in the cluster.
When a Namenode is formatted, this identifier is either provided or
auto generated. This ID should be used for formatting the other
Namenodes into the cluster.
• Key Benefits
• Namespace Scalability - Federation adds namespace horizontal
scaling. Large deployments or deployments using lot of small files
benefit from namespace scaling by allowing more Namenodes to be
added to the cluster.
• Performance - File system throughput is not limited by a single
Namenode. Adding more Namenodes to the cluster scales the file
system read/write throughput.
• Isolation - A single Namenode offers no isolation in a multi user
environment. For example, an experimental application can
overload the Namenode and slow down production critical
applications. By using multiple Namenodes, different categories of
applications and users can be isolated to different namespaces.
• Federation Configuration
• Federation configuration is backward compatible and allows existing
single Namenode configurations to work without any change. The new
configuration is designed such that all the nodes in the cluster have the
same configuration without the need for deploying different
configurations based on the type of the node in the cluster.
• Federation adds a new NameServiceID abstraction. A Namenode and its
corresponding secondary/backup/checkpointer nodes all belong to a
NameServiceId. In order to support a single configuration file, the
Namenode and secondary/backup/checkpointer configuration
parameters are suffixed with the NameServiceID.
• Configuration:
• Step 1: Add the dfs.nameservices parameter to your configuration and
configure it with a list of comma separated NameServiceIDs. This will be
used by the Datanodes to determine the Namenodes in the cluster.
• Step 2: For each Namenode and Secondary
Namenode/BackupNode/Checkpointer add the following configuration
parameters suffixed with the corresponding NameServiceID into the
common configuration file:
Daemon Configuration Parameter

Namenode dfs.namenode.rpc-
address dfs.namenode.servicerpc-
address dfs.namenode.http-
address dfs.namenode.https-
address dfs.namenode.keytab.file dfs.n
amenode.name.dir dfs.namenode.edits.di
r dfs.namenode.checkpoint.dir dfs.name
node.checkpoint.edits.dir
Secondary Namenode dfs.namenode.secondary.http-
address dfs.secondary.namenode.keytab.
file

BackupNode dfs.namenode.backup.address dfs.second

ary.namenode.keytab.file
Formatting Namenodes
Step 1: Format a Namenode using the following command:
[hdfs]$ $HADOOP_HOME/bin/hdfs namenode -format [-
clusterId <cluster_id>]
Choose a unique cluster_id which will not conflict other
clusters in your environment. If a cluster_id is not provided,
then a unique one is auto generated.
Step 2: Format additional Namenodes using the following
command:
[hdfs]$ $HADOOP_HOME/bin/hdfs namenode -format -
clusterId <cluster_id>
Note that the cluster_id in step 2 must be same as that of
the cluster_id in step 1. If they are different, the additional
Namenodes will not be part of the federated cluster.
Upgrading from an older release and configuring federation
Older releases only support a single Namenode. Upgrade the cluster
to newer release in order to enable federation During upgrade you
can provide a ClusterID as follows:
[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start namenode -
upgrade -clusterId <cluster_ID>
If cluster_id is not provided, it is auto generated.
Adding a new Namenode to an existing HDFS cluster
Perform the following steps:
•Add dfs.nameservices to the configuration.
•Update the configuration with the NameServiceID suffix.
Configuration key names changed post release 0.20. You must use
the new configuration parameter names in order to use federation.
•Add the new Namenode related config to the configuration file.
•Propagate the configuration file to the all the nodes in the cluster.
•Start the new Namenode and Secondary/Backup.
•Refresh the Datanodes to pickup the newly added Namenode by
running the following command against all the Datanodes in the
cluster:
[hdfs]$ $HADOOP_HOME/bin/hdfs dfsadmin -refreshNamenodes
<datanode_host_name>:<datanode_ipc_port>
Managing the cluster
Starting and stopping cluster
To start the cluster run the following command:
[hdfs]$ $HADOOP_HOME/sbin/start-dfs.sh
To stop the cluster run the following command:
[hdfs]$ $HADOOP_HOME/sbin/stop-dfs.sh
These commands can be run from any node
where the HDFS configuration is available. The
command uses the configuration to determine the
Namenodes in the cluster and then starts the
Namenode process on those nodes. The
Datanodes are started on the nodes specified in
the workers file. The script can be used as a
reference for building your own scripts to start
and stop the cluster.
• Balancer
• The Balancer has been changed to work with multiple
Namenodes. The Balancer can be run using the command:
• [hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start balancer
[-policy <policy>]
• The policy parameter can be any of the following:
• datanode - this is the default policy. This balances the
storage at the Datanode level. This is similar to balancing
policy from prior releases.
• blockpool - this balances the storage at the block pool level
which also balances at the Datanode level.
• Note that Balancer only balances the data and does not
balance the namespace.
• Decommissioning
• Decommissioning is similar to prior releases. The nodes that need
to be decommissioned are added to the exclude file at all of the
Namenodes. Each Namenode decommissions its Block Pool. When
all the Namenodes finish decommissioning a Datanode, the
Datanode is considered decommissioned.
• Step 1: To distribute an exclude file to all the Namenodes, use the
following command:
• [hdfs]$ $HADOOP_HOME/sbin/distribute-exclude.sh <exclude_file>
• Step 2: Refresh all the Namenodes to pick up the new exclude file:
• [hdfs]$ $HADOOP_HOME/sbin/refresh-namenodes.sh
• The above command uses HDFS configuration to determine the
configured Namenodes in the cluster and refreshes them to pick up
the new exclude file.
• Cluster Web Console
• Similar to the Namenode status web page, when using federation a
Cluster Web Console is available to monitor the federated cluster
at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode
in the cluster can be used to access this web page.
• The Cluster Web Console provides the following information:
• A cluster summary that shows the number of files, number of
blocks, total configured storage capacity, and the available and
used storage for the entire cluster.
• A list of Namenodes and a summary that includes the number of
files, blocks, missing blocks, and live and dead data nodes for each
Namenode. It also provides a link to access each Namenode’s web
UI.
• The decommissioning status of Datanodes.
MapReduce Insights
• Restricted key-value model
– Same fine-grained operation (Map &
Reduce) repeated on big data
– Operations must be deterministic
– Operations must be idempotent/no
side effects
– Only communication is through the
shuffle
– Operation (Map & Reduce) output
saved (on disk)
What is MapReduce Used
For?
• At Google:
– Index building for Google Search
– Article clustering for Google News
– Statistical machine translation

• At Yahoo!:
– Index building for Yahoo! Search
– Spam detection for Yahoo! Mail

• At Facebook:
– Data mining
– Ad optimization
– Spam detection
MapReduce Pros
• Distribution is completely transparent
– Not a single line of distributed programming (ease,
correctness)

• Automatic fault-tolerance
– Determinism enables running failed tasks somewhere else
again
– Saved intermediate data enables just re-running failed
reducers

• Automatic scaling
– As operations as side-effect free, they can be distributed
to any number of machines dynamically

• Automatic load-balancing
– Move tasks and speculatively execute duplicate copies of
slow tasks (stragglers)
MapReduce Cons
• Restricted programming model
– Not always natural to express problems
in this model
– Low-level coding necessary
– Little support for iterative jobs (lots of
disk access)
– High-latency (batch processing)

• Addressed by follow-up research

– Pig and Hive for high-level coding
– Spark for iterative and low-latency jobs
Hadoop YARN Architecture
• YARN stands for “Yet Another Resource Negotiator“. It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in
Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at
the time of its launching, but it has now evolved to be known as large-
scale distributed operating system used for Big Data processing.

• YARN architecture basically separates resource management layer from

the processing layer. In Hadoop 1.0 version, the responsibility of Job
tracker is split between the resource manager and application manager

• YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to
run and process data stored in HDFS (Hadoop Distributed File System)
thus making the system much more efficient. Through its various
components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite
necessary to manage the available resources properly so that every
application can leverage them.
• YARN Features: YARN gained popularity because of the
following features-
• Scalability: The scheduler in Resource manager of YARN
architecture allows Hadoop to extend and manage
thousands of nodes and clusters.
• Compatability: YARN supports the existing map-reduce
applications without disruptions thus making it compatible
with Hadoop 1.0 as well.
• Cluster Utilization:Since YARN supports Dynamic utilization
of cluster in Hadoop, which enables optimized Cluster
Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
• The main components of YARN architecture include:
• Client: It submits map-reduce jobs.
• Resource Manager: It is the master daemon of YARN and is responsible for resource assignment
and management among all the applications. Whenever it receives a processing request, it
forwards it to the corresponding node manager and allocates resources for the completion of the
request accordingly. It has two major components:
– Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure
scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a
restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.
– Application manager: It is responsible for accepting the application and negotiating the first container from
the resource manager. It also restarts the Application Manager container if a task fails.

• Node Manager: It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Node Manager. It
monitors resource usage, performs log management and also kills a container based on directions
from the resource manager. It is also responsible for creating the container process and start it on
the request of Application master.

• Application Master: An application is a single job submitted to a framework. The application

manager is responsible for negotiating resources with the resource manager, tracking the status
and monitoring progress of a single application. The application master requests the container
from the node manager by sending a Container Launch Context(CLC) which includes everything an
application needs to run. Once the application is started, it sends the health report to the resource
manager from time-to-time.
• Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node.
The containers are invoked by Container Launch Context(CLC) which is a record that contains
information such as environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:

• Client submits an application

• The Resource Manager allocates a container to start the Application Manager
• The Application Manager registers itself with the Resource Manager
• The Application Manager negotiates containers from the Resource Manager
• The Application Manager notifies the Node Manager to launch containers
• Application code is executed in the container
• Client contacts Resource Manager/Application Manager to monitor application’s status
• Once the processing is complete, the Application Manager un-registers with the Resource Manager

Yahoo Hadoop Tutorial
No ratings yet
Yahoo Hadoop Tutorial
28 pages
Pcs TCM 2800 - 200
No ratings yet
Pcs TCM 2800 - 200
1 page
Yarn Ha Federation
No ratings yet
Yarn Ha Federation
64 pages
HDFS Federation: Background
No ratings yet
HDFS Federation: Background
4 pages
03 Hdfs
No ratings yet
03 Hdfs
27 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
22 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
HDFS
No ratings yet
HDFS
16 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Unit-Ii Bda
No ratings yet
Unit-Ii Bda
103 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
5.apache Hadoop
No ratings yet
5.apache Hadoop
33 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
Hdfs and Pig
No ratings yet
Hdfs and Pig
13 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
21CS72 Bigdata Module 2 HDFS
No ratings yet
21CS72 Bigdata Module 2 HDFS
55 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Name Node Federation
No ratings yet
Name Node Federation
3 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Quick Look: HDFS: Assumptions and Goals
No ratings yet
Quick Look: HDFS: Assumptions and Goals
5 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
4
No ratings yet
4
53 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
HDFS
No ratings yet
HDFS
37 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Unit-4 Hadoop Distributed File System (HDFS) : Syllabus
No ratings yet
Unit-4 Hadoop Distributed File System (HDFS) : Syllabus
17 pages
6 - BDP 2024 07
No ratings yet
6 - BDP 2024 07
17 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
BDP 2024 07
No ratings yet
BDP 2024 07
17 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Hadoop
No ratings yet
Hadoop
31 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
29 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Day 5
No ratings yet
Day 5
7 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
Hadoop
No ratings yet
Hadoop
23 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
2018 Unit1 Lecture5 HDFS HA
No ratings yet
2018 Unit1 Lecture5 HDFS HA
29 pages
Namenode High Availability
No ratings yet
Namenode High Availability
7 pages
HDFS
100% (2)
HDFS
6 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Hdfs R20it III
No ratings yet
Hdfs R20it III
19 pages
Document 4 HDFS
No ratings yet
Document 4 HDFS
8 pages
Big Data Unit-3
No ratings yet
Big Data Unit-3
46 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Hadoop Week 2
No ratings yet
Hadoop Week 2
40 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Hadoop Building Blocks
No ratings yet
Hadoop Building Blocks
30 pages
Unit 2
No ratings yet
Unit 2
14 pages
Big Data - Unit 4
No ratings yet
Big Data - Unit 4
15 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
61 Cybersecurity Job Interview Questions and Answers - Springboard Blog
0% (1)
61 Cybersecurity Job Interview Questions and Answers - Springboard Blog
10 pages
Top 110 Cyber Security Interview Questions & Answers
No ratings yet
Top 110 Cyber Security Interview Questions & Answers
30 pages
Seven Rules For Effective Publication - PDF
No ratings yet
Seven Rules For Effective Publication - PDF
42 pages
Plagiarism
No ratings yet
Plagiarism
19 pages
Reveiw Paper
No ratings yet
Reveiw Paper
9 pages
ABC Optimization - Full Paper (Santosh Galgotia)
No ratings yet
ABC Optimization - Full Paper (Santosh Galgotia)
17 pages
Mini Project 2A
No ratings yet
Mini Project 2A
1 page
Greater Noida Institute of Technology (Engg. Institute), Greater Noida Mini Project Evaluation (First Project Review Presentation) III Sem
No ratings yet
Greater Noida Institute of Technology (Engg. Institute), Greater Noida Mini Project Evaluation (First Project Review Presentation) III Sem
18 pages
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
No ratings yet
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
9 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
36 pages
Shibani 25721005
No ratings yet
Shibani 25721005
324 pages
1 Sessional Question
No ratings yet
1 Sessional Question
1 page
Sandhya - Self Intro
No ratings yet
Sandhya - Self Intro
1 page
en - Smart RSG
No ratings yet
en - Smart RSG
3 pages
Edward Heath: Work Experience
No ratings yet
Edward Heath: Work Experience
2 pages
CSC103 Programming Fundamentals
No ratings yet
CSC103 Programming Fundamentals
8 pages
Mom Report
No ratings yet
Mom Report
8 pages
Training Presentation
No ratings yet
Training Presentation
14 pages
MODULE 3 - Revise With Quiz
No ratings yet
MODULE 3 - Revise With Quiz
8 pages
B EIM GRADE 12 Q1M1 Learner Copy-TVL Final Layout
100% (2)
B EIM GRADE 12 Q1M1 Learner Copy-TVL Final Layout
25 pages
Solution Manual For "Wireless Communications" by A. F. Molisch
No ratings yet
Solution Manual For "Wireless Communications" by A. F. Molisch
5 pages
BS 1881-112 1983 Concrete Methods of Accelerated Curing of Test Cubes
No ratings yet
BS 1881-112 1983 Concrete Methods of Accelerated Curing of Test Cubes
11 pages
Software Engineering Lab
No ratings yet
Software Engineering Lab
2 pages
Appd List of Contractros of CEEC As On 27 Dec 2019
No ratings yet
Appd List of Contractros of CEEC As On 27 Dec 2019
187 pages
Bray / Mccannalok: 41R High Performance Valves For The SUGAR INDUSTRY
No ratings yet
Bray / Mccannalok: 41R High Performance Valves For The SUGAR INDUSTRY
4 pages
ZCP 515-33KV Twin FDR
No ratings yet
ZCP 515-33KV Twin FDR
21 pages
Maltego Webinar Slides 58f8040a532d6
No ratings yet
Maltego Webinar Slides 58f8040a532d6
10 pages
SOLAR BOOST™ 3024i: Installation and Operation Manual
No ratings yet
SOLAR BOOST™ 3024i: Installation and Operation Manual
19 pages
Data Processing
No ratings yet
Data Processing
7 pages
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
No ratings yet
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
63 pages
DSP Unit 5
No ratings yet
DSP Unit 5
34 pages
Alignment Report Nelamangala Chikkaballapura
No ratings yet
Alignment Report Nelamangala Chikkaballapura
39 pages
La Gard Combogard Pro 39e Electronic Lock Software Installation Instructions 730 018 Rev D Web PDF
No ratings yet
La Gard Combogard Pro 39e Electronic Lock Software Installation Instructions 730 018 Rev D Web PDF
12 pages
TSQL Interview Questions and Answers
No ratings yet
TSQL Interview Questions and Answers
8 pages
Cellular Sales Of, LLC
100% (1)
Cellular Sales Of, LLC
1 page
Inkjet
No ratings yet
Inkjet
4 pages
Physics Classmate
No ratings yet
Physics Classmate
15 pages
Manual de Usuario Volkswagen Sharan (1996) (513 Páginas)
No ratings yet
Manual de Usuario Volkswagen Sharan (1996) (513 Páginas)
3 pages
Multipulse: Electronic Pulse Counter
No ratings yet
Multipulse: Electronic Pulse Counter
2 pages
Ejemplo de Bill of Lading
No ratings yet
Ejemplo de Bill of Lading
2 pages
Final Project1
No ratings yet
Final Project1
44 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

4.2 HDFS Federation

Uploaded by

4.2 HDFS Federation

Uploaded by

HDFS Federation

HDFS Federation feature and how to configure

BackupNode dfs.namenode.backup.address dfs.second

• Addressed by follow-up research

• YARN architecture basically separates resource management layer from

• Application Master: An application is a single job submitted to a framework. The application

• Client submits an application

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.