0% found this document useful (0 votes)
39 views23 pages

4.2 HDFS Federation

BG

Uploaded by

Hare Ram Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views23 pages

4.2 HDFS Federation

BG

Uploaded by

Hare Ram Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

HDFS Federation

HDFS Federation feature and how to configure


and manage the federated cluster.
• HDFS has two main layers:
• Namespace
– Consists of directories, files and blocks.
– It supports all the namespace related file system operations such as
create, delete, modify and list files and directories.
• Block Storage Service, which has two parts:
– Block Management (performed in the Namenode)
• Provides Datanode cluster membership by handling registrations, and
periodic heart beats.
• Processes block reports and maintains location of blocks.
• Supports block related operations such as create, delete, modify and get
block location.
• Manages replica placement, block replication for under replicated blocks, and
deletes blocks that are over replicated.
– Storage - is provided by Datanodes by storing blocks on the local file
system and allowing read/write access.
• The prior HDFS architecture allows only a single namespace for the
entire cluster. In that configuration, a single Namenode manages
the namespace. HDFS Federation addresses this limitation by
adding support for multiple Namenodes/namespaces to HDFS.
• In order to scale the name service horizontally,
federation uses multiple independent
Namenodes/namespaces. The Namenodes are
federated; the Namenodes are independent and do
not require coordination with each other. The
Datanodes are used as common storage for blocks by
all the Namenodes. Each Datanode registers with all
the Namenodes in the cluster. Datanodes send
periodic heartbeats and block reports. They also
handle commands from the Namenodes.
• Users may use ViewFs to create personalized
namespace views. ViewFs is analogous to client side
mount tables in some Unix/Linux systems.
• Block Pool
• A Block Pool is a set of blocks that belong to a single
namespace. Datanodes store blocks for all the block
pools in the cluster. Each Block Pool is managed
independently. This allows a namespace to generate
Block IDs for new blocks without the need for
coordination with the other namespaces. A Namenode
failure does not prevent the Datanode from serving
other Namenodes in the cluster.
• A Namespace and its block pool together are called
Namespace Volume. It is a self-contained unit of
management. When a Namenode/namespace is
deleted, the corresponding block pool at the
Datanodes is deleted. Each namespace volume is
upgraded as a unit, during cluster upgrade.
• ClusterID
• A ClusterID identifier is used to identify all the nodes in the cluster.
When a Namenode is formatted, this identifier is either provided or
auto generated. This ID should be used for formatting the other
Namenodes into the cluster.
• Key Benefits
• Namespace Scalability - Federation adds namespace horizontal
scaling. Large deployments or deployments using lot of small files
benefit from namespace scaling by allowing more Namenodes to be
added to the cluster.
• Performance - File system throughput is not limited by a single
Namenode. Adding more Namenodes to the cluster scales the file
system read/write throughput.
• Isolation - A single Namenode offers no isolation in a multi user
environment. For example, an experimental application can
overload the Namenode and slow down production critical
applications. By using multiple Namenodes, different categories of
applications and users can be isolated to different namespaces.
• Federation Configuration
• Federation configuration is backward compatible and allows existing
single Namenode configurations to work without any change. The new
configuration is designed such that all the nodes in the cluster have the
same configuration without the need for deploying different
configurations based on the type of the node in the cluster.
• Federation adds a new NameServiceID abstraction. A Namenode and its
corresponding secondary/backup/checkpointer nodes all belong to a
NameServiceId. In order to support a single configuration file, the
Namenode and secondary/backup/checkpointer configuration
parameters are suffixed with the NameServiceID.
• Configuration:
• Step 1: Add the dfs.nameservices parameter to your configuration and
configure it with a list of comma separated NameServiceIDs. This will be
used by the Datanodes to determine the Namenodes in the cluster.
• Step 2: For each Namenode and Secondary
Namenode/BackupNode/Checkpointer add the following configuration
parameters suffixed with the corresponding NameServiceID into the
common configuration file:
Daemon Configuration Parameter

Namenode dfs.namenode.rpc-
address dfs.namenode.servicerpc-
address dfs.namenode.http-
address dfs.namenode.https-
address dfs.namenode.keytab.file dfs.n
amenode.name.dir dfs.namenode.edits.di
r dfs.namenode.checkpoint.dir dfs.name
node.checkpoint.edits.dir
Secondary Namenode dfs.namenode.secondary.http-
address dfs.secondary.namenode.keytab.
file

BackupNode dfs.namenode.backup.address dfs.second


ary.namenode.keytab.file
Formatting Namenodes
Step 1: Format a Namenode using the following command:
[hdfs]$ $HADOOP_HOME/bin/hdfs namenode -format [-
clusterId <cluster_id>]
Choose a unique cluster_id which will not conflict other
clusters in your environment. If a cluster_id is not provided,
then a unique one is auto generated.
Step 2: Format additional Namenodes using the following
command:
[hdfs]$ $HADOOP_HOME/bin/hdfs namenode -format -
clusterId <cluster_id>
Note that the cluster_id in step 2 must be same as that of
the cluster_id in step 1. If they are different, the additional
Namenodes will not be part of the federated cluster.
Upgrading from an older release and configuring federation
Older releases only support a single Namenode. Upgrade the cluster
to newer release in order to enable federation During upgrade you
can provide a ClusterID as follows:
[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start namenode -
upgrade -clusterId <cluster_ID>
If cluster_id is not provided, it is auto generated.
Adding a new Namenode to an existing HDFS cluster
Perform the following steps:
•Add dfs.nameservices to the configuration.
•Update the configuration with the NameServiceID suffix.
Configuration key names changed post release 0.20. You must use
the new configuration parameter names in order to use federation.
•Add the new Namenode related config to the configuration file.
•Propagate the configuration file to the all the nodes in the cluster.
•Start the new Namenode and Secondary/Backup.
•Refresh the Datanodes to pickup the newly added Namenode by
running the following command against all the Datanodes in the
cluster:
[hdfs]$ $HADOOP_HOME/bin/hdfs dfsadmin -refreshNamenodes
<datanode_host_name>:<datanode_ipc_port>
Managing the cluster
Starting and stopping cluster
To start the cluster run the following command:
[hdfs]$ $HADOOP_HOME/sbin/start-dfs.sh
To stop the cluster run the following command:
[hdfs]$ $HADOOP_HOME/sbin/stop-dfs.sh
These commands can be run from any node
where the HDFS configuration is available. The
command uses the configuration to determine the
Namenodes in the cluster and then starts the
Namenode process on those nodes. The
Datanodes are started on the nodes specified in
the workers file. The script can be used as a
reference for building your own scripts to start
and stop the cluster.
• Balancer
• The Balancer has been changed to work with multiple
Namenodes. The Balancer can be run using the command:
• [hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start balancer
[-policy <policy>]
• The policy parameter can be any of the following:
• datanode - this is the default policy. This balances the
storage at the Datanode level. This is similar to balancing
policy from prior releases.
• blockpool - this balances the storage at the block pool level
which also balances at the Datanode level.
• Note that Balancer only balances the data and does not
balance the namespace.
• Decommissioning
• Decommissioning is similar to prior releases. The nodes that need
to be decommissioned are added to the exclude file at all of the
Namenodes. Each Namenode decommissions its Block Pool. When
all the Namenodes finish decommissioning a Datanode, the
Datanode is considered decommissioned.
• Step 1: To distribute an exclude file to all the Namenodes, use the
following command:
• [hdfs]$ $HADOOP_HOME/sbin/distribute-exclude.sh <exclude_file>
• Step 2: Refresh all the Namenodes to pick up the new exclude file:
• [hdfs]$ $HADOOP_HOME/sbin/refresh-namenodes.sh
• The above command uses HDFS configuration to determine the
configured Namenodes in the cluster and refreshes them to pick up
the new exclude file.
• Cluster Web Console
• Similar to the Namenode status web page, when using federation a
Cluster Web Console is available to monitor the federated cluster
at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode
in the cluster can be used to access this web page.
• The Cluster Web Console provides the following information:
• A cluster summary that shows the number of files, number of
blocks, total configured storage capacity, and the available and
used storage for the entire cluster.
• A list of Namenodes and a summary that includes the number of
files, blocks, missing blocks, and live and dead data nodes for each
Namenode. It also provides a link to access each Namenode’s web
UI.
• The decommissioning status of Datanodes.
MapReduce Insights
• Restricted key-value model
– Same fine-grained operation (Map &
Reduce) repeated on big data
– Operations must be deterministic
– Operations must be idempotent/no
side effects
– Only communication is through the
shuffle
– Operation (Map & Reduce) output
saved (on disk)
What is MapReduce Used
For?
• At Google:
– Index building for Google Search
– Article clustering for Google News
– Statistical machine translation

• At Yahoo!:
– Index building for Yahoo! Search
– Spam detection for Yahoo! Mail

• At Facebook:
– Data mining
– Ad optimization
– Spam detection
MapReduce Pros
• Distribution is completely transparent
– Not a single line of distributed programming (ease,
correctness)

• Automatic fault-tolerance
– Determinism enables running failed tasks somewhere else
again
– Saved intermediate data enables just re-running failed
reducers

• Automatic scaling
– As operations as side-effect free, they can be distributed
to any number of machines dynamically

• Automatic load-balancing
– Move tasks and speculatively execute duplicate copies of
slow tasks (stragglers)
MapReduce Cons
• Restricted programming model
– Not always natural to express problems
in this model
– Low-level coding necessary
– Little support for iterative jobs (lots of
disk access)
– High-latency (batch processing)

• Addressed by follow-up research


– Pig and Hive for high-level coding
– Spark for iterative and low-latency jobs
Hadoop YARN Architecture
• YARN stands for “Yet Another Resource Negotiator“. It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in
Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at
the time of its launching, but it has now evolved to be known as large-
scale distributed operating system used for Big Data processing.

• YARN architecture basically separates resource management layer from


the processing layer. In Hadoop 1.0 version, the responsibility of Job
tracker is split between the resource manager and application manager

• YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to
run and process data stored in HDFS (Hadoop Distributed File System)
thus making the system much more efficient. Through its various
components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite
necessary to manage the available resources properly so that every
application can leverage them.
• YARN Features: YARN gained popularity because of the
following features-
• Scalability: The scheduler in Resource manager of YARN
architecture allows Hadoop to extend and manage
thousands of nodes and clusters.
• Compatability: YARN supports the existing map-reduce
applications without disruptions thus making it compatible
with Hadoop 1.0 as well.
• Cluster Utilization:Since YARN supports Dynamic utilization
of cluster in Hadoop, which enables optimized Cluster
Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
• The main components of YARN architecture include:
• Client: It submits map-reduce jobs.
• Resource Manager: It is the master daemon of YARN and is responsible for resource assignment
and management among all the applications. Whenever it receives a processing request, it
forwards it to the corresponding node manager and allocates resources for the completion of the
request accordingly. It has two major components:
– Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure
scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a
restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.
– Application manager: It is responsible for accepting the application and negotiating the first container from
the resource manager. It also restarts the Application Manager container if a task fails.

• Node Manager: It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Node Manager. It
monitors resource usage, performs log management and also kills a container based on directions
from the resource manager. It is also responsible for creating the container process and start it on
the request of Application master.

• Application Master: An application is a single job submitted to a framework. The application


manager is responsible for negotiating resources with the resource manager, tracking the status
and monitoring progress of a single application. The application master requests the container
from the node manager by sending a Container Launch Context(CLC) which includes everything an
application needs to run. Once the application is started, it sends the health report to the resource
manager from time-to-time.
• Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node.
The containers are invoked by Container Launch Context(CLC) which is a record that contains
information such as environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:

• Client submits an application


• The Resource Manager allocates a container to start the Application Manager
• The Application Manager registers itself with the Resource Manager
• The Application Manager negotiates containers from the Resource Manager
• The Application Manager notifies the Node Manager to launch containers
• Application code is executed in the container
• Client contacts Resource Manager/Application Manager to monitor application’s status
• Once the processing is complete, the Application Manager un-registers with the Resource Manager

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy