4.2 HDFS Federation
4.2 HDFS Federation
Namenode dfs.namenode.rpc-
address dfs.namenode.servicerpc-
address dfs.namenode.http-
address dfs.namenode.https-
address dfs.namenode.keytab.file dfs.n
amenode.name.dir dfs.namenode.edits.di
r dfs.namenode.checkpoint.dir dfs.name
node.checkpoint.edits.dir
Secondary Namenode dfs.namenode.secondary.http-
address dfs.secondary.namenode.keytab.
file
• At Yahoo!:
– Index building for Yahoo! Search
– Spam detection for Yahoo! Mail
• At Facebook:
– Data mining
– Ad optimization
– Spam detection
MapReduce Pros
• Distribution is completely transparent
– Not a single line of distributed programming (ease,
correctness)
• Automatic fault-tolerance
– Determinism enables running failed tasks somewhere else
again
– Saved intermediate data enables just re-running failed
reducers
• Automatic scaling
– As operations as side-effect free, they can be distributed
to any number of machines dynamically
• Automatic load-balancing
– Move tasks and speculatively execute duplicate copies of
slow tasks (stragglers)
MapReduce Cons
• Restricted programming model
– Not always natural to express problems
in this model
– Low-level coding necessary
– Little support for iterative jobs (lots of
disk access)
– High-latency (batch processing)
• YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to
run and process data stored in HDFS (Hadoop Distributed File System)
thus making the system much more efficient. Through its various
components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite
necessary to manage the available resources properly so that every
application can leverage them.
• YARN Features: YARN gained popularity because of the
following features-
• Scalability: The scheduler in Resource manager of YARN
architecture allows Hadoop to extend and manage
thousands of nodes and clusters.
• Compatability: YARN supports the existing map-reduce
applications without disruptions thus making it compatible
with Hadoop 1.0 as well.
• Cluster Utilization:Since YARN supports Dynamic utilization
of cluster in Hadoop, which enables optimized Cluster
Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
• The main components of YARN architecture include:
• Client: It submits map-reduce jobs.
• Resource Manager: It is the master daemon of YARN and is responsible for resource assignment
and management among all the applications. Whenever it receives a processing request, it
forwards it to the corresponding node manager and allocates resources for the completion of the
request accordingly. It has two major components:
– Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure
scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a
restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.
– Application manager: It is responsible for accepting the application and negotiating the first container from
the resource manager. It also restarts the Application Manager container if a task fails.
• Node Manager: It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Node Manager. It
monitors resource usage, performs log management and also kills a container based on directions
from the resource manager. It is also responsible for creating the container process and start it on
the request of Application master.