Introduction to Hadoop- chapter-2
Introduction to Hadoop- chapter-2
Doug
Cutting
Why Hadoop?
Data Storage:
• Terabytes - Petabytes
•Reliability challenges
Why Hadoop?
Analysis:
• Much of Big Data is unstructured. Traditional
RDBMS/EDW cannot handle them.
• HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Very large
– In this context means files that are hundreds of megabytes, gigabytes, or terabytes in
size. There are Hadoop clusters running today that store petabytes of data.
• Streaming data access
– HDFS is built around the idea that the most efficient data processing pattern is a write-
once, read-many-times pattern. A dataset is typically generated or copied from source,
then various analyses are performed on that dataset over time. Each analysis will
involve a large proportion, if not all, of the dataset, so the time to read the whole
dataset is more important than the latency in reading the first record.
• Commodity hardware
– Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to
run on clusters of commodity hardware (commonly available hardware available from
multiple vendors) for which the chance of node failure across the cluster is high, at least
for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
• Low-latency data access
– Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember, HDFS is
optimized for delivering a high throughput of data, and this may be at the
expense of latency.
• Lots of small files
– Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode. As a rule of thumb, each file, directory, and block takes about 150
bytes. So, for example, if you had one million files, each taking one block, you
would need at least 300 MB of memory. While storing millions of files is
feasible, billions is beyond the capability of current hardware.
• Multiple writers, arbitrary file modifications
– Files in HDFS may be written to by a single writer. Writes are always made at
the end of the file. There is no support for multiple writers, or for
modifications at arbitrary offsets in the file
HDFS Architecture and concepts
• To read or write a file in HDFS, the client needs to interact with
NameNode. HDFS applications need a write-once-read-many access
model for files. A file, once created and written, cannot be edited.
• Blocks : A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default and this is configurable. Files in HDFS
are broken into block-sized chunks, which are stored as independent units
• NameNode stores metadata, and DataNode stores actual data. The client
interacts with NameNode for performing any tasks, as NameNode is the
center piece in the cluster.
• There are several DataNodes in the cluster which store HDFS data in the
local disk. DataNode sends a heartbeat message to NameNode
periodically to indicate that it is alive. Also, it replicates data to other
DataNode as per the replication factor.
• Secondary Namenode is used for taking the hourly backup of the data. In
case the Hadoop cluster fails, or crashes, the secondary Namenode will
take the hourly backup or checkpoints of that data and store this data into
a file name fsimage. This file then gets transferred to a new system.
• Major Function Of Secondary NameNode:
– It groups the Edit logs and Fsimage from NameNode together.
– It continuously reads the MetaData from the RAM of NameNode and writes
into the Hard Disk.
• Fsimage contains all the modifications ever made across the Hadoop
Cluster since the start of the NameNode (stored on disk).
• EditLogs: contains all the recent modifications made – may be in last 1 hr
(stored in RAM).
• Block Caching: When ever a request is made to a data node to read a
block, the data node will read the block from disk. If you know that you
will read the block many times, it is good idea to cache the block in
memory. hadoop allows you to cache a block. You can specify which file to
cache (or directory) and for how long (The block will be cached in off-heap
caches).
• HDFS Federation: The namenode keeps a reference to every file and block
in the file system in memory, which means that on very large clusters with
many files, memory becomes the limiting factor for scaling.
• HDFS federation, allows a cluster to scale by adding namenodes, each of
which manages a portion of the file system namespace. For example, one
namenode might manage all the files rooted under /user, say, and a
second namenode might handle files under /share.
• Under federation, each namenode manages a namespace volume, which
is made up of the metadata for the namespace, and a block pool
containing all the blocks for the files in the namespace.
• Namespace volumes are independent of each other, which means
namenodes do not communicate with one another, and furthermore the
failure of one namenode does not affect the availability of the
namespaces managed by other namenodes.
• Block pool storage is not partitioned, however, so datanodes register with
each namenode in the cluster and store blocks from multiple block pools.
• HDFS High Availability: The combination of replicating namenode
metadata on multiple filesystems and using the secondary namenode to
create checkpoints protects against data loss, but it does not provide high
availability of the filesystem. The namenode is still a single point of failure
(SPOF).
• If it did fail, all clients—including MapReduce jobs—would be unable to
read, write, or list files, because the namenode is the sole repository of
the metadata and the file-to-block mapping. In such an event, the whole
Hadoop system would effectively be out of service until a new namenode
could be brought online.
• To recover from a failed namenode in this situation, an administrator
starts a new primary namenode with one of the filesystem metadata
replicas and configures datanodes and clients to use this new namenode.
• The new namenode is not able to serve requests until it has (i) loaded its
namespace image into memory, (ii) replayed its edit log, and (iii) received
enough block reports from the datanodes to leave safe mode. On large
clusters with many files and blocks, the time it takes for a namenode to
start from cold can be 30 minutes or more.
• Failover: The transition from the active namenode to the standby is
managed by a new entity in the system called the failover controller.
There are various failover controllers, but the default implementation uses
ZooKeeper to ensure that only one namenode is active.
• Each namenode runs a lightweight failover controller process whose job it
is to monitor its namenode for failures (using a simple heartbeating
mechanism) and trigger a failover should a namenode fail.
• Failover may also be initiated manually by an administrator, for example,
in the case of routine maintenance. This is known as a graceful failover,
since the failover controller arranges an orderly transition for both
namenodes to switch roles.
• Fencing: In the case of an ungraceful failover, however, it is impossible to
be sure that the failed namenode has stopped running.
• For example, a slow network or a network partition can trigger a failover
transition, even though the previously active namenode is still running
and thinks it is still the active namenode.
• The High availability implementation goes to great lengths to ensure that
the previously active namenode is prevented from doing any damage and
causing corruption—a method known as fencing.
• There are two types of Map Reduce
– Classic Map Reduce (Map Reduce 1)
– YARN (Map Reduce 2)
• For very large clusters in the region of 4000 nodes
and higher, the Map Reduce system described in the
previous section begins to hit scalability bottlenecks,
so in 2010 a group at Yahoo! began to design the
next generation of Map Reduce.
• YARN meets the scalability shortcomings of “classic”
Map Reduce
Data Processing: YARN
• YARN is an acronym for Yet Another Resource Negotiator. It handles the
cluster of nodes and acts as Hadoop’s resource management unit.
• YARN/MapReduce2 has been introduced in Hadoop 2.0
• YARN allocates RAM, memory, and other resources to different
applications and performs resource allocation.
• It is a layer that separates the resource management layer and the
processing components layer.
• Mapreduce2 moves Resource management( like infrastructure to monitor
nodes, allocate resources and schedule jobs) into YARN.
Motivation to YARN:
• Scalability bottleneck caused by having a single JobTracker. According to
Yahoo!, the practical limits of such a design are reached with a cluster of
5000 nodes and 40000 tasks running concurrently.
• The computational resources on each slave node are divides by a cluster
administrator into a fixed number of map and reduce slots.
YARN Architecture
YARN Components
• Client: to submit map reduce jobs
• Resource manager (Master) - This is the master daemon. It manages the
assignment of resources such as CPU, memory, and network bandwidth across
the cluster.
• Container: Name given to a package of resources including RAM, CPU, Network
etc.
• Node manager(Slave): This is the slave daemon, to oversee the containers (a
fraction of node managers resource capacity) running on the cluster nodes and
it reports the resource usage to the Resource Manager.
• Application manager: which negotiates with the resource manager for
resources and runs the application-specific process(Map or reduce tasks) in
those clusters.
Please note that YARN is a generic Framework, its not only meant to execute
Map Reduce Jobs. It can be used to execute any application, say main() of a
Java Application.
Job/Application Flow in YARN
• Client submits a Job to YARN.
• The submitted Job can be a Map Reduce Job or any other application/process
• This Job/application is picked by Resource Manager
• Since there can be multiple Jobs/applications submitted to Resource
Manager, hence Resource Manager will check the scheduling algorithm,
available capacity to see if submitted Job/Application can be launched
• When Resource Manager finds that it can launch newly submitted
Job/Application, it allocates a Container. Container is a set of resources
(CPU,memory etc) required to launch the Job/Application
• It checks which Node can take up this request, once it finds a Node then it
contacts the appropriate Node Manager for the same
• Node Manager will then actually allocate the resources required to execute
the Job/application and will then launch Application Master Process within
Container
• Application Master Process is the main process for Job/Application execution.
Please note that Application Master is Framework specific implementation. Map
Reduce Framework has its own implementation of Application Master.
• Application Master will check if additional resources or containers are required to
execute the Job/Application. This is the case when we submit a Map Reduce Job
where Multiple Mappers and Reducers will be required to accomplish the Job.
• If additional resources are required then Application Master will negotiate with
Resource Manager to allocate resources/containers. It will be responsibility of
Application Master to execute and monitor the individual tasks for an
application/job.
• The request made by Application Master to Resource Manager is known as
Resource Request. The request contains the resources required to execute the
individual Task and a location constraint. Location constraint is required as Task
needs to be run in as proximity to data as possible to conserver network
bandwidth.
• If there are multiple Mappers then there will be multiple App Process
running (in a container) on multiple Nodes. Each of them will send their
heat beat to their Application Master Process. This is how Application
Master will monitor individual Tasks it launches.
• Application Master will also send its Heart Beat Signal to Resource
Manager to indicate status of Job/Application execution.
• Once any Application execution is completed then Application Master for
that application will be de-registered.
PIG:
• Pig was basically developed by Yahoo which works on a pig
Latin language, which is Query based language similar to SQL.
•It is a platform for structuring the data flow, processing and
analyzing huge data sets.
HIVE:
•With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. However, its query
language is called as HQL (Hive Query Language).
Mahout:
•Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of algorithms.
•It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning.
Oozie:
•Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
Apache Spark:
•It’s a platform that handles all the process consumptive tasks
like batch processing, interactive or iterative real-time
processing, graph conversions, and visualization, etc.
Zookeeper:
•There was a huge issue of management of coordination and
synchronization among the resources or the components of
Hadoop which resulted in inconsistency, often. Zookeeper
overcame all the problems by performing synchronization, inter-
component based communication, grouping, and maintenance.
HDFS Commands
Cat:
Usage: hadoop fs -cat URI [URI …]
copyFromLocal:
Usage: hadoop fs -copyFromLocal <localsrc> URI
copyToLocal:
Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Put:
Usage: hadoop fs -put <localsrc> ... <dst>
Rm:
Usage: hadoop fs -rm URI [URI …]