0% found this document useful (0 votes)
34 views59 pages

Introduction to Hadoop- chapter-2

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views59 pages

Introduction to Hadoop- chapter-2

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 59

Introduction to Hadoop

• Hadoop is an open-source software framework for storing data and


running applications on clusters of commodity hardware. It provides
massive storage for any kind of data, enormous processing power and
the ability to handle virtually limitless concurrent tasks or jobs.
• In a Hadoop cluster, data is distributed to all the nodes of the cluster as it
is being loaded in. The Hadoop Distributed File System (HDFS) which was
written in java will split large data files into chunks which are managed by
different nodes in the cluster.
• In addition to this each chunk is replicated across several machines, so
that a single machine failure does not result in any data being unavailable.
• An active monitoring system then re-replicates the data in response to
system failures which can result in partial storage.
• Even though the file chunks are replicated and distributed across several
machines, they form a single namespace, so their contents are universally
accessible.
History of Hadoop
• Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
• It was originally developed to support distribution for the
Nutch search engine project.
• Doug, who was working at Yahoo! at the time and is now Chief Architect
of Cloudera, named the project after his son's toy elephant.
• Cutting's son was 2 years old at the time and just beginning to talk. He called
his beloved stuffed yellow elephant "Hadoop"
2005: Doug Cutting and Michael J.
Cafarella developed Hadoop

The project was funded by Yahoo.

Doug
Cutting
Why Hadoop?
Data Storage:
• Terabytes - Petabytes

•Distributed storage & distributed Processing

•Reliability challenges
Why Hadoop?
Analysis:
• Much of Big Data is unstructured. Traditional
RDBMS/EDW cannot handle them.

• Traditional RDBMS/ EDW cannot handle these with


their limited scalability options and architectural
limitations
Components of Hadoop
Data Storage: HDFS
• HDFS is a Filesystem of Hadoop designed for storing very large files
running on a cluster of commodity hardware.
• It is designed on the principle of storage of less number of large
files rather than the huge number of small files.
• HDFS provides a fault-tolerant storage layer for Hadoop and its
other components. HDFS Replication of data helps us to attain this
feature.
• It stores data reliably, even in the case of hardware failure. It
provides high throughput access to application data by providing
the data access in parallel.
• HDFS consists of two core components i.e.
– Name node
– Data node
NameNode
• It is also known as Master node. NameNode does not store actual
data or dataset. NameNode stores Metadata i.e. number of blocks,
their location, on which Rack, which Datanode the data is stored
and other details. It consists of files and directories.

• This metadata is available in memory in the master for faster


retrieval of data. In the local disk, a copy of the metadata is
available for persistence. So NameNode memory should be high as
per the requirement. the name node is always an enterprise server.

• Tasks of HDFS NameNode


– Manage file system namespace.
– Regulates client’s access to files.
– Executes file system execution such as naming, closing, opening files
and directories.
DataNode
• It is also known as Slave node. HDFS Datanode is responsible for storing
actual data in HDFS. Datanode performs read and write operation as per
the request of the clients.
• At startup, each Datanode connects to its corresponding Namenode and
does handshaking.
• They perform block creation, deletion, and replication upon instruction
from the NameNode. Once a block is written on a DataNode, it replicates
it to other DataNode, and the process continues until creating the
required number of replicas.
• Hadoop enables you to use commodity machines as your data nodes
• Tasks of HDFS DataNode
– DataNode performs operations like block replica creation, deletion,
and replication according to the instruction of NameNode.
– DataNode manages data storage of the system.
•The name node is responsible for the workings of the data nodes.
•The data nodes read, write, process, and replicate the data. They
also send signals, known as heartbeats, to the name node. These
heartbeats show the status of the data node.
•Replication of the data is performed three times by default. It is
done this way, so if a commodity machine fails, you can replace it
with a new machine that has the same data.
Daemons of Hadoop
Distributed file management system – HDFS
 Name Node:
Centrally Monitors and controls the whole file
system
 Data Node:
Stores the data and constantly communicates
with Name Node.
 Secondary Name Node:
This just backs up the file system status from
the Name Node periodically.
• HDFS splits the data into multiple blocks, defaulting to a maximum of 128
MB. The default block size can be changed depending on the processing
speed and the data distribution. Let’s have a look at the example below:
• From the below image, we have 300 MB of data. This is broken down into
128 MB, 128 MB, and 44 MB. The final block handles the remaining
needed storage space, so it doesn’t have to be sized at 128 MB. This is
how data gets stored in a distributed manner in HDFS.
Rack Awareness in Hadoop HDFS
• The Rack is the collection of around 40-50 DataNodes connected using
the same network switch. If the network goes down, the whole rack will
be unavailable. A large Hadoop cluster is deployed in multiple racks.
• Hadoop runs on a cluster of computers spread commonly across many
racks. NameNode places replicas of a block on multiple racks for improved
fault tolerance.
• NameNode tries to place at least one replica of a block in a different rack
so that if a complete rack goes down, then also the system will be highly
available.
• Optimize replica placement distinguishes HDFS from other distributed file
systems. The purpose of a rack-aware replica placement policy is to
improve data reliability, availability, and network bandwidth utilization.
• In a large Hadoop cluster, there are multiple racks. Each rack consists of
DataNodes. Communication between the DataNodes on the same rack is
more efficient as compared to the communication between DataNodes
residing on different racks.
• To reduce the network traffic during file read/write, NameNode chooses
the closest DataNode for serving the client read/write request.
• NameNode maintains rack ids of each DataNode to achieve this rack
information. This concept of choosing the closest DataNode based on the
rack information is known as Rack Awareness.
• Rack awareness policies says:
– Not more than one replica be placed on one node.
– Not more than two replicas are placed on the same rack.
– Also, the number of racks used for block replication should always be
smaller than the number of replicas.
Data Storage: HBASE
• HBASE is a NoSQL database which supports all kinds of data
and thus capable of handling anything of Hadoop Database. It
is a distributed database that was designed to store
structured data in tables that could have billions of row and
millions of columns. HBase, provide real-time access to read
or write data in HDFS.
• HBase tables can serve as input and output for MapReduce
jobs.
Data Processing: MapReduce
• It is the core service of hadoop and acts as a processing layer.
• This programming model is designed for processing large volumes of data
in parallel by dividing the work into a set of independent tasks.
• MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
– Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based
result which is later on processed by the Reduce() method.
– Reduce() performs by aggregating the mapped data. In simple,
Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
• The first phase is the Map phase, where data in each split is passed
to produce output values.
• In the shuffle and sort phase, the mapping phase’s output is taken
and grouped into blocks of similar data.
• Finally, the output values from the shuffling phase are aggregated.
It then returns a single output value.
Working of MapReduce
• Hadoop is so much powerful and efficient due to MapReduce as here
parallel processing is done.
• We need to put business logic in the way MapReduce works and rest
things will be taken care by the framework.
• MapReduce programs are written in a particular style influenced by
functional programming constructs, specifical idioms for processing lists of
data. Here in MapReduce, we get inputs from a list and it converts it into
output which is again a list.
• Map-Reduce divides the work into small parts, each of which can be done
in parallel on the cluster of servers. A problem is divided into a large
number of smaller problems each of which is processed to give individual
outputs. These individual outputs are further processed to give final
output.
• Hadoop Map-Reduce is scalable and can also be used across many
computers.
What is a MapReduce Job?
• It is an execution of 2 processing layers i.e mapper and reducer. A
MapReduce job is a work that the client wants to be performed.
• It consists of the input data, the MapReduce Program, and configuration
info. So client needs to submit input data, he needs to write Map Reduce
program and set the configuration info (These were provided
during hadoop setup in the configuration file).
• A task in MapReduce is an execution of a Mapper or a Reducer on a slice
of data. It is also called Task-In-Progress (TIP). It means processing of data
is in progress either on mapper or reducer.
• Task Attempt is a particular instance of an attempt to execute a task on a
node. There is a possibility that anytime any machine can go down.
• For example, while processing data if any node goes down, framework
reschedules the task to some other node. This rescheduling of the task
cannot be infinite. There is an upper limit for that as well. The default
value of task attempt is 4. If a task (Mapper or reducer) fails 4 times, then
the job is considered as a failed job. For high priority job or huge job, the
value of this task attempt can also be increased.
• The map takes key/value pair as input. Whether data is in
structured or unstructured format, framework converts the
incoming data into key and value.
– Key is a reference to the input value.
– Value is the data set on which to operate.
• Map Processing:
– A function defined by user – user can write custom business logic according to
his need to process the data.
– Applies to every value in value input.
• Map produces a new list of key/value pairs:
– An output of Map is called intermediate output.
– Can be the different type from input pair.
– An output of map is stored on the local disk from where it is shuffled to
reduce nodes.
• Reduce takes intermediate Key / Value pairs as input and processes
the output of the mapper. Usually, in the reducer, we do
aggregation or summation sort of computation.
– Input given to reducer is generated by Map (intermediate output)
– Key / Value pairs provided to reduce are sorted by key
• Reduce processing:
– A function defined by user – Here also user can write custom business
logic and get the final output.
– Iterator supplies the values for a given key to the Reduce function.
• Reduce produces a final list of key/value pairs:
– An output of Reduce is called Final output.
– It can be a different type from input pair.
– An output of Reduce is stored in HDFS
How Map and Reduce work Together?
• Input data given to mapper is processed through user defined function
written at mapper. All the required complex business logic is implemented
at the mapper level so that heavy processing is done by the mapper in
parallel as the number of mappers is much more than the number
of reducers.
• Mapper generates an output which is intermediate data and this output
goes as input to reducer.
This intermediate result is then processed by user defined function
written at reducer and final output is generated. Usually, in reducer very
light processing is done. This final output is stored in HDFS and replication
is done as usual.
Data Locality in MapReduce

• Data locality is the process of moving the computation close


to where the actual data resides on the node, instead of
moving large data to computation. This minimizes network
congestion and increases the overall throughput of the
system.

“Move computation close to the data rather than data


to computation”
Design of HDFS

• HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Very large
– In this context means files that are hundreds of megabytes, gigabytes, or terabytes in
size. There are Hadoop clusters running today that store petabytes of data.
• Streaming data access
– HDFS is built around the idea that the most efficient data processing pattern is a write-
once, read-many-times pattern. A dataset is typically generated or copied from source,
then various analyses are performed on that dataset over time. Each analysis will
involve a large proportion, if not all, of the dataset, so the time to read the whole
dataset is more important than the latency in reading the first record.
• Commodity hardware
– Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to
run on clusters of commodity hardware (commonly available hardware available from
multiple vendors) for which the chance of node failure across the cluster is high, at least
for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
• Low-latency data access
– Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember, HDFS is
optimized for delivering a high throughput of data, and this may be at the
expense of latency.
• Lots of small files
– Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode. As a rule of thumb, each file, directory, and block takes about 150
bytes. So, for example, if you had one million files, each taking one block, you
would need at least 300 MB of memory. While storing millions of files is
feasible, billions is beyond the capability of current hardware.
• Multiple writers, arbitrary file modifications
– Files in HDFS may be written to by a single writer. Writes are always made at
the end of the file. There is no support for multiple writers, or for
modifications at arbitrary offsets in the file
HDFS Architecture and concepts
• To read or write a file in HDFS, the client needs to interact with
NameNode. HDFS applications need a write-once-read-many access
model for files. A file, once created and written, cannot be edited.
• Blocks : A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default and this is configurable. Files in HDFS
are broken into block-sized chunks, which are stored as independent units
• NameNode stores metadata, and DataNode stores actual data. The client
interacts with NameNode for performing any tasks, as NameNode is the
center piece in the cluster.
• There are several DataNodes in the cluster which store HDFS data in the
local disk. DataNode sends a heartbeat message to NameNode
periodically to indicate that it is alive. Also, it replicates data to other
DataNode as per the replication factor.
• Secondary Namenode is used for taking the hourly backup of the data. In
case the Hadoop cluster fails, or crashes, the secondary Namenode will
take the hourly backup or checkpoints of that data and store this data into
a file name fsimage. This file then gets transferred to a new system.
• Major Function Of Secondary NameNode:
– It groups the Edit logs and Fsimage from NameNode together.
– It continuously reads the MetaData from the RAM of NameNode and writes
into the Hard Disk.
• Fsimage contains all the modifications ever made across the Hadoop
Cluster since the start of the NameNode (stored on disk).
• EditLogs: contains all the recent modifications made – may be in last 1 hr
(stored in RAM).
• Block Caching: When ever a request is made to a data node to read a
block, the data node will read the block from disk. If you know that you
will read the block many times, it is good idea to cache the block in
memory. hadoop allows you to cache a block. You can specify which file to
cache (or directory) and for how long (The block will be cached in off-heap
caches).
• HDFS Federation: The namenode keeps a reference to every file and block
in the file system in memory, which means that on very large clusters with
many files, memory becomes the limiting factor for scaling.
• HDFS federation, allows a cluster to scale by adding namenodes, each of
which manages a portion of the file system namespace. For example, one
namenode might manage all the files rooted under /user, say, and a
second namenode might handle files under /share.
• Under federation, each namenode manages a namespace volume, which
is made up of the metadata for the namespace, and a block pool
containing all the blocks for the files in the namespace.
• Namespace volumes are independent of each other, which means
namenodes do not communicate with one another, and furthermore the
failure of one namenode does not affect the availability of the
namespaces managed by other namenodes.
• Block pool storage is not partitioned, however, so datanodes register with
each namenode in the cluster and store blocks from multiple block pools.
• HDFS High Availability: The combination of replicating namenode
metadata on multiple filesystems and using the secondary namenode to
create checkpoints protects against data loss, but it does not provide high
availability of the filesystem. The namenode is still a single point of failure
(SPOF).
• If it did fail, all clients—including MapReduce jobs—would be unable to
read, write, or list files, because the namenode is the sole repository of
the metadata and the file-to-block mapping. In such an event, the whole
Hadoop system would effectively be out of service until a new namenode
could be brought online.
• To recover from a failed namenode in this situation, an administrator
starts a new primary namenode with one of the filesystem metadata
replicas and configures datanodes and clients to use this new namenode.
• The new namenode is not able to serve requests until it has (i) loaded its
namespace image into memory, (ii) replayed its edit log, and (iii) received
enough block reports from the datanodes to leave safe mode. On large
clusters with many files and blocks, the time it takes for a namenode to
start from cold can be 30 minutes or more.
• Failover: The transition from the active namenode to the standby is
managed by a new entity in the system called the failover controller.
There are various failover controllers, but the default implementation uses
ZooKeeper to ensure that only one namenode is active.
• Each namenode runs a lightweight failover controller process whose job it
is to monitor its namenode for failures (using a simple heartbeating
mechanism) and trigger a failover should a namenode fail.
• Failover may also be initiated manually by an administrator, for example,
in the case of routine maintenance. This is known as a graceful failover,
since the failover controller arranges an orderly transition for both
namenodes to switch roles.
• Fencing: In the case of an ungraceful failover, however, it is impossible to
be sure that the failed namenode has stopped running.
• For example, a slow network or a network partition can trigger a failover
transition, even though the previously active namenode is still running
and thinks it is still the active namenode.
• The High availability implementation goes to great lengths to ensure that
the previously active namenode is prevented from doing any damage and
causing corruption—a method known as fencing.
• There are two types of Map Reduce
– Classic Map Reduce (Map Reduce 1)
– YARN (Map Reduce 2)
• For very large clusters in the region of 4000 nodes
and higher, the Map Reduce system described in the
previous section begins to hit scalability bottlenecks,
so in 2010 a group at Yahoo! began to design the
next generation of Map Reduce.
• YARN meets the scalability shortcomings of “classic”
Map Reduce
Data Processing: YARN
• YARN is an acronym for Yet Another Resource Negotiator. It handles the
cluster of nodes and acts as Hadoop’s resource management unit.
• YARN/MapReduce2 has been introduced in Hadoop 2.0
• YARN allocates RAM, memory, and other resources to different
applications and performs resource allocation.
• It is a layer that separates the resource management layer and the
processing components layer.
• Mapreduce2 moves Resource management( like infrastructure to monitor
nodes, allocate resources and schedule jobs) into YARN.

Motivation to YARN:
• Scalability bottleneck caused by having a single JobTracker. According to
Yahoo!, the practical limits of such a design are reached with a cluster of
5000 nodes and 40000 tasks running concurrently.
• The computational resources on each slave node are divides by a cluster
administrator into a fixed number of map and reduce slots.
YARN Architecture
YARN Components
• Client: to submit map reduce jobs
• Resource manager (Master) - This is the master daemon. It manages the
assignment of resources such as CPU, memory, and network bandwidth across
the cluster.
• Container: Name given to a package of resources including RAM, CPU, Network
etc.
• Node manager(Slave): This is the slave daemon, to oversee the containers (a
fraction of node managers resource capacity) running on the cluster nodes and
it reports the resource usage to the Resource Manager.
• Application manager: which negotiates with the resource manager for
resources and runs the application-specific process(Map or reduce tasks) in
those clusters.
 Please note that YARN is a generic Framework, its not only meant to execute
Map Reduce Jobs. It can be used to execute any application, say main() of a
Java Application.
Job/Application Flow in YARN
• Client submits a Job to YARN.
• The submitted Job can be a Map Reduce Job or any other application/process
• This Job/application is picked by Resource Manager
• Since there can be multiple Jobs/applications submitted to Resource
Manager, hence Resource Manager will check the scheduling algorithm,
available capacity to see if submitted Job/Application can be launched
• When Resource Manager finds that it can launch newly submitted
Job/Application, it allocates a Container. Container is a set of resources
(CPU,memory etc) required to launch the Job/Application
• It checks which Node can take up this request, once it finds a Node then it
contacts the appropriate Node Manager for the same
• Node Manager will then actually allocate the resources required to execute
the Job/application and will then launch Application Master Process within
Container
• Application Master Process is the main process for Job/Application execution.
Please note that Application Master is Framework specific implementation. Map
Reduce Framework has its own implementation of Application Master.
• Application Master will check if additional resources or containers are required to
execute the Job/Application. This is the case when we submit a Map Reduce Job
where Multiple Mappers and Reducers will be required to accomplish the Job.
• If additional resources are required then Application Master will negotiate with
Resource Manager to allocate resources/containers. It will be responsibility of
Application Master to execute and monitor the individual tasks for an
application/job.
• The request made by Application Master to Resource Manager is known as
Resource Request. The request contains the resources required to execute the
individual Task and a location constraint. Location constraint is required as Task
needs to be run in as proximity to data as possible to conserver network
bandwidth.
• If there are multiple Mappers then there will be multiple App Process
running (in a container) on multiple Nodes. Each of them will send their
heat beat to their Application Master Process. This is how Application
Master will monitor individual Tasks it launches.
• Application Master will also send its Heart Beat Signal to Resource
Manager to indicate status of Job/Application execution.
• Once any Application execution is completed then Application Master for
that application will be de-registered.
PIG:
• Pig was basically developed by Yahoo which works on a pig
Latin language, which is Query based language similar to SQL.
•It is a platform for structuring the data flow, processing and
analyzing huge data sets.

HIVE:
•With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. However, its query
language is called as HQL (Hive Query Language).
Mahout:
•Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of algorithms.
•It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning.

Oozie:
•Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
Apache Spark:
•It’s a platform that handles all the process consumptive tasks
like batch processing, interactive or iterative real-time
processing, graph conversions, and visualization, etc.

Zookeeper:
•There was a huge issue of management of coordination and
synchronization among the resources or the components of
Hadoop which resulted in inconsistency, often. Zookeeper
overcame all the problems by performing synchronization, inter-
component based communication, grouping, and maintenance.
HDFS Commands
Cat:
Usage: hadoop fs -cat URI [URI …]
copyFromLocal:
Usage: hadoop fs -copyFromLocal <localsrc> URI
copyToLocal:
Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Put:
Usage: hadoop fs -put <localsrc> ... <dst>
Rm:
Usage: hadoop fs -rm URI [URI …]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy