Unit - 2
Unit - 2
HADOOP
HADOOP
Hadoop is an open source software programming framework for storing
a large amount of data and performing the computation. Its framework
is based on Java programming with some native code in C and shell
scripts.
Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel
processing of large datasets.
• It all started in the year 2002 with the Apache Nutch project.
• In 2002, Doug Cutting and Mike Cafarella were working on Apache Nutch
Project that aimed at building a web search engine that would crawl and
index websites.
• After a lot of research, Mike Cafarella and Doug Cutting estimated that it
would cost around $500,000 in hardware with a monthly running cost of
$30,000 for a system supporting a one-billion-page index.
• This project proved to be too expensive and thus found infeasible for
indexing billions of webpages. So they were looking for a feasible solution
that would reduce the cost.
2003
Meanwhile, In 2003 Google released a search paper on Google distributed File
System (GFS) that described the architecture for GFS that provided an idea for
storing large datasets in a distributed environment. This paper solved the
problem of storing huge files generated as a part of the web crawl and
indexing process. But this is half of a solution to their problem.
2004
In 2004, Nutch’s developers set about writing an open-source implementation,
the Nutch Distributed File System (NDFS).
In 2004, Google introduced MapReduce to the world by releasing a paper on
MapReduce. This paper provided the solution for processing those large
datasets. It gave a full solution to the Nutch developers.
Google provided the idea for distributed storage and MapReduce. Nutch
developers implemented MapReduce in the middle of 2004.
2006
The Apache community realized that the implementation of MapReduce and
NDFS could be used for other tasks as well. In February 2006, they came out of
Nutch and formed an independent subproject of Lucene called “Hadoop”
(which is the name of Doug’s kid’s yellow elephant).
As the Nutch project was limited to 20 to 40 nodes cluster, Doug Cutting in
2006 itself joined Yahoo to scale the Hadoop project to thousands of nodes
cluster.
2007
In 2007, Yahoo started using Hadoop on 1000 nodes cluster.
2008
In January 2008, Hadoop confirmed its success by becoming the top-level project at
Apache.
By this time, many other companies like Last.fm, Facebook, and the New York Times
started using Hadoop.
Hadoop Distributed File System,
HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file
system designed to run on commodity hardware.
HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware.
HDFS provides high throughput data access to application data and is suitable for
applications that have large data sets and enables streaming access to file system
data in Apache Hadoop.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1.NameNode(MasterNode):
3. It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2.DataNode(SlaveNode):
1. Actual worker nodes, who do the actual work like reading, writing, processing etc.
2. They also perform creation, deletion, and replication upon instruction from the
master.
HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
Name node
Data Node
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it
possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce()
Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data.
In simple, Reduce() takes the output generated by Map() as input and combines those
tuples into smaller set of tuples.
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
HOW DOES HDFS WORK
• The way HDFS works is by having a main « NameNode »
and multiple « data nodes » on a commodity hardware
cluster.
• All the nodes are usually organized within the same
physical rack in the data center. Data is then broken down
into separate « blocks » that are distributed among
• the various data nodes for storage. Blocks are also
replicated across nodes to reduce the likelihood of failure.
• The NameNode is the «smart» node in the cluster. It knows
exactly which data node contains which blocks and where
the data nodes are located within the machine cluster.
• The NameNode also manages access to the files, including
reads, writes, creates, deletes and replication of data
blocks across different data nodes.
The NameNode operates in a “loosely coupled” way with the data nodes. This means the
elements of the cluster can dynamically adapt to the real-time demand of server capacity
by adding or subtracting nodes as the system sees fit.
The data nodes constantly communicate with the NameNode to see if they need complete
a certain task.
The constant communication ensures that the NameNode is aware of each data
node’s status at all times. Since the NameNode assigns tasks to the individual datanodes,
should it realize that a datanode is not functioning properly it is able to immediately re-
assign that node’s task to a different node containing that same data block.
Data nodes also communicate with each other so they can cooperate during normal file
operations. Clearly the NameNode is critical to the whole system and should be replicated
to prevent system failure.
Again, data blocks are replicated across multiple data nodes and access is managed by
the NameNode. This means when a data node no longer sends a “life signal” to the
NameNode, the NameNode unmaps the data note from the cluster and keeps operating
with the other data nodes as if nothing had happened. When this data node comes
back to life or a different (new) data
node is detected, that new data node is (re-)added to the system. That is what makes
HDFS
resilient and self-healing. Since data blocks are replicated across several data nodes,
the failure
of one server will not corrupt a file.
Goal of hdfs
Fault detection and recovery − Since HDFS includes a large
number of commodity hardware,
failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection
and recovery.
Huge datasets − HDFS should have hundreds of nodes per
cluster to manage the applications
having huge datasets. HDFS accommodates applications
that have data sets typically gigabytes to terabytes in size.
Hardware at data − A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
Access to streaming data - HDFS is intended more for batch processing versus
interactive use, so the emphasis in the design is for high data throughput rates, which
accommodate streaming access to data sets.
Coherence Model – The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can
be appended and truncate.
Analysing data with Hadoop
Analyzing data with Hadoop involves using its ecosystem components to process, store, and analyze
large datasets. Here's an overview of the process:
Apache Spark Apache spark in an open-source processing engine that is designed for ease of analytics
operations. It is a cluster computing platform that is designed to be fast and made for general
purpose uses. Spark is designed to cover various batch applications, Machine Learning,
streaming data processing, and interactive queries.
Features of Spark:
In memory processing
Tight Integration Of component
Easy and In-expensive
Map Reduce
MapReduce is just like an Algorithm or a data structure that is based on the YARN
framework.
The primary feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster, which Makes Hadoop working so fast Because when we are dealing with
Big
Data, serial processing is no more of any use.
Features of Map-Reduce:
Scalable
Fault Tolerance
Parallel Processing
Tunable Replication
Load Balancing
Apache Hive
Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various
sources.
Hive is one of the best tools used for data analysis on Hadoop. The one who is having
knowledge of SQL can comfortably use Apache Hive. The query language of high is
known as
HQL or HIVEQL.
Features of Hive:
Queries are similar to SQL queries.
Hive has different storage type HBase, ORC, Plain text, etc.
Hive has in-built function for data-mining and other works.
Hive operates on compressed data that is present inside Hadoop Ecosystem.
Hadoop streaming
Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to
execute
programs for big data analysis. Hadoop streaming can be performed using languages
like
Python, Java, PHP, Scala, Perl, UNIX, and many more. The utility allows us to create and
run
Map/Reduce jobs with any executable or script as the mapper and/or the reducer. It uses
Unix
streams as the interface between the Hadoop and our MapReduce program so that we
can use
any language which can read standard input and write to standard output to write for
writing our
MapReduce program.
Features of Hadoop Streaming
Some of the key features associated with Hadoop Streaming are as follows :
Hadoop Streaming is a part of the Hadoop Distribution System.
It facilitates ease of writing Map Reduce programs and codes.
Hadoop Streaming supports almost all types of programming languages such as
Python,
C++, Ruby, Perl etc.
Data formats used in Hadoop
Text files
A text file is the most basic and a human-readable file. It can be read or written in any
programming language and is mostly delimited by comma or tab.
The text file format consumes more space when a numeric value needs to be stored as a
string. It is also difficult to represent binary data such as an image.
Sequence File
The sequence file format can be used to store an image in the binary format. They store key-
value pairs in a binary container format and are more efficient than a text file. However,
sequence files are not human- readable.
The Avro file format has efficient storage due to optimized binary encoding. It is widely
supported both inside and outside the Hadoop ecosystem.
The Avro file format is ideal for long-term storage of important data. It can read from and
write in many languages like Java, Scala and so on.Schema metadata can be embedded in the
file to ensure that it will always be readable. Schema evolution can accommodate changes.
The Avro file format is considered the best choice for general-purpose storage in Hadoop.
JSON Records:- JSON is in text format that stores meta data with the data, so it fully
supports schema evolution. You can easily add or remove attributes for each datum.
However,
because it’s text file, it doesn’t support block compression. JSON records contain JSON files
where each line is its own JSON datum. In the case of JSON files, metadata is stored and
the file
is also splittable but again it also doesn’t support block compression.
Map Reduce framework
MapReduce is a programming model and framework used for processing large datasets
in a distributed and parallel manner, a core component of the Hadoop ecosystem,
enabling efficient big data analysis by breaking down tasks into "map" and "reduce"
phases.
•Distributed Processing:
•MapReduce allows you to process data across a cluster of computers (nodes) instead of
relying on a single machine.
•Parallelism:
•The framework automatically handles the distribution and parallel execution of tasks,
making it efficient for large datasets.
•Fault Tolerance:
•MapReduce is designed to be robust, meaning it can continue processing even if some
nodes in the cluster fail.
there are five independent entities:
•The client, which submits the MapReduce job.
•The YARN resource manager, which coordinates the allocation of
compute resources on the cluster.
•The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
•The MapReduce application master, which coordinates the tasks
running the MapReduce job The application master and the MapReduce tasks run
in containers that are scheduled by the resource manager and managed by the
node managers.
•The distributed filesystem, which is used for sharing job files between
the other entities
Job Submission :
•The submit() method on Job creates an internal JobSubmitter
instance and calls submitJobInternal() on it.
•Having submitted the job, waitForCompletion polls the job’s
progress once per second and reports the progress to the console if it
has changed since the last report.
•When the job completes successfully, the job counters are displayed
Otherwise, the error that caused the job to fail is logged to the
console.
•Asks the resource manager for a new application ID, used for the
MapReduce job ID.
•Checks the output specification of the job For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
•Computes the input splits for the job If the splits cannot be
computed (because the input paths don’t exist, for example), the job
is not submitted and an error is thrown to the MapReduce
program.
•Copies the resources needed to run the job, including the job
JAR file, the configuration file, and the computed input splits, to
the shared filesystem in a directory named after the job ID.
•Submits the job by calling submitApplication() on the resource
manager.
job Initialization :
•When the resource manager receives a call to its submitApplication() method, it hands off the
request to the YARN scheduler.
•The scheduler allocates a container, and the resource manager then
launches the application master’s process there, under the node
manager’s management.
•The application master for MapReduce jobs is a Java application
whose main class is MRAppMaster .
•It initializes the job by creating a number of bookkeeping objects to
keep track of the job’s progress, as it will receive progress and
completion reports from the tasks.
•It retrieves the input splits computed in the client from the shared
filesystem.
•It then creates a map task object for each split, as well as a number of
reduce task objects determined by the mapreduce.job.reduces property (set by
the setNumReduceTasks() method on Job).
Task Assignment:
•If the job does not qualify for running as an uber task, then the
application master requests containers for all the map and reduce
tasks in the job from the resource manager .
•Requests for map tasks are made first and with a higher priority than
those for reduce tasks, since all the map tasks must complete before
the sort phase of the reduce can start.
•Requests for reduce tasks are not made until 5% of map tasks have
completed.
Task Execution:
•Once a task has been assigned resources for a container on a
particular node by the resource manager’s scheduler, the application
master starts the container by contacting the node manager.
•The task is executed by a Java application whose main class is
YarnChild. Before it can run the task, it localizes the resources that
the task needs, including the job configuration and JAR file, and
any files from the distributed cache.
•Finally, it runs the map or reduce task.
•Streaming runs special map and reduce tasks for
the purpose of
launching the user supplied executable and
communicating with it.
•The Streaming task communicates with the
process (which may be
written in any language) using standard input and
output streams.
•During execution of the task, the Java process
passes input key value
pairs to the external process, which runs it through
the user defined
map or reduce function and passes the output key
value pairs back to
the Java process.
•From the node manager’s point of view, it is as if
the child process
ran the map or reduce code itself.
Progress and status updates :
•MapReduce jobs are long running batch jobs, taking anything from
tens of seconds to hours to run.
•A job and each of its tasks have a status, which includes such things
as the state of the job or task (e g running, successfully completed,
failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description (which may be set by
user code).
•When a task is running, it keeps track of its progress (i e the
proportion of task is completed).
•For map tasks, this is the proportion of the input that has been
processed.
•For reduce tasks, it’s a little more complex, but the system can still
estimate the proportion of the reduce input processed.
It does this by dividing the total progress into three parts,
corresponding to the three phases of the shuffle.
•As the map or reduce task runs, the child process communicates
with its parent application master through the umbilical interface.
•The task reports its progress and status (including counters) back to
its application master, which has an aggregate view of the job, every
three seconds over the umbilical interface.
•The resource manager web UI displays all the running applications
with links to the web UIs of their respective application masters,
each of which displays further details on the MapReduce job,
including its progress.
•During the course of the job, the client receives the latest status
by polling the application master every second (the interval is set
via mapreduce.client.progressmonitor.pollinterval).
Job Completion:
•When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to Successful.
•Then, when the Job polls for status, it learns that the job has
completed successfully, so it prints a message to tell the user and
then returns from the waitForCompletion() .
•Finally, on job completion, the application master and the task
containers clean up their working state and the OutputCommitter’s
commitJob () method is called.
•Job information is archived by the job history server to enable later
interrogation by users if desired.