Module III Hadoop Framework
Module III Hadoop Framework
Distributed File Systems -HDFS concepts - Map Reduce Execution, Algorithms using Map
Reduce, Matrix-Vector Multiplication – Hadoop YARN.
A distributed file system (DFS) is a file system that is distributed on various file servers and
locations. It permits programs to access and store isolated data in the same method as in the local
files. It also permits the user to access files from any system. It allows network users to share
information and files in a regulated and permitted manner. Although, the servers have complete
control over the data and provide users access control.
DFS's primary goal is to enable users of physically distributed systems to share resources and
information through the Common File System (CFS). It is a file system that runs as a part of
the operating systems. Its configuration is a set of workstations and mainframes that a LAN
connects. The process of creating a namespace in DFS is transparent to the clients.
DFS has two components in its services, and these are as follows:
1. Local Transparency
2. Redundancy
Local Transparency
Redundancy
In the case of failure or heavy load, these components work together to increase data availability
by allowing data from multiple places to be logically combined under a single folder known as
the "DFS root".
1. Structure Transparency
The client does not need to be aware of the number or location of file servers and storage devices.
In structure transparency, multiple file servers must be given to adaptability, dependability, and
performance.
2. Naming Transparency
There should be no hint of the file's location in the file's name. When the file is transferred form
one node to other, the file name should not be changed.
3. Access Transparency
Local and remote files must be accessible in the same method. The file system must automatically
locate the accessed file and deliver it to the client.
4. Replication Transparency
When a file is copied across various nodes, the copies files and their locations must be hidden from
one node to the next.
There are two methods of DFS in which they might be implemented, and these are as follows:
It does not use Active Directory and only permits DFS roots that exist on the local system. A
Standalone DFS may only be acquired on the systems that created it. It offers no-fault liberation
and may not be linked to other DFS.
It stores the DFS configuration in Active Directory and creating namespace root
at domainname>dfsroot> or FQDN>dfsroot>.
DFS namespace
SMB routes of the form are used in traditional file shares that are linked to a single server.
\\<SERVER>\<path>\<subpath>
Domain-based DFS file share paths are identified by utilizing the domain name for the server's
name throughout the form.
\\<DOMAIN.NAME>\<dfsroot>\<path>
When users access such a share, either directly or through mapping a disk, their computer connects
to one of the accessible servers connected with that share, based on rules defined by the network
administrator. For example, the default behavior is for users to access the nearest server to them;
however, this can be changed to prefer a certain server.
There are several applications of the distributed file system. Some of them are as follows:
Hadoop
Hadoop is a collection of open-source software services. It is a software framework that uses the
MapReduce programming style to allow distributed storage and management of large amounts of
data. Hadoop is made up of a storage component known as Hadoop Distributed File System
(HDFS). It is an operational component based on the MapReduce programming model.
A client-server architecture enables a computer user to store, update, and view files remotely. It is
one of various DFS standards for Network-Attached Storage.
IBM developed an SMB protocol to file sharing. It was developed to permit systems to read and
write files to a remote host across a LAN. The remote host's directories may be accessed through
SMB and are known as "shares".
NetWare
It is an abandon computer network operating system that is developed by Novell, Inc. The IPX
network protocol mainly used combined multitasking to execute many services on a computer
system.
There are various advantages and disadvantages of the distributed file system. These are as
follows:
Advantages
There are various advantages of the distributed file system. Some of the advantages are as follows:
Disadvantages
There are various disadvantages of the distributed file system. Some of the disadvantages are as
follows:
HDFS CONCEPTS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are
128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is
smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just
to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Start $ start-dfs.sh
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test
Recursive deleting
Example:
Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow,
if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.
Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Map Reduce Execution
Map Reduce is a processing technique and a program model for distributed computing based on
java. The Map Reduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
Map Reduce implies, the reduce task is always performed after the map job.
The major advantage of Map Reduce is that it is easy to scale data processing over multiple
computing nodes. Under the Map Reduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the Map Reduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
Map Reduce model.
The Algorithm
Generally Map Reduce paradigm is based on sending the computer to where the data
resides!
Map Reduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
During a Map Reduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Hadoop divides the job into tasks. There are two types of tasks:
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a
1. Job tracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Job tracker that resides on Name
node and there are multiple task trackers which reside on Data node.
A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run
on different data nodes.
Execution of individual task is then to look after by task tracker, which resides on every
data node executing part of the job.
Task tracker’s responsibility is to send the progress report to the job tracker.
In addition, task tracker periodically sends ‘heartbeat’ signal to the Job tracker so as to
notify him of the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the event of task failure,
the job tracker can reschedule it on a different task tracker.
Terminology
Pay Load − Applications implement the Map and the Reduce functions, and form the core
of the job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
Named Node − Node that manages the Hadoop Distributed File System (HDFS).
Data Node − Node where data is presented in advance before any processing takes place.
Master Node − Node where Job Tracker runs and which accepts job requests from clients.
Slave Node − Node where Map and Reduce program runs.
Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to Job Tracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a Slave Node.
In the Map Reduce bulk tasks are divided into smaller tasks, they are then allotted to many systems.
The two important tasks in Map Reduce algorithm
Map
Reduce
Map task is always performed first which is then followed by Reduce job. One data set converts
into another data set in map, and individual element is broken into tuples.
Reduce task combines the tuples of data into smaller tuples set and it uses map output as an input.
Input Phase: It is a record reader that sends data in the form of key-value pairs and transforms
every input file record to the mapper.
Map Phase: It is a user defined function. It generates zero or more key-value pairs with the help
of a sequence of key-value pairs and processes each of them.
Intermediate Keys: Mapper generated key-value pairs are called as intermediate keys.
Enroll in MapReduce Training and get the best assistance for learning.
Combiner: Combiner takes mapper internal keys as input and applies a user-defined code to
combine the values in a small scope of one mapper.
Shuffle and Sort: Shuffle and Sort is the first step of reducer task. When reducer is running, it
downloads all the key-value pairs onto the local machine. Each key -value pairs are stored by key
into a larger data list. This data list groups the corresponding keys together so that their values can
be iterated easily in the reducer task.
Reducer phase: This phase gives zero or more key-value pairs after the following process The
data can be combined, filtered and aggregated in a number of ways and it requires a large range
of processing.
Output phase: It has an output formatter, from the reducer function and writes them onto a file
using a record writer that translates the last key-value pairs.
The Map Reduce algorithm contains two important tasks, namely Map and Reduce.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
Sorting
Searching
Indexing
TF-IDF
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper by
their keys.
Sorting methods are implemented in the mapper class itself.
In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes the help
of RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help
of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.
The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.
The combiner phase (searching technique) will accept the input from the Map phase as a
key-value pair with employee name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried employee in each file. See
the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary
else{
Continue checking;
}
The expected result is as follows –
Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input files.
The final output should be as follows −
<gopal, 50000>
Indexing
Normally indexing is used to point to a particular data and its address. It performs batch indexing
on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names
and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the
term "is" appears in the files T[0], T[1], and T[2].
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to
the number of times a term appears in a document.
Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the number
of times a word appears in a document divided by the total number of words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the
document)
Inverse Document Frequency (IDF)
It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.
While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent
terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).
The algorithm is explained below with the help of a small example.
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF
for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Matrix-Vector Multiplication
Matrix-vector multiplication is an operation between a matrix and a vector that
produces a new vector.
Matrix-vector multiplication is an operation between a matrix and a vector that
produces a new vector. Notably, matrix-vector multiplication is only defined between
a matrix and a vector where the length of the vector equals the number of columns of
the matrix. It is defined as follows:
YARN architecture basically separates resource management layer from the processing layer. In
Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager and
application manager.
YARN Features: YARN gained popularity because of the following features-