0% found this document useful (0 votes)
3 views21 pages

Module III Hadoop Framework

The document provides an overview of the Hadoop framework, focusing on Distributed File Systems (DFS) and the Hadoop Distributed File System (HDFS). It explains the concepts of local transparency, redundancy, and various types of transparency in DFS, along with the advantages and disadvantages of using such systems. Additionally, it details the MapReduce programming model, including its execution stages, terminology, and algorithms for processing large datasets in a distributed computing environment.

Uploaded by

portablegaming0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

Module III Hadoop Framework

The document provides an overview of the Hadoop framework, focusing on Distributed File Systems (DFS) and the Hadoop Distributed File System (HDFS). It explains the concepts of local transparency, redundancy, and various types of transparency in DFS, along with the advantages and disadvantages of using such systems. Additionally, it details the MapReduce programming model, including its execution stages, terminology, and algorithms for processing large datasets in a distributed computing environment.

Uploaded by

portablegaming0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

MODULE III HADOOP FRAMEWORK 9

Distributed File Systems -HDFS concepts - Map Reduce Execution, Algorithms using Map
Reduce, Matrix-Vector Multiplication – Hadoop YARN.

Distributed File Systems

A distributed file system (DFS) is a file system that is distributed on various file servers and
locations. It permits programs to access and store isolated data in the same method as in the local
files. It also permits the user to access files from any system. It allows network users to share
information and files in a regulated and permitted manner. Although, the servers have complete
control over the data and provide users access control.

DFS's primary goal is to enable users of physically distributed systems to share resources and
information through the Common File System (CFS). It is a file system that runs as a part of
the operating systems. Its configuration is a set of workstations and mainframes that a LAN
connects. The process of creating a namespace in DFS is transparent to the clients.

DFS has two components in its services, and these are as follows:

1. Local Transparency
2. Redundancy

Local Transparency

It is achieved via the namespace component.

Redundancy

It is achieved via a file replication component.

In the case of failure or heavy load, these components work together to increase data availability
by allowing data from multiple places to be logically combined under a single folder known as
the "DFS root".

There are mainly four types of transparency. These are as follows:

1. Structure Transparency

The client does not need to be aware of the number or location of file servers and storage devices.
In structure transparency, multiple file servers must be given to adaptability, dependability, and
performance.
2. Naming Transparency

There should be no hint of the file's location in the file's name. When the file is transferred form
one node to other, the file name should not be changed.

3. Access Transparency

Local and remote files must be accessible in the same method. The file system must automatically
locate the accessed file and deliver it to the client.

4. Replication Transparency

When a file is copied across various nodes, the copies files and their locations must be hidden from
one node to the next.

Working of Distributed File System

There are two methods of DFS in which they might be implemented, and these are as follows:

1. Standalone DFS namespace


2. Domain-based DFS namespace

Standalone DFS namespace

It does not use Active Directory and only permits DFS roots that exist on the local system. A
Standalone DFS may only be acquired on the systems that created it. It offers no-fault liberation
and may not be linked to other DFS.

Domain-based DFS namespace

It stores the DFS configuration in Active Directory and creating namespace root
at domainname>dfsroot> or FQDN>dfsroot>.

DFS namespace

SMB routes of the form are used in traditional file shares that are linked to a single server.

\\<SERVER>\<path>\<subpath>

Domain-based DFS file share paths are identified by utilizing the domain name for the server's
name throughout the form.

\\<DOMAIN.NAME>\<dfsroot>\<path>
When users access such a share, either directly or through mapping a disk, their computer connects
to one of the accessible servers connected with that share, based on rules defined by the network
administrator. For example, the default behavior is for users to access the nearest server to them;
however, this can be changed to prefer a certain server.

Applications of Distributed File System

There are several applications of the distributed file system. Some of them are as follows:

Hadoop

Hadoop is a collection of open-source software services. It is a software framework that uses the
MapReduce programming style to allow distributed storage and management of large amounts of
data. Hadoop is made up of a storage component known as Hadoop Distributed File System
(HDFS). It is an operational component based on the MapReduce programming model.

NFS (Network File System)

A client-server architecture enables a computer user to store, update, and view files remotely. It is
one of various DFS standards for Network-Attached Storage.

SMB (Server Message Block)

IBM developed an SMB protocol to file sharing. It was developed to permit systems to read and
write files to a remote host across a LAN. The remote host's directories may be accessed through
SMB and are known as "shares".

NetWare

It is an abandon computer network operating system that is developed by Novell, Inc. The IPX
network protocol mainly used combined multitasking to execute many services on a computer
system.

CIFS (Common Internet File System)

CIFS is an accent of SMB. The CIFS protocol is a Microsoft-designed implementation of the


SIMB protocol.

Advantages and Disadvantages of Distributed File System

There are various advantages and disadvantages of the distributed file system. These are as
follows:
Advantages

There are various advantages of the distributed file system. Some of the advantages are as follows:

1. It allows the users to access and store the data.


2. It helps to improve the access time, network efficiency, and availability of files.
3. It provides the transparency of data even if the server of disk files.
4. It permits the data to be shared remotely.
5. It helps to enhance the ability to change the amount of data and exchange data.

Disadvantages

There are various disadvantages of the distributed file system. Some of the disadvantages are as
follows:

1. In a DFS, the database connection is complicated.


2. In a DFS, database handling is also more complex than in a single-user system.
3. If all nodes try to transfer data simultaneously, there is a chance that overloading will
happen.
4. There is a possibility that messages and data would be missed in the network while moving
from one node to another.

HDFS CONCEPTS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.

Where to use HDFS


o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency
in reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.
Where not to use HDFS
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the
files are small in size it takes a lot of memory for name node's memory which is not
feasible.
o Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are
128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is
smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just
to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.

HDFS DataNode and NameNode Image:


HDFS Read Image:

HDFS Write Image:


Starting HDFS

The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations


1. Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.

$ hadoop fs -mkdir /user/test

o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test


2. Copying data from HDFS to local file system
o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

o hadoop fs -rmr <arg>

Example:

o hadoop fs -rmr /user/sonoo/

Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow,
if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.

Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Map Reduce Execution

Map Reduce is a processing technique and a program model for distributed computing based on
java. The Map Reduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
Map Reduce implies, the reduce task is always performed after the map job.
The major advantage of Map Reduce is that it is easy to scale data processing over multiple
computing nodes. Under the Map Reduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the Map Reduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
Map Reduce model.
The Algorithm
 Generally Map Reduce paradigm is based on sending the computer to where the data
resides!
 Map Reduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
 During a Map Reduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)


2. Reduce tasks (Shuffling, Reducing)

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a

1. Job tracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Job tracker that resides on Name
node and there are multiple task trackers which reside on Data node.

A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.

 It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run
on different data nodes.
 Execution of individual task is then to look after by task tracker, which resides on every
data node executing part of the job.
 Task tracker’s responsibility is to send the progress report to the job tracker.
 In addition, task tracker periodically sends ‘heartbeat’ signal to the Job tracker so as to
notify him of the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the event of task failure,
the job tracker can reschedule it on a different task tracker.

Terminology
 Pay Load − Applications implement the Map and the Reduce functions, and form the core
of the job.
 Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 Named Node − Node that manages the Hadoop Distributed File System (HDFS).
 Data Node − Node where data is presented in advance before any processing takes place.
 Master Node − Node where Job Tracker runs and which accepts job requests from clients.
 Slave Node − Node where Map and Reduce program runs.
 Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker − Tracks the task and reports status to Job Tracker.
 Job − A program is an execution of a Mapper and Reducer across a dataset.
 Task − An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt − A particular instance of an attempt to execute a task on a Slave Node.

Algorithms using Map Reduce

In the Map Reduce bulk tasks are divided into smaller tasks, they are then allotted to many systems.
The two important tasks in Map Reduce algorithm

 Map
 Reduce

Map task is always performed first which is then followed by Reduce job. One data set converts
into another data set in map, and individual element is broken into tuples.

Reduce task combines the tuples of data into smaller tuples set and it uses map output as an input.
Input Phase: It is a record reader that sends data in the form of key-value pairs and transforms
every input file record to the mapper.

Map Phase: It is a user defined function. It generates zero or more key-value pairs with the help
of a sequence of key-value pairs and processes each of them.

Intermediate Keys: Mapper generated key-value pairs are called as intermediate keys.

Enroll in MapReduce Training and get the best assistance for learning.

Combiner: Combiner takes mapper internal keys as input and applies a user-defined code to
combine the values in a small scope of one mapper.

Shuffle and Sort: Shuffle and Sort is the first step of reducer task. When reducer is running, it
downloads all the key-value pairs onto the local machine. Each key -value pairs are stored by key
into a larger data list. This data list groups the corresponding keys together so that their values can
be iterated easily in the reducer task.

Reducer phase: This phase gives zero or more key-value pairs after the following process The
data can be combined, filtered and aggregated in a number of ways and it requires a large range
of processing.
Output phase: It has an output formatter, from the reducer function and writes them onto a file
using a record writer that translates the last key-value pairs.
The Map Reduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as
input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper by
their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes the help
of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help
of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
 Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.

 The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the Map phase as a
key-value pair with employee name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried employee in each file. See
the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){


Max = v(salary);
}

else{
Continue checking;
}
The expected result is as follows –

 Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input files.
The final output should be as follows −
<gopal, 50000>
Indexing
Normally indexing is used to point to a particular data and its address. It performs batch indexing
on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names
and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the
term "is" appears in the files T[0], T[1], and T[2].
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to
the number of times a term appears in a document.
Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the number
of times a word appears in a document divided by the total number of words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the
document)
Inverse Document Frequency (IDF)
It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.
While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent
terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).
The algorithm is explained below with the help of a small example.
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF
for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Matrix-Vector Multiplication
Matrix-vector multiplication is an operation between a matrix and a vector that
produces a new vector.
Matrix-vector multiplication is an operation between a matrix and a vector that
produces a new vector. Notably, matrix-vector multiplication is only defined between
a matrix and a vector where the length of the vector equals the number of columns of
the matrix. It is defined as follows:

Matrix-vector multiplication can be viewed from multiple angles, at various levels of


abstraction. These views come in handy when we attempt to conceptualize the various
ways in which we utilize matrix-vector multiplication to model real-world problems.
Below are three ways that I find useful for conceptualizing matrix-vector multiplication
ordered from least to most abstract:

1. As a “row-wise”, vector-generating process: Matrix-vector multiplication


defines a process for creating a new vector using an existing vector where each
element of the new vector is “generated” by taking a weighted sum of each row
of the matrix using the elements of a vector as coefficients
2. As taking a linear combination of the columns of a matrix: Matrix-vector
multiplication is the process of taking a linear combination of the column-space
of a matrix using the elements of a vector as the coefficients
3. As evaluating a function between vector spaces: Matrix-vector multiplication
allows a matrix to define a mapping between two vector spaces.
Hadoop YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described
as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be
known as large-scale distributed operating system used for Big Data processing.

YARN architecture basically separates resource management layer from the processing layer. In
Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager and
application manager.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to


extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other tasks
such as monitoring or tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler
to partition the cluster resources.
 Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the health status
of the node. It monitors resource usage, performs log management and also kills a container
based on directions from the resource manager. It is also responsible for creating the
container process and start it on the request of Application master.
 Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch Context(CLC)
which includes everything an application needs to run. Once the application is started, it
sends the health report to the resource manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
Application workflow in Hadoop YARN:

 Client submits an application


 The Resource Manager allocates a container to start the Application Manager
 The Application Manager registers itself with the Resource Manager
 The Application Manager negotiates containers from the Resource Manager
 The Application Manager notifies the Node Manager to launch containers
 Application code is executed in the container
 Client contacts Resource Manager/Application Manager to monitor application’s status
 Once the processing is complete, the Application Manager un-registers with the Resource
Manager

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy