0% found this document useful (0 votes)

3 views21 pages

Module III Hadoop Framework

The document provides an overview of the Hadoop framework, focusing on Distributed File Systems (DFS) and the Hadoop Distributed File System (HDFS). It explains the concepts of local transparency, redundancy, and various types of transparency in DFS, along with the advantages and disadvantages of using such systems. Additionally, it details the MapReduce programming model, including its execution stages, terminology, and algorithms for processing large datasets in a distributed computing environment.

Uploaded by

portablegaming0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views21 pages

Module III Hadoop Framework

Uploaded by

portablegaming0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

MODULE III HADOOP FRAMEWORK 9

Distributed File Systems -HDFS concepts - Map Reduce Execution, Algorithms using Map
Reduce, Matrix-Vector Multiplication – Hadoop YARN.

Distributed File Systems

A distributed file system (DFS) is a file system that is distributed on various file servers and
locations. It permits programs to access and store isolated data in the same method as in the local
files. It also permits the user to access files from any system. It allows network users to share
information and files in a regulated and permitted manner. Although, the servers have complete
control over the data and provide users access control.

DFS's primary goal is to enable users of physically distributed systems to share resources and
information through the Common File System (CFS). It is a file system that runs as a part of
the operating systems. Its configuration is a set of workstations and mainframes that a LAN
connects. The process of creating a namespace in DFS is transparent to the clients.

DFS has two components in its services, and these are as follows:

1. Local Transparency
2. Redundancy

Local Transparency

It is achieved via the namespace component.

Redundancy

It is achieved via a file replication component.

In the case of failure or heavy load, these components work together to increase data availability
by allowing data from multiple places to be logically combined under a single folder known as
the "DFS root".

There are mainly four types of transparency. These are as follows:

1. Structure Transparency

The client does not need to be aware of the number or location of file servers and storage devices.
In structure transparency, multiple file servers must be given to adaptability, dependability, and
performance.
2. Naming Transparency

There should be no hint of the file's location in the file's name. When the file is transferred form
one node to other, the file name should not be changed.

3. Access Transparency

Local and remote files must be accessible in the same method. The file system must automatically
locate the accessed file and deliver it to the client.

4. Replication Transparency

When a file is copied across various nodes, the copies files and their locations must be hidden from
one node to the next.

Working of Distributed File System

There are two methods of DFS in which they might be implemented, and these are as follows:

1. Standalone DFS namespace

2. Domain-based DFS namespace

Standalone DFS namespace

It does not use Active Directory and only permits DFS roots that exist on the local system. A
Standalone DFS may only be acquired on the systems that created it. It offers no-fault liberation
and may not be linked to other DFS.

Domain-based DFS namespace

It stores the DFS configuration in Active Directory and creating namespace root
at domainname>dfsroot> or FQDN>dfsroot>.

DFS namespace

SMB routes of the form are used in traditional file shares that are linked to a single server.

\\<SERVER>\<path>\<subpath>

Domain-based DFS file share paths are identified by utilizing the domain name for the server's
name throughout the form.

\\<DOMAIN.NAME>\<dfsroot>\<path>
When users access such a share, either directly or through mapping a disk, their computer connects
to one of the accessible servers connected with that share, based on rules defined by the network
administrator. For example, the default behavior is for users to access the nearest server to them;
however, this can be changed to prefer a certain server.

Applications of Distributed File System

There are several applications of the distributed file system. Some of them are as follows:

Hadoop

Hadoop is a collection of open-source software services. It is a software framework that uses the
MapReduce programming style to allow distributed storage and management of large amounts of
data. Hadoop is made up of a storage component known as Hadoop Distributed File System
(HDFS). It is an operational component based on the MapReduce programming model.

NFS (Network File System)

A client-server architecture enables a computer user to store, update, and view files remotely. It is
one of various DFS standards for Network-Attached Storage.

SMB (Server Message Block)

IBM developed an SMB protocol to file sharing. It was developed to permit systems to read and
write files to a remote host across a LAN. The remote host's directories may be accessed through
SMB and are known as "shares".

NetWare

It is an abandon computer network operating system that is developed by Novell, Inc. The IPX
network protocol mainly used combined multitasking to execute many services on a computer
system.

CIFS (Common Internet File System)

CIFS is an accent of SMB. The CIFS protocol is a Microsoft-designed implementation of the

SIMB protocol.

Advantages and Disadvantages of Distributed File System

There are various advantages and disadvantages of the distributed file system. These are as
follows:
Advantages

There are various advantages of the distributed file system. Some of the advantages are as follows:

1. It allows the users to access and store the data.

2. It helps to improve the access time, network efficiency, and availability of files.
3. It provides the transparency of data even if the server of disk files.
4. It permits the data to be shared remotely.
5. It helps to enhance the ability to change the amount of data and exchange data.

Disadvantages

There are various disadvantages of the distributed file system. Some of the disadvantages are as
follows:

1. In a DFS, the database connection is complicated.

2. In a DFS, database handling is also more complex than in a single-user system.
3. If all nodes try to transfer data simultaneously, there is a chance that overloading will
happen.
4. There is a possibility that messages and data would be missed in the network while moving
from one node to another.

HDFS CONCEPTS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency
in reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.
Where not to use HDFS
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the
files are small in size it takes a lot of memory for name node's memory which is not
feasible.
o Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are
128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is
smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just
to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:

HDFS Write Image:

Starting HDFS

The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations

1. Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.

$ hadoop fs -mkdir /user/test

o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system
o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

o hadoop fs -rmr <arg>

Example:

o hadoop fs -rmr /user/sonoo/

Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow,
if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.

Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Map Reduce Execution

Map Reduce is a processing technique and a program model for distributed computing based on
java. The Map Reduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
Map Reduce implies, the reduce task is always performed after the map job.
The major advantage of Map Reduce is that it is easy to scale data processing over multiple
computing nodes. Under the Map Reduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the Map Reduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
Map Reduce model.
The Algorithm
 Generally Map Reduce paradigm is based on sending the computer to where the data
resides!
 Map Reduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
 During a Map Reduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)

2. Reduce tasks (Shuffling, Reducing)

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a

1. Job tracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Job tracker that resides on Name
node and there are multiple task trackers which reside on Data node.

A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.

 It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run
on different data nodes.
 Execution of individual task is then to look after by task tracker, which resides on every
data node executing part of the job.
 Task tracker’s responsibility is to send the progress report to the job tracker.
 In addition, task tracker periodically sends ‘heartbeat’ signal to the Job tracker so as to
notify him of the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the event of task failure,
the job tracker can reschedule it on a different task tracker.

Terminology
 Pay Load − Applications implement the Map and the Reduce functions, and form the core
of the job.
 Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 Named Node − Node that manages the Hadoop Distributed File System (HDFS).
 Data Node − Node where data is presented in advance before any processing takes place.
 Master Node − Node where Job Tracker runs and which accepts job requests from clients.
 Slave Node − Node where Map and Reduce program runs.
 Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker − Tracks the task and reports status to Job Tracker.
 Job − A program is an execution of a Mapper and Reducer across a dataset.
 Task − An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt − A particular instance of an attempt to execute a task on a Slave Node.

Algorithms using Map Reduce

In the Map Reduce bulk tasks are divided into smaller tasks, they are then allotted to many systems.
The two important tasks in Map Reduce algorithm

 Map
 Reduce

Map task is always performed first which is then followed by Reduce job. One data set converts
into another data set in map, and individual element is broken into tuples.

Reduce task combines the tuples of data into smaller tuples set and it uses map output as an input.
Input Phase: It is a record reader that sends data in the form of key-value pairs and transforms
every input file record to the mapper.

Map Phase: It is a user defined function. It generates zero or more key-value pairs with the help
of a sequence of key-value pairs and processes each of them.

Intermediate Keys: Mapper generated key-value pairs are called as intermediate keys.

Enroll in MapReduce Training and get the best assistance for learning.

Combiner: Combiner takes mapper internal keys as input and applies a user-defined code to
combine the values in a small scope of one mapper.

Shuffle and Sort: Shuffle and Sort is the first step of reducer task. When reducer is running, it
downloads all the key-value pairs onto the local machine. Each key -value pairs are stored by key
into a larger data list. This data list groups the corresponding keys together so that their values can
be iterated easily in the reducer task.

Reducer phase: This phase gives zero or more key-value pairs after the following process The
data can be combined, filtered and aggregated in a number of ways and it requires a large range
of processing.
Output phase: It has an output formatter, from the reducer function and writes them onto a file
using a record writer that translates the last key-value pairs.
The Map Reduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class

 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as
input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper by
their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes the help
of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help
of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
 Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.

 The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the Map phase as a
key-value pair with employee name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried employee in each file. See
the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){

Max = v(salary);
}

else{
Continue checking;
}
The expected result is as follows –

 Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input files.
The final output should be as follows −
<gopal, 50000>
Indexing
Normally indexing is used to point to a particular data and its address. It performs batch indexing
on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names
and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the
term "is" appears in the files T[0], T[1], and T[2].
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to
the number of times a term appears in a document.
Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the number
of times a word appears in a document divided by the total number of words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the
document)
Inverse Document Frequency (IDF)
It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.
While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent
terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).
The algorithm is explained below with the help of a small example.
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF
for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Matrix-Vector Multiplication
Matrix-vector multiplication is an operation between a matrix and a vector that
produces a new vector.
Matrix-vector multiplication is an operation between a matrix and a vector that
produces a new vector. Notably, matrix-vector multiplication is only defined between
a matrix and a vector where the length of the vector equals the number of columns of
the matrix. It is defined as follows:

Matrix-vector multiplication can be viewed from multiple angles, at various levels of

abstraction. These views come in handy when we attempt to conceptualize the various
ways in which we utilize matrix-vector multiplication to model real-world problems.
Below are three ways that I find useful for conceptualizing matrix-vector multiplication
ordered from least to most abstract:

1. As a “row-wise”, vector-generating process: Matrix-vector multiplication

defines a process for creating a new vector using an existing vector where each
element of the new vector is “generated” by taking a weighted sum of each row
of the matrix using the elements of a vector as coefficients
2. As taking a linear combination of the columns of a matrix: Matrix-vector
multiplication is the process of taking a linear combination of the column-space
of a matrix using the elements of a vector as the coefficients
3. As evaluating a function between vector spaces: Matrix-vector multiplication
allows a matrix to define a mapping between two vector spaces.
Hadoop YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described
as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be
known as large-scale distributed operating system used for Big Data processing.

YARN architecture basically separates resource management layer from the processing layer. In
Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager and
application manager.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to

extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
The main components of YARN architecture include:

 Client: It submits map-reduce jobs.

 Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other tasks
such as monitoring or tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler
to partition the cluster resources.
 Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the health status
of the node. It monitors resource usage, performs log management and also kills a container
based on directions from the resource manager. It is also responsible for creating the
container process and start it on the request of Application master.
 Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch Context(CLC)
which includes everything an application needs to run. Once the application is started, it
sends the health report to the resource manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
Application workflow in Hadoop YARN:

 Client submits an application

 The Resource Manager allocates a container to start the Application Manager
 The Application Manager registers itself with the Resource Manager
 The Application Manager negotiates containers from the Resource Manager
 The Application Manager notifies the Node Manager to launch containers
 Application code is executed in the container
 Client contacts Resource Manager/Application Manager to monitor application’s status
 Once the processing is complete, the Application Manager un-registers with the Resource
Manager

Yokogawa
No ratings yet
Yokogawa
118 pages
Rev. Lecture 1 PPT2
No ratings yet
Rev. Lecture 1 PPT2
24 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
258 pages
The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
Case Study-Automation in Manufacturing
100% (1)
Case Study-Automation in Manufacturing
14 pages
4
No ratings yet
4
36 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Apache Hadoop 3.4.1 – HDFS Architecture
No ratings yet
Apache Hadoop 3.4.1 – HDFS Architecture
7 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Unit 3: Distributed File System
No ratings yet
Unit 3: Distributed File System
12 pages
HDFS
No ratings yet
HDFS
16 pages
What Is DFS
No ratings yet
What Is DFS
37 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
DATA228 Lecture Notes Week 4
No ratings yet
DATA228 Lecture Notes Week 4
21 pages
Computer Science Apprenticeship Bigdata Assignement3
No ratings yet
Computer Science Apprenticeship Bigdata Assignement3
3 pages
HDFS
No ratings yet
HDFS
14 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
4
No ratings yet
4
53 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
2403.15701v1
No ratings yet
2403.15701v1
10 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
PDF (Sa1) - Eapp
No ratings yet
PDF (Sa1) - Eapp
4 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
17 pages
Wa Introhdfs PDF
No ratings yet
Wa Introhdfs PDF
11 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
File System Basics: Hadoop Distributed
No ratings yet
File System Basics: Hadoop Distributed
22 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
Notes - 3 Unit neha
No ratings yet
Notes - 3 Unit neha
25 pages
BDA Module-1 Notes
No ratings yet
BDA Module-1 Notes
14 pages
Unit 3 Big Data_240516_090400
No ratings yet
Unit 3 Big Data_240516_090400
20 pages
BCS061_Notes_Unit3
No ratings yet
BCS061_Notes_Unit3
23 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
DC MOD 6
No ratings yet
DC MOD 6
9 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
16 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
PlaceRaider: Virtual Theft in Physical Spaces With Smartphones
No ratings yet
PlaceRaider: Virtual Theft in Physical Spaces With Smartphones
21 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
No ratings yet
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
9 pages
2distributed File System Dfs
No ratings yet
2distributed File System Dfs
21 pages
IOT Complete Notes TRUE ENGINEER
No ratings yet
IOT Complete Notes TRUE ENGINEER
107 pages
Distributed File System
No ratings yet
Distributed File System
5 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
HDFS
No ratings yet
HDFS
11 pages
HDFS Intro
No ratings yet
HDFS Intro
9 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
QFE3320 Low Mid-Band TX Front-End
No ratings yet
QFE3320 Low Mid-Band TX Front-End
110 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Tattvabodha
No ratings yet
Tattvabodha
175 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Quick Look: HDFS: Assumptions and Goals
No ratings yet
Quick Look: HDFS: Assumptions and Goals
5 pages
LATHESHAAS
100% (1)
LATHESHAAS
340 pages
Proposal New
No ratings yet
Proposal New
31 pages
Military Handbook
No ratings yet
Military Handbook
80 pages
Pointers in C++
100% (1)
Pointers in C++
10 pages
Nokia Antenna Solutions Customer presentation v4.0 (long version)
No ratings yet
Nokia Antenna Solutions Customer presentation v4.0 (long version)
50 pages
DSK6713 (Unit Iv)
No ratings yet
DSK6713 (Unit Iv)
20 pages
Lightning Systemm Main (1) (Autosaved) - 1
No ratings yet
Lightning Systemm Main (1) (Autosaved) - 1
24 pages
Safety System - Emergency Shutdown System P2
100% (4)
Safety System - Emergency Shutdown System P2
15 pages
026-069-B0-RevD AMPS80-Manual-LowRes
No ratings yet
026-069-B0-RevD AMPS80-Manual-LowRes
76 pages
Ece PDF
No ratings yet
Ece PDF
16 pages
Stama MC010
No ratings yet
Stama MC010
28 pages
JobsPortal Guide
No ratings yet
JobsPortal Guide
10 pages
Creating The Block Model
No ratings yet
Creating The Block Model
46 pages
FMPC: A Fast Implementation of Model Predictive Control M. Canale M. Milanese
No ratings yet
FMPC: A Fast Implementation of Model Predictive Control M. Canale M. Milanese
6 pages
Chapter 5:user Interface Design
No ratings yet
Chapter 5:user Interface Design
16 pages
A New Model of LPC Excitation For Producing Natural-Sounding Speech at Low Bit Rates (Bell-Labs, 1982)
No ratings yet
A New Model of LPC Excitation For Producing Natural-Sounding Speech at Low Bit Rates (Bell-Labs, 1982)
4 pages
Module 1.4 IB 601 ENTREPRENEURSHIP I
No ratings yet
Module 1.4 IB 601 ENTREPRENEURSHIP I
3 pages
Resume Template
No ratings yet
Resume Template
2 pages
Running Sap Applications On The Microsoft Platform: Sapinst May Run Slowly On Windows 2012 or Windows 2012 R2
No ratings yet
Running Sap Applications On The Microsoft Platform: Sapinst May Run Slowly On Windows 2012 or Windows 2012 R2
2 pages
Widlar Current Mirror
No ratings yet
Widlar Current Mirror
14 pages
Design Health Care System Using Raspberry Pi and ESP32: Qunoot N. Alsahi Ali F. Marhoon
No ratings yet
Design Health Care System Using Raspberry Pi and ESP32: Qunoot N. Alsahi Ali F. Marhoon
6 pages
Lesson Plan in S.T.E
No ratings yet
Lesson Plan in S.T.E
11 pages
Echelon Smart
No ratings yet
Echelon Smart
5 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Module III Hadoop Framework

Uploaded by

Module III Hadoop Framework

Uploaded by

MODULE III HADOOP FRAMEWORK 9

Distributed File Systems

It is achieved via the namespace component.

It is achieved via a file replication component.

There are mainly four types of transparency. These are as follows:

Working of Distributed File System

1. Standalone DFS namespace

Standalone DFS namespace

Domain-based DFS namespace

Applications of Distributed File System

NFS (Network File System)

SMB (Server Message Block)

CIFS (Common Internet File System)

CIFS is an accent of SMB. The CIFS protocol is a Microsoft-designed implementation of the

Advantages and Disadvantages of Distributed File System

1. It allows the users to access and store the data.

1. In a DFS, the database connection is complicated.

Where to use HDFS

HDFS DataNode and NameNode Image:

HDFS Write Image:

To Format $ hadoop namenode -format

HDFS Basic File Operations

$ hadoop fs -mkdir /user/test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

o hadoop fs -rmr <arg>

o hadoop fs -rmr /user/sonoo/

1. Map tasks (Splits & Mapping)

Algorithms using Map Reduce

 The map task is done by means of Mapper Class

if(v(second employee).salary > Max){

Matrix-vector multiplication can be viewed from multiple angles, at various levels of

1. As a “row-wise”, vector-generating process: Matrix-vector multiplication

 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to

 Client: It submits map-reduce jobs.

 Client submits an application

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.