0% found this document useful (0 votes)

14 views42 pages

Unit - 2

Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce programming model. It consists of key components such as the Hadoop Distributed File System (HDFS), MapReduce, and YARN, which work together to manage data storage and processing efficiently. The document details the history of Hadoop's development, its architecture, and the various tools and data formats associated with it.

Uploaded by

Mahesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views42 pages

Unit - 2

Uploaded by

Mahesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

UNIT - 2

HADOOP
HADOOP
Hadoop is an open source software programming framework for storing
a large amount of data and performing the computation. Its framework
is based on Java programming with some native code in C and shell
scripts.
Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel
processing of large datasets.
• It all started in the year 2002 with the Apache Nutch project.
• In 2002, Doug Cutting and Mike Cafarella were working on Apache Nutch
Project that aimed at building a web search engine that would crawl and
index websites.
• After a lot of research, Mike Cafarella and Doug Cutting estimated that it
would cost around $500,000 in hardware with a monthly running cost of
$30,000 for a system supporting a one-billion-page index.
• This project proved to be too expensive and thus found infeasible for
indexing billions of webpages. So they were looking for a feasible solution
that would reduce the cost.
2003
Meanwhile, In 2003 Google released a search paper on Google distributed File
System (GFS) that described the architecture for GFS that provided an idea for
storing large datasets in a distributed environment. This paper solved the
problem of storing huge files generated as a part of the web crawl and
indexing process. But this is half of a solution to their problem.
2004
In 2004, Nutch’s developers set about writing an open-source implementation,
the Nutch Distributed File System (NDFS).
In 2004, Google introduced MapReduce to the world by releasing a paper on
MapReduce. This paper provided the solution for processing those large
datasets. It gave a full solution to the Nutch developers.
Google provided the idea for distributed storage and MapReduce. Nutch
developers implemented MapReduce in the middle of 2004.

2006
The Apache community realized that the implementation of MapReduce and
NDFS could be used for other tasks as well. In February 2006, they came out of
Nutch and formed an independent subproject of Lucene called “Hadoop”
(which is the name of Doug’s kid’s yellow elephant).
As the Nutch project was limited to 20 to 40 nodes cluster, Doug Cutting in
2006 itself joined Yahoo to scale the Hadoop project to thousands of nodes
cluster.
2007
In 2007, Yahoo started using Hadoop on 1000 nodes cluster.
2008
In January 2008, Hadoop confirmed its success by becoming the top-level project at
Apache.
By this time, many other companies like Last.fm, Facebook, and the New York Times
started using Hadoop.
Hadoop Distributed File System,
HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file
system designed to run on commodity hardware.
HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware.
HDFS provides high throughput data access to application data and is suitable for
applications that have large data sets and enables streaming access to file system
data in Apache Hadoop.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1.NameNode(MasterNode):

1. Manages all the slave nodes and assign work to them.

2. It executes filesystem namespace operations like opening, closing, renaming files

and directories.

3. It should be deployed on reliable hardware which has the high config. not on
commodity hardware.

2.DataNode(SlaveNode):

1. Actual worker nodes, who do the actual work like reading, writing, processing etc.

2. They also perform creation, deletion, and replication upon instruction from the
master.

3. They can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background.
•Namenodes:

• Run on the master node.

• Store metadata (data about data) like file path, the number of blocks, block Ids.
etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
•DataNodes:

• Run on slave nodes.

• Require high memory as data is actually stored here.
COMPONENTS OF HADOOP

Hadoop Distributed File System (HDFS)

Hadoop MapReduce
Hadoop YARN (Yet Another Resource Negotiator)
Components of Hadoop

HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
Name node
Data Node

MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it
possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce()
Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data.
In simple, Reduce() takes the output generated by Map() as input and combines those
tuples into smaller set of tuples.

Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
HOW DOES HDFS WORK
• The way HDFS works is by having a main « NameNode »
and multiple « data nodes » on a commodity hardware
cluster.
• All the nodes are usually organized within the same
physical rack in the data center. Data is then broken down
into separate « blocks » that are distributed among
• the various data nodes for storage. Blocks are also
replicated across nodes to reduce the likelihood of failure.
• The NameNode is the «smart» node in the cluster. It knows
exactly which data node contains which blocks and where
the data nodes are located within the machine cluster.
• The NameNode also manages access to the files, including
reads, writes, creates, deletes and replication of data
blocks across different data nodes.
The NameNode operates in a “loosely coupled” way with the data nodes. This means the
elements of the cluster can dynamically adapt to the real-time demand of server capacity
by adding or subtracting nodes as the system sees fit.

The data nodes constantly communicate with the NameNode to see if they need complete
a certain task.

The constant communication ensures that the NameNode is aware of each data
node’s status at all times. Since the NameNode assigns tasks to the individual datanodes,
should it realize that a datanode is not functioning properly it is able to immediately re-
assign that node’s task to a different node containing that same data block.
Data nodes also communicate with each other so they can cooperate during normal file
operations. Clearly the NameNode is critical to the whole system and should be replicated
to prevent system failure.
Again, data blocks are replicated across multiple data nodes and access is managed by
the NameNode. This means when a data node no longer sends a “life signal” to the
NameNode, the NameNode unmaps the data note from the cluster and keeps operating
with the other data nodes as if nothing had happened. When this data node comes
back to life or a different (new) data
node is detected, that new data node is (re-)added to the system. That is what makes
HDFS
resilient and self-healing. Since data blocks are replicated across several data nodes,
the failure
of one server will not corrupt a file.
Goal of hdfs
Fault detection and recovery − Since HDFS includes a large
number of commodity hardware,
failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection
and recovery.
Huge datasets − HDFS should have hundreds of nodes per
cluster to manage the applications
having huge datasets. HDFS accommodates applications
that have data sets typically gigabytes to terabytes in size.
Hardware at data − A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.

Access to streaming data - HDFS is intended more for batch processing versus
interactive use, so the emphasis in the design is for high data throughput rates, which
accommodate streaming access to data sets.

Portability - To facilitate adoption, HDFS is designed to be portable across multiple

hardware platforms and to be compatible with a variety of underlying operating systems.

Coherence Model – The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can
be appended and truncate.
Analysing data with Hadoop
Analyzing data with Hadoop involves using its ecosystem components to process, store, and analyze
large datasets. Here's an overview of the process:
Apache Spark Apache spark in an open-source processing engine that is designed for ease of analytics
operations. It is a cluster computing platform that is designed to be fast and made for general
purpose uses. Spark is designed to cover various batch applications, Machine Learning,
streaming data processing, and interactive queries.
Features of Spark:
 In memory processing
 Tight Integration Of component
 Easy and In-expensive
Map Reduce
MapReduce is just like an Algorithm or a data structure that is based on the YARN
framework.
The primary feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster, which Makes Hadoop working so fast Because when we are dealing with
Big
Data, serial processing is no more of any use.
Features of Map-Reduce:
 Scalable
 Fault Tolerance
 Parallel Processing
 Tunable Replication
 Load Balancing
Apache Hive
Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various
sources.
Hive is one of the best tools used for data analysis on Hadoop. The one who is having
knowledge of SQL can comfortably use Apache Hive. The query language of high is
known as
HQL or HIVEQL.
Features of Hive:
 Queries are similar to SQL queries.
 Hive has different storage type HBase, ORC, Plain text, etc.
 Hive has in-built function for data-mining and other works.
 Hive operates on compressed data that is present inside Hadoop Ecosystem.
Hadoop streaming
Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to
execute
programs for big data analysis. Hadoop streaming can be performed using languages
like
Python, Java, PHP, Scala, Perl, UNIX, and many more. The utility allows us to create and
run
Map/Reduce jobs with any executable or script as the mapper and/or the reducer. It uses
Unix
streams as the interface between the Hadoop and our MapReduce program so that we
can use
any language which can read standard input and write to standard output to write for
writing our
MapReduce program.
Features of Hadoop Streaming
Some of the key features associated with Hadoop Streaming are as follows :
 Hadoop Streaming is a part of the Hadoop Distribution System.
 It facilitates ease of writing Map Reduce programs and codes.
 Hadoop Streaming supports almost all types of programming languages such as
Python,
C++, Ruby, Perl etc.
Data formats used in Hadoop

Text files

A text file is the most basic and a human-readable file. It can be read or written in any
programming language and is mostly delimited by comma or tab.

The text file format consumes more space when a numeric value needs to be stored as a
string. It is also difficult to represent binary data such as an image.
Sequence File

The sequence file format can be used to store an image in the binary format. They store key-
value pairs in a binary container format and are more efficient than a text file. However,
sequence files are not human- readable.

Avro Data Files

The Avro file format has efficient storage due to optimized binary encoding. It is widely
supported both inside and outside the Hadoop ecosystem.

The Avro file format is ideal for long-term storage of important data. It can read from and
write in many languages like Java, Scala and so on.Schema metadata can be embedded in the
file to ensure that it will always be readable. Schema evolution can accommodate changes.
The Avro file format is considered the best choice for general-purpose storage in Hadoop.
JSON Records:- JSON is in text format that stores meta data with the data, so it fully
supports schema evolution. You can easily add or remove attributes for each datum.
However,
because it’s text file, it doesn’t support block compression. JSON records contain JSON files
where each line is its own JSON datum. In the case of JSON files, metadata is stored and
the file
is also splittable but again it also doesn’t support block compression.
Map Reduce framework

MapReduce is a programming model and framework used for processing large datasets
in a distributed and parallel manner, a core component of the Hadoop ecosystem,
enabling efficient big data analysis by breaking down tasks into "map" and "reduce"
phases.

•Distributed Processing:
•MapReduce allows you to process data across a cluster of computers (nodes) instead of
relying on a single machine.
•Parallelism:
•The framework automatically handles the distribution and parallel execution of tasks,
making it efficient for large datasets.
•Fault Tolerance:
•MapReduce is designed to be robust, meaning it can continue processing even if some
nodes in the cluster fail.
there are five independent entities:
•The client, which submits the MapReduce job.
•The YARN resource manager, which coordinates the allocation of
compute resources on the cluster.
•The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
•The MapReduce application master, which coordinates the tasks
running the MapReduce job The application master and the MapReduce tasks run
in containers that are scheduled by the resource manager and managed by the
node managers.
•The distributed filesystem, which is used for sharing job files between
the other entities
Job Submission :
•The submit() method on Job creates an internal JobSubmitter
instance and calls submitJobInternal() on it.
•Having submitted the job, waitForCompletion polls the job’s
progress once per second and reports the progress to the console if it
has changed since the last report.
•When the job completes successfully, the job counters are displayed
Otherwise, the error that caused the job to fail is logged to the
console.
•Asks the resource manager for a new application ID, used for the
MapReduce job ID.
•Checks the output specification of the job For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
•Computes the input splits for the job If the splits cannot be
computed (because the input paths don’t exist, for example), the job
is not submitted and an error is thrown to the MapReduce
program.
•Copies the resources needed to run the job, including the job
JAR file, the configuration file, and the computed input splits, to
the shared filesystem in a directory named after the job ID.
•Submits the job by calling submitApplication() on the resource
manager.
job Initialization :
•When the resource manager receives a call to its submitApplication() method, it hands off the
request to the YARN scheduler.
•The scheduler allocates a container, and the resource manager then
launches the application master’s process there, under the node
manager’s management.
•The application master for MapReduce jobs is a Java application
whose main class is MRAppMaster .
•It initializes the job by creating a number of bookkeeping objects to
keep track of the job’s progress, as it will receive progress and
completion reports from the tasks.
•It retrieves the input splits computed in the client from the shared
filesystem.
•It then creates a map task object for each split, as well as a number of
reduce task objects determined by the mapreduce.job.reduces property (set by
the setNumReduceTasks() method on Job).
Task Assignment:
•If the job does not qualify for running as an uber task, then the
application master requests containers for all the map and reduce
tasks in the job from the resource manager .
•Requests for map tasks are made first and with a higher priority than
those for reduce tasks, since all the map tasks must complete before
the sort phase of the reduce can start.
•Requests for reduce tasks are not made until 5% of map tasks have
completed.
Task Execution:
•Once a task has been assigned resources for a container on a
particular node by the resource manager’s scheduler, the application
master starts the container by contacting the node manager.
•The task is executed by a Java application whose main class is
YarnChild. Before it can run the task, it localizes the resources that
the task needs, including the job configuration and JAR file, and
any files from the distributed cache.
•Finally, it runs the map or reduce task.
•Streaming runs special map and reduce tasks for
the purpose of
launching the user supplied executable and
communicating with it.
•The Streaming task communicates with the
process (which may be
written in any language) using standard input and
output streams.
•During execution of the task, the Java process
passes input key value
pairs to the external process, which runs it through
the user defined
map or reduce function and passes the output key
value pairs back to
the Java process.
•From the node manager’s point of view, it is as if
the child process
ran the map or reduce code itself.
Progress and status updates :
•MapReduce jobs are long running batch jobs, taking anything from
tens of seconds to hours to run.
•A job and each of its tasks have a status, which includes such things
as the state of the job or task (e g running, successfully completed,
failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description (which may be set by
user code).
•When a task is running, it keeps track of its progress (i e the
proportion of task is completed).
•For map tasks, this is the proportion of the input that has been
processed.
•For reduce tasks, it’s a little more complex, but the system can still
estimate the proportion of the reduce input processed.
It does this by dividing the total progress into three parts,
corresponding to the three phases of the shuffle.
•As the map or reduce task runs, the child process communicates
with its parent application master through the umbilical interface.
•The task reports its progress and status (including counters) back to
its application master, which has an aggregate view of the job, every
three seconds over the umbilical interface.
•The resource manager web UI displays all the running applications
with links to the web UIs of their respective application masters,
each of which displays further details on the MapReduce job,
including its progress.
•During the course of the job, the client receives the latest status
by polling the application master every second (the interval is set
via mapreduce.client.progressmonitor.pollinterval).
Job Completion:
•When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to Successful.
•Then, when the Job polls for status, it learns that the job has
completed successfully, so it prints a message to tell the user and
then returns from the waitForCompletion() .
•Finally, on job completion, the application master and the task
containers clean up their working state and the OutputCommitter’s
commitJob () method is called.
•Job information is archived by the job history server to enable later
interrogation by users if desired.

Google About Bard
No ratings yet
Google About Bard
7 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit 3
No ratings yet
Unit 3
18 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 5
No ratings yet
Unit 5
101 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Module II
No ratings yet
Module II
46 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Hadoop
No ratings yet
Hadoop
7 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
HADOOP
No ratings yet
HADOOP
18 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Hadoop
No ratings yet
Hadoop
7 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
No ratings yet
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
6 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Emc-Price-Ohio - June 2020
No ratings yet
Emc-Price-Ohio - June 2020
539 pages
Assignment 2 - Frontsheet - Business Process Support
No ratings yet
Assignment 2 - Frontsheet - Business Process Support
40 pages
Computer
No ratings yet
Computer
3 pages
Unit 4 Transaction
No ratings yet
Unit 4 Transaction
7 pages
Data Handling in I.O.T: R.K.Biradar
No ratings yet
Data Handling in I.O.T: R.K.Biradar
17 pages
Unit 5 Basics of File Handling
No ratings yet
Unit 5 Basics of File Handling
12 pages
FTTH Plans: Maharashtra Telecom Circle
No ratings yet
FTTH Plans: Maharashtra Telecom Circle
3 pages
L-9517-9417-06-B Data Sheet RTLC en
No ratings yet
L-9517-9417-06-B Data Sheet RTLC en
9 pages
GA Data Analytics Immersive Syllabus Bahrain
No ratings yet
GA Data Analytics Immersive Syllabus Bahrain
12 pages
Oracle Data Integrator (Odi) Best Practices
No ratings yet
Oracle Data Integrator (Odi) Best Practices
65 pages
File Management Presentation
No ratings yet
File Management Presentation
10 pages
MySQL CONTROL STATEMENTS
No ratings yet
MySQL CONTROL STATEMENTS
4 pages
Pemrograman Berorientasi Objek Lanjutan Tugas Ii "Library Book"
No ratings yet
Pemrograman Berorientasi Objek Lanjutan Tugas Ii "Library Book"
5 pages
DBMS Tutorial
No ratings yet
DBMS Tutorial
171 pages
Research Proposal Khul Supermarket: Amanuel Wubshet Dagim Desalen Samson Damtew MR - Guta
No ratings yet
Research Proposal Khul Supermarket: Amanuel Wubshet Dagim Desalen Samson Damtew MR - Guta
23 pages
Test C - s4fcf - 2021 Ingles
No ratings yet
Test C - s4fcf - 2021 Ingles
17 pages
Databases in Organisations
No ratings yet
Databases in Organisations
14 pages
Naukri AjitKumarMishra (7y 0m)
No ratings yet
Naukri AjitKumarMishra (7y 0m)
5 pages
OS Lab 1
No ratings yet
OS Lab 1
15 pages
Security Part II: Auditing Database Systems: IT Auditing, Hall, 4e
No ratings yet
Security Part II: Auditing Database Systems: IT Auditing, Hall, 4e
37 pages
Intractivity
No ratings yet
Intractivity
2 pages
Report of Summer Internship Finance Amity University
No ratings yet
Report of Summer Internship Finance Amity University
28 pages
Character Arrays and C-String
No ratings yet
Character Arrays and C-String
33 pages
Gianluca Hotz: SQL Server Modernization
No ratings yet
Gianluca Hotz: SQL Server Modernization
74 pages
04 Section III Scope of Requirements
No ratings yet
04 Section III Scope of Requirements
18 pages
Research01 9 Q2 M1
No ratings yet
Research01 9 Q2 M1
16 pages
Data Science, Classification, and Related Methods PDF
No ratings yet
Data Science, Classification, and Related Methods PDF
786 pages
Grade 12 Ict Dbms Using SQL
No ratings yet
Grade 12 Ict Dbms Using SQL
61 pages
(06) 分析輔助工具 - Power PI - 基礎入門介紹.zh-CN.en
No ratings yet
(06) 分析輔助工具 - Power PI - 基礎入門介紹.zh-CN.en
55 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit - 2

Uploaded by

Unit - 2

Uploaded by

UNIT - 2

1. Manages all the slave nodes and assign work to them.

2. It executes filesystem namespace operations like opening, closing, renaming files

3. They can be deployed on commodity hardware.

• Run on the master node.

• Run on slave nodes.

Hadoop Distributed File System (HDFS)

Portability - To facilitate adoption, HDFS is designed to be portable across multiple

Avro Data Files

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.