0% found this document useful (0 votes)

4 views19 pages

Hadoop Presentation

Hadoop is an open-source framework developed in 2005 for handling large-scale data, consisting of two main components: HDFS for storage and MapReduce for processing. It is popular in the big data market due to its scalability, fault tolerance, and ability to handle various data types. HDFS offers features like data replication and high throughput, while MapReduce enables parallel processing of large datasets.

Uploaded by

nandushettar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views19 pages

Hadoop Presentation

Uploaded by

nandushettar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Hadoop and Its two

critical components
.

 Hadoop is a open source framework from the Apache Software

foundation
 Hadoop was developed by Doug Cutting and Mike Cafarella in 2005,
 inspired by Google’s MapReduce and GFS papers.
 It began as part of the Nutch project to handle large-scale data.
 Yahoo supported its growth, and in 2009 it became a top-level Apache
project, forming the core of big data processing.
 Most of the Hadoop codes are written by Yahoo, IBM (International
Business Machines), Cloudera etc...
Evolution of Hadoop
Why we should use Hadoop?
The Hadoop solution is very popular, It as captured at least 90% of Big data
market.
Hadoop has some unique feature that make this solution very popular those
are.

1.Hadoop is scalable, so we can increase the number of commodity

Hardware easily.

2. It is a FaultTolerant solution, when one node goes down other

nodes can process the data,

3. Data can be stored as structured(database),unstructured (textfile,

image, pdf file, video) and semi-structured mode(xml) so it is more
flexible.
Two Core Components of Hadoop:
 Hadoop consists of two main components:
- HDFS [ Hadoop Distributed File System](Storage):
- MapReduce (Processing)
HDFS - Hadoop Distributed File System
 Hadoop comes with a distributed file system called HDFS.
 Using the HDFS , the main data is stored in multiple data nodes, and also
replicated .
 When one data node is down, the data can be accessed through any other data
nodes which contain the replication.
 HDFS is very cost effective.

Concept of HDFS:
1. Data blocks:
 The data blocks are the minimum size of data that can be read or write in one
shot.
 The default size of the HDFS block is 128 MB.
 The data are divided into different blocks and stored in the clusters.
 When the data size is smaller than the block size, then it will not occupy the
whole block.
2. NameNode (MasterNode):
 The Name Node is like the master node in the HDFS.
 It is like a controller or the manager of the HDFS .
 Name Node does not store the data in it, but it store the
metadata of all files.

3. DataNode (SlaveNode):
 These nodes are like the worker node of HDFS.
 Stores actual data blocks in the Hadoop Distributed File
System.
 Sends block reports to the Name Node with details of stored
blocks.
 Automatically replaces lost blocks if a Data Node fails,
ensuring fault tolerance.
HDFS Architecture:
HDFS Features:
• Fault Tolerance – Automatically recovers data using replication if a node fails.

• High Scalability – Easily stores petabytes of data by adding more nodes.

• Distributed Storage – Stores data across multiple machines (nodes).

• Data Replication – Default 3 copies of each data block for reliability.

• High Throughput – Optimized for large data read/write access.

• Write Once, Read Many – Supports write-once and multiple-read operations.

s
• Block Storage – Splits large files into fixed-size blocks (default 128 MB).

• Cost-Effective – Runs on commodity hardware (no need for high-end servers).

• Support for Large Files – Handles files ranging from gigabytes to terabytes.
Advantages of HDFS:
1. Fault Tolerant – Data is replicated across nodes, so it's safe even if
one node fails.
2. Scalable – Easy to add more nodes to handle more data.
3. Cost-Effective – Works on low-cost commodity hardware.
4. High Throughput – Optimized for reading large datasets quickly.
5. Supports Big Data – Designed to store and manage huge volumes of
data .
Disadvantages of HDFS:
1. Not suitable for small files – Storing too many small files can
overload the Name Node.
2. Latency issues – Not ideal for real-time data access.
3. No data modification – HDFS allows data to be written only once;
it doesn’t support updates.
4. High memory usage on Name Node – Name Node stores metadata
in memory, which can become a bottleneck.
5. complex setup and maintenance – Requires proper configuration
and monitoring.
MapReduce:
 MapReduce processes data in parallel.
 - Map Phase: Converts input into key-value pairs
 - Reduce Phase: Aggregates values by key
 The MapReduce is one of the main component of the Hadoop
Ecosystem
 MapReduce is designed to process a large amount of data in
parallel by dividing the work into some smaller and independent
task.
 The whole job is taken from the user and divided into smaller
tasks and assign them to the worker nodes
 MapReduce programs take input as a list and convert to the
output as a list also
1.Map Task:
The map takes a set of key and values ,we can say it as a key-value
pair as input. The data may be is a structure or unstructured form.
The key are the references of input files and values are the dataset
the task is applied on every input value.

2.Reduce Task:
The Reduce takes the key-values pair which is created by the
mapper as input. key value pair are sorted by the key element. In the
reducer we perform the sorting ,aggregation or summation type of job.
The phases of MapReduce
 split:
data is partitioned across several computer
nodes
 map: apply a map function to each chunk of data
 sort & shuffle: the output of the mappers is
sorted and distributed to the reducers
 reduce: finally, a reduce function is applied to
the data and an output is produced
MapReduce Working Example
Real life example of MapReduce
Advantages of MapReduce:

1.Scalability
Easily processes petabytes of data across thousands of machines.

2.Fault Tolerance
Automatically handles node failures using data replication.

3.Cost-Effective
Can run on low-cost, commodity hardware.

4.Parallel Processing
Splits tasks across many nodes, increasing processing speed.

5.Simplicity of Programming Model

Developers only need to define Map() and Reduce() functions.
Disadvantages of MapReduce:
1.Not Suitable for All Problems
Inefficient for iterative and real-time processing tasks (e.g., machine
learning).

2.High Latency
Batch-oriented; not suitable for low-latency applications.

3.Complex Debugging
Distributed environment makes error tracing difficult.

4.Requires Expertise
Writing optimized MapReduce code requires understanding of
distributed systems.
Difference Between HDFS and MapReduce
Summary
 - HDFS handles storage
 - MapReduce manages computation
 - Both provide scalable, fault-tolerant big data
solutions
 - Efficient for storage and analysis

IDS Unit3
No ratings yet
IDS Unit3
19 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Module II
No ratings yet
Module II
46 pages
Module 2
No ratings yet
Module 2
23 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Unit 2
No ratings yet
Unit 2
19 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
HADOOP
No ratings yet
HADOOP
19 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data
No ratings yet
Big Data
67 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit 5
No ratings yet
Unit 5
32 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Hadoop
No ratings yet
Hadoop
5 pages
Unit 5
No ratings yet
Unit 5
7 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
HADOOP
No ratings yet
HADOOP
18 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Introduction To
No ratings yet
Introduction To
7 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Windows Kernel Programming (1st Edition) by Pavel Yosifovich
100% (2)
Windows Kernel Programming (1st Edition) by Pavel Yosifovich
392 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Dead Space 2 - Manual
No ratings yet
Dead Space 2 - Manual
8 pages
Data Science - Sem6
100% (3)
Data Science - Sem6
118 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
MATHEMATICS 7-10 Edited LAS WEEK 1 AND 2
100% (2)
MATHEMATICS 7-10 Edited LAS WEEK 1 AND 2
5 pages
Lesson 6 Revolved Features: Objectives
No ratings yet
Lesson 6 Revolved Features: Objectives
54 pages
Help Desk Representative Job Description
100% (1)
Help Desk Representative Job Description
3 pages
Harvard CS197 - AI Research Experiences - The Course Book
No ratings yet
Harvard CS197 - AI Research Experiences - The Course Book
16 pages
Online TV Shows Pitch Deck by Slidesgo
No ratings yet
Online TV Shows Pitch Deck by Slidesgo
51 pages
Optical Fiber Communications
No ratings yet
Optical Fiber Communications
6 pages
Ift 211-1
No ratings yet
Ift 211-1
16 pages
(May-2022) New PassLeader DP-900 Exam Dumps
No ratings yet
(May-2022) New PassLeader DP-900 Exam Dumps
8 pages
Exp 6
No ratings yet
Exp 6
6 pages
Open Web Application Security Project (OWASP)
No ratings yet
Open Web Application Security Project (OWASP)
4 pages
Guide To Mini Cooper Coding I
No ratings yet
Guide To Mini Cooper Coding I
4 pages
Reference Citation
100% (1)
Reference Citation
8 pages
Ai9 2
No ratings yet
Ai9 2
3 pages
HCI Lecure 10 Software Process
No ratings yet
HCI Lecure 10 Software Process
58 pages
The Net Exam
No ratings yet
The Net Exam
73 pages
Fire Detection Algorithm Based On The Fusion of YOLOv8 and Deformable Conv DCN
No ratings yet
Fire Detection Algorithm Based On The Fusion of YOLOv8 and Deformable Conv DCN
8 pages
Learn Machine Learning in 20 Days
No ratings yet
Learn Machine Learning in 20 Days
23 pages
Juniper Care Service Description Document
No ratings yet
Juniper Care Service Description Document
7 pages
CRM Unit-1
No ratings yet
CRM Unit-1
23 pages
Benzell Et Al 2023 How Apis Create Growth by Inverting The Firm
No ratings yet
Benzell Et Al 2023 How Apis Create Growth by Inverting The Firm
23 pages
Chapter 6 Synchronisation - p1
No ratings yet
Chapter 6 Synchronisation - p1
30 pages
Ashutosh Resume
No ratings yet
Ashutosh Resume
4 pages
An Efficient Forward Secure Proxy Re Encryption 17658971hwytdnvdhxcf
No ratings yet
An Efficient Forward Secure Proxy Re Encryption 17658971hwytdnvdhxcf
15 pages
NU Student Email User Guide
No ratings yet
NU Student Email User Guide
6 pages
RIT AR 20 B TECH III YEAR II SEMESTER MID I EXAMINATIONS TIME TABLE February 2025
No ratings yet
RIT AR 20 B TECH III YEAR II SEMESTER MID I EXAMINATIONS TIME TABLE February 2025
1 page
EMC Statement of Volatility - Isilon S2xx, X2xx, X4xx, NL4xx - January 2016
No ratings yet
EMC Statement of Volatility - Isilon S2xx, X2xx, X4xx, NL4xx - January 2016
7 pages
Fixed Wireless Data WM550
No ratings yet
Fixed Wireless Data WM550
2 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hadoop Presentation

Uploaded by

Hadoop Presentation

Uploaded by

Hadoop and Its two

 Hadoop is a open source framework from the Apache Software

1.Hadoop is scalable, so we can increase the number of commodity

2. It is a FaultTolerant solution, when one node goes down other

3. Data can be stored as structured(database),unstructured (textfile,

• High Scalability – Easily stores petabytes of data by adding more nodes.

• Distributed Storage – Stores data across multiple machines (nodes).

• Data Replication – Default 3 copies of each data block for reliability.

• High Throughput – Optimized for large data read/write access.

• Write Once, Read Many – Supports write-once and multiple-read operations.

• Cost-Effective – Runs on commodity hardware (no need for high-end servers).

5.Simplicity of Programming Model

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.