0% found this document useful (0 votes)
4 views19 pages

Hadoop Presentation

Hadoop is an open-source framework developed in 2005 for handling large-scale data, consisting of two main components: HDFS for storage and MapReduce for processing. It is popular in the big data market due to its scalability, fault tolerance, and ability to handle various data types. HDFS offers features like data replication and high throughput, while MapReduce enables parallel processing of large datasets.

Uploaded by

nandushettar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views19 pages

Hadoop Presentation

Hadoop is an open-source framework developed in 2005 for handling large-scale data, consisting of two main components: HDFS for storage and MapReduce for processing. It is popular in the big data market due to its scalability, fault tolerance, and ability to handle various data types. HDFS offers features like data replication and high throughput, while MapReduce enables parallel processing of large datasets.

Uploaded by

nandushettar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Hadoop and Its two

critical components
.

 Hadoop is a open source framework from the Apache Software


foundation
 Hadoop was developed by Doug Cutting and Mike Cafarella in 2005,
 inspired by Google’s MapReduce and GFS papers.
 It began as part of the Nutch project to handle large-scale data.
 Yahoo supported its growth, and in 2009 it became a top-level Apache
project, forming the core of big data processing.
 Most of the Hadoop codes are written by Yahoo, IBM (International
Business Machines), Cloudera etc...
Evolution of Hadoop
Why we should use Hadoop?
The Hadoop solution is very popular, It as captured at least 90% of Big data
market.
Hadoop has some unique feature that make this solution very popular those
are.

1.Hadoop is scalable, so we can increase the number of commodity


Hardware easily.

2. It is a FaultTolerant solution, when one node goes down other


nodes can process the data,

3. Data can be stored as structured(database),unstructured (textfile,


image, pdf file, video) and semi-structured mode(xml) so it is more
flexible.
Two Core Components of Hadoop:
 Hadoop consists of two main components:
- HDFS [ Hadoop Distributed File System](Storage):
- MapReduce (Processing)
HDFS - Hadoop Distributed File System
 Hadoop comes with a distributed file system called HDFS.
 Using the HDFS , the main data is stored in multiple data nodes, and also
replicated .
 When one data node is down, the data can be accessed through any other data
nodes which contain the replication.
 HDFS is very cost effective.

Concept of HDFS:
1. Data blocks:
 The data blocks are the minimum size of data that can be read or write in one
shot.
 The default size of the HDFS block is 128 MB.
 The data are divided into different blocks and stored in the clusters.
 When the data size is smaller than the block size, then it will not occupy the
whole block.
2. NameNode (MasterNode):
 The Name Node is like the master node in the HDFS.
 It is like a controller or the manager of the HDFS .
 Name Node does not store the data in it, but it store the
metadata of all files.

3. DataNode (SlaveNode):
 These nodes are like the worker node of HDFS.
 Stores actual data blocks in the Hadoop Distributed File
System.
 Sends block reports to the Name Node with details of stored
blocks.
 Automatically replaces lost blocks if a Data Node fails,
ensuring fault tolerance.
HDFS Architecture:
HDFS Features:
• Fault Tolerance – Automatically recovers data using replication if a node fails.

• High Scalability – Easily stores petabytes of data by adding more nodes.

• Distributed Storage – Stores data across multiple machines (nodes).

• Data Replication – Default 3 copies of each data block for reliability.

• High Throughput – Optimized for large data read/write access.

• Write Once, Read Many – Supports write-once and multiple-read operations.


s
• Block Storage – Splits large files into fixed-size blocks (default 128 MB).

• Cost-Effective – Runs on commodity hardware (no need for high-end servers).

• Support for Large Files – Handles files ranging from gigabytes to terabytes.
Advantages of HDFS:
1. Fault Tolerant – Data is replicated across nodes, so it's safe even if
one node fails.
2. Scalable – Easy to add more nodes to handle more data.
3. Cost-Effective – Works on low-cost commodity hardware.
4. High Throughput – Optimized for reading large datasets quickly.
5. Supports Big Data – Designed to store and manage huge volumes of
data .
Disadvantages of HDFS:
1. Not suitable for small files – Storing too many small files can
overload the Name Node.
2. Latency issues – Not ideal for real-time data access.
3. No data modification – HDFS allows data to be written only once;
it doesn’t support updates.
4. High memory usage on Name Node – Name Node stores metadata
in memory, which can become a bottleneck.
5. complex setup and maintenance – Requires proper configuration
and monitoring.
MapReduce:
 MapReduce processes data in parallel.
 - Map Phase: Converts input into key-value pairs
 - Reduce Phase: Aggregates values by key
 The MapReduce is one of the main component of the Hadoop
Ecosystem
 MapReduce is designed to process a large amount of data in
parallel by dividing the work into some smaller and independent
task.
 The whole job is taken from the user and divided into smaller
tasks and assign them to the worker nodes
 MapReduce programs take input as a list and convert to the
output as a list also
1.Map Task:
The map takes a set of key and values ,we can say it as a key-value
pair as input. The data may be is a structure or unstructured form.
The key are the references of input files and values are the dataset
the task is applied on every input value.

2.Reduce Task:
The Reduce takes the key-values pair which is created by the
mapper as input. key value pair are sorted by the key element. In the
reducer we perform the sorting ,aggregation or summation type of job.
The phases of MapReduce
 split:
data is partitioned across several computer
nodes
 map: apply a map function to each chunk of data
 sort & shuffle: the output of the mappers is
sorted and distributed to the reducers
 reduce: finally, a reduce function is applied to
the data and an output is produced
MapReduce Working Example
Real life example of MapReduce
Advantages of MapReduce:

1.Scalability
Easily processes petabytes of data across thousands of machines.

2.Fault Tolerance
Automatically handles node failures using data replication.

3.Cost-Effective
Can run on low-cost, commodity hardware.

4.Parallel Processing
Splits tasks across many nodes, increasing processing speed.

5.Simplicity of Programming Model


Developers only need to define Map() and Reduce() functions.
Disadvantages of MapReduce:
1.Not Suitable for All Problems
Inefficient for iterative and real-time processing tasks (e.g., machine
learning).

2.High Latency
Batch-oriented; not suitable for low-latency applications.

3.Complex Debugging
Distributed environment makes error tracing difficult.

4.Requires Expertise
Writing optimized MapReduce code requires understanding of
distributed systems.
Difference Between HDFS and MapReduce
Summary
 - HDFS handles storage
 - MapReduce manages computation
 - Both provide scalable, fault-tolerant big data
solutions
 - Efficient for storage and analysis

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy