Hadoop Presentation
Hadoop Presentation
critical components
.
Concept of HDFS:
1. Data blocks:
The data blocks are the minimum size of data that can be read or write in one
shot.
The default size of the HDFS block is 128 MB.
The data are divided into different blocks and stored in the clusters.
When the data size is smaller than the block size, then it will not occupy the
whole block.
2. NameNode (MasterNode):
The Name Node is like the master node in the HDFS.
It is like a controller or the manager of the HDFS .
Name Node does not store the data in it, but it store the
metadata of all files.
3. DataNode (SlaveNode):
These nodes are like the worker node of HDFS.
Stores actual data blocks in the Hadoop Distributed File
System.
Sends block reports to the Name Node with details of stored
blocks.
Automatically replaces lost blocks if a Data Node fails,
ensuring fault tolerance.
HDFS Architecture:
HDFS Features:
• Fault Tolerance – Automatically recovers data using replication if a node fails.
• Support for Large Files – Handles files ranging from gigabytes to terabytes.
Advantages of HDFS:
1. Fault Tolerant – Data is replicated across nodes, so it's safe even if
one node fails.
2. Scalable – Easy to add more nodes to handle more data.
3. Cost-Effective – Works on low-cost commodity hardware.
4. High Throughput – Optimized for reading large datasets quickly.
5. Supports Big Data – Designed to store and manage huge volumes of
data .
Disadvantages of HDFS:
1. Not suitable for small files – Storing too many small files can
overload the Name Node.
2. Latency issues – Not ideal for real-time data access.
3. No data modification – HDFS allows data to be written only once;
it doesn’t support updates.
4. High memory usage on Name Node – Name Node stores metadata
in memory, which can become a bottleneck.
5. complex setup and maintenance – Requires proper configuration
and monitoring.
MapReduce:
MapReduce processes data in parallel.
- Map Phase: Converts input into key-value pairs
- Reduce Phase: Aggregates values by key
The MapReduce is one of the main component of the Hadoop
Ecosystem
MapReduce is designed to process a large amount of data in
parallel by dividing the work into some smaller and independent
task.
The whole job is taken from the user and divided into smaller
tasks and assign them to the worker nodes
MapReduce programs take input as a list and convert to the
output as a list also
1.Map Task:
The map takes a set of key and values ,we can say it as a key-value
pair as input. The data may be is a structure or unstructured form.
The key are the references of input files and values are the dataset
the task is applied on every input value.
2.Reduce Task:
The Reduce takes the key-values pair which is created by the
mapper as input. key value pair are sorted by the key element. In the
reducer we perform the sorting ,aggregation or summation type of job.
The phases of MapReduce
split:
data is partitioned across several computer
nodes
map: apply a map function to each chunk of data
sort & shuffle: the output of the mappers is
sorted and distributed to the reducers
reduce: finally, a reduce function is applied to
the data and an output is produced
MapReduce Working Example
Real life example of MapReduce
Advantages of MapReduce:
1.Scalability
Easily processes petabytes of data across thousands of machines.
2.Fault Tolerance
Automatically handles node failures using data replication.
3.Cost-Effective
Can run on low-cost, commodity hardware.
4.Parallel Processing
Splits tasks across many nodes, increasing processing speed.
2.High Latency
Batch-oriented; not suitable for low-latency applications.
3.Complex Debugging
Distributed environment makes error tracing difficult.
4.Requires Expertise
Writing optimized MapReduce code requires understanding of
distributed systems.
Difference Between HDFS and MapReduce
Summary
- HDFS handles storage
- MapReduce manages computation
- Both provide scalable, fault-tolerant big data
solutions
- Efficient for storage and analysis