0% found this document useful (0 votes)
9 views18 pages

HDFS

The document provides a comprehensive overview of the Hadoop Distributed File System (HDFS), including its history, key features, architecture, and commands. It highlights the evolution of Hadoop from its inception in 2003 to the introduction of cloud-based services and Apache Spark. Additionally, it compares HDFS with other file systems and outlines the process for scaling out Hadoop clusters.

Uploaded by

nextapai.blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

HDFS

The document provides a comprehensive overview of the Hadoop Distributed File System (HDFS), including its history, key features, architecture, and commands. It highlights the evolution of Hadoop from its inception in 2003 to the introduction of cloud-based services and Apache Spark. Additionally, it compares HDFS with other file systems and outlines the process for scaling out Hadoop clusters.

Uploaded by

nextapai.blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Hadoop Distributed File

System
History of Hadoop
Year Event
2003 Google- MapReduce and Google File System (GFS)->Hadoop.
2005 Doug Cutting & Mike Cafarella - Nutch project-> Hadoop.
2006 Standalone project- Doug Cutting's son's toy elephant.
2008 Hadoop joins the Apache Software Foundation
2010 Hadoop 1.0
2011 Hive, Pig, HBase, and ZooKeeper
2012 Hadoop -> big data processing.
2014 Hadoop 2.0 -YARN & improved HDFS

2015 Cloud-based Hadoop services like Amazon EMR, Google Dataproc, and Azure HDInsight

2019 Apache Spark


HDFS
 HDFS stands for Hadoop Distributed File System.
 HDFS is fault-tolerant and designed to be deployed on low-cost,
commodity hardware.
 HDFS provides high throughput and is suitable for applications that
have large data sets and enables streaming access.
HDFS vs. Other File Systems
Feature Database NTFS EXT4 APFS HDFS

Manages Distributed
General-purpose General-purpose Optimized for
Purpose structured data storage for big
file system file system Apple devices
with queries data

Applications
Primary Use Windows macOS/iOS Big data
needing Linux storage
Case storage storage analytics
tables/queries

Depends on DB 16 EB No fixed limit


File Size Limit 16 TB 8 EB
engine/schema (theoretical) (block-based)

Depends on DB Scales across


Partition Size 256 TB 1 EB 8 EB
engine multiple nodes
Key Features of HDFS
 Distributed Storage
 Fault Tolerance
 High Throughput
 Scalability
 Write-Once, Read-Many
 Large File Support
Key Components of hdfs
 Data node
 Name node
 Secondary Name Node
 Client
 Back up node
 Replication management
 Rack awareness
 Read & write operation
Key Differences
Feature Secondary NameNode Backup Node Standby Node
Maintaining an in-memory
Checkpointing and merging edit High Availability (HA) -
Primary Role replica of the NameNode's
logs with fsimage. provides automatic failover.
metadata.
Holds a replica of the
Merges edit logs and fsimage to Synchronized with the active
Data Handling NameNode's in-memory
reduce log size. NameNode for failover.
metadata.
Does not automatically handle
Does not provide failover Provides automatic failover in
Failover Capability failover; requires manual
capability. case of NameNode failure.
promotion.
Ensures NameNode availability
Reduces NameNode recovery Can be promoted to NameNode
Use Case by switching roles in case of
time and prevents large logs. if the NameNode fails.
failure.

Can take over from the


Periodically merges logs and Continuously synchronized and
Interaction with NameNode NameNode in case of failure,
fsimage to reduce load. takes over automatically.
but manually promoted.
HDFS Architecture
Scaling Out in Hadoop
 Scaling out refers to adding more nodes to a cluster to increase its
capacity for handling larger datasets and processing workloads.
 This is in contrast to scaling up, which involves upgrading the existing
hardware with more powerful components (e.g., more CPU, memory,
or storage).
Steps for Scaling Out
 Add New Nodes to the Cluster
 Install Hadoop on New Nodes
 Update Configuration Files
 Start Hadoop Services
 Rebalance Data Across Nodes
 Monitor the Cluster
Hadoop Streaming
 A Hadoop-specific utility that allows users to write MapReduce programs in any
language that can read from stdin and write to stdout.
 It is not for real-time streaming; it operates on batch processing of large datasets
stored in HDFS.
 Allows users to process data in parallel using the MapReduce framework.

Key Features:
 Executes scripts in non-Java languages for batch processing.
 Part of Hadoop's MapReduce ecosystem.
HDFS COMMANDS

mkdir ls get

put cat
HDFS COMMANDS
mkdir:
hdfs dfs -mkdir /path/to/directory
hdfs dfs -mkdir /user/hadoop/data cat:
hdfs dfs -cat /path/to/file
hdfs dfs -cat /user/hadoop/data/data.txt
ls:
hdfs dfs -ls /path/to/directory get:
hdfs dfs -ls /user/Hadoop hdfs dfs -get /hdfs/path/file /local/path
hdfs dfs -get /user/hadoop/data/data.txt
put: /home/user
hdfs dfs -put /local/path/file /hdfs/path
hdfs dfs -put /home/user/data.txt
/user/hadoop/data
HDFS COMMANDS

cp mv rm

chown touchz
HDFS COMMANDS
 cp:
hdfs dfs -cp /source/path /destination/path
 chown:
hdfs dfs -cp /user/hadoop/data.txt
/user/hadoop/backup/data.txt hdfs dfs -chown [user]:[group]
/path/to/file_or_directory

 mv: hdfs dfs -chown hadoop:supergroup


hdfs dfs -mv /source/path /destination/path /user/hadoop/data.txt
hdfs dfs -mv /user/hadoop/data.txt
 touchz:
/user/hadoop/old_data.txt
hdfs dfs -touchz /path/to/file

 rm: hdfs dfs -touchz


hdfs dfs -rm /path/to/file /user/hadoop/empty_file.txt
hdfs dfs -rm -r /path/to/directory
HDFS COMMANDS

du df setrep

clear stat
HDFS COMMANDS
 du:  setrep:
hdfs dfs -du [options] /path [s,h] hdfs dfs -setrep -w [replication_factor] /path
hdfs dfs -setrep -w 3
hdfs dfs -du /user/hadoop/project
/user/hadoop/project/data.txt

 df:  stat:
hdfs dfs -df [path] hdfs dfs -stat [format] /path
hdfs dfs -df /
Filesystem Size Used Available Use%
hdfs://localhost:9000 1000GB 400GB 600GB 40%

 clear:
hdfs dfs -stat %n %b %r %y /user/hadoop/project/data.txt
clear
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy