Unit 3 - BD - Hadoop Ecosystem
Unit 3 - BD - Hadoop Ecosystem
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Hadoop
Apache open-source software framework
Inspired by:
- Google MapReduce
- Google File System
Let’s look at few stastics to get an idea of data gets generated every day, every
minute, and every second.
Every day
NYSE generates 1.5 billion shares and trade data
Facebook stores 2.7 billion comments and likes
Google processes about 24 petabytes of data
Every minutes
Facebook users share nearly 2.5 million pieces of content.
Amazon generates over $ 80,000 in online sale
Twitter users tweet nearly 300,000 times.
Instagram users post nearly 220,000 new photos
Apple users download nearly 50,000 apps.
Email users send over 2000 million messages
YouTube users upload 72 hrs of new video content
Every second
Banking applications process more than 10,000 credit card
transactions. School of Computer Engineering
Data Challenges
5
To process, analyze and made sense of these different kinds of data, a system is
needed that scales and address the challenges as shown:
Hadoop was created by Doug Cutting, the creator of Apache Lucene (text search
library). Hadoop was part of Apace Nutch (open-source web search engine of
Yahoo project) and also part of Lucene project. The name Hadoop is not an
acronym; it’s a made-up name.
School of Computer Engineering
Key Aspects of Hadoop
8
Data Management
Data Access
Data Processing
Data Storage
Data Management
Data Access
Data Processing
Data Storage
HDFS YARN
HDFS
Cluster NameNode DataNode DataNode DataNode
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
HDFS holds very large amount of data and employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
It’s run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and designed
using low-cost hardware.
1. Metadata stored about the file consists of file name, file path, number of
blocks, block Ids, replication level.
2. This metadata information is stored on the local disk. Namenode uses two
files for storing this metadata information.
FsImage EditLog
3. NameNode in HDFS also keeps in it’s memory, location of the DataNodes
that store the blocks for any given file. Using that information Namenode
can reconstruct the whole file by getting the location of all the blocks of a
given file.
Example
(File Name, numReplicas, rack-ids, machine-ids, block-ids, …)
/user/in4072/data/part-0, 3, r:3, M3, {1, 3}, …
/user/in4072/data/part-1, 3, r:2, M1, {2, 4, 5}, …
/user/in4072/data/part-2, 3, r:1, M2, {6, 9, 8}, …
1
2
Secondary
NameNode
NameNode
3
With Hadoop 2.0, built into the platform, HDFS now has automated failover
with a hot standby, with full stack resiliency.
1. Automated Failover: Hadoop pro-actively detects NameNode host and
process failures and will automatically switch to the standby NameNode to
maintain availability for the HDFS service. There is no need for human
intervention in the process – System Administrators can sleep in peace!
2. Hot Standby: Both Active and Standby NameNodes have up to date HDFS
metadata, ensuring seamless failover even for large clusters – which means
no downtime for your HDP cluster!
3. Full Stack Resiliency: The entire Hadoop stack (MapReduce, Hive, Pig,
HBase, Oozie etc.) has been certified to handle a NameNode failure scenario
without losing data or the job progress. This is vital to ensure long running
jobs that are critical to complete on schedule will not be adversely affected
during a NameNode failure scenario.
All machines in rack are connected using the same network switch and if that
network goes down then all machines in that rack will be out of service. Thus
the rack is down. Rack Awareness was introduced by Apache Hadoop to
overcome this issue. In Rack Awareness, NameNode chooses the DataNode
which is closer to the same rack or nearby rack. NameNode maintains Rack ids
of each DataNode to achieve rack information. Thus, this concept chooses
DataNodes based on the rack information. NameNode in Hadoop makes ensures
that all the replicas should not stored on the same rack or single rack. Default
replication factor is 3. Therefore according to Rack Awareness Algorithm:
When a Hadoop framework creates new block, it places first replica on the
local node, and place a second one in a different rack, and the third one is
on different node on same remote node.
When re-replicating a block, if the number of existing replicas is one, place
the second on a different rack.
When number of existing replicas are two, if the two replicas are in the
same rack, place the third one on a different rack.
School of Computer Engineering
Rack Awareness & Replication
34
B3 DN 1 B1 DN 1 B2 DN 1
B1 DN 2 B2 DN 2 B3 DN 2
B3 DN 3 B1 DN 3 B2 DN 3
DN 4 DN 4 DN 4
HDFS follow Write once Read many models. So files can’t be edited that are
already stored in HDFS, but data can be appended by reopening the file.
te Distributed 2. Creat
HDFS 1. Crea e
3. W File System NameNode
Client rite FSData
6. C
lose OutputStrea
Client JVM m 7. Complete
Client Node
4. Write Packet 5. Acknowledge Packet
4 4
DataNode1 5 DataNode2 5 DataNode3
Pipelines of DataNode
School of Computer Engineering
Anatomy of File Write cont’d
37
Client Node
4.2. read
4.1. read 4.3. read
DataNodes
1. The client opens the file that it wishes to read from by calling open() on the
DistributedFileSystem
2. DistributedFileSystem communicates with the NameNode to get the
location of the data blocks. NameNode returns the address of the
DataNodes that the data blocks are stored on. Subsequent to this,
DistributedFileSystem returns DFSInputStream (i.e. a class) to the client to
read from the file.
3. Client then calls read() on the stream DFSInputStream, which has address
of the DataNodes for the first few blocks of the file, connects to the closet
DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When the end of the block is reached, DFSInputStream closes the
connection with the DataNode. It repeats the steps to find the best
DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on
FSDataInputStream to close the connection.
School of Computer Engineering
HDFS Commands
41
Let’s assume that this sample.txt file contains few lines as text. The content of the file is
as follows:
Hello I am expert in Big Data
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while storing this
file in Hadoop, HDFS broke this file into four parts and named each part as first.txt,
second.txt, third.txt, and fourth.txt. So, you can easily see that the above file will be
divided into four equal parts and each part will contain 2 lines. First two lines will be in
the file first.txt, next two lines in second.txt, next two in third.txt and the last two lines
will be stored in fourth.txt. All these files will be stored in DataNodes and the Name Node
will contain the metadata about them. All this is the task of HDFS.