0% found this document useful (0 votes)
20 views42 pages

Unit 3 - BD - Hadoop Ecosystem

Uploaded by

2028110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views42 pages

Unit 3 - BD - Hadoop Ecosystem

Uploaded by

2028110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Big Data (CS-3032)

Kalinga Institute of Industrial Technology


Deemed to be University
Bhubaneswar-751024

School of Computer Engineering

Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission

3 Credit Lecture Note


Course Contents
2

Sr # Major and Detailed Coverage Area Hrs


3 Hadoop Ecosystem 8
Introduction to Hadoop, Hadoop Ecosystem, Hadoop Distributed File System,
MapReduce, YARN, Hive, Pig and PigLatin, Jaql - Zookeeper - HBase, Cassandra-
Oozie, Lucene- Avro, Mahout.

School of Computer Engineering


Introduction to Hadoop
3

Hadoop is an open-source project of the Apache Foundation. Apache Hadoop is


written in Java and a collection of open-source software utilities that facilitate
using a network of many computers to solve problems involving massive
amounts of data and computation. It provides a software framework for
distributed storage and processing of big data and uses Google’s MapReduce
and Google File System as its foundation.

Hadoop
Apache open-source software framework
Inspired by:
- Google MapReduce
- Google File System

Hadoop provides various tools and technologies, collectively termed as Hadoop


ecosystem, to enable development and deployment of Big Data solutions. It
accomplishes two tasks namely i) Massive data storage, and ii) Faster data
processing.
School of Computer Engineering
Flood of data
4

Let’s look at few stastics to get an idea of data gets generated every day, every
minute, and every second.
 Every day
 NYSE generates 1.5 billion shares and trade data
 Facebook stores 2.7 billion comments and likes
 Google processes about 24 petabytes of data
 Every minutes
 Facebook users share nearly 2.5 million pieces of content.
 Amazon generates over $ 80,000 in online sale
 Twitter users tweet nearly 300,000 times.
 Instagram users post nearly 220,000 new photos
 Apple users download nearly 50,000 apps.
 Email users send over 2000 million messages
 YouTube users upload 72 hrs of new video content
 Every second
 Banking applications process more than 10,000 credit card
transactions. School of Computer Engineering
Data Challenges
5

To process, analyze and made sense of these different kinds of data, a system is
needed that scales and address the challenges as shown:

“I have data in various sources. I have


“I am flooded with
data that rich in variety – structured,
data”. How to store
semi-structured and unstructured”. How
terabytes of mounting
to work with data that is so very
data?
different?

“I need this data to be


proceed quickly. My
decision is pending”.
How to access the
information quickly?

School of Computer Engineering


Why Hadoop
6

Its capability to handle massive amounts of data, different categories of data –


fairly quickly.
Considerations

School of Computer Engineering


Hadoop History
7

Hadoop was created by Doug Cutting, the creator of Apache Lucene (text search
library). Hadoop was part of Apace Nutch (open-source web search engine of
Yahoo project) and also part of Lucene project. The name Hadoop is not an
acronym; it’s a made-up name.
School of Computer Engineering
Key Aspects of Hadoop
8

School of Computer Engineering


Hadoop Components
9

School of Computer Engineering


Hadoop Components cont’d
10

Hadoop Core Components:


 HDFS
 Storage component
 Distributed data across several nodes
 Natively redundant
 MapReduce
 Computational Framework
 Splits a task across multiple nodes
 Process data in parallel
Hadoop Ecosystems: These are support projects to enhance the functionality of
Hadoop Core components. The projects are as follows:
 Hive  Flume  HBase
 Pig  Oozie
 Sqoop  Mahout

School of Computer Engineering


Hadoop Ecosystem
11

Data Management

Data Access

Data Processing

Data Storage

School of Computer Engineering


Version of Hadoop
12

There are 3 versions of Hadoop available:


 Hadoop 1.x  Hadoop 3.x
 Hadoop 2.x

Hadoop 1.x vs. Hadoop 2.x

Hadoop 1.x Hadoop 2.x


Other Data Processing
MapReduce MapReduce
Framework
Data Processing & Resource
Management YARN
Resource Management
HDFS HDFS2
Distributed File Storage Distributed File Storage
(redundant, reliable storage) (redundant, highly-available, reliable storage)

School of Computer Engineering


Hadoop 2.x vs. Hadoop 3.x
13

Characteristics Hadoop 2.x Hadoop 3.x


Minimum Java 7 Java 8
supported version
of java
Fault tolerance Handled by replication (which is Handled by erasure coding
wastage of space).
Data Balancing Uses HDFS balancer Uses Intra-data node balancer,
which is invoked via the HDFS
disk balancer CLI.
Storage Scheme Uses 3X replication scheme. E.g. If Support for erasure encoding in
there is 6 block so there will be 18 HDFS. E.g. If there is 6 block so
blocks occupied the space because there will be 9 blocks occupied
of the replication scheme. the space 6 block and 3 for parity.
Scalability Scale up to 10,000 nodes per Scale more than 10,000 nodes per
cluster. cluster.

School of Computer Engineering


Hadoop Distributors
14

The top 8 vendors offering Big Data Hadoop solution are:


 Integrated Hadoop Solution
 Cloudera
 HortonWorks
 Amazon Web Services Elastic MapReduce Hadoop Distribution
 Microsoft
 MapR
 IBM InfoSphere Insights
 Cloud-Based Hadoop Solution
 Amazon Web Service
 Google BigQuery

School of Computer Engineering


Hadoop Ecosystem
15

Data Management

Data Access

Data Processing

Data Storage

School of Computer Engineering


Hadoop
16

School of Computer Engineering


Version of Hadoop
17

There are 3 versions of Hadoop available:


 Hadoop 1.x  Hadoop 3.x
 Hadoop 2.x

Hadoop 1.x vs. Hadoop 2.x

Hadoop 1.x Hadoop 2.x


Other Data Processing
MapReduce MapReduce
Framework
Data Processing & Resource
Management YARN
Resource Management
HDFS HDFS2
Distributed File Storage Distributed File Storage
(redundant, reliable storage) (redundant, highly-available, reliable storage)

School of Computer Engineering


Hadoop 2.x vs. Hadoop 3.x
18

Characteristics Hadoop 2.x Hadoop 3.x


Minimum Java 7 Java 8
supported version
of java
Fault tolerance Handled by replication (which is Handled by erasure coding
wastage of space).
Data Balancing Uses HDFS balancer Uses Intra-data node balancer,
which is invoked via the HDFS
disk balancer CLI.
Storage Scheme Uses 3X replication scheme. E.g. If Support for erasure encoding in
there is 6 block so there will be 18 HDFS. E.g. If there is 6 block so
blocks occupied the space because there will be 9 blocks occupied
of the replication scheme. the space 6 block and 3 for parity.
Scalability Scale up to 10,000 nodes per Scale more than 10,000 nodes per
cluster. cluster.

School of Computer Engineering


High Level Hadoop 2.0 Architecture
19

Hadoop is distributed Master-Slave architecture.


Distributed data storage Distributed data processing
Client

HDFS YARN

HDFS Master Node YARN Master Node


Active Namenode Resource Manager
Master
Standby Namenode
Secondary Namenode

HDFS Slave Node YARN Slave Node


DataNode 1 Slave Node Manager 1

DataNode n Node Manager n

School of Computer Engineering


High Level Hadoop 2.0 Architecture cont’d
20

Resource Node Node Node


YARN Manager Manager Manager Manager

HDFS
Cluster NameNode DataNode DataNode DataNode

School of Computer Engineering


Hadoop HDFS
21

School of Computer Engineering


Hadoop HDFS
22

 The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
 HDFS holds very large amount of data and employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
 To store such huge data, the files are stored across multiple machines.
 These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
 It’s run on commodity hardware.
 Unlike other distributed systems, HDFS is highly fault-tolerant and designed
using low-cost hardware.

School of Computer Engineering


Hadoop HDFS Key points
23

Some key points of HDFS are as follows:


1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and moves
computation where data is stored).
5. One can replicate a file for a configured number of times, which is tolerant
in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. Sits on top of native file system

School of Computer Engineering


HDFS Daemons
24

Key components of HDFS are as follows:


1. NameNode 3. Secondary NameNode
2. DataNodes 4. Standby NameNode
Blocks: Generally the user data is stored in the files of HDFS. HDFS breaks a
large file into smaller pieces called blocks. In other words, the minimum
amount of data that HDFS can read or write is called a block. By default the
block size is 128 MB in Hadoop 2.x and 64 MB in Hadoop 1.x. But it can be
increased as per the need to change in HDFS configuration.
Hadoop 2.X Hadoop 1.X
200 MB – abc.txt 200 MB – abc.txt
128 MB – Block 1
72 MB – Block 2 ?
Why block size is large?
1. Reduce the cost of seek time and 2. Proper usage of storage space
School of Computer Engineering
Rack
25

A rack is a collection of 30 or 40 nodes that are physically stored close together


and are all connected to the same network switch. Network bandwidth between
any two nodes in rack is greater than bandwidth between two nodes on different
racks. A Hadoop Cluster is a collection of racks. Switch

Node 1 Node 1 Node 1


S S S
Node 2 Node 2 Node 2
W W W
I I I
T T T
C C C
H H H
Node N Node N Node N

Rack 1 Rack 2 Rack N


School of Computer Engineering
NameNode
26

1. NameNode is the centerpiece of HDFS.


2. NameNode is also known as the Master.
3. NameNode only stores the metadata of HDFS – the directory tree of all files in the
file system, and tracks the files across the cluster.
4. NameNode does not store the actual data or the dataset. The data itself is actually
stored in the DataNodes
5. NameNode knows the list of the blocks and its location for any given file in HDFS.
With this information NameNode knows how to construct the file from blocks.
6. NameNode is usually configured with a lot of memory (RAM).
7. NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop
cluster is inaccessible and considered down.
8. NameNode is a single point of failure in Hadoop cluster.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet

School of Computer Engineering


NameNode Metadata
27

1. Metadata stored about the file consists of file name, file path, number of
blocks, block Ids, replication level.
2. This metadata information is stored on the local disk. Namenode uses two
files for storing this metadata information.
 FsImage  EditLog
3. NameNode in HDFS also keeps in it’s memory, location of the DataNodes
that store the blocks for any given file. Using that information Namenode
can reconstruct the whole file by getting the location of all the blocks of a
given file.

Example
(File Name, numReplicas, rack-ids, machine-ids, block-ids, …)
/user/in4072/data/part-0, 3, r:3, M3, {1, 3}, …
/user/in4072/data/part-1, 3, r:2, M1, {2, 4, 5}, …
/user/in4072/data/part-2, 3, r:1, M2, {6, 9, 8}, …

School of Computer Engineering


DataNode
28

1. DataNode is responsible for storing the actual data in HDFS.


2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with
the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the
cluster. NameNode will arrange for replication for the blocks managed by
the DataNode that is not available.
6. DataNode is usually configured with a lot of hard disk space. Because the
actual data is stored in the DataNode.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet

School of Computer Engineering


Secondary NameNode
29

1. Secondary NameNode in Hadoop is more of a helper to NameNode, it is not


a backup NameNode server which can quickly take over in case of
NameNode failure.
2. EditLog– All the file write operations done by client applications are first
recorded in the EditLog.
3. FsImage– This file has the complete information about the file system
metadata when the NameNode starts. All the operations after that are
recorded in EditLog.
4. When the NameNode is restarted it first takes metadata information from
the FsImage and then apply all the transactions recorded in EditLog.
NameNode restart doesn’t happen that frequently so EditLog grows quite
large. That means merging of EditLog to FsImage at the time of startup
takes a lot of time keeping the whole file system offline during that process.
5. Secondary NameNode take over this job of merging FsImage and EditLog
and keep the FsImage current to save a lot of time. Its main function is to
check point the file system metadata stored on NameNode.
School of Computer Engineering
Secondary NameNode cont’d
30

The process followed by Secondary NameNode to periodically merge the


fsimage and the edits log files is as follows:
1. Secondary NameNode pulls the latest FsImage and EditLog files from the
primary NameNode.
2. Secondary NameNode applies each transaction from EditLog file to FsImage
to create a new merged FsImage file.
3. Merged FsImage file is transferred back to primary NameNode.

1
2
Secondary
NameNode
NameNode
3

It’s been an hour,


provide your
metadata

School of Computer Engineering


Standby NameNode
31

With Hadoop 2.0, built into the platform, HDFS now has automated failover
with a hot standby, with full stack resiliency.
1. Automated Failover: Hadoop pro-actively detects NameNode host and
process failures and will automatically switch to the standby NameNode to
maintain availability for the HDFS service. There is no need for human
intervention in the process – System Administrators can sleep in peace!
2. Hot Standby: Both Active and Standby NameNodes have up to date HDFS
metadata, ensuring seamless failover even for large clusters – which means
no downtime for your HDP cluster!
3. Full Stack Resiliency: The entire Hadoop stack (MapReduce, Hive, Pig,
HBase, Oozie etc.) has been certified to handle a NameNode failure scenario
without losing data or the job progress. This is vital to ensure long running
jobs that are critical to complete on schedule will not be adversely affected
during a NameNode failure scenario.

School of Computer Engineering


Replication
32

HDFS provides a reliable way to store huge data in a distributed environment as


data blocks. The blocks are also replicated to provide fault tolerance. The
default replication factor is 3 which is configurable. Therefore, if a file to be
stored of 128 MB in HDFS using the default configuration, it would occupy a
space of 384 MB (3*128 MB) as the blocks will be replicated three times and
each replica will be residing on a different DataNode.

School of Computer Engineering


Rack Awareness
33

All machines in rack are connected using the same network switch and if that
network goes down then all machines in that rack will be out of service. Thus
the rack is down. Rack Awareness was introduced by Apache Hadoop to
overcome this issue. In Rack Awareness, NameNode chooses the DataNode
which is closer to the same rack or nearby rack. NameNode maintains Rack ids
of each DataNode to achieve rack information. Thus, this concept chooses
DataNodes based on the rack information. NameNode in Hadoop makes ensures
that all the replicas should not stored on the same rack or single rack. Default
replication factor is 3. Therefore according to Rack Awareness Algorithm:
 When a Hadoop framework creates new block, it places first replica on the
local node, and place a second one in a different rack, and the third one is
on different node on same remote node.
 When re-replicating a block, if the number of existing replicas is one, place
the second on a different rack.
 When number of existing replicas are two, if the two replicas are in the
same rack, place the third one on a different rack.
School of Computer Engineering
Rack Awareness & Replication
34

File B1 Block 1 B3 Block 3


B1 B2 B3 B2 Block 2

B3 DN 1 B1 DN 1 B2 DN 1

B1 DN 2 B2 DN 2 B3 DN 2

B3 DN 3 B1 DN 3 B2 DN 3

DN 4 DN 4 DN 4

Rack 1 Rack 2 Rack 3

School of Computer Engineering


Rack Awareness Advantages
35

 Provide higher bandwidth and low latency – This policy


maximizes network bandwidth by transferring block within a rack
rather than between racks. The YARN is able to optimize MapReduce
job performance by assigning tasks to nodes that are closer to their
data in terms of network topology.
 Provides data protection against rack failure – Namenode assign
the block replicas of 2nd And 3rd Block to nodes in different rack
from the first replica. Thus, it provides data protection even against
rack failure. However, this is possible only if Hadoop was configured
with knowledge of its rack configuration.
 Minimize the writing cost and Maximize read speed – Rack
awareness, policy places read/write requests to replicas which are in
the same rack. Thus, this minimizes writing cost and maximizes
reading speed.

School of Computer Engineering


Anatomy of File Write
36

HDFS follow Write once Read many models. So files can’t be edited that are
already stored in HDFS, but data can be appended by reopening the file.

te Distributed 2. Creat
HDFS 1. Crea e
3. W File System NameNode
Client rite FSData
6. C
lose OutputStrea
Client JVM m 7. Complete
Client Node
4. Write Packet 5. Acknowledge Packet

4 4
DataNode1 5 DataNode2 5 DataNode3

Pipelines of DataNode
School of Computer Engineering
Anatomy of File Write cont’d
37

1. The client calls create function on DistributedFileSystem (a class extends


from FileSystem) to create a file.
2. The RPC call to the NameNode happens through the DistributedFileSystem
to create a new file. The NameNode performs various checks (existence of
the file) to create a new file. Initially, the NameNode creates a file without
associating any data blocks to the file. The DistributedFileSystem returns
an FSDataOutputStream (i.e. class instance) to the client to perform write.
3. As the client writes data, data is split into packets by DFSOutputStream (i.e.
a class), which is then written to the internal queue, called data queue.
DataStreamer (i.e. a class) consumes the data queue. The DataStreamer
requests the NameNode to allocate new blocks by selecting a list of suitable
DataNodes to store replicas. The list of DataNodes makes a pipeline. With
the default replication factor of 3, there will be 3 nodes in the pipeline for
the first block.

School of Computer Engineering


Anatomy of File Write cont’d
38

4. DataStreamer streams the packets to first DataNode in the pipeline. It


stores packet and forwards it to the second DataNode in the pipeline. In the
sameway, the cecond DataNode stores the packet and forwards to the third
DataNode in the pipeline.
5. In additional to the internal queue, DFSOutputStream also manages an “Ack
queue” of packets that are waiting for the acknowledgement by DataNodes.
A packet is removed from the “Ack Queue” only if it is acknowledged by all
the DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close() on the stream.
7. This flushes all the remaining packets to the DataNode pipeline and waits
for relevant acknowledgements before communicating with the NameNode
to inform the client that the creation of file is complete.

School of Computer Engineering


Anatomy of File Read
39

Distributed 2. Get Block L


HDFS 1. Open ocation
3. R File System NameNode
Client e ad
5. C FSData
lose
Client JVM InputStream

Client Node
4.2. read
4.1. read 4.3. read

DataNode1 DataNode2 DataNode3

DataNodes

School of Computer Engineering


Anatomy of File Write cont’d
40

1. The client opens the file that it wishes to read from by calling open() on the
DistributedFileSystem
2. DistributedFileSystem communicates with the NameNode to get the
location of the data blocks. NameNode returns the address of the
DataNodes that the data blocks are stored on. Subsequent to this,
DistributedFileSystem returns DFSInputStream (i.e. a class) to the client to
read from the file.
3. Client then calls read() on the stream DFSInputStream, which has address
of the DataNodes for the first few blocks of the file, connects to the closet
DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When the end of the block is reached, DFSInputStream closes the
connection with the DataNode. It repeats the steps to find the best
DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on
FSDataInputStream to close the connection.
School of Computer Engineering
HDFS Commands
41

 To get the list of directories and files at the root of HDFS.


hadoop fs –ls /
 To create a directory (say, sample) in HDFS
hadoop fs –mkdir /sample
 To copy a file from local file system to HDFS
hadoop fs –put /root/sample/test.txt /sample/test.txt
 To copy a file from HDFS to local file system
hadoop fs –get /sample/test.txt /root/sample/test.txt
 To display the contents of an HDFS file on console
hadoop fs –cat /sample/test.txt
 To copy a file from one directory to another on HDFS
hadoop fs –cp /sample/test.txt /sample1
 To remove a directory from HDFS
hadoop fs –rm –r /sample1

School of Computer Engineering


HDFS Example
42

Let’s assume that this sample.txt file contains few lines as text. The content of the file is
as follows:
Hello I am expert in Big Data
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while storing this
file in Hadoop, HDFS broke this file into four parts and named each part as first.txt,
second.txt, third.txt, and fourth.txt. So, you can easily see that the above file will be
divided into four equal parts and each part will contain 2 lines. First two lines will be in
the file first.txt, next two lines in second.txt, next two in third.txt and the last two lines
will be stored in fourth.txt. All these files will be stored in DataNodes and the Name Node
will contain the metadata about them. All this is the task of HDFS.

School of Computer Engineering

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy