Slide 2 GFS and Hadoop
Slide 2 GFS and Hadoop
(Hadoop)
Instructor: Trong-Hop Do
July 4th, 2021
S3Lab
Smart Software System
Laboratory
1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems
Big Data 2
Distributed File System
3
Recap – Big data 5V’s
4
Big data case study – Google File System
5
Big data case study – Google File System
The 2003 Google File System paper
What is Google File System?
GFS design consideration: Commodity hardware
9
GFS design consideration: Large files
10
GFS design consideration: File operations
11
GFS design consideration: Chunks
12
GFS design consideration: Replicas
13
GFS design consideration: Large files
14
GFS architecture
15
GFS architecture
Update status
GFS: Read
17
GFS: Write
18
GFS: Heartbeat
19
GFS: Ensure chunk replica count
20
GFS: Single Master for multi-TB cluster
21
GFS: Operation logs
22
GFS: Operation logs
23
Use case of data processing in GFS
24
Use case of data processing in GFS
25
Use case of data processing in GFS
26
Use case of data processing in GFS
27
Use case of data processing in GFS
28
Use case of data processing in GFS
29
Map Reduce in GFS
30
Map + Reduce
31
Map + Reduce
32
Map + Reduce
33
Map + Reduce
34
Map + Reduce
35
Hadoop
Hadoop architecture
36
Introduction
● Hadoop is a software framework written in Java for distributed
processing of large datasets (terabytes or petabytes of data) across large
clusters (thousands of nodes) of computers. Included some key
components as below:
○ Hadoop Common: common utilities
○ Hadoop Distributed File System (HDFS) (Storage Component): A distributed file system
that provides high-throughput access
○ Hadoop YARN (Scheduling): a framework for job scheduling & cluster resource
management (available from Hadoop 2.x)
○ Hadoop MapReduce (Processing): A yarn-based system for parallel processing of large
Big Data data sets 37
Introduction
● Hadoop is a large and active ecosystem.
● Hadoop emerged as a solution for big data problems.
● Open source under the friendly Apache License
● Originally built as a Infrastructure for the “Nutch” project.
● Based on Google’s mapreduce and google File System.
38
Big Data
Features
39
Big Data
Hadoop 1.x architecture
Core components
41
Big Data
Hadoop 1.x architecture
Core components
5 Hadoop daemons:
42
Big Data
Hadoop 1.x architecture
Core components
43
Big Data
Architecture
Multi-Node Cluster
44
Big Data
What makes Hadoop unique
46
Big Data
HDFS
● Hadoop Distributed File System (HDFS) is designed to reliably store very large files
across machines in a large cluster. It is inspired by the Google File System.
● Designed to reliably store data on commodity hardware (crash all the time)
● Intended for Large files and Batch inserts
● Distribute large data file into blocks
● Each block is replicated on multiple (slave) nodes
● HDFS component is divided into two sub-components: Name node and Data node
47
Big Data
HDFS
● NameNode:
○ Master of the system, daemon runs on the master machine
○ Maintains, monitoring and manages the blocks which are present on the DataNodes
○ records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
○ captures all the changes to the metadata like deletion, creation and renaming of the file in
edit logs.
○ It regularly receives heartbeat and block reports from the DataNodes.
48
Big Data
HDFS
● All of the Hadoop server processes (daemons) serve a web UI. For
NameNode, it was on port 50070.
49
Big Data
HDFS
Secondary namenode
● The secondary namenode daemon is responsible for performing periodic
housekeeping functions for namenode.
● It creates checkpoints of the filesystem metadata (fsimage) present in namenode by
merging the edits logfile and the fsimage file from the namenode daemon.
● In case the namenode daemon fails, this checkpoint could be used to rebuild the
filesystem metadata.
● Checkpoints are done in intervals, thus checkpoint data could be slightly outdated.
Rebuilding the fsimage file using such a checkpoint could lead to data loss.
● It is recommended that the secondary namenode daemon be hosted on a separate
machine for large clusters.
● The checkpoints are created by merging the edits logfiles and the fsimage file from
the namenode daemon.
50
HDFS
Secondary namenode
51
HDFS
● DataNode:
○ DataNode runs on the slave machine.
○ It stores the actual business data.
○ It serves the read-write request from the user.
○ DataNode does the ground work of creating, replicating and deleting the blocks on the
command of NameNode.
○ After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of
HDFS.
52
Big Data
HDFS
Data node
53
HDFS
Architecture
54
Big Data
HDFS
Write & Read files
56
Big Data
Map Reduce
60
Big Data
Map Reduce
● The Mapper:
○ Each block is processed in isolation by a map task called mapper
○ Map task runs on the node where the block is stored
○ Iterate over a large number of records
○ Extract something of interest from each
62
Big Data
Map Reduce
Combined Hadoop architecture
63
Big Data
Map Reduce
Good for ...
64
Big Data
Map Reduce
Not good for ...
● Low-latency jobs
● Jobs on small datasets
● Finding individual records
65
Big Data
Map Reduce
limitations
● Scalability
○ Maximum cluster size ~ 4,500 nodes, concurrent tasks – 40,000
○ Coarse synchronization in JobTracker
● Availability
○ Failure kills all queued and running jobs
67
Hadoop 1.x
68
Hadoop 2.x
Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x:
There are additional features such as Capacity Scheduler (Enable Multi-tenancy support in
Hadoop), Data Snapshot, Support for Windows, NFS access, enabling increased Hadoop
adoption in the Industry to solve Big Data problems.
71
Limitation of Hadoop 1.x
72
Limitation of Hadoop 1.x
No Horizontal Scalability of NameNode
73
Hadoop 2.x
HDFS Federation
74
Limitation of Hadoop 1.x
NameNode – Single point of failure
75
Hadoop 2.x
NameNode High Availability
76
Limitation of Hadoop 1.x
Overburdened JobTracker
77
Hadoop 2.x
YARN (Yet Another Resource Nigotiator)
78
Limitation of Hadoop 1.x
Unutilized data in HDFS
79
Hadoop 2.x
Non-MapReduce Big Data Application
80
Limitation of Hadoop 1.x
No Multi-tenancy Support
81
Hadoop 2.x
Capacity Scheduler – Multi-tenancy Support
82
Ecosystem
83
Big Data
Hadoop Ecosystem History
The Google Stack
84
Hadoop Ecosystem History
85
Hadoop Ecosystem history
86
Hadoop Ecosystem History
Linked
87
Hadoop Ecosystem History
88
Ecosystem
89
Big Data
Install Cloudera Quickstart VM
• Download Cloudera Quickstart VM for VirtualBox
https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.13.0-0-virtualbox.zip
90
Install Cloudera Quickstart VM
Choose “cloudera-quickstart-vm-5.13.0-0-virtualbox.ovf”
91
Install Cloudera Quickstart VM
92
Install Cloudera Quickstart VM
VMware
● Download Cloudera Quickstart VM for Vmware
https://downloads.cloudera.com/demo_vm/vmware/cloudera-quickstart-
vm-5.13.0-0-vmware.zip
93
Install Cloudera Quickstart VM
VMware
94
Projects
● Dasta Classifications
● Data anomaly detection
● Tourist behaviour analysis
● Credit scoring
● Price forecasting
● Streaming data
● Traffic data analysis
● Time series data
95
Q&A
96
Big Data