0% found this document useful (0 votes)
47 views95 pages

Slide 2 GFS and Hadoop

This document provides an overview of the Google File System (GFS) and how it influenced the development of Hadoop. GFS was designed by Google to reliably store very large files on low-cost commodity hardware. It uses a master-slave architecture with a single master server to manage metadata and chunk replicas. GFS inspired both HDFS for storage and MapReduce for distributed processing in Hadoop.

Uploaded by

Hưng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views95 pages

Slide 2 GFS and Hadoop

This document provides an overview of the Google File System (GFS) and how it influenced the development of Hadoop. GFS was designed by Google to reliably store very large files on low-cost commodity hardware. It uses a master-slave architecture with a single master server to manage metadata and chunk replicas. GFS inspired both HDFS for storage and MapReduce for distributed processing in Hadoop.

Uploaded by

Hưng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Big Data

(Hadoop)
Instructor: Trong-Hop Do
July 4th, 2021

S3Lab
Smart Software System
Laboratory
1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems

Big Data 2
Distributed File System

Google File System

3
Recap – Big data 5V’s

4
Big data case study – Google File System

5
Big data case study – Google File System
The 2003 Google File System paper
What is Google File System?
GFS design consideration: Commodity hardware

9
GFS design consideration: Large files

10
GFS design consideration: File operations

11
GFS design consideration: Chunks

12
GFS design consideration: Replicas

13
GFS design consideration: Large files

14
GFS architecture

15
GFS architecture

Update status
GFS: Read

17
GFS: Write

18
GFS: Heartbeat

19
GFS: Ensure chunk replica count

20
GFS: Single Master for multi-TB cluster

21
GFS: Operation logs

22
GFS: Operation logs

23
Use case of data processing in GFS

24
Use case of data processing in GFS

25
Use case of data processing in GFS

26
Use case of data processing in GFS

27
Use case of data processing in GFS

28
Use case of data processing in GFS

29
Map Reduce in GFS

30
Map + Reduce

31
Map + Reduce

32
Map + Reduce

33
Map + Reduce

34
Map + Reduce

35
Hadoop

Hadoop architecture

36
Introduction
● Hadoop is a software framework written in Java for distributed
processing of large datasets (terabytes or petabytes of data) across large
clusters (thousands of nodes) of computers. Included some key
components as below:
○ Hadoop Common: common utilities
○ Hadoop Distributed File System (HDFS) (Storage Component): A distributed file system
that provides high-throughput access
○ Hadoop YARN (Scheduling): a framework for job scheduling & cluster resource
management (available from Hadoop 2.x)
○ Hadoop MapReduce (Processing): A yarn-based system for parallel processing of large
Big Data data sets 37
Introduction
● Hadoop is a large and active ecosystem.
● Hadoop emerged as a solution for big data problems.
● Open source under the friendly Apache License
● Originally built as a Infrastructure for the “Nutch” project.
● Based on Google’s mapreduce and google File System.

38
Big Data
Features

39
Big Data
Hadoop 1.x architecture
Core components

HDFS and MapReduce are known as “Two Pillars” of Hadoop 1.x 40


Big Data
Hadoop 1.x architecture
Core components

41
Big Data
Hadoop 1.x architecture
Core components
5 Hadoop daemons:

42
Big Data
Hadoop 1.x architecture
Core components

43
Big Data
Architecture
Multi-Node Cluster

44
Big Data
What makes Hadoop unique

● Data locality and Shared Nothing: Moving computation to data, instead


of moving data to computation. Each node can independently process a
much smaller subset of the entire dataset without needing to
communicate with one another.
● Simplified programming model: allows user to quickly write and test
● Schema-on-read system (same as NoSQL platforms) # Schema-on-
write system
● Automatic distribution of data and work across machines
45
Big Data
Architecture
Architecture in different perspective

46
Big Data
HDFS

● Hadoop Distributed File System (HDFS) is designed to reliably store very large files
across machines in a large cluster. It is inspired by the Google File System.
● Designed to reliably store data on commodity hardware (crash all the time)
● Intended for Large files and Batch inserts
● Distribute large data file into blocks
● Each block is replicated on multiple (slave) nodes
● HDFS component is divided into two sub-components: Name node and Data node

47
Big Data
HDFS

● NameNode:
○ Master of the system, daemon runs on the master machine
○ Maintains, monitoring and manages the blocks which are present on the DataNodes
○ records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
○ captures all the changes to the metadata like deletion, creation and renaming of the file in
edit logs.
○ It regularly receives heartbeat and block reports from the DataNodes.

48
Big Data
HDFS
● All of the Hadoop server processes (daemons) serve a web UI. For
NameNode, it was on port 50070.

49
Big Data
HDFS
Secondary namenode
● The secondary namenode daemon is responsible for performing periodic
housekeeping functions for namenode.
● It creates checkpoints of the filesystem metadata (fsimage) present in namenode by
merging the edits logfile and the fsimage file from the namenode daemon.
● In case the namenode daemon fails, this checkpoint could be used to rebuild the
filesystem metadata.
● Checkpoints are done in intervals, thus checkpoint data could be slightly outdated.
Rebuilding the fsimage file using such a checkpoint could lead to data loss.
● It is recommended that the secondary namenode daemon be hosted on a separate
machine for large clusters.
● The checkpoints are created by merging the edits logfiles and the fsimage file from
the namenode daemon.
50
HDFS
Secondary namenode

51
HDFS

● DataNode:
○ DataNode runs on the slave machine.
○ It stores the actual business data.
○ It serves the read-write request from the user.
○ DataNode does the ground work of creating, replicating and deleting the blocks on the
command of NameNode.
○ After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of
HDFS.

52
Big Data
HDFS
Data node

53
HDFS
Architecture

54
Big Data
HDFS
Write & Read files

56
Big Data
Map Reduce

● Programming model for distributed computations at a massive scale


● Execution framework for organizing and performing such computations
● Data locality is king
● MapReduce component is again divided into two sub-components:
JobTracker and TaskTracker
● JobTracker: takes care of all the job scheduling and assign tasks to Task
Trackers.
● TaskTracker: a node in the cluster that accepts tasks - Map, Reduce &
Shuffle operations - from jobtracker
Big Data 57
Map Reduce
JobTracker
● The jobtracker daemon is responsible for accepting job requests from a client and
scheduling/assigning tasktrackers with tasks to be performed.
● The jobtracker daemon tries to assign tasks to the tasktracker daemon on the
datanode daemon where the data to be processed is stored. This feature is
called data locality.
● If that is not possible, it will at least try to assign tasks to tasktrackers within the
same physical server rack.
● If for some reason the node hosting the datanode and tasktracker daemons fails, the
jobtracker daemon assigns the task to another tasktracker daemon where the replica
of the data exists. This is possible because of the replication factor configuration for
HDFS where the data blocks are replicated across multiple datanodes. This ensures
that the job does not fail even if a node fails within the cluster.
58
Map Reduce
TaskTracker
● The tasktracker daemon is a daemon that accepts tasks (map, reduce,
and shuffle) from the jobtracker daemon.
● The tasktracker daemon is the daemon that performs the actual tasks
during a MapReduce operation.
● The tasktracker daemon sends a heartbeat message to jobtracker,
periodically, to notify the jobtracker daemon that it is alive.
● Along with the heartbeat, it also sends the free slots available within it,
to process tasks.
● The tasktracker daemon starts and monitors the map, and reduces
tasks and sends progress/status information back to the jobtracker
daemon. 59
Map Reduce
flow

60
Big Data
Map Reduce
● The Mapper:
○ Each block is processed in isolation by a map task called mapper
○ Map task runs on the node where the block is stored
○ Iterate over a large number of records
○ Extract something of interest from each

● Shuffle and sort intermediate results


● The Reducer:
○ Consolidate result from different mappers
○ Aggregate intermediate results
○ Produce final output
61
Big Data
Map Reduce
flow

62
Big Data
Map Reduce
Combined Hadoop architecture

63
Big Data
Map Reduce
Good for ...

● Embarrassingly parallel algorithms


● Summing, grouping, filtering, joining
● Off-line batch jobs on massive data sets
● Analyzing an entire large dataset

64
Big Data
Map Reduce
Not good for ...

● Jobs that need shared state/coordination


○ Tasks are shared-nothing
○ Shared-state requires scalable state store

● Iterative jobs (i.e., graph algorithms)


○ Each iteration must read/write data to disk
○ I/O and latency cost of an iteration is high

● Low-latency jobs
● Jobs on small datasets
● Finding individual records
65
Big Data
Map Reduce
limitations

● Scalability
○ Maximum cluster size ~ 4,500 nodes, concurrent tasks – 40,000
○ Coarse synchronization in JobTracker

● Availability
○ Failure kills all queued and running jobs

● Hard partition of resources into map & reduce slots


○ Low resource utilization

● Lacks support for alternate paradigms and services


○ Iterative applications implemented using MapReduce are 10x slower
66
Big Data
Hadoop 1.x

In small clusters, the namenode and jobtracker


daemons reside on the same node. However, in
larger clusters, there are dedicated nodes for
the namenode and jobtracker daemons.

67
Hadoop 1.x

68
Hadoop 2.x

HDFS2, YARN, MapReduce: Three Pillars of Hadoop 2


69
70
Hadoop 2.x

Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x:

● HDFS Federation – horizontal scalability of NameNode


● NameNode High Availability – NameNode is no longer a Single Point of Failure
● YARN – ability to process Terabytes and Petabytes of data available in HDFS using Non-
MapReduce applications such as MPI, GIRAPH
● Resource Manager – splits up the two major functionalities of overburdened JobTracker
(resource management and job scheduling/monitoring) into two separate daemons: a global
Resource Manager and per-application ApplicationMaster

There are additional features such as Capacity Scheduler (Enable Multi-tenancy support in
Hadoop), Data Snapshot, Support for Windows, NFS access, enabling increased Hadoop
adoption in the Industry to solve Big Data problems.
71
Limitation of Hadoop 1.x

72
Limitation of Hadoop 1.x
No Horizontal Scalability of NameNode

73
Hadoop 2.x
HDFS Federation

74
Limitation of Hadoop 1.x
NameNode – Single point of failure

75
Hadoop 2.x
NameNode High Availability

76
Limitation of Hadoop 1.x
Overburdened JobTracker

77
Hadoop 2.x
YARN (Yet Another Resource Nigotiator)

78
Limitation of Hadoop 1.x
Unutilized data in HDFS

79
Hadoop 2.x
Non-MapReduce Big Data Application

80
Limitation of Hadoop 1.x
No Multi-tenancy Support

81
Hadoop 2.x
Capacity Scheduler – Multi-tenancy Support

82
Ecosystem

83
Big Data
Hadoop Ecosystem History
The Google Stack

84
Hadoop Ecosystem History

85
Hadoop Ecosystem history

86
Hadoop Ecosystem History
Linked

87
Hadoop Ecosystem History

88
Ecosystem

89
Big Data
Install Cloudera Quickstart VM
• Download Cloudera Quickstart VM for VirtualBox
https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.13.0-0-virtualbox.zip

• Open Oracle VM VirtualBox. Click File → Import Appliance

90
Install Cloudera Quickstart VM

Choose “cloudera-quickstart-vm-5.13.0-0-virtualbox.ovf”

91
Install Cloudera Quickstart VM

92
Install Cloudera Quickstart VM
VMware
● Download Cloudera Quickstart VM for Vmware
https://downloads.cloudera.com/demo_vm/vmware/cloudera-quickstart-
vm-5.13.0-0-vmware.zip

93
Install Cloudera Quickstart VM
VMware

94
Projects

● Dasta Classifications
● Data anomaly detection
● Tourist behaviour analysis
● Credit scoring
● Price forecasting
● Streaming data
● Traffic data analysis
● Time series data

95
Q&A

Cảm ơn đã theo dõi


Chúng tôi hy vọng cùng nhau đi đến thành công.

96
Big Data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy