0% found this document useful (0 votes)

47 views95 pages

Slide 2 GFS and Hadoop

This document provides an overview of the Google File System (GFS) and how it influenced the development of Hadoop. GFS was designed by Google to reliably store very large files on low-cost commodity hardware. It uses a master-slave architecture with a single master server to manage metadata and chunk replicas. GFS inspired both HDFS for storage and MapReduce for distributed processing in Hadoop.

Uploaded by

Hưng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views95 pages

Slide 2 GFS and Hadoop

Uploaded by

Hưng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

Big Data

(Hadoop)
Instructor: Trong-Hop Do
July 4th, 2021

S3Lab
Smart Software System
Laboratory
1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems

Big Data 2
Distributed File System

Google File System

3
Recap – Big data 5V’s

4
Big data case study – Google File System

5
Big data case study – Google File System
The 2003 Google File System paper
What is Google File System?
GFS design consideration: Commodity hardware

9
GFS design consideration: Large files

10
GFS design consideration: File operations

11
GFS design consideration: Chunks

12
GFS design consideration: Replicas

13
GFS design consideration: Large files

14
GFS architecture

15
GFS architecture

Update status
GFS: Read

17
GFS: Write

18
GFS: Heartbeat

19
GFS: Ensure chunk replica count

20
GFS: Single Master for multi-TB cluster

21
GFS: Operation logs

22
GFS: Operation logs

23
Use case of data processing in GFS

24
Use case of data processing in GFS

25
Use case of data processing in GFS

26
Use case of data processing in GFS

27
Use case of data processing in GFS

28
Use case of data processing in GFS

29
Map Reduce in GFS

30
Map + Reduce

31
Map + Reduce

32
Map + Reduce

33
Map + Reduce

34
Map + Reduce

35
Hadoop

Hadoop architecture

36
Introduction
● Hadoop is a software framework written in Java for distributed
processing of large datasets (terabytes or petabytes of data) across large
clusters (thousands of nodes) of computers. Included some key
components as below:
○ Hadoop Common: common utilities
○ Hadoop Distributed File System (HDFS) (Storage Component): A distributed file system
that provides high-throughput access
○ Hadoop YARN (Scheduling): a framework for job scheduling & cluster resource
management (available from Hadoop 2.x)
○ Hadoop MapReduce (Processing): A yarn-based system for parallel processing of large
Big Data data sets 37
Introduction
● Hadoop is a large and active ecosystem.
● Hadoop emerged as a solution for big data problems.
● Open source under the friendly Apache License
● Originally built as a Infrastructure for the “Nutch” project.
● Based on Google’s mapreduce and google File System.

38
Big Data
Features

39
Big Data
Hadoop 1.x architecture
Core components

HDFS and MapReduce are known as “Two Pillars” of Hadoop 1.x 40

Big Data
Hadoop 1.x architecture
Core components

41
Big Data
Hadoop 1.x architecture
Core components
5 Hadoop daemons:

42
Big Data
Hadoop 1.x architecture
Core components

43
Big Data
Architecture
Multi-Node Cluster

44
Big Data
What makes Hadoop unique

● Data locality and Shared Nothing: Moving computation to data, instead

of moving data to computation. Each node can independently process a
much smaller subset of the entire dataset without needing to
communicate with one another.
● Simplified programming model: allows user to quickly write and test
● Schema-on-read system (same as NoSQL platforms) # Schema-on-
write system
● Automatic distribution of data and work across machines
45
Big Data
Architecture
Architecture in different perspective

46
Big Data
HDFS

● Hadoop Distributed File System (HDFS) is designed to reliably store very large files
across machines in a large cluster. It is inspired by the Google File System.
● Designed to reliably store data on commodity hardware (crash all the time)
● Intended for Large files and Batch inserts
● Distribute large data file into blocks
● Each block is replicated on multiple (slave) nodes
● HDFS component is divided into two sub-components: Name node and Data node

47
Big Data
HDFS

● NameNode:
○ Master of the system, daemon runs on the master machine
○ Maintains, monitoring and manages the blocks which are present on the DataNodes
○ records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
○ captures all the changes to the metadata like deletion, creation and renaming of the file in
edit logs.
○ It regularly receives heartbeat and block reports from the DataNodes.

48
Big Data
HDFS
● All of the Hadoop server processes (daemons) serve a web UI. For
NameNode, it was on port 50070.

49
Big Data
HDFS
Secondary namenode
● The secondary namenode daemon is responsible for performing periodic
housekeeping functions for namenode.
● It creates checkpoints of the filesystem metadata (fsimage) present in namenode by
merging the edits logfile and the fsimage file from the namenode daemon.
● In case the namenode daemon fails, this checkpoint could be used to rebuild the
filesystem metadata.
● Checkpoints are done in intervals, thus checkpoint data could be slightly outdated.
Rebuilding the fsimage file using such a checkpoint could lead to data loss.
● It is recommended that the secondary namenode daemon be hosted on a separate
machine for large clusters.
● The checkpoints are created by merging the edits logfiles and the fsimage file from
the namenode daemon.
50
HDFS
Secondary namenode

51
HDFS

● DataNode:
○ DataNode runs on the slave machine.
○ It stores the actual business data.
○ It serves the read-write request from the user.
○ DataNode does the ground work of creating, replicating and deleting the blocks on the
command of NameNode.
○ After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of
HDFS.

52
Big Data
HDFS
Data node

53
HDFS
Architecture

54
Big Data
HDFS
Write & Read files

56
Big Data
Map Reduce

● Programming model for distributed computations at a massive scale

● Execution framework for organizing and performing such computations
● Data locality is king
● MapReduce component is again divided into two sub-components:
JobTracker and TaskTracker
● JobTracker: takes care of all the job scheduling and assign tasks to Task
Trackers.
● TaskTracker: a node in the cluster that accepts tasks - Map, Reduce &
Shuffle operations - from jobtracker
Big Data 57
Map Reduce
JobTracker
● The jobtracker daemon is responsible for accepting job requests from a client and
scheduling/assigning tasktrackers with tasks to be performed.
● The jobtracker daemon tries to assign tasks to the tasktracker daemon on the
datanode daemon where the data to be processed is stored. This feature is
called data locality.
● If that is not possible, it will at least try to assign tasks to tasktrackers within the
same physical server rack.
● If for some reason the node hosting the datanode and tasktracker daemons fails, the
jobtracker daemon assigns the task to another tasktracker daemon where the replica
of the data exists. This is possible because of the replication factor configuration for
HDFS where the data blocks are replicated across multiple datanodes. This ensures
that the job does not fail even if a node fails within the cluster.
58
Map Reduce
TaskTracker
● The tasktracker daemon is a daemon that accepts tasks (map, reduce,
and shuffle) from the jobtracker daemon.
● The tasktracker daemon is the daemon that performs the actual tasks
during a MapReduce operation.
● The tasktracker daemon sends a heartbeat message to jobtracker,
periodically, to notify the jobtracker daemon that it is alive.
● Along with the heartbeat, it also sends the free slots available within it,
to process tasks.
● The tasktracker daemon starts and monitors the map, and reduces
tasks and sends progress/status information back to the jobtracker
daemon. 59
Map Reduce
flow

60
Big Data
Map Reduce
● The Mapper:
○ Each block is processed in isolation by a map task called mapper
○ Map task runs on the node where the block is stored
○ Iterate over a large number of records
○ Extract something of interest from each

● Shuffle and sort intermediate results

● The Reducer:
○ Consolidate result from different mappers
○ Aggregate intermediate results
○ Produce final output
61
Big Data
Map Reduce
flow

62
Big Data
Map Reduce
Combined Hadoop architecture

63
Big Data
Map Reduce
Good for ...

● Embarrassingly parallel algorithms

● Summing, grouping, filtering, joining
● Off-line batch jobs on massive data sets
● Analyzing an entire large dataset

64
Big Data
Map Reduce
Not good for ...

● Jobs that need shared state/coordination

○ Tasks are shared-nothing
○ Shared-state requires scalable state store

● Iterative jobs (i.e., graph algorithms)

○ Each iteration must read/write data to disk
○ I/O and latency cost of an iteration is high

● Low-latency jobs
● Jobs on small datasets
● Finding individual records
65
Big Data
Map Reduce
limitations

● Scalability
○ Maximum cluster size ~ 4,500 nodes, concurrent tasks – 40,000
○ Coarse synchronization in JobTracker

● Availability
○ Failure kills all queued and running jobs

● Hard partition of resources into map & reduce slots

○ Low resource utilization

● Lacks support for alternate paradigms and services

○ Iterative applications implemented using MapReduce are 10x slower
66
Big Data
Hadoop 1.x

In small clusters, the namenode and jobtracker

daemons reside on the same node. However, in
larger clusters, there are dedicated nodes for
the namenode and jobtracker daemons.

67
Hadoop 1.x

68
Hadoop 2.x

HDFS2, YARN, MapReduce: Three Pillars of Hadoop 2

69
70
Hadoop 2.x

Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x:

● HDFS Federation – horizontal scalability of NameNode

● NameNode High Availability – NameNode is no longer a Single Point of Failure
● YARN – ability to process Terabytes and Petabytes of data available in HDFS using Non-
MapReduce applications such as MPI, GIRAPH
● Resource Manager – splits up the two major functionalities of overburdened JobTracker
(resource management and job scheduling/monitoring) into two separate daemons: a global
Resource Manager and per-application ApplicationMaster

There are additional features such as Capacity Scheduler (Enable Multi-tenancy support in
Hadoop), Data Snapshot, Support for Windows, NFS access, enabling increased Hadoop
adoption in the Industry to solve Big Data problems.
71
Limitation of Hadoop 1.x

72
Limitation of Hadoop 1.x
No Horizontal Scalability of NameNode

73
Hadoop 2.x
HDFS Federation

74
Limitation of Hadoop 1.x
NameNode – Single point of failure

75
Hadoop 2.x
NameNode High Availability

76
Limitation of Hadoop 1.x
Overburdened JobTracker

77
Hadoop 2.x
YARN (Yet Another Resource Nigotiator)

78
Limitation of Hadoop 1.x
Unutilized data in HDFS

79
Hadoop 2.x
Non-MapReduce Big Data Application

80
Limitation of Hadoop 1.x
No Multi-tenancy Support

81
Hadoop 2.x
Capacity Scheduler – Multi-tenancy Support

82
Ecosystem

83
Big Data
Hadoop Ecosystem History
The Google Stack

84
Hadoop Ecosystem History

85
Hadoop Ecosystem history

86
Hadoop Ecosystem History
Linked

87
Hadoop Ecosystem History

88
Ecosystem

89
Big Data
Install Cloudera Quickstart VM
• Download Cloudera Quickstart VM for VirtualBox
https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.13.0-0-virtualbox.zip

• Open Oracle VM VirtualBox. Click File → Import Appliance

90
Install Cloudera Quickstart VM

Choose “cloudera-quickstart-vm-5.13.0-0-virtualbox.ovf”

91
Install Cloudera Quickstart VM

92
Install Cloudera Quickstart VM
VMware
● Download Cloudera Quickstart VM for Vmware
https://downloads.cloudera.com/demo_vm/vmware/cloudera-quickstart-
vm-5.13.0-0-vmware.zip

93
Install Cloudera Quickstart VM
VMware

94
Projects

● Dasta Classifications
● Data anomaly detection
● Tourist behaviour analysis
● Credit scoring
● Price forecasting
● Streaming data
● Traffic data analysis
● Time series data

95
Q&A

Cảm ơn đã theo dõi

Chúng tôi hy vọng cùng nhau đi đến thành công.

96
Big Data

Picasa: Software Requirement Specification
No ratings yet
Picasa: Software Requirement Specification
20 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Big Data
No ratings yet
Big Data
67 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Unit-I
No ratings yet
Unit-I
38 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
1- HADOOP crash course
No ratings yet
1- HADOOP crash course
52 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Chapter 2 - 大数据生态系统
No ratings yet
Chapter 2 - 大数据生态系统
31 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
FCC_Module v - Cloud Technologies and Advancements
No ratings yet
FCC_Module v - Cloud Technologies and Advancements
63 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Bigdata Lecture 2
No ratings yet
Bigdata Lecture 2
17 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Chap4_BigDataStorageAndManagement
No ratings yet
Chap4_BigDataStorageAndManagement
46 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Big Data Technologies - Hadoop
No ratings yet
Big Data Technologies - Hadoop
64 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
BDT_Unit03.pptx_(1)
No ratings yet
BDT_Unit03.pptx_(1)
93 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
Big Data
No ratings yet
Big Data
47 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
bda 2_hadoop
No ratings yet
bda 2_hadoop
112 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
4
No ratings yet
4
53 pages
HADOOP
No ratings yet
HADOOP
55 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Dorado V6 Series 6.1.x Best Practices of OceanStor Dorado & OceanStor for NAS Microsoft Hyper-V
No ratings yet
Dorado V6 Series 6.1.x Best Practices of OceanStor Dorado & OceanStor for NAS Microsoft Hyper-V
52 pages
MAH MCA CET Computer Concepts
No ratings yet
MAH MCA CET Computer Concepts
79 pages
Cloud Gateway Technical Guide
No ratings yet
Cloud Gateway Technical Guide
52 pages
Generic Software
No ratings yet
Generic Software
2 pages
Reference Architecture For Event-Driven Application Architecture
No ratings yet
Reference Architecture For Event-Driven Application Architecture
1 page
Accounting in the Cloud
No ratings yet
Accounting in the Cloud
15 pages
McAfee SIEM Collector Installation and Configuration v1 2
No ratings yet
McAfee SIEM Collector Installation and Configuration v1 2
19 pages
Gartner MarketScope For Enterprise File Synchronization and Sharing PDF
No ratings yet
Gartner MarketScope For Enterprise File Synchronization and Sharing PDF
24 pages
OWASP Testing Guidev2 (EUSecWest) v1
No ratings yet
OWASP Testing Guidev2 (EUSecWest) v1
52 pages
Cyberratings Cloud Network Firewall Report
No ratings yet
Cyberratings Cloud Network Firewall Report
16 pages
What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn
No ratings yet
What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn
54 pages
Customer Relationship Management: Concepts and Technologies
No ratings yet
Customer Relationship Management: Concepts and Technologies
53 pages
Conceptual Model of UML
No ratings yet
Conceptual Model of UML
24 pages
Introduction to Data Science and Data Analytics
No ratings yet
Introduction to Data Science and Data Analytics
72 pages
Freckle Teacher Dashboard
No ratings yet
Freckle Teacher Dashboard
2 pages
A Guide On Conducting Online Classes: Considerations, Tools and Strategies For Digitization
No ratings yet
A Guide On Conducting Online Classes: Considerations, Tools and Strategies For Digitization
23 pages
White Label Mobile Form Builder
No ratings yet
White Label Mobile Form Builder
12 pages
Apps L6+Applied+Digital+Technology+Programme+Guide
No ratings yet
Apps L6+Applied+Digital+Technology+Programme+Guide
18 pages
SRS - Software Requirement Specification
No ratings yet
SRS - Software Requirement Specification
53 pages
Atm System 1
No ratings yet
Atm System 1
50 pages
PowerHA - 1 PowerHA Consideration
No ratings yet
PowerHA - 1 PowerHA Consideration
17 pages
CCNPv7 ROUTE Lab5-2 IP SLA Tracking and Path Control Student Alexis Pedroza
No ratings yet
CCNPv7 ROUTE Lab5-2 IP SLA Tracking and Path Control Student Alexis Pedroza
24 pages
Chapter5Solutions 1 .HTML
No ratings yet
Chapter5Solutions 1 .HTML
2 pages
Java Developer (5+ Years) – Remote Position
No ratings yet
Java Developer (5+ Years) – Remote Position
2 pages
CE003323 ADS Exam2
No ratings yet
CE003323 ADS Exam2
5 pages
E Commerce
No ratings yet
E Commerce
27 pages
S.N. Name Designation Office Residence Mobile No Fax No E-Mail Address
No ratings yet
S.N. Name Designation Office Residence Mobile No Fax No E-Mail Address
3 pages
Cloud Ecosystem
No ratings yet
Cloud Ecosystem
9 pages
E-Commerce - Wikipedia
No ratings yet
E-Commerce - Wikipedia
22 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Slide 2 GFS and Hadoop

Uploaded by

Slide 2 GFS and Hadoop

Uploaded by

Big Data

Google File System

HDFS and MapReduce are known as “Two Pillars” of Hadoop 1.x 40

● Data locality and Shared Nothing: Moving computation to data, instead

● Programming model for distributed computations at a massive scale

● Shuffle and sort intermediate results

● Embarrassingly parallel algorithms

● Jobs that need shared state/coordination

● Iterative jobs (i.e., graph algorithms)

● Hard partition of resources into map & reduce slots

● Lacks support for alternate paradigms and services

In small clusters, the namenode and jobtracker

HDFS2, YARN, MapReduce: Three Pillars of Hadoop 2

● HDFS Federation – horizontal scalability of NameNode

• Open Oracle VM VirtualBox. Click File → Import Appliance

Cảm ơn đã theo dõi

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.