T5 Failure Detectors
T5 Failure Detectors
Talo Harrison
Lecture 5: Failure Detection and
Membership, Grids
A Challenge
• You’ve been put in charge of a datacenter, and your
manager has told you, “Oh no! We don’t have any failures
in our datacenter!”
When you have 120 servers in the DC, the mean time to failure (MTTF)
of the next machine is 1 month.
When you have 12,000 servers in the DC, the MTTF is about once every
7.2 hours!
1. Hire 1000 people, each to monitor one machine in the datacenter and
report to you when it fails.
2. Write a failure detector program (distributed) that automatically detects
failures and reports to your workstation.
4
Target Settings
• Process ‘group’-based systems
– Clouds/Datacenters
– Replicated servers
– Distributed databases
1000’s of processes
Unreliable Communication
Network
8
Group Membership
Protocol II Failure Detector
Some process
pi finds out quickly
pj crashed
I pj
III Dissemination
Unreliable Communication
Network
Fail-stop Failures only 9
Next
• How do you design a group membership
protocol?
10
I. pj crashes
• Nothing we can do about it!
• A frequent occurrence
• Common case rather than exception
• Frequency goes up linearly with size of
datacenter
11
II. Distributed Failure
Detectors: Desirable Properties
• Completeness = each failure is detected
• Accuracy = there is no mistaken detection
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 12
Distributed Failure Detectors:
Properties
Impossible together in
• Completeness
lossy networks [Chandra
• Accuracy and Toueg]
• Speed
– Time to first detection of a failureIf possible, then can
solve consensus! (but
• Scale consensus is known to be
– Equal Load on each member unsolvable in
– Network Message Load asynchronous systems)
13
What Real Failure Detectors
Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 14
What Real Failure Detectors
Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale Time until some non-faulty
process detects the failure
– Equal Load on each member
– Network Message Load 15
What Real Failure Detectors
Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale Time until some non-faulty
process detects the failure
– Equal Load on each member
No bottlenecks/single
– Network Message Load failure point 16
Failure Detector Properties
• Completeness In spite of
arbitrary simultaneous
• Accuracy process failures
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 17
Centralized Heartbeating
Hotspot
pi
…
pi, Heartbeat Seq. l++
pj • Heartbeats sent periodically
• If heartbeat not received from pi within
18
timeout, mark pi as failed
Ring Heartbeating
Unpredictable on
pi simultaneous multiple
pi, Heartbeat Seq. l++
failures
pj
…
…
19
All-to-All Heartbeating
Equal load per member
pi, Heartbeat Seq. l++ pi Single hb loss false
detection
…
pj
20
Next
• How do we increase the robustness of all-to-all
heartbeating?
21
Gossip-style Heartbeating
Good accuracy
Array of pi properties
Heartbeat Seq. l
for member subset
22
Gossip-Style Failure
Detection 1 10118 64
2 10110 64
1 10120 66 3 10090 58
2 10103 62 4 10111 65
3 10098 63 2
4 10111 65 1
1 10120 70
Address Time (local) 2 10110 64
Heartbeat Counter 3 10098 70
Protocol:
• Nodes periodically gossip their membership 4 4 10111 65
4
3 25
Analysis/Discussion
• Well-known result: a gossip takes O(log(N)) time to propagate.
• So: Given sufficient bandwidth, a single heartbeat takes O(log(N)) time to
propagate.
• So: N heartbeats take:
– O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be
O(N)
– O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)
– What about O(k) bandwidth?
• What happens if gossip period Tgossip is decreased?
• What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
• Tradeoff: False positive rate vs. detection time vs. bandwidth 26
Next
• So, is this the best we can do? What is the best
we can do?
27
Failure Detector Properties
…
• Completeness
• Accuracy
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 28
…Are application-defined
Requirements
• Completeness Guarantee always
Probability PM(T)
• Accuracy
T time units
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load
29
…Are application-defined
Requirements
• Completeness Guarantee always
Probability PM(T)
• Accuracy
T time units
• Speed
– Time to first detection of a failure
N*L: Compare this across protocols
• Scale
– Equal Load on each member
– Network Message Load
30
All-to-All Heartbeating
pi, Heartbeat Seq. l++ pi Every T units
L=N/T
…
31
Gossip-style Heartbeating
pi T=logN * tg
Array of
Heartbeat Seq. l L=N/tg=N*logN/T
for member subset
Every tg units
=gossip period,
send O(N) gossip
message
32
What’s the Best/Optimal we
can do?
• Worst case load L* per member in the group
(messages per second)
– as a function of T, PM(T), N
– Independent Message Loss probability pml
log( PM (T )) 1
• L* .
log( p ) T
ml
33
Heartbeating
• Optimal L is independent of N (!)
• All-to-all and gossip-based: sub-optimal
• L=O(N/T)
• try to achieve simultaneous detection at all processes
• fail to distinguish Failure Detection and Dissemination
components
Can we reach this bound?
Key:
Separate the two components
Use a non heartbeat-based Failure Detection Component 34
Next
• Is there a better failure detector?
35
SWIM Failure Detector
pi Protocol
pj
• random pj
ping K random
ack processes
• random K
ping-req
X
X
Protocol ping
period ack
= T’ time units ack
36
Detection Time
1 N1 1
• Prob. of being pinged in T’= 1 (1 ) 1 e
N
• E[T ] = e
T'.
e 1
• Completeness: Any alive member detects failure
– Eventually
– By using a trick: within worst case O(N) protocol periods
37
Accuracy, Load
L E[ L]
• 28 8
L* L* for up to 15 % loss rates
38
SWIM Failure Detector
Parameter SWIM
First Detection
Time
SWIM Heartbeating
Constant
42
Group Membership
Protocol II Failure Detector
Some process
pi finds out quickly
pj crashed
I pj
III Dissemination
Unreliable Communication
Network
Fail-stop Failures only 43
Dissemination Options
• Multicast (Hardware / IP)
– unreliable
– multiple simultaneous multicasts
• Point-to-point (TCP / UDP)
– expensive
• Zero extra messages: Piggyback on Failure
Detector messages
– Infection-style Dissemination 44
Infection-style Dissemination
pi pj
• random pj
ping K random
ack processes
• random K
ping-req
X
X
Protocol ping
period ack
= T time units ack Piggybacked
membership
information
45
Infection-style
• Dissemination
Epidemic/Gossip style dissemination
– After . log( N ) protocol periods, processes would
not have heard about an update
• Maintain a buffer of recently joined/evicted processes
– Piggyback from this buffer
– Prefer recent updates
• Buffer elements are garbage collected after a while
– After . log( N ) protocol periods, i.e., once they’ve
propagated through the system; this defines weak consistency
46
Suspicion Mechanism
• False detections, due to
– Perturbed processes
– Packet losses, e.g., from congestion
• Indirect pinging may not solve the problem
• Key: suspect a process before declaring it as
failed in the group
47
Suspicion Mechanism pi
50
Wrap Up
• Failures the norm, not the exception in datacenters
• Every distributed system uses a failure detector
• Many distributed systems use a membership service
52
“A Cloudy History of
Time”
The first datacenters!
Timesharing Companies Clouds and datacenters
1940
& Data Processing Industry
1950 Clusters
1960
Grids
1970
1980
PCs 1990
(not distributed!)
2000
Peer to peer systems 2012
53
“A Cloudy History of
Time”
First large datacenters: ENIAC, ORDVAC, ILLIAC
Many used vacuum tubes and mechanical relays
Berkeley NOW Project
Supercomputers
1940
Server Farms (e.g., Oceano)
1950
1960 P2P Systems (90s-00s)
• Many Millions of users
1970 • Many GB per day
1980
Data Processing Industry
- 1968: $70 M. 1978: $3.15 Billion 1990
Timesharing Industry (1975): 2000
• Market Share: Honeywell 34%, IBM 15%,
• Xerox 10%, CDC 10%, DEC 10%, UNIVAC 10% Grids (1980s-2000s): 2012 Clouds
• GriPhyN (1970s-80s)
• Honeywell 6000 & 635, IBM 370/168,
• Open Science Grid and Lambda Rail (2000s)
Xerox 940 & Sigma 9, DEC PDP-10, UNIVAC 1108 • Globus & other standards (1990s-2000s)
54
Example: Rapid Atmospheric Modeling
System, ColoState U
• Hurricane Georges, 17 days in Sept 1998
– “RAMS modeled the mesoscale convective complex that dropped
so much rain, in good agreement with recorded data”
– Used 5 km spacing instead of the usual 10 km
– Ran on 256+ processors
• Computation-intenstive computing (or HPC = high
performance computing)
• Can one run such a program without access to a
supercomputer?
55
Distributed Computing Resources
Wisconsin
MIT NCSA
56
An Application Coded by a
Physicist
Job 0 Output files of Job 0
Input to Job 2
Job 1
Jobs 1 and 2 can Job 2
be concurrent
Job 3
MIT NCSA
59
2-level Scheduling
Infrastructure
Wisconsin
HTCondor Protocol Job 0
Job 1 Job 2
Job 3
Globus Protocol
NCSA
MIT
Job 0
62
Inter-site Protocol
Wisconsin
Job 3
Job 0
Internal structure of different
64
Globus Toolkit
• Open-source
• Consists of several components
– GridFTP: Wide-area transfer of bulk data
– GRAM5 (Grid Resource Allocation Manager): submit, locate, cancel, and
manage jobs
• Not a scheduler
• Globus communicates with the schedulers in intra-site protocols like HTCondor
or Portable Batch System (PBS)
– RLS (Replica Location Service): Naming service that translates from a
file/dir name to a target location (or another file/dir name)
– Libraries like XIO to provide a standard API for all Grid IO functionalities
– Grid Security Infrastructure (GSI)
65
Security Issues
• Important in Grids because they are federated, i.e., no single entity controls the entire
infrastructure
• Single sign-on: collective job set should require once-only user authentication
• Mapping to local security mechanisms: some sites use Kerberos, others using Unix
• Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to
job 1
• Community authorization: e.g., third-party authentication
• These are also important in clouds, but less so because clouds are typically run under a
central control
• In clouds the focus is on failures, scale, on-demand nature
66
Summary
• Grid computing focuses on computation-intensive computing
(HPC)
• Though often federated, architecture and key concepts have a
lot in common with that of clouds
• Are Grids/HPC converging towards clouds?
– E.g., Compare OpenStack and Globus
67