Lecture 4 - Failure Detection and Membership
Lecture 4 - Failure Detection and Membership
Computing
Fall 2022
Dr. Zeshan Iqbal
Lecture 4: Failure Detection and
Membership
1
A Challenge
• You’ve been put in charge of a datacenter, and your
manager has told you, “Oh no! We don’t have any failures
in our datacenter!”
1
Failures are the Norm
… not the exception, in datacenters.
When you have 120 servers in the DC, the mean time to failure (MTTF) of the
next machine is 1 month.
When you have 12,000 servers in the DC, the MTTF is about once every 7.2
hours!
1. Hire 1000 people, each to monitor one machine in the datacenter and
report to you when it fails.
2. Write a failure detector program (distributed) that automatically detects
failures and reports to your workstation.
2
Target Settings
• Process ‘group’-based systems
– Clouds/Datacenters
– Replicated servers
– Distributed databases
3
Two sub-protocols
Application Process pi
Group
Membership List
pj
1000’s of processes
Unreliable Communication
Network
8
4
Group Membership Protocol
II Failure Detector
Some process
pi finds out quickly
I pj crashed
III Dissemination
Unreliable Communication
Network
Fail-stop Failures only 9
Next
• How do you design a group membership
protocol?
10
10
5
I. pj crashes
• Nothing we can do about it!
• A frequent occurrence
• Common case rather than exception
• Frequency goes up linearly with size of
datacenter
11
11
12
6
Distributed Failure Detectors: Properties
Impossible together in
• Completeness
lossy networks [Chandra
• Accuracy and Toueg]
• Speed
If possible, then can
– Time to first detection of a failure
solve consensus! (but
• Scale consensus is known to be
– Equal Load on each member unsolvable in
– Network Message Load asynchronous systems)
13
13
14
7
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
Time until some
• Scale process detects the failure
– Equal Load on each member
– Network Message Load 15
15
16
8
Failure Detector Properties
• Completeness In spite of
arbitrary simultaneous
• Accuracy process failures
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 17
17
Centralized Heartbeating
L Hotspot
pi
…
pi, Heartbeat Seq. l++
pj •Heartbeats sent periodically
•If heartbeat not received from pi within
18
timeout, mark pi as failed
18
9
Ring Heartbeating
L Unpredictable on
pi simultaneous multiple
pi, Heartbeat Seq. l++
failures
pj
…
…
19
19
All-to-All Heartbeating
J Equal load per member
pi, Heartbeat Seq. l++ pi L Single hb loss à false
detection
…
pj
20
20
10
Next
• How do we increase the robustness of all-to-all
heartbeating?
21
21
Gossip-style Heartbeating
J Good accuracy
Array of pi properties
Heartbeat Seq. l
for member subset
22
22
11
Gossip-Style Failure Detection
1 10118 64
1 10120 66 2 10110 64
2 10103 62 3 10090 58
3 10098 63 4 10111 65
4 10111 65 2
1
Address Time (local) 1 10120 70
Heartbeat Counter 2 10110 64
Protocol:
3 10098 70
•Nodes periodically gossip their membership
list: pick random nodes, send it list
4 4 10111 65
23
24
12
Gossip-Style Failure Detection
• What if an entry pointing to a failed node is
deleted right after Tfail (=24) seconds?
1 10120 66
2 10110 64
1 10120 66 34 10098
10111 75
50
65
2 10103 62 4 10111 65
3 10098 55 2
4 10111 65 1
Current time : 75 at node 2
4
3 25
25
26
26
13
SWIM Failure Detector Protocol
pi pj
•random pj
ping K random
ack processes
•random K X
ping-req
X
Protocol period ping
= T’ time units
ack
ack
27
27
Time-bounded Completeness
• Key: select each membership element once as a ping
target in a traversal
– Round-robin pinging
– Random permutation of list after each traversal
• Each failure is detected in worst case 2N-1 (local)
protocol periods
• Preserves FD properties
28
28
14
Next
• How do failure detectors fit into the big picture
of a group membership protocol?
• What are the missing blocks?
29
29
III Dissemination
Unreliable Communication
Network
Fail-stop Failures only 30
30
HOW ?
HOW ? 15
HOW ?
HOW ?
Dissemination Options
• Multicast (Hardware / IP)
– unreliable
– multiple simultaneous multicasts
• Point-to-point (TCP / UDP)
– expensive
• Zero extra messages: Piggyback on Failure Detector messages
– Infection-style Dissemination
31
31
Infection-style Dissemination
pi pj
•random pj
ping K random
ack processes
•random K X
ping-req
X
Protocol period ping
= T time units
ack
ack Piggybacked
membership
information
32
32
16
Suspicion Mechanism
• False detections, due to
– Perturbed processes
– Packet losses, e.g., from congestion
• Indirect pinging may not solve the problem
• Key: suspect a process before declaring it as
failed in the group
33
33
Suspicion Mechanism pi
34
17
Suspicion Mechanism
• Distinguish multiple suspicions of a process
– Per-process incarnation number
– Inc # for pi can be incremented only by pi
• e.g., when it receives a (Suspect, pi) message
– Somewhat similar to DSDV (routing protocol in ad-hoc nets)
• Higher inc# notifications over-ride lower inc#’s
• Within an inc#: (Suspect inc #) > (Alive, inc #)
• (Failed, inc #) overrides everything else 35
35
Wrap Up
• Failures the norm, not the exception in datacenters
• Every distributed system uses a failure detector
• Many distributed systems use a membership service
36
18