Notes
Notes
James Aspnes
2016-01-30 23:20
Contents
Table of contents i
Preface xviii
Syllabus xix
1 Introduction 1
1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
I Message passing 7
2 Model 8
2.1 Basic message-passing model . . . . . . . . . . . . . . . . . . 8
2.1.1 Formal details . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Network structure . . . . . . . . . . . . . . . . . . . . 10
2.2 Asynchronous systems . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Example: client-server computing . . . . . . . . . . . . 11
2.3 Synchronous systems . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Drawing message-passing executions . . . . . . . . . . . . . . 12
2.5 Complexity measures . . . . . . . . . . . . . . . . . . . . . . . 14
i
CONTENTS ii
3 Coordinated attack 16
3.1 Formal description . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Impossibility proof . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Randomized coordinated attack . . . . . . . . . . . . . . . . . 18
3.3.1 An algorithm . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Why it works . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.3 Almost-matching lower bound . . . . . . . . . . . . . . 21
6 Leader election 35
6.1 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Leader election in rings . . . . . . . . . . . . . . . . . . . . . 37
6.2.1 The Le-Lann-Chang-Roberts algorithm . . . . . . . . 37
6.2.1.1 Proof of correctness for synchronous executions 38
6.2.1.2 Performance . . . . . . . . . . . . . . . . . . 38
6.2.2 The Hirschberg-Sinclair algorithm . . . . . . . . . . . 39
6.2.3 Peterson’s algorithm for the unidirectional ring . . . . 39
6.2.4 A simple randomized O(n log n)-message algorithm . . 42
6.3 Leader election in general networks . . . . . . . . . . . . . . . 42
6.4 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4.1 Lower bound on asynchronous message complexity . . 43
6.4.2 Lower bound for comparison-based algorithms . . . . 44
7 Logical clocks 47
7.1 Causal ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.1 Lamport clock . . . . . . . . . . . . . . . . . . . . . . 49
7.2.2 Neiger-Toueg-Welch clock . . . . . . . . . . . . . . . . 50
CONTENTS iii
8 Synchronizers 55
8.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.2.1 The alpha synchronizer . . . . . . . . . . . . . . . . . 57
8.2.2 The beta synchronizer . . . . . . . . . . . . . . . . . . 57
8.2.3 The gamma synchronizer . . . . . . . . . . . . . . . . 58
8.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.4 Limitations of synchronizers . . . . . . . . . . . . . . . . . . . 59
8.4.1 Impossibility with crash failures . . . . . . . . . . . . 59
8.4.2 Unavoidable slowdown with global synchronization . . 60
9 Synchronous agreement 62
9.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2 Lower bound on rounds . . . . . . . . . . . . . . . . . . . . . 63
9.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.3.1 Flooding . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.4 Exponential information gathering . . . . . . . . . . . . . . . 66
9.4.1 Basic invariants . . . . . . . . . . . . . . . . . . . . . . 67
9.4.2 Stronger facts . . . . . . . . . . . . . . . . . . . . . . . 68
9.4.3 The payoff . . . . . . . . . . . . . . . . . . . . . . . . 68
9.4.4 The real payoff . . . . . . . . . . . . . . . . . . . . . . 68
9.5 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10 Byzantine agreement 69
10.1 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
10.1.1 Minimum number of rounds . . . . . . . . . . . . . . . 69
10.1.2 Minimum number of processes . . . . . . . . . . . . . 69
10.1.3 Minimum connectivity . . . . . . . . . . . . . . . . . . 71
10.1.4 Weak Byzantine agreement . . . . . . . . . . . . . . . 72
10.2 Upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.2.1 Exponential information gathering gets n = 3f + 1 . . 74
10.2.1.1 Proof of correctness . . . . . . . . . . . . . . 74
10.2.2 Phase king gets constant-size messages . . . . . . . . . 76
10.2.2.1 The algorithm . . . . . . . . . . . . . . . . . 76
CONTENTS iv
12 Paxos 85
12.1 Motivation: replicated state machines . . . . . . . . . . . . . 85
12.2 The Paxos algorithm . . . . . . . . . . . . . . . . . . . . . . . 86
12.3 Informal analysis: how information flows between rounds . . 88
12.4 Safety properties . . . . . . . . . . . . . . . . . . . . . . . . . 88
12.5 Learning the results . . . . . . . . . . . . . . . . . . . . . . . 90
12.6 Liveness properties . . . . . . . . . . . . . . . . . . . . . . . . 90
13 Failure detectors 92
13.1 How to build a failure detector . . . . . . . . . . . . . . . . . 93
13.2 Classification of failure detectors . . . . . . . . . . . . . . . . 93
13.2.1 Degrees of completeness . . . . . . . . . . . . . . . . . 93
13.2.2 Degrees of accuracy . . . . . . . . . . . . . . . . . . . 93
13.2.3 Boosting completeness . . . . . . . . . . . . . . . . . . 94
13.2.4 Failure detector classes . . . . . . . . . . . . . . . . . . 95
13.3 Consensus with S . . . . . . . . . . . . . . . . . . . . . . . . . 96
13.3.1 Proof of correctness . . . . . . . . . . . . . . . . . . . 97
13.4 Consensus with ♦S and f < n/2 . . . . . . . . . . . . . . . . 98
13.4.1 Proof of correctness . . . . . . . . . . . . . . . . . . . 100
13.5 f < n/2 is still required even with ♦P . . . . . . . . . . . . . 101
13.6 Relationships among the classes . . . . . . . . . . . . . . . . . 102
15 Model 112
15.1 Atomic registers . . . . . . . . . . . . . . . . . . . . . . . . . 112
15.2 Single-writer versus multi-writer registers . . . . . . . . . . . 113
15.3 Fairness and crashes . . . . . . . . . . . . . . . . . . . . . . . 114
15.4 Concurrent executions . . . . . . . . . . . . . . . . . . . . . . 114
15.5 Consistency properties . . . . . . . . . . . . . . . . . . . . . . 115
15.6 Complexity measures . . . . . . . . . . . . . . . . . . . . . . . 116
15.7 Fancier registers . . . . . . . . . . . . . . . . . . . . . . . . . 117
22 Common2 190
22.1 Test-and-set and swap for two processes . . . . . . . . . . . . 191
22.2 Building n-process TAS from 2-process TAS . . . . . . . . . . 192
22.3 Single-use swap objects . . . . . . . . . . . . . . . . . . . . . 192
24 Renaming 211
24.1 Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
24.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
24.3 Order-preserving renaming . . . . . . . . . . . . . . . . . . . 213
24.4 Deterministic renaming . . . . . . . . . . . . . . . . . . . . . 213
24.4.1 Wait-free renaming with 2n − 1 names . . . . . . . . . 214
24.4.2 Long-lived renaming . . . . . . . . . . . . . . . . . . . 215
24.4.3 Renaming without snapshots . . . . . . . . . . . . . . 216
24.4.3.1 Splitters . . . . . . . . . . . . . . . . . . . . . 216
24.4.3.2 Splitters in a grid . . . . . . . . . . . . . . . 217
24.4.4 Getting to 2n − 1 names in polynomial space . . . . . 219
24.4.5 Renaming with test-and-set . . . . . . . . . . . . . . . 220
24.5 Randomized renaming . . . . . . . . . . . . . . . . . . . . . . 220
24.5.1 Randomized splitters . . . . . . . . . . . . . . . . . . . 221
24.5.2 Randomized test-and-set plus sampling . . . . . . . . 221
24.5.3 Renaming with sorting networks . . . . . . . . . . . . 222
CONTENTS viii
26 Obstruction-freedom 233
26.1 Why build obstruction-free algorithms? . . . . . . . . . . . . 234
26.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
26.2.1 Lock-free implementations . . . . . . . . . . . . . . . . 234
26.2.2 Double-collect snapshots . . . . . . . . . . . . . . . . . 234
26.2.3 Software transactional memory . . . . . . . . . . . . . 235
26.2.4 Obstruction-free test-and-set . . . . . . . . . . . . . . 235
26.2.5 An obstruction-free deque . . . . . . . . . . . . . . . . 237
26.3 Boosting obstruction-freedom to wait-freedom . . . . . . . . . 239
26.3.1 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
26.4 Lower bounds for lock-free protocols . . . . . . . . . . . . . . 244
26.4.1 Contention . . . . . . . . . . . . . . . . . . . . . . . . 244
26.4.2 The class G . . . . . . . . . . . . . . . . . . . . . . . . 245
26.4.3 The lower bound proof . . . . . . . . . . . . . . . . . . 247
26.4.4 Consequences . . . . . . . . . . . . . . . . . . . . . . . 251
26.4.5 More lower bounds . . . . . . . . . . . . . . . . . . . . 251
26.5 Practical considerations . . . . . . . . . . . . . . . . . . . . . 252
27 BG simulation 253
27.1 Safe agreement . . . . . . . . . . . . . . . . . . . . . . . . . . 253
27.2 The basic simulation algorithm . . . . . . . . . . . . . . . . . 255
27.3 Effect of failures . . . . . . . . . . . . . . . . . . . . . . . . . 256
27.4 Inputs and outputs . . . . . . . . . . . . . . . . . . . . . . . . 256
27.5 Correctness of the simulation . . . . . . . . . . . . . . . . . . 257
CONTENTS ix
30 Self-stabilization 284
34 Self-assembly 288
Appendix 290
A Assignments 290
A.1 Assignment 1: due Wednesday, 2016-02-17, at 5:00pm . . . . 290
A.2 Assignment 2: due Wednesday, 2016-03-09, at 5:00pm . . . . 290
A.3 Assignment 3: due Wednesday, 2016-04-20, at 5:00pm . . . . 291
CONTENTS x
Bibliography 347
Index 364
List of Figures
xiii
List of Tables
xiv
List of Algorithms
xv
LIST OF ALGORITHMS xvi
These are notes for the Spring 2016 semester version of the Yale course CPSC
465/565 Theory of Distributed Systems. This document also incorporates
the lecture schedule and assignments, as well as some sample assignments
from previous semesters. Because this is a work in progress, it will be
updated frequently over the course of the semester.
Notes from Fall 2011 can be found at http://www.cs.yale.edu/homes/
aspnes/classes/469/notes-2011.pdf.
Notes from Spring 2014 can be found at http://www.cs.yale.edu/
homes/aspnes/classes/469/notes-2014.pdf.
Notes from earlier semesters can be found at http://pine.cs.yale.
edu/pinewiki/465/.
Much of the structure of the course follows the textbook, Attiya and
Welch’s Distributed Computing [AW04], with some topics based on Lynch’s
Distributed Algorithms [Lyn96] and additional readings from the research
literature. In most cases you’ll find these materials contain much more
detail than what is presented here, so it is better to consider this document
a supplement to them than to treat it as your primary source of information.
Acknowledgments
Many parts of these notes were improved by feedback from students taking
various versions of this course. I’d like to thank Mike Marmar and Hao Pan
in particular for suggesting improvements to some of the posted solutions.
I’d also like to apologize to the many other students who should be thanked
here but whose names I didn’t keep track of in the past.
xviii
Syllabus
Description
Models of asynchronous distributed computing systems. Fundamental con-
cepts of concurrency and synchronization, communication, reliability, topo-
logical and geometric constraints, time and space complexity, and distributed
algorithms.
Meeting times
Lectures are MW 2:30–3:45 in AKW 200.
Staff
The instructor for the course is James Aspnes. Office: AKW 401. Email:
james.aspnes@gmail.com. URL: http://www.cs.yale.edu/homes/aspnes/.
Office hours can be found in the course calendar at Google Calendar,
which can also be reached through James Aspnes’s web page.
xix
SYLLABUS xx
Textbook
Hagit Attiya and Jennifer Welch, Distributed Computing: Fundamentals,
Simulations, and Advanced Topics, second edition. Wiley, 2004. QA76.9.D5
A75X 2004 (LC). ISBN 0471453242.
On-line version: http://dx.doi.org/10.1002/0471478210. (This may
not work outside Yale.)
Errata: http://www.cs.technion.ac.il/~hagit/DC/2nd-errata.html.
Course requirements
If you are taking this as CPSC 465: Three homework assignments (60% of
the semester grade) plus a final exam (40%).
If you are taking this as CPSC 565: Three homework assignments (45%
of the semester grade), a presentation (15%) and a final exam (40%).
Each presentation will be a short description of the main results in a
relevant paper chosen in consultation with the instructor, and will be done
in front of the class during one of the last few lecture slots. If numbers and
time permit, it may be possible to negotiate doing a presentation even if
you are taking this as CPSC 465.
Late assignments
Late assignments will not be accepted without a Dean’s Excuse.
As always, the future is uncertain, so you should take parts of the schedule
that haven’t happened yet with a grain of salt. Readings refer to chapters
or sections in the course notes, except for those specified as in AW, which
refer to the course textbook Attiya and Welch [AW04].
xxii
LECTURE SCHEDULE xxiii
Introduction
1
CHAPTER 1. INTRODUCTION 2
1.1 Models
The global state consisting of all process states is called a configuration,
and we think of the system as a whole as passing from one global state
or configuration to another in response to each event. When this occurs
the processes participating in the event update their states, and the other
processes do nothing. This does not model concurrency directly; instead,
we interleave potentially concurrent events in some arbitrary way. The ad-
vantage of this interleaving approach is that it gives us essentially the same
behavior as we would get if we modeled simultaneous events explicitly, but
still allows us to consider only one event at a time and use induction to
prove various properties of the sequence of configurations we might reach.
We will often use lowercase Greek letters for individual events or se-
quences of events. Configurations are typically written as capital Latin
letters (often C). An execution of a schedule is an alternating sequence of
configurations and events C0 σ0 C1 σ1 C2 . . . , where Ci+1 is the configuration
CHAPTER 1. INTRODUCTION 3
and write operations (), but which could be more complex hardware
primitives like compare-and-swap (§18.1.3), load-linked/store-
conditional (§18.1.3), atomic queues, or more exotic objects from
the seldom-visited theoretical depths. Practical shared-memory sys-
tems may be implemented as distributed shared-memory (Chap-
ter 16) on top of a message-passing system in various ways.
Like message-passing systems, shared-memory systems must also deal
with issues of asynchrony and failures, both in the processes and in
the shared objects.
Realistic shared-memory systems have additional complications, in
that modern CPUs allow out-of-order execution in the absence of spe-
cial (and expensive) operations called fences or memory barriers.
We will effectively be assuming that our shared-memory code is liber-
ally sprinkled with these operations to ensure atomicity, but this is not
always true of real production code, and indeed there is work in the
theory of distributed computing literature on algorithms that don’t
require unlimited use of memory barriers.[[[ citation needed ]]]
We’ll see many of these at some point in this course, and examine which
of them can simulate each other under various conditions.
1.2 Properties
Properties we might want to prove about a system include:
of Busy and Main is ever green.” Such properties are typically proved
using invariants, properties of the state of the system that are true
initially and that are preserved by all transitions; this is essentially a
disguised induction proof.
There are some basic proof techniques that we will see over and over
again in distributed computing.
For lower bound and impossibility proofs, the main tool is the in-
distinguishability argument. Here we construct two (or more) executions
in which some process has the same input and thus behaves the same way,
regardless of what algorithm it is running. This exploitation of process’s
ignorance is what makes impossibility results possible in distributed com-
puting despite being notoriously difficult in most areas of computer science.2
For safety properties, statements that some bad outcome never occurs,
the main proof technique is to construct an invariant. An invariant is es-
sentially an induction hypothesis on reachable configurations of the system;
an invariant proof shows that the invariant holds in all initial configurations,
and that if it holds in some configuration, it holds in any configuration that
is reachable in one step.
Induction is also useful for proving termination and liveness proper-
ties, statements that some good outcome occurs after a bounded amount of
time. Here we typically structure the induction hypothesis as a progress
measure, showing that some sort of partial progress holds by a particular
time, with the full guarantee implied after the time bound is reached.
2
An exception might be lower bounds for data structures, which also rely on a process’s
ignorance.
Part I
Message passing
7
Chapter 2
Model
We’re going to use a simplified version of the model in [AW04, Chapter 2].
The main difference is that Attiya and Welch include an inbuf component in
the state of each process for holding incoming messages, and have delivery
events only move messages from one process’s outbuf to another’s inbuf. We
will have delivery events cause the receiving process to handle the incoming
message immediately, which is closer to how most people write message-
passing algorithms, and which serves as an example of a good policy for
anybody dealing with a lot of email. It’s not hard to show that this doesn’t
change what we can or can’t do in the system, because (in one direction) the
inbuf-less processes can always simulate an inbuf, and (in the other direction)
we can pretend that messages stored in the inbuf are only really delivered
when they are processed. We will still allow local computation events that
don’t require a message to be delivered, to allow processes to take action
spontaneously, but we won’t use them as much.
8
CHAPTER 2. MODEL 9
its own outbuf) or a computation event (some process updates its state
and possibly adds new messages to its outbuf). An execution segment
is a sequence of alternating configurations and events C0 , φ1 , C1 , φ2 , . . . , in
which each triple Ci φi+1 Ci+1 is consistent with the transition rules for the
event φi+1 , and the last element of the sequence (if any) is a configuration.
If the first configuration C0 is an initial configuration of the system, we
have an execution. A schedule is an execution with the configurations
removed.
• A process can’t tell when its outgoing messages are delivered, because
the outbuf i variables aren’t included in the accessible state used as
input to the transition function.
S|i. In particular, this means that i will have the same accessible
state after any two schedules S and S 0 where S|i = S 0 |i, and thus
will take the same actions in both schedules. This is the basis for
indistinguishability proofs (§3.2), a central technique in obtaining
lower bounds and impossibility results.
1 initially do
2 send request to server
Algorithm 2.1: Client-server computation: client code
The interpretation of Algorithm 2.1 is that the client sends request (by
adding it to its outbuf) in its very first computation event (after which it does
nothing). The interpretation of Algorithm 2.2 is that in any computation
event where the server observes request in its inbuf, it sends response.
We want to claim that the client eventually receives response in any
admissible execution. To prove this, observe that:
1. After finitely many steps, the client carries out a computation event.
This computation event puts request in its outbuf.
2. After finitely many more steps, a delivery event occurs that delivers
request to the server. This causes the server to send response.
3. After finitely many more steps, a delivery event delivers response to
the client, causing it to process response (and do nothing, given that
we haven’t included any code to handle this response).
Each step of the proof is justified by the constraints on admissible execu-
tions. If we could run for infinitely many steps without a particular process
doing a computation event or a particular message being delivered, we’d
violate those constraints.
Most of the time we will not attempt to prove the correctness of a pro-
tocol at quite this level of tedious detail. But if you are only interested in
CHAPTER 2. MODEL 12
distributed algorithms that people actually use, you have now seen a proof
of correctness for 99.9% of them, and do not need to read any further.
p3
p2
p1
Time →
same order that p sends them (this can be simulated by a non-FIFO channel
by adding a sequence number to each message, and queuing messages at
the receiver until all previous messages have been processed).
If we go as far as to assume synchrony, we get the execution in Figure 2.3.
Now all messages take exactly one time unit to arrive, and computation
events follow each other in lockstep.
p3
p2
p1
Time →
p3
p2
p1
Time →
p3
1 1
p2
1 1
p1
0 2 2
Time →
Coordinated attack
16
CHAPTER 3. COORDINATED ATTACK 17
Validity If all processes have the same input x, and no messages are lost,
all processes produce output x. (If processes start with different inputs
or one or more messages are lost, processes can output 0 or 1 as long
as they all agree.)
Sadly, there is not protocol that satisfies all three conditions. We show
this in the next section.
So far, pretty dull. But now let’s consider a chain of hypothetical exe-
cutions A = A0 A1 . . . Ak = B, where each Ai is indistinguishable from Ai+1
for some process pi . Suppose also that we are trying to solve an agreement
1
Bounded means that there is a fixed upper bound on the length of any execution.
We could also demand merely that all processes terminate in a finite number of rounds.
In general, finite is a weaker requirement than bounded, but if the number of possible
outcomes at each step is finite (as they are in this case), they’re equivalent. The reason
is that if we build a tree of all configurations, each configuration has only finitely many
successors, and the length of each path is finite, then König’s lemma (see http://en.
wikipedia.org/wiki/Konig’s_lemma) says that there are only finitely many paths. So
we can take the length of the longest of these paths as our fixed bound. [BG97, Lemma
3.1]
2
Without making additional assumptions, always a caveat when discussing impossibil-
ity.
CHAPTER 3. COORDINATED ATTACK 18
task, where every process must output the same value. Then since pi out-
puts the same value in Ai and Ai+1 , every process outputs the same value
in Ai and Ai+1 . By induction on k, every process outputs the same value in
A and B, even though A and B may be very different executions.
This gives us a tool for proving impossibility results for agreement: show
that there is a path of indistinguishable executions between two executions
that are supposed to produce different output. Another way to picture
this: consider a graph whose nodes are all possible executions with an edge
between any two indistinguishable executions; then the set of output-0 exe-
cutions can’t be adjacent to the set of output-1 executions. If we prove the
graph is connected, we prove the output is the same for all executions.
For coordinated attack, we will show that no protocol satisfies all of
agreement, validity, and termination using an indistinguishability argument.
The key idea is to construct a path between the all-0-input and all-1-input
executions with no message loss via intermediate executions that are indis-
tinguishable to at least one process.
Let’s start with A = A0 being an execution in which all inputs are 1 and
all messages are delivered. We’ll build executions A1 , A2 , etc. by pruning
messages. Consider Ai and let m be some message that is delivered in the
last round in which any message is delivered. Construct Ai+1 by not deliv-
ering m. Observe that while Ai is distinguishable from Ai+1 by the recipient
of m, on the assumption that n ≥ 2 there is some other process that can’t
tell whether m was delivered or not (the recipient can’t let that other pro-
cess know, because no subsequent message it sends are delivered in either
execution). Continue until we reach an execution Ak in which all inputs are
1 and no messages are sent. Next, let Ak+1 through Ak+n be obtained by
changing one input at a time from 1 to 0; each such execution is indistin-
guishable from its predecessor by any process whose input didn’t change.
Finally, construct Ak+n through A2k+n by adding back messages in the re-
verse process used for A0 through Ak . This gets us to an execution Ak+n
in which all processes have input and no messages are lost. If agreement
holds, then the indistinguishability of adjacent executions to some process
means that the common output in A0 is the same as in A2k+n . But validity
requires that A0 outputs 1 and A2k+n outputs 0: so validity is violated.
3.3.1 An algorithm
Here’s an algorithm that gives = 1/r. (See [Lyn96, §5.2.2] for details
or [VL92] for the original version.) A simplifying assumption is that network
is complete, although a strongly-connected network with r greater than or
equal to the diameter also works.
• So now we have that levelri [i] is in {`, ` + 1}, where ` is some fixed
value uncorrelated with key. The only way to get some process
to decide 1 while others decide 0 is if ` + 1 ≥ key but ` < key.
(If ` = 0, a process at this level doesn’t know key, but it can still
reason that 0 < key since key is in [1, r].) This can only occur if
key = ` + 1, which occurs with probability at most 1/r since key
was chosen uniformly.
4.1 Flooding
Flooding is about the simplest of all distributed algorithms. It’s dumb and
expensive, but easy to implement, and gives you both a broadcast mecha-
nism and a way to build rooted spanning trees.
We’ll give a fairly simple presentation of flooding roughly following Chap-
ter 2 of [AW04].
22
CHAPTER 4. BROADCAST AND CONVERGECAST 23
1 initially do
2 if pid = root then
3 seen-message ← true
4 send M to all neighbors
5 else
6 seen-message ← false
7 upon receiving M do
8 if seen-message = false then
9 seen-message ← true
10 send M to all neighbors
Note that the time complexity proof also demonstrates correctness: every
process receives M at least once.
As written, this is a one-shot algorithm: you can’t broadcast a sec-
ond message even if you wanted to. The obvious fix is for each process
to remember which messages it has seen and only forward the new ones
(which costs memory) and/or to add a time-to-live (TTL) field on each
message that drops by one each time it is forwarded (which may cost ex-
tra messages and possibly prevents complete broadcast if the initial TTL
is too small). The latter method is what was used for searching in http:
//en.wikipedia.org/wiki/Gnutella, an early peer-to-peer system. An
interesting property of Gnutella was that since the application of flooding
was to search for huge (multiple MiB) files using tiny ( 100 byte) query mes-
sages, the actual bit complexity of the flooding algorithm was not especially
large relative to the bit complexity of sending any file that was found.
We can optimize the algorithm slightly by not sending M back to the
node it came from; this will slightly reduce the message complexity in many
cases but makes the proof a sentence or two longer. (It’s all a question of
what you want to optimize.)
CHAPTER 4. BROADCAST AND CONVERGECAST 24
1 initially do
2 if pid = root then
3 parent ← root
4 send M to all neighbors
5 else
6 parent ← ⊥
We can easily prove that Algorithm 4.2 has the same termination prop-
erties as Algorithm 4.1 by observing that if we map parent to seen-message
by the rule ⊥ → false, anything else → true, then we have the same al-
gorithm. We would like one additional property, which is that when the
algorithm quiesces (has no outstanding messages), the set of parent point-
ers form a rooted spanning tree. For this we use induction on time:
Lemma 4.1.2. At any time during the execution of Algorithm 4.2, the
following invariant holds:
1. If u.parent 6= ⊥, then u.parent.parent 6= ⊥ and following parent point-
ers gives a path from u to root.
At the end of the algorithm, the invariant shows that every process has
a path to the root, i.e., that the graph represented by the parent pointers is
connected. Since this graph has exactly |V | − 1 edges (if we don’t count the
self-loop at the root), it’s a tree.
Though we get a spanning tree at the end, we may not get a very good
spanning tree. For example, suppose our friend the adversary picks some
Hamiltonian path through the network and delivers messages along this
path very quickly while delaying all other messages for the full allowed 1
time unit. Then the resulting spanning tree will have depth |V | − 1, which
might be much worse than D. If we want the shallowest possible spanning
tree, we need to do something more sophisticated: see the discussion of
distributed breadth-first search in Chapter 5. However, we may be
happy with the tree we get from simple flooding: if the message delay on
each link is consistent, then it’s not hard to prove that we in fact get a
shortest-path tree. As a special case, flooding always produces a BFS tree
in the synchronous model.
Note also that while the algorithm works in a directed graph, the parent
pointers may not be very useful if links aren’t two-way.
4.1.3 Termination
See [AW04, Chapter 2] for further modifications that allow the processes to
detect termination. In a sense, each process can terminate as soon as it is
done sending M to all of its neighbors, but this still requires some mecha-
nism for clearing out the inbuf; by adding acknowledgments as described in
CHAPTER 4. BROADCAST AND CONVERGECAST 26
[AW04], we can terminate with the assurance that no further messages will
be received.
4.2 Convergecast
A convergecast is the inverse of broadcast: instead of a message propa-
gating down from a single root to all nodes, data is collected from outlying
nodes to the root. Typically some function is applied to the incoming data
at each node to summarize it, with the goal being that eventually the root
obtains this function of all the data in the entire system. (Examples would
be counting all the nodes or taking an average of input values at all the
nodes.)
A basic convergecast algorithm is given in Algorithm 4.3; it propagates
information up through a previously-computed spanning tree.
1 initially do
2 if I am a leaf then
3 send input to parent
• If input = 1 for all nodes and f is sum, then we count the number of
nodes in the system.
• If input is arbitrary and f is sum, then we get a total of all the input
values.
1 initially do
2 children ← ∅
3 nonChildren ← ∅
4 if pid = root then
5 parent ← root
6 send init to all neighbors
7 else
8 parent ← ⊥
Distributed breadth-first
search
29
CHAPTER 5. DISTRIBUTED BREADTH-FIRST SEARCH 30
1 initially do
2 if pid = initiator then
3 distance ← 0
4 send distance to all neighbors
5 else
6 distance ← ∞
The initiator sends exactly(0) to all neighbors at the start of the protocol
(these are the only messages the initiator sends).
My distance will be the unique distance that I am allowed to send in an
exactly(d) messages. Note that this algorithm terminates in the sense that
every node learns its distance at some finite time.
If you read the discussion of synchronizers in Chapter 8, this algorithm
essentially corresponds to building the alpha synchronizer into the syn-
chronous BFS algorithm, just as the layered model builds in the beta syn-
chronizer. See [AW04, §11.3.2] for a discussion of BFS using synchronizers.
The original approach of applying synchronizers to get BFS is due to Awer-
buch [Awe85].
We now show correctness. Under the assumption that local computation
takes zero time and message delivery takes at most 1 time unit, we’ll show
that if d(initiator, p) = d, (a) p sends more-than(d0 ) for any d0 < d by time
d0 , (b) p sends exactly(d) by time d, (c) p never sends more-than(d0 ) for any
d0 ≥ d, and (d) p never sends exactly(d0 ) for any d0 6= d. For parts (c) and
(d) we use induction on d0 ; for (a) and (b), induction on time. This is not
terribly surprising: (c) and (d) are safety properties, so we don’t need to
talk about time. But (a) and (b) are liveness properties so time comes in.
Let’s start with (c) and (d). The base case is that the initiator never
sends any more-than messages at all, and so never sends more-than(0), and
any non-initiator never sends exactly(0). For larger d0 , observe that if a
non-initiator p sends more-than(d0 ) for d0 ≥ d, it must first have received
1
In an earlier version of these notes, these messages where called distance(d) and
not-distance(d); the more self-explanatory exactly and more-than terminology is taken from
[BDLP08].
CHAPTER 5. DISTRIBUTED BREADTH-FIRST SEARCH 33
Leader election
35
CHAPTER 6. LEADER ELECTION 36
6.1 Symmetry
A system exhibits symmetry if we can permute the nodes without changing
the behavior of the system. More formally, we can define a symmetry as an
equivalence relation on processes, where we have the additional properties
that all processes in the same equivalence class run the same code; and
whenever p is equivalent to p0 , each neighbor q of p is equivalent to the
corresponding neighbor q 0 of p0 .
An example of a network with a lot of symmetries would be an anony-
mous ring, which is a network in the form of a cycle (the ring part) in
which every process runs the same code (the anonymous part). In this case
all nodes are equivalent. If we have a line, then we might or might not have
any non-trivial symmetries: if each node has a sense of direction that
tells it which neighbor is to the left and which is to the right, then we can
identify each node uniquely by its distance from the left edge. But if the
nodes don’t have a sense of direction, we can flip the line over and pair up
nodes that map to each other.1
Symmetries are convenient for proving impossibility results, as observed
by Angluin [Ang80]. The underlying theme is that without some mecha-
nism for symmetry breaking, a message-passing system escape from a
symmetric initial configuration. The following lemma holds for determin-
istic systems, basically those in which processes can’t flip coins:
same state, then p and p0 receive the same messages from their neighbors
and can proceed to the same state (including outgoing messages) in the next
round.
Formally, we’ll let the state space for each process i consist of two vari-
ables: leader, initially 0, which is set to 1 if i decides it’s a leader; and maxId,
the largest id seen so far. We assume that i denotes i’s position rather than
its id, which we’ll write as idi . We will also treat all positions as values mod
n, to simplify the arithmetic.
Code for the LCR algorithm is given in Algorithm 6.1.
1 initially do
2 leader ← 0
3 maxId ← idi
4 send idi to clockwise neighbor
5 upon receiving j do
6 if j = idi then
7 leader ← 1
8 if j > maxId then
9 maxId ← j
10 send j to clockwise neighbor
6.2.1.2 Performance
It’s immediate from the correctness proof that the protocols terminates after
exactly n rounds.
CHAPTER 6. LEADER ELECTION 39
its nearest neighbors to the left and right; if its ID is larger than the IDs of
both neighbors, it survives to the next phase. Non-candidates act as relays
passing messages between candidates. As in Hirschberg and Sinclair (§6.2.2),
the probing operations in each phase take O(n) messages, and at least half
of the candidates drop out in each phase. The last surviving candidate wins
when it finds that it’s its own neighbor.
To make this work in a 1-way ring, we have to simulate 2-way communi-
cation by moving the candidates clockwise around the ring to catch up with
their unsendable counterclockwise messages. Peterson’s algorithm does this
with a two-hop approach that is inspired by the 2-way case above; in each
phase k, a candidate effectively moves two positions to the right, allowing it
to look at the ids of three phase-k candidates before deciding to continue in
phase k + 1 or not. Here is a very high-level description; it assumes that we
can buffer and ignore incoming messages from the later phases until we get
to the right phase, and that we can execute sends immediately upon receiv-
ing messages. Doing this formally in terms of I/O automata or the model of
§2.1 means that we have to build explicit internal buffers into our processes,
which we can easily do but won’t do here (see [Lyn96, pp. 483–484] for the
right way to do this.)
We can use a similar trick to transform any bidirectional-ring algorithm
into a unidirectional-ring algorithm: alternative between phases where we
send a message right, then send a virtual process right to pick up any left-
going messages deposited for us. The problem with this trick is that it
requires two messages per process per phase, which gives us a total message
complexity of O(n2 ) if we start with an O(n)-time algorithm. Peterson’s
algorithm avoids this by only propagating the surviving candidates.
Pseudocode for Peterson’s algorithm is given in Algorithm 6.2.
Note: the phase arguments in the probe messages are useless if one has
FIFO channels, which is why [Lyn96] doesn’t use them. Note also that the
algorithm does not elect the process with the highest ID, but the process
that is carrying the sole surviving candidate in the last phase.
Proof of correctness is essentially the same as for the 2-way algorithm.
For any pair of adjacent candidates, at most one of their current IDs survives
to the next phase. So we get a sole survivor after lg n phases. Each process
sends or relays at most 2 messages per phases, so we get at most 2n lg n
total messages.
CHAPTER 6. LEADER ELECTION 41
1 procedure candidate()
2 phase ← 0
3 current ← pid
4 while true do
5 send probe(phase, current)
6 wait for probe(phase, x)
7 id2 ← x
8 send probe(phase, current)
9 wait for probe(phase, x)
10 id3 ← x
11 if id2 = current then
12 I am the leader!
13 return
14 else if id2 > current and id2 > id3 do
15 current ← id2
16 phase ← phase + 1
17 else
18 switch to relay()
19 procedure relay()
20 upon receiving probe(p, i) do
21 send probe(p, i)
35
30
25
20
15
10
0
-5 0 5 10 15 20 25 30 35
“time-bounded” means that the running time can’t depend on the size of
the ID space. See [AW04, §3.4.2] or [Lyn96, §3.7] for the textbook version,
or [FL87, §7] for the original result.
The intuition is that for any fixed protocol, if the ID space is large
enough, then there exists a subset of the ID space where the protocol
acts like a comparison-based protocol. So the existence of an O(f (n))-
message time-bounded protocol implies the existence of an O(f (n))-message
comparison-based protocol, and from the previous lower bound we know
f (n) is Ω(n log n). Note that time-boundedness is necessary: we can’t prove
the lower bound for non-time-bounded algorithms because of the i · n trick.
with all edges between them the same color, you will no longer be able to once the graph
is large enough (for any fixed k). See [GRS90] for much more on the subject of Ramsey
theory.
Chapter 7
Logical clocks
47
CHAPTER 7. LOGICAL CLOCKS 48
same process.
2. All pairs (e, e0 ) where e is a send event and e0 is the receive event for
the same message.
3. All pairs (e, e0 ) where there exists a third event e00 such that e ⇒S e00
and e00 ⇒S e0 . (In other words, we take the transitive closure of the
relation defined by the previous two cases.)
It is not terribly hard to show that this gives a partial order; the main
observation is that if e ⇒S e0 , then e precedes e0 in S. So ⇒S is a subset of
the total order <S given by the order of events in S.
A causal shuffle S 0 of a schedule S is a permutation of S that is consis-
tent with the happens-before relation on S; that is, if e happens-before e0 in
S, then e precedes e0 in S 0 . The importance of the happens-before relation
follows from the following lemma, which says that the causal shuffles of S
are precisely the schedules S 0 that are similar to S.
Lemma 7.1.1. Let S 0 be a permutation of the events in S. Then the fol-
lowing two statements are equivalent:
1. S 0 is a causal shuffle of S.
2. S 0 is the schedule of an execution fragment of a message-passing system
with S|p = S 0 |p for all S 0 .
Proof. (1 ⇒ 2). We need to show both similarity and that S 0 corresponds
to some execution fragment. We’ll show similarity first. Pick some p; then
every event at p in S also occurs in S 0 , and they must occur in the same order
by the first case of the definition of the happens-before relation. This gets
us halfway to showing S 0 is the schedule of some execution fragment, since
it says that any events initiated by p are consistent with p’s programming.
To get the rest of the way, observe that any other events are receive events.
For each receive event e0 in S, there must be some matching send event e
also in S; thus e and e0 are both in S 0 and occur in the right order by the
second case of the definition of happens-before.
(2 ⇒ 1). First observe that since every event e in S 0 occurs at some
process p, if S 0 |p = S|p for all p, then there is a one-to-one correspondence
between events in S 0 and S, and thus S 0 is a permutation of S. Now we
need to show that S 0 is consistent with ⇒S . Let e ⇒S e0 . There are three
cases.
1. e and e0 are events of the same process p and e <S e0 . But then e <S 0 e0
because S|p = S 0 |p.
CHAPTER 7. LOGICAL CLOCKS 49
What this means: if I tell you ⇒S , then you know everything there is
to know about the order of events in S that you can deduce from reports
from each process together with the fact that messages don’t travel back in
time. But ⇒S is a pretty big relation (Θ(|S|2 ) bits with a naive encoding),
and seems to require global knowledge of <S to compute. So we can ask if
there is some simpler, easily computable description that works almost as
well. This is where logical clocks come in.
7.2 Implementations
The basic idea of a logical clock is to compute a timestamp for each event,
so that comparing timestamps gives information about ⇒S . Note that these
timestamps need not be totally ordered. In general, we will have a relation
<L between timestamps such that e ⇒S e0 implies e <L e0 , but it may be
that there are some pairs of events that are ordered by the logical clock
despite being incomparable in the happens-before relation.
Examples of logical clocks that use small timestamps but add extra or-
dering are Lamport clocks [Lam78], discussed in §7.2.1; and Neiger-Toueg-
Welch clocks [NT87, Wel87], discussed in §7.2.2. These both assign integer
timestamps to events and may order events that are not causally related.
The main difference between them is that Lamport clocks do not alter the
underlying execution, but may allow arbitrarily large jumps in the logical
clock values; while Neiger-Toueg-Welch clocks guarantee small increments
at the cost of possibly delaying parts of the system.1 A more restricted type
of logical clock are vector clock [Fid91, Mat93], discussed in §7.2.3, which
use n-dimensional vectors of integers to capture ⇒S exactly, at the cost of
much higher overhead.
Proof. Let e <L e0 if e has a lower clock value than e0 . If e and e0 are two
events of the same process, then e <L e0 . If e and e0 are send and receive
events of the same message, then again e <L e0 . So for any events e, e0 , if
e ⇒S e0 , then e <L e0 . Now apply Lemma 7.1.1.
Proof. Again, we have that (a) all events at the same process occur in in-
creasing order (since the event count rises even if the clock value doesn’t,
and we assume that the clock value doesn’t drop) and (b) all receive events
occur later than the corresponding send event (since we force them to). So
Lemma 7.1.1 applies.
Theorem 7.2.3. Fix a schedule S; then for any e, e0 , V C(e) < V C(e0 ) if
and only if e ⇒S e0 .
2
As I write this, my computer reports that its clock is an estimated 289 microseconds
off from the timeserver it is synchronized to, which is less than a tenth of the round-trip
delay to machines on the same local-area network and a tiny fraction of the round-trip
delay to machines elsewhere, including the timeserver machine.
CHAPTER 7. LOGICAL CLOCKS 52
Proof. The if part follows immediately from the update rules for the vector
clock. For the only if part, suppose e does not happen-before e0 . Then e and
e0 are events of distinct processes p and p0 . For VC(e) < VC(e0 ) to hold, we
must have VC(e)p < VC(e0 )p ; but this can occur only if the value of VC(e)p
is propagated to p0 by some sequence of messages starting at p and ending
at p0 at or before e0 occurs. In this case we have e ⇒S e0 .
7.3 Applications
7.3.1 Consistent snapshots
A consistent snapshot of a message-passing computation is a description
of the states of the processes (and possibly messages in transit, but we
can reduce this down to just states by keeping logs of messages sent and
received) that gives the global configuration at some instant of a schedule
that is a consistent reordering of the real schedule (a consistent cut in
the terminology of [AW04, §6.1.2]. Without shutting down the protocol
before taking a snapshot this is the about the best we can hope for in a
message-passing system.
Logical time can be used to obtain consistent snapshots: pick some logi-
cal time and have each process record its state at this time (i.e. immediately
after its last step before the time or immediately before its first step after
the time). We have already argued that logical time gives a consistent re-
ordering of the original schedule, so the set of values recorded is just the
configuration at the end of an appropriate prefix of this reordering. In other
words, it’s a consistent snapshot.
If we aren’t building logical clocks anyway, there is a simpler consistent
snapshot algorithm due to Chandy and Lamport [CL85]. Here some central
initiator broadcasts a snap message, and each process records its state and
immediately forwards the snap message to all neighbors when it first receives
a snap message. To show that the resulting configuration is a configuration
of some consistent reordering, observe that (with FIFO channels) no process
receives a message before receiving snap that was sent after the sender sent
snap: thus causality is not violated by lining up all the pre-snap operations
before all the post-snap ones.
The full Chandy-Lamport algorithm adds a second marker message that
is used to sweep messages in transit out of the communications channels,
which avoids the need to keep logs if we want to reconstruct what messages
are in transit (this can also be done with the logical clock version). The
idea is that when a process records its state after receiving the snap mes-
CHAPTER 7. LOGICAL CLOCKS 53
holds, we will eventually start the snapshot protocol after it holds and obtain
a configuration (which again may not correspond to any global configuration
that actually occurs) in which P holds.
Synchronizers
8.1 Definitions
Formally, a synchronizer sits between the underlying network and the pro-
cesses and does one of two things:
In both cases the synchronizer packages all the incoming round r mes-
sages m for a single process together and delivers them as a single action
recv(p, m, r). Similarly, a process is required to hand over all of its outgoing
round-r messages to the synchronizer as a single action send(p, m, r)—this
prevents a process from changing its mind and sending an extra round-r
message or two. It is easy to see that the global synchronizer produces ex-
ecutions that are effectively indistinguishable from synchronous executions,
assuming that a synchronous execution is allowed to have some variability
in exactly when within a given round each process does its thing. The local
synchronizer only guarantees an execution that is locally indistinguishable
from an execution of the global synchronizer: an individual process can’t
55
CHAPTER 8. SYNCHRONIZERS 56
tell the difference, but comparing actions at different (especially widely sep-
arated) processes may reveal some process finishing round r + 1 while others
are still stuck in round r or earlier. Whether this is good enough depends
on what you want: it’s bad for coordinating simultaneous missile launches,
but may be just fine for adapting a synchronous message-passing algorithm
(e.g. for distributed breadth-first search as described in Chapter 5) to an
asynchronous system, if we only care about the final states of the processes
and not when precisely those states are reached.
Formally, the relation between global and local synchronization is de-
scribed by the following lemma:
8.2 Implementations
These all implement at least a local synchronizer (the beta synchronizer is
global). The names were chosen by their inventor, Baruch Awerbuch [Awe85].
The main difference between them is the mechanism used to determine
when round-r messages have been delivered.
In the alpha synchronizer, every node sends a message to every neigh-
bor in every round (possibly a dummy message if the underlying protocol
doesn’t send a message); this allows the receiver to detect when it’s gotten
all its round-r messages (because it expects to get a message from every
neighbor) but may produce huge blow-ups in message complexity in a dense
graph.
In the beta synchronizer, messages are acknowledged by their receivers
(doubling the message complexity), so the senders can detect when all of
their messages are delivered. But now we need a centralized mechanism to
CHAPTER 8. SYNCHRONIZERS 57
collect this information from the senders and distribute it to the receivers,
since any particular receiver doesn’t know which potential senders to wait
for. This blows up time complexity, as we essentially end up building a
global synchronizer with a central leader.
The gamma synchronizer combines the two approaches at different lev-
els to obtain a trade-off between messages and time that depends on the
structure of the graph and how the protocol is organized.
Details of each synchronizer are given below.
• When the root of a tree gets all acks and OK, it sends ready to the
roots of all adjacent trees (and itself). Two trees are adjacent if any
of their members are adjacent.
• When the root collects ready from itself and all adjacent roots, it broad-
casts go through its own tree.
8.3 Applications
See [AW04, §11.3.2] or [Lyn96, §16.5]. The one we have seen is distributed
breadth-first search, where the two asynchronous algorithms we described
in Chapter 5 were essentially the synchronous algorithms with the beta and
alpha synchronizers embedded in them. But what synchronizers give us
in general is the ability to forget about problems resulting from asynchrony
provided we can assume no failures (which may be a very strong assumption)
and are willing to accept a bit of overhead.
detect termination).
We now want to perform a causal shuffle on β that leaves it with only
s − 1 sessions. The first step is to chop β into at most s − 1 segments
β1 , β2 , . . . of at most D rounds each. Because the diameter of the network
is D, there exist processes p0 and p1 such that no chain of messages starting
at p0 within some segment reaches p1 before the end of the segment. It
follows that for any events e0 of p0 and e1 of p1 in the same segment βi , it
is not the case that e0 ⇒βδ e1 . So there exists a causal shuffle of βi that
puts all events of p0 after all events of p1 . By a symmetrical argument, we
can similarly put all events of p1 after all events of p0 . In both cases the
resulting schedule is indistinguishable by all processes from the original.
So now we apply these shuffles to each of the segments βi in alternating
order: p0 goes first in the even-numbered segments and p1 goes first in the
odd-numbered segments, yielding a sequence of shuffled segments βi0 . This
has the effect of putting the p0 events together, as in this example with
(s − 1) = 4:
βδ|(p0 , p1 ) = β1 β2 β3 β4 δ|(p0 , p1 )
= β10 β20 β30 β40 δ|(p0 , p1 )
= (p1 p0 )(p0 p1 )(p1 p0 )(p0 p1 )δ
= p1 (p0 p0 )(p1 p1 )(p0 p0 )p1 δ
Synchronous agreement
Validity If all processes start with the same input, all non-faulty processes
decide it.
62
CHAPTER 9. SYNCHRONOUS AGREEMENT 63
For lower bounds, we’ll replace validity with non-triviality (often called
validity in the literature):
Non-triviality follows from validity but doesn’t imply validity; for exam-
ple, a non-trivial algorithm might have the property that if all non-faulty
processes start with the same input, they all decide something else. We’ll
start by using non-triviality, agreement, and termination to show a lower
bound on the number of rounds needed to solve the problem.
Now for the proof. To simplify the argument, let’s assume that all ex-
ecutions terminate in exactly f rounds (we can always have processes send
pointless chitchat to pad out short executions) and that every processes
sends a message to every other process in every round where it has not
crashed (more pointless chitchat). Formally, this means we have a sequence
of rounds 0, 1, 2, . . . , f −1 where each process sends a message to every other
process (assuming no crashes), and a final round f where all processes decide
on a value (without sending any additional messages).
We now want to take any two executions A and B and show that both
produce the same output. To do this, we’ll transform A’s inputs into B’s
inputs one process at a time, crashing processes to hide the changes. The
problem is that just crashing the process whose input changed might change
the decision value—so we have to crash later witnesses carefully to maintain
indistinguishability all the way across the chain.
Let’s say that a process p crashes fully in round r if it crashes in round
r and no round-r messages from p are delivered. The communication
pattern of an execution describes which messages are delivered between
processes without considering their contents—in particular, it tells us which
processes crash and what other processes they manage to talk to in the
round in which they crash.
With these definitions, we can state and prove a rather complicated
induction hypothesis:
Lemma 9.2.1. For any f -round protocol with n ≥ f + 2 process permitting
up to f crash failures; any process p; and any execution A in which at
most one processes crashes per round in rounds 0 . . . r − 1, p crashes fully in
round r + 1, and no other processes crash; there is a sequence of executions
A = A0 A1 . . . Ak such that each Ai is indistinguishable from Ai+1 by some
process, each Ai has at most one crash per round, and the communication
pattern in Ak is identical to A except that p crashes fully in round r.
Proof. By induction on f − r. If r = f , we just crash p in round r and
nobody else notices. For r < f , first crash p in round r instead of r + 1, but
deliver all of its round-r messages anyway (this is needed to make space for
some other process to crash in round r + 1). Then choose some message m
sent by p in round r, and let p0 be the recipient of m. We will show that we
can produce a chain of indistinguishable executions between any execution
in which m is delivered and the corresponding execution in which it is not.
If r = f − 1, this is easy; only p0 knows whether m has been delivered,
and since n ≥ f +2, there exists another non-faulty p00 that can’t distinguish
between these two executions, since p0 sends no messages in round f or later.
CHAPTER 9. SYNCHRONOUS AGREEMENT 65
If r < f − 1, we have to make sure p0 doesn’t tell anybody about the missing
message.
By the induction hypothesis, there is a sequence of executions starting
with A and ending with p0 crashing fully in round r + 1, such that each exe-
cution is indistinguishable from its predecessor. Now construct the sequence
The first and last step apply the induction hypothesis; the middle one yields
indistinguishable executions since only p0 can tell the difference between m
arriving or not and its lips are sealed.
We’ve shown that we can remove one message through a sequence of
executions where each pair of adjacent executions is indistinguishable to
some process. Now paste together n − 1 such sequences (one per message)
to prove the lemma.
The rest of the proof: Crash some process fully in round 0 and then
change its input. Repeat until all inputs are changed.
9.3 Solutions
Here we give two solutions to synchronous agreement with crash failures.
The first, due to Dolev and Strong [DS83], is more practical but does not gen-
eralize well to Byzantine failures. The second is a variant on the exponential
information gathering algorithm of Pease, Shostak, and Lamport [PSL80],
which propagates enough information that it can in principle simulate any
other possible algorithm; it is mostly of interest because it can be used for
the Byzantine case as well.
9.3.1 Flooding
We’ll now show an algorithm that gets agreement, termination, and validity.
Validity here is stronger than the non-triviality condition used in the lower
bound, but the lower bound still applies: we can’t do this in less than f + 1
rounds.
So let’s do it in exactly f +1 rounds. There are two standard algorithms,
one of which generalizes to Byzantine processes under good conditions. We’ll
start with a simple approach based on flooding. This algorithm is described
CHAPTER 9. SYNCHRONOUS AGREEMENT 66
Lemma 9.3.1. After f + 1 rounds, all non-faulty processes have the same
set.
Proof. Let Sir be the set of process i after r rounds. What we’ll really show
is that if there are no failures in round k, then Sir = Sjr = Sik+1 for all i,
j, and r > k. To show this, observe that no faults in round k means that
all processes that are still alive at the start of round k send their message
to all other processes. Let L be the set of live processes in round k. At the
end of round k, for i in L we have Sik+1 = j∈L Sjk = S. Now we’ll consider
S
inputs, but a tree describing who it heard what from. We build this tree
out of pairs of the form (id-sequence, input) where id-sequence is a sequence
of intermediaries with no repetitions and input is some input. A process’s
state at each round is just a set of such pairs.
This is not really an improvement on flooding for crash failures, but it
can be used as a basis for building an algorithm for Byzantine agreement
(Chapter 10). Also useful as an example of a full-information algorithm,
in which every process records all that it knows about the execution; in
principle this allows the algorithm to simulate any other algorithm, which
can sometimes be useful for proving lower bounds.
See [AW04, §5.2.4] or [Lyn96, §6.2.3] for more details than we provide
here. The original exponential information-gathering algorithm (for Byzan-
tine processes) is due to Pease, Shostak, and Lamport [PSL80].
Initial state is (hi, myInput).
At round r, process i broadcasts all pairs (w, v) where |w| = r and i
does not appear in w (these restrictions make the algorithm slightly less
exponential). Upon receiving (w, v) from j, i adds (wj, v) to its list. If no
message arrives from j in round r, i adds (wj, ⊥) to its list for all non-
repeating w with |w| = r (this step can also be omitted).
A tree structure is obtained by letting w be the parent of wj for each j.
At round f + 1, apply some fixed decision rule to the set of all values
that appear in the tree (e.g. take the max, or decide on a default value v0
if there is more than one value in the tree). That this works follows pretty
much immediately from the fact that the set of node labels propagates just
as in the flooding algorithm (which is why EIG isn’t really an improvement).
But there are some complications from the messages that aren’t sent due to
the i-not-in-w restriction on sending. So we have to write out a real proof.
Below is basically just following the presentation in [Lyn96].
Let val(w, i) be the value v such that (w, v) appears in i’s list at the
end of round f + 1. We don’t worry about what earlier round it appears in
because we can compute that round as |w| + 1.
9.5 Variants
So far we have described binary consensus, since all inputs are 0 or 1. We
can also allow larger input sets. With crash failures, this allows a stronger
validity condition: the output must be equal to some input. Note that this
stronger condition doesn’t work if we have Byzantine failures. (Exercise:
why not?)
Chapter 10
Byzantine agreement
69
CHAPTER 10. BYZANTINE AGREEMENT 70
A0 B0 A0 B0
C1 C0
Č B1 A1
B0
B0 A0 D0
A0 D0 C0
C1
Č A1
B1
D1
and let each of the n = 3 processes simulate one group, with everybody in
the group getting the same input, which can only make things easier. Then
we get a protocol for n = 3 and f = 1, an impossibility.
non-faulty nodes to decide their inputs or violate validity. But then doing
the same thing with B0 and B1 yields an execution that violates agreement.
Conversely, if we have connectivity 2f +1, then the processes can simulate
a general graph by sending each other messages along 2f + 1 predetermined
vertex-disjoint paths and taking the majority value as the correct message.
Since the f Byzantine processes can only corrupt one path each (assuming
the non-faulty processes are careful about who they forward messages from),
we get at least f + 1 good copies overwhelming the f bad copies. This
reduces the problem on a general graph with sufficiently high connectivity
to the problem on a complete graph, allowing Byzantine agreement to be
solved if the other lower bounds are met.
Lemma 10.2.1. If i, j, and k are all non-faulty then for all w, val(wk, i) =
val(wk, j) = val(w, k).
Lemma 10.2.2. If j is non-faulty then val0 (wj, i) = val(wj, i) for all non-
faulty i and all w.
We call a node w common val0 (w, i) = val0 (w, j) for all non-faulty i, j.
Lemma 10.2.2 says that wk is common if k is non-faulty. We can also show
that any node whose children are all common is also common, whether or
not the last process in its label is faulty.
Proof. Recall that, for |w| < f + 1, val0 (w, i) is the majority value among
all val0 (wk, i). If all wk are common, then val0 (wk, i) = val0 (wk, j) for all
non-faulty i and j, so i and j compute the same majority values and get
val0 (w, i) = val0 (w, j).
Agreement: Observe that every path has a common node on it, since
a path travels through f + 1 nodes and one of them is good. If we then
suppose that the root is not common: by Lemma 10.2.3, it must have a
not-common child, that node must have a not-common child, etc. But this
constructs a path from the root to a leaf with no not-common nodes, which
we just proved can’t happen.
2. The process i takes its value from the phase king. We’ve already shown
that i then agrees with any j that sees a big majority; but since the
phase king is non-faulty, process i will agree with any process j that
also takes its new preference from the phase king.
This shows that after any phase with a non-faulty king, all processes
agree. The proof that the non-faulty processes continue to agree is the same
as for validity.
Impossibility of
asynchronous agreement
80
CHAPTER 11. IMPOSSIBILITY OF ASYNCHRONOUS AGREEMENT81
11.1 Agreement
Usual rules: agreement (all non-faulty processes decide the same value),
termination (all non-faulty processes eventually decide some value), va-
lidity (for each possible decision value, there an execution in which that
value is chosen). Validity can be tinkered with without affecting the proof
much.
To keep things simple, we assume the only two decision values are 0 and
1.
11.2 Failures
A failure is an internal action after which all send operations are disabled.
The adversary is allowed one failure per execution. Effectively, this means
that any group of n − 1 processes must eventually decide without waiting
for the n-th, because it might have failed.
11.3 Steps
The FLP paper uses a notion of steps that is slightly different from the
send and receive actions of the asynchronous message-passing model we’ve
been using. Essentially a step consists of receiving zero or more messages
followed by doing a finite number of sends. To fit it into the model we’ve
been using, we’ll define a step as either a pair (p, m), where p receives
message m and performs zero or more sends in response, or (p, ⊥), where
p receives nothing and performs zero or more sends. We assume that the
processes are deterministic, so the messages sent (if any) are determined by
p’s previous state and the message received. Note that these steps do not
correspond precisely to delivery and send events or even pairs of delivery
and send events, because what message gets sent in response to a particular
delivery may change as the result of delivering some other message; but this
won’t affect the proof.
The fairness condition essentially says that if (p, m) or (p, ⊥) is contin-
uously enabled it eventually happens. Since messages are not lost, once
(p, m) is enabled in some configuration C, it is enabled in all successor con-
figurations until it occurs; similarly (p, ⊥) is always enabled. So to ensure
fairness, we have to ensure that any non-faulty process eventually performs
any enabled step.
CHAPTER 11. IMPOSSIBILITY OF ASYNCHRONOUS AGREEMENT82
2. Now suppose e and e0 are steps of the same process p. Again we let
both go through in either order. It is not the case now that Dee0 =
De0 e, since p knows which step happened first (and may have sent
messages telling the other processes). But now we consider some finite
sequence of steps e1 e2 . . . ek in which no message sent by p is delivered
and some process decides in Dee1 . . . ek (this occurs since the other
processes can’t distinguish Dee0 from the configuration in which p
died in D, and so have to decide without waiting for messages from
p). This execution fragment is indistinguishable to all processes except
CHAPTER 11. IMPOSSIBILITY OF ASYNCHRONOUS AGREEMENT84
p from De0 ee1 . . . ek , so the deciding process decides the same value i
in both executions. But Dee0 is 0-valent and De0 e is 1-valent, giving
a contradiction.
It follows that our assumption was false, and there is some reachable
bivalent configuration C 0 e.
Now to construct a fair execution that never decides, we start with a
bivalent configuration, choose the oldest enabled action and use the above
to make it happen while staying in a bivalent configuration, and repeat.
Paxos
85
CHAPTER 12. PAXOS 86
common choice, although other choices of consensus protocols will work too.
then does a second phase of voting where it sends accept(n, v) to all accepters
and wins if receives a majority of votes.
So for each proposal, the algorithm proceeds as follows:
Invariant 2 For any v and n, if a proposal with value v and number n has
been issued (by sending accept messages), then there is a majority of
accepters S such that either (a) no accepter in S has accepted any
proposal numbered less than n, or (b) v is the value of the highest-
numbered proposal among all proposals numbered less than n accepted
by at least one accepter in S.
The proof of the first invariant is immediate from the rule for issuing
acks.
The proof of the second invariant follows from the first invariant and the
proposer’s rule for issuing proposals: it can only do so after receiving ack
from a majority of accepters—call this set S—and the value it issues is either
the proposal’s initial value if all responses are ack(n, ⊥, 0), or the maximum
value sent in by accepters in S if some responses are ack(n, v, nv ). In the
first case we have case (a) of the invariant: nobody accepted any proposals
numbered less than n before responding, and they can’t afterwards. In the
second case we have case (b): the maximum response value is the maximum-
numbered accepted value within S at the time of each response, and again no
new values numbered less than n will be accepted afterwards. Amazingly,
none of this depends on the temporal ordering of different proposals or
CHAPTER 12. PAXOS 90
messages: the accepters enforce that their acks are good for all time by
refusing to change their mind about earlier rounds later.
So now we suppose that some value v is eventually accepted by a majority
T with number n. Then we can show by induction on proposal number that
all proposals issued with higher numbers have the same value (even if they
were issued earlier). For any proposal accept(v 0 , n0 ) with n0 > n, there is
a majority S (which must overlap with T ) for which either case (a) holds
(a contradiction—once the overlapping accepter finally accepts, it violates
the requirement that no proposal less than n0 has been accepted) or case
(b) holds (in which case by the induction hypothesis v 0 is the value of some
earlier proposal with number n0 ≥ n, implying v 0 = v).
starting a new round; thus two or more will eventually start far enough
apart in time that one will get done without interference.
A more abstract solution is to assume some sort of weak leader election
mechanism, which tells each accepter who the “legitimate” proposer is at
each time. The accepters then discard messages from illegitimate proposers,
which prevents conflict at the cost of possibly preventing progress. Progress
is however obtained if the mechanism eventually reaches a state where a
majority of the accepters bow to the same non-faulty proposer long enough
for the proposal to go through.
Such a weak leader election method is an example of a more general
class of mechanisms known as failure detectors, in which each process
gets hints about what other processes are faulty that eventually converge to
reality. The particular failure detector in this case is known as the Ω failure
detector; there are other still weaker ones that we will talk about later that
can also be used to solve consensus. We will discuss failure detectors in
detail in Chapter 13.
Chapter 13
Failure detectors
92
CHAPTER 13. FAILURE DETECTORS 93
Note that “strong” and “weak” mean different things for accuracy vs
completeness: for accuracy, we are quantifying over suspects, and for com-
pleteness, we are quantifying over suspectors. Even a weakly-accurate failure
detector guarantees that all processes trust the one visibly good process.
1 initially do
2 suspects ← ∅
3 while true do
4 Let S be the set of all processes my weak detector suspects.
5 Send S to all processes.
6 upon receiving S from q do
7 suspects ← (suspects ∪ p) \ {q}
Algorithm 13.1: Boosting completeness
CHAPTER 13. FAILURE DETECTORS 95
It’s not hard to see that this boosts completeness: if p crashes, some-
body’s weak detector eventually suspects it, this process tells everybody
else, and p never contradicts it. So eventually everybody suspects p.
What is slightly trickier is showing that it preserves accuracy. The es-
sential idea is this: if there is some good-guy process p that everybody trusts
forever (as in weak accuracy), then nobody ever reports p as suspect—this
also covers strong accuracy since the only difference is that now every non-
faulty process falls into this category. For eventual weak accuracy, wait for
everybody to stop suspecting p, wait for every message ratting out p to be
delivered, and then wait for p to send a message to everybody. Now every-
body trusts p, and nobody every suspects p again. Eventual strong accuracy
is again similar.
This will justify ignoring the weakly-complete classes.
Jumping to the punch line: P can simulate any of the others, S and
♦P can both simulate ♦S but can’t simulate P or each other, and ♦S can’t
simulate any of the others (See Figure 13.1—we’ll prove all of this later.)
Thus ♦S is the weakest class of failure detectors in this list. However, ♦S is
strong enough to solve consensus, and in fact any failure detector (whatever
CHAPTER 13. FAILURE DETECTORS 96
S ♦P
♦S
Figure 13.1: Partial order of failure detector classes. Higher classes can
simulate lower classes.
1 Vp ← {hp, vp i}
2 δp ← {hp, vp i}
// Phase 1
3 for i ← 1 to n − 1 do
4 Send hi, δp i to all processes.
5 Wait to receive hi, δq i from all q I do not suspect.
S
6 δp ← q δq \ Vp
S
7 Vp ← q δq ∪ Vp
// Phase 2
8 Send hn, δp i to all processes.
9 Wait to receive hn, δq i from all q I do not suspect.
T
10 Vp ← q Vq ∩ Vp
// Phase 3
11 return some input from Vp chosen via a consistent rule.
Algorithm 13.2: Consensus with a strong failure detector
faulty process that gets stuck eventually is informed by the S-detector that
the process it is waiting for is dead.
For agreement, we must show that in phase 3, every Vp is equal; in
particular, we’ll show that every Vp = Vc . First it is necessary to show that
at the end of phase 1, Vc ⊆ Vp for all p. This is done by considering two
cases:
1 procedure broadcast(m)
2 send m to all processes.
3 upon receiving m do
4 if I haven’t seen m before then
5 send m to all processes
6 deliver m to myself
• Each process keeps track of a preference (initially its own input) and a
timestamp, the round number in which it last updated its preference.
1 preference ← input
2 timestamp ← 0
3 for round ← 1 . . . ∞ do
4 Send hround, preference, timestampi to coordinator
5 if I am the coordinator then
6 Wait to receive hround, preference, timestampi from majority of
processes.
7 Set preference to value with largest timestamp.
8 Send hround, preferencei to all processes.
9 Wait to receive round, preference0 from coordinator or to suspect
coordinator.
10 if I received round, preference0 then
11 preference ← preference0
12 timestamp ← round
13 Send ack(round) to coordinator.
14 else
15 Send nack(round) to coordinator.
16 if I am the coordinator then
17 Wait to receive ack(round) or nack(round) from a majority of
processes.
18 if I received no nack(round) messages then
19 Broadcast preference using reliable broadcast.
why we need the nacks in phase 3). The loophole here is that processes
that decide stop participating in the protocol; but because any non-faulty
process retransmits the decision value in the reliable broadcast, if a process
is waiting for a response from a non-faulty process that already terminated,
eventually it will get the reliable broadcast instead and terminate itself.
In phase 3, a process might get stuck waiting for a dead coordinator, but
the strong completeness of ♦S means that it suspects the dead coordinator
eventually and escapes. So at worst we do finitely many rounds.
Now suppose that after some time t there is a process c that is never
suspected by any process. Then in the next round in which c is the co-
ordinator, in phase 3 all surviving processes wait for c and respond with
ack, c decides on the current estimate, and triggers the reliable broadcast
protocol to ensure everybody else decides on the same value. Since reli-
able broadcast guarantees that everybody receives the message, everybody
decides this value or some value previously broadcast—but in either case
everybody decides.
Agreement is the tricky part. It’s possible that two coordinators both
initiate a reliable broadcast and some processes choose the value from the
first and some the value from the second. But in this case the first coordi-
nator collected acks from a majority of processes in some round r, and all
subsequent coordinators collected estimates from an overlapping majority
of processes in some round r0 > r. By applying the same induction argu-
ment as for Paxos, we get that all subsequent coordinators choose the same
estimate as the first coordinator, and so we get agreement.
accurate for non-leaders. For other similar problems see the paper.
Chapter 14
Quorum systems
14.1 Basics
In the past few chapters, we’ve seen many protocols that depend on the fact
that if I talk to more than n/2 processes and you talk to more than n/2 pro-
cesses, the two groups overlap. This is a special case of a quorum system,
a family of subsets of the set of processes with the property that any two
subsets in the family overlap. By choosing an appropriate family, we may
be able to achieve lower load on each system member, higher availability,
defense against Byzantine faults, etc.
The exciting thing from a theoretical perspective is that these turn a
systems problem into a combinatorial problem: this means we can ask com-
binatorialists how to solve it.
• Dynamic quorum systems: get more than half of the most recent copy.
104
CHAPTER 14. QUORUM SYSTEMS 105
14.3 Goals
• Minimize quorum size.
Naor and Wool [NW98] describe trade-offs between these goals (some of
these were previously known, see the paper for citations):
• load ≥ max(c/n, 1/c) where c is the minimum quorum size. The first
case is obvious: if every access hits c nodes, spreading them out as
evenly as possible still hits each node c/n of the time. The second is
trickier: Naor and Wool prove it using LP duality, but the argument
essentially says that if we have some quorum Q of size c, then since
every other quorum Q0 intersects Q in at least one place, we can show
that every Q0 adds at least 1 unit of load in total to the c members of
Q. So if we pick a random quorum Q0 , the average load added to all of
Q is at least 1, so the average load added to some particular element
of Q is at least 1/|Q| = 1/c. Combining the two cases, we can’t hope
√
to get load better than 1/ n, and to get this load we need quorums
√
of size at least n.
CHAPTER 14. QUORUM SYSTEMS 106
Figure 14.1: Figure 2 from [NW98]. Solid lines are G(3); dashed lines are
G∗ (3).
G(d) grid and one from the G∗ (d) grid (the star indicates that G∗ (d) is the
dual graph1 of G(d). A quorum consists of a set of servers that produce an
LR path in G(d) and a TB path in G∗ (d). Quorums intersect, because any
LR path in G(d) must cross some TB path in G∗ (d) at some server (in fact,
each pair of quorums intersects in at least two places). The total number of
√
elements n is (d + 1)2 and the minimum size of a quorum is 2d + 1 = Θ( n).
The symmetry of the mesh gives that there exists a LR path in the
mesh if and only if there does not exist a TB path in its complement, the
graph that has an edge only if the mesh doesn’t. For a mesh with failure
probability p < 1/2, the complement is a mesh with failure probability
q = 1 − p > 1/2. Using results in percolation theory, it can be shown that
for failure probability q > 1/2, the probability that there exists a left-to-
right path is exponentially small in d (formally, for each p there is a constant
φ(p) such that Pr[∃LR path] ≤ exp(−φ(p)d)). We then have
So the failure probability of this system is exponentially small for any fixed
p < 1/2.
See the paper [NW98] for more details.
tion w supplied by the quorum system designer, with the property that
Pr[Q1 ∩ Q2 = ∅] ≤ when Q1 and Q2 are chosen independently according
to their weights.
14.6.1 Example
√
Let a quorum be any set of size k n for some k and let all quorums be
chosen uniformly at random. Pick some quorum Q1 ; what is the probability
that a random Q2 does not intersect Q1 ? Imagine we choose the elements
of Q2 one at a time. The chance that the first element x1 of Q2 misses Q1
√ √
is exactly (n − k n)/n = 1 − k/ n, and conditioning on x1 through xi−1
√
missing Q1 the probability that xi also misses it is (n − k n − i + 1)/(n −
√ √
i + 1) ≤ (n − k n)/n = 1 − k/√ n. So taking the√product over all i gives
√ k n √
Pr[all miss Q1 ] ≤ (1 − k/ n) ≤ exp(−k n)k/ n) = exp(−k 2 ). So by
setting k = Θ(ln 1/), we can get our desired -intersecting system.
14.6.2 Performance
Failure probabilities, if naively defined, can be made arbitrarily small: add
low-probability singleton quorums that are hardly ever picked unless massive
failures occur. But the resulting system is still -intersecting.
One way to look at this is that it points out a flaw in the -intersecting
definition: -intersecting quorums may cease to be -intersecting conditioned
on a particular failure pattern (e.g. when all the non-singleton quorums are
knocked out by massive failures). But Malkhi et al. [MRWW01] address the
problem in a different way, by considering only survival of high quality
quorums, where a particular quorum Q is δ-high-quality if Pr[Q1 ∩ Q2 =
√
∅|Q1 = Q] ≤ δ and high quality if it’s -high-quality. It’s not hard to show
that a random quorum is δ-high-quality with probability at least /δ, so
a high quality quorum is one that fails to intersect a random quorum with
√
probability at most and a high quality quorum is picked with probability
√
at least 1 − .
We can also consider load; Malkhi et al. [MRWW01] show that essen-
tially the same bounds on load for strict quorum systems also hold for -
√
intersecting quorum systems: load(S) ≥ max((E(|Q|)/n, (1 − )2 / E(|Q|)),
where E(|Q|) is the expected size of a quorum. The left-hand branch of the
max is just the average load applied to a uniformly-chosen server. For the
right-hand side, pick some high quality quorum Q0 with size less than or
√
equal to (1 − ) E(|Q|) and consider the load applied to its most loaded
member by its nonempty intersection (which occurs with probability at least
CHAPTER 14. QUORUM SYSTEMS 110
√
1− ) with a random quorum.
Shared memory
111
Chapter 15
Model
112
CHAPTER 15. MODEL 113
1 leftIsDone ← read(leftDone)
2 rightIsDone ← read(rightDone)
3 write(done, leftIsDone ∧ rightIsDone)
where after any prefix of the execution, every response corresponds to some
preceding invocation, and there is at most one invocation for each pro-
cess—always the last—that does not have a corresponding response. How
a concurrent execution may or may not relate to a sequential execution
depends on the consistency properties of the implementation, as described
below.
Sticky bits (or sticky registers) With a sticky bit or sticky regis-
ter [Plo89], once the initial empty value is overwritten, all further
writes fail. The writer is not notified that the write fails, but may
be able to detect this fact by reading the register in a subsequent
operation.
Bank accounts Replace the write operation with deposit, which adds a
non-negative amount to the state, and withdraw, which subtracts a
non-negative amount from the state provided the result would not go
below 0; otherwise, it has no effect.
These solve problems that are hard for ordinary read/write registers un-
der bad conditions. Note that they all have to return something in response
to an invocation.
There are also blocking objects like locks or semaphores, but these don’t
fit into the RMW framework.
We can also consider generic read-modify-write registers that can com-
pute arbitrary functions (passed as an argument to the read-modify-write
operation) in the modify step. Here we typically assume that the read-
modify-write operation returns the old value of the register. Generic read-
modify-write registers are not commonly found in hardware but can be easily
simulated (in the absence of failures) using mutual exclusion.2
2
See Chapter 17.
Chapter 16
119
CHAPTER 16. DISTRIBUTED SHARED MEMORY 120
It then responds to p with ack(v, t), whether or not it updated its local
copy. A process will also respond to a message read(u) with a response
ack(value, timestamp, u); here u is a nonce3 used to distinguish between dif-
ferent read operations so that a process can’t be confused by out-of-date
acknowledgments.
To write a value, the writer increments its timestamp, updates its value
and sends write(value, timestamp) to all other processes. The write opera-
tion terminates when the writer has received acknowledgments containing
the new timestamp value from a majority of processes.
To read a value, a reader does two steps:
(Any extra messages, messages with the wrong nonce, etc. are dis-
carded.)
Both reads and writes cost Θ(n) messages (Θ(1) per process).
Intuition: Nobody can return from a write or a read until they are sure
that subsequent reads will return the same (or a later) value. A process
can only be sure of this if it knows that the values collected by a read will
include at least one copy of the value written or read. But since majorities
overlap, if a majority of the processes have a current copy of v, then the
majority read quorum will include it. Sending write(v, t) to all processes
and waiting for acknowledgments from a majority is just a way of ensuring
that a majority do in fact have timestamps that are at least t.
If we omit the write stage of a read operation, we may violate lineariz-
ability. An example would be a situation where two values (1 and 2, say),
have been written to exactly one process each, with the rest still holding
the initial value ⊥. A reader that observes 1 and (n − 1)/2 copies of ⊥
will return 1, while a reader that observes 2 and (n − 1)/2 copies of ⊥ will
return 2. In the absence of the write stage, we could have an arbitrarily
long sequence of readers return 1, 2, 1, 2, . . . , all with no concurrency. This
3
A nonce is any value that is guaranteed to be used at most once (the term originally
comes from cryptography, which in turn got it from linguistics). In practice, a reader will
most likely generate a nonce by combining its process id with a local timestamp.
CHAPTER 16. DISTRIBUTED SHARED MEMORY 122
would not be consistent with any sequential execution in which 1 and 2 are
only written once.
4. none of the other cases applies, and we feel like putting π1 first.
The intent is that we pick some total ordering that is consistent with both
<T and the timestamp ordering (with writes before reads when timestamps
are equal). To make this work we have to show (a) that these two orderings
are in fact consistent, and (b) that the resulting ordering produces values
consistent with an atomic register: in particular, that each read returns the
value of the last preceding write.
Part (b) is easy: since timestamps only increase in response to writes,
each write is followed by precisely those reads with the same timestamp,
which are precisely those that returned the value written.
For part (a), suppose that π1 <T π2 . The first case is when π2 is a read.
Then before the end of π1 , a set S of more than n/2 processes send the π1
process an ack(v1, t1 ) message. Since local timestamps only increase, from
this point on any ack(v2 , t2 , u) message sent by a process in S has t2 ≥ t1 .
Let S 0 be the set of processes sending ack(v2 , t2 , u) messages processed by
π2 . Since |S| > n/2 and |S 0 | > n/2, we have S ∩ S 0 is nonempty and so S 0
includes a process that sent ack(v2 , t2 ) with t2 ≥ t1 . So π2 is serialized after
CHAPTER 16. DISTRIBUTED SHARED MEMORY 123
3. Send write(v, t) to all processes, and wait for a response ack(v, t) from
a majority of processes.
This increases the cost of a write by a constant factor, but in the end
we still have only a linear number of messages. The proof of linearizability
CHAPTER 16. DISTRIBUTED SHARED MEMORY 124
Mutual exclusion
17.2 Goals
(See also [AW04, §4.2], [Lyn96, §10.2].)
Core mutual exclusion requirements:
125
CHAPTER 17. MUTUAL EXCLUSION 126
Note that the protocol is not required to guarantee that processes leave
the critical or remainder state, but we generally have to insist that the
processes at least leave the critical state on their own to make progress.
Additional useful properties (not satisfied by all mutual exclusion pro-
tocols; see [Lyn96, §10.4)]:
1 oldValue ← read(bit)
2 write(bit, 1)
3 return oldValue
Typically there is also a second reset operation for setting the bit back
to zero. For some implementations, this reset operation may only be used
safely by the last process to get 0 from the test-and-set bit.
Because a test-and-set operation is atomic, if two processes both try to
perform test-and-set on the same bit, only one of them will see a return value
CHAPTER 17. MUTUAL EXCLUSION 127
of 0. This is not true if each process simply executes the above code on a
stock atomic register: there is an execution in which both processes read
0, then both write 1, then both return 0 to whatever called the non-atomic
test-and-set subroutine.
Test-and-set provides a trivial implementation of mutual exclusion, shown
in Algorithm 17.1.
1 while true do
// trying
2 while testAndSet(lock) = 1 do nothing
// critical
3 (do critical section stuff)
// exiting
4 reset(lock)
// remainder
5 (do remainder stuff)
Algorithm 17.1: Mutual exclusion using test-and-set
It is easy to see that this code provides mutual exclusion, as once one
process gets a 0 out of lock, no other can escape the inner while loop until
that process calls the reset operation in its exiting state. It also provides
progress (assuming the lock is initially set to 0); the only part of the code
that is not straight-line code (which gets executed eventually by the fairness
condition) is the inner loop, and if lock is 0, some process escapes it, while
if lock is 1, some process is in the region between the testAndSet call and
the reset call, and so it eventually gets to reset and lets the next process
in (or itself, if it is very fast).
The algorithm does not provide lockout-freedom: nothing prevents a
single fast process from scooping up the lock bit every time it goes through
the outer loop, while the other processes ineffectually grab at it just after it
is taken away. Lockout-freedom requires a more sophisticated turn-taking
strategy.
Note that this requires a queue that supports a head operation. Not all
implementations of queues have this property.
1 while true do
// trying
2 enq(Q, myId)
3 while head(Q) 6= myId do nothing
// critical
4 (do critical section stuff)
// exiting
5 deq(Q)
// remainder
6 (do remainder stuff)
Algorithm 17.2: Mutual exclusion using a queue
Here the proof of mutual exclusion is that only the process whose id is at
the head of the queue can enter its critical section. Formally, we maintain an
invariant that any process whose program counter is between the inner while
loop and the call to deq(Q) must be at the head of the queue; this invariant
is easy to show because a process can’t leave the while loop unless the test
fails (i.e., it is already at the head of the queue), no enq operation changes
the head value (if the queue is nonempty), and the deq operation (which
does change the head value) can only be executed by a process already at
the head (from the invariant).
Deadlock-freedom follows from proving a similar invariant that every
element of the queue is the id of some process in the trying, critical, or
exiting states, so eventually the process at the head of the queue passes the
inner loop, executes its critical section, and dequeues its id.
Lockout-freedom follows from the fact that once a process is at position
k in the queue, every execution of a critical section reduces its position by 1;
when it reaches the front of the queue (after some finite number of critical
sections), it gets the critical section itself.
each process in the queue itself; instead, we can hand out numerical tickets
to each process and have the process take responsibility for remembering
where its place in line is.
The RMW register has two fields, first and last, both initially 0. In-
crementing last simulates an enqueue, while incrementing first simulates a
dequeue. The trick is that instead of testing if it is at the head of the queue,
a process simply remembers the value of the last field when it “enqueued”
itself, and waits for the first field to equal it.
Algorithm 17.3 shows the code from Algorithm 17.2 rewritten to use this
technique. The way to read the RMW operations is that the first argument
specifies the variable to update and the second specifies an expression for
computing the new value. Each RMW operation returns the old state of the
object, before the update.
1 while true do
// trying
2 position ← RMW(V, hV.first, V.last + 1i)
// enqueue
3 while RMW(V, V ).first 6= position.last do
4 nothing
// critical
5 (do critical section stuff)
// exiting
6 RMW(V, hV.first + 1, V.lasti)
// dequeue
// remainder
7 (do remainder stuff)
Algorithm 17.3: Mutual exclusion using read-modify-write
shared data:
1 waiting, initially arbitrary
2 present[i] for i ∈ {0, 1}, initially 0
3 Code for process i:
4 while true do
// trying
5 present[i] ← 1
6 waiting ← i
7 while true do
8 if present[¬i] = 0 then break
9
10 if waiting 6= i then break
11
// critical
12 (do critical section stuff)
// exiting
13 present[i] = 0
// remainder
14 (do remainder stuff)
Algorithm 17.4: Peterson’s mutual exclusion algorithm for two pro-
cesses
1. p0 sets present[0] ← 1
CHAPTER 17. MUTUAL EXCLUSION 131
2. p0 sets waiting ← 0
4. p1 sets present[1] ← 1
5. p1 sets waiting ← 1
7. p0 sets present[0] ← 0
The idea is that if I see a 0 in your present variable, I know that you
aren’t playing, and can just go in.
Here’s a more interleaved execution where the waiting variable decides
the winner:
1. p0 sets present[0] ← 1
2. p0 sets waiting ← 0
3. p1 sets present[1] ← 1
4. p1 sets waiting ← 1
5. p0 reads present[1] = 1
6. p1 reads present[0] = 1
9. p0 sets present[0] ← 0
Note that it’s the process that set the waiting variable last (and thus
sees its own value) that stalls. This is necessary because the earlier process
might long since have entered the critical section.
Sadly, examples are not proofs, so to show that this works in general, we
need to formally verify each of mutual exclusion and lockout-freedom. Mu-
tual exclusion is a safety property, so we expect to prove it using invariants.
The proof in [Lyn96] is based on translating the pseudocode directly into
CHAPTER 17. MUTUAL EXCLUSION 132
Lemma 17.4.2. If pi is at Line 12, and p¬i is at Line 8, 10, or 12, then
waiting = ¬i.
Proof. We’ll do the case i = 0; the other case is symmetric. The proof is by
induction on the schedule. We need to check that any event that makes the
left-hand side of the invariant true or the right-hand side false also makes
the whole invariant true. The relevent events are:
We can now read mutual exclusion directly off of Lemma 17.4.2: if both
p0 and p1 are at Line 12, then we get waiting = 1 and waiting = 0, a
contradiction.
To show progress, observe that the only place where both processes can
get stuck forever is in the loop at Lines 8 and 10. But then waiting isn’t
changing, and so some process i reads waiting = ¬i and leaves. To show
lockout-freedom, observe that if p0 is stuck in the loop while p1 enters the
critical section, then after p1 leaves it sets present[1] to 0 in Line 13 (which
lets p0 in if p0 reads present[1] in time), but even if it then sets present[1]
back to 1 in Line 5, it still sets waiting to 1 in Line 6, which lets p0 into
the critical section. With some more tinkering this argument shows that p1
CHAPTER 17. MUTUAL EXCLUSION 133
enters the critical section at most twice while p0 is in the trying state, giving
2-bounded bypass; see [Lyn96, Lemma 10.12]. With even more tinkering we
get a constant time bound on the waiting time for process i to enter the
critical section, assuming the other process never spends more than O(1)
time inside the critical section.
shared data:
1 atomic register race, big enough to hold an id, initially ⊥
2 atomic register door, big enough to hold a bit, initially open
3 procedure splitter(id)
4 race ← id
5 if door = closed then
6 return right
7 door ← closed
8 if race = id then
9 return stop
10 else
11 return down
Lemma 17.4.3. After each time that door is set to open, at most one
process running Algorithm 17.5 returns stop.
Proof. To simplify the argument, we assume that each process calls splitter
at most once.
Let t be some time at which door is set to open (−∞ in the case of the
initial value). Let St be the set of processes that read open from door after
time t and before the next time at which some process writes closed to door,
and that later return stop by reaching Line 9.
Then every process in St reads door before any process in St writes door.
It follows that every process in St writes race before any process in St reads
race. If some process p is not the last process in St to write race, it will not
see its own id, and will not return stop. But only one process can be the
last process in St to write race.2
Proof. Follows from examining a solo execution: the process sets race to id,
reads open from door, then reads id from race. This causes it to return stop
as claimed.
left in the system. The simplest way to do this is to have each process mark
a bit in an array to show it is present, and have each slow-path process,
while still holding all the mutexes, check on its way out if the door bit is
set and no processes claim to be present. If it sees all zeros (except for
itself) after seeing door = closed, it can safely conclude that there is no
fast-path process and reset the splitter itself. The argument then is that the
last slow-path process to leave will do this, re-enabling the fast path once
there is no contention again. This approach is taken implicitly in Lamport’s
original algorithm, which combines the splitter and the mutex algorithms
into a single miraculous blob.
shared data:
1 choosing[i], an atomic bit for each i, initially 0
2 number[i], an unbounded atomic register, initially 0
3 Code for process i:
4 while true do
// trying
5 choosing[i] ← 1
6 number[i] ← 1 + maxj6=i number[j]
7 choosing[i] ← 0
8 for j 6= i do
9 loop until choosing[j] = 0
10 loop until number[j] = 0 or hnumber[i], ii < hnumber[j], ji
// critical
11 (do critical section stuff)
// exiting
12 number[i] ← 0
// remainder
13 (do remainder stuff)
Algorithm 17.6: Lamport’s Bakery algorithm
Note that several of these lines are actually loops; this is obvious for
CHAPTER 17. MUTUAL EXCLUSION 137
Lines 9 and 10, but is also true for Line 6, which includes an implicit loop
to read all n − 1 values of number[j].
Intuition for mutual exclusion is that if you have a lower number than I
do, then I block waiting for you; for lockout-freedom, eventually I have the
smallest number. (There are some additional complications involving the
choosing bits that we are sweeping under the rug here.) For a real proof
see [AW04, §4.4.1] or [Lyn96, §10.7].
Selling point is a strong near-FIFO guarantee and the use of only single-
writer registers (which need not even be atomic—it’s enough that they re-
turn correct values when no write is in progress). Weak point is unbounded
registers.
The final result follows by the fact that when k = n we cover n registers;
this implies that there are n registers to cover.
CHAPTER 17. MUTUAL EXCLUSION 139
1 C[side(i)] ← i
2 T ←i
3 P [i] ← 0
4 rival ← C[¬side(i)]
5 if rival 6= ⊥ and T = i then
6 if P [rival] = 0 then
7 P [rival] = 1
8 while P [i] = 0 do spin
9 if T = i then
10 while P [i] ≤ 1 do spin
When I want to enter my critical section, I first set C[side(i)] so you can
find me; this also has the same effect as setting present[side(i)] in Peterson’s
algorithm. I then point T to myself and look for you. I’ll block if I see
C[¬side(i)] = 1 and T = i. This can occur in two ways: one is that I really
write T after you did, but the other is that you only wrote C[¬side(i)] but
haven’t written T yet. In the latter case, you will signal to me that T may
have changed by setting P [i] to 1. I have to check T again (because maybe
I really did write T later), and if it is still i, then I know that you are ahead
of me and will succeed in entering your critical section. In this case I can
safely spin on P [i] waiting for it to become 2, which signals that you have
left.
There is a proof that this actually works in [YA95], but it’s 27 pages
of very meticulously-demonstrated invariants (in fairness, this includes the
entire algorithm, including the tree parts that we omitted here). For intu-
ition, this is not much more helpful than having a program mechanically
check all the transitions, since the algorithm for two processes is effectively
finite-state if we ignore the issue with different processes i jumping into the
role of side(i).
A slightly less rigorous proof but more human-accessible proof would be
analogous to the proof of Peterson’s algorithm. We need to show two things:
first, that no two processes ever both enter the critical section, and second,
that no process gets stuck.
For the first part, consider two processes i and j, where side(i) = 0 and
side(j) = 1. We can’t have both i and j skip the loops, because whichever
one writes T last sees itself in T . Suppose that this is process i and that
j skips the loops. Then T = i and P [i] = 0 as long as j is in the critical
section, so i blocks. Alternatively, suppose i writes T last but does so after
j first reads T . Now i and j both enter the loops. But again i sees T = i on
its second test and blocks on the second loop until j sets P [i] to 2, which
doesn’t happen until after j finishes its critical section.
Now let us show that i doesn’t get stuck. Again we’ll assume that i
wrote T second.
If j skips the loops, then j sets P [i] = 2 on its way out as long as T = i;
this falsifies both loop tests. If this happens after i first sets P [i] to 0, only
i can set P [i] back to 0, so i escapes its first loop, and any j 0 that enters
from the 1 side will see P [i] = 2 before attempting to set P [i] to 1, so P [i]
remains at 2 until i comes back around again. If j sets P [i] to 2 before i
sets P [i] to 0 (or doesn’t set it at all because T = j, then C[side(j)] is set
to ⊥ before i reads it, so i skips the loops.
If j doesn’t skip the loops, then P [i] and P [j] are both set to 1 after i
CHAPTER 17. MUTUAL EXCLUSION 143
and j enter the loopy part. Because j waits for P [j] 6= 0, when it looks at
T the second time it will see T = i 6= j and will skip the second loop. This
causes it to eventually set P [i] to 2 or set C[side(j)] to ⊥ before i reads it
as in the previous case, so again i eventually reaches its critical section.
Since the only operations inside a loop are on local variables, the algo-
rithm has O(1) RMR complexity. For the full tree this becomes O(log n).
144
CHAPTER 18. THE WAIT-FREE HIERARCHY 145
• x and y are both reads, Then x and y commute: Cxy = Cyx, and we
get a contradiction.
• x and y are both writes. Now py can’t tell the difference between Cxy
and Cy, so we get the same decision value for both, again contradicting
that Cx is 0-valent and Cy is 1-valent.
not the identity (otherwise RMW is just read). Then there is some value v
such that f (v) 6= v. To solve two-process consensus, have each process pi first
write its preferred value to a register ri , then execute the non-trivial RMW
operation on the RMW object initialized to v. The first process to execute
its operation sees v and decides its own value. The second process sees f (v)
and decides the first process’s value (which it reads from the register). It
follows that non-trivial RMW object has consensus number at least 2.
In many cases, this is all we get. Suppose that the operations of some
RMW type T are non-interfering in a way analogous to the previous defini-
tion, where now we say that x and y commute if they leave the object in the
same state (regardless of what values are returned) and that y overwrites x
if the object is always in the same state after both x and xy (again regard-
less of what is returned). The two processes px and py that carry out x and
y know what happenened, but a third process pz doesn’t. So if we run pz
to completion we get the same decision value after both Cx and Cy, which
means that Cx and Cy can’t be 0-valent and 1-valent. It follows that no
collection of RMW registers with interfering operations can solve 3-process
consensus, and thus all such objects have consensus number 2. Examples
of these objects include test-and-set bits, fetch-and-add registers, and
swap registers that support an operation swap that writes a new value and
return the previous value.
There are some other objects with consensus number 2 that don’t fit this
pattern. Define a wait-free queue as an object with enqueue and dequeue
operations (like normal queues), where dequeue returns ⊥ if the queue is
empty (instead of blocking). To solve 2-process consensus with a wait-free
queue, initialize the queue with a single value (it doesn’t matter what the
value is). We can then treat the queue as a non-trivial RMW register where
a process wins if it successfully dequeues the initial value and loses if it gets
empty.
However, enqueue operations are non-interfering: if px enqueues vx and
py enqueues vy , then any third process can detect which happened first;
similarly we can distinguish enq(x)deq() from deq()enq(x). So to show we
can’t do three process consensus we do something sneakier: given a bivalent
state C with allegedly 0- and 1-valent successors Cenq(x) and Cenq(y),
consider both Cenq(x)enq(y) and Cenq(y)enq(x) and run px until it does
a deq() (which it must, because otherwise it can’t tell what to decide) and
then stop it. Now run py until it also does a deq() and then stop it. We’ve
now destroyed the evidence of the split and poor hapless pz is stuck. In the
case of Cdeq()enq(x) and Cenq(x)deq() on a non-empty queue we can kill
the initial dequeuer immediately and then kill whoever dequeues x or the
CHAPTER 18. THE WAIT-FREE HIERARCHY 149
value it replaced, and if the queue is empty only the dequeuer knows. In
either case we reach indistinguishable states after killing only 2 witnesses,
and the queue has consensus number at most 2.
Similar arguments work on stacks, deques, and so forth—these all have
consensus number exactly 2.
Queue with peek Has operations enq(x) and peek(), which returns the
first value enqueued. (Maybe also deq(), but we don’t need it for
consensus). Protocol is to enqueue my input and then peek and return
the first value in the queue.
Fetch-and-cons Returns old cdr and adds new car on to the head of a list.
Use preceding protocol where peek() = tail(car :: cdr).
Sticky bit Has a write operation that has no effect unless register is in
the initial ⊥ state. Whether the write succeeds or fails, it returns
nothing. The consensus protocol is to write my input and then return
result of a read.
Algorithm 18.1 requires 2-register writes, and will give us a protocol for 2
processes (since the reader above has to participate somewhere to make the
first case work). For m processes, we can do the same thing with m-register
writes. We have a register rpq = rqp for each pair of distinct processes p
4
Or use any other rule that all processes apply consistently.
CHAPTER 18. THE WAIT-FREE HIERARCHY 151
1 v1 ← r1
2 v2 ← r2
3 if v1 = v2 = ⊥ then
4 return no winner
5 if v1 = 1 and v2 = ⊥ then
// p1 went first
6 return 1
// read r1 again
7 v10 ← r1
8 if v2 = 2 and v10 = ⊥ then
// p2 went first
9 return 2
// both p1 and p2 wrote
10 if rshared = 1 then
11 return 2
12 else
13 return 1
Algorithm 18.1: Determining the winner of a race between 2-register
writes. The assumption is that p1 and p2 each wrote their own ids to ri
and rshared simultaneously. This code can be executed by any process
(including but not limited to p1 or p2 ) to determine which of these
2-register writes happened first.
CHAPTER 18. THE WAIT-FREE HIERARCHY 152
Now suppose we have 2m − 1 processes. The first part says that each of
the pending operations (x, y, all of the zi ) writes to 1 single-writer register
and at least k two-writer registers where k is the number of processes leading
to a different univalent value. This gives k + 1 total registers simultaneously
written by this operation. Now observe that with 2m − 1 process, there is
some set of m processes whose operations all lead to a b-valent state; so
for any process to get to a (¬b)-valent state, it must write m + 1 registers
simultaneously. It follows that with only m simultaneous writes we can only
do (2m − 2)-consensus.
1 procedure apply(π)
// announce my intended operation
2 op[i] ← π
3 while true do
// find a recent round
4 r ← maxj round[j]
// obtain the history as of that round
5 if hr = ⊥ then
6 hr ← consensus(c[r], ⊥)
7 if π ∈ hr then
8 return value π returns in hr
// else attempt to advance
9 h0 ← hr
10 for each j do
11 if op[j] 6∈ h0 then
12 append op[j] to h0
loop, so in principle it could run forever. But we can argue that no process
after executes the loop more than twice. The reason is that a process p puts
its operation in op[p] before it calculates r; so any process that writes r0 > r
to round sees p’s operation before the next round. It follows that p’s value
gets included in the history no later than round r + 2. (We’ll see this sort
of thing again when we do atomic snapshots in Chapter 19.)
Building a consistent shared history is easier with some particular objects
that solve consensus. For example, a fetch-and-cons object that supplies
an operation that pushes a new head onto a linked list and returns the old
head trivially implements the common history above without the need for
helping. One way to implement fetch-and-cons is with a swap object; to
add a new element to the list, create a cell with its next pointer pointing to
itself, then swap the next field with the head pointer for the entire list.
The solutions we’ve described here have a number of deficiencies that
make them impractical in a real system (even more so than many of the
algorithms we’ve described). If we store entire histories in a register, the
register will need to be very, very wide. If we store entire histories as a linked
list, it will take an unbounded amount of time to read the list. For solutions
to these problems, see [AW04, 15.3] or the papers of Herlihy [Her91b] and
Plotkin [Plo89].
Chapter 19
Atomic snapshots
We’ve seen in the previous chapter that there are a lot of things we can’t
make wait-free with just registers. But there are a lot of things we can.
Atomic snapshots are a tool that let us do a lot of these things easily.
An atomic snapshot object acts like a collection of n single-writer
multi-reader atomic registers with a special snapshot operation that returns
(what appears to be) the state of all n registers at the same time. This is
easy without failures: we simply lock the whole register file, read them all,
and unlock them to let all the starving writers in. But it gets harder if
we want a protocol that is wait-free, where any process can finish its own
snapshot or write even if all the others lock up.
We’ll give the usual sketchy description of a couple of snapshot algo-
rithms. More details on early snapshot results can be found in [AW04,
§10.3] or [Lyn96, §13.3]. There is also a reasonably recent survey by Fich
on upper and lower bounds for the problem [Fic05].
157
CHAPTER 19. ATOMIC SNAPSHOTS 158
terminate if there are a lot of writers around.1 So we need some way to slow
the writers down, or at least get them to do snapshots for us.
in order to prevent case (a) from holding, the adversary has to supply at
least one new value in each collect after the first. But it can only supply one
new value for each of the n − 1 processes that aren’t doing collects before
case (b) is triggered (it’s triggered by the first process that shows up with
a second new value). Adding up all the collects gives 1 + (n − 1) + 1 =
n + 1 collects before one of the cases holds. Since each collect takes n − 1
read operations (assuming the process is smart enough not to read its own
register), a snapshot operation terminates after at most n2 − 1 reads.
19.2.1 Linearizability
We now need to argue that the snapshot vectors returned by the Afek et al.
algorithm really work, that is, that between each matching invoke-snapshot
and respond-snapshot there was some actual time where the registers in
the array contained precisely the values returned in the respond-snapshot
CHAPTER 19. ATOMIC SNAPSHOTS 160
1. The toggle bit for some process q is unchanged between the two snap-
shots taken by p. Since the bit is toggled with each update, this means
that an even number of updates to q 0 s segment occurred during the
interval between p’s writes. If this even number is 0, we are happy: no
updates means no call to tryHandshake by q, which means we don’t
see any change in q’s segment, which is good, because there wasn’t
any. If this even number is 2 or more, then we observe that each of
these events precedes the following one:
It follows that q both reads and writes the handshake bits in between
p’s calls to tryHandshake and checkHandshake, so p correctly sees
that q has updated its segment.
2. The toggle bit for q has changed. Then q did an odd number of updates
(i.e., at least one), and p correctly detects this fact.
What does p do with this information? Each time it sees that q has done
a scan, it updates a count for q. If the count reaches 3, then p can determine
that q’s last scanned value is from a scan that is contained completely within
the time interval of p’s scan. Either this is a direct scan, where q actually
performs two collects with no changes between them, or it’s an indirect
scan, where q got its value from some other scan completely contained
within q’s scan. In the first case p is immediately happy; in the second,
we observe that this other scan is also contained within the interval of p’s
scan, and so (after chasing down a chain of at most n − 1 indirect scans) we
eventually reach a direct scan contained within it that provided the actual
CHAPTER 19. ATOMIC SNAPSHOTS 163
value. In either case p returns the value of pair of adjacent collects with
no changes between them that occurred during the execution of its scan
operation, which gives us linearizability.
1 procedure scan()
// First attempt
2 Ri ← r ← max(R1 . . . Rn , Ri + 1)
3 collect ← read(S1 . . . Sn )
4 view ← LAr (collect)
5 if max(R1 . . . Rn ) > Ri then
// Fall through to second attempt
6 else
7 Vir ← view
8 return Vir
// Second attempt
9 Ri ← r ← max(R1 . . . Rn , Ri + 1)
10 collect ← read(S1 . . . Sn )
11 view ← LAr (collect)
12 if max(R1 . . . Rn ) > Ri then
13 Vir ← some nonempty Vjr
14 return Vir
15 else
16 Vir ← view
17 returnVir
than once; if the same thing happens on my second attempt, I can use an
indirect view as in [AAD+ 93], knowing that it is safe to do so because any
collect that went into this indirect view started after I did.
The update operation is the usual update-and-scan procedure; for com-
pleteness this is given as Algorithm 19.3. To make it easier to reason about
the algorithm, we assume that an update returns the result of the embedded
scan.
1. All views returned by the scan operation are comparable; that is, there
exists a total order on the set of views (which can be extended to a
total order on scan operations by breaking ties using the execution
order).
3. The total order on views respects the execution order: if π1 and π2 are
scan operations that return v1 and v2 , then scan1 <S scan2 implies
view1 ≤ view2 . (This gives us linearization.)
Let’s start with comparability. First observe that any view returned is
either a direct view (obtained from LAr ) or an indirect view (obtained from
Vjr for some other process j). In the latter case, following the chain of
indirect views eventually reaches some direct view. So all views returned for
a given round are ultimately outputs of LAr and thus satisfy comparability.
But what happens with views from different rounds? The lattice-agreement
objects only operate within each round, so we need to ensure that any view
returned in round r is included in any subsequent rounds. This is where
checking round numbers after calling LAr comes in.
CHAPTER 19. ATOMIC SNAPSHOTS 167
Suppose some process i returns a direct view; that is, it sees no higher
round number in either its first attempt or its second attempt. Then at
the time it starts checking the round number in Line 5 or 12, no process
has yet written a round number higher than the round number of i’s view
(otherwise i would have seen it). So no process with a higher round number
has yet executed the corresponding collect operation. When such a process
does so, it obtains values that are at least as current as those fed into LAr ,
and i’s round-r view is less than or equal to the vector of these values by
upward validity of LAr and thus less than or equal to the vector of values
returned by LAr0 for r0 > r by upward validity. So we have comparability
of all direct views, which implies comparability of all indirect views as well.
To show that each view returned by scan includes the preceding update,
we observe that either a process returns its first-try scan (which includes
the update by downward validity) or it returns the results of a scan in the
second-try round (which includes the update by downward validity in the
later round, since any collect in the second-try round starts after the update
occurs). So no updates are missed.
Now let’s consider two scan operations π1 and π2 where π1 precedes π2
in the execution. We want to show that, for the views v1 and v2 that these
scans return, v1 ≤ v2 . From the comparability property, the only way this
can fail is if v2 < v1 ; that is, there is some update included in v2 that is
not included in v1 . But this can’t happen; if π2 starts after π1 finishes, it
starts after any update π1 sees is already in one of the Sj registers, and so
π2 will include this update in its initial collect. (Slightly more formally, if s
is the contents of the registers at some time between when π1 finishes and
π2 starts, then v1 ≤ s by upward validity and s ≤ v2 by downward validity
of the appropriate LA objects.)
3. If the values obtained are the same in both collects, call WriteSet on
the current node to store the union of the two sets and proceed to the
parent node. Otherwise repeat the preceding step.
1 procedure WriteSet(S)
2 for i ← |S| down to 1 do
3 a[i] ← S
4 procedure ReadSet()
// update p to last nonempty position
5 while true do
6 s ← a[p]
7 if p = m or a[p + 1] = ∅ then
8 break
9 else
10 p←p+1
11 return s
Algorithm 19.4: Increasing set data structure
Naively, one might think that we could just write directly to a[|S|] and
skip the previous ones, but this makes it harder for a reader to detect that
CHAPTER 19. ATOMIC SNAPSHOTS 170
1 procedure scan()
2 currSeq ← currSeq + 1
3 for j ← 0 to n − 1 do
4 h ← memory[j].high
5 if h.seq < currSeq then
6 view[j] ← h.value
7 else
8 view[j] ← memory[j].low.value
1 procedure update()
2 seq ← currSeq
3 h ← memory[i].high
4 if h.seq 6= seq then
5 memory[i].low ← h
6 memory[i].high ← (value, seq)
Algorithm 19.6: Single-scanner snapshot: update
CHAPTER 19. ATOMIC SNAPSHOTS 173
19.5 Applications
Here we describe a few things we can do with snapshots.
Lower bounds on
perturbable objects
Being able to do snapshots in linear time means that we can build lineariz-
able counters, generalized counters, max registers, etc. in linear time, by
having each reader take a snapshot and combine the contributions of each
updater using the appropriate commutative and associative operation. A
natural question is whether we can do better by exploiting the particular
features of these objects.
Unfortunately, the Jayanti-Tan-Toueg [JTT00] lower bound for per-
turbable objects says each of these objects requires n − 1 space and n − 1
steps for a read operation in the worst case, for any solo-terminating imple-
mentation from historyless objects.1
Here perturbable means that the object has a particular property that
makes the proof work, essentially that the outcome of certain special exe-
cutions can be changed by stuffing lots of extra update operations in the
middle (see below for details). Solo-terminating means that a process
finishes its current operation in a finite number of steps if no other process
takes steps in between; it is a much weaker condition, for example, than
wait-freedom. Historyless objects are those for which any operation that
changes the state overwrites all previous operations (i.e., those for which
covering arguments work, as long as the covering processes never report
back what they say). Atomic registers are the typical example, while swap
objects (with a swap operation that writes a new state while returning the
old state) are the canonical example since they can implement any other
1
A caveat is that we may be able to make almost all read operations cheaper, although
we won’t be able to do anything about the space bound. See Chapter 21.
177
CHAPTER 20. LOWER BOUNDS ON PERTURBABLE OBJECTS 178
historyless object (and even have consensus number 2, showing that even
extra consensus power doesn’t necessarily help here).
Below is a sketch of the proof. See the original paper [JTT00] for more
details.
– For a max register, let γ include a bigger write than all the others.
– For a counter, let γ include at least n increments. The same
works for a mod-m counter if m is at least 2n.
∗ Why n increments? With fewer increments, we can make
Π return the same value by being sneaky about when the
partial increments represented in Σk are linearized.
– In contrast, historyless objects (including atomic registers) are
not perturbable: if Σk includes a write that sets the value of the
object, no set of operations inserted before it will change this
value. (This is good, because we know that it only takes one
atomic register to implement an atomic register.)
• Find a γ 0 that writes to the first uncovered register that Π looks at (if
none exists, the reader is wasting a step), truncate before that write,
and prepend the write to Σk .
Restricted-use objects
Here we are describing work by Aspnes, Attiya, and Censor [AAC09], plus
some extensions by Aspnes et al. [AACHE12] and Aspnes and Censor-
Hillel [ACH13]. The idea is to place restrictions on the size of objects that
would otherwise be subject to the Jayanti-Tan-Toueg bound [JTT00] (see
Chapter 20), in order to get cheap implementations.
The central object that is considered in this work is a max register,
for which read operation returns the largest value previously written, as
opposed to the last value previously written. So after writes of 0, 3, 5, 2, 6,
11, 7, 1, 9, a read operation will return 11.
These are perturbable objects in the sense of the Jayanti-Tan-Toueg
bound, so in the worst case a max-register read will have to read at least
n−1 distinct atomic registers, giving an n−1 lower bound on both individual
work and space. But we can get around this by considering bounded max
registers (which only hold values in some range 0 . . . m − 1); these are not
perturbable because once one hits its upper bound we can no longer insert
new operations to change the value returned by a read.
180
CHAPTER 21. RESTRICTED-USE OBJECTS 181
1 procedure read(r)
2 if switch = 0 then
3 return 0(read(left))
4 else
5 return 1(read(right))
The intuition is that the max register is really a big tree of switch vari-
ables, and we store a particular bit-vector in the max register by setting to 1
the switches needed to make read follow the path corresponding to that bit-
vector. The procedure for writing 0x tests switch first, because once switch
gets set to 1, any 0x values are smaller than the largest value, and we don’t
want them getting written to left where they might confuse particularly slow
readers into returning a value we can’t linearize. The procedure for writing
1x sets switch second, because (a) it doesn’t need to test switch, since 1x
always beats 0x, and (b) it’s not safe to send a reader down into right until
some value has actually been written there.
It’s easy to see that read and write operations both require exactly
one operation per bit of the value read or written. To show that we get
linearizability, we give an explicit linearization ordering (see the paper for a
full proof that this works):
CHAPTER 21. RESTRICTED-USE OBJECTS 182
(a) Within this pile, we sort operations using the linearization order-
ing for left.
(a) Within this pile, operations that touch right are ordered using
the linearization ordering for right. Operations that don’t (which
are the “do nothing” writes for 0x values) are placed consistently
with the actual execution order.
To show that this gives a valid linearization, we have to argue first that
any read operation returns the largest earlier write argument and that we
don’t put any non-concurrent operations out of order.
For the first part, any read in the 0 pile returns 0read(left), and read(left)
returns (assuming left is a linearizable max register) the largest value pre-
viously written to left, which will be the largest value linearized before the
read, or the all-0 vector if there is no such value. In either case we are
happy. Any read in the 1 pile returns 1read(right). Here we have to guard
against the possibility of getting an all-0 vector if no write operations lin-
earize before the read. But any write operation that writes 1x doesn’t set
switch to 1 until after it writes to right, so no read operation ever starts
read(right) until after at least one write to right has completed, implying
that that write to right linearizes before the read from right. So in this case
as well all the second-pile operations linearize.
all the values represented by earlier max registers in the chain. Formally,
this is equivalent to encoding values using an Elias gamma code, tweaked
slightly by changing the prefixes from 0k 1 to 1k 0 to get the ordering right.
write values ≤ k. Let t be the smallest value such that some execution in
St writes to r (there must be some such t, or our reader can omit reading r,
which contradicts the assumption that it is optimal).
We’ve shown the recurrence T (m, n) ≥ mint (max(T (t, n), T (m−t, n)))+
1, with base cases T (1, n) = 0 and T (m, 1) = 0. The solution to this recur-
rence is exactly min(dlg me , n − 1), with is the same, except for a constant
factor on n, as the upper bound we got by choosing between a balanced
tree for small m and a snapshot for m ≥ 2n−1 . For small m, the recursive
split we get is also the same as in the tree-based algorithm: call the r reg-
ister switch and you can extract a tree from whatever algorithm somebody
gives you. So this says that the tree-based algorithm is (up to choice of the
tree) essentially the unique optimal bounded max register implementation
for m ≤ 2n−1 .
It is also possible to show lower bounds on randomized implementations
of max registers and other restricted-use objects. See [AAC09, AACHH12]
for examples.
1 procedure write(a, i, v)
2 if i = 0 then
3 if v < k1 then
4 if a.switch = 0 then
5 write(a.left, 0, v)
6 else
7 write(a.right, 0, v − k1 )
8 a.switch ← 1
9 else
10 write(a.tail, v)
11 procedure read(a)
12 x ← read(a.tail)
13 if a.switch = 0 then
14 write(a.left, 1, x)
15 return read(a.left)
16 else
17 x ← read(a.tail)
18 write(a.right, 1, x)
19 return hk1 , 0i + read(a.right)
21.5.1 Linearizability
In broad outline, the proof of linearizability follows the proof for a simple
max register. But as with snapshots, we have to show that the ordering of
the head and tail components are consistent.
The key observation is the following lemma.
Proof. Both vleft [1] and vright [1] are values that were previously written to
their respective max arrays by read(a) operations (such writes necessar-
ily exist because any process that reads a.left or a.right writes a.left[1] or
a.right[1] first). From examining the code, we have that any value written
to a.left[1] was read from a.tail before a.switch was set to 1, while any value
written to a.right[1] was read from a.tail after a.switch was set to 1. Since
max-register reads are non-decreasing, we have than any value written to
a.left[1] is less than or equal to any value written to a.right[1], proving the
claim.
Theorem 21.5.2. If a.left and a.right are linearizable max arrays, and a.tail
is a linearizable max register, then Algorithm 21.3 implements a linearizable
max array.
it is the root. The pointers themselves are non-decreasing indices into ar-
rays of values that consist of ordinary (although possibly very wide) atomic
registers.
When a process writes a new value to its component of the snapshot
object, it increases the pointer value in its leaf and then propagates the
new value up the tree by combining together partial snapshots at each step,
using 2-component max arrays to ensure linearizability. The resulting algo-
rithm is similar in many ways to the lattice agreement procedure of Inoue et
al. [IMCT94] (see §19.3.5), except that it uses a more contention-tolerant
snapshot algorithm than double collects and we allow processes to update
their values more than once. It is also similar to some constructions of
Jayanti [Jay02] for efficient computation of array aggregates (sum, min,
max, etc.) using LL/SC, the main difference being that because the index
values are non-decreasing, max arrays can substitute for LL/SC.
Each node in the tree except the root is represented by one component
of a 2-component max array that we can think of as being owned by its
parent, with the other component being the node’s sibling in the tree. To
propagate a value up the tree, at each level the process takes a snapshot
of the two children of the node and writes the sum of the indices to the
node’s component in its parent’s max array (or to an ordinary max register
if we are at the root). Before doing this last write, a process will combine
the partial snapshots from the two child nodes and write the result into
a separate array indexed by the sum. In this way any process that reads
the node’s component can obtain the corresponding partial snapshot in a
single register operation. At the root this means that the cost of obtaining
a complete snapshot is dominated by the cost of the max-register read, at
O(log v), where v is the number of updates ever performed.
A picture of this structure, adapted from [AACHE12], appears in Fig-
ure 21.1. The figure depicts an update in progress, with red values being the
new values written as part of the update. Only some of the tables associated
with the nodes are shown.
The cost of an update is dominated by the O(log n) max-array operations
needed to propagate the new value to the root. This takes O(log2 v log n)
steps.
The linearizability proof is trivial: linearize each update by the time at
which a snapshot containing its value is written to the root (which neces-
sarily occurs within the interval of the update, since we don’t let an update
finish until it has propagated its value to the top), and linearize reads by
when they read the root. This immediately gives us an O(log3 n) implemen-
tation—as long as we only want to use it polynomially many times—of any-
CHAPTER 21. RESTRICTED-USE OBJECTS 189
cms
5 cmr
bmr
br
ar
0 5 a
ms
0 0 3 3 mr
m
c
b 1 2
a
s
m r
Common2
190
CHAPTER 22. COMMON2 191
1 procedure TAS2()
2 if Consensus2(myId) = myId then
3 return 0
4 else
5 return 1
Once we have test-and-set for two processes, we can easily get one-shot
swap for two processes. The trick is that a one-shot swap object always
returns ⊥ to the first process to access it and returns the other process’s value
to the second process. We can distinguish these two roles using test-and-set
and add a register to send the value across. Pseudocode is in Algorithm 22.2.
1 procedure swap(v)
2 a[myId] = v
3 if TAS2() = 0 then
4 return ⊥
5 else
6 return a[¬myId]
1 procedure compete(i)
// check the gate
2 if gate 6= ⊥ then
3 return gate
4 gate ← i
// Do tournament, returning id of whoever I lose to
5 node ← leaf for i
6 while node 6= root do
7 for each j whose leaf is below sibling of node do
8 if TAS2(t[i, j]) = 1 then
9 return j
10 node ← node.parent
// I win!
11 return ⊥
Algorithm 22.3: Tournament algorithm with gate
locked down to round k − 1 (and thread itself behind some other process at
round k−1); this is done using a “trap” object implemented with a 2-process
swap. If the target process escapes by calling the trap object first, it will
leave behind the id of the process it lost to at round k − 1, allowing the
round-k winner to try again. If the round-k winner fails to trap anybody, it
will eventually thread itself behind the round-(k − 1) winner, who is stuck
at round k − 1.
Only those processes that are ancestors of the process that beat the
round-k winner may get trapped in round k − 1; everybody else will escape
and try again in a later round.
Pseudocode for the trap object is given in Algorithm 22.4. There are two
operations. The pass operation is called by the process trying to escape;
if it executes first, this process successfully escapes, but leaves behind the
identity of somebody else to try. The block operation locks the target down
so that pass fails. The shared data for a trap t consists of a two-process
swap object t[i, j] for each process i trying to block a process j. A utility
procedure passAll is included that calls pass on all potential blockers until
it fails.
It is not hard to see from the code that Algorithm 22.4 has the desired
properties: if the passer reaches the swap object first, it is not blocked but
leaves behind its value v for the passer; while if the blocker reaches the
CHAPTER 22. COMMON2 194
1 procedure block(t, j)
2 return swap(t[j, i], blocked)
3 procedure pass(t, j, v)
4 if swap(t[i, j], v) = blocked then
5 return false
6 else
7 return true
8 procedure passAll(t, v)
9 for j ← 1 to n do
10 if ¬pass(t, j, v) then return false
11
12 return true
Algorithm 22.4: Trap implementation from [AWW93]
11 procedure findValue(k, t)
12 if k = 0 then
13 return ⊥
14 else
15 repeat
16 x ← block(trap[k], t)
17 if x 6= ⊥ then t ← x
18 until x = ⊥
19 return input[t]
196
CHAPTER 23. RANDOMIZED CONSENSUS AND TEST-AND-SET197
based on knowledge of the state of the protocol and its past evolution. How
much knowledge we give the adversary affects its power. Several classes of
adversaries have been considered in the literature; ranging from strongest
to weakest, we have:
1. An adaptive adversary. This adversary is a function from the state
of the system to the set of processes; it can see everything that has
happened so far (including coin-flips internal to processes that have not
yet been revealed to anybody else), but can’t predict the future. It’s
known that an adaptive adversary can force any randomized consensus
protocol to take Θ(n2 ) total steps [AC08]. The adaptive adversary is
also called a strong adversary following a foundational paper of
Abrahamson [Abr88].
2. An intermediate adversary or weak adversary [Abr88] is one
that limits the adversary’s ability to observe or control the system in
some way, without completely eliminating it. For example, a content-
oblivious adversary [Cha96] or value-oblivious adversary [Aum97]
is restricted from seeing the values contained in registers or pending
write operations and from observing the internal states of processes
directly. A location-oblivious adversary [Asp12b] can distinguish
between values and the types of pending operations, but can’t discrim-
inate between pending operations based one which register they are
operating on. These classes of adversaries are modeled by imposing
an equivalence relation on partial executions and insisting that the
adversary make the same choice of processes to go next in equivalent
situations. Typically they arise because somebody invented a consen-
sus protocol for the oblivious adversary below, and then looked for the
next most powerful adversary that still let the protocol work.
Weak adversaries often allow much faster consensus protocols than
adaptive adversaries. Each of the above adversaries permits consensus
to be achieved in O(log n) expected individual work using an appropri-
ate algorithm. But from a mathematical standpoint, weak adversaries
are a bit messy, and once you start combining algorithms designed for
different weak adversaries, it’s natural to move all the way down to
the weakest reasonable adversary, the oblivious adversary described
below.
3. A oblivious adversary has no ability to observe the system at all;
instead, it fixes a sequence of process ids in advance, and at each step
the next process in the sequence runs.
CHAPTER 23. RANDOMIZED CONSENSUS AND TEST-AND-SET198
23.2 History
The use of randomization to solve consensus in an asynchronous system
with crash failures was proposed by Ben-Or et al.Ben-Or1983 for a message-
passing model. Chor, Israeli, and Li [CIL94] gave the first wait-free consen-
sus protocol for a shared-memory system, which assumed a particular kind
of weak adversary. Abrahamson [Abr88] defined strong and weak adver-
saries and gave the first wait-free consensus
2 protocol for a strong adversary;
its expected step complexity was Θ 2n . After failing to show that expo-
nential time was necessary, Aspnes and Herlihy [AH90a] showed how to do
consensus in O(n4 ) total work, a value that was soon reduced to O(n2 log n)
by Bracha and Rachman [BR91]. This remained the best known bound for
the strong-adversary model until Attiya and Censor [AC08] showed match-
ing Θ(n2 ) upper and lower bounds for the problem; subsequent work [AC09]
showed that it was also possible to get an O(n) bound on individual work.
For weak adversaries, the best known upper bound on individual step
complexity was O(log n) for a long time [Cha96, Aum97, Asp12b], with
an O(n) bound on total step complexity for some models [Asp12b]. More
recent work has lowered the bound to O(log log n), under the assumption of
an oblivious adversary [Asp12a]. No non-trivial lower bound on expected
individual step complexity is known, although there is a known lower bound
on the distribution of of the individual step complexity [ACH10].
1 preference ← input
2 for r ← 1 . . . ∞ do
3 (b, preference) ← AdoptCommit(AC[r], preference)
4 if b = commit then
5 return preference
6 else
7 do something to generate a new preference
The idea is that the adopt-commit takes care of ensuring that once some-
body returns a value (after receiving commit), everybody else who doesn’t
CHAPTER 23. RANDOMIZED CONSENSUS AND TEST-AND-SET200
return adopts the same value (follows from coherence). Conversely, if ev-
erybody already has the same value, everybody returns it (follows from
convergence). The only missing piece is the part where we try to shake all
the processes into agreement. For this we need a separate object called a
conciliator.
23.3.2 Conciliators
Conciliators are a weakened version of randomized consensus that replace
agreement with probabilistic agreement: it’s OK if the processes disagree
sometimes as long as they agree with constant probability despite interfer-
ence by the adversary. An algorithm that satisfies termination, validity, and
probabilistic agreement is called a conciliator.1
The important feature of conciliators is that if we plug a conciliator that
guarantees agreement with probability at least δ into Algorithm 23.1, then
on average we only have to execute the loop 1/δ times before every process
agrees. This gives an expected cost equal to 1/δ times the total cost of
AdoptCommit and the conciliator. Typically we will aim for constant δ.
and (maybe) writing to the register; if a process reads a non-null value from
the register, it returns it. Any other process that reads the same non-null
value will agree with the first process; the only way that this can’t happen
is if some process writes a different value to the register before it notices the
first write.
The random choice of whether to write the register or not avoids this
problem. The idea is that even though the adversary can schedule a write
at a particular time, because it’s oblivious, it won’t be able to tell if the
process wrote (or was about to write) or did a no-op instead.
The basic version of this algorithm, due to Chor, Israeli, and Li [CIL94],
1
uses a fixed 2n probability of writing to the register. So once some process
writes to the register, the chance that any of the remaining n − 1 processes
write to it before noticing that it’s non-null is at most n−12n < 1/2. It’s also
not hard to see that this algorithm uses O(n) total operations, although it
may be that one single process running by itself has to go through the loop
2n times before it finally writes the register and escapes.
Using increasing probabilities avoids this problem, because any process
that executes the main loop dlg ne + 1 times will write the register. This
establishes the O(log n) per-process bound on operations. At the same time,
an O(n) bound on total operations still holds, since each write has at least
1
a 2n chance of succeeding. The price we pay for the improvement is that
we increase the chance that an initial value written to the register gets
overwritten by some high-probability write. But the intuition is that the
probabilities can’t grow too much, because the probability that I write on
my next write is close to the sum of the probabilities that I wrote on my
previous writes—suggesting that if I have a high probability of writing next
time, I should have done a write already.
Formalizing this intuition requires a little bit of work. Fix the schedule,
and let pi be the probability that the i-th write operation in this schedule
succeeds. Let t be the least value for which ti=1 pi ≥ 1/4. We’re going to
P
argue that with constant probability one of the first t writes succeeds, and
that the next n − 1 writes by different processes all fail.
The probability that none of the first t writes succeed is
t t
e−pi
Y Y
(1 − pi ) ≤
i=1 i=1
t
!
X
= exp pi
i=1
≤ e−1/4 .
CHAPTER 23. RANDOMIZED CONSENSUS AND TEST-AND-SET203
Now observe that if some process q writes at or before the t-th write,
then any process with a pending write either did no writes previously, or its
last write was among the first t − 1 writes, whose probabilities sum to less
1
than 1/4. In the first case, the process has a 2n chance of writing on its
P 1
next attempt. In the second, it has a i∈Sq pi + 2n chance of writing on its
next attempt, where Sq is the set of indices in 1 . . . t − 1 where q attempts
to write.
Summing up these probabilities over all processes gives a total of n−1
2n +
−1/4
i∈Sq pi ≤ 1/2+1/4 = 3/4. So with probabililty at least e
P P
q (1−3/4) =
e −1/4 /4, we get agreement.
23.6 Sifters
[[[ This is at least two papers out of date: there is the faster
sifter of Giakkoupis and Woelfel, and the O(log n)-space TAS of
Giakkoupis, Helmi, Higham, and Woelfel from STOC 2015. ]]]
A faster conciliator can be obtained using a sifter, which is a mechanism
for rapidly discarding processes using randomization [AA11] while keeping
at least one process around. The idea of a sifter is to have each process either
write a register (with low probability) or read it (with high probability); all
writers and all readers that see ⊥ continue to the next stage of the protocol,
while all readers who see a non-null value drop out. An appropriately-
√
tuned sifter will reduce n processes to at most 2 n processes on average; by
iterating this mechanism, the expected number of remaining processes can
be reduced to 1 + after O(log log n + log(1/)) phases.
As with previous implementations of test-and-set (see Algorithm 22.3),
it’s often helpful to have a sifter return not only that a process lost but which
process it lost to. This gives the implementation shown in Algorithm 23.4.
1 procedure sifter(p, r)
2 with probability p do
3 r ← id
4 return ⊥
5 else
6 return r
processes that are likely to use it. This is because of the following lemma:
Lemma 23.6.1. Fix p, and let X processes executed a sifter with parameter
p. Let Y be the number of processes for which the sifter returns ⊥. Then
1
E [X | Y ] ≤ pX + . (23.6.1)
p
1 if gate 6= ⊥ then
2 return 1
3 else
4 gate ← myId
l m
5 for i ← 1 . . . dlog log ne + log4/3 (7 log n) do
−i+1
6 with probability min 1/2, 21−2 do
7 ri ← myId
8 else
9 w ← ri
10 if w 6= ⊥ then
11 return 1
Pseudocode for this algorithm is given in Algorithm 23.6. Note that the
loop body is essentially the same as the code in Algorithm 23.4, except that
the random choice is replaced by a lookup in persona.chooseWrite.
To show that this works, we need to argue that having multiple copies
of a persona around doesn’t change the behavior of the sifter. In each
round, we will call the first process with a given persona p to access ri
the representative of p, and argue that a persona survives round i in
this algorithm precisely when its representative would survive round i in
a corresponding test-and-set sifter with the schedule restricted only to the
representatives.
There are three cases:
1. The representative of p writes. Then at least one copy of p survives.
2. The representative of p reads a null value. Again at least one copy of
p survives.
3. The representative of p reads a non-null value. Then no copy of p
survives: all subsequent reads by processes carrying p also read a non-
null value and discard p, and since no process with p writes, no other
process adopts p.
CHAPTER 23. RANDOMIZED CONSENSUS AND TEST-AND-SET207
1 procedure conciliator(input)
l m
2 Let R = dlog log ne + log4/3 (7/)
3 Let chooseWrite be a vector of R independent random Boolean
variables with Pr[chooseWrite[i] = 1] = pi , where
−i+1 −i
pi = 21−2 (n)−2 for i ≤ dlog log ne and pi = 1/2 for larger i.
4 persona ← hinput, chooseWrite, myIdi
5 for i ← 1 . . . R do
6 if persona.chooseWrite[i] = 1 then
7 ri ← persona
8 else
9 v ← ri
10 if v 6= ⊥ then
11 persona ← v
12 return persona.input
Algorithm 23.6: Sifting conciliator (from [Asp12a])
From the preceding analysis for test-and-set, we have that after O(log log n+
log 1/) rounds with appropriate probabilities of writing, at most 1+ values
survive on average. This gives a probability of at most of disagreement. By
alternating these conciliators with adopt-commit objects, we get agreement
in O(log log n + log m/ log log m) expected time, where m is the number of
possible input values.
I don’t think the O(log log n) part of this expression is optimal, but I
don’t know how to do better.
Proof. For the first part, observe that any process that picks the largest
value of r among all processes will survive; since the number of processes is
finite, there is at least one such survivor.
For the second part, let Xi be the number of survivors with r = i. Then
E [Xi ] is bounded by n · 2−i , since no process survives with r = i without
first choosing r = i. But we can also argue that E [Xi ] ≤ 3 for any value of
n, by considering the sequence of write operations in the execution.
Because the adversary is oblivious, the location of these writes is uncor-
related with their ordering. If we assume that the adversary is trying to
maximize the number of survivors, its best strategy is to allow each process
to read immediately after writing, as delaying this read can only increase the
probability that A[r + 1] is nonzero. So in computing Xi , we are counting
the number of writes to A[i] before the first write to A[i + 1]. Let’s ignore
all writes to other registers; then the j-th write to either of A[i] or A[i + 1]
has a conditional probability of 2/3 of landing on A[i] and 1/3 on A[i + 1].
We are thus looking at a geometric distribution with parameter 1/3, which
has expectation 3.
CHAPTER 23. RANDOMIZED CONSENSUS AND TEST-AND-SET209
because once n · 2−i drops below 3, the remaining terms form a geometric
series.
it personally won or lost. It is not clear whether the techniques used for this
problem could carry across to consensus.
Chapter 24
Renaming
We will start by following the presentation in [AW04, §16.3]. This mostly de-
scribes results of the original paper of Attiya et al. [ABND+ 90] that defined
the renaming problem and gave a solution for message-passing; however, it’s
now more common to treat renaming in the context of shared-memory, so
we will follow Attiya and Welch’s translation of these results to a shared-
memory setting.
24.1 Renaming
In the renaming problem, we have n processes, each starts with a name
from some huge namespace, and we’d like to assign them each unique names
from a much smaller namespace. The main application is allowing us to run
algorithms that assume that the processes are given contiguous numbers,
e.g. the various collect or atomic snapshot algorithms in which each process
is assigned a unique register and we have to read all of the registers. With
renaming, instead of reading a huge pile of registers in order to find the few
that are actually used, we can map the processes down to a much smaller
set.
Formally, we have a decision problem where each process has input xi
(its original name) and output yi , with the requirements:
Uniqueness If pi 6= pj , then yi 6= yj .
Anonymity The code executed by any process depends only on its input
xi : for any execution of processes p1 . . . pn with inputs x1 . . . xn , and
211
CHAPTER 24. RENAMING 212
24.2 Performance
Conventions on counting processes:
discussing lower bounds on the namespace follow the approach of Herlihy and
Shavit and quote lower bounds that are generally 2 higher than the minimum
number of names needed for n processes. This requires a certain amount of
translation when comparing these lower bounds with upper bounds, which
use the more natural convention.
1 procedure getName()
2 s←1
3 while true do
4 a[i] ← s
5 view ← snapshot(a)
6 if view[j] = s for some j then
7 r ← |{j : view[j] 6= ⊥ ∧ j ≤ i}|
8 s ← r-th positive integer not in
{view[j] : j 6= i ∧ view[j] = ⊥}
9 else
10 return s
The array a holds proposed names for each process (indexed by the
original names), or ⊥ for processes that have not proposed a name yet. If a
process proposes a name and finds that no other process has proposed the
same name, it takes it; otherwise it chooses a new name by first computing
its rank r among the active processes and then choosing the r-th smallest
name that hasn’t been proposed by another process. Because the rank is at
most n and there are at most n − 1 names proposed by the other processes,
this always gives proposed names in the range [1 . . . 2n − 1]. But it remains
to show that the algorithm satisfies uniqueness and termination.
For uniqueness, consider two process with original names i and j. Sup-
pose that i and j both decide on s. Then i sees a view in which a[i] = s and
a[j] 6= s, after which it no longer updates a[i]. Similarly, j sees a view in
which a[j] = s and a[i] 6= s, after which it no longer updates a[j]. If i’s view
is obtained first, then j can’t see a[i] 6= s, but the same holds if j’s view is
CHAPTER 24. RENAMING 215
1 procedure releaseName()
2 a[i] ← ⊥
Algorithm 24.2: Releasing a name
24.4.3.1 Splitters
The Moir-Anderson renaming protocol uses a network of splitters, which
we last saw providing a fast path for mutual exclusion in §17.4.2. Each
splitter is a widget, built from a pair of atomic registers, that assigns to
each processes that arrives at it the value right, down, or stop. As discussed
previously, the useful properties of splitters are that if at least one process
arrives at a splitter, then (a) at least one process returns right or stop; and
(b) at least one process returns down or stop; (c) at most one process returns
stop; and (d) any process that runs by itself returns stop.
We proved the last two properties in §17.4.2; we’ll prove the first two
here. Another way of describing these properties is that of all the processes
that arrive at a splitter, some process doesn’t go down and some process
doesn’t go right. By arranging splitters in a grid, this property guarantees
that every row or column that gets at least one process gets to keep it—which
means that with k processes, no process reaches row k + 1 or column k + 1.
Algorithm 24.3 gives the implementation of a splitter (it’s identical to
Algorithm 17.5, but it will be convenient to have another copy here).
Lemma 24.4.1. If at least one process completes the splitter, at least one
process returns stop or right.
Proof. Suppose no process returns right; then every process sees open in
door, which means that every process writes its id to race before any process
closes the door. Some process writes its id last: this process will see its own
id in race and return stop.
CHAPTER 24. RENAMING 217
shared data:
1 atomic register race, big enough to hold an id, initially ⊥
2 atomic register door, big enough to hold a bit, initially open
3 procedure splitter(id)
4 race ← id
5 if door = closed then
6 return right
7 door ← closed
8 if race = id then
9 return stop
10 else
11 return down
Lemma 24.4.2. If at least one process completes the splitter, at least one
process returns stop or down.
Proof. First observe that if no process ever writes to door, then no process
completes the splitter, because the only way a process can finish the splitter
without writing to door is if it sees closed when it reads door (which must
have been written by some other process). So if at least one process finishes,
at least one process writes to door. Let p be any such process. From the
code, having written door, it has already passed up the chance to return
right; thus it either returns stop or down.
(see Figure 24.2, also taken from [Asp10]). Each splitter on this path must
handle at least two processes (or p would have stopped at that splitter, by
Lemma 17.4.4). So some other process leaves on the other output wire, either
right or down. If we draw a path from each of these wires that continues right
or down to the end of the grid, then along each of these m disjoint paths
either some splitter stops a process, or some process reaches a final output
wire, each of which is at a distinct splitter. But this gives m processes in
addition to p, for a total of m + 1 processes. It follows that:
parators that compare two values coming in from the left and swap the
larger value to the bottom. A network of comparators is a sorting network
if the sequences of output values is always sorted no matter what the order
of values on the inputs is.
The depth of a sorting network is the maximum number of comparators
on any path from an input to an output. The width is the number of wires;
equivalently, the number of values the network can sort. The sorting network
in Figure 24.3 has depth 3 and width 4.
Explicit constructions of sorting networks with width n and depth O(log2 n)
are known [Bat68]. It is also known that sorting networks with depth
O(log n) exist [AKS83], but no explicit construction of such a network is
known.
we have now moved from 1/c to 1/2c2 . The 2 gives us some room to reduce
the number of names in the next round, to cn/2, say, while still keeping a
1/c2 ratio of survivors to names.
So the actual renaming algorithm consists of allocating cn/2i names to
round i, and squaring the ratio of survivors to names in each rounds. It only
takes O(log log n) rounds to knock the ratio of survivors to names below 1/n,
so at this point it is likely that all processes will have finished. At the same
time, the sum over all rounds of the allocated names forms a geometric
series, so only O(n) names are needed altogether.
Swept under the carpet here is a lot of careful analysis of the probabili-
ties. Unlike what happens with sifters (see §23.6), Jensen’s inequality goes
the wrong way here, so some additional technical tricks are needed (see the
paper for details). But the result is that only O(log log n) rounds are to
assign every process a name with high probability, which is the best value
currently known.
There is a rather weak lower bound in the Alistarh et al. paper that
shows that Ω(log log n) steps are needed for some process in the worst case,
under the assumption that the renaming algorithm uses only test-and-set
objects and that a process acquires a name as soon as it wins some test-and-
set object. This does not give a lower bound on the problem in general, and
indeed the renaming-network based algorithms discussed previously do not
have this property. So the question of the exact complexity of randomized
loose renaming is still open.
Chapter 25
Software transactional
memory
226
CHAPTER 25. SOFTWARE TRANSACTIONAL MEMORY 227
starvation.
25.1 Motivation
Some selling points for software transactional memory:
1. We get atomic operations without having to use our brains much.
Unlike hand-coded atomic snapshots, counters, queues, etc., we have
a universal construction that converts any sequential data structure
built on top of ordinary memory into a concurrent data structure.
This is useful since most programmers don’t have very big brains. We
also avoid burdening the programmer with having to remember to lock
things.
2. We can build large shared data structures with the possibility of con-
current access. For example, we can implement atomic snapshots so
that concurrent updates don’t interfere with each other, or an atomic
queue where enqueues and dequeues can happen concurrently so long
as the queue always has a few elements in it to separate the enqueuers
and dequeuers.
register only if our id is still attached, and clears any other id’s that might
also be attached. It’s easy to build a 1-register CAS (CAS1) out of this,
though Shavit and Touitou exploit some additional power of LL/SC.
1 if LL(status) = ⊥ then
2 if LL(r) = oldValue then
3 if SC(status, ⊥) = true then
4 SC(r, newValue)
transaction, who will be the only process working on the transaction until
it starts acquiring locks.
1. Initialize the record rec for the transaction. (Only the initiator does
this.)
Note that only an initiator helps; this avoids a long chain of helping
and limits the cost of each attempted transaction to the cost of doing two
full transactions, while (as shown below) still allowing some transaction to
finish.
25.4 Improvements
One downside of the Shavit and Touitou protocol is that it uses LL/SC
very aggressively (e.g. with overlapping LL/SC operations) and uses non-
trivial (though bounded, if you ignore the ever-increasing version numbers)
amounts of extra space. Subsequent work has aimed at knocking these down;
for example a paper by Harris, Fraser, and Pratt [HFP02] builds multi-
register CAS out of single-register CAS with O(1) extra bits per register.
The proof of these later results can be quite involved; Harris et al, for
example, base their algorithm on an implementation of 2-register CAS whose
correctness has been verified only by machine (which may be a plus in some
views).
CHAPTER 25. SOFTWARE TRANSACTIONAL MEMORY 232
25.5 Limitations
There has been a lot of practical work on STM designed to reduce over-
head on real hardware, but there’s still a fair bit of overhead. On the
theory side, a lower bound of Attiya, Hillel, and Milani [AHM09] shows that
any STM system that guarantees non-interference between non-overlapping
RMW transactions has the undesirable property of making read-only trans-
actions as expensive as RMW transactions: this conflicts with the stated
goals of many practical STM implementations, where it is assumed that
most transactions will be read-only (and hopefully cheap). So there is quite
a bit of continuing research on finding the right trade-offs.
Chapter 26
Obstruction-freedom
233
CHAPTER 26. OBSTRUCTION-FREEDOM 234
26.2 Examples
26.2.1 Lock-free implementations
Pretty much anything built using compare-and-swap or LL/SC ends up
being lock-free. A simple example would be a counter, where an increment
operation does
1 x ← LL(C)
2 SC(C, x + 1)
1 x←0
2 while true do
3 δ ← x − a[1 − i]
4 if δ = 2 (mod 5) then
5 return 0
6 else if δ = −1 (mod 5) do
7 return 1
8 else
9 x ← (x + 1) mod 5
10 a[i] ← x
case, you return 1 immediately; in the former, you return after one more
increment (and more importantly, you can’t return 0). Alternatively, if I
ever observe δ = −1, your next read will show you either δ = 1 or δ = 2;
in either case, you will eventually return 0. (We chose 5 as a modulus
because this is the smallest value that makes the cases δ = 2 and δ = −2
distinguishable.)
We can even show that this is linearizable, by considering a solo execution
in which the lone process takes two steps and returns 0 (with two processes,
solo executions are the only interesting case for linearizability).
However, Algorithm 26.1 is not wait-free or even lock-free: if both pro-
cesses run in lockstep, they will see δ = 0 forever. But it is obstruction-free.
If I run by myself, then whatever value of δ I start with, I will see −1 or 2
after at most 6 operations.1
This gives an obstruction-free step complexity of 6, where the
obstruction-free step complexity is defined as the maximum number of op-
erations any process can take after all other processes stop. Note that our
usual wait-free measures of step complexity don’t make a lot of sense for
obstruction-free algorithms, as we can expect a sufficiently cruel adversary
to be able to run them up to whatever value he likes.
Building a tree of these objects as in §22.2 gives n-process test-and-set
with obstruction-free step complexity O(log n).
1
The worst case is where an increment by my fellow process leaves δ = −1 just before
my increment.
CHAPTER 26. OBSTRUCTION-FREEDOM 237
1 procedure rightPush(v)
2 while true do
3 k ← oracle(right)
4 prev ← a[k − 1]
5 next ← a[k]
6 if prev.value 6= RN and next.value = RN then
7 if CAS(a[k − 1], prev, [prev.value, prev.version + 1]) then
8 if CAS(a[k], next, [v, next.version + 1]) then
9 we win, go home
10 procedure rightPop()
11 while true do
12 k ← oracle(right)
13 cur ← a[k − 1]
14 next ← a[k]
15 if cur.value 6= RN and next.value = RN then
16 if cur.value = LN and A[k − 1] = cur then
17 return empty
18 else if CAS(a[k], next, [RN, next.version + 1]) do
19 if CAS(a[k − 1], cur, [RN, cur.version + 1]) then
20 return cur.value
previous values (top, RN) with (top, value) in rightPush or (top, value) with
(top, RN) in rightPop; in either case the operation preserves the invariant.
So the only way we get into trouble is if, for example, a rightPush does a
CAS on a[k −1] (verifying that it is unmodified and incrementing the version
number), but then some other operation changes a[k − 1] before the CAS on
a[k]. If this other operation is also a rightPush, we are happy, because it
must have the same value for k (otherwise it would have failed when it saw
a non-null in a[k − 1]), and only one of the two right-pushes will succeed
in applying the CAS to a[k]. If the other operation is a rightPop, then it
can only change a[k − 1] after updating a[k]; but in this case the update to
a[k] prevents the original right-push from changing a[k]. With some more
tedious effort we can similarly show that any interference from leftPush or
leftPop either causes the interfering operation or the original operation to
fail. This covers 4 of the 16 cases we need to consider. The remaining cases
will be brushed under the carpet to avoid further suffering.
steps without you doing anything, I can reasonably conclude that you are
dead—the semisynchrony assumption thus acts as a failure detector.
The fact that R is unknown might seem to be an impediment to using
this failure detector, but we can get around this. The idea is to start with
a small guess for R; if a process is suspected but then wakes up again, we
increment the guess. Eventually, the guessed value is larger than the correct
value, so no live process will be falsely suspected after this point. Formally,
this gives an eventually perfect (♦P ) failure detector, although the algorithm
does not specifically use the failure detector abstraction.
To arrange for a solo execution, when a process detects a conflict (be-
cause its operation didn’t finish quickly), it enters into a “panic mode” where
processes take turns trying to finish unmolested. A fetch-and-increment reg-
ister is used as a timestamp generator, and only the process with the smallest
timestamp gets to proceed. However, if this process is too sluggish, other
processes may give up and overwrite its low timestamp with ∞, temporarily
ending its turn. If the sluggish process is in fact alive, it can restore its low
timestamp and kill everybody else, allowing it to make progress until some
other process declares it dead again.
The simulation works because eventually the mechanism for detecting
dead processes stops suspecting live ones (using the technique described
above), so the live process with the winning timestamp finishes its operation
without interference. This allows the next process to proceed, and eventually
all live processes complete any operation they start, giving the wait-free
property.
The actual code is in Algorithm 26.3. It’s a rather long algorithm but
most of the details are just bookkeeping.
The preamble before entering PANIC mode is a fast-path computation
that allows a process that actually is running in isolation to skip testing
any timestamps or doing any extra work (except for the one register read of
PANIC). The assumption is that the constant B is set high enough that any
process generally will finish its operation in B steps without interference. If
there is interference, then the timestamp-based mechanism kicks in: we grab
a timestamp out of the convenient fetch-and-add register and start slugging
it out with the other processes.
(A side note: while the algorithm as presented in the paper assumes a
fetch-and-add register, any timestamp generator that delivers increasing val-
ues over time will work. So if we want to limit ourselves to atomic registers,
we could generate timestamps by taking snapshots of previous timestamps,
Stockmeyer [DLS88].
CHAPTER 26. OBSTRUCTION-FREEDOM 241
1 if ¬PANIC then
2 execute up to B steps of the underlying algorithm
3 if we are done then return
4 PANIC ← true // enter panic mode
5 myTimestamp ← fetchAndIncrement()
6 A[i] ← 1 // reset my activity counter
7 while true do
8 T [i] ← myTimestamp
9 minTimestamp ← myTimestamp; winner ← i
10 for j ← 1 . . . n, j 6= i do
11 otherTimestamp ← T [j]
12 if otherTimestamp < minTimestamp then
13 T [winner] ← ∞ // not looking so winning any more
14 minTimestamp ← otherTimestamp; winner ← j
15 else if otherTimestamp < ∞ do
16 T [j] ← ∞
17 if i = winner then
18 repeat
19 execute up to B steps of the underlying algorithm
20 if we are done then
21 T [i] ← ∞
22 PANIC ← false
23 return
24 else
25 A[i] ← A[i] + 1
26 PANIC ← true
27 until T [i] = ∞
28 repeat
29 a ← A[winner]
30 wait a steps
31 winnerTimestamp ← T [winner]
32 until a = A[winner] or winnerTimestamp 6= minTimestamp
33 if winnerTimestamp = minTimestamp then
34 T [winner] ← ∞ // kill winner for inactivity
The next step is to show that if there is some process i with a minimum
timestamp that executes infinitely many operations, it increments A[i] in-
finitely often (thus eventually making the failure detector stop suspecting
it). This gives us Lemma 2 from the paper:
Lemma 26.3.2 ([FLMS05, Lemma 2]). Consider the set of all processes that
execute infinitely many operations without completing an operation. Suppose
this set is non-empty, and let i hold the minimum timestamp of all these
processes. Then i is not active infinitely often.
Proof. Suppose that from some time on, i is active forever, i.e., it never
leaves the active loop. Then T [i] < ∞ throughout this interval (or else i
CHAPTER 26. OBSTRUCTION-FREEDOM 243
leaves the loop), so for any active j, T [j] = ∞ by the preceding lemma. It
follows that any active T [j] leaves the active loop after B + O(1) steps of j
(and thus at most R(B + O(1)) steps of i). Can j re-enter? If j’s timestamp
is less than i’s, then j will set T [i] = ∞, contradicting our assumption. But
if j’s timestamp is greater than i’s, j will not decide it’s the winner and
will not re-enter the active loop. So now we have i alone in the active loop.
It may still be fighting with processes in the initial fast path, but since i
sets PANIC every time it goes through the loop, and no other process resets
PANIC (since no other process is active), no process enters the fast path after
some bounded number of i’s steps, and every process in the fast path leaves
after at most R(B + O(1)) of i’s steps. So eventually i is in the loop alone
forever—and obstruction-freedom means that it finishes its operation and
leaves. This contradicts our initial assumption that i is active forever.
So now we want to argue that our previous assumption that there exists
a bad process that runs forever without winning leads to a contradiction,
by showing that the particular i from Lemma 26.3.2 actually finishes (note
that Lemma 26.3.2 doesn’t quite do this—we only show that i finishes if it
stays active long enough, but maybe it doesn’t stay active).
Suppose i is as in Lemma 26.3.2. Then i leaves the active loop infinitely
often. So in particular it increments A[i] infinitely often. After some finite
number of steps, A[i] exceeds the limit R(B +O(1)) on how many steps some
other process can take between increments of A[i]. For each other process j,
either j has a lower timestamp than i, and thus finishes in a finite number of
steps (from the premise of the choice of i), or j has a higher timestamp than
i. Once we have cleared out all the lower-timestamp processes, we follow the
same logic as in the proof of Lemma 26.3.2 to show that eventually (a) i sets
T [i] < ∞ and PANIC = true, (b) each remaining j observes T [i] < ∞ and
PANIC = true and reaches the waiting loop, (c) all such j wait long enough
(since A[i] is now very big) that i can finish its operation. This contradicts
the assumption that i never finishes the operation and completes the proof.
26.3.1 Cost
If the parameters are badly tuned, the potential cost of this construction is
quite bad. For example, the slow increment process for A[i] means that the
time a process spends in the active loop even after it has defeated all other
processes can be as much as the square of the time it would normally take
to complete an operation alone—and every other process may pay R times
this cost waiting. This can be mitigated to some extent by setting B high
CHAPTER 26. OBSTRUCTION-FREEDOM 244
enough that a winning process is likely to finish in its first unmolested pass
through the loop (recall that it doesn’t detect that the other processes have
reset T [i] until after it makes its attempt to finish). An alternative might
be to double A[i] instead of incrementing it at each pass through the loop.
However, it is worth noting (as the authors do in the paper) that nothing
prevents the underlying algorithm from incorporating its own contention
management scheme to ensure that most operations complete in B steps
and PANIC mode is rarely entered. So we can think of the real function of
the construction as serving as a backstop to some more efficient heuristic
approach that doesn’t necessarily guarantee wait-free behavior in the worst
case.
26.4.1 Contention
A limitation of real shared-memory systems is that physics generally won’t
permit more than one process to do something useful to a shared object
at a time. This limitation is often ignored in computing the complexity of
a shared-memory distributed algorithm (and one can make arguments for
ignoring it in systems where communication costs dominate update costs
in the shared-memory implementation), but it is useful to recognize it if we
can’t prove lower bounds otherwise. Complexity measures that take the cost
of simultaneous access into account go by the name of contention.
4
The result first appeared in FOCS in 2005 [FHS05], with a small but easily fixed
bug in the definition of the class of objects the proof applies to. We’ll use the corrected
definition from the journal version.
CHAPTER 26. OBSTRUCTION-FREEDOM 245
1. φ is an instance of Op executed by p,
2. no operation in A or A0 is executed by p,
then there exists a sequence of operations Q by q such that for every sequence
HφH 0 where
1. p does nothing in E,
So this definition includes both the fact that p incurs k stalls and some
other technical details that make the proof go through. The fact that p
incurs k stalls follows from observing that it incurs |Sj | stalls in each segment
σj , since all processes in Sj access Oj just before p does.
Note that the empty execution is a 0-stall execution (with i = 0) by the
definition. This shows that a k-stall execution exists for some k.
Note also that the weird condition is pretty strong: it claims not only
that there are no non-trivial operation on O1 . . . Oi in τ , but also that there
are no non-trivial operations on any objects accessed in σ1 . . . σi , which may
include many more objects accessed by p.6
We’ll now show that if a k-stall execution exists, for k ≤ n − 2, then
a (k + k 0 )-stall execution exists for some k 0 > 0. Iterating this process
eventually produces an (n − 1)-stall execution.
Start with some k-stall execution Eσ1 . . . σi . Extend this execution by
a sequence of operations σ in which p runs in isolation until it finishes its
operation φ (which it may start in σ if it hasn’t done so already), then each
process in S runs in isolation until it completes its operation. Now linearize
6
And here is where I screwed up in class on 2011-11-14, by writing the condition as
the weaker requirement that nobody touches O1 . . . Oi .
CHAPTER 26. OBSTRUCTION-FREEDOM 249
if our previous choice was in fact maximal, the weird condition still holds,
and we have just constructed a (k + k 0 )-stall execution. This concludes the
proof.
26.4.4 Consequences
We’ve just shown that counters and snapshots have (n − 1)-stall executions,
because they are in the class G. A further, rather messy argument (given
in the Ellen et al. paper) extends the result to stacks and queues, obtaining
a slightly weaker bound of n total stalls and operations for some process in
the worst case.7 In both cases, we can’t expect to get a sublinear worst-case
bound on time under the reasonable assumption that both a memory stall
and an actual operation takes at least one time unit. This puts an inherent
bound on how well we can handle hot spots for many practical objects, and
means that in an asynchronous system, we can’t solve contention at the
object level in the worst case (though we may be able to avoid it in our
applications).
But there might be a way out for some restricted classes of objects. We
saw in Chapter 21 that we could escape from the Jayanti-Tan-Toueg [JTT00]
lower bound by considering bounded objects. Something similar may hap-
pen here: the Fich-Herlihy-Shavit bound on fetch-and-increments requires
executions with n(n − 1)d + n increments to show n − 1 stalls for some fetch-
and-increment if each fetch-and-increment only touches d objects, and even
for d = log n this is already superpolynomial. The max-register construction
of a counter [AAC09] doesn’t help here, since everybody hits the switch bit
at the top of the max register, giving n − 1 stalls if they all hit it at the
same time. But there might be some better construction that avoids this.
BG simulation
253
CHAPTER 27. BG SIMULATION 254
nate only if there are no failures by any process during an initial unsafe
section of its execution. Each process i starts the agreement protocol with a
proposei (v) event for its input value v. At some point during the execution
of the protocol, the process receives a notification safei , followed later (if
the protocol finishes) by a second notification agreei (v 0 ) for some output
value v 0 . It is guaranteed that the protocol terminates as long as all pro-
cesses continue to take steps until they receive the safe notification, and
that the usual validity (all outputs equal some input) and agreement (all
outputs equal each other) conditions hold. There is also a wait-free progress
condition that the safei notices do eventually arrive for any process that
doesn’t fail, no matter what the other processes do (so nobody gets stuck
in their unsafe section).
Pseudocode for a safe agreement object is given in Algorithm 27.1. This
is a translation of the description of the algorithim in [BGLR01], which is
specified at a lower level using I/O automata.
// proposei (v)
1 A[i] ← hv, ii
2 if snapshot(A) contains hj, 2i for some j 6= i then
// Back off
3 A[i] ← hv, 0i
4 else
// Advance
5 A[i] ← hv, 2i
// safei
6 repeat
7 s ← snapshot(A)
8 until s does not contain hj, 1i for any j
// agreei
9 return s[j].value where j is smallest index with s[j].level = 2
Algorithm 27.1: Safe agreement (adapted from [BGLR01])
The safei transition occurs when the process leaves level 1 (no matter
which way it goes). This satisfies the progress condition, since there is
no loop before this, and guarantees termination if all processes leave their
unsafe interval, because no process can then wait forever for the last 1 to
disappear.
To show agreement, observe that at least one process advances to level 2
(because the only way a process doesn’t is if some other process has already
advanced to level 2), so any process i that terminates observes a snapshot
s that contains at least one level-2 tuple and no level-1 tuples. This means
that any process j whose value is not already at level 2 in s can at worst
reach level 1 after s is taken. But then j sees a level-2 tuples and backs
off. It follows that any other process i0 that takes a later snapshot s0 that
includes no level-1 tuples sees the same level-2 tuples as i, and computes the
same return value. (Validity also holds, for the usual trivial reasons.)
1. The process makes an initial guess for sjr by taking a snapshot of A and
taking the value with the largest round number for each component
A[−][k].
2. The process initiates the safe agreement protocol Sjr using this guess.
It continues to run Sjr until it leaves the unsafe interval.
1
The underlying assumption is that all simulated processes alternate between taking
snapshots and doing updates. This assumption is not very restrictive, because two snap-
shots with no intervening update are equivalent to two snapshots separated by an update
that doesn’t change anything, and two updates with no intervening snapshot can be re-
placed by just the second update, since the adversary could choose to schedule them
back-to-back anyway.
CHAPTER 27. BG SIMULATION 256
4. If Sjr terminates, the process computes a new value vjr for j to write
based on the simulated snapshot returned by Sjr , and updates A[i][j]
with hvjr , ri.
input, after each i proposes its own input vector for all j based on its own
input to the simulator protocol. For outputs, i waits for at least n − t of the
simulated processes to finish, and computes its own output based on what
it sees.
One issue that arises here is that we can only use the simulation to
solve colorless tasks, which are decision problems where any process can
take the output of any other process without causing trouble.2 This works
for consensus or k-set agreement, but fails pretty badly for renaming. The
extended BG simulation, due to Gafni [Gaf09], solves this problem by
mapping each simulating process p to a specific simulated process qp , and
using a more sophisticated simulation algorithm to guarantee that qp doesn’t
crash unless p does. Details can be found in Gafni’s paper; there is also a
later paper by Imbs and Raynal [IR09] that simplifies some details of the
construction. Here, we will limit ourselves to the basic BG simulation.
before this snapshot, since the s-th write operation by k will be represented
in the snapshot if and only if the first instance of the s-th write operation
by k occurs before it. The only tricky bit is that process i’s snapshot for
Sjr might include some operations that can’t possibly be included in Sjr ,
like j’s round-r write or some other operation that depends on it. But this
can only occur if some other process finished Sjr before process i takes its
snapshot, in which case i’s snapshot will not win Sjr and will be discarded.
Topological methods
• Topological version:
259
CHAPTER 28. TOPOLOGICAL METHODS 260
to obtain insight into how topological techniques might help for other prob-
lems. The advantage is that (unlike these notes) the resulting text includes
actual proofs instead of handwaving.
Example: For 2-process binary consensus with processes 0 and 1, the in-
put complex, which describes all possible combinations of inputs, consists
of the sets
{{}, {p0}, {q0}, {p1}, {q1}, {p0, q0}, {p0, q1}, {p1, q0}, {p1, q1}} ,
q1 p1
As a picture, this omits two of the edges (1-dimensional simplexes) from the
input complex:
p0 q0
q1 p1
One thing to notice about this output complex is that it is not con-
nected: there is no path from the p0–q0 component to the q1–p1 compo-
nent.
Here is a simplicial complex describing the possible states of two pro-
cesses p and q, after each writes 1 to its own bit then reads the other process’s
bit. Each node in the picture is labeled by a sequence of process ids. The
first id in the sequence is the process whose view this node represents; any
other process ids are processes this first process sees (by seeing a 1 in the
other process’s register). So p is the view of process p running by itself,
while pq is the view of process p running in an execution where it reads q’s
register after q writes it.
p qp pq q
CHAPTER 28. TOPOLOGICAL METHODS 264
The edges express the constraint that if we both write before we read,
then if I don’t see your value you must see mine (which is why there is no
p–q edge), but all other combinations are possible. Note that this complex
is connected: there is a path between any two points.
Here’s a fancier version in which each process writes its input (and re-
members it), then reads the other process’s register (i.e., a one-round full-
information protocol). We now have final states that include the process’s
own id and input first, then the other process’s id and input if it is visible.
For example, p1 means p starts with 1 but sees a null and q0p1 means q
starts with 0 but sees p’s 1. The general rule is that two states are compat-
ible if p either sees nothing or q’s actual input and similarly for q, and that
at least one of p or q must see the other’s input. This gives the following
simplicial complex:
p0 q0p0 p0q0 q0
q1p0 p1q0
p0q1 q0p1
q1 p1q1 q1p1 p1
p−q
p − qp − pq − q
Here (pq)(qp) is the view of p after seeing pq in the first round and seeing
that q saw qp in the first round.
28.3.2 Subdivisions
In the simple write-then-read protocol above, we saw a single input edge turn
into 3 edges. Topologically, this is an example of a subdivision, where we
represent a simplex using several new simplexes pasted together that cover
exactly the same points.
Certain classes of protocols naturally yield subdivisions of the input
complex. The iterated immediate snapshot (IIS) model, defined by
Borowsky and Gafni [BG97], considers executions made up of a sequence
of rounds (the iterated part) where each round is made up of one or more
mini-rounds in which some subset of the processes all write out their cur-
rent views to their own registers and then take snapshots of all the registers
(the immediate snapshot part). The two-process protocols of the previous
section are special cases of this model.
Within each round, each process p obtains a view vp that contains the
previous-round views of some subset of the processes. We can represent the
views as a subset of the processes, which we will abbreviate in pictures by
putting the view owner first: pqr will be the view {p, q, r} as seen by p, while
qpr will be the same view as seen by q. The requirements on these views
are that (a) every process sees its own previous view: p ∈ vp for all p; (b)
all views are comparable: vp ⊆ vq or vq ⊆ vp ; and (c) if I see you, then I see
everything you see: q ∈ vp implies vq ⊆ vp . This last requirement is called
immediacy and follows from the assumption that writes and snapshots are
done in the same mini-round: if I see your write, then your snapshot takes
place no later than mine does.
The IIS model does not correspond exactly to a standard shared-memory
model (or even a standard shared-memory model augmented with cheap
snapshots). There are two reasons for this: standard snapshots don’t provide
CHAPTER 28. TOPOLOGICAL METHODS 266
rp qp
qpr rpq
pr pq
pqr
r q
qr rq
rp qp
qpr rpq
pr pq
pqr
r q
qr rq
2 1
1 1
1 3
2 3 3 3
We now run into Sperner’s Lemma [Spe28], which says that, for any
subdivision of a simplex into smaller simplexes, if each corner of the original
simplex has a different color, and each corner that appears on some face of
the original simplex has a color equal to the color of one of the corners of
that face, then within the subdivision there are an odd number of simplexes
whose corners are all colored differently.3
How this applies to k-set agreement: Suppose we have n = k+1 processes
in a wait-free system (corresponding to allowing up to k failures). With
the cooperation of the adversary, we can restrict ourselves to executions
consisting of ` rounds of iterated immediate snapshot for some ` (termination
comes in here to show that ` is finite). This gives a subdivision of a simplex,
where each little simplex corresponds to some particular execution and each
corner some process’s view. Color all the corners of the little simplexes in
this subdivision with the output of the process holding the corresponding
view. Validity means that these colors satisfy the requirements of Sperner’s
Lemma. Sperner’s Lemma then says that some little simplex has all k + 1
colors, giving us a bad execution with more than k distinct output values.
The general result says that we can’t do k-set agreement with k failures
for any n > k. We haven’t proved this result, but it can be obtained from
the n = k + 1 version using a simulation of k + 1 processes with k failures
by n processes with k failures due to Borowsky and Gafni [BG93].
One thing we could conclude from the fact that the output complex for
consensus was not connected but the ones describing our simple protocols
were was that we can’t solve consensus (non-trivially) using these protocols.
The reason is that to solve consensus using such a protocol, we would need
to have a mapping from states to outputs (this is just whatever rule tells
each process what to decide in each state) with the property that if some
collection of states are consistent, then the outputs they are mapped to are
consistent.
In simplical complex terms, this means that the mapping from states
to outputs is a simplicial map, a function f from points in one simplicial
complex C to points in another simplicial complex D such that for any
simplex A ∈ C, f (A) = {f (x)|x ∈ A} gives a simplex in D. (Recall that
consistency is represented by including a simplex, in both the state complex
and the output complex.) A mapping from states to outputs that satisfies
the consistency requirements encoded in the output complex s always a
simplicial map, with the additional requirement that it preserves process ids
(we don’t want process p to decide the output for process q). Conversely,
any id-preserving simplicial map gives an output function that satisfies the
consistency requirements.
Simplicial maps are examples of continuous functions, which have all
sorts of nice topological properties. One nice property is that a continuous
function can’t separate a connected space into disconnected components. We
can prove this directly for simplical maps: if there is a path of 1-simplexes
{x1 , x2 }, {x2 , x3 }, . . . {xk−1 , xk } from x1 to xk in C, and f : C → D is a
simplicial map, then there is a path of 1-simplexes {f (x1 ), f (x2 )}, . . . from
f (x1 ) to f (xk ). Since being connected just means that there is a path
between any two points,4 if C is connected we’ve just shown that f (C) is as
well.
Getting back to our consensus example, it doesn’t matter what simplicial
map f you pick to map process states to outputs; since the state complex C
is connected, so is f (C), so it lies entirely within one of the two connected
components of the output complex. This means in particular that everybody
always outputs 0 or 1: the protocol is trivial.
Protocol implies map Even though we don’t get a subdivision with the
full protocol, there is a restricted set of executions that does give a
CHAPTER 28. TOPOLOGICAL METHODS 273
Map implies protocol This requires an algorithm. The idea here is that
that participating set algorithm, originally developed to solve k-set
agreement [BG93], produces precisely the standard chromatic subdivi-
sion used in the ACT proof. In particular, it can be used to solve the
problem of simplex agreement, the problem of getting the processes
to agree on a particular simplex contained within the subdivision of
their original common input simplex. This is a little easier to explain,
so we’ll do it.
The following theorem shows that the return values from participating
set have all the properties we want for iterated immediate snapshot:
Theorem 28.6.2. Let Si be the output of the participating set algorithm for
process i. Then all of the following conditions hold:
Proof. Self-inclusion is trivial, but we will have to do some work for the
other two properties.
The first step is to show that Algorithm 28.1 neatly sorts the processes
out into levels, where each process that returns at level k returns precisely
the set of processes at level k and below.
For each process i, let Si be defined as above, let `i be the final value of
level[i] when i returns, and let Si0 = {j | `j ≤ Si }. Our goal is to show that
Si0 = Si , justifying the above claim.
Because no process ever increases its level, if process i observes level[j] ≤
`i in its last snapshot, then `j ≤ level[j] ≤ `i . So Si0 is a superset of Si . We
thus need to show only that no extra processes sneak in; in particular, we
will to show that Si = Si0 , by showing that both equal `i .
The first step is to show that |Si0 | ≥ |Si | ≥ `i . The first inequality follows
from the fact that Si0 ⊇ Si ; the second follows from the code (if not, i would
have stayed in the loop).
The second step is to show that |Si0 | ≤ `i . Suppose not; that is, suppose
that |Si0 | > `i . Then there are at least `i +1 processes with level `i or less, all
of which take a snapshot on level `i + 1. Let i0 be the last of these processes
to take a snapshot while on level `i + 1. Then i0 sees at least `i + 1 processes
at level `i + 1 or less and exits, contradicting the assumption that it reaches
level `i . So |Si0 | ≤ `i .
The atomic snapshot property follows immediately from the fact that if
`i ≤ `j , then `k ≤ `i implies `k ≤ `j , giving Si = Si0 ⊆ Sj0 = Sj . Similarly,
for immediacy we have that if i ∈ Sj , then `i ≤ `j , giving Si ≤ Sj by the
same argument.
The missing piece for turning this into IIS is that in Algorithm 28.1, I
only learn the identities of the processes I am supposed to include but not
their input values. This is easily dealt with by adding an extra register for
each process, to which it writes its input before executing participating set.
CHAPTER 28. TOPOLOGICAL METHODS 275
28.7.1 k-connectivity
Define the m-dimensional disk to be the set of all points at most 1 unit away
from the origin in Rm , and the m-dimensional sphere to be the surface of
the (m + 1)-dimensional disk (i.e., all points exactly 1 unit away from the
origin in Rm+1 ). Note that what we usually think of as a sphere (a solid
body), topologists call a disk, leaving the term sphere for just the outside
part.
An object is k-connected if any continuous image of an m-dimensional
sphere can be extended to a continuous image of an (m + 1)-dimensional
disk, for all m ≤ k.5 This is a roundabout way of saying that if we can
draw something that looks like a deformed sphere inside our object, we can
always include the inside as well: there are no holes that get in the way.
The punch line is that continuous functions preserve k-connectivity: if we
map an object with no holes into some other object, the image had better
not have any holes either.
Ordinary path-connectivity is the special case when k = 0; here, the
0-sphere consists of two points and the 1-disk is the path between them. So
0-connectivity says that for any two points, there is a path between them.
For 1-connectivity, if we draw a loop (a path that returns to its origin), we
can include the interior of the loop somewhere. One way to thinking about
this is to say that we can shrink the loop to a point without leaving the object
(the technical term for this is that the path is null-homotopic, where a
homotopy is a way to transform one thing continuously into another thing
over time and the null path sits on a single point). An object that is
1-connected is also called simply connected.
5
This definition is for the topological version of k-connectivity. It is not related in any
way to the definition of k-connectivity in graph theory, where a graph is k-connected if
there are k disjoint paths between any two points.
CHAPTER 28. TOPOLOGICAL METHODS 276
b1 c2 a1 b2 c1 a2 b1
c3 a4 b3 c4 a3 b4 c3
a1 b2 c1 a2 b1 c2 a1
Approximate agreement
Validity Every process returns an output within the range of inputs. For-
mally, for all i, it holds that (minj xj ) ≤ yi ≤ (maxj xj ).
278
CHAPTER 29. APPROXIMATE AGREEMENT 279
Algorithm 54] but with a slight bug fix;1 pseudocode appears in Algo-
rithm 29.1.2
The algorithm carries out a sequence of asynchronous rounds in which
processes adopt new values, such that the spread of the vector of all values
Vr in round r, defined as spread Vr = max Vr − min Vr , drops by a factor of 2
per round. This is done by having each process choose a new value in each
round by taking the midpoint (average of min and max) of all the values it
sees in the previous round. Slow processes will jump to the maximum round
they see rather than propagating old values up from ancient rounds; this
is enough to guarantee that latecomer values that arrive after some process
writes in round 2 are ignored.
The algorithm uses a single snapshot object A to communicate, and each
process stores its initial input and a round number along with its current
preference. We assume that the initial values in this object all have round
number 0, and that log2 0 = −∞ (which avoids a special case in the termi-
nation test).
1 A[i] ← hxi , 1, xi i
2 repeat
3 hx01 , r1 , v1 i . . . hx0n , rn , vn i ← snapshot(A)
4 rmax ← maxj rj
5 v ← midpoint{vj | rj = rmax }
6 A[i] ← hxi , rmax + 1, vi
7 until rmax ≥ 2 and rmax ≥ log2 (spread({x0j })/)
8 return v
Algorithm 29.1: Approximate agreement
To show this works, we want to show that the midpoint operation guar-
antees that the spread shrinks by a factor of 2 in each round. Let Vr be the
set of all values v that are ever written to the snapshot object with round
1
The original algorithm from [AW04] does not include the test rmax ≥ 2. This allows
for bad executions in which process 1 writes its input of 0 in round 1 and takes a snapshot
that includes only its own input, after which process 2 runs the algorithm to completion
with input 1. Here process 2 will see 0 and 1 in round 1, and will write (1/2, 2, 1) to
A[2]; on subsequent iterations, it will see only the value 1/2 in the maximum round, and
after dlog2 (1/)e rounds it will decide on 1/2. But if we now wake process 1 up, it will
decided 0 immediately based on its snapshot, which includes only its own input and gives
spread(x) = 0. Adding the extra test prevents this from happening, as new values that
arrive after somebody writes round 2 will be ignored.
2
Showing that this particular algorithm works takes a lot of effort. If I were to do this
over, I’d probably go with a different algorithm due to Schenk [Sch95].
CHAPTER 29. APPROXIMATE AGREEMENT 280
number r. Let Ur ⊆ Vr be the set of values that are ever written to the snap-
shot object with round number r before some process writes a value with
round number r + 1 or greater; the intuition here is that Ur includes only
those values that might contribute to the computation of some round-(r + 1)
value.
Lemma 29.1.1. For all r for which Vr+1 is nonempty,
spread(Vr+1 ) ≤ spread(Ur )/2.
Proof. Let Uri be the set of round-r values observed by a process i in the
iteration in which it sees rmax = r in some iteration, if such an iteration
exists. Note that Uri ⊆ Ur , because if some value with round r + 1 or greater
is written before i’s snapshot, then i will compute a larger value for rmax .
Given two processes i and j, we can argue from the properties of snapshot
that either Uri ⊆ Urj or Urj ⊆ Uri . The reason is that if i’s snapshot comes
first, then j sees at least as many round-r values as i does, because the only
way for a round-r value to disappear is if it is replaced by a value in a later
round. But in this case, process j will compute a larger value for rmax and
will not get a view for round r. The same holds in reverse if j’s snapshot
comes first.
Observe that if Uri ⊆ Urj , then
midpoint(Uri ) − midpoint(Urj ) ≤ spread(Urj )/2.
This holds because midpoint(Uri ) lies within the interval min Urj , max Urj ,
and every point in this interval is within spread(Urj )/2 of midpoint(Urj ). The
same holds if Urj ⊆ Uri . So any two values written in round r + 1 are within
spread(Ur )/2 of each other.
In particular, the minimum and maximum values in Vr+1 are within
spread(Ur )/2 of each other, so spread(Vr+1 ) ≤ spread(Ur )/2.
Corollary 29.1.2. For all r ≥ 2 for which Vr is nonempty,
spread(Vr ) ≤ spread(U1 )/2r−1 .
Proof. By induction on r. For r = 2, this is just Lemma 29.1.1. For larger
r, use the fact that Ur−1 ⊆ Vr−1 and thus spread(Ur−1 ) ≤ spread(Vr−1 ) to
compute
spread(Vr ) ≤ spread(Ur−1 )/2
≤ spread(Vr−1 )/2
≤ (spread(U1 )/2r−2 )/2
= spread(U1 )/2r−1 .
CHAPTER 29. APPROXIMATE AGREEMENT 281
Let i be some process that finishes in the fewest number of rounds. Pro-
cess i can’t finish until it reaches round rmax +1, where rmax ≥ log2 (spread({x0j })/)
for a vector of input values x0 that it reads after some process writes round
2 or greater. We have spread({x0j }) ≥ spread(U1 ), because every value in
U1 is included in x0 . So rmax ≥ log2 (spread(U1 )/) and spread(Vrmax +1 ) ≤
spread(U1 )/2rmax ≤ spread(U1 )/(spread(U1 )/) = . Since any value re-
turned is either included in Vrmax +1 or some later Vr0 ⊆ Vrmax +1 , this gives
us that the spread of all the outputs is less than : Algorithm 29.1 solves
approximate agreement.
The cost of Algorithm 29.1 depends on the cost of the snapshot oper-
ations, on , and on the initial input spread D. For linear-cost snapshots,
this works out to O(n log(D/)).
follows that after k steps the best spread we can get is D/3k , requiring
k ≥ log3 (D/) steps to get -agreement.
Herlihy uses this result to show that there are decisions problems that
have wait-free but not bounded wait-free deterministic solutions using regis-
ters. Curiously, the lower bound says nothing about the dependence on the
number of processes; it is conceivable that there is an approximate agree-
ment protocol with running time that depends only on D/ and not n.
Part III
Direct interaction
283
Chapter 30
Self-stabilization
284
Chapter 31
Population protocols
285
Chapter 32
286
Chapter 33
Mobile robots
287
Chapter 34
Self-assembly
288
Appendix
289
Appendix A
Assignments
290
APPENDIX A. ASSIGNMENTS 291
1. Your name.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
292
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 293
evil, and knows the identities of all of its neighbors. However, the processes
do not know the number of processes n or the diameter of the network D.
Give a protocol that allows every process to correctly return the number
of evil processes no later than time D. Your protocol should only return a
value once for each process (no converging to the correct answer after an
initial wrong guess).
Solution
There are a lot of ways to do this. Since the problem doesn’t ask about
message complexity, we’ll do it in a way that optimizes for algorithmic sim-
plicity.
At time 0, each process initiates a separate copy of the flooding algorithm
(Algorithm 4.1). The message hp, N (p), ei it distributes consists of its own
identity, the identities of all of its neighbors, and whether or not it is evil.
In addition to the data for the flooding protocol, each process tracks a
set I of all processes it has seen that initiated a protocol and a set N of all
processes that have been mentioned as neighbors. The initial values of these
sets for process p are {p} and N (p), the neighbors of p.
Upon receiving a message hq, N (q), ei, a process adds q to I and N (q) to
N . As soon as I = N , the process returns a count of all processes for which
e = true.
Termination by D: Follows from the same analysis as flooding. Any
process at distance d from p has p ∈ I by time d, so I is complete by time
D.
S
Correct answer: Observe that N = i∈I N (i) always. Suppose that
there is some process q that is not in I. Since the graph is connected, there
is a path from p to q. Let r be the last node in this path in I, and let s be
the following node. Then s ∈ N \ I and N 6= I. By contraposition, if I = N
then I contains all nodes in the network, and so the count returned at this
time is correct.
its neighbors as its parent, and following the parent pointers always gives a
path of minimum total weight to the initiator.1
Give a protocol that solves this problem with reasonable time, message,
and bit complexity, and show that it works.
Solution
There’s an ambiguity in the definition of total weight: does it include the
weight of the initiator and/or the initial node in the path? But since these
values are the same for all paths to the initiator from a given process, they
don’t affect which is lightest.
If we don’t care about bit complexity, there is a trivial solution: Use an
existing BFS algorithm followed by convergecast to gather the entire struc-
ture of the network at the initiator, run your favorite single-source shortest-
path algorithm there, then broadcast the results. This has time complexity
O(D) and message complexity O(DE) if we use the BFS algorithm from
§5.3. But the last couple of messages in the convergecast are going to be
pretty big.
A solution by reduction: Suppose that we construct a new graph G0
where each weight-2 node u in G is replaced by a clique of nodes u1 , u2 , . . . uk ,
with each node in the clique attached to a different neighbor of u. We then
run any breadth-first search protocol of our choosing on G0 , where each
weight-2 node simulates all members of the corresponding clique. Because
any path that passes through a clique picks up an extra edge, each path in
the breadth-first search tree has a length exactly equal to the sum of the
weights of the nodes other than its endpoints.
A complication is that if I am simulating k nodes, between them they
may have more than one parent pointer. So we define u.parent to be ui .parent
where ui is a node at minimum distance from the initiator in G0 . We also
re-route any incoming pointers to uj 6= ui to point to ui instead. Because
ui was chosen to have minimum distance, this never increases the length of
any path, and the resulting modified tree is a still a shortest-path tree.
Adding nodes blows up |E 0 |, but we don’t need to actually send messages
between different nodes ui represented by the same process. So if we use the
§5.3 algorithm again, we only send up to D messages per real edge, giving
O(D) time and O(DE) messages.
1
Clarification added 2014-01-26: The actual number of hops is not relevant for the
construction of the shortest-path tree. By shortest path, we mean path of minimum total
weight.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 295
If we don’t like reductions, we could also tweak one of our existing al-
gorithms. Gallager’s layered BFS (§5.2) is easily modified by changing the
depth bound for each round to a total-weight bound. The synchronizer-
based BFS can also be modified to work, but the details are messy.
Solution
√
The par solution for this is an Ω( f ) lower bound and O(f ) upper bound.
I don’t know if it is easy to do better than this.
For the lower bound, observe that the adversary can simulate an ordinary
crash failure by jamming a process in every round starting in the round it
crashes in. This means that in an r-round protocol, we can simulate k crash
failures with kr jamming faults. From the Dolev-Strong lower bound [DS83]
(see also Chapter 9), we know that there is no r-round protocol with k = r
crash failures faults, so there is no r-round protocol with r2 jamming faults.
2
Clarifications added 2014-02-10: We assume that processes don’t know that they are
being jammed or which messages are lost (unless the recipient manages to tell them that
a message was not delivered). As in the original model, we assume a complete network
and that all processes have known identities.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 296
√
This gives a lower bound of f + 1 on the number of rounds needed to
solve synchronous agreement with f jamming faults.3
For the upper bound, have every process broadcast its input every round.
After f +1 rounds, there is at least one round in which no process is jammed,
so every process learns all the inputs and can take, say, the majority value.
Solution
The relevant bound here is the requirement that the network have enough
connectivity that the adversary can’t take over half of a vertex cut (see
§10.1.3). This is complicated slightly by the requirement that the faulty
nodes be contiguous.
The smallest vertex cut in a sufficiently large torus consists of the four
neighbors of a single node; however, these nodes are not connected. But we
can add a third node to connect two of them (see Figure B.1).
By adapting the usual lower bound we can use this construction to show
that f = 3 faults are enough to prevent agreement when m ≥ 3. The
question then is whether f = 2 faults is enough.
By a case analysis, we can show that any two nodes in a sufficiently
large torus are either adjacent themselves or can be connected by three
paths, where no two paths have adjacent vertices. Assume without loss of
generality that one of the nodes is at position (0, 0). Then any other node
is covered by one of the following cases:
3
Since Dolev-Strong only needs to crash one process per round, we don’t really need
the full r jamming faults for processes that crash late. This could be used to improve the
constant for this argument.
4
Problem modified 2014-02-03. In the original version, it asked to compute f for all
m, but there are some nasty special cases when m is small.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 297
2. Nodes at (0, i) or (i, 0). These cases are symmetric, so we’ll describe
the solution for (0, i). Run one path directly north: (0, 1), (0, 2), . . . , (0, i−
1). Similarly, run a second path south: (0, −1), (0, −2), . . . (0, i + 1).
For the third path, take two steps east and then run north and back
west: (1, 0), (2, 0), (2, 1), (2, 2), . . . , (2, i), (1, i). These paths are all
non-adjacent as long as m ≥ 4.
3. Nodes at (±1, i) or (i, ±1), where i is not −1, 0, or 1. Suppose the node
is at (1, i). Run one path east then north through (1, 0), (1, 1), . . . , (1, i−
1). The other two paths run south and west, with a sideways jog in the
middle as needed. This works for m sufficiently large to make room
for the sideways jogs.
Solution
We can tolerate f < n/2, but no more.
If f < n/2, the following algorithm works: Run Paxos, where each
process i waits to learn that it is non-faulty, then acts as a proposer for
proposal number i. The highest-numbered non-faulty process then carries
out a proposal round that succeeds because no higher proposal is ever issued,
and both the proposer (which is non-faulty) and a majority of accepters
participate.
If f ≥ n/2, partition the processes into two groups of size bn/2c, with
any leftover process crashing immediately. Make all of the processes in both
groups non-faulty, and tell each of them this at the start of the protocol.
Now do the usual partitioning argument: Run group 0 with inputs 0 with no
messages delivered from group 1 until all processes decide 0 (we can do this
because the processes can’t distinguish this execution from one in which
the group 1 processes are in fact faulty). Run group 1 similarly until all
processes decide 1. We have then violated agreement, assuming we didn’t
previously violate termination of validity.
Solution
First observe that ♦S can simulate ♦Sk for any k by having n − k processes
ignore the output of their failure detectors. So we need f < n/2 by the
usual lower bound on ♦S.
If f ≥ k, we are also in trouble. The f > k case is easy: If there exists
a consensus protocol for f > k, then we can transform it into a consensus
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 299
1 procedure inc
2 ci [i] ← ci [i] + 1
3 Send ci [i] to all processes.
4 Wait to receive ack(ci [i]) from a majority of processes.
5 upon receiving c from j do
6 ci [j] ← max(ci [j], c)
7 Send ack(c) to j.
8 procedure read
9 ri ← ri + 1
10 Send read(ri ) to all processes.
11 Wait to receive respond(ri , cj ) from a majority of processes j.
P
12 return k maxj cj [k]
13 upon receiving read(r) from j do
14 Send respond(r, ci ) to j
Algorithm B.1: Counter algorithm for Problem B.4.2.
Unlike mutex, a concurrency detector does not enforce that only one
process is in the critical section at a time; instead, exiti returns 1 if the
interval between it and the previous enteri overlaps with some interval
between a enterj and corresponding exitj for some j 6= i, and returns 0 if
there is no overlap.
Is there a deterministic linearizable wait-free implementation of a con-
currency detector from atomic registers? If there is, give an implementation.
If there is not, give an impossibility proof.
Solution
It is not possible to implement this object using atomic registers.
Suppose that there were such an implementation. Algorithm B.2 im-
plements two-process consensus using a two atomic registers and a single
concurrency detector, initialized to the state following enter1 .
return its own value and process 2 to return the contents of r1 . These
will equal process 1’s value, because process 2’s read follows its call to
enter2 , which follows exit1 and thus process 1’s write to r1 .
Solution
If n = 2, then a two-writer sticky bit is equivalent to a sticky bit, so we can
solve consensus.
If n ≥ 3, suppose that we maneuver our processes as usual to a bivalent
configuration C with no bivalent successors. Then there are three pending
operations x, y, and z, that among them produce both 0-valent and 1-valent
configurations. Without loss of generality, suppose that Cx and Cy are both
0-valent and Cz is 1-valent. We now consider what operations these might
be.
Solution
The necessary part is easier, although we can’t use JTT (Chapter 20) di-
rectly because having write operations means that our rotate register is not
perturbable. Instead, we argue that if we initialize the register to 1, we
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 304
1. Show that any return values of the protocol are consistent with a
linearizable, single-use test-and-set.
1 procedure write(A, v)
2 s ← snapshot(A)
3 A[id] ← hmaxi s[i].timestamp + 1, id, v, 0i
4 procedure RotateLeft(A)
5 s ← snapshot(A)
6 Let i maximize hs[i].timestamp, s[i].processi
7 if s[i].timestamp = A[id].timestamp and
s[i].process = A[id].process then
// Increment my rotation count
8 A[id].rotations ← A[id].rotations + 1
9 else
// Reset and increment my rotation count
10 A[id] ← hs[i].timestamp, s[i].process, s[i].value, 1i
11 procedure read(A)
12 s ← snapshot(A)
13 Let i maximize hs[i].timestamp, s[i].processi
14 Let
P
r = j,s[j].timestamp=s[i].timestamp∧s[j].process=s[i].process s[j].rotations
15 return s[i].value rotated r times.
Algorithm B.3: Implementation of a rotate register
1 procedure TASi ()
2 while true do
3 with probability 1/2 do
4 ri ← ri + 1
5 else
6 ri ← ri
7 s ← r¬i
8 if s > ri then
9 return 1
10 else if s < ri − 1 do
11 return 0
Solution
1. To show that this implements a linearizable test-and-set, we need to
show that exactly one process returns 0 and the other 1, and that
if one process finishes before the other starts, the first process to go
returns 1.
Suppose that pi finishes before p¬i starts. Then pi reads only 0 from
r¬i , and cannot observe ri < r¬i : pi returns 0 in this case.
We now show that the two processes cannot return the same value.
Suppose that both processes terminate. Let i be such that pi reads r¬i
for the last time before p¬i reads ri for the last time. If pi returns 0,
then it observes ri ≥ r¬i + 2 at the time of its read; p¬i can increment
r¬i at most once before reading ri again, and so observed r¬i < ri and
returns 1.
Alternatively, if pi returns 1, it observed ri < r¬i . Since it performs
no more increments on ri , pi also observes ri < r¬i in all subsequent
reads, and so cannot also return 1.
2. Let’s run the protocol with an oblivious adversary, and track the value
of r0t − r1t over time, where rit is the value of ri after t writes (to either
register). Each write to r0 increases this value by 1/2 on average, with
a change of 0 or 1 equally likely, and each write to r1 decreases it by
1/2 on average.
To make things look symmetric, let ∆t be the change caused by the
t-th write and write ∆t as ct + X t where ct = ±1/2 is a constant
determined by whether p0 or p1 does the t-th write and X t = ±1/2 is
a random variable with expectation 0. Observe that the X t variables
are independent of each other and the constants ct (which depend only
on the schedule).
For the protocol to run forever, at every time t it must hold that
r0t − r1t ≤ 3; otherwise, even after one or both processes does its
0 0
next write, we will have r0t − r1t and the next process to read will
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 307
terminate. But
t
X
r0t − r1t = ∆s
s=1
Xt
= (cs + Xs )
s=1
Xt t
X
= cs + Xs .
s=1 s=1
The left-hand sum is a constant, while the right-hand sum has a bi-
nomial distribution. For any fixed constant, the probability that a
binomial distribution lands within ±2 of the constant goes to zero in
the limit as t → ∞, so with probability 1 there is some t for which
this event does not occur.
size of the ring. We would like the processes to each compute the maximum
input. As usual, each process may only return an output once, and must do
so after a finite number of rounds, although it may continue to participate
in the protocol (say, by relaying messages) even after it returns an output.
Prove or disprove: It is possible to solve this problem in this model.
Solution
It’s not possible.
Consider an execution with n = 3 processes, each with input 0. If the
protocol is correct, then after some finite number of rounds t, each process
returns 0. By symmetry, the processes all have the same states and send
the same messages throughout this execution.
Now consider a ring of size 2(t + 1) where every process has input 0,
except for one process p that has input 1. Let q be the process at maximum
distance from p. By induction on r, we can show that after r rounds of
communication, every process that is more than r + 1 hops away from p has
the same state as all of the processes in the 3-process execution above. So
in particular, after t rounds, process q (at distance t + 1) is in the same state
as it would be in the 3-process execution, and thus it returns 0. But—as it
learns to its horror, one round too late—the correct maximum is 1.
Solution
Test-and-sets are (a) historyless, and (b) have consensus number 2, so n is
at least 2.
To show that no historyless object can solve wait-free 3-process consen-
sus, consider an execution that starts in a bivalent configuration and runs
to a configuration C with two pending operations x and y such that Cx is
0-valent and Cy is 1-valent. By the usual arguments x and y must both be
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 309
Solution
Consider an execution in which the client orders ham. Run the northern
server together with the client until the server is about to issue a launch
action (if it never does so, the client receives no ham when the southern
server is faulty).
Now run the client together with the southern server. There are two
cases:
1. If the southern server ever issues launch, execute both this and the
northern server’s launch actions: the client gets two hams.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2014 310
2. If the southern server never issues launch, never run the northern
server again: the client gets no hams.
In either case, the one-ham rule is violated, and the protocol is not
correct.5
1 procedure mutex()
2 predecessor ← swap(s, myId)
3 while r 6= predecessor do
4 try again
// Start of critical section
5 ...
// End of critical section
6 r ← myId
Algorithm B.5: Mutex using a swap object and register
Solution
Because processes use the same id if they try to access the mutex twice, the
algorithm doesn’t work.
Here’s an example of a bad execution:
2. Process 2 swaps 2 into s and gets 1, reads 1 from r, and enters the
critical section.
I believe this works if each process adopts a new id every time it calls
mutex, but the proof is a little tricky.6
6
The simplest proof I can come up with is to apply an invariant that says that (a)
the processes that have executed swap(s, myId) but have not yet left the while loop have
predecessor values that form a linked list, with the last pointer either equal to ⊥ (if no
process has yet entered the critical section) or the last process to enter the critical section;
(b) r is ⊥ if no process has yet left the critical section, or the last process to leave the
critical section otherwise; and (c) if there is a process that is in the critical section, its
predecessor field points to the last process to leave the critical section. Checking the effects
of each operation shows that this invariant is preserved through the execution, and (a)
combined with (c) show that we can’t have two processes in the critical section at the
same time. Additional work is still needed to show starvation-freedom. It’s a good thing
this algorithm doesn’t work as written.
Appendix C
1. Your name.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
312
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 313
Solution
Disproof: Consider two executions, one in an n × m torus and one in an
m × n torus where n > m and both n and m are at least 2.2 Using the same
argument as in Lemma 6.1.1, show by induction on the round number that,
for each round r, all processes in both executions have the same state. It
follows that if the processes correctly detect n > m in the n × m execution,
then they incorrectly report m > n in the m × n execution.
C.1.2 Clustering
Suppose that k of the nodes in an asynchronous message-passing network
are designated as cluster heads, and we want to have each node learn the
identity of the nearest head. Given the most efficient algorithm you can for
this problem, and compute its worst-case time and message complexities.
You may assume that processes have unique identifiers and that all pro-
cesses know how many neighbors they have.3
Solution
The simplest approach would be to run either of the efficient distributed
breadth-first search algorithms from Chapter 5 simultaneously starting at
all cluster heads, and have each process learn the distance to all cluster heads
at once and pick the nearest one. This gives O(D2 ) time and O(k(E + V D))
messages if we use layering and O(D) time and O(kDE) messages using
local synchronization.
We can get rid of the dependence on k in the local-synchronization algo-
rithm by running it almost unmodified, with the only difference being the
attachment of a cluster head id to the exactly messages. The simplest way to
show that the resulting algorithm works is to imagine coalescing all cluster
1
Clarification added 2011-09-28.
2
This last assumption is not strictly necessary, but it avoids having to worry about
what it means when a process sends a message to itself.
3
Clarification added 2011-09-26.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 314
C.1.3 Negotiation
Two merchants A and B are colluding to fix the price of some valuable
commodity, by sending messages to each other for r rounds in a synchronous
message-passing system. To avoid the attention of antitrust regulators, the
merchants are transmitting their messages via carrier pigeons, which are
unreliable and may become lost. Each merchant has an initial price pA or
pB , which are integer values satisfying 0 ≤ p ≤ m for some known value
m, and their goal is to choose new prices p0A and p0B , where |p0A − p0B | ≤ 1.
If pA = pB and no messages are lost, they want the stronger goal that
p0A = p0B = pA = pB .
Prove the best lower bound you can on r, as a function of m, for all
protocols that achieve these goals.
Solution
This is a thinly-disguised version of the Two Generals Problem from Chap-
ter 3, with the agreement condition p0A = p0B replaced by an approximate
agreement condition |p0A − p0B | ≤ 1. We can use a proof based on the
indistinguishability argument in §3.2 to show that r ≥ m/2.
Fix r, and suppose that in a failure-free execution both processes send
messages in all rounds (we can easily modify an algorithm that does not
have this property to have it, without increasing r). We will start with a
sequence of executions with pA = pB = 0. Let X0 be the execution in which
no messages are lost, X1 the execution in which A’s last message is lost,
X2 the execution in which both A and B’s last messages are lost, and so
on, with Xk for 0 ≤ k ≤ 2r losing k messages split evenly between the two
processes, breaking ties in favor of losing messages from A.
When i is even, Xi is indistinguishable from Xi+1 by A; it follows that
p0A is the same in both executions. Because we no longer have agreement,
it may be that p0B (Xi ) and p0B (Xi+1 ) are not the same as p0A in either ex-
ecution; but since both are within 1 of p0A , the difference between them is
at most 2. Next, because Xi+1 to Xi+2 are indistinguishable to B, we have
p0B (Xi+1 ) = p0B (Xi+2 ), which we can combine with the previous claim to get
|p0B (Xi ) − p0B (Xi+2 )|. A simple induction then gives p0B (X2r ) ≤ 2r, where
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 315
Suppose that we augment the system so that senders are notified imme-
diately when their messages are delivered. We can model this by making the
delivery of a single message an event that updates the state of both sender
and recipient, both of which may send additional messages in response. Let
us suppose that this includes attempted deliveries to faulty processes, so
that any non-faulty process that sends a message m is eventually notified
that m has been delivered (although it might not have any effect on the
recipient if the recipient has already crashed).
1. Show that this system can solve consensus with one faulty process
when n = 2.
2. Show that this system cannot solve consensus with two faulty processes
when n = 3.
Solution
1. To solve consensus, each process sends its input to the other. Whichever
input is delivered first becomes the output value for both processes.
2. To show impossibility with n = 3 and two faults, run the usual FLP
proof until we get to a configuration C with events e0 and e such that
Ce is 0-valent and Ce0 e is 1-valent (or vice versa). Observe that e
and e0 may involve two processes each (sender and receiver), for up
to four processes total, but only a process that is involved in both e
and e0 can tell which happened first. There can be at most two such
processes. Kill both, and get that Ce0 e is indistinguishable from Cee0
for the remaining process, giving the usual contradiction.
Solution
There is an easy reduction to FLP that shows f ≤ n/2 is necessary (when n
√
is even), and a harder reduction that shows f < 2 n − 1 is necessary. The
easy reduction is based on crashing every other process; now no surviving
process can suspect any other survivor, and we are back in an asynchronous
message-passing system with no failure detector and 1 remaining failure (if
f is at least n/2 + 1).
√
The harder reduction is to crash every ( n)-th process. This partitions
√ √
the ring into n segments of length n − 1 each, where there is no failure
detector in any segment that suspects any process in another segment. If an
algorithm exists that solves consensus in this situation, then it does so even
if (a) all processes in each segment have the same input, (b) if any process
√
in one segment crashes, all n − 1 process in the segment crash, and (c) if
any process in a segment takes a step, all take a step, in some fixed order.
Under this additional conditions, each segment can be simulated by a single
process in an asynchronous system with no failure detectors, and the extra
√ √
n − 1 failures in 2 n − 1 correspond to one failure in the simulation. But
we can’t solve consensus in the simulating system (by FLP), so we can’t
solve it in the original system either.
On the other side, let’s first boost completeness of the failure detector,
by having any process that suspects another transmit this submission by
reliable broadcast. So now if any non-faulty process i suspects i + 1, all the
non-faulty processes will suspect i + 1. Now with up to t failures, whenever
I learn that process i is faulty (through a broadcast message passing on the
suspicion of the underlying failure detector, I will suspect processes i + 1
through i + t − f as well, where f is the number of failures I have heard
about directly. I don’t need to suspect process i + t − f + 1 (unless there is
some intermediate process that has also failed), because the only way that
this process will not be suspected eventually is if every process in the range
i to i + t − f is faulty, which can’t happen given the bound t.
Now if t is small enough that I can’t cover the entire ring with these
segments, then there is some non-faulty processes that is far enough away
from the nearest preceding faulty process that it is never suspected: this
gives us an eventually strong failure detector, and we can solve consensus
using the standard Chandra-Toueg ♦S algorithm from §13.4 or [CT96]. The
inequality I am looking for is f (t − f ) < n, where the √ left-hand side is
maximized by setting 2
f = t/2, which gives t /4 < n or t < 2n. This leaves
√
a gap of about 2 between the upper and lower bounds; I don’t know which
one can be improved.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 318
√
I am indebted to Hao Pan for suggesting the Θ( n) upper and lower
bounds, which corrected an error in my original draft solution to this prob-
lem.
Termination If at some time an odd number of sensors are active, and from
that point on no sensor changes its state, then some process eventually
sets off an alarm.
Solution
It is feasible to solve the problem for n < 3.
For n = 1, the unique process sets off its alarm as soon as its sensor
becomes active.
For n = 2, have each process send a message to the other containing
its sensor state whenever the sensor state changes. Let s1 and s2 be the
state of the two process’s sensors, with 0 representing inactive and 1 active,
and let pi set off its alarm if it receives a message s such that s ⊕ si = 1.
This satisfies termination, because if we reach a configuration with an odd
number of active sensors, the last sensor to change causes a message to be
sent to the other process that will cause it to set off its alarm. It satisfies
no-false-positives, because if pi sets off its alarm, then s¬i = s because at
most one time unit has elapsed since p¬i sent s; it follows that s¬i ⊕ si = 1
and an odd number of sensors are active.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 319
• enq(Q) always pushes the identity of the current process onto the tail
of the queue.
• deq(Q) tests if the queue is nonempty and its head is equal to the
identity of the current process. If so, it pops the head and returns
true. If not, it does nothing and returns false.
The rationale for these restrictions is that this is the minimal version of
a queue needed to implement a starvation-free mutex using Algorithm 17.2.
What is the consensus number of this object?
Solution
The restricted queue has consensus number 1.
Suppose we have 2 processes, and consider all pairs of operations on Q
that might get us out of a bivalent configuration C. Let x be an operation
carried out by p that leads to a b-valent state, and y an operation by q that
leads to a (¬b)-valent state. There are three cases:
• One enq and one deq operation. Suppose x is an enq and y a deq. If
Q is empty or the head is not q, then y is a no-op: p can’t distinguish
Cx from Cyx. If the head is q, then x and y commute. The same
holds in reverse if x is a deq and y an enq.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 320
• Two enq operations. This is a little tricky, because Cxy and Cyx
are different states. However, if Q is nonempty in C, whichever pro-
cess isn’t at the head of Q can’t distinguish them, because any deq
operation returns false and never reaches the newly-enqueued values.
This leaves the case where Q is empty in C. Run p until it is poised
to do x0 = deq(Q) (if this never happens, p can’t distinguish Cxy
from Cyx); then run q until it is poised to do y 0 = deq(Q) as well
(same argument as for p). Now allow both deq operations to proceed
in whichever order causes them both to succeed. Since the processes
can’t tell which deq happened first, they can’t tell which enq hap-
pened first either. Slightly more formally, if we let α be the sequence
of operations leading up to the two deq operations, we’ve just shown
Cxyαx0 y 0 is indistinguishable from Cyxαy 0 x0 to both processes.
In all cases, we find that we can’t escape bivalence. It follows that Q can’t
solve 2-process consensus.
Solution
We’ll use a snapshot object a to control access to an infinite array f of fetch-
and-increments, where each time somebody writes to the implemented ob-
ject, we switch to a new fetch-and-increment. Each cell in a holds (timestamp, base),
where base is the starting value of the simulated fetch-and-increment. We’ll
also use an extra fetch-and-increment T to hand out timestamps.
Code is in Algorithm C.1.
Since this is all straight-line code, it’s trivially wait-free.
Proof of linearizability is by grouping all operations by timestamp, us-
ing s[i].timestamp for FetchAndIncrement operations and t for write opera-
tions, then putting write before FetchAndIncrement, then ordering FetchAndIncrement
by return value. Each group will consist of a write(v) for some v followed by
zero or more FetchAndIncrement operations, which will return increasing
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 321
1 procedure FetchAndIncrement()
2 s ← snapshot(a)
3 i ← argmaxi (s[i].timestamp)
4 return f [s[i].timestamp] + s[i].base
5 procedure write(v)
6 t ← FetchAndIncrement(T )
7 a[myId] ← (t, v)
Algorithm C.1: Resettable fetch-and-increment
values starting at v since they are just returning values from the underlying
FetchAndIncrement object; the implementation thus meets the specifica-
tion.
To show consistency with the actual execution order, observe that time-
stamps only increase over time and that the use of snapshot means that
any process that observes or writes a timestamp t does so at a time later
than any process that observes or writes any t0 < t; this shows the group
order is consistent. Within each group, the write writes a[myId] before
any FetchAndIncrement reads it, so again we have consistency between the
write and any FetchAndIncrement operations. The FetchAndIncrement
operations are linearized in the order in which they access the underlying
f [. . . ] object, so we win here too.
Solution
Let b be the box object. Represent b by a snapshot object a, where a[i]
holds a pair (∆wi , ∆hi ) representing the number of times process i has
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 322
Solution
The consensus number is ∞; a single lockable register solves consensus for
any number of processes. Code is in Algorithm C.2.
1 write(r, input)
2 lock(r)
3 return read(r)
Algorithm C.2: Consensus using a lockable register
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 323
Termination and validity are trivial. Agreement follows from the fact
that whatever value is in r when lock(r) is first called will never change,
and thus will be read and returned by all processes.
Solution
It is possible to solve the problem for all n except n = 3. For n = 1, there are
no non-faulty processes, so the specification is satisfied trivially. For n = 2,
there is only one non-faulty process: it can just keep its own counter and
return an increasing sequence of timestamps without talking to the other
process at all.
For n = 3, it is not possible. Consider an execution in which messages
between non-faulty processes p and q are delayed indefinitely. If the Byzan-
tine process r acts to each of p and q as it would if the other had crashed,
this execution is indistinguishable to p and q from an execution in which r
is correct and the other is faulty. Since there is no communication between
p and q, it is easy to construct and execution in which the specification is
violated.
For n ≥ 4, the protocol given in Algorithm C.3 works.
The idea is similar to the Attiya, Bar-Noy, Dolev distributed shared
memory algorithm [ABND95]. A process that needs a timestamp polls n − 1
other processes for the maximum values they’ve seen and adds 1 to it; before
returning, it sends the new timestamp to all other processes and waits to
receive n − 1 acknowledgments. The Byzantine process may choose not to
answer, but this is not enough to block completion of the protocol.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 324
1 procedure getTimestamp()
2 ci ← ci + 1
3 send probe(ci ) to all processes
4 wait to receive response(ci , vj ) from n − 1 processes
5 vi ← (maxj vj ) + 1
6 send newTimestamp(ci , vi ) to all processes
7 wait to receive ack(ci ) from n − 1 processes
8 return vi
To show the timestamps are increasing, observe that after the completion
of any call by i to getTimestamp, at least n − 2 non-faulty processes j have
a value vj ≥ vi . Any call to getTimestamp that starts later sees at least
n − 3 > 0 of these values, and so computes a max that is at least as big as
vi and then adds 1 to it, giving a larger value.
Solution
Yes. With f < n/2 and ♦S, we can solve consensus using Chandra-
Toueg [CT96]. Since this gives a unique decision value, it solves k-set agree-
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2011 325
Solution
Algorithm C.4 implements a counter from a set object, where the counter
read consists of a single call to size(S). The idea is that each increment
is implemented by inserting a new element into S, so |S| is always equal to
the number of increments.
1 procedure inc(S)
2 nonce ← nonce + 1
3 add(S, hmyId, noncei).
4 procedure read(S)
5 return size(S)
Algorithm C.4: Counter from set object
4
Clarification added during exam.
Appendix D
This appendix contains final exams from previous times the course was of-
fered, and is intended to give a rough guide to the typical format and content
of a final exam. Note that the topics covered in past years were not neces-
sarily the same as those covered this year.
326
APPENDIX D. ADDITIONAL SAMPLE FINAL EXAMS 327
of your choosing, and that the design of the consensus protocol can depend
on the number of processes N .
Solution
The consensus number is 2.
To implement 2-process wait-free consensus, use a single fetch-and-subtract
register initialized to 1 plus two auxiliary read/write registers to hold the
input values of the processes. Each process writes its input to its own regis-
ter, then performs a fetch-and-subtract(1) on the fetch-and-subtract register.
Whichever process gets 1 from the fetch-and-subtract returns its own input;
the other process (which gets 0) returns the winning process’s input (which
it can read from the winning process’s read/write register.)
To show that the consensus number is at most 2, observe that any
two fetch-and-subtract operations commute: starting from state x, after
fetch-and-subtract(k1 ) and fetch-and-subtract(k2 ) the value in the fetch-
and-subtract register is max(0, x − k1 − k2 ) regardless of the order of the
operations.
Solution
Upper bound
Because there are no failures, we can appoint a leader and have it decide.
The natural choice is some process near the middle, say pb(N +1)/2c . Upon
receiving an input, either directly through an input event or indirectly from
another process, the process sends the input value along the line toward the
leader. The leader takes the first input it receives and broadcasts it back
out in both directions as the decision value. The worst case is when the
protocol is initiated at pN ; then we pay 2(N − b(N + 1)/2c) time to send all
messages out and back, which is N time units when N is even and N − 1
time units when N is odd.
Lower bound
Proving an almost-matching lower bound of N − 1 time units is trivial: if
p1 is the only initiator and it starts at time t0 , then by an easy induction
argument,in the worst case pi doesn’t learn of any input until time t0 +(i−1),
and in particular pN doesn’t find out until after N − 1 time units. If pN
nonetheless decides early, its decision value will violate validity in some
executions.
But we can actually prove something stronger than this: that N time
units are indeed required when N is odd. Consider two slow executions Ξ0
and Ξ1 , where (a) all messages are delivered after exactly one time unit in
each execution; (b) in Ξ0 only p1 receives an input and the input is 0; and
(c) in Ξ1 only pN receives an input and the input is 1. For each of the
executions, construct a causal ordering on events in the usual fashion: a
send is ordered before a receive, two events of the same process are ordered
by time, and other events are partially ordered by the transitive closure of
this relation.
Now consider for Ξ0 the set of all events that precede the decide(0)
event of p1 and for Ξ1 the set of all events that precede the decide(1) event
of pN . Consider further the sets of processes S0 and S1 at which these events
occur; if these two sets of processes do not overlap, then we can construct
an execution in which both sets of events occur, violating Agreement.
Because S0 and S1 overlap, we must have |S0 | + |S1 | ≥ N + 1, and so at
least one of the two sets has size at least d(N + 1)/2e, which is N/2 + 1 when
N is even. Suppose that it is S0 . Then in order for any event to occur at
pN/2+1 at all some sequence of messages must travel from the initial input
to p1 to process pN/2+1 (taking N/2 time units), and the causal ordering
APPENDIX D. ADDITIONAL SAMPLE FINAL EXAMS 329
In either case, the solution should work for arbitrarily many processes—solving
mutual exclusion when N = 1 is not interesting. You are also not required
in either case to guarantee lockout-freedom.
Solution
1. Disproof: With append registers only, it is not possible to solve mutual
exclusion. To prove this, construct a failure-free execution in which
the processes never break symmetry. In the initial configuration, all
processes have the same state and thus execute either the same read
operation or the same append operation; in either case we let all N
operations occur in some arbitrary order. If the operations are all
reads, all processes read the same value and move to the same new
state. If the operations are all appends, then no values are returned
and again all processes enter the same new state. (It’s also the case
that the processes can’t tell from the register’s state which of the
identical append operations went first, but we don’t actually need to
use this fact.)
APPENDIX D. ADDITIONAL SAMPLE FINAL EXAMS 330
2. Since the processes are anonymous, any solution that depends on them
having identifiers isn’t going to work. But there is a simple solution
that requires only appending single bits to the register.
Each process trying to enter a critical section repeatedly executes an
append-and-fetch operation with argument 0; if the append-and-fetch
operation returns either a list consisting only of a single 0 or a list
whose second-to-last element is 1, the process enters its critical section.
To leave the critical section, the process does append-and-fetch(1).
Solution
Pick some leader node to implement the object. To execute an operation,
send the operation to the leader node, then have the leader carry out the
operation (sequentially) on its copy of the object and send the results back.
each i less than k − 1 and a[k − 1] ← v; and (b) returns a snapshot of the
new contents of the array (after the shift).
What is the consensus number of this object as a function of k?
Solution
We can clearly solve consensus for at least k processes: each process calls
shift-and-fetch on its input, and returns the first non-null value in the buffer.
So now we want to show that we can’t solve consensus for k+1 processes.
Apply the usual FLP-style argument to get to a bivalent configuration C
where each of the k + 1 processes has a pending operation that leads to
a univalent configuration. Let e0 and e1 be particular operations leading
to 0-valent and 1-valent configurations, respectively, and let e2 . . . ek be the
remaining k − 1 pending operations.
We need to argue first that no two distinct operations ei and ej are
operations of different objects. Suppose that Cei is 0-valent and Cej is
1-valent; then if ei and ej are on different objects, Cei ej (still 0-valent) is
indistinguishable by all processes from Cej ei (still 1-valent), a contradiction.
Alternatively, if ei and ej are both b-valent, there exists some (1−b)-valent ek
such that ei and ej both operate on the same object as ek , by the preceding
argument. So all of e0 . . . ek are operations on the same object.
By the usual argument we know that this object can’t be a register. Let’s
show it can’t be a ring buffer either. Consider the configurations Ce0 e1 . . . ek
and Ce1 . . . ek . These are indistinguishable to the process carrying out ek
(because its sees only the inputs to e1 through ek in its snapshot). So they
must have the same valence, a contradiction.
It follows that the consensus number of a k-element ring buffer is exactly
k.
Solution
First observe that each row and column of the torus is a bidirectional ring,
so we can run e.g. Hirschbirg and Sinclair’s O(n log n)-message protocol
within each of these rings to find the smallest identifier in the ring. We’ll
use this to construct the following algorithm:
1. Run Hirschbirg-Sinclair in each row to get a local leader for each row;
this takes n × O(n log n) = O(n2 log n) messages. Use an additional n
messages per row to distribute the identifier for the row leader to all
nodes and initiate the next stage of the protocol.
2. Run Hirschbirg-Sinclair in each column with each node adopting the
row leader identifier as its own. This costs another O(n2 log n) mes-
sages; at the end, every node knows the minimum identifier of all nodes
in the torus.
The total message complexity is O(n2 log n). (I suspect this is optimal,
but I don’t have a proof.)
3. Give the best lower bound you can on the total message complexity of
the pre-processing and search algorithms in the case above.
Solution
1. Run depth-first search to find the matching key and return the corre-
sponding value back up the tree. Message complexity is O(|E|) = O(n)
(since each node has only O(1) links).
2. Basic idea: give each node a copy of all key-value pairs, then searches
take zero messages. To give each node a copy of all key-value pairs we
could do convergecast followed by broadcast (O(n) message complex-
ity) or just flood each pair O(n2 ). Either is fine since we don’t care
about the message complexity of the pre-processing stage.
Solution
No protocol for two: turn an anti-consensus protocol with outputs in {0, 1}
into a consensus protocol by having one of the processes always negate its
output.
A protocol for three: Use a splitter.
Solution
Here is an impossibility proof. Suppose there is such an algorithm, and let
it correctly decide “odd” on a ring of size 2k + 1 for some k and some set
of leader inputs. Now construct a ring of size 4k + 2 by pasting two such
rings together (assigning the same values to the leader bits in each copy)
and run the algorithm on this ring. By the usual symmetry argument,
every corresponding process sends the same messages and makes the same
decisions in both rings, implying that the processes incorrectly decide the
ring of size 4k + 2 is odd.
Solution
Disproof: Let s1 and s2 be processes carrying out snapshots and let w1
and w2 be processes carrying out writes. Suppose that each wi initiates a
write of 1 to a[wi ], but all of its messages to other processes are delayed
after it updates its own copy awi [wi ]. Now let each si receive responses
from 3n/4 − 1 processes not otherwise mentioned plus wi . Then s1 will
return a vector with a[w1 ] = 1 and a[w2 ] = 0 while s2 will return a vector
with a[w1 ] = 0 and a[w2 ] = 1, which is inconsistent. The fact that these
vectors are also disseminated throughout at least 3n/4 other processes is a
red herring.
APPENDIX D. ADDITIONAL SAMPLE FINAL EXAMS 336
Solution
The consensus number is 2. The proof is similar to that for a queue.
To show we can do consensus for n = 2, start with a priority queue with
a single value in it, and have each process attempt to dequeue this value. If
a process gets the value, it decides on its own input; if it gets null, it decides
on the other process’s input.
To show we can’t do consensus for n = 3, observe first that starting from
any states C of the queue, given any two operations x and y that are both
enqueues or both dequeues, the states Cxy and Cyx are identical. This
means that a third process can’t tell which operation went first, meaning
that a pair of enqueues or a pair of dequeues can’t get us out of a bivalent
configuration in the FLP argument. We can also exclude any split involving
two operations on different queues (or other objects) But we still need to
consider the case of a dequeue operation d and an enqueue operation e on
the same queue Q. This splits into several subcases, depending on the state
C of the queue in some bivalent configuration:
1. C = {}. Then Ced = Cd = {}, and a third process can’t tell which of
d or e went first.
to dequeue from Q, then we have already won, since the survivors can’t
distinguish Ced from Cde). Now the state of all objects is the same
after Cedσ and Cdeσ, and only pd and p have different states in these
two configurations. So any third process is out of luck.
Appendix E
I/O automata
338
APPENDIX E. I/O AUTOMATA 339
All output actions of the components are also output actions of the
composition. An input action of a component is an input of the composition
only if some other component doesn’t supply it as an output; in this case
1
Note that infinite (but countable) compositions are permitted.
APPENDIX E. I/O AUTOMATA 340
E.1.5 Fairness
I/O automata come with a built-in definition of fair executions, where an
execution of A is fair if, for each equivalence class C of actions in task(A),
3. the execution is infinite and there are infinitely many states in which
no action in C is enabled.
E.2.1 Example
A property we might demand of the spambot above (or some other ab-
straction of a message channel) is that it only delivers messages that have
previously been given to it. As a trace property this says that in any trace
t, if tk = spam(m), then tj = setMessage(m) for some j < k. (As a set, this
is just the set of all sequences of external spambot-actions that have this
property.) Call this property P .
To prove that the spambot automaton given above satisfies P , we might
argue that for any execution s0 a0 s1 a1 . . . , that si = m in the last setMessage
action preceding si , or ⊥ if there is no such action. This is easily proved
by induction on i. It then follows that since spam(m) can only transmit the
current state, that if spam(m) follows si = m that it follows some earlier
setMessage(m) as claimed.
However, there are traces that satisfy P that don’t correspond to execu-
tions of the spambot; for example, consider the trace setMessage(0)setMessage(1)spam(0).
This satisfies P (0 was previously given to the automaton spam(0)), but the
automaton won’t generate it because the 0 was overwritten by the later
setMessage(1) action. Whether this is indicates a problem with our automa-
ton not being nondeterministic enough or our trace property being too weak
is a question about what we really want the automaton to do.
1. P is nonempty.
APPENDIX E. I/O AUTOMATA 343
Because of the last restrictions, it’s enough to prove that P holds for all
finite traces of A to show that it holds for all traces (and thus for all fair
traces), since any trace is a limit of finite traces. Conversely, if there is some
trace or fair trace for which P fails, the second restriction says that P fails
on any finite prefix of P , so again looking at only finite prefixes is enough.
The spambot property mentioned above is a safety property.
Safety properties are typically proved using invariants, properties that
are shown by induction to hold in all reachable states.
task(A), then the spambot doesn’t satisfy the liveness property: in an exe-
cution that alternates setMessage(m1 )setMessage(m2 )setMessage(m1 )setMessage(m2 ) . . .
there are infinitely many states in which spam(m1 ) is not enabled, so fairness
doesn’t require doing it even once, and similarly for spam(m2 ).
E.2.3.1 Example
Consider two spambots A1 and A2 where we identify the spam(m) operation
of A1 with the setMessage(m) operation of A2 ; we’ll call this combined action
spam1 (m) to distinguish it from the output actions of A2 . We’d like to
argue that the composite automaton A1 + A2 satisfies the safety property
(call it Pm ) that any occurrence of spam(m) is preceded by an occurrence
of setMessage(m), where the signature of Pm includes setMessage(m) and
spam(m) for some specific m but no other operations. (This is an example
of where trace property signatures can be useful without being limited to
actions of any specific component automaton.)
To do so, we’ll prove a stronger property Pm 0 , which is P
m modified
to include the spam1 (m) action in its signature. Observe that Pm 0 is the
the later says that any trace that includes spam(m) has a previous spam1 (m)
and the former says that any trace that includes spam1 (m) has a previous
setMessage(m). Since these properties hold for the individual A1 and A2 ,
their product, and thus the restriction Pm0 , holds for A + A , and so P (as
1 2 m
a further restriction) holds for A1 + A2 as well.
Now let’s prove the liveness property for A1 + A2 , that at least one
occurrence of setMessage yields infinitely many spam actions. Here we
let L1 = {at least one setMessage action ⇒ infinitely many spam1 actions}
and L2 = {at least one spam1 action ⇒ infinitely many spam actions}. The
product of these properties is all sequences with (a) no setMessage actions
or (b) infinitely many spam actions, which is what we want. This product
holds if the individual properties L1 and L2 hold for A1 + A2 , which will be
the case if we set task(A1 ) and task(A2 ) correctly.
E.2.4.1 Example
A single spambot A can simulate the conjoined spambots A1 +A2 . Proof: Let
f (s) = (s, s). Then f (⊥) = (⊥, ⊥) is a start state of A1 + A2 . Now consider
a transition (s, a, s0 ) of A; the action a is either (a) setMessage(m), giving
s0 = m; here we let x = setMessage(m)spam1 (m) with trace(x) = trace(a)
since spam1 (m) is internal and f (s0 ) = (m, m) the result of applying x; or (b)
a = spam(m), which does not change s or f (s); the matching x is spam(m),
which also does not change f (s) and has the same trace.
A different proof could take advantage of f being a relation by defining
f (s) = {(s, s0 )|s0 ∈ states(A2 )}. Now we don’t care about the state of
A2 , and treat a setMessage(m) action of A as the sequence setMessage(m)
in A1 + A2 (which updates the first component of the state correctly) and
treat a spam(m) action as spam1 (m)spam(m) (which updates the second
component—which we don’t care about—and has the correct trace.) In
some cases an approach of this sort is necessary because we don’t know
which simulated state we are heading for until we get an action from A.
Note that the converse doesn’t work: A1 + A2 don’t simulate A, since
there are traces of A1 +A2 (e.g. setMessage(0)spam1 (0)setMessage(1)spam(0))
that don’t restrict to traces of A. See [Lyn96, §8.5.5] for a more complicated
example of how one FIFO queue can simulate two FIFO queues and vice
versa (a situation called bisimulation).
Since we are looking at traces rather than fair traces, this kind of simula-
tion doesn’t help much with liveness properties, but sometimes the connec-
tion between states plus a liveness proof for B can be used to get a liveness
proof for A (essentially we have to argue that A can’t do infinitely many
action without triggering a B-action in an appropriate task class). Again
see [Lyn96, §8.5.5].
Bibliography
[AAC09] James Aspnes, Hagit Attiya, and Keren Censor. Max registers,
counters, and monotone circuits. In Proceedings of the 28th An-
nual ACM Symposium on Principles of Distributed Computing,
PODC 2009, Calgary, Alberta, Canada, August 10-12, 2009,
pages 36–45, August 2009.
[AAD+ 93] Yehuda Afek, Hagit Attiya, Danny Dolev, Eli Gafni, Michael
Merritt, and Nir Shavit. Atomic snapshots of shared memory.
J. ACM, 40(4):873–890, 1993.
347
BIBLIOGRAPHY 348
[AAG+ 10] Dan Alistarh, Hagit Attiya, Seth Gilbert, Andrei Giurgiu, and
Rachid Guerraoui. Fast randomized test-and-set and renam-
ing. In Nancy A. Lynch and Alexander A. Shvartsman, editors,
Distributed Computing, 24th International Symposium, DISC
2010, Cambridge, MA, USA, September 13-15, 2010. Proceed-
ings, volume 6343 of Lecture Notes in Computer Science, pages
94–108. Springer, 2010.
[AAGG11] Dan Alistarh, James Aspnes, Seth Gilbert, and Rachid Guer-
raoui. The complexity of renaming. In Fifty-Second Annual
IEEE Symposium on Foundations of Computer Science, pages
718–727, October 2011.
[ABND+ 90] Hagit Attiya, Amotz Bar-Noy, Danny Dolev, David Peleg, and
Rüdiger Reischuk. Renaming in an asynchronous environment.
J. ACM, 37(3):524–548, 1990.
[ABND95] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing mem-
ory robustly in message-passing systems. Journal of the ACM,
42(1):124–142, 1995.
[AC08] Hagit Attiya and Keren Censor. Tight bounds for asynchronous
randomized consensus. Journal of the ACM, 55(5):20, October
2008.
[ACH10] Hagit Attiya and Keren Censor-Hillel. Lower bounds for ran-
domized consensus under a weak adversary. SIAM J. Comput.,
39(8):3885–3904, 2010.
[ACHS13] Dan Alistarh, Keren Censor-Hillel, and Nir Shavit. Are lock-
free concurrent algorithms practically wait-free? arXiv preprint
arXiv:1311.3200, 2013.
[AE11] James Aspnes and Faith Ellen. Tight bounds for anonymous
adopt-commit objects. In 23rd Annual ACM Symposium on
Parallelism in Algorithms and Architectures, pages 317–324,
June 2011.
[AF01] Hagit Attiya and Arie Fouren. Adaptive and efficient algo-
rithms for lattice agreement and renaming. SIAM Journal on
Computing, 31(2):642–664, 2001.
[AG91] Yehuda Afek and Eli Gafni. Time and message bounds for
election in synchronous and asynchronous complete networks.
SIAM Journal on Computing, 20(2):376–394, 1991.
[AGTV92] Yehuda Afek, Eli Gafni, John Tromp, and Paul M. B. Vitányi.
Wait-free test-and-set (extended abstract). In Adrian Segall
and Shmuel Zaks, editors, Distributed Algorithms, 6th Inter-
national Workshop, WDAG ’92, Haifa, Israel, November 2-4,
1992, Proceedings, volume 647 of Lecture Notes in Computer
Science, pages 85–94. Springer, 1992.
[AHM09] Hagit Attiya, Eshcar Hillel, and Alessia Milani. Inherent lim-
itations on disjoint-access parallel implementations of trans-
actional memory. In Friedhelm Meyer auf der Heide and
Michael A. Bender, editors, SPAA 2009: Proceedings of the
21st Annual ACM Symposium on Parallelism in Algorithms
and Architectures, Calgary, Alberta, Canada, August 11-13,
2009, pages 69–78. ACM, 2009.
[AHS94] James Aspnes, Maurice Herlihy, and Nir Shavit. Counting net-
works. Journal of the ACM, 41(5):1020–1048, September 1994.
[AHW08] Hagit Attiya, Danny Hendler, and Philipp Woelfel. Tight rmr
lower bounds for mutual exclusion and other problems. In Pro-
ceedings of the 40th annual ACM symposium on Theory of com-
puting, STOC ’08, pages 217–226, New York, NY, USA, 2008.
ACM.
BIBLIOGRAPHY 351
[AKP+ 06] Hagit Attiya, Fabian Kuhn, C. Greg Plaxton, Mirjam Watten-
hofer, and Roger Wattenhofer. Efficient adaptive collect using
randomization. Distributed Computing, 18(3):179–188, 2006.
[AM99] Yehuda Afek and Michael Merritt. Fast, wait-free (2k − 1)-
renaming. In PODC, pages 105–112, 1999.
[Att14] Hagit Attiya. Lower bounds and impossibility results for trans-
actional memory computing. Bulletin of the European Associ-
ation for Computer Science, 112:38–52, February 2014.
[Bel03] S. Bellovin. The Security Flag in the IPv4 Header. RFC 3514
(Informational), April 2003.
[Cha93] Soma Chaudhuri. More choices allow more faults: Set consen-
sus problems in totally asynchronous systems. Inf. Comput.,
105(1):132–158, 1993.
[CIL94] Benny Chor, Amos Israeli, and Ming Li. Wait-free consensus
using asynchronous hardware. SIAM J. Comput., 23(4):701–
712, 1994.
BIBLIOGRAPHY 355
[FHS98] Faith Ellen Fich, Maurice Herlihy, and Nir Shavit. On the space
complexity of randomized synchronization. J. ACM, 45(5):843–
862, 1998.
[FHS05] Faith Ellen Fich, Danny Hendler, and Nir Shavit. Linear lower
bounds on real-world implementations of concurrent objects.
In Foundations of Computer Science, Annual IEEE Sympo-
sium on, pages 165–173, Los Alamitos, CA, USA, 2005. IEEE
Computer Society.
[FL06] Rui Fan and Nancy A. Lynch. An ω(n log n) lower bound on the
cost of mutual exclusion. In Eric Ruppert and Dahlia Malkhi,
editors, Proceedings of the Twenty-Fifth Annual ACM Sym-
posium on Principles of Distributed Computing, PODC 2006,
Denver, CO, USA, July 23-26, 2006, pages 275–284. ACM,
2006.
BIBLIOGRAPHY 357
[FLMS05] Faith Ellen Fich, Victor Luchangco, Mark Moir, and Nir
Shavit. Obstruction-free algorithms can be practically wait-
free. In Pierre Fraigniaud, editor, Distributed Computing,
19th International Conference, DISC 2005, Cracow, Poland,
September 26-29, 2005, Proceedings, volume 3724 of Lecture
Notes in Computer Science, pages 78–92. Springer, 2005.
[GW12a] George Giakkoupis and Philipp Woelfel. On the time and space
complexity of randomized test-and-set. In Darek Kowalski and
Alessandro Panconesi, editors, ACM Symposium on Principles
of Distributed Computing, PODC ’12, Funchal, Madeira, Por-
tugal, July 16-18, 2012, pages 19–28. ACM, 2012.
[JTT00] Prasad Jayanti, King Tan, and Sam Toueg. Time and space
lower bounds for nonblocking implementations. SIAM J. Com-
put., 30(2):438–456, 2000.
[NT87] Gil Neiger and Sam Toueg. Substituting for real time and
common knowledge in asynchronous distributed systems. In
Proceedings of the sixth annual ACM Symposium on Principles
of distributed computing, PODC ’87, pages 281–293, New York,
NY, USA, 1987. ACM.
BIBLIOGRAPHY 362
[NW98] Moni Naor and Avishai Wool. The load, capacity, and avail-
ability of quorum systems. SIAM J. Comput., 27(2):423–447,
1998.
[Oka99] Chris Okasaki. Purely Functional Data Structures. Cambridge
University Press, 1999.
[Pet81] Gary L. Peterson. Myths about the mutual exclusion problem.
Inf. Process. Lett., 12(3):115–116, 1981.
[Pet82] Gary L. Peterson. An O(n log n) unidirectional algorithm for
the circular extrema problem. ACM Trans. Program. Lang.
Syst., 4(4):758–762, 1982.
[PF77] Gary L. Peterson and Michael J. Fischer. Economical solu-
tions for the critical section problem in a distributed system
(extended abstract). In John E. Hopcroft, Emily P. Fried-
man, and Michael A. Harrison, editors, Proceedings of the 9th
Annual ACM Symposium on Theory of Computing, May 4-6,
1977, Boulder, Colorado, USA, pages 91–97. ACM, 1977.
[Plo89] S. A. Plotkin. Sticky bits and universality of consensus. In
Proceedings of the eighth annual ACM Symposium on Princi-
ples of distributed computing, PODC ’89, pages 159–175, New
York, NY, USA, 1989. ACM.
[PSL80] M. Pease, R. Shostak, and L. Lamport. Reaching agreements
in the presence of faults. Journal of the ACM, 27(2):228–234,
April 1980.
[PW95] David Peleg and Avishai Wool. The availability of quorum
systems. Inf. Comput., 123(2):210–223, 1995.
[PW97a] David Peleg and Avishai Wool. The availability of crumbling
wall quorum systems. Discrete Applied Mathematics, 74(1):69–
83, 1997.
[PW97b] David Peleg and Avishai Wool. Crumbling walls: A class of
practical and efficient quorum systems. Distributed Computing,
10(2):87–97, 1997.
[RST01] Yaron Riany, Nir Shavit, and Dan Touitou. Towards a practical
snapshot algorithm. Theor. Comput. Sci., 269(1-2):163–201,
2001.
BIBLIOGRAPHY 363
364
INDEX 365
barrier client-server, 11
memory, 4 clock
beta synchronizer, 32 logical, 47
BFS, 29 Lamport, 49
BG simulation, 253 Neiger-Toueg-Welch, 50
extended, 257 coherence, 199
big-step, 116 collect, 124, 157
binary consensus, 68 adaptive, 221
biologically inspired systems, 4 coordinated, 171
birthday paradox, 220 colorless task, 257
bisimulation, 346 common node, 75
bit complexity, 117 common2, 190
bivalence, 82 communication pattern, 64
bivalent, 82 commuting object, 190
Borowsky-Gafni simulation, 253 commuting operations, 147
bounded, 17 comparability, 163
bounded bypass, 126 comparators, 222
bounded fetch-and-subtract, 326 compare-and-swap, 4, 117, 149
bounded wait-free, 278 comparison-based algorithm, 44
breadth-first search, 29 complement, 107
broadcast completeness, 93, 190
reliable, 99 complex
terminating reliable, 102 input, 263
busy-waiting, 116 output, 263
Byzantine agreement, 69 protocol, 272
weak, 72 simplicial, 260
Byzantine failure, 3, 69 complexity
bit, 117
cache-coherent, 139 message, 15
capacity, 105 obstruction-free step, 236
CAS, 149 space, 117
causal ordering, 47 step, 14
causal shuffle, 48 individual, 14, 116
Chandra-Toueg consensus protocol, 98 per-process, 116
channel total, 14, 116
FIFO, 12 time, 14, 116
chemical reaction networks, 4 composite register, 158
chromatic subdivision, 272 computation event, 9
class G, 245 conciliator, 200
client, 11 concurrency detector, 300
INDEX 366
triangulation, 266
trying, 125
Two Generals, 5, 16
two-writer sticky bit, 302
unidirectional ring, 37
uniform, 43
univalent, 82
universality of consensus, 154
unknown-bound semisynchrony, 239
unsafe, 254
upward validity, 163