Fault Tolerant THESIS
Fault Tolerant THESIS
BY
TAYLOR T. JOHNSON
THESIS
Urbana, Illinois
Adviser:
ii
ACKNOWLEDGMENTS
First and foremost, I thank my adviser, Sayan Mitra, for countless hours
spent solving problems with me, teaching me what research is and is not,
helping me to improve my research skills, and being a superb and always
supportive mentor. Without his advice and help this thesis would not have
been realized.
With equal importance, I thank my family—especially Mom, Dad, and
Brock—for without them I would not be here. I also especially thank my
cousin Tommy Hoherd for his support, which has helped to make graduate
school a reality for me.
I would like to thank all teachers everywhere, but specifically the ones
who have taught me, particularly Mark Capps, Paul Hester, Yih-Chun
Hu, Viraj Kumar, Daniel Liberzon, Pat Nelson, Lui Sha, James Taylor,
Nitin Vaidya, Jan Bigbee Weesner, and Geoff Winningham. I give special
thanks to undergraduate advisers who encouraged me to pursue graduate
studies, including Brent Houchens, Fathi Ghorbel, Dung Nguyen, and
Karthik Mohanram, as well as my advisers from Schlumberger, Albert
Hoefel and Peter Swinburne.
Without friends to let loose and relax with on occasion, life would
be boring, so I acknowledge Alan Gostin, Brian Proulx, Daniel Rein-
hardt, Emily Williams, Frank Havlak, John Stafford, Josh Langsfeld, Navid
Aghasadeghi, Paul Rancuret, Rakesh Kumar, Sarah Lohman, Stanley Bak,
among many others, especially my friends and fellow lab mates, Bere
Carrasco, Karthik Manamcheri, and Sridhar Duggirala. I acknowledge a
recently acquired friend, Ellen Prince, for providing both motivation and
distraction in the final phase of the thesis.
Lastly, I acknowledge anyone else who I interact with that I may have
made the unfortunate mistake of forgetting to mention.
iii
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . 1
1.1 Failures in DCPS . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Modeling Techniques . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Approach for Achieving Fault-Tolerance in DCPS . . . . . . 4
1.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Key Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 8
iv
4.3.3 Progress of Entities Towards the Target . . . . . . . . 44
4.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
v
CHAPTER 1
INTRODUCTION
1
1.1 Failures in DCPS
A fundamental issue in the design of distributed computing systems is to
ensure reliable operation in spite of being composed of unreliable compo-
nents. Similarly, designs for reliable DCPS must take into account failures
of all their components, which include the computers, software, and com-
munication channels as in distributed computing systems, and addition-
ally, sensors and actuators.
Even when considering only distributed computing systems, there are
broad classes of failures. A computer can fail, as when it crashes and never
makes another transition. Additionally, failures could occur somewhere
in the communication channel between computers, as a result of which
messages are lost, delivered out of order, or corrupted. When considering
DCPS, failures may also occur in sensors or actuators, such as an actuator
becoming stuck at some value forever. All of the previous distributed com-
puting failures may be applicable, as are failures of the agents’ components
which interact with the physical environment.
Broadly, as a result of this thesis, we believe that there are three classes
of failures based on the location of the failure:
(a) cyber failures: failures in the hardware or software of the agents’ com-
puters,
(b) physical failures: failures in the agents’ interfaces to the physical world,
such as sensors and actuators, and
Failures from one of these classes can now manifest as a behavior in another
domain, such as a cyber failure of a mobile robot resulting in a collision
between the robot and another adjacent robot in the physical world.
2
Agent Agent
Computer(s) Computer(s)
msgs msgs
Sensor(s) Sensor(s)
Actuator(s) Actuator(s)
Channel
Agent Agent
Computer(s)
p () msgs
g msgs
g Computer(s)
Sensor(s) Sensor(s)
Se so (s)
Actuator(s) Actuator(s)
3
Either
(b) assumptions about the behavior of the agents within the physical en-
vironment must be made.
For instance, the first avenue of expanding the expressiveness of the model
may be accomplished through the use of timed automata [13], timed in-
put/output automata (TIOA) [14], hybrid automata [15, 16], or hybrid in-
put/output automata (HIOA) [17].
In this thesis however, the second route is more frequently traversed
and the continuous dynamics are abstracted in such a way that they may
be discussed as discrete transitions. The use of shared variables and syn-
chrony simplifies the analysis of the distributed computation, and discrete
abstractions of continuous behavior simplify the analysis of the dynamical
systems.
4
state are involved. For instance, with only software state, a standard recov-
ery procedure is to reset, but this has no reasonable analogy for physical
state.
The approach taken by this thesis is the following. A definition of
fault-tolerance for DCPS is given as stabilization in Chapter 2. Roughly,
without failures, a DCPS remains in a set of legal states that satisfy some
desired system property. Note that the set of legal states is the only set
from which progress can be made towards satisfying the desired system
property. However, upon failure events occurring, the DCPS may leave
the set of legal states and go into a set of illegal states.
Synchrony is assumed so that the actions of all the agents are composed
into a single discrete transition system with a synchronous update that
modifies the state of all the agents in the system based on an agents’
local state and the states of adjacent agents. Failures are represented as
events which may modify the state of some agents. When failure events
stop occurring, and without any other event occurring aside from the
synchronous system update for all agents, the DCPS may or may not be
guaranteed to eventually return to the set of legal states. If it can be
guaranteed that the DCPS automatically returns to the set of legal states
without any event other than the synchronous update, then the DCPS is
said to be self-stabilizing. However, it may be necessary for the DCPS
to rely on a failure detector for the occurrence of a failure to be realized,
where the failure detector is modeled as an event. Upon detecting such
a failure, the DCPS may then perform some mitigation routine in the
synchronous update which allows it to return to the set of legal states. This
second case of converting a non-self-stabilizing DCPS to a self-stabilizing
DCPS is analogous to the use of a stabilizer—a failure detector and a
method for state reset to ensure eventual progress—for converting a non-
self-stabilizing algorithm to a self-stabilizing one [4]. Both of these cases
allow the DCPS to make progress towards satisfying the desired system
property. See Figure 1.2 for a graphical depiction.
5
Illegal
DM
failure Legal
SS
SS
normal
Figure 1.2: Under normal operation the DCPS state remains in the set of
Legal states, but upon failures, the state of the DCPS may enter the set of
Illegal states. The SS labels indicate that self-stabilization can be achieved
from these states. The SS arrow on the left indicates that upon failure
events not occurring, the system automatically returns to Legal states.
The arrow on the right labeled DM indicates that a transition of the
failure detector has been taken and must occur prior to the DCPS
recovering to the set of Legal states from the set of Illegal states.
6
plane where the movements of all entities (vehicles) within each partition
(cell) are tightly coupled. Each of these cells is controlled by a computer.
A self-stabilizing algorithm, called a distributed traffic control protocol, is
presented which guarantees
(a) minimum separation between vehicles at all times, even when some
cells’ control software may fail permanently by crashing, and
(b) once failures cease, a route to the target cell stabilizes and the vehicles
with feasible paths to the target cell make progress towards it.
However, some agents’ actuators may fail permanently and become stuck-
at a value, causing the failed agents to move forever according to this
value. Without the use of failure detection and mitigation, the algorithm
is fault-intolerant and critical system properties like avoiding collisions or
ensuring progress to the flock or goal may be violated. Thus, the algorithm
incorporates failure detection, when it is possible, for which the detection
time is the same order as the number of rounds it takes for the agents to
reach the set of states which satisfy flocking. Then upon detecting failed
agents, non-failed agents migrate to adjacent lanes to avoid collisions and
to make progress towards the flock and destination.
7
1.5 Key Insights
The main contribution of this thesis is the general method of using self-
stabilization to ensure fault-tolerance of DCPS, and the general method
for converting non-self-stabilizing DCPS to self-stabilizing ones by use
of a failure detector. The thesis relies on two case studies which utilize
these techniques to ensure correct operation in spite of failures of agents’
components in the cyber and physical domains.
As a discussion point, in the distributed traffic control problem, a failure
detector is implicitly provided by agents no longer reporting their distances
to the target. While this problem does not require a failure detector to
ensure self-stabilization, it inherently has a method for detecting failures
by virtue of the synchronous update transition. Because the algorithm
used to locate the destination is self-stabilizing, returning to a state from
which vehicles can make progress to the destination occurs automatically.
In the distributed flocking problem, failure detection is explicitly pro-
vided by a failure detector. Then an additional mechanism is incorporated
by the synchronous update transition of the system, which allows all non-
faulty agents to (a) avoid collisions, (b) avoid falsely following agents
which are not moving towards states which satisfy the flocking condition,
and (c) avoid falsely following agents moving away from the destination.
These case studies show that it is possible to utilize stabilization-based
methods for achieving fault-tolerance in DCPS which are analogous to
those used for ensuring fault-tolerance in distributed computing systems.
Specifically, the distributed traffic control algorithm shows that it is possi-
ble to develop a self-stabilizing DCPS which automatically recovers from
failures. The distributed flocking case study shows that it is possible to
develop a self-stabilizing DCPS by combining a non-self-stabilizing DCPS
with a failure detector.
8
of what it formally means for a system to exhibit fault-tolerance. Chap-
ter 3 presents a literature review primarily from the field of fault-tolerance
for distributed systems, but also briefly mentions fault tolerance from re-
lated fields. Chapter 4 presents the first case study, the distributed cellular
flows problem, in which a graph is given representing a network of roads
or waypoints, along which some physical entities such as vehicles travel.
Chapter 5 presents the second case study, the safe flocking problem on
lanes, in which a group of mobile agents form a roughly equally spaced
distribution and travel towards a destination without collision. Each of
the case studies utilize fault tolerance to ensure correct operation of the
DCPS in spite of failures. Chapter 6 presents future directions for work and
concludes the thesis.
9
CHAPTER 2
2.1 Preliminaries
The sets of natural, real, positive real, and nonnegative real numbers are
denoted by N, R, R+ , and R≥0 , respectively. For K ∈ N, [K] denotes the set
Δ Δ
{0, . . . , K}. For a set K, let K⊥ = K ∪ {⊥} and K∞ = K ∪ {∞}.
A variable is a name with an associated type. For a variable x, its type is
denoted by type(x) and it is the set of values that it can take. A valuation
for a set of variables X, denoted by x, is a function that maps each x ∈ X
to a point in type(x). Given a valuation x for X, the valuation for a variable
v ∈ X, denoted by x.v, is the restriction of x to {v}. The set of all possible
valuations of X is denoted by val(X).
Example 2.1. For example, consider a DCPS of three mobile robots positioned on
the Euclidean plane. Each robot has an identifier i in the set {0, 1, 2}. A variable
for each robot could be the position pi of that robot, each of which has a type of R2 .
The set of variables X is {p0 , p1 , p2 }. The valuation x for X is a function that maps
10
each of p0 , p1 , and p2 to a point in R2 , that is, a function which maps the position
of each robot to a point in the plane.
11
Agenti Agentj
12
occur between update transitions. There may be other transitions in A
which update other states—potentially of individual agents—of System.
Example 2.3. Continuing with the example of three mobile robots in the Euclidean
plane, the set of variables X is {p0 , p1 , p2 }. If initially the robots are specified to start
with position variables in a closed unit-circle in the Euclidean plane, then Q0 is
{(xi , yi ) : x2i + y2i ≤ 1, i ∈ {0, 1, 2}} where pi = (xi , yi ). Then, presume the variables
pi are used to coordinate the robots to form some shape in the plane. Specifically,
the coordination among the robots could allow them to form an equilateral triangle
where the distance between any two of the three robots is equal to some constant √
s. For simplicity assume this triangle has corners in the set {(0, 0), (s, 0), ( 23s , 2s )}.
Then, update could specify that each of the variables pi is set to an element in this
set of corners.
13
the executions are indistinguishable. Indistinguishability of executions is
frequently used in lower-bound proofs on the amount of time required for
the states of System to satisfy some property.
In distributed systems, two kinds of properties are of paramount impor-
tance. A safety property captures the notion that some “bad” property is
never satisfied, such as processors agreeing on incorrect values in consen-
sus. Equivalently, it means that some “good” property is always satisfied.
Safety properties are generally established by use of a potentially simpler
invariant, or several invariants each of which successively refines the state
space. A liveness property captures the notion that some “good” property
will eventually be satisfied, such as processors eventually agreeing on a
common value in consensus. However it is not known given a state how
far in the future the good property is satisfied. For correct algorithms, one
would like to have termination, but this is not always possible. A progress
property is a stronger notion than liveness and is defined as the a priori
knowledge that, given any state, there is a constant k amount of time in the
future where the good property is satisfied.
a
System is stable with respect to a set S ⊆ val(X), if for each x → x , x ∈ S
implies that x ∈ S. System is invariant with respect to a set S ⊆ val(X) if
all reachable states are contained in S, that is ReachSystem ⊆ S. System is
said to stabilize to S if S is stable and every execution fragment has a suffix
ending with a state in S.
A predicate P defines a set of states SP ⊆ val(X). If the set S defined
by a predicate P is respectively stable or invariant, then the predicate is
respectively called stable or invariant.
The standard method to show that some predicate P is an invariant is
by induction on the number of completed rounds k in some execution α,
beginning with a base case of k = 0. Such assertional reasoning is used to
establish properties of the DCPS. Similarly, compositional reasoning about
one, or few, of the agents in a composite System is employed to simplify
establishing properties of the entire System. Finally, hierarchical proofs
involving a successive refinement of invariants are also used.
14
2.3 Failure Model
A faili transition represents the failure due to some exogenous event of the
ith agent in System, where i ∈ ID. This transition sets the variable failedi
to true permanently—it may never be set to false once being set to true—
and may have other effects depending on the failure model considered.
For instance if an actuator failure occurs, faili may set velocity of i to be a
constant.
For a state x, let
Δ
F(x) = {i ∈ ID : x.failedi = true}
Δ
NF(x) = ID \ F(x)
be the set of non-faulty identifiers. The terminology failed agent and non-
faulty agent refers to the agents with identifiers in the failed identifiers or
non-faulty identifiers, respectively.
Let
Δ
ANF = A \ {faili }
15
2.3.2 Self-Stabilizing DCPS
Define self-stabilization as follows [18].
Definition 2.4. If S is a stable set of states for System, called the legal states,
then System is self-stabilizing for S if and only if there exists a set of states T
for System, called the illegal states, such that
(i) S ⊆ T,
(ii) T is invariant,
(iii) S is stable for any failure-free execution, that is, for any transition in ANF ,
(iv) There exists a reachable state in S along any failure-free execution fragment
α which begins with any state in T.
16
T
Įf
Įf S
Įnf
Q0
Įnf
17
the shared variables.
The failure detector algorithm could be implemented by some other
external oracle. Such a failure detector still must rely solely on the in-
formation communicated by the agents. However, this does not prevent
such a failure detector from keeping a history of all messages sent in an
execution.
Along these lines, the failure detector for System, if one is necessary, is
modeled through a special transition called suspecti ∈ A for each agent in
System, and agent i has a state variable called Suspectedi , which is the set
of other agent identifiers which agent i believes to be failed.
There is a precondition on the suspecti transition being executed, which
may be a predicate on the states of System. Additionally, the precondition
quantifies an agent j being checked for failure. It is assumed that upon
this precondition being satisfied, the transition is taken. Upon execution
of suspecti , agent i adds agent j to the set Suspectedi . See Chapter 5 for an
example of this.
2.4 Conclusion
This chapter introduced a model for DCPS, a definition of stabilization, and
failure detectors. Through the use of stabilization, fault-tolerant DCPS can
be constructed which allow discussion of invariant sets of states describing
safety properties and stable sets of states under normal system operation
describing progress properties. This is the traditional use of the term
stabilization [18]. The key difference in the use of stabilization and failure
detectors in DCPS is that physical state may now provide a means of
identifying when the system is behaving badly.
The case study in Chapter 5 exemplifies this point through the creation of
a fault-tolerant DCPS by combining a fault-intolerant DCPS with a failure
detector, which allows for the DCPS to satisfy stabilization. Specifically
the failure detector relies on physical state—the position of an agent—and
cyber state—a computed position where an agent would like to move—
to detect that an agents’ actuators have failed. Upon all failures being
detected, an invariant safety property (the set T in Definition 2.4) in the
physical domain is satisfied which is that collisions do not occur, and that
18
eventually states are reached from which progress can be made (the set S
in Definition 2.4). As this case study exemplifies, we believe it is possible
to develop general methods for designing fault-tolerant DCPS.
19
CHAPTER 3
RELATED WORK
The related work summarized in this chapter addresses failure classes and
models, as well as methods for ensuring fault-tolerant operation of sys-
tems. Fault-tolerance has been widely studied in a variety of engineering
and computer science disciplines related to the work of this thesis, such
as control theory [21–24], reliability [25], artificial intelligence [26], dis-
tributed computing systems [1–4, 18, 20, 27, 28], embedded and real-time
systems [29], and combinations of these [30].
20
by physical laws, there exists the possibility of developing smarter failure
detection algorithms.
A crash failure is modeled as an agent ceasing to take transitions, and
if the crash is not clean, then at the agent’s final step, it might succeed
in sending only a subset of the messages it was supposed to send. A
Byzantine failure is modeled as agents changing software state arbitrarily
and sending messages with arbitrary content; note that continuous state
is not included, as arbitrary changes of continuous state could require
infinite amounts of energy to complete in finite time. A classical result
from distributed computing is that for many problems, such as consensus,
f crash failures can be tolerated with at least f + 1 agents in f + 1 rounds,
and that t Byzantine failures can be tolerated in t + 1 with at least 3t + 1
agents. Furthermore, in a combined failure model where both crash and
Byzantine failures can occur, where f are crash failures and t are Byzantine
failures, in f +t+1 rounds, at least 3t+ f +1 agents suffice to solve consensus
in a synchronous setting [11].
Physical processes may have failures with regard to sensors, actuators,
and control surfaces [21] which may affect both physical state and software
state. Actuators and sensors may become stuck at a certain value, although
it should be noted that one can utilize physical constraints such as satu-
ration to limit the effect such a fault has on a system. That is to say that
the actuator and sensors’ behaviors are constrained due to physical limita-
tions, which may prove useful in detecting and mitigating faults: they do
not have the ability to behave arbitrarily bad like Byzantine failures in the
cyber domain.
21
in an arbitrary state, but stop occurring after some period of time [35]. This
can represent a computer crashing and subsequently recovering, or a com-
puter losing power and then being restarted [32]. Intermittent failures make
one or several agents of System behave erratically for some time and may
occur at any time, but are generally rare. This can represent a processor
temporarily having Byzantine behavior, or cause a communication service
to lose, duplicate, reorder, or modify messages in transit [32]. Incessant
failures behave like intermittent failures except that they may occur with
regularity rather than rarity [33].
22
Given that effectively all DCPS must maintain some notion of the current
state of the system with regards to time to be able to interact with the phys-
ical world, the real-time systems community has analyzed faults. When
implemented as real-time systems, there is a possibility for timing failures,
where a process misses some deadlines specified by worst-case execution
time (WCET) analysis [29]. Giotto [36] and its extensions allow for analysis
of programs to ensure that no timing failures (missing deadlines) can oc-
cur in the virtual machine these programs are executed on. Etherware [37]
utilizes a middleware layer and shows the ability of a distributed real-time
control system to maintain safety and liveness in spite of communications
link failures.
The Simplex-architecture supervisory control allows for the automatic
mitigation of certain faults by concurrently executing several controllers,
one of which is thoroughly tested, and then choosing the control output
from the safe controllers if the other controllers issue commands that would
take the system to an unsafe set of states [38]. While this slows progress,
it guarantees a notion of safety, and eventually upon returning far enough
within a good set of states (far from the bad states), a faster response can
be utilized. In some systems, a degradation of a safety property, such as
moving from very safe states to less safe states, could potentially be used
to detect faults—this is similar to how the Simplex architecture switches
between controllers, and this idea is employed in the failure detection in
the distributed flocking problem in Chapter 5. More recent work on this
utilizing a field-programmable gate array (FPGA) based safety controller
in the system Simplex architecture allows the avoidance of even further
faults that may have occurred due to the operating system [39].
23
ence of certain types of failures, and also establish lower bounds about
impossibility of solving those problems with certain resource constraints.
What requirements a failure detector must satisfy to be able to solve a
problem is theoretically interesting and frequently studied [40]. Specifi-
cally, several classes of failure detectors have been defined according to
the nature and the quality of the information that they provide [20]. Al-
gorithms for implementing these failure detectors have been incorporated
in practical fault-tolerant systems [41, 42]. On the theoretical side, fail-
ure detectors of different quality are used to characterize the hardness of
different distributed computing problems [43], and more directly, failure
detectors of certain quality are used to solve other problems, such as dis-
tributed consensus. There exist failure detectors for classes of transient
failures [35].
The general model is that the failure detector is acting as an oracle or
outside service and suspects agents to have failed. Implementation can
be done in several ways, such as agents occasionally sending an alive
message to the failure detector, which then removes that agent from the
list of suspects if it was there, or otherwise resets a timeout; such a method
is a push. Other methods revolve around whether the scheme is a pull
method, where the failure detector occasionally asks agents if they have
failed. Thus, one desired property is completeness of detecting failures,
which means that if an agent has failed, then eventually it is suspected by
the failure detector. A competing metric is accuracy, in that, if an agent is
suspected of having failed, then it has in fact failed.
Similar to the notion of failure detectors is the fault diagnosis (or fault
detection and identification) problem from controls, which is composed of
three steps.
Real-time systems often utilize failure detectors through watchdog timers.
If a response is not received from one processor by another, a flag is raised
that the processor may have reached an illegal state, and the other processor
may have an ability to reset it [29].
The control-theoretic literature deals with detecting faults in the context
of a given plant dynamics. Typically faults are modeled as additive or
multiplicative dynamics that cause perturbations in the evolution of the
plant [44], and failure detectors rely on techniques such as signature gener-
ation, residual generation, observer designs [23], and statistical testing [21].
24
For instance, it is shown in Chapter 5 that is is possible to model actua-
tor stuck-at failures as additive dynamics for a switched system. First,
fault detection results in a binary decision of whether something is wrong
in the system. Second, fault isolation locates which component is faulty.
Third, fault identification determines the magnitude of the fault and/or the
time the fault occurred. Fault detection and isolation together are called
fault diagnosis [44]. Practical implementations usually only rely on fault
detection and isolation, and are together called fault detection and isolation
(FDI). Other notions of failure detection in the controls community can be
applied through observers [23], or more frequently in a more probabilistic
way, such as using Kalman filters to diagnose faults [21].
Diagnosis techniques have also been specifically developed for discrete
event dynamical systems (DEDS) [45, 46]. These methods include central-
ized detection approaches as well as distributed ones [24, 47]. Here faults
can be modeled as uncontrollable transitions, specifically that the transi-
tions are caused by some exogenous actor and not the system [48]. Likewise
faults can be modeled as unobservable transitions, and the occurrence of
the transition must be deduced [45, 49].
Safe diagnosability [50] implies that for some systems, mitigation must
occur before some bounded time, as otherwise the system can reach states
that violate safety. Safe diagnosability applies to the flocking case study
in Chapter 5 because if failures are not detected and corrected quickly,
the system may reach states which violate safety and progress. These
techniques are applicable to dynamical systems without any notions of
communication.
3.2.2 Self-Stabilization
The concept of self-stabilization was introduced by Dijkstra [51]. Self-
stabilizing algorithms are those that from an arbitrary starting state even-
tually converge to a legal state and remain in the set of legal states [4];
see Figure 1.2. The two necessary properties of self-stabilizing algorithms
are closure and convergence [18]. From any state (legal or not) the system
must converge in a finite number of steps to a legal state. The set of legal
states must then be closed, in that only failures may take the system to a
25
set of illegal states. The design of self-stabilizing failure detectors has been
investigated [52].
As defined above, self-stabilizing algorithms implement a form of non-
masking fault tolerance, in that the fault may be observable as the system
is no longer in a legal state, but automatically the system eventually, in a
finite number of steps, returns to a set of legal states. Such protocols rely
on the assumption that the programs do not fail, and that only state and
data may become corrupted due to failures. It should also be noted that
due to the closure property, a composition of self-stabilizing algorithms
can be utilized to solve a complex task. For instance, if from arbitrary state
xLA an algorithm A takes the system in TA steps to legal state xLA , then some
algorithm B can operate that takes the system in TB steps to another legal
states xLB , and so on.
3.2.3 Stabilizers
The use of a stabilizer provides a general method to convert a fault-
intolerant algorithm to a fault-tolerant one through composition of other
algorithms. One mechanism monitors system consistency—such as com-
bining a self-stabilizing distributed snapshot algorithm [53] with a self-
stabilizing failure detector [3]. The other mechanism repairs the system to a
consistent state upon inconsistency being detected—such as self-stabilizing
distributed reset [54].
The first stabilizer collected distributed global snapshots [53] of the com-
posite system and checked whether the snapshots were legal, where the
distributed snapshot did not interfere with the activity of the algorithm,
so the composed algorithm trivially satisfied closure [55]. Thus, such
stabilizers rely on utilizing a composition or hierarchy of self-stabilizing
algorithms. The detectors and correctors of [56] are analogous to stabilizers
and also the detection and mitigation of Chapter 5. The paradigm is that a
fault-tolerant system is constructed out of a fault-tolerant system and a set
of components for fault-tolerance (detectors and corrects).
Rather than relying on predicates on global system state to detect incon-
sistency, it is possible to detect inconsistent global state by checking if local
state is inconsistent. Local detection [57, 58], where if a global property is
26
violated (such as the global system not being in a legal state), then some
local property must also be violated. Local checking and correction were in-
troduced in the design of a self-stabilizing communications protocol with
a self-stabilizing network reset [59] where global inconsistency is detected
by analyzing local state. Local detection and checking are analogous to the
detection method used in Chapter 5 and local correction is analogous to the
mitigation method. The local stabilizer of [60] takes a distributed algorithm
and transforms it into a self-stabilizing synchronous algorithm which tol-
erates transient faults through local detection in O(1) time and local repair
of the inconsistent system state, resulting in an algorithm which tolerates
f faults in O( f ) time.
Similar to the notion of a stabilizer in distributed systems and the case
study in Chapter 5 is the control theoretic paper [61], where a motion probe,
or a specific control applied for some time, is used to detect failures of
individual agents solving a consensus problem. Upon detection of failures
through the use of motion probes, the non-faulty agents stop utilizing the
values of faulty agents to ensure progress.
27
CHAPTER 4
4.1 Introduction
This chapter is based upon previous work [62].
Highway and air traffic flows are nonlinear dynamical systems that give
rise to complex phenomena such as abrupt phase transitions from fast
to sluggish flow [63–65]. The ability to monitor, predict, and avoid such
phenomena can have a significant impact on the reliability and the capacity
of traffic networks. Traditional traffic protocols, such as those implemented
for air-traffic control are centralized [66]—a coordinator periodically collects
information from the vehicles, decides and disseminates the waypoints,
and subsequently the vehicles try to blindly follow a path to the waypoint.
The advent of wireless vehicular networks [67] presents a new opportunity
for distributed traffic monitoring [68] and control. Distributed protocols
should scale and be less vulnerable to failures compared to their centralized
counterparts. In this case study, such a distributed traffic control protocol
is presented, as is an analysis of its behavior.
A traffic control protocol is a set of rules that determines the routing and
movement of certain physical entities, such as cars and packages, over an
underlying graph, such as a road network, air-traffic network, or a ware-
house conveyor system. Any traffic control protocol should guarantee:
(a) (safety) that the entities maintain some minimum physical separation,
and (b) (progress) that the entities arrive at a given a destination (or target)
vertex. In a distributed traffic control protocol each entity determines its
own next-waypoint, or each vertex in the underlying graph determines the
next-waypoints for the entities in an appropriately defined neighborhood.
The idea of distributed traffic control has been around for some time but
most of the work has focused on human-factors issues [69, 70], collision
28
avoidance [71–75], and platooning [76–78]. A notable exception is [79],
which presents a distributed algorithm (executed by entities, vehicles in
this case) for controlling a highway intersection without any stop signs.
The distributed traffic control problem is studied in a partitioned plane
where the motions of entities within a partition are coupled. The problem
can be described as follows (refer to Figure 4.1). The geographical space of
interest is partitioned into regions or cells. There is a designated target cell
which consumes entities and some source cells that produce new entities.
The entities within a cell are coupled, in the sense that they all either
move identically or they remain static (the motivation for this is discussed
below). If a cell moves such that some entities within it touch the boundary
of a neighboring cell, those get transferred to the neighboring cell. Thus,
the role of the distributed traffic control protocol is to control the motion
of the cells so that the entities (a) always have the required separation, and
(b) they reach the target, when feasible.
The coupling mentioned above which requires entities within a cell to
move identically may appear surprising at first sight. After all, under
low traffic conditions, individual drivers control the movement of their
cars within a particular region of the highway, somewhat independently
of the other drivers in that region. However, on highways under high-
traffic, high-velocity conditions, it is known that coupling may emerge
spontaneously, whereby the vehicles form a fixed lattice structure and
move with zero relative speed [64, 80]. In other scenarios coupling arises
because passive entities are moved around by active cells, for example,
packages being routed on a grid of multi-directional conveyors [81], and
molecules moving on a medium according to some controlled chemical
gradient. Finally, even where the entities are active and cells are not,
the entities can cooperate to emulate a virtual active cell expressly for
the purposes of distributed coordination. This idea has been explored for
mobile robot coordination in [82] using a cooperation strategy called virtual
stationary automata [83, 84].
The distributed traffic control protocol guarantees safety at all times, even
when some cells fail permanently by crashing. The protocol also guar-
antees eventual progress of entities towards the target, provided that there
exists a path through non-faulty cells to the target. Specifically, the protocol
is self-stabilizing [4], in that after failures stop occurring, the composed sys-
29
5
<3,3>
3
tid
dist=0
dist 0
2
<2,1>
dist=ь
1
<0,0>
<1,0>
0
0 1 2 3 4 5
Figure 4.1: Example System with 4 × 4 unit-length square cells where
tid = 2, 2 (in very light gray), SID = {1, 0 } (in light gray), and
failed2,1 = true (in black). The gray arrows represent next variables. The
smaller squares are entities with safety region specified by rs represented
by the gray border and length region specified by l represented by the
white interior.
tem automatically returns to a state from which progress can be made. The
algorithm relies on two mechanisms: (a) a rule to maintain local routing
tables at each non-faulty cell, and (b) a (more interesting) rule for signaling
among neighbors which guarantees safety while preventing deadlocks.
Roughly speaking, the signaling mechanism at some cell fairly chooses
among its neighboring cells which contain entities, indicating if it is safe
for one of these cells to apply a movement in the direction of the signal-
30
ing cell. This permission-to-move policy turns out to be necessary, because
movement of neighboring cells may otherwise result in a violation of safety
in the signaling cell, if entity transfers occur.
These safety and progress properties are established through systematic
assertional reasoning. These proofs may serve as a template for the analysis
of other distributed traffic control protocols and also can be mechanized
using automated theorem proving tools, for example [85].
The throughput analysis of this algorithm, and in fact any distributed
traffic control algorithm, remains a challenge. Simulation results are pre-
sented that illustrate the influence (or the lack thereof) of several factors
on throughput:
(d) failure-recovery rates, under a model where crash failures are not per-
manent and cells may recover from crashing.
31
entity touches an edge of a cell, it is instantaneously transferred to the next
neighboring cell.
The software of a cell implements the distributed traffic control protocol.
At each round, every cell exchanges messages bearing state information
with their neighbors. Based on this, the cells update their software state
and decide their (possibly zero) velocities. Until the beginning of the next
round, the cells continue to operate according to this velocity—this may
lead to entity transfers.
Recall from Chapter 2 the modeling assumptions that messages are de-
livered within bounded time and computations are instantaneous. Under
these assumptions, the system can be modeled as a SSDCPS. Further as-
sume, for simplicity of presentation only, that all the entities have the same
size, and if moving, any cell does so with the same constant velocity.
Now follows the SSDCPS model.
32
(iii) v: cell velocity, or distance by which an entity may move over one
round.
It is required that
(ii) rs + l < 1.
The former is required to ensure cells do not violate the gap requirement
from one round to the next when new entities enter a cell. The latter is
required so that entities cover at most the same area of the Euclidean plane
as the cell in which they are contained, since cells are squares of unit length.
Define the total center spacing requirement as
Δ
d = rs + l.
(v) tokeni,j : a token used for mutual exclusion to indicate which neighbor
may move,
When clear from context, the subscripts in the names of the variables are
dropped. A state of Celli, j refers to a valuation of all these variables, that
is, a function that maps each variable to a value of the corresponding type.
The complete system is an automaton, called System as in Chapter 2,
consisting of the ensemble of all the cells. A state of System is a valuation
33
variables 1
Membersi, j : Set[P ] := {}
NEPrevi,j : Set[ID⊥ ] := {} 3
nexti, j , signali,j , tokeni,j : ID⊥ := ⊥
disti,j : N∞ := ∞ 5
failedi,j : B := false
7
transitions
faili, j 9
eff failedi,j := true; disti,j := ∞; nexti, j := ⊥
11
updatei,j
eff Route; Signal; Move 13
of all the variables for all the cells. Recall from Chapter 2 that states of
System are referred to by bold letters x, x , etc.
Variables tokeni,j , failedi,j , and NEPrevi, j are private to Celli, j , while disti, j ,
nexti, j , and signali, j can be read by neighboring cells of Celli, j , and Membersi,j
can be both read from and written to by neighboring cells of Celli, j . See
Figure 4.3. Recall from Chapter 2 that this has the following interpretation
for an actual message-passing implementation. At the beginning of each
round, Celli, j broadcasts messages containing the values of these variables
and receives similar values from its neighbors. Then, the computation of
this round updates the local variables for each cell based on the values
collected from its neighbors. Variable Membersi, j is a special variable, in
that it can also be written to by the neighbors of Celli, j . This is how
transferal of entities among cells is modeled. An entity p is quantified to
be in x.Membersi, j for a state x and i, j ∈ ID, so denote p where p = p,
a
such that p ∈ x .Membersm,n where x → x for some a ∈ A and m, n ∈ ID.
If a transfer does not occur, then m, n = i, j , but if a transfer occurs, then
m, n ∈ Nbrsi, j .
System has two types of state transitions: fails and updates. A faili, j
th
transition models the crash failure of the i, j cell and sets failedi, j to true,
disti, j to ∞, and nexti, j to ⊥. A cell i, j is called failed if failedi,j is true,
otherwise it is called non-faulty. The set of identifiers of all failed and non-
faulty cells at a state x is denoted by F(x) and NF(x), respectively. A failed
cell does nothing; it never moves and it never communicates.1
disti, j = ∞ can be interpreted as its neighbors not receiving a timely response from
1
i, j .
34
Celli,j Cellm,n
Membersi,j
ij Membersm,n
disti,j distm,n
nexti,j nextm,n
signal
i li,j signal
i lm,n
tokeni,j tokenm,n
NEPrevi,jij NEPrevm,n
mn
failedi,j failedm,n
35
routing table for each cell that relies only on neighbors’ estimates of dis-
tance to the target. Recall that failed cells have dist set to ∞. From a state x,
for each i, j ∈ NF(x), the variable disti, j is updated as 1 plus the minimum
value of dist among the neighbors of i, j . If this results in disti, j being
infinity, then nexti,j is set to ⊥; otherwise, it is set to be the identifier with
the minimum dist with ties broken with neighbor identifiers.
if ¬failedi,j ∧ i, j tid then 1
⎛ ⎞
⎜⎜ ⎟⎟
disti,j := ⎜⎜⎝ min distm,n ⎟⎟⎠ + 1
m,n ∈Nbrs
i,j
if disti,j = ∞ then nexti,j := ⊥ 3
else nexti,j := argmin distm,n , m, n
m,n ∈Nbrsi,j
The Signal function in Figure 4.5 executes after Route and is the key part
of the protocol for both maintaining safe entity separations and ensuring
progress of entities to the target. Roughly, each cell implements this by
following two policies: (a) accept new entities from a neighbor only when
this is safe, and (b) provide opportunities infinitely often for each nonempty
neighbor to make progress. First i, j sets NEPrevi, j to be the subset of
Nbrsi, j for which next has been set to i, j and Members is nonempty. If
tokeni, j is ⊥, then it is set to some arbitrary value in NEPrevi, j ; it continues to
be ⊥ if NEPrevi, j is empty. Otherwise, tokeni, j = m, n , which is a neighbor
of i, j with nonempty Members. It is checked if there is a gap of length
d on Celli, j in the direction of m, n . This is accomplished through the
conditional in Lines 4–7 as a step in guarantying fairness. If there is not
enough gap, then signali, j is set to ⊥, which blocks m, n from moving its
entities in the direction of i, j , thus preventing entity transfers. On the
other hand, if there is sufficient gap, then signali, j is set to tokeni, j which
enables m, n to move its entities towards i, j . Finally, tokeni,j is updated
to a value in NEPrevi,j that is different from its previous value, if that is
possible according to the rules just described (Lines 10–12).
Finally, the Move function in Figure 4.6 models the physical movement
of entities over a given round. For cell i, j , let m, n be nexti,j . The entities
in Membersi, j move in the direction of m, n if and only if signalm,n is set to
i, j . In that case, all the entities in Membersi, j are shifted in the direction of
cell m, n . This may lead to some entities crossing the boundary of Celli,j
36
if ¬failedi,j then
NEPrevi,j := {m, n ∈ Nbrsi, j : nextm,n = i, j ∧ Membersm,n ∅} 2
if tokeni,j = ⊥ then tokeni,j := choose from NEPrevi, j
if ((tokeni,j = i + 1 ∧ ∀ p ∈ Membersi,j : px + l/2 ≤ i + 1 − d) 4
∨ (tokeni,j = i − 1 ∧ ∀ p ∈ Membersi,j : px − l/2 ≥ i + d)
∨ (tokeni,j = j + 1 ∧ ∀ p ∈ Membersi,j : p y + l/2 ≤ j + 1 − d) 6
∨ (tokeni,j = j − 1 ∧ ∀ p ∈ Membersi,j : p y − l/2 ≥ j + d))
then 8
i,j := token
signal i,j
if NEPrevi,j > 1 then 10
token i,j := choose
from NEPrevi,j \ {tokeni, j }
elseif NEPrevi, j = 1 then tokeni, j ∈ NEPrevi, j 12
else tokeni,j := ⊥
else signali,j := ⊥; tokeni,j := tokeni,j 14
into Cellm,n , in which case, such entities are removed from Membersi, j . If
m, n is not the target, then the removed entities are added to Membersm,n .
In this case (Lines 13–20), the transferred entities are placed at the edge of
Cellm,n . However, if m, n is the target, then the removed entities are not
added to any cell and thus no longer exist in System.
if ¬failedi, j ∧ signalnexti,j = i, j then
let m, n = nexti,j 2
for each p ∈ Membersi,j
px := px + v(m − i) 4
p y := p y + v(n − j)
6
if (m = i + 1 ∧ px + l/2 > i + 1) ∨ (m = i − 1 ∧ px − l/2 < i)
∨ (n = j + 1 ∧ p y + l/2 > j + 1) ∨ (n = j − 1 ∧ p y − l/2 < j) 8
then
Membersi,j := Membersi,j \ {p} 10
if m, n tid
then Membersm,n := Membersm,n ∪ {p} 12
if m = i + 1 ∧ px + l/2 > i + 1
then px := m + l/2 14
elseif m = i − 1 ∧ px − l/2 < i
then px := m − l/2 16
elseif n = j + 1 ∧ p y + l/2 > j + 1
then p y := n + l/2 18
elseif n = j − 1 ∧ p y − l/2 < j
then p y := n − l/2 20
The source cells i, j ∈ SID, in addition to the above, add at most one
entity in each round to Membersi, j such that the addition of an entity does
not violate the minimum gap between entities at Celli,j .
37
4.3 Analysis
In this section we present an analysis of System with regard to safety and
progress properties. Roughly, the safety property is an invariant that for all
reachable states there is a minimum gap between entities, and the progress
property requires that all entities which reside on cells with feasible paths
to the target, eventually reach the target.
See Figure 4.7 for a graphical outline of the properties.
Safe
Įff
Stable Routes
Įf
Progress
Q0 Įff
Įff
38
4.3.1 Safety Analysis
A state is safe if for every cell, the distance between the centers of any two
entities along either coordinate is at least d. Thus, in a safe state, the edges
of all entities in a cell are separated by a distance of rs . However, the entities
in two adjacent cells may have edges spaced apart by less, although their
centers will be spaced by at least l.
For any state x of System, define:
Δ
Safei, j (x) = ∀p, q ∈ x.Membersi, j , p q, (px − qx ≥ d) ∨ (p y − q y ≥ d), and
Δ
Safe(x) = ∀ i, j ∈ ID, Safei, j (x).
The safety property is that Safe(x) is an invariant and thus satisfied for all
reachable states. We proceed by proving some preliminary properties of
System which will be used for establishing the desired safety property. The
following invariant asserts that no entities exist between the boundaries of
cells. This is a consequence of transferring entities upon an entity’s edge
touching an edge of a cell, and then resetting the entity’s position to be
within the new cell.
Invariant 4.1. In any reachable state x, ∀ i, j ∈ ID, ∀p ∈ x.Membersi, j
l l
i+ ≤ px ≤ i + 1 − , and
2 2
l l
j+ ≤ py ≤ j + 1 − .
2 2
The next invariant states that cells’ Members are disjoint. This is imme-
diate from the Move function since entities are only added to one cell’s
Members upon being removed from a different cell’s Members.
Invariant 4.2. In any reachable state x, for any distinct i, j , m, n ∈ ID,
x.Membersi, j ∩ x.Membersm,n = ∅.
Δ
H(x) = ∀ i, j ∈ ID, ∀ m, n ∈ Nbrsi, j ,
39
if x.signali, j = m, n then exactly one of the following hold:
l
m = i + 1 ∧ ∀p ∈ x.Membersi, j , px + ≤ i + 1 − d, or
2
l
m = i − 1 ∧ ∀p ∈ x.Membersi, j , px − ≥ i + d, or
2
l
n = j + 1 ∧ ∀p ∈ x.Membersi, j , p y + ≤ j + 1 − d, or
2
l
n = j − 1 ∧ ∀p ∈ x.Membersi, j , p y − ≥ j + d.
2
H(x) is not an invariant property because once entities move the property
may be violated. However, for proving safety all that needs to be estab-
lished is that at the point of computation of the signal variable this property
holds. The next key lemma states this.
Lemma 4.3. For all reachable states x, H(x) ⇒ H(xs ) where xS is the state
obtained by applying the Route and Signal functions to x.
Proof : Fix a reachable state x, a i, j ∈ ID, and a m, n ∈ Nbrsi,j such
that x.signali, j = m, n . Let xR be the state obtained by applying the Route
function of Figure 4.4 to x and xS be the state obtained by applying the
Signal function of Figure 4.5 to xR .
Without loss of generality, assume m, n = i − 1, j , so if x.signali, j =
i − 1, j , then ∀p ∈ x.Membersi, j , px − 2l ≥ i + d. First, observe that H(xR ).
This is because the Route function does not change any of the variables
involved in the definition of H(.). Next, we show that H(xR ) implies H(xS ).
There are two possible cases. First, if xS .signali, j m, n then the statement
holds vacuously. Second, when xS .signali, j = i − 1, j , the second condition
in H(xR ) and Figure 4.5, Line 5 implies H(xS ). The cases where m, n takes
the other values in Nbrsi, j follow by symmetry.
The following lemma asserts that if there is a cycle of length two formed
by the signal variables, then entity transfers cannot occur between the
involved cells in that round.
Lemma 4.4. Let x be any reachable state and x be a state that is reached from x after
a single update transition (round). If x.signali, j = m, n and x.signalm,n = i, j ,
then x.Membersi, j = x .Membersi, j and x.Membersm,n = x .Membersm,n .
40
Proof : No entities enter either x .Membersi, j or x .Membersm,n from any other
a, b ∈ Nbrsi, j or c, d ∈ Nbrsm,n since x.signali, j = m, n and x.signalm,n =
i, j . Assume without loss of generality that m, n = i − 1, j . It remains
to be established that p ∈ x.Membersi−1, j such that p ∈ x .Membersi, j where
p = p or vice-versa. For the transfer to occur, px must be such that px =
px + 2l + v > i by Figure 4.6, Line 4. But for x.signali, j = i − 1, j to be
satisfied, it must have been the case that px − 2l < i + l + rs by Figure 4.5,
Line 5 and since v < l, a contradiction is reached.
Proof : The proof is by standard induction over the length of any execution
of System. The base case is satisfied by the initialization assumption. For
the inductive step, consider reachable states x, x and an action a ∈ A such
a
that x → x . Fix i, j ∈ ID and assuming Safei, j (x), show that Safei, j (x ).
If a = faili, j , then clearly Safe(x ) as none of the entities move.
For a = update, there are two cases to consider. First, x .Membersi, j ⊆
x.Membersi, j . There are two sub-cases: if x .Membersi, j = x.Membersi, j , then
all entities in x.Members move identically and the spacing between two dis-
tinct entities p, q ∈ x .Membersi, j is unchanged. That is, ∀p, q ∈ x.Membersi, j ,
∀p , q ∈ x .Membersi,j such that p = p and q = q and where p q, px − qx =
px + vc − qx − vc, where c is a constant. It follows that px − qx ≥ d. By
similar reasoning it follows that p y − q y is also at least d. The second sub-
case arises if x .Membersi,j x.Membersi, j , then Safei, j (x ) is either vacuously
satisfied or it is satisfied by the same argument as above.
The second case is when x .Membersi, j x.Membersi,j , that is, there exists
some entity p ∈ x .Membersi,j that was not in x.Membersi, j . There are two
sub-cases. The first sub-case is when p was added to x .Membersi, j since
i, j ∈ SID. In this case, the specification of the source cells states that
the entity p was added to x .Membersi, j without violating Safei,j (x ), and the
proof is complete. Otherwise, p was added to x .Membersi, j by a neighbor,
so p ∈ x.Membersi , j for some i , j ∈ x.Nbrsi, j . Without loss of generality,
assume that i = i − 1 and j = j. That is, p was transferred to Celli, j from
its left neighbor. From Line 14 of Figure 4.6 it follows that px = i + 2l .
The fact that p transferred from Celli , j in x to Celli, j in x implies that
41
x.nexti , j = i, j and x.signali, j = i , j —these are necessary conditions for
the transfer. Thus, applying at state x the second inequality from H(x)
and Lemma 4.3, it follows that for every q ∈ x.Membersi, j , qx ≥ i + d + 2l .
It must be established that if p is transfered to x .Membersi, j , then every
q ∈ x .Membersi, j , where q p satisfies qx ≥ i + d + 2l , which means that
q did not move towards p. This follows by application of Lemma 4.4,
which states that if entities on adjacent cells move towards one another
simultaneously, then a transfer of entities cannot occur. This implies that
all entities q in x .Membersi, j have edges greater than rs of the edges of any
such entity p , implying Safei,j (x ), since px = i + 2l and qx ≥ i + d + 2l , so
qx − px ≥ d. Finally, since i, j was chosen arbitrarily, Safe(x ).
Δ
TC(x) = { i, j : ρ(x, i, j ) < ∞}
as the set of cell identifiers that are connected to the target through non-
faulty cells.
The analysis relies on the following assumptions on the environment of
System which controls the occurrence of fail transitions and the insertion
of entities by the source.
42
(a) The target cell does not fail.
(b) Source cells s, t ∈ SID place entities in Memberss,t without blocking
any of their nonempty non-faulty neighbors perpetually. That is, for
any execution α of System, if there exists an i, j ∈ Nbrss,t , such that
for every state x in α after a certain round, i, j ∈ x.NEPrevs,t , then
eventually signals,t becomes equal to i, j in some round of α.
disti, j = h, and
nexti, j = in , jn , where ρ(x, in , jn ) = h − 1.
43
Now consider i, j such that ρ(y, i, j ) = ρ(y , i, j ) = h + 1. In order to
show that S is closed, we have to assume that y.disti, j = h + 1 and y.nexti, j =
m, n , and show that the same holds for y . Since ρ(y , i, j ) = h + 1, i, j
does not have a neighbor with path distance smaller than h. The required
result follows from applying the inductive hypothesis to m, n and from
Lines 2 and 4 of Figure 4.4.
Next, we have to show that starting from x, α enters S within h rounds.
Once again, this is established by inducting on h, which is ρ(x, i, j ). The
base case only includes the paths satisfying h = ρ(x, i, j ) = 1 and follows
by instantiating in , jn = tid. For the inductive case, assume that at round
h, disti ,j = h and nexti ,j = in , jn such that ρ(x, in , jn ) = h − 1 and in , jn is
the minimum identifier among all such cells. Observe that one such i , j ∈
Nbrs(i, j) by the definition of TC. Then at round h + 1, by Lines 2 and 4 of
Figure 4.4, disti,j = disti , j + 1 = h + 1.
The following corollary of Lemma 4.6 states that after new failures cease
occurring, all target connected cells get their next variables set correctly
within at most O(N2 ) rounds. It follows since the value of h in Lemma 4.6
for any target connected cell is at most N2 .
Corollary 4.7. Consider any execution of System with arbitrary but finite se-
quence of fail transitions. Within O(N2 ) rounds of the last fail transition, every
target connected cell i, j in System has nexti, j fixed permanently to the identifier
of the next cell along such a path.
44
the round after the last failure, and α be the infinite failure-free execution
fragment x f , x f +1 , . . . of α starting from x f . Observe that TC(x f ) = TC(x f +1 ) =
TC(. . .), so define TC to be TC(x f ).
Lemma 4.8. For any i, j ∈ TC, k > f , if xk .signalm,n = i, j and xk .nexti, j =
m, n , then ∀p ∈ xk .Membersi, j , if p ∈ xk+1 .Membersi, j such that p = p, then
px − m < px − m , or p y − n < p y − n ,
m ≤ px ≤ m + 1, or n ≤ p y ≤ n + 1.
Proof : The first case is when no entity transfers from i, j to m, n in
the k + 1th round. In this case, the result follows since velocity is applied
towards m, n by Move in Figure 4.6, Lines 4–5. The second case is when
some entity p transfers from i, j to m, n , in which case px ∈ [m, m + 1] or
p y ∈ [n, n + 1] by Figure 4.6, Lines 13–20.
Lemma 4.9. Consider any i, j ∈ TC \ {tid}, such that for all k > f , (if
xk .Membersi, j ∅, then ∃k > k such that xk .signalnexti, j = i, j ).
Proof : Since i, j ∈ TC, there exists h < ∞ such that for all k > f , ρ(xk ) = h.
We prove the lemma by induction on h. The base case is h = 1. Fix i, j
and instantiate k = f + 4. By Lemma 4.6, for all non-faulty i, j ∈ Nbrstid ,
x f .nexti,j = tid since k > f . For all k > f , if xk .Membersi, j ∅, then signaltid
changes to a different neighbor with entities every round. It is thus the
case that |xk .NEPrevtid | ≤ 4 and since Memberstid = ∅ always, exactly one
of Figure 4.5, Lines 4–7 is satisfied in any round, then within 4 rounds,
signaltid = i, j .
For the inductive case, let ks = k + h be the step in α after which all
non-faulty a, b ∈ Nbrsi, j have xks .nexta,b = i, j by Lemma 4.6. Also by
Lemma 4.6, ∃ m, n ∈ Nbrsi, j such that xks .distm,n < xks .disti, j , implying that
after ks , xks .NEPrevi,j ≤ 3 since xks .nexti, j = m, n and xks .nextm,n i, j .
By the inductive hypothesis, xks .signalnexti, j = i, j infinitely often. If i, j ∈
SID, then entity initialization does not prevent xk .signali,j = a, b from
45
being satisfied infinitely often by the second assumption introduced in
Subsection 4.3.2. It remains to be established that signali, j = a, b infinitely
often. Let a, b ∈ xks .NEPrevi, j where ρ(xks , a, b ) = h + 1.
If xks .NEPrevi, j = 1, then because the inductive hypothesis satisfies
signalnexti,j = i, j infinitely often, then Lemma 4.8 applies infinitely often,
and thus Membersi, j = ∅ infinitely often, finally implying that signali, j = a, b
infinitely often.
If xks .NEPrevi, j > 1, there are two sub-cases. The first sub-case is when
no entity enters i, j from some c, d a, b ∈ xks .NEPrev, which follows
by the same reasoning used in the xks .NEPrev = 1 case. The second
sub-case is when a entity enters i, j from c, d , in which case it must
be established that signali, j = a, b infinitely often. This follows since if
xk .tokeni, j = a, b where k > kt > ks and kt is the round at which an
entity entered i, j from c, d , and the appropriate case of Lemma 4.3 is
not satisfied, then xk +1 .signali, j = ⊥ and xk +1 .tokeni, j = a, b by Figure 4.5,
Line 14. This implies that no more entities enter i, j from either cell c, d
satisfying c, d a, b . Thus tokeni, j = a, b infinitely often follows by the
same reasoning xks .NEPrev = 1 case.
46
4.4 Simulation
We have performed several simulation studies of the algorithm for eval-
uating its throughput performance. In this section, we discuss the main
findings with illustrative examples taken from the simulation results. Let
the K-round throughput of System be the total number of entities arriving
at the target over K rounds, divided by K. We define the average throughput
(henceforth throughput) as the limit of K-round throughput for large K.
All simulations start at a state where all cells are empty and subsequently
entities are added to the source cells.
0.08
v=0.05
v=0.1
0.07 v=0.2
v=0.25
0.06
throughput
0.05
0.04
0.03
0.02
0.01
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
r
s
Figure 4.8: Throughput versus safety spacing rs for several values of v, for
K = 2500, l = 0.25 for System with 8 × 8 cells.
47
of System. The parameters are set to l = 0.25, SID = {1, 0 }, tid = 1, 7 ,
Δ
and K = 2500. The entities move along the path β = 1, 0 , 1, 1 , 1, 2 ,
1, 3 , 1, 4 , 1, 5 , 1, 6 , 1, 7 with length 8. For the most part, the in-
verse relationship with v holds as expected: all other factors remaining
the same, a lower velocity makes each entity take longer to move away
from the boundary, which causes the predecessor cell to be blocked more
frequently, and thus fewer entities reach tid from any element of SID in the
same number of rounds. In cases with low velocity (for example v = 0.1)
and for very small rs , however, the throughput can actually be greater than
that at a slightly higher velocity. We conjecture that this somewhat sur-
prising effect appears because at very small safety spacing, the potential
for safety violation is higher with faster speeds, and therefore there are
many more blocked cells per round. We also observe that the throughput
saturates at a certain value of rs (≈ 0.55). This situation arises when there
is roughly only one entity in each cell.
48
0.09
rs=0.05, v=0.2, l=0.2
0.08 rs=0.05, v=0.1, l=0.2
rs=0.05, v=0.1, l=0.1
0.07 rs=0.05, v=0.05, l=0.1
0.06
throughput
0.05
0.04
0.03
0.02
0.01
0 1 2 3 4 5 6 7
number of turns along path
Figure 4.9: Throughput versus number of turns along a path, for a path of
length 8, where K = 2500, rs = 0.05, and each of l and v are varied for
System with 8 × 8 cells.
4.5 Conclusion
This case study presented a self-stabilizing distributed traffic control pro-
tocol for the partitioned plane where each partition controls the motion
of all entities within that partition. The algorithm guarantees separation
between entities in the face of crash failures of the software controlling a
partition. Once new failures cease to occur, it guarantees progress of all
entities that are not isolated by failed partitions to the target. Through
simulations, throughput was estimated as a function of velocity, minimum
separation, path complexity, and failure-recovery rates. The algorithm is
presented for a two-dimensional square-grid partition; however, an exten-
sion to three-dimensional cube partitions follows in an obvious way.
49
0.06
pr=0.05
0.05 pr=0.1
pr=0.15
pr=0.2
0.04
throughput
0.03
0.02
0.01
0
0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
pf
Figure 4.10: Throughput versus failure rate p f for several recovery rates pr
with an initial path of length 8, where K = 20000, rs = 0.05, l = 0.2, and
v = 0.2 for System with 8 × 8 cells.
50
CHAPTER 5
5.1 Introduction
The goal of the safe flocking problem is to ensure that a collection of agents:
(a) always maintain a minimum safe separation (that is, the agents avoid
collisions),
The flocking problem has a rich body of literature (see, for example [86–89],
and the references therein) and has several applications in robotics and
automation, such as robotic swarms and the automated highway system.
This case study considers flocking in one dimension where some agents
may fail.
In order to allow non-faulty agents to avoid colliding with faulty ones,
there must be a way for the non-faulty agents to go around them. In this
thesis, this is addressed by allowing different agents to reside in different
lanes; see Figure 5.1 on page 55. A lane is a real line and there are finitely
many lanes. Informally, a non-faulty agent can then avoid collisions by mi-
grating to a different lane appropriately. Several agents can, and normally
do, reside in a single lane.
51
periodically computes its target based on the messages received from its
neighbors and moves towards the target with some arbitrary but bounded
velocity. The targets are computed such that the agents preserve safe sep-
aration and they eventually form a weak flock configuration. Once a weak
flock is formed it remains invariant, and progress is ensured to a tighter
strong flock. Once a strong flock is attained by the set of agents, this property
can be detected through the use of a distributed snapshot algorithm [53].
Once the snapshot algorithm detects that the global state of the system sat-
isfies the strong flock predicate, the detecting agent makes a move towards
the destination, sacrificing the strong flock, but still preserving the weak
flock.
Actuator failures are modeled as exogenous events that set the velocity of
a non-failed agent to an arbitrary but constant value. This could correspond
to a robot’s motors being stuck at an input voltage, causing the robot to
forever move in a given direction with a constant speed. Likewise it could
correspond to a control surface being stuck in a given position, resulting
in movement forever in a given direction. After failure, the failed agent
continues to compute targets, send and receive messages, but its actuators
simply ignore all this and continue to move with the failure velocity.
Certain failures lead to immediate violation of safety, while others, such
as failing with zero velocity at the destination, are undetectable. The
algorithm determines only the direction in which an agent should move,
based on neighbor information. The speed with which it moves is left
as a non-deterministic choice. Thus, the only way of detecting failures
is to observe that an agent has moved in the wrong direction. Under
some assumptions about the system parameters, a simple lower-bound
is established, indicating that no detection algorithm can detect failures
in less than O(N) rounds. A failure detector is presented that utilizes
this idea in detecting certain classes of failures in O(N) rounds. Finally,
it is shown that the failure detector can be combined with the flocking
algorithm to guarantee the required safety and progress properties in the
face of a restricted class of actuator failures.
52
5.1.2 Literature on Flocking and Consensus in Distributed
Computing and Controls
The distributed computing consensus problem, that of a set of processors
agreeing upon some common value based on some inputs, under a variety
of communications constraints (synchronous, partially synchronous, or
totally asynchronous) and failure situations, has been studied extensively
by the distributed systems community [11,34,91]. The consensus problem in
distributed systems is that every agent has an input from a well-ordered
set and satisfies the following conditions. The conditions are
(c) a validity condition, that if all inputs to all agents are the same, then the
value decided by all non-faulty agents must be the common input.
53
reaching a neighborhood of the fixed point, that is, by allowing the error
to approach a set about the equilibrium instead of the equilibrium, giving
a finite-time termination. Such constraints have been imposed on this
problem from the controls community, normally through quantization of
sensor or actuator values [98, 99].
Some attention has been given to the problem of failure detection in the
flocking problem. Most closely related to this case study is [61] which
works with a similar model of actuator failures. However, this work dis-
cusses using the developed motion probes in failure detection scenarios, but
has no stated bounds on detection time as more effort was spent ensuring
convergence to the failure-free centroid assuming that failure detection has
occurred within some time. To the best of the author’s knowledge, there
has been no work on provable avoidance of collisions with such a failure
model, only detection of such failures and mitigation to ensure progress
(convergence).
54
Based on these messages, the agents update their software state and decide
their (possibly zero) velocities. Until the beginning of the next round, the
agents continue to operate according to this velocity. However, an agent
may fail, that is, it may get stuck with a (possibly zero) velocity, in spite
of different computed velocities. The key novelty of this case study is
that the algorithm incorporates failure detection and collision prevention
mechanisms.
Assume that the messages are delivered within bounded time and com-
putations are instantaneous. Recall from Chapter 2 that under these as-
sumptions, the system can be modeled as a (SSDCPS) defined in that
chapter. Refer to an individual agent as Agenti . Now follows the SSDCPS
model.
L
Lane 2
rc
T( ) 8
T(x)=8
4 6 7
Lane 1
rc
H(x)=2
3 5 1
Lane 0
0
rc
55
5.2.2 Formal System Model
Δ
Let ID = [N − 1] be the set of identifiers for all possible agents that may
be present in the system. Each agent has a unique identifier i ∈ ID. Each
Δ
agent is positioned on a lane with an identifier in the set IDL = [NL − 1].
The following constant parameters are used throughout this chapter:
(i) rs : minimum required inter-agent gap or safety distance when there
are no faulty agents in the system,
(ii) rr : reduced safety distance when there are faulty agents in the system,
(vii) vmin , vmax : minimum and maximum velocity, or minimum and maxi-
mum distance by which an agent may move over one round.
State Variables. Each Agenti has the following state variables, where
initial values of the variables are shown in Figure 5.2 using the ‘:=’ notation.
(a) x, xo: position and old position (from the previous round) of agent i on
the real line,
(b) u, uo: target position and old target position (from the previous round)
of agent i on the real line,
(c) lane: the parallel real line upon which agent i is physically located,
56
variables
x, xo : R 2
u, uo : R := x
lane : IDL := 1 4
snaprun : B := false
gsf : B := false 6
failed : B := false
vf : R⊥ := ⊥ 8
Suspected : Set[ID⊥ ] := {}
10
Nbrs : Set[ID ] := Nbrs(x, i)
L : Nbrs := LS (x, i) 12
R : Nbrs := RS (x, i)
57
Neighbors. Agenti is said to be a neighbor of a different Agent j at state
x if and only if x.xi − x.x j ≤ rc where rc > 0. The set of identifiers of all
neighbors of Agenti at state x is denoted by
Δ
Nbrs(x, i) = {j ∈ ID : i j ∧ x.xi − x.x j ≤ rc }.
Let L(x, i) (and symmetrically R(x, i)) be the nearest non-failed neighbor of
Agenti at state x such that x.xL(x,i) ≤ x.xi (symmetrically xR(x,i) ≥ x.xi ) or ⊥ if
no such neighbor exists (ties are broken by the unique agent identifiers).
So L(x, i) and R(x, i) take values from {⊥} ∪ Nbrs(x, i) \ F(x). Let LS (x, i) (and
symmetrically RS (x, i)) be the nearest neighbor not suspected by Agenti at
state x such that xL(x,i) ≤ x.xi (symmetrically xR(x,i) ≥ x.xi ) or ⊥ if no such
neighbor exists. So LS (x, i) and RS (x, i) take values from {⊥} ∪ Nbrs(x, i) \
x.Suspectedi , and thus, LS (x, i) (and RS (x, i)) is the identifier of nearest non-
suspected agent positioned to the left (right) of i on the real line. Only
upon failures occurring and subsequently these failed agents becoming
suspected will LS (x, i) or RS (x, i) change for any i. We denote by NR(x, i)
and NL(x, i) the number of non-failed agents located to the right, and
respectively to the left, of Agenti at state x.
If Agenti has both left and right neighbors, it is said to be a middle agent.
If Agenti does not have a right neighbor, it is said to be a tail agent. If Agenti
does not have a left neighbor it is said to be a head agent. For a state x, let
Δ
Heads(x) = {i ∈ NF(x) : LS (x, i) = ⊥},
Δ
Tails(x) = {i ∈ NF(x) : LS (x, i) ⊥ ∧ RS (x, i) = ⊥},
Δ
Mids(x) = NF(x) \ (Heads(x) ∪ Tails(x)), and
Δ
RMids(x) = Mids(x) \ {R(x, H(x))}.
The identifier of the non-suspected agent closest to the goal (the origin) is
denoted by
Δ
H(x) = min NS(x).
The identifier of the non-suspected agent farthest from the goal is denoted
by
Δ
T(x) = max NS(x).
58
Neighbor Variables. Each agent i has the following variables,
(a) Nbrs: this variable is the set of identifiers of agents which are neighbors
of agent i at the pre-state x of any transition, so it is Nbrs(x, i), and
(b) L and R: these variables are the identifiers of the neighbor with the
nearest left and right position, respectively, at the pre-state x of any
transition, so the agent j with x.x j nearest to x.xi from the left and right,
respectively.
Agenti Agentj
xi, xoi xj, xoj
ui, uoi uj, uoj
lanei lanej
Suspectedi Suspectedj
snapruni snaprunj
gsfi gsfj
failedi failedj
vfi vfj
Nbrsi Nbrsj
59
Failure Detection. The first stage of making this algorithm fault-tolerant
is the detection of failures described below. Recall from Chapter 2 that the
detection time of a failure detector is the number of rounds until each failed
agent has been suspected by each of its non-faulty neighbors.
Definition 5.1. For any execution α, let x f ∈ α be a state with failed agents F(x f ).
Assuming no further failures occur, let xd be a state in α reachable from x f such
that ∀i ∈ NF(xd ), F(x) ∩ Nbrs(x, i) ⊆ xd .Suspectedi . Then, the detection time kd
is d − f rounds.
Assumption 5.2. Assume there exists a constant kd that satisfies the above
statement for all executions and for all x f .
The results in Section 5.4 will rely on this assumption. Then Subsec-
tion 5.4.6 will introduce the conditions under which it is possible for a
failure detector to match this number of rounds and hence guarantee all
properties previously proven under this assumption in Section 5.4.
Such a detection is only based on the messages received by i from j,
and hence the shared variables described above. This is modeled by i
having access to some of j’s state, respectively current and old positions
x j and xo j and current and old targets u j and uo j . Assume that any failure
detection service has access only to these shared variables. Alternatively,
an agent could report itself as being suspected. However, it is ideal for
other agents to detect failures, as in the case of adversarial failures where
an agent could falsely (or not) report itself as having failed. While the
model we are utilizing relies on messages from an agent that may be
failed, the quantities used could be estimated from physical state by the
agents performing failure detection. Hence in essence, i’s knowledge of j
is for clear presentation only. When the conditional in Figure 5.4, Line 7
is satisfied for some neighbor j, then j is added to the Suspectedi set. This
conditional roughly states that at a state x for some agent j ∈ Nbrsi (x),
agent i suspects j when i learns through the shared memory that j wanted
to move one direction as specified by its target xT .u j , but in fact moved in
60
the other direction as specified by its new position x .x j , which is in the
opposite direction of x.u j .
transitions 1
faili (v)
eff failed := true; 3
vf := v;
5
suspecti
pre ∃ j ∈ Nbrs, (if j Suspected ∧ ((xo j − uo j ≤ β ∧ x j − uo j 0) ∨ 7
(xo j − uo j > β ∧ sgn x j − xo j sgn uo j − xo j )))
eff Suspected := Suspected ∪ {j} 9
snapStarti 11
pre L = ⊥ ∧ ¬snaprun
eff snaprun := true // global snapshot invoked 13
snapEndi (GS) 15
eff gsf := GS // global snapshot terminated giving if strong flock satisfied
snaprun := false 17
updatei 19
eff uo := u;
xo := x; 21
for each j ∈ Nbrs
Suspected := Suspected ∪ Suspected j // share suspected sets 23
end
Mitigate: 25
for each {s ∈ Suspected : lanes = lane}
if (∃ L ∈ IDL : ∀ j ∈ Nbrs, lane j = L ∧ x j [x − 2kd vmax , x + 2kd vmax ]) then 27
lane := L; fi
end 29
Target:
if L = ⊥ then 31
if gsf then u := x − min{x, δ/2}; gsf := false;
else u := x fi 33
elseif R = ⊥ then u := (xL + x + r f )/2
else u := (xL + xR )/2 fi 35
Quant: if |u − x| < β then u := x; fi
Move: 37
if failed then x := x + vf
else x := x + sgn (x − u) choose [vmin , vmax ]; fi 39
Failure Mitigation. Agents are aligned on lanes, which are parallel real
lines. Agents cannot collide or violate safety unless they reside in the same
lane. To mitigate failures to ensure safety and progress properties, non-
failed agents will pass failed agents that are moving incorrectly by entering
a different lane. This is accomplished by the Mitigate subroutine of the
update transition.
61
State Transitions. The state transitions are fails, snapStarts, snapEnds,
suspects, and updates. A faili (v) transition where i ∈ NF(x) for a state x
models the permanent failure of Agenti . As a result of this transition, failedi
is set to true and vf i is set to v. This causes Agenti to move forever with
velocity v. Assume that |v| ≤ vmax , which is reasonable due to actuation
constraints.
The snapStart and snapEnd transitions model the periodic initialization
and termination of a distributed global state snapshot protocol, such as
Chandy and Lamport’s snapshot algorithm [53]. This global state snap-
shot is used in the update transition to a detect stable global predicate as
described below. We model the initialization of this algorithm by snapStart
and the termination as snapEnd. Termination is guaranteed since the run-
ning time of the algorithm is O(N) rounds. This is ensured by Assump-
tion 5.4, which states that within O(N) rounds of a snapStart transition,
a snapEnd input transition occurs with a Boolean parameter GS which
specifies whether the global state satisfied the specified stable predicate.
We note that the assumptions to apply Chandy-Lamport’s algorithm are
satisfied here since
62
The computations of Mitigate, Target, Quant, and Move are all assumed
to be instantaneous. There is a slight separation from physical state evolu-
tion here as Move is abstractly capturing the duration of time required to
move agents by their specified velocities and is not instantaneous. Mitigate
attempts to restore safety and progress properties that may be reduced or
violated due to failures. Target is the flocking algorithm, which roughly
averages the positions of the closest left and right non-suspected neighbors
of an agent. Quant is the quantization step which prevents targets ui com-
puted in the Target subroutine from being applied to real positions xi if the
difference between the two is smaller than the quantization parameter β. Fi-
nally, Move moves agent positions xi towards the quantized targets. Thus,
update
for x → x , the state x is obtained by applying each of these subroutines.
We will refer to the internal state after Mitigate, Target, Quant, and Move
Δ Δ
as xM , xT , xQ , and xV , respectively. Specifically, xM = Mitigate(x), xT =
Target(xM ), etc., and observe that x = xV = Move(xQ ). For a state specified
by a round k, such as xk , the notation xk,T to indicate the state of System at
Δ
round k following the Target subroutine, so xk,T = Target(Mitigate(xk )).
63
5.2.3 Model as a Discrete-Time Switched Linear System
The following is a view of the system as a discrete-time switched system
and displays that failures can be modeled as a combination of an additive
affine control and a switch to another system matrix.
Discrete-time switched systems can be described as x[k + 1] = fp (x[k]) in
general where x ∈ RN , p ∈ P for some index set P, such as P = {1, 2, . . . , m},
or x[k + 1] = Ap x[k] for linear discrete-time switched systems [100]. For
the following, assume that Figure 5.4, Line 39 is deleted and replaced with
x := u. This deletion removes the nondeterministic choice of velocity with
which to set position x, and instead sets it to be the computed control
value u. This nondeterministic choice can be modeled through the use of
a time-varying system matrix A as in [90], but we omit it for simplicity of
presentation.
The effect of an update transition on the position variables of all agents in
System can be represented by the difference equation x[k + 1] = Ap x[k] + bp
where for a state xk at round k,
⎛ ⎞
⎜⎜ x .x ⎟⎟
⎜⎜ k H(xk ) ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ xk .xx.R ⎟⎟
⎜ ⎟⎟
x[k] = ⎜⎜⎜⎜ ⎟⎟ ,
H(xk )
⎜⎜ .
.. ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎝ x .x ⎟⎠
k T(xk )
⎛ ⎞
⎜⎜ . . . ⎟⎟⎟⎟
⎜⎜ a1,1 0 0 0 0 0
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ a2,1 a2,2 a2,3 0 0 0 . . . ⎟⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜
⎜⎜
⎜⎜ 0 a3,2 a3,3 a3,4 0 0 . . . ⎟⎟⎟⎟
⎜⎜ ⎟⎟
⎜ ... ... ... ⎟⎟
Ap = ⎜⎜⎜
⎜⎜ 0 0 0 . . . ⎟⎟⎟⎟ , and
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ 0 0 0 ai,i−1 ai,i ai,i+1 . . . ⎟⎟⎟
⎜⎜ ⎟⎟
⎜⎜ .. .. .. ⎟⎟
⎜⎜
⎜⎜ 0 0 0 . . . . . . ⎟⎟⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎝ 0 0 0 0 0 aN,N−1 a ⎠
N,N
64
⎛ ⎞
⎜⎜ ⎟⎟
⎜⎜ b1 ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ .. ⎟⎟
⎜⎜ . ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜ ⎟⎟
bp = ⎜⎜⎜⎜ bi ⎟⎟ .
⎟⎟
⎜⎜ ⎟⎟
⎜⎜ .. ⎟⎟
⎜⎜ . ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎠
⎝ bN
The following are the family of matrices Ap and vectors bp that are
switched among based on the state of System; refer to Figure 5.4 for the
following referenced line numbers. From Line 32, for H(xk ), if FlockS (xk ),
then either (a) if xk .xH(xk ) ≥ δ, then a1,1 = 1 and b1 = − 2δ , otherwise (b) a1,1 = 0
and b1 = 0. From Line 33, if ¬FlockS (xk ), then a1,1 = 1 and b1 = 0. From
Line 35, for i ∈ Mids(xk ), ai,i = 0, ai,i−1 = 12 , ai,i+1 = 12 , and bi = 0. Finally, from
r
Line 34, for T(xk ), aN,N−1 = 12 , aN,N = 12 , and bN = 2f .
Next, all coefficients in the matrix can change due to the quantization law
in Line 36. If the conditional on Line 36 is satisfied for agent i ∈ Mids(xk ),
then ai,i = 1, ai,xk .Li = 0, ai,xk .Ri = 0, and bi = 0, for agent i = H(xk ), then ai,i = 1
and bi = 0, and for agent i = T(xk ), then ai,xk .Li = 0, ai,i = 1, and bi = 0.
Failures also cause a switch of system matrices. The actuator stuck-at
failures being modeled are representative of an additive error term in the
bp vector [44]. From Line 38, for i ∈ Mids(xk ), ai,i = 1, ai,xk .Li = 0, ai,xk .Ri = 0,
and bi = xk .v fi , for i = H(xk ), ai,i = 1 and b1 = xk .v fH(xk ) , and for i = T(xk ),
aN,N−1 = 0, aN,N = 1, and bN = xk .v fT(xk ) .
65
minimum spacing are formalized through the predicates Safety and SafetyR ,
Δ
Safety(x) = ∀i ∈ ID, ∀ j ∈ ID, i j, x.xi − x.x j ≥ rs ∧ x.lanei = x.lane j ,
Δ
SafetyR (x) = ∀i ∈ ID, ∀ j ∈ ID, i j, x.xi − x.x j ≥ rr ∧ x.lanei = x.lane j .
It will be shown that without failures, Safety is maintained for all reach-
able states, but upon failures occurring, when it is possible to be main-
tained, reachable states satisfy the weaker SafetyR (x) for some time, prior
to Safety(x) being restored.
Without a notion of liveness or progress, however, safety can be trivially
maintained by agents not moving. In this case study, there are two progress
properties. The first is called the flocking property, which states that agents
reach states where their positions are in a flock or an equally spaced for-
mation. Specifically it is when the differences of positions between agents
are near the flocking distance r f with tolerance parameter f .
States which satisfy such a spacing of agent positions are defined by the
predicate
Δ
Flock(x, f ) = ∀i ∈ NS(x), LS (x, i) ⊥, x.xi − x.xLS (x,i) − r f ≤ f .
Flock is then instantiated as a weak flock by FlockW and strong flock by FlockS ,
which respectively specify a larger and smaller error from agent positions
being exactly spaced by r f . Given the flocking tolerance parameter δ > 0,
define respectively states where agent positions satisfy a weak flock and a
strong flock as
Δ
FlockW (x) = Flock(x, δ), and
Δ δ
FlockS (x) = Flock(x, ).
2
66
goal and are defined by the predicate,
Δ
Goal(x) = x.xH(x) ∈ [0, 0 + β].
The NBM definition defines states from which middle and tail agents
can no longer make progress due to quantization. No big moves (NBM)
states are those such that that no middle or tail agents have the ability to
move by more than the quantization parameter β > 0, and are defined by the
predicate
Δ
NBM(x) = ∀i ∈ NF(x), LS (x, i) ⊥, |xT .ui − x.xi | ≤ β,
Δ
Terminal(x) = Goal(x) ∧ NBM(x).
5.4 Analysis
Having described the system and failure models formally, in this section,
the behavior of System is analyzed. Upon establishing some basic behav-
67
SafeR
S f
Safe
Įff
FlockW Įfd
Įff Įff
FlockS
Įff
Q0 Į
ff NBM Įf
Įff Goal Įff
Įff
68
5.4.1 Assumptions
The following assumptions are required on the constant parameters used
throughout the chapter:
(v) the graph of neighbors is strongly connected and the graph of non-
faulty agents may never be partitioned.
Assumption (i) indicates that the reduced safety margin rr seen under
failures is strictly less than the safety margin rs when no failures are present.
It then states the desired inter-agent spacing r f is strictly greater than
these safety margins and strictly less than the communications radius rc .
Assumption (ii) prevents the agent nearest to the goal from moving beyond
the communications radius of any right agent it is adjacent to, that is, it
prevents disconnection of the graph of neighbors. Assumption (iii) bounds
the minimum and maximum velocities, although they may be equal. It
then upper bounds the maximum velocity to be less than or equal to the
quantization parameter β. This is necessary to prevent a violation of safety
due to overshooting computed targets. Finally, β is upper bounded in
such a way that it is possible to establish that NBM ⊆ FlockS . Assumption
(iv) allows the safety and progress properties to be maintained in spite of
failures (under further restrictions to be introduced) by allowing agents
to move among a set of NL lanes, preventing collisions of failed and non-
failed agents and allowing non-failed agents to pass failed agents which
are not moving in the direction of the goal. Assumption (v) is a natural
assumption indicating there is a single network of agents. It further states
that failures do not cause the graph of non-faulty neighbors to partition.
For the remainder of the chapter we make the following assumptions.
Sa f ety(x) ∧ x.xH(x) ≥ 0.
69
The following assumption states that for an agent i, a snapEndi transition
occurs within O(N) rounds from the occurrence of any snapStarti transi-
tion. Essentially it ensures termination of the global snapshot algorithm so
that any agent which relies on this algorithm for target computation may
calculate targets infinitely often. Thus it is used to ensure progress of the
algorithm.
snapStarti
Assumption 5.4. For any execution α, let x be a state in α such that x → x.
snapEndi
Then, there exists a state x in α such that x → x where x is a state
reachable from x . Furthermore, x is at most O(N) rounds from x in the sequence
α.
Lemma 5.5. For any reachable state x such that for all i ∈ NF(x), Nbrs(x, i) =
a
x.Nbrsi . For any agent i ∈ ID, for a state x such that x → x for a ∈ A \ {update},
Nbrs(x , i) = x .Nbrsi and Nbrs(x, i) = x.Nbrsi .
The next lemma states that if neighbors change, then they do so sym-
metrically. This is used to establish safety upon agents no longer relying
on suspected agents for target computation.
a
Lemma 5.6. For any reachable state x such that x → x for any a ∈ A, ∀i, j ∈ ID,
if x.Li j and x .Li = j, then x .R j = i.
Proof : Fix i and j and observe that only the suspect or update action
changes LS (x, i) or RS (x, j) by changing either the positions of agents xi or
the sets of suspected agents. By Lemma 5.5, we consider L and R. There
are two cases when x.Li x .Li = j. The first is upon agents that were not
neighbors at x becoming neighbors at x , that is, j x.Nbrsi and j ∈ x .Nbrsi .
This is only possible due to the update action since no other action changes
70
xi . By definition of neighbor, also i x.Nbrs j and i ∈ x .Nbrs j . By the
symmetric definitions of LS (x , i) and RS (x , j), we have x .R j = i.
The second case is when agents i and j were neighbors at x, so j ∈ x.Nbrsi
and i ∈ x.Nbrs j , but now have at least one suspected agent f where i > f > j
between them and f ∈ x.Nbrsi ∩ x.Nbrs j . This is possible due to the suspect
or update transitions. Prior to suspecting that f is failed, no change of
LS (x, i) and RS (x, j) occurs by definition, implying that for the hypothesis
of the lemma to be satisfied, x must be a state where f ∈ x .Suspectedi ∩
x .Suspected j , since i and j both use the same suspect action at Figure 5.4,
Line 6. In this case, the symmetric switch occurs by definition of LS (x, i)
and RS (x, j), we have x .R j = i. Otherwise, f x .Suspectedi ∩ x .Suspected j
and a contradiction that x.Li x .Li occurs.
Lemma 5.7. Let x be a state along any execution of System and assume that F(x)
= ∅. Consider the execution fragment α = x.fail1 (v).x .fail2 (v).x . . . .failN (v).x f .
That is, ∀i ∈ ID, let faili (v) occur where v is the same for each of these faili
transitions. Then, for any round xs appearing after x f in α, Sa f ety(xs ).
71
states of agents with identifiers in the set of suspected agents NS(x) and not
the set of failed agents NF(x) or all agents ID. Observe that if Flock(x) were
defined with ID, by Lemma 5.8, at no future point in time could Flock(x)
be attained. Furthermore, if Flock(x) relied on NF(x) instead of NS(x), then
potentially the failure detection algorithm could rely upon the head agents
detection of this predicate on the global snapshot for detection of failures.
We end this section by presenting the motivation for sharing sets of sus-
pected agents among agents in Figure 5.4, Line 23, so this lemma assumes
this line of code is deleted. The following gives a failure condition under
which no moves are possible and hence no progress can be made.
Lemma 5.9. Assume that agents do not share sets of suspected agents, so Fig-
ure 5.4, Line 23 is deleted. For any execution α such that for a state x ∈ α where
F(x) = ∅ and ∀i ∈ ID \ H(x),
δ
x.xi − x.xLi = x.xRi − x.xi = . . . = x.xT(x) − x.xLT(x) > r f ±
2
such that ¬FlockS (x). Let there be a single non-faulty agent p which is located
farther than rc from agent T(x) so that p x.NbrsT(x) .
Let α be an execution fragment starting from x such that for every state x ∈ α ,
ID = F(x ) ∪ {p} and x .v f j = 0 for all j ∈ F(x ). Then, for all reachable states x
from x , x .Suspectedp = ∅ and ∀i ∈ ID, x .xi = x .xi .
72
distance, vmax , any failed or non-failed agent moves in any round. This
then implies that any two agents move towards or away from one another
by at most 2vmax in any round. Then, if non-failed agents change neighbors,
it is shown that they do not violate safety. Next, a condition on when a
single agent can fail for maintenance of reduced safety is given. Finally,
the safety property is shown to be invariant without failures, and with the
aforementioned condition, in the face of one failure, the reduced safety
property is proven.
Lemma 5.10 shows that any agents move by at most a positive constant
vmax in any round. Otherwise agents are not allowed to move due to
quantization constraints, so they will move by 0 in this round, which
is also less than vmax . The proof follows since update is the only action
to change any xi , then from Figure 5.4, Line 39, by the assumption that
∀i ∈ F(x), vmin ≤ x.v fi ≤ vmax , and failures are permanent, so for any state x
reachable from x, x .v fi = x.v fi .
a
Lemma 5.10. For any execution α, for states x, x ∈ α such that x → x for any
a ∈ A, ∀i ∈ ID, then |x .xi − x.xi | ≤ vmax .
The following corollary states that any two agents move towards or
away from one another by at most 2vmax from one round to another and
follows from Lemma 5.10.
a
Corollary 5.11. For any execution α, for states x, x ∈ α such that x → x for
any a ∈ A, ∀i, j ∈ ID such that i j, then (x .xi − x.xi ) − (x .x j − x.x j ) ≤ 2vmax .
The next lemma establishes that upon agents switching neighbors used
in Target by changes of either neighbors Nbrs(x, i) or LS (x, i) or RS (x, i) from
x to x , safety is maintained.
a
Lemma 5.12. For any execution α, for states x, x ∈ α such that x → x for any
a ∈ A, ∀i, j ∈ ID, if LS (x, i) j and RS (x, j) i and LS (x , i) = j and RS (x , j) = i
and x.xRS (x, j) − x.xLS (x,i) ≥ rs , then x .xRS (x ,j) − x .xLS (x ,i) ≥ rs .
Proof : Only suspect and update modify LS (x, i), RS (x, i), or xi for any i. By
Lemma 5.5, we discuss L and R. By Lemma 5.6, which states that neighbor
switching occurs symmetrically, if x.Li j and x .Li = j, then x .R j = i.
73
It remains to be established that x .xx .R j − x .xx .Li ≥ rs . For convenient
notation, observe that x .xx .R j = x .xi and x .xx .Li = x .x j . Now,
x.xx.L j + x.xi
x .x j = , and
2
x.x j + x.xx.Ri
x .xi = ,
2
and thus
x.x j + x.xx.Ri x.xx.L j + x.xi
x .xi − x .x j = −
2 2
x.x j − x.xx.L j + x.xx.Ri − x.xi
= .
2
rs + rs
x .xi − x .x j ≥ ≥ rs .
2
The cases for i = N and j = 1 follow by similar analysis, as does the case
when x .xm is quantized so that x.xm = x .xm for any m ∈ ID.
Invariant 5.13 shows that targets ui and positions xi are always safe in
the presence of no failures. When failures can occur, under the following
assumption about detection and mitigation of such failures, a weaker re-
duced safety property is invariant. Particularly, all analysis in the face of
failures relies on detection of any failed agents within kd rounds of any
faili (v) transition.
Proof : The proof is by induction over the length of any execution of System.
The base case follows from Assumption 5.3. For the inductive case, for each
a
transition a ∈ A, we show if x → x ∧ x ∈ Sa f ety, then x ∈ Sa f ety.
74
(a) update: The only times x .ui x.ui are on an update transition. The
inductive hypothesis provides Assumption 5.3 for the pre-state x. By
Lemma 5.10, it is sufficient to show if ∀i ∈ ID,
All of the following follows from Figure 5.4, Lines 31–35. For all
i ∈ NF(x) ∩ Mids(x),
For i = T(x),
This implies at least Sa f etyR (xm ) for any states xm in the execution
75
between x and xd . Since x.x f − x.xi ≥ rs , xd .x f − xd .xi ≥ rr and Sa f etyR (xd )
is established. It remains to be established that for a reachable state
xd , Sa f etyR (xd ). However, any agent i such that f ∈ xd .Nbrsi will have
f ∈ xd .Suspectedi , which changes LS and RS , but applying Lemma 5.12
still yields Sa f ety(xd ). Finally, by Figure 5.4, Line 28, xd .lanei xd .lane f
since NL ≥ 2 and hence Sa f etyR (xd ).
(b) faili (v), snapStarti , snapTermi , and suspecti : these transitions do not
modify any xi or ui , so Sa f ety(x ).
5.4.5 Progress
In this section it is established that along executions of System in which fail
actions are fixed, then System reaches a terminal state, that is one satisfying
Terminal. To show this, it is first established that a state x satisfying NBM
is reached. It is further argued that NBM ⊆ FlockS so that x also satisfies
FlockS . That is, System reaches states from which no non-head agent
may move and such states satisfy the strong flocking condition, in that
they are roughly equally spaced with a tight tolerance parameter. Upon
FlockS being satisfied, it is shown that progress is made towards a state
x satisfying Goal. Upon such progress being made, only FlockW remains
invariant, but by reapplication of the previous arguments for reachability
of states satisfying NBM, another state x is reached which again satisfies
NBM and hence FlockS . Finally, by repeated application of these arguments,
it is established that a state x satisfying Terminal is reached. The order
in which NBM and Goal are satisfied depends on the initial conditions. If
System starts in a state satisfying Goal and ¬NBM, then obviously Goal is
satisfied first. However, if System starts in a state satisfying ¬Goal and
¬NBM, then it will always be the case that NBM is satisfied first.
76
The following descriptions of error dynamics are useful for later analysis:
⎧
⎪
⎪x.xi − x.xx.L − r f if i ∈ Mids(x) ∪ T(x)
Δ ⎪
⎨
e(x, i) = ⎪
i
⎪
⎪
⎩0 otherwise,
⎧
⎪
⎪x.ui − x.ux.L − r f if i ∈ Mids(x) ∪ T(x)
Δ ⎪
⎨
eu(x, i) = ⎪
i
⎪
⎪
⎩0 otherwise.
Here e(x, i) gives the error with respect to r f of Agenti and its non-suspected
left neighbor. The quantity eu(x, i) and eu(x, i) gives the same notion of
error as aforementioned, but with respect to target positions x.ui rather
than physical positions x.xi .
The next lemma shows that if an agent is allowed to move in spite of
quantization, then it moves by at least a strictly positive constant vmin in
any round. This follows from Figure 5.4, Line 39.
Lemma 5.14. For any failure-free execution fragment α and for two adjacent
rounds xk and xk+1 in α, for any i ∈ NF(xk ) ∩ NF(xk+1 ), if xk,T .ui − xk .xi > β,
then |xk+1 .xi − xk .xi | ≥ vmin > 0.
Lemma 5.15 shows that when System is outside of states satisfying NBM,
the maximum error for all non-failed agents’ target positions ui and their
position in a state satisfying NBM is non-increasing. It also displays that
the maximum error for all non-failed agents’ positions xi and the goal is
non-increasing. Finally it shows that the maximum error for all non-failed
agents’ positions in adjacent rounds is non-increasing.
Lemma 5.15. For any failure-free execution fragment α, for any state x ∈ α, if
x NBM, then
max eu(xQ , i) ≤ max eu(x, i).
i∈NF(xQ ) i∈NF(x)
a
Finally, if x and x are in α such that x → x , ∀a ∈ A, ∀i ∈ NF(x), then
77
Proof : Target and Quant are the only subroutines of updatei to modify
ui . Now max eu(xT , i) ≤ max eu(x, i) which follows from eu(xT , i) being
i∈NF(xT ) i∈NF(x)
computed as convex combinations of positions from x,
i = H(xT ) ⇒ eu(xT , i) = 0
eu(x, x.Ri )
i = xT .RH(xT ) ⇒ eu(xT , i) =
2
eu(x, x.Li ) + eu(x, x.Ri )
i ∈ RMids(xT ) ⇒ eu(xT , i) =
2
eu(x, x.Li ) + eu(x, i)
i = T(xT ) ⇒ eu(xT , i) = .
2
Finally, Quant sets xQ .ui = xT .ui or xQ .ui = xT .xi . In the first case, the result
follows by the above reasoning. In the other case, if ui and uL are each
quantized, then ei does not change for any i and the result follows. If,
however, ui is quantized and uL is not quantized, then ei is computed as
i = H(xT ) ⇒ eu(xT , i) = 0
i = xT .RH(xT ) ⇒ eu(xT , i) = eu(x, i)
eu(x, x.Ri ) + eu(x, i)
i ∈ RMids(xT ) ⇒ eu(xT , i) =
2
eu(x, i) + eu(x, x.Li )
i = T(xT ) ⇒ eu(xT , i) = .
2
i = H(xT ) ⇒ eu(xT , i) = 0
eu(x, i) + eu(x, x.Ri )
i = xT .RH(xT ) ⇒ eu(xT , i) =
2
eu(x, x.Li ) + eu(x, i)
i ∈ RMids(xT ) ⇒ eu(xT , i) =
2
eu(x, i)
i = T(xT ) ⇒ eu(xT , i) = .
2
Finally, applying Lemma 5.10 indicates that error between actual positions
and not target positions is non-increasing.
78
as
Δ
V(x) = e(x, i).
i∈NF(x)
Note the similarity of this candidate with the one found in [101]. In partic-
ular, it is not quadratic and is the sum of absolute values of the positions of
the agents. Thus for a state x, if for some i, e(x, i) > 0, then V(x) > 0. Define
the maximum value of the candidate function obtained for any execution
α over any state x ∈ α satisfying NBM as
Δ
γ= sup V(x).
{x∈α:x∈NBM}
The next lemma shows that sets of states satisfying NBM are invariant,
that a state satisfying NBM is reached, and gives a bound on the number
of rounds required to reach a state satisfying NBM.
and we show ΔV(xk , xk+1 ) < ψ for some ψ < 0. Observe that −vmin ≥
ΔV(xk , xk+1 ) ≥ −vmax and since vmax ≥ vmin > 0 let ψ = vmin . Therefore
update
a transition xk → xk+1 causes V(xk+1 ) to decrease by at least a positive
79
V(x )−γ
constant vmin . By repeated application of this reasoning, ∃c, k < c ≤ k
vmin
such that V(xc ) ∈ NBM and V(xc ) ≤ γ.
Lemma 5.16 states a bound on the time it takes for System to reach the
set of states satisfying NBM. However, to satisfy FlockS (x), all x ∈ NBM
must be inside the set of states that satisfy FlockS . If FlockS (x), then V(x) ≤
δ(N−1)
i∈NF(x) e(x, i) = 2
. From any state x that does not satisfy FlockS (x), there
exists an agent that will compute a control that will satisfy the quantization
constraint and hence make a move towards NBM. Thus to satisfy FlockS ,
δ(N−1)
it is required that γ ≤ 2 , in which case the set x ∈ NBM will be such
that FlockS (x) is satisfied, or equivalently, NBM ⊆ FlockS . This allows a
derivation on the quantization parameter β.
The following corollary follows from Lemma 5.16, as the only time at
which FlockS (x) is not satisfied after becoming satisfied is when the head
agent moves, in which case x .xH(x ) < x.xH(x) which causes V(x ) ≥ V(x).
a
Corollary 5.17. For any execution α for x ∈ α such that, if FlockS (x) ∧ x → x
∀a ∈ A ∧ x.xH(x) = x .xH(x ) , then FlockS (x ).
Lemma 5.18 shows that once a weak flock is formed, it is invariant. This
establishes that for any reachable state x , if V(x ) > V(x), then V(x ) <
δ(N − 1).
Case 1. The the system satisfies FlockW (x) ∧ ¬ FlockS (x), then FlockW (x )
holds by application of Lemma 5.16 since x .xH(x ) = x.xH(x ) by Figure 5.4,
Line 33.
Case 2. The system satisfies FlockW (x) ∧ FlockS (x) so upon termination of
the global snapshot algorithm by Assumption 5.4, if x.xH(x) x.x g , then
H(x) computes x .uH(x ) < x.uH(x) and applies this target x .uH(x ) < x.uH(x)
by Figure 5.4, Line 32, and we show FlockS (x) ⇒ FlockW (x ). If x.xH(x) ∈
[0, β] such that the predicate on Line 36 is satisfied, then x .xH(x ) = x.xH(x)
and the proof is complete. If not, then by the definition of x .uH(x ) in
80
Figure 5.4, Line 32, H(x) will compute a target no more than δ/2 to the
left, so x .uH(x ) − x.uH(x) ≤ δ/2. Now, for Agenti to have moved, the error
between the distance of H(x) and x.RH(x) and the flocking distance must
have been at most δ/2 by the definition of FlockS . AgentRH(x) will have
moved to the center of H(x) and RRH(x) , so x .uRH(x ) may be less than, equal
to, or greater than its previous position x.xRH(x) , requiring a case analysis of
each of these three possibilities. In the first two cases x .uRH(x ) ≤ x.xRH(x ) and
the proof is complete. The other case follows by applying Lemma 5.10 to
H(x) and x.RH(x) and observing that the most they would ever move apart
by is 2β ≤ δ/2 and are now separated by at most δ, hence FlockW (x ) is
satisfied.
(i) a j+1 ≤ a j
Proof : First, note that by assumption, a j+1 is bounded from above by a j (i.e.,
by a1 ). Now assume for the purpose of contradiction that there exists a
pair (ap , bp ) where ap > c5 and bp > c4 such that ∀ f ≥ p, (a f , b f ) = (ap , bp ).
Then, we show there exists a q > p such that (ap , bp ) = (aq , bq ) where aq < ap
and bq < bp .
Without loss of generality, assume that bp > c2 initially. Now, starting
from (ap , bp ), the next step in the sequence is such that bp+1 ≤ bp − c1 , since
it must be the case that ap = ap+1 as we assumed bp > c2 . This process of b j
decreasing continues in the form of bn ≤ bp − nc1 where n is the step that
b −c
bn ≤ c2 , thus bn ≤ bp − nc1 ≤ c2 and n ≥ pc1 2 . At the next step from n, that
is n + 1, it must be the case that an+1 ≤ max{0, an − c3 } since bn ≤ c2 and
an = ap > c5 . Since an+1 < an , it is the case that bn+1 ≤ c6 − c2 = c6 − nc1 .
81
Note that it would seem to remain to be established that bn > c4 so that the
decrease of bn+1 could occur, but, if it is in fact the case that bn ≤ c4 , then
bp ∈ B as desired. Therefore, q = n + 1 > p and since (ap , bp ) becomes smaller
at a larger step in the sequence, we reach the contradiction. By repeatedly
applying the previous arguments, existence of such a t is established.
(iv) V(xt ) ≤ (N − 1) 2δ
Proof : The proof follows from Lemma 5.19 by the analysis above, instanti-
ating
(i) c1 = vmin ,
(ii) c2 = (N − 1) 2δ ,
(iii) c3 = 2δ ,
(iv) c4 = γ,
(v) c5 = β, and
(vi) c6 = (N − 1)δ.
82
The following theorem states that System achieves the desired prop-
erties in forming a flock at the goal (the origin) within a specified time,
and follows by Theorem 5.20, Assumption 5.4, and Lemma 5.16. The con-
vergence time would be exact were it not for the O(N) rounds from the
snapshot algorithm to terminate (Assumption 5.4).
rounds from x0 in α where x0 is the first state in α, then Terminal(xt ) and FlockS (xt ).
Lemma 5.22. For any execution which may reach a terminal state, consider a
terminal state x ∈ Terminal, and assume F(x) = ∅. Now consider two infinite
executions fragments α and α starting from x, and assume α = faili (0).α, for any
i ∈ ID. For any state x ∈ α and any state x ∈ α , for all i ∈ ID, x.xi = x .xi and
x.ui = x .ui .
83
show a lower-bound on the detection time for all faili (v) actions that could
cause safety or progress violations. The following lower-bound applies
for executions beginning from states that do not a priori satisfy the NBM
states. Informally, it says that a failed agent mimicked the actions of its
correct non-faulty behavior in such a way that despite the failure, System
still progressed to NBM as was intended.
More specifically, it assumes that the head agent is not at the goal—so
Goal is not satisfied—and that the head agent has failed with zero velocity.
It takes O(N) rounds to reach states which satisfy NBM, and these states also
satisfy FlockS . The head agent detects the strong flocking stable predicate
through the global snapshot algorithm in O(N) rounds and computes a
target towards the goal. However, since the head agent has failed with
zero velocity, it cannot make this movement, so a neighbor of the head
agent detects that the head agent has failed. Thereby the fact that faili (v)
occurred was undetected until O(N) rounds had passed.
84
Proof : Given that ∃i such that x.Suspectedi ∅, then the predicate at Fig-
ure 5.4, Line 7 has been satisfied at some round ks in the past. That is at
ks , some j was added to x.Suspectedi . Fix such a j. Let xs correspond to
the state at round ks and xs be the subsequent state in the execution. At
the round prior to ks , there are two cases based the computation of u j in
Figure 5.4, Line 36 for some j xks −1 .Suspectedi .
was not satisfied in Figure 5.4, Line 36, so Agent j applies a velocity in the
direction of sgn u j − x j . If
sgn xs .x j − xs .x j sgn xs .u j − xs .x j ,
should have been observed, but instead it was observed that Agent j per-
formed a move, such that
xs .x j − xs .x j 0.
85
This implies that xs .failed j = true since the only way
xs .x j − xs .x j 0
is if for xs .v f j 0
xs .x j = xs .x j + xs .v f j .
The next lemma describes a partial completeness property [3], in that after
a failure has occurred, some agent eventually suspects that a failure has
occurred. This is partial completeness as already it was demonstrated that
there exists a class of failures that can never be detected in Lemma 5.22.
Lemma 5.25. For any failure-free execution fragment α, suppose that x is a state
in α such that ∃ j ∈ F(x) and ∃i ∈ ID such that j ∈ x.Nbrsi \ Suspectedi . Let i’s
state knowledge for j satisfy either (x.xoi, j − x.uoi, j ≤ β ∧ x.xi,j − x.uoi, j 0)
or (x.xoi, j − x.uoi, j > β ∧ sgn x.xi, j − x.xoi, j sgn x.uoi,j − x.xoi, j ). Then
suspecti (j)
x → x.
Proof : Fix a failure-free execution fragment α. Note that there always exists
an i ∈ ID that is a neighbor of the failed agent j by the strong connectivity
assumption. For the transition suspecti to be taken, the precondition at
Figure 5.4, Line 7 must satisfy that j x.Suspectedi , and that either
(x.xoi,j − x.uoi,j ≤ β ∧ x.xi, j − x.uoi,j 0), or
(x.xoi,j − x.uoi,j > β ∧ sgn x.xi, j − x.xoi,j sgn x.uoi,j − x.xoi, j ).
These are the two hypotheses of the lemma and thus the result follows that
the suspecti transition is enabled.
Corollary 5.26. For all non-terminal states, the detection time is O(N). That is,
the occurrence of any faili (v) transition is suspected within O(N) rounds.
86
The following corollary states that eventually all non-faulty agents know
the set of all failed agents, and follows from Lemma 5.24 and Corollary 5.26,
and that agents share suspected sets in Figure 5.4, Line 23.
Corollary 5.27. For all executions α of System, for any state x ∈ α, there exists
an element xs in α such that ∀i ∈ NF(xs ), xs .Suspectedi = F(x).
Upon detecting nearby agents have failed, Agenti may need to move to
an adjacent to maintain the safety and eventual progress properties. For
instance, if Agent j has failed, x.xi > x.x j , and x.v j = 0, then to make progress
toward the goal 0 < x j , Agenti must somehow get past Agent j , motivating
that the mitigation action is to generally move to a different lane until either
i has passed j if x.v j = 0, or j has passed i if x.v j > 0. For this passing to
occur, the mitigating agent must also change its belief on which neighbor it
should compute its target in the Target subroutine of the update transition
based upon, motivating the need for Li and Ri to also change.
Roughly, if at state x, s is a failed agent and s is suspected by i = R(x, s),
then L(x, i) must yield Agents ’s left neighbor, L(x, s). This is always possi-
ble given the assumption that failures do not cause a partitioning of the
communications graph.
With no further assumptions on when agents fail and in which directions,
up to f ≤ NL − 1 failures may occur, with at most one in each lane. This
ensures there is a failure free lane which can always be used to mitigate
failures. However, up to f ≤ N − 1 failures may occur so long as no failure
occurs within O(N) time in a single lane and there are NL ≥ 2 lanes, which
follows by Lemma 5.25 and is formalized in the next lemma, which states
that if i is failed, then within O(N) rounds, no other agent believes i is its
left or right neighbor. This lemma is sufficient to prove convergence to
terminal states.
Lemma 5.28. If at a reachable state x, x.failedi , then for a state x reachable from
x, after O(N) rounds, ∀ j ∈ ID.x .L j i ∧ x .R j i.
The previous lemma ensures progress with at most one failure in each
of f ≤ L − 1 lanes. By Lemma 5.28, within O(N2 ) time no agent i ∈ NF(x)
believes any j ∈ F(x) is its Li or Ri , and thereby any failed agents diverge
safely along their individual lanes if x.v j > 0 by Lemma 5.8 and i converges
87
to states that satisfy NBM by Theorem 5.20. This shows that System is self-
stabilizing when combined with a failure detector.
Alternatively, a topological requirement can be made to allow more
frequent occurrence of failures. In particular, restrict the set of executions
to those containing configurations in which there is always sufficiently
large free spacing for mitigation in some lane which is formalized below
in Invariant 5.29.
After Agent j has been suspected by its neighbors, that is, j ∈ x.Suspectedi
for all i where j ∈ Nbrs(x, i), the Mitigation subroutine of the update transi-
tion shows that they will move to some free lane at the next round. This
shows that mitigation takes at most one additional round after detection,
since we have assumed there is always free space on some lane and is thus
safe to move onto. This implies that so long as a failed agent is detected
prior to safety being violated, only one additional round is required to
mitigate, so the time of mitigation is a constant factor added to the time to
suspect, resulting in a O(N) time to both suspect and mitigate. Note that
since there is a single collection of agents (the communications graph is
strongly connected), the only time when an agent needs to change its left
or right is upon determining that its left or right neighbor has in fact failed.
5.5 Conclusion
This case study demonstrated a DCPS which when combined with a failure
detector satisfies a self-stabilization property. In particular it demonstrated
safety without failures, a reduced form of safety when a single failure
occurs, and eventually reaching a destination as a strong flock along failure-
free executions. Without the failure detector, the system would not be able
to maintain safety as agents could collide, nor make progress to states
88
satisfying flocking or the destination, since failed agents may diverge,
causing their neighbors to follow and diverge as well. Thus it presented
the development of a fault-tolerant DCPS from a fault-intolerant one.
89
CHAPTER 6
CONCLUSION
90
these cases.
An investigation of a constant-time algorithm for failure detection is also
interesting, where agents occasionally apply a special motion and agents
which do not follow this special coordinated motion are deemed faulty.
This is conceptually similar to the motion probes used in [61].
Failure Classes. This thesis investigated through case studies two types
of failures, a cyber failure of a computer crash and an actuator stuck-at
failure. There are many other failures which could be considered in these
case studies or other DCPS, such as those enumerated in Chapter 3.
Modeling DCPS. As mentioned, the dynamics for each of the case stud-
ies were simple, and thus the modeling formalism was able to rely only on
discrete transitions. To model more complicated dynamics, as well as mes-
sage passing in partially synchronous timing models, a more expressive
formalism is necessary, and we would consider the use of timed input/out-
put automata (TIOA) [14] or hybrid input/output automata (HIOA) [17].
The work in [102] may provide a route for converting some of the results
presented here to a partially synchronous timing model with message
passing.
While this thesis investigated fault-tolerance of DCPS—in the form of the
systems satisfying an invariant safety property and an eventual progress
property—it would be interesting to investigate a provably optimal or
lower-bound on the time required to return to states which may make
progress. Interesting impossibility results regarding when a system may
not tolerant faults might arise in the partially synchronous or asynchronous
timing models.
91
gorithm. Similarly in the safe flocking problem, a self-stabilizing DCPS im-
plemented in a simulation over a network may rely on the composition of
(a) a self-stabilizing clock synchronization algorithm, (b) a self-stabilizing
leader election algorithm to decide on the head agent, (c) a self-stabilizing
distributed snapshot algorithm for strong flock detection, and (d) a self-
stabilizing failure detector.
Finally, an interesting case study would be how self-stabilizing algo-
rithms could be combined with supervisory controllers, like in the inverted
pendulum of [103].
6.2 Conclusions
Overall, this thesis took a first step in constructing a theory of fault-
tolerance for DCPS. A general model of DCPS and a definition for such
systems to be fault-tolerant were introduced. Furthermore, it introduced
a general method for establishing whether a given DCPS is fault-tolerant
through the use of self-stabilization. If the DCPS was found not to be fault-
tolerant, it was shown that through the construction of a failure detector,
the DCPS could be converted into a fault-tolerant system.
The general model was then instantiated for two specific DCPS and
their fault-tolerant properties were investigated. In the distributed traffic
control problem in Chapter 4 (and [62]), the system was shown to be
fault-tolerant. Specifically it presented a DCPS in which it is possible for
physical safety to be maintained in spite of arbitrary crash failures of the
software controling agents. However, progress cannot be preserved, but
due to the self-stabilizing nature of the algorithm, the DCPS was shown to
automatically return to states from which progress can be made.
In the distributed flocking problem in Chapter 5, the system was shown
to require a failure detector to satisfy fault-tolerance. Specifically a failure
detector was constructed which eventually suspects agents which have
failed with actuator stuck at faults, when it is possible to suspect such
faults.
In each case study, an invariant safety property was established as well
as an eventual progress property, according to which the invariant safety
properties each specified that some bad states were not reached and the
92
eventual progress properties ensured that eventually the problem specifi-
cations were satisfied, all in spite of failures. The importance of the work
will be realized as the proliferation of sensors, actuators, networking, and
computing, results in the creation of DCPS like mobile robot swarms, the
future electric grid, the automated highway system, and other systems
which make strong use of combining distributed computation with phys-
ical processes.
93
REFERENCES
94
[11] H. Attiya and J. Welch, Distributed Computing: Fundamentals, Simula-
tions, and Advanced Topics. John Wiley and Sons, Inc., 2004.
[12] H. K. Khalil, Nonlinear Systems, 3rd ed. Upper Saddle River, NJ:
Prentice Hall, 2002.
95
[23] C. N. Hadjicostis, “Non-concurrent error detection and correction in
fault-tolerant discrete-time lti dynamic systems,” IEEE Trans. Autom.
Control, vol. 48, pp. 2133–2140, 2002.
96
[36] T. Henzinger, B. Horowitz, and C. Kirsch, “Giotto: a time-triggered
language for embedded programming,” Proceedings of the IEEE,
vol. 91, no. 1, pp. 84–99, Jan. 2003.
[40] F. Bonnet and M. Raynal, “Looking for the weakest failure detector
for k-set agreement in message-passing systems: Is Πk the end of
the road?” in SSS ’09: Proceedings of the 11th International Symposium
on Stabilization, Safety, and Security of Distributed Systems. Berlin,
Heidelberg: Springer-Verlag, 2009, pp. 149–164.
97
[46] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and
D. Teneketzis, “Failure diagnosis using discrete-event models,” IEEE
Trans. Control Syst. Technol., vol. 4, no. 2, pp. 105–124, Mar. 1996.
98
[58] Y. Afek, S. Kutten, and M. Yung, “The local detection paradigm and
its applications to self-stabilization,” Theor. Comput. Sci., vol. 186, no.
1-2, pp. 199–229, Oct. 1997.
[64] D. Helbing and M. Treiber, “Jams, waves, and clusters,” Science, vol.
282, pp. 2001–2003, Dec. 1998.
99
[70] N. Leveson, M. de Villepin, J. Srinivasan, M. Daouk, N. Neogi,
E. Bachelder, J. Bellingham, N. Pilon, and G. Flynn, “A safety and
human-centered approach to developing new air traffic management
tools,” in Proceedings Fourth USA/Europe Air Traffic Management R&D
Seminar, Dec. 2001, pp. 1–14.
[71] C. Livadas, J. Lygeros, and N. A. Lynch, “High-level modeling and
analysis of TCAS,” in Proceedings of the 20th IEEE Real-Time Systems
Symposium (RTSS’99), Dec. 1999, pp. 115–125.
[72] J. Misener, R. Sengupta, and H. Krishnan, “Cooperative collision
warning: Enabling crash avoidance with wireless technology,” in
12th World Congress on Intelligent Transportation Systems, 2005, pp.
1–11.
[73] A. Girard, J. de Sousa, J. Misener, and J. Hedrick, “A control architec-
ture for integrated cooperative cruise control and collision warning
systems,” in Decision and Control, 2001. Proceedings of the 40th IEEE
Conference on, vol. 2, 2001, pp. 1491–1496.
[74] C. Tomlin, G. Pappas, and S. Sastry, “Conflict resolution of air traffic
management: A study in multi-agent hybrid systems,” IEEE Trans.
Autom. Control, vol. 43, pp. 509–521, 1998.
[75] C. Muñoz, V. Carreño, and G. Dowek, “Formal analysis of the op-
erational concept for the Small Aircraft Transportation System,” in
Rigorous Engineering of Fault-Tolerant Systems, ser. LNCS, vol. 4157,
2006, pp. 306–325.
[76] D. Swaroop and J. K. Hedrick, “Constant spacing strategies for pla-
tooning in automated highway systems,” Journal of Dynamic Systems,
Measurement, and Control, vol. 121, pp. 462–470, 1999.
[77] E. Dolginova and N. Lynch, “Safety verification for automated pla-
toon maneuvers: A case study,” in HART’97 (International Workshop
on Hybrid and Real-Time Systems), ser. LNCS, vol. 1201. Springer
Verlag, March 1997.
[78] P. Varaiya, “Smart cars on smart roads: Problems of control,” IEEE
Trans. Autom. Control, vol. 38, pp. 195–207, 1993.
[79] H. Kowshik, D. Caveney, and P. R. Kumar, “Safety and liveness in
intelligent intersections,” in Hybrid Systems: Computation and Control
(HSCC), 11th International Workshop, ser. LNCS, vol. 4981, Apr. 2008,
pp. 301–315.
[80] P. Weiss, “Stop-and-go science,” Science News, vol. 156, no. 1, pp.
8–10, July 1999.
100
[81] Kornylak, “Omniwheel brochure,” Hamilton, Ohio, 2008. [Online].
Available: http://www.kornylak.com/images/pdf/omni-wheel.pdf.
[82] S. Gilbert, N. Lynch, S. Mitra, and T. Nolte, “Self-stabilizing robot for-
mations over unreliable networks,” ACM Trans. Auton. Adapt. Syst.,
vol. 4, no. 3, pp. 1–29, July 2009.
[83] S. Dolev, L. Lahiani, S. Gilbert, N. Lynch, and T. Nolte, “Virtual
stationary automata for mobile networks,” in PODC ’05: Proceedings
of the twenty-fourth annual ACM symposium on Principles of distributed
computing. New York, NY, USA: ACM, 2005, pp. 323–323.
[84] T. Nolte and N. Lynch, “A virtual node-based tracking algorithm
for mobile networks,” in Distributed Computing Systems, International
Conference on (ICDCS). Los Alamitos, CA, USA: IEEE Computer
Society, 2007, pp. 1–9.
[85] S. Owre, S. Rajan, J. Rushby, N. Shankar, and M. Srivas, “PVS:
Combining specification, proof checking, and model checking,” in
Computer-Aided Verification, CAV ’96, ser. LNCS, R. Alur and T. A.
Henzinger, Eds., no. 1102. New Brunswick, NJ: Springer-Verlag,
July/August 1996, pp. 411–414.
[86] R. Olfati-Saber, “Flocking for multi-agent dynamic systems: algo-
rithms and theory,” IEEE Trans. Autom. Control, vol. 51, no. 3, pp.
401–420, Mar. 2006.
[87] J. Fax and R. Murray, “Information flow and cooperative control of
vehicle formations,” IEEE Trans. Autom. Control, vol. 49, no. 9, pp.
1465–1476, Sep. 2004.
[88] A. Jadbabaie, J. Lin, and A. Morse, “Coordination of groups of mo-
bile autonomous agents using nearest neighbor rules,” IEEE Trans.
Autom. Control, vol. 48, no. 6, pp. 988–1001, June 2003.
[89] H. Tanner, A. Jadbabaie, and G. Pappas, “Stable flocking of mobile
agents, Part I: Fixed topology,” in 42nd IEEE Conference on Decision
and Control, 2003. Proceedings, vol. 2, 2003.
[90] V. Gazi and K. M. Passino, “Stability of a one-dimensional discrete-
time asynchronous swarm,” IEEE Trans. Syst., Man, Cybern. B, vol. 35,
no. 4, pp. 834–841, Aug. 2005.
[91] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in the presence
of partial synchrony,” J. ACM, vol. 35, no. 2, pp. 288–323, 1988.
[92] M. J. Fischer, N. A. Lynch, and M. S. Paterson, “Impossibility of
distributed consensus with one faulty process,” J. ACM, vol. 32,
no. 2, pp. 374–382, 1985.
101
[93] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous
deterministic and stochastic gradient optimization algorithms,”
IEEE Trans. Autom. Control, vol. 31, no. 9, pp. 803–812, Sep. 1986.
[103] D. Seto and L. Sha, “A case study on analytical analysis of the in-
verted pendulum real-time control system,” Carnegie Mellon Uni-
versity, Pittsburgh, PA, CMU/SEI Tech. Rep. 99-TR-023, Nov. 1999.
102