0% found this document useful (0 votes)
4 views107 pages

Fault Tolerant THESIS

2.1 Computer Crimes Act, No 24 of 2007 2.2 Electronic Transaction Act 2.3 Payment Devices Frauds Act, No. 30 OF 2006 2.4 Legislation for Scripless Securities Trading 2.5 Payment and Settlement Systems Act, No. 28 of 2005 2.6 Data protection bill 2.7 Cyber security bills 2.8 Intellectual Property Act no. 36 of 2003 2.9 Other relevant prevailing laws

Uploaded by

Nirosh Ananda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views107 pages

Fault Tolerant THESIS

2.1 Computer Crimes Act, No 24 of 2007 2.2 Electronic Transaction Act 2.3 Payment Devices Frauds Act, No. 30 OF 2006 2.4 Legislation for Scripless Securities Trading 2.5 Payment and Settlement Systems Act, No. 28 of 2005 2.6 Data protection bill 2.7 Cyber security bills 2.8 Intellectual Property Act no. 36 of 2003 2.9 Other relevant prevailing laws

Uploaded by

Nirosh Ananda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

FAULT-TOLERANT DISTRIBUTED CYBER-PHYSICAL SYSTEMS:

TWO CASE STUDIES

BY

TAYLOR T. JOHNSON

THESIS

Submitted in partial fulfillment of the requirements


for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2010

Urbana, Illinois

Adviser:

Assistant Professor Sayan Mitra


ABSTRACT

Fault-tolerance in distributed computing systems has been investigated


extensively in the literature and has a rich history and detailed theory.
This thesis studies fault-tolerance for distributed cyber-physical systems
(DCPS), where distributed computation is combined with dynamics of
physical processes. Due to their interaction with the physical world, DCPS
may suffer from failures that are qualitatively different from the types of
failures studied in distributed computing. Failures of the components of
DCPS which interact with the physical processes—such as actuators and
sensors—must be considered. Failures in the cyber domain may interact
with failures of sensors and actuators in adverse ways.
This thesis takes a first step in analyzing fault-tolerance in DCPS through
the presentation of two case studies. In each case study, the DCPS are mod-
eled as distributed algorithms executed by a set of agents, where each agent
acts independently based on information obtained from its communica-
tion neighbors and agents may suffer from various failures. The first case
study is a distributed traffic control problem, where agents control regions
of roadway to move vehicles toward a destination, in spite of some agents’
computers crashing permanently. The second case study is a distributed
flocking problem, where agents form a flock, or a roughly equally spaced
distribution in one dimension, and move towards a destination, in spite of
some agents’ actuators becoming stuck at some value.
Each algorithm incorporates self-stabilization in order to solve the prob-
lem in spite of failures. The traffic algorithm uses a local signaling mecha-
nism to guarantee safety and a self-stabilizing routing protocol to guarantee
progress. The flocking algorithm uses a failure detector combined with an
additional control strategy to ensure safety and progress.

ii
ACKNOWLEDGMENTS

First and foremost, I thank my adviser, Sayan Mitra, for countless hours
spent solving problems with me, teaching me what research is and is not,
helping me to improve my research skills, and being a superb and always
supportive mentor. Without his advice and help this thesis would not have
been realized.
With equal importance, I thank my family—especially Mom, Dad, and
Brock—for without them I would not be here. I also especially thank my
cousin Tommy Hoherd for his support, which has helped to make graduate
school a reality for me.
I would like to thank all teachers everywhere, but specifically the ones
who have taught me, particularly Mark Capps, Paul Hester, Yih-Chun
Hu, Viraj Kumar, Daniel Liberzon, Pat Nelson, Lui Sha, James Taylor,
Nitin Vaidya, Jan Bigbee Weesner, and Geoff Winningham. I give special
thanks to undergraduate advisers who encouraged me to pursue graduate
studies, including Brent Houchens, Fathi Ghorbel, Dung Nguyen, and
Karthik Mohanram, as well as my advisers from Schlumberger, Albert
Hoefel and Peter Swinburne.
Without friends to let loose and relax with on occasion, life would
be boring, so I acknowledge Alan Gostin, Brian Proulx, Daniel Rein-
hardt, Emily Williams, Frank Havlak, John Stafford, Josh Langsfeld, Navid
Aghasadeghi, Paul Rancuret, Rakesh Kumar, Sarah Lohman, Stanley Bak,
among many others, especially my friends and fellow lab mates, Bere
Carrasco, Karthik Manamcheri, and Sridhar Duggirala. I acknowledge a
recently acquired friend, Ellen Prince, for providing both motivation and
distraction in the final phase of the thesis.
Lastly, I acknowledge anyone else who I interact with that I may have
made the unfortunate mistake of forgetting to mention.

iii
TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . 1
1.1 Failures in DCPS . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Modeling Techniques . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Approach for Achieving Fault-Tolerance in DCPS . . . . . . 4
1.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Key Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 8

CHAPTER 2 MODELING FAILURES AND FAULT-TOLERANCE


IN DCPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Modeling DCPS as Composed Discrete Transition System . . 11
2.2.1 Executions and Properties of SSDCPS . . . . . . . . . 13
2.3 Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Failure-Free Executions . . . . . . . . . . . . . . . . . 15
2.3.2 Self-Stabilizing DCPS . . . . . . . . . . . . . . . . . . 16
2.3.3 Failure Detector Model . . . . . . . . . . . . . . . . . . 16
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

CHAPTER 3 RELATED WORK . . . . . . . . . . . . . . . . . . . . . 20


3.1 Failure Classes and Models . . . . . . . . . . . . . . . . . . . 20
3.1.1 Failure Occurrence . . . . . . . . . . . . . . . . . . . . 21
3.2 Methods for Ensuring Fault-Tolerance . . . . . . . . . . . . . 22
3.2.1 Failure Detectors . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Self-Stabilization . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Stabilizers . . . . . . . . . . . . . . . . . . . . . . . . . 26

CHAPTER 4 DISTRIBUTED CELLULAR FLOWS . . . . . . . . . . 28


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Overview of Distributed Cellular Traffic Control . . . 31
4.2.2 Formal System Model . . . . . . . . . . . . . . . . . . 32
4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Safety Analysis . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Stabilization of Routing . . . . . . . . . . . . . . . . . 42

iv
4.3.3 Progress of Entities Towards the Target . . . . . . . . 44
4.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

CHAPTER 5 SAFE FLOCKING ON LANES . . . . . . . . . . . . . . 51


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 Overview of the Problem . . . . . . . . . . . . . . . . 51
5.1.2 Literature on Flocking and Consensus in Dis-
tributed Computing and Controls . . . . . . . . . . . 53
5.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1 Overview of Distributed Flocking . . . . . . . . . . . 54
5.2.2 Formal System Model . . . . . . . . . . . . . . . . . . 56
5.2.3 Model as a Discrete-Time Switched Linear System . . 64
5.3 Safety and Progress Properties . . . . . . . . . . . . . . . . . 65
5.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.2 Basic Analysis . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.3 Basic Failure Analysis . . . . . . . . . . . . . . . . . . 71
5.4.4 Safety in Spite of a Single Failure . . . . . . . . . . . . 72
5.4.5 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.6 Failure Detection . . . . . . . . . . . . . . . . . . . . . 83
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

v
CHAPTER 1

INTRODUCTION

One of the principle benefits of distributed computing systems is their ability


to tolerate failures of some of the components which compose the system,
since there is no single point-of-failure. However, it is challenging to design
and analyze distributed systems which operate correctly in spite of failures.
A variety of methods exist to ensure fault-tolerance in distributed comput-
ing systems, such as rollback-recovery [1], replicated state-machines [2],
failure detectors [3], and self-stabilization [4].
Distributed cyber-physical systems (DCPS) are distributed computing sys-
tems which interact with their physical environment through sensors and
actuators. Specifically a DCPS is a system in which a collection of individ-
ual computers interact with one another through some form of commu-
nication, and where each of these computers interacts with the physical
world.
Examples of DCPS include: (a) mobile robots or unmanned aerial vehi-
cles (UAVs) performing search and rescue tasks [5], (b) the automated high-
way systems (AHS) based on vehicular networks [6], (c) the future electric
grid or Smart Grid with distributed generation and decision units through-
out the grid [7], (d) power and thermal management of computers in data
centers through dynamic voltage scaling (DVS) [8], dynamic frequency
scaling, or on/off policies, and (e) wireless sensor networks (WSN) [9] or
wireless sensor and actor networks (WSAN) [10].
The combination of the individual computers, sensors, and actuators in
a DCPS are referred to as agents. See Figure 1.1 for a typical architecture of
a DCPS. The individual agents of a DCPS interact with one another to solve
some task. Coordination may no longer be limited to communications, but
may rely on coordination of physical state.

1
1.1 Failures in DCPS
A fundamental issue in the design of distributed computing systems is to
ensure reliable operation in spite of being composed of unreliable compo-
nents. Similarly, designs for reliable DCPS must take into account failures
of all their components, which include the computers, software, and com-
munication channels as in distributed computing systems, and addition-
ally, sensors and actuators.
Even when considering only distributed computing systems, there are
broad classes of failures. A computer can fail, as when it crashes and never
makes another transition. Additionally, failures could occur somewhere
in the communication channel between computers, as a result of which
messages are lost, delivered out of order, or corrupted. When considering
DCPS, failures may also occur in sensors or actuators, such as an actuator
becoming stuck at some value forever. All of the previous distributed com-
puting failures may be applicable, as are failures of the agents’ components
which interact with the physical environment.
Broadly, as a result of this thesis, we believe that there are three classes
of failures based on the location of the failure:

(a) cyber failures: failures in the hardware or software of the agents’ com-
puters,

(b) physical failures: failures in the agents’ interfaces to the physical world,
such as sensors and actuators, and

(c) communication failures: failures in the channels through which the


agents communicate.

Failures from one of these classes can now manifest as a behavior in another
domain, such as a cyber failure of a mobile robot resulting in a collision
between the robot and another adjacent robot in the physical world.

1.2 Modeling Techniques


Several mathematical formalisms are used to analyze distributed comput-
ing systems under various models of communication. The computers are

2
Agent Agent
Computer(s) Computer(s)
msgs msgs

Sensor(s) Sensor(s)
Actuator(s) Actuator(s)

Channel
Agent Agent
Computer(s)
p () msgs
g msgs
g Computer(s)

Sensor(s) Sensor(s)
Se so (s)
Actuator(s) Actuator(s)

Figure 1.1: Typical architecture of a DCPS composed of four agents, each


of which has components of a computer with some software processes
representing the interaction of cyber processes of the DCPS, and sensors
and actuators representing the interaction with physical processes of the
DCPS. These components are the ones which act on the cyber and
physical state.

modeled by some formalism, such as a finite-state machine, discrete tran-


sition system, or Turing machine [11]. The communication channels can be
modeled in numerous ways as well, such as a first-input first-output (FIFO)
queue. In this thesis, a shared-memory model is used and its justification
for the DCPS considered is presented in Chapter 2.
A dynamical system is a mathematical formalism which describes the
evolution of state of the system over time by a fixed rule, such as a dif-
ferential or difference equation [12]. When computers are placed in an
environment in which they interact with the physical world and its contin-
uous quantities, such as in the DCPS considered in this thesis, additional
considerations beyond modeling discrete transitions must be considered.

3
Either

(a) the expressiveness of the modeling formalism must be expanded to


capture the interaction with the physical world and its continuous
quantities, or

(b) assumptions about the behavior of the agents within the physical en-
vironment must be made.

For instance, the first avenue of expanding the expressiveness of the model
may be accomplished through the use of timed automata [13], timed in-
put/output automata (TIOA) [14], hybrid automata [15, 16], or hybrid in-
put/output automata (HIOA) [17].
In this thesis however, the second route is more frequently traversed
and the continuous dynamics are abstracted in such a way that they may
be discussed as discrete transitions. The use of shared variables and syn-
chrony simplifies the analysis of the distributed computation, and discrete
abstractions of continuous behavior simplify the analysis of the dynamical
systems.

1.3 Approach for Achieving Fault-Tolerance in


DCPS
The theory of distributed systems provides not only impossibility results
which provide theoretical limits of what problems can be solved under
what assumptions—for instance, what types of failures—but also algo-
rithms which describe how to solve a problem when it is possible. DCPS
would benefit from a similar theory, and this thesis takes a first step in this
direction with regards to investigating different assumptions on failures.
Distributed computing provides a theory which shows that algorithms
and impossibility of problems can be viewed through failure detection,
such as showing when it is impossible for a failure detector to exist. How-
ever, if it is possible, upon detection of some failure, a correction of the
failure through the reset of system state may be necessary to ensure that
eventually the problem specification is satisfied. Recovery from such fail-
ures is inherently more complicated when physical state and not only cyber

4
state are involved. For instance, with only software state, a standard recov-
ery procedure is to reset, but this has no reasonable analogy for physical
state.
The approach taken by this thesis is the following. A definition of
fault-tolerance for DCPS is given as stabilization in Chapter 2. Roughly,
without failures, a DCPS remains in a set of legal states that satisfy some
desired system property. Note that the set of legal states is the only set
from which progress can be made towards satisfying the desired system
property. However, upon failure events occurring, the DCPS may leave
the set of legal states and go into a set of illegal states.
Synchrony is assumed so that the actions of all the agents are composed
into a single discrete transition system with a synchronous update that
modifies the state of all the agents in the system based on an agents’
local state and the states of adjacent agents. Failures are represented as
events which may modify the state of some agents. When failure events
stop occurring, and without any other event occurring aside from the
synchronous system update for all agents, the DCPS may or may not be
guaranteed to eventually return to the set of legal states. If it can be
guaranteed that the DCPS automatically returns to the set of legal states
without any event other than the synchronous update, then the DCPS is
said to be self-stabilizing. However, it may be necessary for the DCPS
to rely on a failure detector for the occurrence of a failure to be realized,
where the failure detector is modeled as an event. Upon detecting such
a failure, the DCPS may then perform some mitigation routine in the
synchronous update which allows it to return to the set of legal states. This
second case of converting a non-self-stabilizing DCPS to a self-stabilizing
DCPS is analogous to the use of a stabilizer—a failure detector and a
method for state reset to ensure eventual progress—for converting a non-
self-stabilizing algorithm to a self-stabilizing one [4]. Both of these cases
allow the DCPS to make progress towards satisfying the desired system
property. See Figure 1.2 for a graphical depiction.

5
Illegal
DM

failure Legal
SS
SS

normal

Figure 1.2: Under normal operation the DCPS state remains in the set of
Legal states, but upon failures, the state of the DCPS may enter the set of
Illegal states. The SS labels indicate that self-stabilization can be achieved
from these states. The SS arrow on the left indicates that upon failure
events not occurring, the system automatically returns to Legal states.
The arrow on the right labeled DM indicates that a transition of the
failure detector has been taken and must occur prior to the DCPS
recovering to the set of Legal states from the set of Illegal states.

1.4 Case Studies


Upon establishing the general definitions and methods for fault-tolerance
in DCPS in Chapter 2, the thesis studies specific instances of fault-tolerance
of DCPS through two case studies in Chapters 4 and 5.
The first case study is the distributed traffic control problem and is an
example of fault-tolerance through self-stabilization without a failure de-
tector, that is, where the DCPS automatically returns to the set of legal
states. Advances in wireless vehicular networks present opportunities for
developing new distributed traffic control algorithms that avoid phenom-
ena such as abrupt phase transitions. The physical model is a partitioned

6
plane where the movements of all entities (vehicles) within each partition
(cell) are tightly coupled. Each of these cells is controlled by a computer.
A self-stabilizing algorithm, called a distributed traffic control protocol, is
presented which guarantees

(a) minimum separation between vehicles at all times, even when some
cells’ control software may fail permanently by crashing, and

(b) once failures cease, a route to the target cell stabilizes and the vehicles
with feasible paths to the target cell make progress towards it.

The algorithm relies on two general principles: temporary blocking for


maintenance of safety and a self-stabilizing geographical routing protocol
for guaranteeing progress.
The second case study is the distributed flocking problem and is an
example of fault-tolerance where self-stabilization is achieved through the
use of a failure detector, that is, where the DCPS relies on a failure detector
to return to the set of legal states. The physical model is a set of one-
dimensional real lines called lanes, which are representative of lanes of
roads, along which a group of mobile agents move. A distributed flocking
algorithm is presented which guarantees

(a) maintenance of safe separation of the agents’ physical positions,

(b) formation of a roughly equally spaced distribution of agents’ physical


positions, known as a flock, and

(c) traversal of the flock towards a destination.

However, some agents’ actuators may fail permanently and become stuck-
at a value, causing the failed agents to move forever according to this
value. Without the use of failure detection and mitigation, the algorithm
is fault-intolerant and critical system properties like avoiding collisions or
ensuring progress to the flock or goal may be violated. Thus, the algorithm
incorporates failure detection, when it is possible, for which the detection
time is the same order as the number of rounds it takes for the agents to
reach the set of states which satisfy flocking. Then upon detecting failed
agents, non-failed agents migrate to adjacent lanes to avoid collisions and
to make progress towards the flock and destination.

7
1.5 Key Insights
The main contribution of this thesis is the general method of using self-
stabilization to ensure fault-tolerance of DCPS, and the general method
for converting non-self-stabilizing DCPS to self-stabilizing ones by use
of a failure detector. The thesis relies on two case studies which utilize
these techniques to ensure correct operation in spite of failures of agents’
components in the cyber and physical domains.
As a discussion point, in the distributed traffic control problem, a failure
detector is implicitly provided by agents no longer reporting their distances
to the target. While this problem does not require a failure detector to
ensure self-stabilization, it inherently has a method for detecting failures
by virtue of the synchronous update transition. Because the algorithm
used to locate the destination is self-stabilizing, returning to a state from
which vehicles can make progress to the destination occurs automatically.
In the distributed flocking problem, failure detection is explicitly pro-
vided by a failure detector. Then an additional mechanism is incorporated
by the synchronous update transition of the system, which allows all non-
faulty agents to (a) avoid collisions, (b) avoid falsely following agents
which are not moving towards states which satisfy the flocking condition,
and (c) avoid falsely following agents moving away from the destination.
These case studies show that it is possible to utilize stabilization-based
methods for achieving fault-tolerance in DCPS which are analogous to
those used for ensuring fault-tolerance in distributed computing systems.
Specifically, the distributed traffic control algorithm shows that it is possi-
ble to develop a self-stabilizing DCPS which automatically recovers from
failures. The distributed flocking case study shows that it is possible to
develop a self-stabilizing DCPS by combining a non-self-stabilizing DCPS
with a failure detector.

1.6 Thesis Organization


Chapter 2 introduces mathematical and modeling formalisms and termi-
nology. It also states assumptions on the systems being modeled and the
environments in which the systems reside. This includes general notions

8
of what it formally means for a system to exhibit fault-tolerance. Chap-
ter 3 presents a literature review primarily from the field of fault-tolerance
for distributed systems, but also briefly mentions fault tolerance from re-
lated fields. Chapter 4 presents the first case study, the distributed cellular
flows problem, in which a graph is given representing a network of roads
or waypoints, along which some physical entities such as vehicles travel.
Chapter 5 presents the second case study, the safe flocking problem on
lanes, in which a group of mobile agents form a roughly equally spaced
distribution and travel towards a destination without collision. Each of
the case studies utilize fault tolerance to ensure correct operation of the
DCPS in spite of failures. Chapter 6 presents future directions for work and
concludes the thesis.

9
CHAPTER 2

MODELING FAILURES AND


FAULT-TOLERANCE IN DCPS

This chapter presents mathematical preliminaries and a generic framework


for modeling a distributed cyber-physical system (DCPS) as a shared-state
distributed cyber-physical system (SSDCPS), which is a discrete transition
system (DTS) with some additional structure based on an assumption of
synchronous communications. Then it introduces a generic model of fail-
ures and fault-tolerance and provides an abstract method for achieving
fault-tolerance in DCPS by the use of self-stabilization. This chapter is
the result of reviewing fault-tolerance results from the literature in Chap-
ter 3 and generalizing and extending the results of the two case studies
presented in Chapters 4 and 5.

2.1 Preliminaries
The sets of natural, real, positive real, and nonnegative real numbers are
denoted by N, R, R+ , and R≥0 , respectively. For K ∈ N, [K] denotes the set
Δ Δ
{0, . . . , K}. For a set K, let K⊥ = K ∪ {⊥} and K∞ = K ∪ {∞}.
A variable is a name with an associated type. For a variable x, its type is
denoted by type(x) and it is the set of values that it can take. A valuation
for a set of variables X, denoted by x, is a function that maps each x ∈ X
to a point in type(x). Given a valuation x for X, the valuation for a variable
v ∈ X, denoted by x.v, is the restriction of x to {v}. The set of all possible
valuations of X is denoted by val(X).

Example 2.1. For example, consider a DCPS of three mobile robots positioned on
the Euclidean plane. Each robot has an identifier i in the set {0, 1, 2}. A variable
for each robot could be the position pi of that robot, each of which has a type of R2 .
The set of variables X is {p0 , p1 , p2 }. The valuation x for X is a function that maps

10
each of p0 , p1 , and p2 to a point in R2 , that is, a function which maps the position
of each robot to a point in the plane.

2.2 Modeling DCPS as Composed Discrete


Transition System
The DCPS under consideration are composed of some finite number of
agents—such as the robots in the example in Section 2.1—each of which has
variables which correspond to some software and some physical state. The
agents have unique identifiers drawn from a set ID. Modeling all the agents
of the DCPS as a single discrete transition system requires an assumption
on the communication between agents. Computations are instantaneous
and communications are synchronous, so messages are delivered within
bounded time.
The DCPS is a finite collection of agents which execute and communi-
cate simultaneously, so the entire DCPS operates in synchronous rounds.
At each round, every agent exchanges messages with its communication
neighbors. Then, based on these messages, the agents update their soft-
ware state and decide on a rate-of-change for any continuous variable to
evolve over for the next round. That is, until the beginning of the next
round, all of the agents’ continuous variables continue to evolve according
to this rate-of-change, such as position evolving according to a velocity.
The DCPS can be represented as a single discrete transition system with a
transition that updates some physical and cyber states for all the agents.
These variables may represent cyber or physical state. See Figure 2.1 for
a graphical representation of the cyber and physical variables agents have
and how they are shared.
This has the following interpretation for a message-passing implemen-
tation. At the beginning of each round, each agent broadcasts messages.
Next each agent receives the messages sent by its neighbors, and finally
it computes its local variables based on its local state and the messages
collected from its neighbors.

Definition 2.2. A shared-state distributed cyber-physical system (SSDCPS)


System is a tuple X, Q0 , A, → , where:

11
Agenti Agentj

Read-Only Shared Read-Only Shared


C b St
Cyber State
t C b St
Cyber State
t

Read/Write Shared Read/Write Shared


Physical State Physical State

Private Cyber Private Cyber


State State

Private Physical Private Physical


State State

Figure 2.1: The interaction between a pair of agents in a DCPS is modeled


with read-only shared cyber variables, read/write shared physical
variables, private cyber variables, and private physical variables.

(i) X is a set of variables partitioned into disjoint sets Xi where i ∈ ID and


∀j ∈ ID, Xi = X j , val(Xi ) is called the set of states for agent i, and val(X)
is called the set of states of the DCPS. For each i ∈ ID, Xi includes a special
Boolean variable called failedi .

(ii) Q0 ⊆ val(X) is the set of start states.

(iii) A is a set of transition names. A includes a transition called update and a


transition faili for each i ∈ ID.

(iv) →⊆ val(X) × A × val(X) is a set of discrete transitions.

The notation Agenti is used to refer to the model of an individual agent


in System. A state of System is a valuation of all the variables for all of
the agents. States of System are referred to with bold letters x, x , etc. The
valuation of variables of agent i at state x is referred to as x.xi where xi ∈ Xi .
For Example 2.1, x.p1 would refer to the valuation of the position variable
of the robot with identifier 1 at state x.
The update transition models the evolution of all the agents in System
over a round. The faili transition models the failure of an agent and may

12
occur between update transitions. There may be other transitions in A
which update other states—potentially of individual agents—of System.

Example 2.3. Continuing with the example of three mobile robots in the Euclidean
plane, the set of variables X is {p0 , p1 , p2 }. If initially the robots are specified to start
with position variables in a closed unit-circle in the Euclidean plane, then Q0 is
{(xi , yi ) : x2i + y2i ≤ 1, i ∈ {0, 1, 2}} where pi = (xi , yi ). Then, presume the variables
pi are used to coordinate the robots to form some shape in the plane. Specifically,
the coordination among the robots could allow them to form an equilateral triangle
where the distance between any two of the three robots is equal to some constant √
s. For simplicity assume this triangle has corners in the set {(0, 0), (s, 0), ( 23s , 2s )}.
Then, update could specify that each of the variables pi is set to an element in this
set of corners.

2.2.1 Executions and Properties of SSDCPS


An execution fragment of System is an (possibly infinite) alternating se-
quence of states and transition names, α = x0 , a1 , x1 , . . ., such that for
each index k ≥ 0 appearing in α, (xk , ak+1 , xk+1 ) ∈→ for some ak+1 ∈ A.
a
The notation x → x means (x, a, x ) ∈→. When the term round k is used,
this refers to the pre-state xk prior to the k + 1th update transition in some
execution fragment α. Observe that rounds refer only to the occurrence
of update transitions, whereas the term step refers to any other transition
a ∈ A \ {update}.
An execution is an execution fragment which begins with a start state
x0 ∈ Q0 . The set of all executions of System is denoted ExecsSystem . A state
x is said to be reachable if there exists a finite execution that ends in x. The
set of all reachable states of System is denoted ReachSystem .
The concatenation of a finite execution fragment α of System and any
execution fragment α of System such that the first element of α is equal
to the last element of α is also an execution fragment of System, denoted
α · α , where the duplicate last state of α is deleted from the concatenated
sequence.
For two executions α and α of System, α is said to be indistinguishable
from α , if α and α have the same sequence of states. Specifically, if α =
x0 , a1 , x1 , . . . and α = x0 , a1 , x1 , . . . such that x0 = x0 , x1 = x1 , . . ., then

13
the executions are indistinguishable. Indistinguishability of executions is
frequently used in lower-bound proofs on the amount of time required for
the states of System to satisfy some property.
In distributed systems, two kinds of properties are of paramount impor-
tance. A safety property captures the notion that some “bad” property is
never satisfied, such as processors agreeing on incorrect values in consen-
sus. Equivalently, it means that some “good” property is always satisfied.
Safety properties are generally established by use of a potentially simpler
invariant, or several invariants each of which successively refines the state
space. A liveness property captures the notion that some “good” property
will eventually be satisfied, such as processors eventually agreeing on a
common value in consensus. However it is not known given a state how
far in the future the good property is satisfied. For correct algorithms, one
would like to have termination, but this is not always possible. A progress
property is a stronger notion than liveness and is defined as the a priori
knowledge that, given any state, there is a constant k amount of time in the
future where the good property is satisfied.
a
System is stable with respect to a set S ⊆ val(X), if for each x → x , x ∈ S
implies that x ∈ S. System is invariant with respect to a set S ⊆ val(X) if
all reachable states are contained in S, that is ReachSystem ⊆ S. System is
said to stabilize to S if S is stable and every execution fragment has a suffix
ending with a state in S.
A predicate P defines a set of states SP ⊆ val(X). If the set S defined
by a predicate P is respectively stable or invariant, then the predicate is
respectively called stable or invariant.
The standard method to show that some predicate P is an invariant is
by induction on the number of completed rounds k in some execution α,
beginning with a base case of k = 0. Such assertional reasoning is used to
establish properties of the DCPS. Similarly, compositional reasoning about
one, or few, of the agents in a composite System is employed to simplify
establishing properties of the entire System. Finally, hierarchical proofs
involving a successive refinement of invariants are also used.

14
2.3 Failure Model
A faili transition represents the failure due to some exogenous event of the
ith agent in System, where i ∈ ID. This transition sets the variable failedi
to true permanently—it may never be set to false once being set to true—
and may have other effects depending on the failure model considered.
For instance if an actuator failure occurs, faili may set velocity of i to be a
constant.
For a state x, let

Δ
F(x) = {i ∈ ID : x.failedi = true}

be the set of failed identifiers, and let

Δ
NF(x) = ID \ F(x)

be the set of non-faulty identifiers. The terminology failed agent and non-
faulty agent refers to the agents with identifiers in the failed identifiers or
non-faulty identifiers, respectively.
Let
Δ
ANF = A \ {faili }

be the set of non-faulty actions.

2.3.1 Failure-Free Executions


A failure-free execution fragment is any execution fragment in which all tran-
sitions are non-faulty ones, so any transition is from the set An f . Along
such execution fragments αff , no non-failed agents fail, nor do any failed
agents recover from being failed, so such executions satisfy the property
that failures have ceased to occur. Suppose α is an arbitrary infinite ex-
ecution of System with a finite number of failures. Let x f be the state of
System at the round after the last failure, and α be the infinite failure-free
execution fragment x f , x f +1 , . . . of α starting from x f . Then, F(x f ) = F(xt )
for all t ≥ f , that is, the set of failed agents remains constant.

15
2.3.2 Self-Stabilizing DCPS
Define self-stabilization as follows [18].

Definition 2.4. If S is a stable set of states for System, called the legal states,
then System is self-stabilizing for S if and only if there exists a set of states T
for System, called the illegal states, such that

(i) S ⊆ T,

(ii) T is invariant,

(iii) S is stable for any failure-free execution, that is, for any transition in ANF ,

(iv) There exists a reachable state in S along any failure-free execution fragment
α which begins with any state in T.

Thus, self-stabilization is stability without failures and convergence


upon failure transitions no longer occurring. The set T being invariant
captures the notion of a safety property, and the existence of a reachable
state in any failure-free execution fragment α captures the notion of a
progress property. This definition is used to show safety and progress
properties of DCPS in spite of failures.
See Figure 2.2 for a graphical depiction of these properties.

2.3.3 Failure Detector Model


Based upon what effect on variables the faili transitions have, it may be
necessary for non-faulty agents to detect which other agents are failed. In
such cases, a failure detector is used by the agents [3, 19]. A failure detector
satisfies two properties, completeness and accuracy. Completeness requires
that all failed agents are detected. Accuracy requires that only failed agents
are detected. There are varying degrees of these properties, such as strong
completeness, which is that eventually every agent that fails is permanently
suspected by every correct agent, and eventual weak accuracy, which is that
there is a time after which some correct agent is never suspected by the
correct agents [20].
The properties of the failure detectors desired in this thesis are:

16
T
Įf
Įf S
Įnf

Q0
Įnf

Figure 2.2: Fault-tolerance where T is an invariant set, S is a stable set, Q0


is the set of start states, α f are execution fragments with faili transitions
and αff are failure-free executions (executions with transitions only from
An f ).

(a) Accuracy specifies that if an agent is detected to have failed, then it


has in fact failed; that is, there are never any correct agents which are
detected. So a failure detector suspects agent i only if agent i is failed.

(b) Completeness specifies that every failure is detected within bounded


time. So a failure detector suspects agent i within bounded time of the
faili transition. That is, there are eventually no failed agents which are
undetected.

Define the detection time of a failure detector to be the number of rounds—


recall this is defined as the number of update transitions which have oc-
curred between two states—until a failed agent has been suspected by
some non-faulty agent. The failure detection algorithm must rely only on
the received messages from the agents. For the shared memory model
under consideration, this means that the failure detector may only rely on

17
the shared variables.
The failure detector algorithm could be implemented by some other
external oracle. Such a failure detector still must rely solely on the in-
formation communicated by the agents. However, this does not prevent
such a failure detector from keeping a history of all messages sent in an
execution.
Along these lines, the failure detector for System, if one is necessary, is
modeled through a special transition called suspecti ∈ A for each agent in
System, and agent i has a state variable called Suspectedi , which is the set
of other agent identifiers which agent i believes to be failed.
There is a precondition on the suspecti transition being executed, which
may be a predicate on the states of System. Additionally, the precondition
quantifies an agent j being checked for failure. It is assumed that upon
this precondition being satisfied, the transition is taken. Upon execution
of suspecti , agent i adds agent j to the set Suspectedi . See Chapter 5 for an
example of this.

2.4 Conclusion
This chapter introduced a model for DCPS, a definition of stabilization, and
failure detectors. Through the use of stabilization, fault-tolerant DCPS can
be constructed which allow discussion of invariant sets of states describing
safety properties and stable sets of states under normal system operation
describing progress properties. This is the traditional use of the term
stabilization [18]. The key difference in the use of stabilization and failure
detectors in DCPS is that physical state may now provide a means of
identifying when the system is behaving badly.
The case study in Chapter 5 exemplifies this point through the creation of
a fault-tolerant DCPS by combining a fault-intolerant DCPS with a failure
detector, which allows for the DCPS to satisfy stabilization. Specifically
the failure detector relies on physical state—the position of an agent—and
cyber state—a computed position where an agent would like to move—
to detect that an agents’ actuators have failed. Upon all failures being
detected, an invariant safety property (the set T in Definition 2.4) in the
physical domain is satisfied which is that collisions do not occur, and that

18
eventually states are reached from which progress can be made (the set S
in Definition 2.4). As this case study exemplifies, we believe it is possible
to develop general methods for designing fault-tolerant DCPS.

19
CHAPTER 3

RELATED WORK

The related work summarized in this chapter addresses failure classes and
models, as well as methods for ensuring fault-tolerant operation of sys-
tems. Fault-tolerance has been widely studied in a variety of engineering
and computer science disciplines related to the work of this thesis, such
as control theory [21–24], reliability [25], artificial intelligence [26], dis-
tributed computing systems [1–4, 18, 20, 27, 28], embedded and real-time
systems [29], and combinations of these [30].

3.1 Failure Classes and Models


Since distributed cyber-physical systems (DCPS) are composed of com-
puters interacting with the physical world, many classes of faults exist.
From the cyber domain, there are timing failures of real-time programs
and operating systems, in addition to crash failures, simple software bugs,
and processor hardware faults. From the physical domain, there are ac-
tuator, control surface, and sensor failures, aside from of course necessary
robustness given the potential operating environments of a system. Be-
tween these two worlds is the potential for communication failures, such
as message drops and omissions, or worse, adversarial man-in-the-middle
attacks perhaps culminating in Byzantine failures [29, 31].
The literature contains numerous failure models and definitions for fault-
tolerance. Cyber and physical failures in agents now have physical con-
sequences which may influence the safety and progress of the DCPS. On
the one hand, there is a large class of failures to explore, which includes
traditional failures such as message losses, process crashes, and Byzantine
faults, and also new types of failures that affect sensors and actuators. On
the other hand, since failures manifest in behaviors that are constrained

20
by physical laws, there exists the possibility of developing smarter failure
detection algorithms.
A crash failure is modeled as an agent ceasing to take transitions, and
if the crash is not clean, then at the agent’s final step, it might succeed
in sending only a subset of the messages it was supposed to send. A
Byzantine failure is modeled as agents changing software state arbitrarily
and sending messages with arbitrary content; note that continuous state
is not included, as arbitrary changes of continuous state could require
infinite amounts of energy to complete in finite time. A classical result
from distributed computing is that for many problems, such as consensus,
f crash failures can be tolerated with at least f + 1 agents in f + 1 rounds,
and that t Byzantine failures can be tolerated in t + 1 with at least 3t + 1
agents. Furthermore, in a combined failure model where both crash and
Byzantine failures can occur, where f are crash failures and t are Byzantine
failures, in f +t+1 rounds, at least 3t+ f +1 agents suffice to solve consensus
in a synchronous setting [11].
Physical processes may have failures with regard to sensors, actuators,
and control surfaces [21] which may affect both physical state and software
state. Actuators and sensors may become stuck at a certain value, although
it should be noted that one can utilize physical constraints such as satu-
ration to limit the effect such a fault has on a system. That is to say that
the actuator and sensors’ behaviors are constrained due to physical limita-
tions, which may prove useful in detecting and mitigating faults: they do
not have the ability to behave arbitrarily bad like Byzantine failures in the
cyber domain.

3.1.1 Failure Occurrence


Furthermore, the time of occurrence of such faults is of interest. There
are permanent faults, such as a processor crashing forever, but there are
also intermittent and transient faults [32, 33] where faults come and go.
Permanent failures cause one or several agents of the system to stop run-
ning forever; for processing, this means that a program is stopped from
executing for all time, whereas for a communications link, this is a rupture
of service [32, 34]. Transient failures put one or several agents of the system

21
in an arbitrary state, but stop occurring after some period of time [35]. This
can represent a computer crashing and subsequently recovering, or a com-
puter losing power and then being restarted [32]. Intermittent failures make
one or several agents of System behave erratically for some time and may
occur at any time, but are generally rare. This can represent a processor
temporarily having Byzantine behavior, or cause a communication service
to lose, duplicate, reorder, or modify messages in transit [32]. Incessant
failures behave like intermittent failures except that they may occur with
regularity rather than rarity [33].

3.2 Methods for Ensuring Fault-Tolerance


Consider the following general problem: Given a system model, fault
model, and problem specification, when is it possible (or impossible) to
detect that faults have occurred? Then, if it is possible to detect faults
have occurred, when is it possible to mitigate a fault such that the problem
specification is still satisfied? In particular, when is it possible to prevent a
degradation of safety or progress properties?
In all communities, methods for handling failures can broadly be broken
into two categories: active and passive. In active mitigation (also non-
masking), the existence of a fault is identified by some process or set of
processes (either in the system or an outside observer such as an oracle),
and then a mitigation response is initiated, such as the rest of the correct
set of the system ignoring all outputs of the identified faulty subset. This
requires the system to deviate from normal operation and then be corrected,
for instance in rollback and recovery methods of restarting a distributed
computation [1].
In passive mitigation (also masking), the existence of a fault is hidden
from the perspective of other agents in the composite system, ideally in
an automated manner, often accomplished by redundancy or replication.
Thus with these differences, it is clear that it is sometimes necessary to
detect failures, whereas in other cases it is not necessary. It shall soon be
established that the notion of passive mitigation is in some sense equiv-
alent to the existence of a self-stabilizing algorithm which solves a given
problem.

22
Given that effectively all DCPS must maintain some notion of the current
state of the system with regards to time to be able to interact with the phys-
ical world, the real-time systems community has analyzed faults. When
implemented as real-time systems, there is a possibility for timing failures,
where a process misses some deadlines specified by worst-case execution
time (WCET) analysis [29]. Giotto [36] and its extensions allow for analysis
of programs to ensure that no timing failures (missing deadlines) can oc-
cur in the virtual machine these programs are executed on. Etherware [37]
utilizes a middleware layer and shows the ability of a distributed real-time
control system to maintain safety and liveness in spite of communications
link failures.
The Simplex-architecture supervisory control allows for the automatic
mitigation of certain faults by concurrently executing several controllers,
one of which is thoroughly tested, and then choosing the control output
from the safe controllers if the other controllers issue commands that would
take the system to an unsafe set of states [38]. While this slows progress,
it guarantees a notion of safety, and eventually upon returning far enough
within a good set of states (far from the bad states), a faster response can
be utilized. In some systems, a degradation of a safety property, such as
moving from very safe states to less safe states, could potentially be used
to detect faults—this is similar to how the Simplex architecture switches
between controllers, and this idea is employed in the failure detection in
the distributed flocking problem in Chapter 5. More recent work on this
utilizing a field-programmable gate array (FPGA) based safety controller
in the system Simplex architecture allows the avoidance of even further
faults that may have occurred due to the operating system [39].

3.2.1 Failure Detectors


Since their introduction by Chandra and Toueg [3, 19], failure detectors
have played a central role in the development of distributed computing
systems [11]. A failure detector is a device or a program that provides
each process with information about failure of other processes in a dis-
tributed system. They provide algorithms for solving canonical problems
such as consensus, leader election, and clock synchronization in the pres-

23
ence of certain types of failures, and also establish lower bounds about
impossibility of solving those problems with certain resource constraints.
What requirements a failure detector must satisfy to be able to solve a
problem is theoretically interesting and frequently studied [40]. Specifi-
cally, several classes of failure detectors have been defined according to
the nature and the quality of the information that they provide [20]. Al-
gorithms for implementing these failure detectors have been incorporated
in practical fault-tolerant systems [41, 42]. On the theoretical side, fail-
ure detectors of different quality are used to characterize the hardness of
different distributed computing problems [43], and more directly, failure
detectors of certain quality are used to solve other problems, such as dis-
tributed consensus. There exist failure detectors for classes of transient
failures [35].
The general model is that the failure detector is acting as an oracle or
outside service and suspects agents to have failed. Implementation can
be done in several ways, such as agents occasionally sending an alive
message to the failure detector, which then removes that agent from the
list of suspects if it was there, or otherwise resets a timeout; such a method
is a push. Other methods revolve around whether the scheme is a pull
method, where the failure detector occasionally asks agents if they have
failed. Thus, one desired property is completeness of detecting failures,
which means that if an agent has failed, then eventually it is suspected by
the failure detector. A competing metric is accuracy, in that, if an agent is
suspected of having failed, then it has in fact failed.
Similar to the notion of failure detectors is the fault diagnosis (or fault
detection and identification) problem from controls, which is composed of
three steps.
Real-time systems often utilize failure detectors through watchdog timers.
If a response is not received from one processor by another, a flag is raised
that the processor may have reached an illegal state, and the other processor
may have an ability to reset it [29].
The control-theoretic literature deals with detecting faults in the context
of a given plant dynamics. Typically faults are modeled as additive or
multiplicative dynamics that cause perturbations in the evolution of the
plant [44], and failure detectors rely on techniques such as signature gener-
ation, residual generation, observer designs [23], and statistical testing [21].

24
For instance, it is shown in Chapter 5 that is is possible to model actua-
tor stuck-at failures as additive dynamics for a switched system. First,
fault detection results in a binary decision of whether something is wrong
in the system. Second, fault isolation locates which component is faulty.
Third, fault identification determines the magnitude of the fault and/or the
time the fault occurred. Fault detection and isolation together are called
fault diagnosis [44]. Practical implementations usually only rely on fault
detection and isolation, and are together called fault detection and isolation
(FDI). Other notions of failure detection in the controls community can be
applied through observers [23], or more frequently in a more probabilistic
way, such as using Kalman filters to diagnose faults [21].
Diagnosis techniques have also been specifically developed for discrete
event dynamical systems (DEDS) [45, 46]. These methods include central-
ized detection approaches as well as distributed ones [24, 47]. Here faults
can be modeled as uncontrollable transitions, specifically that the transi-
tions are caused by some exogenous actor and not the system [48]. Likewise
faults can be modeled as unobservable transitions, and the occurrence of
the transition must be deduced [45, 49].
Safe diagnosability [50] implies that for some systems, mitigation must
occur before some bounded time, as otherwise the system can reach states
that violate safety. Safe diagnosability applies to the flocking case study
in Chapter 5 because if failures are not detected and corrected quickly,
the system may reach states which violate safety and progress. These
techniques are applicable to dynamical systems without any notions of
communication.

3.2.2 Self-Stabilization
The concept of self-stabilization was introduced by Dijkstra [51]. Self-
stabilizing algorithms are those that from an arbitrary starting state even-
tually converge to a legal state and remain in the set of legal states [4];
see Figure 1.2. The two necessary properties of self-stabilizing algorithms
are closure and convergence [18]. From any state (legal or not) the system
must converge in a finite number of steps to a legal state. The set of legal
states must then be closed, in that only failures may take the system to a

25
set of illegal states. The design of self-stabilizing failure detectors has been
investigated [52].
As defined above, self-stabilizing algorithms implement a form of non-
masking fault tolerance, in that the fault may be observable as the system
is no longer in a legal state, but automatically the system eventually, in a
finite number of steps, returns to a set of legal states. Such protocols rely
on the assumption that the programs do not fail, and that only state and
data may become corrupted due to failures. It should also be noted that
due to the closure property, a composition of self-stabilizing algorithms
can be utilized to solve a complex task. For instance, if from arbitrary state
xLA an algorithm A takes the system in TA steps to legal state xLA , then some
algorithm B can operate that takes the system in TB steps to another legal
states xLB , and so on.

3.2.3 Stabilizers
The use of a stabilizer provides a general method to convert a fault-
intolerant algorithm to a fault-tolerant one through composition of other
algorithms. One mechanism monitors system consistency—such as com-
bining a self-stabilizing distributed snapshot algorithm [53] with a self-
stabilizing failure detector [3]. The other mechanism repairs the system to a
consistent state upon inconsistency being detected—such as self-stabilizing
distributed reset [54].
The first stabilizer collected distributed global snapshots [53] of the com-
posite system and checked whether the snapshots were legal, where the
distributed snapshot did not interfere with the activity of the algorithm,
so the composed algorithm trivially satisfied closure [55]. Thus, such
stabilizers rely on utilizing a composition or hierarchy of self-stabilizing
algorithms. The detectors and correctors of [56] are analogous to stabilizers
and also the detection and mitigation of Chapter 5. The paradigm is that a
fault-tolerant system is constructed out of a fault-tolerant system and a set
of components for fault-tolerance (detectors and corrects).
Rather than relying on predicates on global system state to detect incon-
sistency, it is possible to detect inconsistent global state by checking if local
state is inconsistent. Local detection [57, 58], where if a global property is

26
violated (such as the global system not being in a legal state), then some
local property must also be violated. Local checking and correction were in-
troduced in the design of a self-stabilizing communications protocol with
a self-stabilizing network reset [59] where global inconsistency is detected
by analyzing local state. Local detection and checking are analogous to the
detection method used in Chapter 5 and local correction is analogous to the
mitigation method. The local stabilizer of [60] takes a distributed algorithm
and transforms it into a self-stabilizing synchronous algorithm which tol-
erates transient faults through local detection in O(1) time and local repair
of the inconsistent system state, resulting in an algorithm which tolerates
f faults in O( f ) time.
Similar to the notion of a stabilizer in distributed systems and the case
study in Chapter 5 is the control theoretic paper [61], where a motion probe,
or a specific control applied for some time, is used to detect failures of
individual agents solving a consensus problem. Upon detection of failures
through the use of motion probes, the non-faulty agents stop utilizing the
values of faulty agents to ensure progress.

27
CHAPTER 4

DISTRIBUTED CELLULAR FLOWS

4.1 Introduction
This chapter is based upon previous work [62].
Highway and air traffic flows are nonlinear dynamical systems that give
rise to complex phenomena such as abrupt phase transitions from fast
to sluggish flow [63–65]. The ability to monitor, predict, and avoid such
phenomena can have a significant impact on the reliability and the capacity
of traffic networks. Traditional traffic protocols, such as those implemented
for air-traffic control are centralized [66]—a coordinator periodically collects
information from the vehicles, decides and disseminates the waypoints,
and subsequently the vehicles try to blindly follow a path to the waypoint.
The advent of wireless vehicular networks [67] presents a new opportunity
for distributed traffic monitoring [68] and control. Distributed protocols
should scale and be less vulnerable to failures compared to their centralized
counterparts. In this case study, such a distributed traffic control protocol
is presented, as is an analysis of its behavior.
A traffic control protocol is a set of rules that determines the routing and
movement of certain physical entities, such as cars and packages, over an
underlying graph, such as a road network, air-traffic network, or a ware-
house conveyor system. Any traffic control protocol should guarantee:
(a) (safety) that the entities maintain some minimum physical separation,
and (b) (progress) that the entities arrive at a given a destination (or target)
vertex. In a distributed traffic control protocol each entity determines its
own next-waypoint, or each vertex in the underlying graph determines the
next-waypoints for the entities in an appropriately defined neighborhood.
The idea of distributed traffic control has been around for some time but
most of the work has focused on human-factors issues [69, 70], collision

28
avoidance [71–75], and platooning [76–78]. A notable exception is [79],
which presents a distributed algorithm (executed by entities, vehicles in
this case) for controlling a highway intersection without any stop signs.
The distributed traffic control problem is studied in a partitioned plane
where the motions of entities within a partition are coupled. The problem
can be described as follows (refer to Figure 4.1). The geographical space of
interest is partitioned into regions or cells. There is a designated target cell
which consumes entities and some source cells that produce new entities.
The entities within a cell are coupled, in the sense that they all either
move identically or they remain static (the motivation for this is discussed
below). If a cell moves such that some entities within it touch the boundary
of a neighboring cell, those get transferred to the neighboring cell. Thus,
the role of the distributed traffic control protocol is to control the motion
of the cells so that the entities (a) always have the required separation, and
(b) they reach the target, when feasible.
The coupling mentioned above which requires entities within a cell to
move identically may appear surprising at first sight. After all, under
low traffic conditions, individual drivers control the movement of their
cars within a particular region of the highway, somewhat independently
of the other drivers in that region. However, on highways under high-
traffic, high-velocity conditions, it is known that coupling may emerge
spontaneously, whereby the vehicles form a fixed lattice structure and
move with zero relative speed [64, 80]. In other scenarios coupling arises
because passive entities are moved around by active cells, for example,
packages being routed on a grid of multi-directional conveyors [81], and
molecules moving on a medium according to some controlled chemical
gradient. Finally, even where the entities are active and cells are not,
the entities can cooperate to emulate a virtual active cell expressly for
the purposes of distributed coordination. This idea has been explored for
mobile robot coordination in [82] using a cooperation strategy called virtual
stationary automata [83, 84].
The distributed traffic control protocol guarantees safety at all times, even
when some cells fail permanently by crashing. The protocol also guar-
antees eventual progress of entities towards the target, provided that there
exists a path through non-faulty cells to the target. Specifically, the protocol
is self-stabilizing [4], in that after failures stop occurring, the composed sys-

29
5

<3,3>
3
tid
dist=0
dist 0
2
<2,1>
dist=ь
1

<0,0>
<1,0>
0
0 1 2 3 4 5
Figure 4.1: Example System with 4 × 4 unit-length square cells where
tid = 2, 2 (in very light gray), SID = {1, 0 } (in light gray), and
failed2,1 = true (in black). The gray arrows represent next variables. The
smaller squares are entities with safety region specified by rs represented
by the gray border and length region specified by l represented by the
white interior.

tem automatically returns to a state from which progress can be made. The
algorithm relies on two mechanisms: (a) a rule to maintain local routing
tables at each non-faulty cell, and (b) a (more interesting) rule for signaling
among neighbors which guarantees safety while preventing deadlocks.
Roughly speaking, the signaling mechanism at some cell fairly chooses
among its neighboring cells which contain entities, indicating if it is safe
for one of these cells to apply a movement in the direction of the signal-

30
ing cell. This permission-to-move policy turns out to be necessary, because
movement of neighboring cells may otherwise result in a violation of safety
in the signaling cell, if entity transfers occur.
These safety and progress properties are established through systematic
assertional reasoning. These proofs may serve as a template for the analysis
of other distributed traffic control protocols and also can be mechanized
using automated theorem proving tools, for example [85].
The throughput analysis of this algorithm, and in fact any distributed
traffic control algorithm, remains a challenge. Simulation results are pre-
sented that illustrate the influence (or the lack thereof) of several factors
on throughput:

(a) path length,

(b) path complexity measured in number of turns along a path,

(c) required safety separation and cell velocity, and

(d) failure-recovery rates, under a model where crash failures are not per-
manent and cells may recover from crashing.

4.2 System Model


In this section, a formal model of the distributed cellular flows algorithm
is presented as a shared-state distributed cyber-physical system (SSDCPS)
as introduced in Chapter 2. Refer to Chapter 2 for preliminaries.

4.2.1 Overview of Distributed Cellular Traffic Control


The system consists of N2 cells arranged in an N × N grid. Each cell
physically occupies a unit square region in the plane and may contain a
number of entities, each of which occupies a smaller square region. All
the entities on a given cell move identically: either they remain static or
they move with some constant velocity either horizontally or vertically.
This movement is determined by the software controlling each cell. The
software relies on communication among adjacent cells. When a moving

31
entity touches an edge of a cell, it is instantaneously transferred to the next
neighboring cell.
The software of a cell implements the distributed traffic control protocol.
At each round, every cell exchanges messages bearing state information
with their neighbors. Based on this, the cells update their software state
and decide their (possibly zero) velocities. Until the beginning of the next
round, the cells continue to operate according to this velocity—this may
lead to entity transfers.
Recall from Chapter 2 the modeling assumptions that messages are de-
livered within bounded time and computations are instantaneous. Under
these assumptions, the system can be modeled as a SSDCPS. Further as-
sume, for simplicity of presentation only, that all the entities have the same
size, and if moving, any cell does so with the same constant velocity.
Now follows the SSDCPS model.

4.2.2 Formal System Model


Δ
Let ID = [N − 1] × [N − 1] be the set of identifiers for all cells in the system.
   
Each cell has a unique identifier i, j ∈ ID. Cell i, j occupies a unit
square whose bottom-left corner is the point (i, j) in the Euclidean plane.
The ensemble of N2 cells cover a N × N square in the first quadrant of the
plane. Cells are ordered increasingly by identifiers along the real plane
from the origin, with Cell0,0 ’s southwest corner located at the origin and
CellN−1,N−1 ’s northeast corner located at the point (N, N). Cell m, n is said
   
to be neighbor of cell i, j if |i − m| +  j − n = 1. The set of identifiers of
 
all neighbors of i, j is denoted by Nbrsi, j . For this case study, consider a
system with a unique target cell with identifier tid and a set of source cells
with identifiers SID ⊂ ID. All other cells are ordinary cells. Every entity
that may ever be in the system has a unique identifier drawn from a set
P. For any entity p ∈ P that is actually present in the system, denote the
coordinates of its center by (px , p y ) ∈ R2 . Entity p occupies an l × l square
area, with its center at (px , p y ).
The specification of the system uses the following three parameters:
(i) l: length of an entity,

(ii) rs : minimum required inter-entity gap along each axis, and

32
(iii) v: cell velocity, or distance by which an entity may move over one
round.

It is required that

(i) v < l < 1, and

(ii) rs + l < 1.

The former is required to ensure cells do not violate the gap requirement
from one round to the next when new entities enter a cell. The latter is
required so that entities cover at most the same area of the Euclidean plane
as the cell in which they are contained, since cells are squares of unit length.
Define the total center spacing requirement as

Δ
d = rs + l.

Next is a description of the behavior of an individual agent, referred


to as Celli, j . The variables associated with each Celli,j are specified below,
where initial values of the variables are shown in Figure 4.2 using the ‘:=’
notation:
 
(i) Membersi, j : set of entities located in cell i, j ,
 
(ii) nexti, j : neighbor towards which i, j attempts to move,
 
(iii) NEPrevi,j : nonempty neighbors for which i, j is equal to next,

(iv) disti,j : estimated Manhattan distance to tid,

(v) tokeni,j : a token used for mutual exclusion to indicate which neighbor
may move,

(vi) signali, j : indicates whether a physical region in Celli, j is empty, and


 
(vii) failedi, j : indicates whether or not i, j has failed.

When clear from context, the subscripts in the names of the variables are
dropped. A state of Celli, j refers to a valuation of all these variables, that
is, a function that maps each variable to a value of the corresponding type.
The complete system is an automaton, called System as in Chapter 2,
consisting of the ensemble of all the cells. A state of System is a valuation

33
variables 1
Membersi, j : Set[P ] := {}
NEPrevi,j : Set[ID⊥ ] := {} 3
nexti, j , signali,j , tokeni,j : ID⊥ := ⊥
disti,j : N∞ := ∞ 5
failedi,j : B := false
7
transitions
faili, j 9
eff failedi,j := true; disti,j := ∞; nexti, j := ⊥
11
updatei,j
eff Route; Signal; Move 13

Figure 4.2: Specification of Celli,j .

of all the variables for all the cells. Recall from Chapter 2 that states of
System are referred to by bold letters x, x , etc.
Variables tokeni,j , failedi,j , and NEPrevi, j are private to Celli, j , while disti, j ,
nexti, j , and signali, j can be read by neighboring cells of Celli, j , and Membersi,j
can be both read from and written to by neighboring cells of Celli, j . See
Figure 4.3. Recall from Chapter 2 that this has the following interpretation
for an actual message-passing implementation. At the beginning of each
round, Celli, j broadcasts messages containing the values of these variables
and receives similar values from its neighbors. Then, the computation of
this round updates the local variables for each cell based on the values
collected from its neighbors. Variable Membersi, j is a special variable, in
that it can also be written to by the neighbors of Celli, j . This is how
transferal of entities among cells is modeled. An entity p is quantified to
 
be in x.Membersi, j for a state x and i, j ∈ ID, so denote p where p = p,
a
such that p ∈ x .Membersm,n where x → x for some a ∈ A and m, n ∈ ID.
 
If a transfer does not occur, then m, n = i, j , but if a transfer occurs, then
m, n ∈ Nbrsi, j .
System has two types of state transitions: fails and updates. A faili, j
 th
transition models the crash failure of the i, j cell and sets failedi, j to true,
 
disti, j to ∞, and nexti, j to ⊥. A cell i, j is called failed if failedi,j is true,
otherwise it is called non-faulty. The set of identifiers of all failed and non-
faulty cells at a state x is denoted by F(x) and NF(x), respectively. A failed
cell does nothing; it never moves and it never communicates.1

disti, j = ∞ can be interpreted as its neighbors not receiving a timely response from
1

i, j .

34
Celli,j Cellm,n

Membersi,j
ij Membersm,n

disti,j distm,n
nexti,j nextm,n
signal
i li,j signal
i lm,n

tokeni,j tokenm,n
NEPrevi,jij NEPrevm,n
mn
failedi,j failedm,n

Figure 4.3: The interaction between a pair of neighboring cells is modeled


with shared variables Members, dist, next, and signal.

An update transition models the evolution of all non-faulty cells over


one synchronous round. For readability, the state-change owing to an
update transition is written as a sequence of three functions (subroutines),
 
which for each non-faulty i, j ,

(i) Route computes the variables disti, j and nexti,j ,

(ii) Signal computes (primarily) the variable signali, j , and

(iii) Move computes the new positions of entities.

Note that the entire update transition is atomic, so there is no possibility to


interleave fail transitions between the subroutines of update. To reiterate,
in this discrete automaton model, all the changes in the state of System are
captured by a single atomic transition brought about by update. Thus, the
state of System at (the beginning of) round k + 1 is obtained by applying
these three functions to the state at round k. Now follows a description
of the distributed traffic control algorithm which is implemented through
these functions.
The Route function in Figure 4.4 is responsible for constructing stable
routes in the face of failures. Specifically, it constructs a distance-based

35
routing table for each cell that relies only on neighbors’ estimates of dis-
tance to the target. Recall that failed cells have dist set to ∞. From a state x,
 
for each i, j ∈ NF(x), the variable disti, j is updated as 1 plus the minimum
 
value of dist among the neighbors of i, j . If this results in disti, j being
infinity, then nexti,j is set to ⊥; otherwise, it is set to be the identifier with
the minimum dist with ties broken with neighbor identifiers.
 
if ¬failedi,j ∧ i, j  tid then 1
⎛ ⎞
⎜⎜ ⎟⎟
disti,j := ⎜⎜⎝ min distm,n ⎟⎟⎠ + 1
m,n ∈Nbrs
i,j
if disti,j = ∞ then nexti,j := ⊥ 3
else nexti,j := argmin distm,n , m, n
m,n ∈Nbrsi,j

Figure 4.4: Route function.

The Signal function in Figure 4.5 executes after Route and is the key part
of the protocol for both maintaining safe entity separations and ensuring
progress of entities to the target. Roughly, each cell implements this by
following two policies: (a) accept new entities from a neighbor only when
this is safe, and (b) provide opportunities infinitely often for each nonempty
 
neighbor to make progress. First i, j sets NEPrevi, j to be the subset of
 
Nbrsi, j for which next has been set to i, j and Members is nonempty. If
tokeni, j is ⊥, then it is set to some arbitrary value in NEPrevi, j ; it continues to
be ⊥ if NEPrevi, j is empty. Otherwise, tokeni, j = m, n , which is a neighbor
 
of i, j with nonempty Members. It is checked if there is a gap of length
d on Celli, j in the direction of m, n . This is accomplished through the
conditional in Lines 4–7 as a step in guarantying fairness. If there is not
enough gap, then signali, j is set to ⊥, which blocks m, n from moving its
 
entities in the direction of i, j , thus preventing entity transfers. On the
other hand, if there is sufficient gap, then signali, j is set to tokeni, j which
 
enables m, n to move its entities towards i, j . Finally, tokeni,j is updated
to a value in NEPrevi,j that is different from its previous value, if that is
possible according to the rules just described (Lines 10–12).
Finally, the Move function in Figure 4.6 models the physical movement
 
of entities over a given round. For cell i, j , let m, n be nexti,j . The entities
in Membersi, j move in the direction of m, n if and only if signalm,n is set to
 
i, j . In that case, all the entities in Membersi, j are shifted in the direction of
cell m, n . This may lead to some entities crossing the boundary of Celli,j

36
if ¬failedi,j then
 
NEPrevi,j := {m, n ∈ Nbrsi, j : nextm,n = i, j ∧ Membersm,n  ∅} 2
if tokeni,j = ⊥ then tokeni,j := choose from NEPrevi, j
if ((tokeni,j = i + 1 ∧ ∀ p ∈ Membersi,j : px + l/2 ≤ i + 1 − d) 4
∨ (tokeni,j = i − 1 ∧ ∀ p ∈ Membersi,j : px − l/2 ≥ i + d)
∨ (tokeni,j = j + 1 ∧ ∀ p ∈ Membersi,j : p y + l/2 ≤ j + 1 − d) 6
∨ (tokeni,j = j − 1 ∧ ∀ p ∈ Membersi,j : p y − l/2 ≥ j + d))
then 8
 i,j := token
signal  i,j
if NEPrevi,j  > 1 then 10
token i,j := choose
 from NEPrevi,j \ {tokeni, j }

elseif NEPrevi, j  = 1 then tokeni, j ∈ NEPrevi, j 12
else tokeni,j := ⊥
else signali,j := ⊥; tokeni,j := tokeni,j 14

Figure 4.5: Signal function.

into Cellm,n , in which case, such entities are removed from Membersi, j . If
m, n is not the target, then the removed entities are added to Membersm,n .
In this case (Lines 13–20), the transferred entities are placed at the edge of
Cellm,n . However, if m, n is the target, then the removed entities are not
added to any cell and thus no longer exist in System.

 
if ¬failedi, j ∧ signalnexti,j = i, j then
let m, n = nexti,j 2
for each p ∈ Membersi,j
px := px + v(m − i) 4
p y := p y + v(n − j)
6
if (m = i + 1 ∧ px + l/2 > i + 1) ∨ (m = i − 1 ∧ px − l/2 < i)
∨ (n = j + 1 ∧ p y + l/2 > j + 1) ∨ (n = j − 1 ∧ p y − l/2 < j) 8
then
Membersi,j := Membersi,j \ {p} 10
if m, n  tid
then Membersm,n := Membersm,n ∪ {p} 12
if m = i + 1 ∧ px + l/2 > i + 1
then px := m + l/2 14
elseif m = i − 1 ∧ px − l/2 < i
then px := m − l/2 16
elseif n = j + 1 ∧ p y + l/2 > j + 1
then p y := n + l/2 18
elseif n = j − 1 ∧ p y − l/2 < j
then p y := n − l/2 20

Figure 4.6: Move function.

 
The source cells i, j ∈ SID, in addition to the above, add at most one
entity in each round to Membersi, j such that the addition of an entity does
not violate the minimum gap between entities at Celli,j .

37
4.3 Analysis
In this section we present an analysis of System with regard to safety and
progress properties. Roughly, the safety property is an invariant that for all
reachable states there is a minimum gap between entities, and the progress
property requires that all entities which reside on cells with feasible paths
to the target, eventually reach the target.
See Figure 4.7 for a graphical outline of the properties.

Safe

Įff
Stable Routes
Įf
Progress
Q0 Įff

Įff

Figure 4.7: Set view of desired properties of System. Start states Q0 at


least satisfy Safe. Failure-free executions are represented by lines with
arrows labeled αff . Safe is shown to be invariant along any execution. It is
shown that eventually stable routes to the target cell are formed, which
allows entities to make progress towards the target. However, along
executions with failures, represented by the red line with an arrow
labeled α f , neither stable routes nor progress are necessarily upheld, but
Safe is invariant. However, any failure-free execution is then guaranteed
to reach states satisfying stable routes and progress for cells which have
paths to the destination, and thus eventually any entity on a cell with a
feasible path to the destination eventually reaches the destination.

38
4.3.1 Safety Analysis
A state is safe if for every cell, the distance between the centers of any two
entities along either coordinate is at least d. Thus, in a safe state, the edges
of all entities in a cell are separated by a distance of rs . However, the entities
in two adjacent cells may have edges spaced apart by less, although their
centers will be spaced by at least l.
For any state x of System, define:

Δ
   
Safei, j (x) = ∀p, q ∈ x.Membersi, j , p  q, (px − qx  ≥ d) ∨ (p y − q y  ≥ d), and
Δ  
Safe(x) = ∀ i, j ∈ ID, Safei, j (x).

The safety property is that Safe(x) is an invariant and thus satisfied for all
reachable states. We proceed by proving some preliminary properties of
System which will be used for establishing the desired safety property. The
following invariant asserts that no entities exist between the boundaries of
cells. This is a consequence of transferring entities upon an entity’s edge
touching an edge of a cell, and then resetting the entity’s position to be
within the new cell.
 
Invariant 4.1. In any reachable state x, ∀ i, j ∈ ID, ∀p ∈ x.Membersi, j

l l
i+ ≤ px ≤ i + 1 − , and
2 2
l l
j+ ≤ py ≤ j + 1 − .
2 2

The next invariant states that cells’ Members are disjoint. This is imme-
diate from the Move function since entities are only added to one cell’s
Members upon being removed from a different cell’s Members.
 
Invariant 4.2. In any reachable state x, for any distinct i, j , m, n ∈ ID,
x.Membersi, j ∩ x.Membersm,n = ∅.

Next, we define a predicate which states that if signali, j is set to some


m, n ∈ Nbrsi, j , then there is a large enough gap from the common edge
where no entities exist in Celli, j . For a state x,

Δ  
H(x) = ∀ i, j ∈ ID, ∀ m, n ∈ Nbrsi, j ,

39
if x.signali, j = m, n then exactly one of the following hold:

l
m = i + 1 ∧ ∀p ∈ x.Membersi, j , px + ≤ i + 1 − d, or
2
l
m = i − 1 ∧ ∀p ∈ x.Membersi, j , px − ≥ i + d, or
2
l
n = j + 1 ∧ ∀p ∈ x.Membersi, j , p y + ≤ j + 1 − d, or
2
l
n = j − 1 ∧ ∀p ∈ x.Membersi, j , p y − ≥ j + d.
2

H(x) is not an invariant property because once entities move the property
may be violated. However, for proving safety all that needs to be estab-
lished is that at the point of computation of the signal variable this property
holds. The next key lemma states this.

Lemma 4.3. For all reachable states x, H(x) ⇒ H(xs ) where xS is the state
obtained by applying the Route and Signal functions to x.
 
Proof : Fix a reachable state x, a i, j ∈ ID, and a m, n ∈ Nbrsi,j such
that x.signali, j = m, n . Let xR be the state obtained by applying the Route
function of Figure 4.4 to x and xS be the state obtained by applying the
Signal function of Figure 4.5 to xR .
 
Without loss of generality, assume m, n = i − 1, j , so if x.signali, j =
 
i − 1, j , then ∀p ∈ x.Membersi, j , px − 2l ≥ i + d. First, observe that H(xR ).
This is because the Route function does not change any of the variables
involved in the definition of H(.). Next, we show that H(xR ) implies H(xS ).
There are two possible cases. First, if xS .signali, j  m, n then the statement
 
holds vacuously. Second, when xS .signali, j = i − 1, j , the second condition
in H(xR ) and Figure 4.5, Line 5 implies H(xS ). The cases where m, n takes
the other values in Nbrsi, j follow by symmetry.

The following lemma asserts that if there is a cycle of length two formed
by the signal variables, then entity transfers cannot occur between the
involved cells in that round.

Lemma 4.4. Let x be any reachable state and x be a state that is reached from x after
 
a single update transition (round). If x.signali, j = m, n and x.signalm,n = i, j ,
then x.Membersi, j = x .Membersi, j and x.Membersm,n = x .Membersm,n .

40
Proof : No entities enter either x .Membersi, j or x .Membersm,n from any other
a, b ∈ Nbrsi, j or c, d ∈ Nbrsm,n since x.signali, j = m, n and x.signalm,n =
   
i, j . Assume without loss of generality that m, n = i − 1, j . It remains
to be established that p ∈ x.Membersi−1, j such that p ∈ x .Membersi, j where
p = p or vice-versa. For the transfer to occur, px must be such that px =
 
px + 2l + v > i by Figure 4.6, Line 4. But for x.signali, j = i − 1, j to be
satisfied, it must have been the case that px − 2l < i + l + rs by Figure 4.5,
Line 5 and since v < l, a contradiction is reached.

Now we state and prove the safety property of System.

Theorem 4.5. For any reachable state x, Safe(x).

Proof : The proof is by standard induction over the length of any execution
of System. The base case is satisfied by the initialization assumption. For
the inductive step, consider reachable states x, x and an action a ∈ A such
a  
that x → x . Fix i, j ∈ ID and assuming Safei, j (x), show that Safei, j (x ).
If a = faili, j , then clearly Safe(x ) as none of the entities move.
For a = update, there are two cases to consider. First, x .Membersi, j ⊆
x.Membersi, j . There are two sub-cases: if x .Membersi, j = x.Membersi, j , then
all entities in x.Members move identically and the spacing between two dis-
tinct entities p, q ∈ x .Membersi, j is unchanged. That is, ∀p, q ∈ x.Membersi, j ,
 
∀p , q ∈ x .Membersi,j such that p = p and q = q and where p  q, px − qx  =
   
px + vc − qx − vc, where c is a constant. It follows that px − qx  ≥ d. By
 
similar reasoning it follows that p y − q y  is also at least d. The second sub-
case arises if x .Membersi,j  x.Membersi, j , then Safei, j (x ) is either vacuously
satisfied or it is satisfied by the same argument as above.
The second case is when x .Membersi, j  x.Membersi,j , that is, there exists
some entity p ∈ x .Membersi,j that was not in x.Membersi, j . There are two
sub-cases. The first sub-case is when p was added to x .Membersi, j since
 
i, j ∈ SID. In this case, the specification of the source cells states that
the entity p was added to x .Membersi, j without violating Safei,j (x ), and the
proof is complete. Otherwise, p was added to x .Membersi, j by a neighbor,
 
so p ∈ x.Membersi , j for some i , j ∈ x.Nbrsi, j . Without loss of generality,
assume that i = i − 1 and j = j. That is, p was transferred to Celli, j from
its left neighbor. From Line 14 of Figure 4.6 it follows that px = i + 2l .
The fact that p transferred from Celli , j in x to Celli, j in x implies that

41
   
x.nexti , j = i, j and x.signali, j = i , j —these are necessary conditions for
the transfer. Thus, applying at state x the second inequality from H(x)
and Lemma 4.3, it follows that for every q ∈ x.Membersi, j , qx ≥ i + d + 2l .
It must be established that if p is transfered to x .Membersi, j , then every
q ∈ x .Membersi, j , where q  p satisfies qx ≥ i + d + 2l , which means that
q did not move towards p. This follows by application of Lemma 4.4,
which states that if entities on adjacent cells move towards one another
simultaneously, then a transfer of entities cannot occur. This implies that
all entities q in x .Membersi, j have edges greater than rs of the edges of any
such entity p , implying Safei,j (x ), since px = i + 2l and qx ≥ i + d + 2l , so
 
qx − px ≥ d. Finally, since i, j was chosen arbitrarily, Safe(x ).

Theorem 4.5 shows that System is safe in spite of failures.

4.3.2 Stabilization of Routing


We show that under mild assumptions, once new failures cease to occur,
System recovers to a state where each entity on a non-faulty cell with a
feasible path to the target makes progress towards it.
 
For a state x, inductively define the path distance ρ of any cell i, j ∈ ID
as the distance to the target through non-faulty cells. Let




⎪ ∞ if failedi, j ,


  Δ⎪ ⎪  

ρ(x, i, j ) = ⎪
⎪ 0 if i, j = tid,






⎩ 1 + m,n ∈Nbrs
min
∩NF(x)
ρ(x, m, n ) otherwise.
i, j

A cell is said to be target connected if its path distance is finite. We define

Δ    
TC(x) = { i, j : ρ(x, i, j ) < ∞}

as the set of cell identifiers that are connected to the target through non-
faulty cells.
The analysis relies on the following assumptions on the environment of
System which controls the occurrence of fail transitions and the insertion
of entities by the source.

42
(a) The target cell does not fail.

(b) Source cells s, t ∈ SID place entities in Memberss,t without blocking
any of their nonempty non-faulty neighbors perpetually. That is, for
 
any execution α of System, if there exists an i, j ∈ Nbrss,t , such that
 
for every state x in α after a certain round, i, j ∈ x.NEPrevs,t , then
 
eventually signals,t becomes equal to i, j in some round of α.

Recall from Chapter 2 that a fault-free execution fragment α is a sequence


of states starting from x along which there are no faili, j transitions for any
 
i, j ∈ NF(x). Intuitively, a fault-free execution fragment is an execution
fragment with no new failures, although for the first state x of α, F(x) need
not be empty.
 
Lemma 4.6. Consider any reachable state x of System and any i, j ∈ TC(x) \
 
{tid}. Let h = ρ(x, i, j ). Any fault-free execution fragment α starting from x
stabilizes in h rounds to a set of states S with all elements satisfying:

disti, j = h, and
   
nexti, j = in , jn , where ρ(x, in , jn ) = h − 1.

Proof : Fix an arbitrary state x, a fault-free execution fragment α starting


 
from x, and i, j ∈ TC(x) \ {tid}. We have to show that the set of states S
defined by the above equations is closed under update transitions and that
after h rounds, the execution fragment α enters S.
First, by induction on h we show that S is stable. Consider any state
y ∈ S and a state y that is obtained by applying an update transition to
y. We have to show that y ∈ S. For the base case, h = 1, so y.disti, j = 1
and y.nexti, j = tid. From Lines 2 and 4 of the Route function in Figure 4.4,
and that there is a unique tid, it follows that y .disti, j remains 1 and y .nexti, j
remains tid. For the inductive step, the inductive hypothesis is for any
 
given h, if for any i , j ∈ NF(x), y.disti , j = h and y.nexti , j = m, n , for
some m, n ∈ ID with ρ(x, m, n ) = h − 1, then

y .disti , j = h and y .nexti , j = m, n .

43
     
Now consider i, j such that ρ(y, i, j ) = ρ(y , i, j ) = h + 1. In order to
show that S is closed, we have to assume that y.disti, j = h + 1 and y.nexti, j =
   
m, n , and show that the same holds for y . Since ρ(y , i, j ) = h + 1, i, j
does not have a neighbor with path distance smaller than h. The required
result follows from applying the inductive hypothesis to m, n and from
Lines 2 and 4 of Figure 4.4.
Next, we have to show that starting from x, α enters S within h rounds.
 
Once again, this is established by inducting on h, which is ρ(x, i, j ). The
 
base case only includes the paths satisfying h = ρ(x, i, j ) = 1 and follows
 
by instantiating in , jn = tid. For the inductive case, assume that at round
     
h, disti ,j = h and nexti ,j = in , jn such that ρ(x, in , jn ) = h − 1 and in , jn is
 
the minimum identifier among all such cells. Observe that one such i , j ∈
Nbrs(i, j) by the definition of TC. Then at round h + 1, by Lines 2 and 4 of
Figure 4.4, disti,j = disti , j + 1 = h + 1.

The following corollary of Lemma 4.6 states that after new failures cease
occurring, all target connected cells get their next variables set correctly
within at most O(N2 ) rounds. It follows since the value of h in Lemma 4.6
for any target connected cell is at most N2 .
Corollary 4.7. Consider any execution of System with arbitrary but finite se-
quence of fail transitions. Within O(N2 ) rounds of the last fail transition, every
 
target connected cell i, j in System has nexti, j fixed permanently to the identifier
of the next cell along such a path.

4.3.3 Progress of Entities Towards the Target


Using the results from the previous sections, we show that once new fail-
ures cease occurring, every entity on a target connected cell eventually gets
to the target. The result is Theorem 4.10 and uses two lemmas which es-
tablish that, along every infinite execution with a finite number of failures,
every nonempty target connected cell gets permission to move infinitely
often (Lemma 4.9), and a permission to move allows the entities on a cell
to make progress towards the target (Lemma 4.8). The latter is simpler and
comes first.
For the remainder of this section, fix an arbitrary infinite execution α of
System with a finite number of failures. Let x f be the state of System at

44
the round after the last failure, and α be the infinite failure-free execution
fragment x f , x f +1 , . . . of α starting from x f . Observe that TC(x f ) = TC(x f +1 ) =
TC(. . .), so define TC to be TC(x f ).
   
Lemma 4.8. For any i, j ∈ TC, k > f , if xk .signalm,n = i, j and xk .nexti, j =
m, n , then ∀p ∈ xk .Membersi, j , if p ∈ xk+1 .Membersi, j such that p = p, then
       
px − m < px − m , or p y − n < p y − n ,

otherwise if p ∈ xk+1 .Membersm,n such that p = p, then

m ≤ px ≤ m + 1, or n ≤ p y ≤ n + 1.

 
Proof : The first case is when no entity transfers from i, j to m, n in
the k + 1th round. In this case, the result follows since velocity is applied
towards m, n by Move in Figure 4.6, Lines 4–5. The second case is when
 
some entity p transfers from i, j to m, n , in which case px ∈ [m, m + 1] or
p y ∈ [n, n + 1] by Figure 4.6, Lines 13–20.

 
Lemma 4.9. Consider any i, j ∈ TC \ {tid}, such that for all k > f , (if
 
xk .Membersi, j  ∅, then ∃k > k such that xk .signalnexti, j = i, j ).
 
Proof : Since i, j ∈ TC, there exists h < ∞ such that for all k > f , ρ(xk ) = h.
 
We prove the lemma by induction on h. The base case is h = 1. Fix i, j
 
and instantiate k = f + 4. By Lemma 4.6, for all non-faulty i, j ∈ Nbrstid ,
x f .nexti,j = tid since k > f . For all k > f , if xk .Membersi, j  ∅, then signaltid
changes to a different neighbor with entities every round. It is thus the
case that |xk .NEPrevtid | ≤ 4 and since Memberstid = ∅ always, exactly one
of Figure 4.5, Lines 4–7 is satisfied in any round, then within 4 rounds,
 
signaltid = i, j .
For the inductive case, let ks = k + h be the step in α after which all
 
non-faulty a, b ∈ Nbrsi, j have xks .nexta,b = i, j by Lemma 4.6. Also by
Lemma 4.6, ∃ m, n ∈ Nbrsi, j such that xks .distm,n < xks .disti, j , implying that
   
after ks , xks .NEPrevi,j  ≤ 3 since xks .nexti, j = m, n and xks .nextm,n  i, j .
   
By the inductive hypothesis, xks .signalnexti, j = i, j infinitely often. If i, j ∈
SID, then entity initialization does not prevent xk .signali,j = a, b from

45
being satisfied infinitely often by the second assumption introduced in
Subsection 4.3.2. It remains to be established that signali, j = a, b infinitely
often. Let a, b ∈ xks .NEPrevi, j where ρ(xks , a, b ) = h + 1.
 
If xks .NEPrevi, j  = 1, then because the inductive hypothesis satisfies
 
signalnexti,j = i, j infinitely often, then Lemma 4.8 applies infinitely often,
and thus Membersi, j = ∅ infinitely often, finally implying that signali, j = a, b
infinitely often.
 
If xks .NEPrevi, j  > 1, there are two sub-cases. The first sub-case is when
 
no entity enters i, j from some c, d  a, b ∈ xks .NEPrev, which follows
 
by the same reasoning used in the xks .NEPrev = 1 case. The second
 
sub-case is when a entity enters i, j from c, d , in which case it must
be established that signali, j = a, b infinitely often. This follows since if
xk .tokeni, j = a, b where k > kt > ks and kt is the round at which an
 
entity entered i, j from c, d , and the appropriate case of Lemma 4.3 is
not satisfied, then xk +1 .signali, j = ⊥ and xk +1 .tokeni, j = a, b by Figure 4.5,
 
Line 14. This implies that no more entities enter i, j from either cell c, d
satisfying c, d  a, b . Thus tokeni, j = a, b infinitely often follows by the
 
same reasoning xks .NEPrev = 1 case.

The final theorem establishes that entities on any cell in TC eventually


reach the target in α .
 
Theorem 4.10. Consider any i, j ∈ TC, ∀k > f , ∀p ∈ xk .Membersi, j , ∃k > k
such that p ∈ xk .Membersnexti,j .
 
Proof : Fix i, j ∈ TC, a round k > f and p ∈ xk .Membersi, j . Let h =
 
maxi, j ∈TC ρ(x f , i, j ) which is finite. By Lemma 4.6, at every round after
   
ks = k + h, for any i, j ∈ TC, the sequence of identifiers β = i, j , xks .nexti, j ,
 
xks .nextxks .nexti, j , . . . forms a fixed path to tid. Applying Lemma 4.9 to i, j ∈
 
TC shows that there exists km ≥ ks such that xkm .signalnexti, j = i, j . Now
applying Lemma 4.8 to xkm establishes movement of p towards xks .nexti, j ,
which is also xkm .nexti, j . Lemma 4.9 further establishes that this occurs
infinitely often, thus there is a round k > km such that p gets transferred to
xkm .Membersnexti,j .

By a simple induction of the sequence of identifiers in the path β, it follows


that entities on any cell in TC eventually get consumed by the target.

46
4.4 Simulation
We have performed several simulation studies of the algorithm for eval-
uating its throughput performance. In this section, we discuss the main
findings with illustrative examples taken from the simulation results. Let
the K-round throughput of System be the total number of entities arriving
at the target over K rounds, divided by K. We define the average throughput
(henceforth throughput) as the limit of K-round throughput for large K.
All simulations start at a state where all cells are empty and subsequently
entities are added to the source cells.

0.08
v=0.05
v=0.1
0.07 v=0.2
v=0.25

0.06
throughput

0.05

0.04

0.03

0.02

0.01
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
r
s

Figure 4.8: Throughput versus safety spacing rs for several values of v, for
K = 2500, l = 0.25 for System with 8 × 8 cells.

Throughput without failures as a function of rs , l, v. Rough calculations


show that throughput should be proportional to cell velocity v, and in-
versely proportional to safety distance rs and entity length l. Figure 4.8
shows throughput versus rs for several choices of v for an 8 × 8 instance

47
of System. The parameters are set to l = 0.25, SID = {1, 0 }, tid = 1, 7 ,
Δ
and K = 2500. The entities move along the path β = 1, 0 , 1, 1 , 1, 2 ,
1, 3 , 1, 4 , 1, 5 , 1, 6 , 1, 7 with length 8. For the most part, the in-
verse relationship with v holds as expected: all other factors remaining
the same, a lower velocity makes each entity take longer to move away
from the boundary, which causes the predecessor cell to be blocked more
frequently, and thus fewer entities reach tid from any element of SID in the
same number of rounds. In cases with low velocity (for example v = 0.1)
and for very small rs , however, the throughput can actually be greater than
that at a slightly higher velocity. We conjecture that this somewhat sur-
prising effect appears because at very small safety spacing, the potential
for safety violation is higher with faster speeds, and therefore there are
many more blocked cells per round. We also observe that the throughput
saturates at a certain value of rs (≈ 0.55). This situation arises when there
is roughly only one entity in each cell.

Throughput without failures as a function of the path. For a sufficiently


large K, throughput is independent of the length of the path. This of course
varies based on the particular path and instance of System considered, but
all other variables fixed, this relationship is observed. More interesting
however, is the relationship between throughput and path complexity,
measured in the number of turns along a path. Figure 4.9 shows through-
put versus the number of turns along paths of length 8. This illustrates
that throughput decreases as the number of turns increases, up to a point
at which the decrease in throughput saturates. This saturation is due to
signaling and indicates that there exists only one entity per cell.

Throughput under failure and recovery of cells. Finally, we considered


a random failure and recovery model in which, at each round, each non-
faulty cell fails with some probability p f and each faulty cell recovers with
some probability pr [33]. A recovery sets failedi,j = false and in the case of tid
also resets disttid = 0, so that eventually Route will correct nextm,n and distm,n
for any m, n ∈ TC. Intuitively, we expect that throughput will decrease
as p f increases and increase as pr increases. Figure 4.10 demonstrates this
result for 0.01 ≤ p f ≤ 0.05 and 0.05 ≤ pr ≤ 0.2. Interestingly, there is
roughly a marginal return on increasing pr for a fixed p f , in that for a fixed

48
0.09
rs=0.05, v=0.2, l=0.2
0.08 rs=0.05, v=0.1, l=0.2
rs=0.05, v=0.1, l=0.1
0.07 rs=0.05, v=0.05, l=0.1

0.06
throughput

0.05

0.04

0.03

0.02

0.01
0 1 2 3 4 5 6 7
number of turns along path
Figure 4.9: Throughput versus number of turns along a path, for a path of
length 8, where K = 2500, rs = 0.05, and each of l and v are varied for
System with 8 × 8 cells.

p f increasing pr results in smaller throughput gains.

4.5 Conclusion
This case study presented a self-stabilizing distributed traffic control pro-
tocol for the partitioned plane where each partition controls the motion
of all entities within that partition. The algorithm guarantees separation
between entities in the face of crash failures of the software controlling a
partition. Once new failures cease to occur, it guarantees progress of all
entities that are not isolated by failed partitions to the target. Through
simulations, throughput was estimated as a function of velocity, minimum
separation, path complexity, and failure-recovery rates. The algorithm is
presented for a two-dimensional square-grid partition; however, an exten-
sion to three-dimensional cube partitions follows in an obvious way.

49
0.06

pr=0.05
0.05 pr=0.1
pr=0.15
pr=0.2
0.04
throughput

0.03

0.02

0.01

0
0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
pf

Figure 4.10: Throughput versus failure rate p f for several recovery rates pr
with an initial path of length 8, where K = 20000, rs = 0.05, l = 0.2, and
v = 0.2 for System with 8 × 8 cells.

50
CHAPTER 5

SAFE FLOCKING ON LANES

5.1 Introduction
The goal of the safe flocking problem is to ensure that a collection of agents:

(a) always maintain a minimum safe separation (that is, the agents avoid
collisions),

(b) form a roughly equally spaced formation or a flock, and

(c) reach a specified destination.

The flocking problem has a rich body of literature (see, for example [86–89],
and the references therein) and has several applications in robotics and
automation, such as robotic swarms and the automated highway system.
This case study considers flocking in one dimension where some agents
may fail.
In order to allow non-faulty agents to avoid colliding with faulty ones,
there must be a way for the non-faulty agents to go around them. In this
thesis, this is addressed by allowing different agents to reside in different
lanes; see Figure 5.1 on page 55. A lane is a real line and there are finitely
many lanes. Informally, a non-faulty agent can then avoid collisions by mi-
grating to a different lane appropriately. Several agents can, and normally
do, reside in a single lane.

5.1.1 Overview of the Problem


The algorithm is obtained by modifying and fixing a bug in an algorithm
in [90], and combining that with Chandy-Lamport’s global snapshot al-
gorithm [53]. The key idea of the algorithm is as follows: each agent

51
periodically computes its target based on the messages received from its
neighbors and moves towards the target with some arbitrary but bounded
velocity. The targets are computed such that the agents preserve safe sep-
aration and they eventually form a weak flock configuration. Once a weak
flock is formed it remains invariant, and progress is ensured to a tighter
strong flock. Once a strong flock is attained by the set of agents, this property
can be detected through the use of a distributed snapshot algorithm [53].
Once the snapshot algorithm detects that the global state of the system sat-
isfies the strong flock predicate, the detecting agent makes a move towards
the destination, sacrificing the strong flock, but still preserving the weak
flock.
Actuator failures are modeled as exogenous events that set the velocity of
a non-failed agent to an arbitrary but constant value. This could correspond
to a robot’s motors being stuck at an input voltage, causing the robot to
forever move in a given direction with a constant speed. Likewise it could
correspond to a control surface being stuck in a given position, resulting
in movement forever in a given direction. After failure, the failed agent
continues to compute targets, send and receive messages, but its actuators
simply ignore all this and continue to move with the failure velocity.
Certain failures lead to immediate violation of safety, while others, such
as failing with zero velocity at the destination, are undetectable. The
algorithm determines only the direction in which an agent should move,
based on neighbor information. The speed with which it moves is left
as a non-deterministic choice. Thus, the only way of detecting failures
is to observe that an agent has moved in the wrong direction. Under
some assumptions about the system parameters, a simple lower-bound
is established, indicating that no detection algorithm can detect failures
in less than O(N) rounds. A failure detector is presented that utilizes
this idea in detecting certain classes of failures in O(N) rounds. Finally,
it is shown that the failure detector can be combined with the flocking
algorithm to guarantee the required safety and progress properties in the
face of a restricted class of actuator failures.

52
5.1.2 Literature on Flocking and Consensus in Distributed
Computing and Controls
The distributed computing consensus problem, that of a set of processors
agreeing upon some common value based on some inputs, under a variety
of communications constraints (synchronous, partially synchronous, or
totally asynchronous) and failure situations, has been studied extensively
by the distributed systems community [11,34,91]. The consensus problem in
distributed systems is that every agent has an input from a well-ordered
set and satisfies the following conditions. The conditions are

(a) a termination condition, that eventually every non-faulty agent must


decide on a value,

(b) an agreement condition, that all decisions by non-faulty agents must be


the same, and

(c) a validity condition, that if all inputs to all agents are the same, then the
value decided by all non-faulty agents must be the common input.

Different assumptions on the timing model and types of failures of agents


may suffer from can cause the problem to be impossible to solve [92], or
very difficult, for instance under Byzantine faults.
The controls community has also studied a consensus problem, but with
a different formulation [93]. In the controls community, variants of this
problem are known as multi-agent coordination, consensus, flocking, ren-
dezvous, or the averaging problem [86, 90, 94–96]. The controls problem
can be thought of having sensors (observability) and actuators (controlla-
bility) [97], in which case a failure of sensors or actuators can lead to a loss
of observability or controllability, respectively, leading to the inability to
solve the problem.
There is however a difference between the controls formulation of the
problem and the distributed computing formulation of the problem. The
controls problem does not have any termination requirements as in general
the error (from flocking or the average) asymptotically approaches a fixed
point, whereas in the distributed computing formulation, the output of
the algorithm must be decided at only one time. A stronger requirement
can be imposed, however, by allowing the algorithm to terminate upon

53
reaching a neighborhood of the fixed point, that is, by allowing the error
to approach a set about the equilibrium instead of the equilibrium, giving
a finite-time termination. Such constraints have been imposed on this
problem from the controls community, normally through quantization of
sensor or actuator values [98, 99].
Some attention has been given to the problem of failure detection in the
flocking problem. Most closely related to this case study is [61] which
works with a similar model of actuator failures. However, this work dis-
cusses using the developed motion probes in failure detection scenarios, but
has no stated bounds on detection time as more effort was spent ensuring
convergence to the failure-free centroid assuming that failure detection has
occurred within some time. To the best of the author’s knowledge, there
has been no work on provable avoidance of collisions with such a failure
model, only detection of such failures and mitigation to ensure progress
(convergence).

5.2 System Model


In this section, we present a formal model of the distributed flocking prob-
lem modeled as a shared-state distributed cyber-physical system (SSDCPS)
as introduced in Chapter 2. Refer to Chapter 2 for relevant preliminaries.

5.2.1 Overview of Distributed Flocking


The distributed system consists of a set of at most N mobile agents. Each
of these agents is physically positioned on one of NL infinite, parallel lanes.
These lanes are real lines and can be thought of as the lanes on a highway.
Refer to Figure 5.1 for clarity. The software of each agent implements
a distributed flocking algorithm. Specifically, the algorithm coordinates the
agents so that they form a flock, or a roughly equally spaced formation, and
migrate as a flock towards a goal, all without collision.
The algorithm operates in synchronous rounds. At each round, each
agent exchanges messages bearing state information with its neighbors.
The neighbors of an agent are the other agents that are sufficiently close to
the agent, independent of the lane upon which the agents are positioned.

54
Based on these messages, the agents update their software state and decide
their (possibly zero) velocities. Until the beginning of the next round, the
agents continue to operate according to this velocity. However, an agent
may fail, that is, it may get stuck with a (possibly zero) velocity, in spite
of different computed velocities. The key novelty of this case study is
that the algorithm incorporates failure detection and collision prevention
mechanisms.
Assume that the messages are delivered within bounded time and com-
putations are instantaneous. Recall from Chapter 2 that under these as-
sumptions, the system can be modeled as a (SSDCPS) defined in that
chapter. Refer to an individual agent as Agenti . Now follows the SSDCPS
model.

L
Lane 2
rc

T( ) 8
T(x)=8
4 6 7
Lane 1
rc

H(x)=2
3 5 1
Lane 0
0
rc

Figure 5.1: Example system at state x for N = 8, NF(x) = {2, 3, 5, 6, 8},


F(x) = {1, 4, 7}. Agent identifiers and communications radius rc are shown
to display connectivity of the graph.

55
5.2.2 Formal System Model
Δ
Let ID = [N − 1] be the set of identifiers for all possible agents that may
be present in the system. Each agent has a unique identifier i ∈ ID. Each
Δ
agent is positioned on a lane with an identifier in the set IDL = [NL − 1].
The following constant parameters are used throughout this chapter:
(i) rs : minimum required inter-agent gap or safety distance when there
are no faulty agents in the system,

(ii) rr : reduced safety distance when there are faulty agents in the system,

(iii) rc : communications distance,

(iv) r f : desired maximum inter-agent gap which defines a flock,

(v) δ: flocking tolerance parameter,

(vi) β: quantization parameter, and

(vii) vmin , vmax : minimum and maximum velocity, or minimum and maxi-
mum distance by which an agent may move over one round.

State Variables. Each Agenti has the following state variables, where
initial values of the variables are shown in Figure 5.2 using the ‘:=’ notation.
(a) x, xo: position and old position (from the previous round) of agent i on
the real line,

(b) u, uo: target position and old target position (from the previous round)
of agent i on the real line,

(c) lane: the parallel real line upon which agent i is physically located,

(d) failed: indicates whether or not agent i has failed permanently,

(e) vf : velocity with which agent i moves upon failing, and

(f) Suspected: set of neighbors that agent i believes to have failed.


Recall from Chapter 2 that a state of Agenti refers to a valuation of all the
above variables. A state of the SSDCPS modeling the complete ensemble
of agents, called System, is a valuation of all the variables for all the agents.
We refer to states of System with bold letters x, x , etc., and individual state
components of Agenti by x.xi , x.ui , etc.

56
variables
x, xo : R 2
u, uo : R := x
lane : IDL := 1 4
snaprun : B := false
gsf : B := false 6
failed : B := false
vf : R⊥ := ⊥ 8
Suspected : Set[ID⊥ ] := {}
10
Nbrs : Set[ID ] := Nbrs(x, i)
L : Nbrs := LS (x, i) 12
R : Nbrs := RS (x, i)

Figure 5.2: Variables of Agenti .

Failure Model. Any agent is susceptible to incorrect operation or failure.


In the SSDCPS model, failure of agent i is modeled by the occurrence
of a faili transition. This action is always enabled, and as a result of its
occurrence, the Boolean (indicator) variable failedi is set to true.
Failures cause agents to move with a failure velocity, which is the distance
traveled by a failed agent over one round. Failures are permanent, so upon
an agent failing, it moves forever with the failure velocity. No agent i knows
if another agent j has failed directly (i.e., i cannot read x. f ailed j ).
At state x, Agenti where x.failedi = true is a failed agent, otherwise it is
a non-failed agent. At state x, Agenti where ∃ j such that i ∈ x.Suspected j is
called a suspected agent, otherwise it is a non-suspected agent. The Suspected
variable represents the output of the failure detector.
At state x, if ∃i ∈ ID such that i ∈ x.Suspected j , then i is called an agent
suspected by j. At state x, denote by F(x) the set of failed agent identifiers,
Δ Δ
that is, F(x) = {i ∈ ID : x. f ailedi }, and NF(x) = ID \ F(x) as the set of non-
Δ
failed agent identifiers. The set of agents suspected by Agenti is SUi (x) =
Δ
x.Suspectedi . The set of agents not suspected by any agent is NS(x) =

ID \ i∈ID SUi (x).
Maintenance of the desired safety and progress properties, or reduced
versions of these, is dependent upon first detecting failures, and then
mitigating the effect of such failures on these system properties. Upon
failed agents being detected, they must be mitigated to maintain the safety
and progress properties introduced below. Details of each of these phases
are given below.

57
Neighbors. Agenti is said to be a neighbor of a different Agent j at state
 
x if and only if x.xi − x.x j  ≤ rc where rc > 0. The set of identifiers of all
neighbors of Agenti at state x is denoted by

Δ
 
Nbrs(x, i) = {j ∈ ID : i  j ∧ x.xi − x.x j  ≤ rc }.

Let L(x, i) (and symmetrically R(x, i)) be the nearest non-failed neighbor of
Agenti at state x such that x.xL(x,i) ≤ x.xi (symmetrically xR(x,i) ≥ x.xi ) or ⊥ if
no such neighbor exists (ties are broken by the unique agent identifiers).
So L(x, i) and R(x, i) take values from {⊥} ∪ Nbrs(x, i) \ F(x). Let LS (x, i) (and
symmetrically RS (x, i)) be the nearest neighbor not suspected by Agenti at
state x such that xL(x,i) ≤ x.xi (symmetrically xR(x,i) ≥ x.xi ) or ⊥ if no such
neighbor exists. So LS (x, i) and RS (x, i) take values from {⊥} ∪ Nbrs(x, i) \
x.Suspectedi , and thus, LS (x, i) (and RS (x, i)) is the identifier of nearest non-
suspected agent positioned to the left (right) of i on the real line. Only
upon failures occurring and subsequently these failed agents becoming
suspected will LS (x, i) or RS (x, i) change for any i. We denote by NR(x, i)
and NL(x, i) the number of non-failed agents located to the right, and
respectively to the left, of Agenti at state x.
If Agenti has both left and right neighbors, it is said to be a middle agent.
If Agenti does not have a right neighbor, it is said to be a tail agent. If Agenti
does not have a left neighbor it is said to be a head agent. For a state x, let

Δ
Heads(x) = {i ∈ NF(x) : LS (x, i) = ⊥},
Δ
Tails(x) = {i ∈ NF(x) : LS (x, i)  ⊥ ∧ RS (x, i) = ⊥},
Δ
Mids(x) = NF(x) \ (Heads(x) ∪ Tails(x)), and
Δ
RMids(x) = Mids(x) \ {R(x, H(x))}.

The identifier of the non-suspected agent closest to the goal (the origin) is
denoted by
Δ
H(x) = min NS(x).

The identifier of the non-suspected agent farthest from the goal is denoted
by
Δ
T(x) = max NS(x).

58
Neighbor Variables. Each agent i has the following variables,

(a) Nbrs: this variable is the set of identifiers of agents which are neighbors
of agent i at the pre-state x of any transition, so it is Nbrs(x, i), and

(b) L and R: these variables are the identifiers of the neighbor with the
nearest left and right position, respectively, at the pre-state x of any
transition, so the agent j with x.x j nearest to x.xi from the left and right,
respectively.

The existence of these variables is guaranteed by Lemma 5.5 in Section 5.4.

Shared Variables. In addition to the local variables introduced earlier for


each agent, all agents also rely on the following variables for sharing state
among their neighbors. A shared variable is an agent i’s state knowledge of a
neighbor j. Each agent i has the following shared variables (see Figure 5.3)
for each neighbor j: (a) x j , (b) xo j , (c) u j , (d) uo j , (e) lane j , and (f) Suspected j .
When necessary to distinguish i’s knowledge of j’s state variables for a
state xk , the notation xk .xi, j will be used to indicate this is j’s position xk−1 .x j
from the perspective of i at round k.

Agenti Agentj
xi, xoi xj, xoj
ui, uoi uj, uoj
lanei lanej
Suspectedi Suspectedj

snapruni snaprunj
gsfi gsfj
failedi failedj
vfi vfj
Nbrsi Nbrsj

Figure 5.3: Interaction between a pair of neighboring agents is modeled


with shared variables x, xo, u, uo, lane, and Suspected.

59
Failure Detection. The first stage of making this algorithm fault-tolerant
is the detection of failures described below. Recall from Chapter 2 that the
detection time of a failure detector is the number of rounds until each failed
agent has been suspected by each of its non-faulty neighbors.

Definition 5.1. For any execution α, let x f ∈ α be a state with failed agents F(x f ).
Assuming no further failures occur, let xd be a state in α reachable from x f such
that ∀i ∈ NF(xd ), F(x) ∩ Nbrs(x, i) ⊆ xd .Suspectedi . Then, the detection time kd
is d − f rounds.

The failure detector is implemented within each Agenti through the


suspecti transition. The suspecti transition models a detection of failure of
some agent j ∈ F(x) by Agenti if j is neighbor of i. The suspecti transition
must occur within kd rounds of a fail j (v) action occurring.

Assumption 5.2. Assume there exists a constant kd that satisfies the above
statement for all executions and for all x f .

The results in Section 5.4 will rely on this assumption. Then Subsec-
tion 5.4.6 will introduce the conditions under which it is possible for a
failure detector to match this number of rounds and hence guarantee all
properties previously proven under this assumption in Section 5.4.
Such a detection is only based on the messages received by i from j,
and hence the shared variables described above. This is modeled by i
having access to some of j’s state, respectively current and old positions
x j and xo j and current and old targets u j and uo j . Assume that any failure
detection service has access only to these shared variables. Alternatively,
an agent could report itself as being suspected. However, it is ideal for
other agents to detect failures, as in the case of adversarial failures where
an agent could falsely (or not) report itself as having failed. While the
model we are utilizing relies on messages from an agent that may be
failed, the quantities used could be estimated from physical state by the
agents performing failure detection. Hence in essence, i’s knowledge of j
is for clear presentation only. When the conditional in Figure 5.4, Line 7
is satisfied for some neighbor j, then j is added to the Suspectedi set. This
conditional roughly states that at a state x for some agent j ∈ Nbrsi (x),
agent i suspects j when i learns through the shared memory that j wanted
to move one direction as specified by its target xT .u j , but in fact moved in

60
the other direction as specified by its new position x .x j , which is in the
opposite direction of x.u j .

transitions 1
faili (v)
eff failed := true; 3
vf := v;
5
suspecti    
pre ∃ j ∈ Nbrs, (if j  Suspected ∧ ((xo j − uo j  ≤ β ∧ x j − uo j   0) ∨ 7
     
(xo j − uo j  > β ∧ sgn x j − xo j  sgn uo j − xo j )))
eff Suspected := Suspected ∪ {j} 9

snapStarti 11
pre L = ⊥ ∧ ¬snaprun
eff snaprun := true // global snapshot invoked 13

snapEndi (GS) 15
eff gsf := GS // global snapshot terminated giving if strong flock satisfied
snaprun := false 17

updatei 19
eff uo := u;
xo := x; 21
for each j ∈ Nbrs
Suspected := Suspected ∪ Suspected j // share suspected sets 23
end
Mitigate: 25
for each {s ∈ Suspected : lanes = lane}
if (∃ L ∈ IDL : ∀ j ∈ Nbrs, lane j = L ∧ x j  [x − 2kd vmax , x + 2kd vmax ]) then 27
lane := L; fi
end 29
Target:
if L = ⊥ then 31
if gsf then u := x − min{x, δ/2}; gsf := false;
else u := x fi 33
elseif R = ⊥ then u := (xL + x + r f )/2
else u := (xL + xR )/2 fi 35
Quant: if |u − x| < β then u := x; fi
Move: 37
if failed then x := x + vf
else x := x + sgn (x − u) choose [vmin , vmax ]; fi 39

Figure 5.4: Transitions of Agenti .

Failure Mitigation. Agents are aligned on lanes, which are parallel real
lines. Agents cannot collide or violate safety unless they reside in the same
lane. To mitigate failures to ensure safety and progress properties, non-
failed agents will pass failed agents that are moving incorrectly by entering
a different lane. This is accomplished by the Mitigate subroutine of the
update transition.

61
State Transitions. The state transitions are fails, snapStarts, snapEnds,
suspects, and updates. A faili (v) transition where i ∈ NF(x) for a state x
models the permanent failure of Agenti . As a result of this transition, failedi
is set to true and vf i is set to v. This causes Agenti to move forever with
velocity v. Assume that |v| ≤ vmax , which is reasonable due to actuation
constraints.
The snapStart and snapEnd transitions model the periodic initialization
and termination of a distributed global state snapshot protocol, such as
Chandy and Lamport’s snapshot algorithm [53]. This global state snap-
shot is used in the update transition to a detect stable global predicate as
described below. We model the initialization of this algorithm by snapStart
and the termination as snapEnd. Termination is guaranteed since the run-
ning time of the algorithm is O(N) rounds. This is ensured by Assump-
tion 5.4, which states that within O(N) rounds of a snapStart transition,
a snapEnd input transition occurs with a Boolean parameter GS which
specifies whether the global state satisfied the specified stable predicate.
We note that the assumptions to apply Chandy-Lamport’s algorithm are
satisfied here since

(a) we are detecting a stable predicate,

(b) the communications graph is strongly connected by Assumption 5.3,


and

(c) the stable predicates are reachable.

A suspect transition models a failure detector service. It determines


which neighbor agents, if any, have failed. To accomplish this, the suspect
action is always enabled with a given conditional in the effect, and in
Subsection 5.4.6 the precondition of the suspect action will be set to the
previous conditional.
An update transition models the evolution of all the agents over one
synchronous round. For an execution fragment α, the term round is used
update
to indicate that x → x has occurred, where x, x ∈ α. It is composed of
the subroutines, in order of execution: Mitigate, Target, Quant, and Move.
The whole update action is atomic and it is only separated into subroutines
for clarity.

62
The computations of Mitigate, Target, Quant, and Move are all assumed
to be instantaneous. There is a slight separation from physical state evolu-
tion here as Move is abstractly capturing the duration of time required to
move agents by their specified velocities and is not instantaneous. Mitigate
attempts to restore safety and progress properties that may be reduced or
violated due to failures. Target is the flocking algorithm, which roughly
averages the positions of the closest left and right non-suspected neighbors
of an agent. Quant is the quantization step which prevents targets ui com-
puted in the Target subroutine from being applied to real positions xi if the
difference between the two is smaller than the quantization parameter β. Fi-
nally, Move moves agent positions xi towards the quantized targets. Thus,
update
for x → x , the state x is obtained by applying each of these subroutines.
We will refer to the internal state after Mitigate, Target, Quant, and Move
Δ Δ
as xM , xT , xQ , and xV , respectively. Specifically, xM = Mitigate(x), xT =
Target(xM ), etc., and observe that x = xV = Move(xQ ). For a state specified
by a round k, such as xk , the notation xk,T to indicate the state of System at
Δ
round k following the Target subroutine, so xk,T = Target(Mitigate(xk )).

Target. There are three different target computations based on an agent’s


belief of its position within the set as a head, middle, or tail agent. Middle
and tail agents rely only on local information from immediate neighbors,
whereas head agents rely on information from all agents in the communi-
cation graph to which they belong. Specifically, for a state x, each agent
i ∈ Mids(x) attempts to maintain the average of the position of its nearest
left and right neighbors (Figure 5.4, Line 35). For a state x, T(x) attempts
to maintain r f distance from its nearest left neighbor (Figure 5.4, Line 34).
For a state x, H(x) attempts to detect a certain stable global predicate
FlockS (defined below) by periodically invoking the global snapshot algo-
rithm, described above through the snapStart and snapEnd transitions.
The key property required to apply the distributed snapshot algorithm is
that, if FlockS holds at the state where the snapshot is invoked then the
global state that is eventually recorded gsf also satisfies FlockS . H(x) can
detect if the system state satisfies FlockS by periodically taking a distributed
snapshot. Until this predicate is detected, H(x) does not change its target u
from its current position x. When the predicate is detected the head agents
compute a new target towards the goal (Figure 5.4, Line 32).

63
5.2.3 Model as a Discrete-Time Switched Linear System
The following is a view of the system as a discrete-time switched system
and displays that failures can be modeled as a combination of an additive
affine control and a switch to another system matrix.
Discrete-time switched systems can be described as x[k + 1] = fp (x[k]) in
general where x ∈ RN , p ∈ P for some index set P, such as P = {1, 2, . . . , m},
or x[k + 1] = Ap x[k] for linear discrete-time switched systems [100]. For
the following, assume that Figure 5.4, Line 39 is deleted and replaced with
x := u. This deletion removes the nondeterministic choice of velocity with
which to set position x, and instead sets it to be the computed control
value u. This nondeterministic choice can be modeled through the use of
a time-varying system matrix A as in [90], but we omit it for simplicity of
presentation.
The effect of an update transition on the position variables of all agents in
System can be represented by the difference equation x[k + 1] = Ap x[k] + bp
where for a state xk at round k,
⎛ ⎞
⎜⎜ x .x ⎟⎟
⎜⎜ k H(xk ) ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ xk .xx.R ⎟⎟
⎜ ⎟⎟
x[k] = ⎜⎜⎜⎜ ⎟⎟ ,
H(xk )

⎜⎜ .
.. ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎝ x .x ⎟⎠
k T(xk )

⎛ ⎞
⎜⎜ . . . ⎟⎟⎟⎟
⎜⎜ a1,1 0 0 0 0 0
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ a2,1 a2,2 a2,3 0 0 0 . . . ⎟⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜
⎜⎜
⎜⎜ 0 a3,2 a3,3 a3,4 0 0 . . . ⎟⎟⎟⎟
⎜⎜ ⎟⎟
⎜ ... ... ... ⎟⎟
Ap = ⎜⎜⎜
⎜⎜ 0 0 0 . . . ⎟⎟⎟⎟ , and
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ 0 0 0 ai,i−1 ai,i ai,i+1 . . . ⎟⎟⎟
⎜⎜ ⎟⎟
⎜⎜ .. .. .. ⎟⎟
⎜⎜
⎜⎜ 0 0 0 . . . . . . ⎟⎟⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎝ 0 0 0 0 0 aN,N−1 a ⎠
N,N

64
⎛ ⎞
⎜⎜ ⎟⎟
⎜⎜ b1 ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ .. ⎟⎟
⎜⎜ . ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜ ⎟⎟
bp = ⎜⎜⎜⎜ bi ⎟⎟ .
⎟⎟
⎜⎜ ⎟⎟
⎜⎜ .. ⎟⎟
⎜⎜ . ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎟
⎜⎜ ⎟⎠
⎝ bN

The following are the family of matrices Ap and vectors bp that are
switched among based on the state of System; refer to Figure 5.4 for the
following referenced line numbers. From Line 32, for H(xk ), if FlockS (xk ),
then either (a) if xk .xH(xk ) ≥ δ, then a1,1 = 1 and b1 = − 2δ , otherwise (b) a1,1 = 0
and b1 = 0. From Line 33, if ¬FlockS (xk ), then a1,1 = 1 and b1 = 0. From
Line 35, for i ∈ Mids(xk ), ai,i = 0, ai,i−1 = 12 , ai,i+1 = 12 , and bi = 0. Finally, from
r
Line 34, for T(xk ), aN,N−1 = 12 , aN,N = 12 , and bN = 2f .
Next, all coefficients in the matrix can change due to the quantization law
in Line 36. If the conditional on Line 36 is satisfied for agent i ∈ Mids(xk ),
then ai,i = 1, ai,xk .Li = 0, ai,xk .Ri = 0, and bi = 0, for agent i = H(xk ), then ai,i = 1
and bi = 0, and for agent i = T(xk ), then ai,xk .Li = 0, ai,i = 1, and bi = 0.
Failures also cause a switch of system matrices. The actuator stuck-at
failures being modeled are representative of an additive error term in the
bp vector [44]. From Line 38, for i ∈ Mids(xk ), ai,i = 1, ai,xk .Li = 0, ai,xk .Ri = 0,
and bi = xk .v fi , for i = H(xk ), ai,i = 1 and b1 = xk .v fH(xk ) , and for i = T(xk ),
aN,N−1 = 0, aN,N = 1, and bN = xk .v fT(xk ) .

5.3 Safety and Progress Properties


Agents are meant to model physical entities, such as vehicles or robots
spaced in adjacent lanes. Hence, a key safety property is that agents do not
collide for all time. It specifies that an inter-agent gap of at least the safety
radius rs is maintained between all agents in the same lane in System and is
maintained without failures for all time. However, upon failures occurring,
it may no longer be possible to maintain a minimum desired spacing, or
even to avoid collisions in finite time. The reduced safety property specifies
a weaker version of safety with spacing rr < rs . States which satisfy such a

65
minimum spacing are formalized through the predicates Safety and SafetyR ,

Δ
 
Safety(x) = ∀i ∈ ID, ∀ j ∈ ID, i  j, x.xi − x.x j  ≥ rs ∧ x.lanei = x.lane j ,
Δ
 
SafetyR (x) = ∀i ∈ ID, ∀ j ∈ ID, i  j, x.xi − x.x j  ≥ rr ∧ x.lanei = x.lane j .

It will be shown that without failures, Safety is maintained for all reach-
able states, but upon failures occurring, when it is possible to be main-
tained, reachable states satisfy the weaker SafetyR (x) for some time, prior
to Safety(x) being restored.
Without a notion of liveness or progress, however, safety can be trivially
maintained by agents not moving. In this case study, there are two progress
properties. The first is called the flocking property, which states that agents
reach states where their positions are in a flock or an equally spaced for-
mation. Specifically it is when the differences of positions between agents
are near the flocking distance r f with tolerance parameter  f .
States which satisfy such a spacing of agent positions are defined by the
predicate

Δ
 
Flock(x,  f ) = ∀i ∈ NS(x), LS (x, i)  ⊥, x.xi − x.xLS (x,i) − r f  ≤  f .

Flock is then instantiated as a weak flock by FlockW and strong flock by FlockS ,
which respectively specify a larger and smaller error from agent positions
being exactly spaced by r f . Given the flocking tolerance parameter δ > 0,
define respectively states where agent positions satisfy a weak flock and a
strong flock as

Δ
FlockW (x) = Flock(x, δ), and
Δ δ
FlockS (x) = Flock(x, ).
2

The second progress property is a termination property, which states that


agents reach a neighborhood of a global goal as a strong flock. The Goal
definition defines the neighborhood of the global goal (assumed to be the
origin without loss of generality). Goal states states are those where the
non-failed agent closest to the goal has come as close as is possible to the

66
goal and are defined by the predicate,

Δ
Goal(x) = x.xH(x) ∈ [0, 0 + β].

The NBM definition defines states from which middle and tail agents
can no longer make progress due to quantization. No big moves (NBM)
states are those such that that no middle or tail agents have the ability to
move by more than the quantization parameter β > 0, and are defined by the
predicate

Δ
NBM(x) = ∀i ∈ NF(x), LS (x, i)  ⊥, |xT .ui − x.xi | ≤ β,

where xT is the state following the Target subroutine.


Terminal states are those corresponding to desired final configurations
of System, from which there are no further movements possible and are
captured by the predicate

Δ
Terminal(x) = Goal(x) ∧ NBM(x).

A relationship between the flock and termination properties will be


established in analysis which ensures that if agents reach states satisfying
the termination property, then these states also satisfy the flock property.
An outline of these properties along executions is presented in Figure 5.5.
To summarize, the safety property states that Safety is maintained for all
time. The reduced safety property states that, when it is possible to do
so, SafetyR is maintained upon failures occurring. There are two progress
properties. The flocking property is that eventually all reachable states will
satisfy FlockS . The termination property is that eventually all reachable
states will satisfy Terminal.
The remainder of the chapter will analyze System with regards to these
safety and progress properties.

5.4 Analysis
Having described the system and failure models formally, in this section,
the behavior of System is analyzed. Upon establishing some basic behav-

67
SafeR
S f
Safe
Įff
FlockW Įfd
Įff Įff
FlockS
Įff
Q0 Į
ff NBM Įf
Įff Goal Įff
Įff

Figure 5.5: Set view of desired properties of System. Start states Q0 at


least satisfy Safety. Failure-free executions are represented by lines with
arrows labeled αff . Safety is shown to be invariant along failure-free
executions. It is shown that eventually NBM—and thus also FlockW and
FlockS —is satisfied along any failure-free execution, upon which the head
agent may move towards states satisfying Goal causing FlockS to no longer
be satisfied, while FlockW remains stable. However, along executions with
failures, represented by the red line with an arrow labeled α f , Safety is not
necessarily upheld, but SafetyR is invariant when combined with a failure
detector whose action is represented by the green line with an arrow
labeled αfd . Upon this detection, any failure-free execution is then
guaranteed to reach states satisfying NBM and also eventually Goal.

ior, the operation of System in response to various failures is analyzed.


Under the assumption that no failures or only a single failure occurs, and
assuming that, if necessary, the failure detection occurs fast enough, the
safety property and reduced safety property are established. Then it is
shown that the failure detector is sufficient to detect failures in a bounded
number of rounds, if this is possible at all, and upon new failures ceasing
to occur and all failures having been detected, progress is established.

68
5.4.1 Assumptions
The following assumptions are required on the constant parameters used
throughout the chapter:

(i) rr < rs < r f < rc ,


δ
(ii) 2
< rc ,
δ
(iii) vmin ≤ vmax ≤ β ≤ 4N
,

(iv) NL ≥ 2, that is, there are at least 2 lanes, and

(v) the graph of neighbors is strongly connected and the graph of non-
faulty agents may never be partitioned.

Assumption (i) indicates that the reduced safety margin rr seen under
failures is strictly less than the safety margin rs when no failures are present.
It then states the desired inter-agent spacing r f is strictly greater than
these safety margins and strictly less than the communications radius rc .
Assumption (ii) prevents the agent nearest to the goal from moving beyond
the communications radius of any right agent it is adjacent to, that is, it
prevents disconnection of the graph of neighbors. Assumption (iii) bounds
the minimum and maximum velocities, although they may be equal. It
then upper bounds the maximum velocity to be less than or equal to the
quantization parameter β. This is necessary to prevent a violation of safety
due to overshooting computed targets. Finally, β is upper bounded in
such a way that it is possible to establish that NBM ⊆ FlockS . Assumption
(iv) allows the safety and progress properties to be maintained in spite of
failures (under further restrictions to be introduced) by allowing agents
to move among a set of NL lanes, preventing collisions of failed and non-
failed agents and allowing non-failed agents to pass failed agents which
are not moving in the direction of the goal. Assumption (v) is a natural
assumption indicating there is a single network of agents. It further states
that failures do not cause the graph of non-faulty neighbors to partition.
For the remainder of the chapter we make the following assumptions.

Assumption 5.3. In all start states x ∈ Q0 of System,

Sa f ety(x) ∧ x.xH(x) ≥ 0.

69
The following assumption states that for an agent i, a snapEndi transition
occurs within O(N) rounds from the occurrence of any snapStarti transi-
tion. Essentially it ensures termination of the global snapshot algorithm so
that any agent which relies on this algorithm for target computation may
calculate targets infinitely often. Thus it is used to ensure progress of the
algorithm.
snapStarti
Assumption 5.4. For any execution α, let x be a state in α such that x → x.
snapEndi
Then, there exists a state x in α such that x → x where x is a state
reachable from x . Furthermore, x is at most O(N) rounds from x in the sequence
α.

5.4.2 Basic Analysis


The following lemma ensures that the set of neighbors of an agent is well
defined and matches the definition of Nbrs(x, i) for agent i at the pre-state
x of any transition. It follows by observing that only update transitions
modify Nbrs(x, i).

Lemma 5.5. For any reachable state x such that for all i ∈ NF(x), Nbrs(x, i) =
a
x.Nbrsi . For any agent i ∈ ID, for a state x such that x → x for a ∈ A \ {update},
Nbrs(x , i) = x .Nbrsi and Nbrs(x, i) = x.Nbrsi .

The next lemma states that if neighbors change, then they do so sym-
metrically. This is used to establish safety upon agents no longer relying
on suspected agents for target computation.
a
Lemma 5.6. For any reachable state x such that x → x for any a ∈ A, ∀i, j ∈ ID,
if x.Li  j and x .Li = j, then x .R j = i.

Proof : Fix i and j and observe that only the suspect or update action
changes LS (x, i) or RS (x, j) by changing either the positions of agents xi or
the sets of suspected agents. By Lemma 5.5, we consider L and R. There
are two cases when x.Li  x .Li = j. The first is upon agents that were not
neighbors at x becoming neighbors at x , that is, j  x.Nbrsi and j ∈ x .Nbrsi .
This is only possible due to the update action since no other action changes

70
xi . By definition of neighbor, also i  x.Nbrs j and i ∈ x .Nbrs j . By the
symmetric definitions of LS (x , i) and RS (x , j), we have x .R j = i.
The second case is when agents i and j were neighbors at x, so j ∈ x.Nbrsi
and i ∈ x.Nbrs j , but now have at least one suspected agent f where i > f > j
between them and f ∈ x.Nbrsi ∩ x.Nbrs j . This is possible due to the suspect
or update transitions. Prior to suspecting that f is failed, no change of
LS (x, i) and RS (x, j) occurs by definition, implying that for the hypothesis
of the lemma to be satisfied, x must be a state where f ∈ x .Suspectedi ∩
x .Suspected j , since i and j both use the same suspect action at Figure 5.4,
Line 6. In this case, the symmetric switch occurs by definition of LS (x, i)
and RS (x, j), we have x .R j = i. Otherwise, f  x .Suspectedi ∩ x .Suspected j
and a contradiction that x.Li  x .Li occurs.

5.4.3 Basic Failure Analysis


A class of safe failures. Prior to introducing failure mitigation, it is es-
tablished that there exists a class of failures which do not violate safety.
This lemma relies on strong assumptions, but shows that some failures
may not cause a violation of safety. The lemma follows by observing that
along such an execution, no agents ever come closer together by Figure 5.4,
Line 38.

Lemma 5.7. Let x be a state along any execution of System and assume that F(x)
= ∅. Consider the execution fragment α = x.fail1 (v).x .fail2 (v).x . . . .failN (v).x f .
That is, ∀i ∈ ID, let faili (v) occur where v is the same for each of these faili
transitions. Then, for any round xs appearing after x f in α, Sa f ety(xs ).

Progress to states satisfying NBM was violated in the previous lemma.


Likewise, progress is violated by the following lemma which says that any
failed agent with nonzero velocity diverges. This follows by the definition
of velocities in Figure 5.4, Line 38.

Lemma 5.8. For any execution α, for a state x ∈ α, if i ∈ F(x)∧x.v fi  0∧x.v fi 


⊥, then for any round xk ∈ α appearing after x, limk→∞ |xk .xi | → ∞.

The previous lemma highlights an important part of the definition of the


Flock(x) property for a state x, specifically that the property relies on the

71
states of agents with identifiers in the set of suspected agents NS(x) and not
the set of failed agents NF(x) or all agents ID. Observe that if Flock(x) were
defined with ID, by Lemma 5.8, at no future point in time could Flock(x)
be attained. Furthermore, if Flock(x) relied on NF(x) instead of NS(x), then
potentially the failure detection algorithm could rely upon the head agents
detection of this predicate on the global snapshot for detection of failures.
We end this section by presenting the motivation for sharing sets of sus-
pected agents among agents in Figure 5.4, Line 23, so this lemma assumes
this line of code is deleted. The following gives a failure condition under
which no moves are possible and hence no progress can be made.

Lemma 5.9. Assume that agents do not share sets of suspected agents, so Fig-
ure 5.4, Line 23 is deleted. For any execution α such that for a state x ∈ α where
F(x) = ∅ and ∀i ∈ ID \ H(x),

δ
x.xi − x.xLi = x.xRi − x.xi = . . . = x.xT(x) − x.xLT(x) > r f ±
2

such that ¬FlockS (x). Let there be a single non-faulty agent p which is located
farther than rc from agent T(x) so that p  x.NbrsT(x) .
Let α be an execution fragment starting from x such that for every state x ∈ α ,
ID = F(x ) ∪ {p} and x .v f j = 0 for all j ∈ F(x ). Then, for all reachable states x
from x , x .Suspectedp = ∅ and ∀i ∈ ID, x .xi = x .xi .

Proof : Each i except T(x ) computes ui to move to the center of their


neighbors. However, since each agent is already there, each computes
ui = 0. For T, uT > 0, but it does not move since x .v fT = 0. Each agent
j ∈ x .NbrsT(x ) suspects T(x ) such that x .Suspected j = {T(x )}. However,
since T(x )  Nbrsp (x ), p does not suspect T(x), so T(x )  x .Suspectedp .
Since p is the only non-failed agent, it is the only that could have made
progress, but does not since it did not detect the failure it cannot move to
another lane to mitigate.

5.4.4 Safety in Spite of a Single Failure


This section analyzes the safety properties of System when a single i ∈ ID
can make a faili (v) transition. First it is shown that there is a maximum

72
distance, vmax , any failed or non-failed agent moves in any round. This
then implies that any two agents move towards or away from one another
by at most 2vmax in any round. Then, if non-failed agents change neighbors,
it is shown that they do not violate safety. Next, a condition on when a
single agent can fail for maintenance of reduced safety is given. Finally,
the safety property is shown to be invariant without failures, and with the
aforementioned condition, in the face of one failure, the reduced safety
property is proven.
Lemma 5.10 shows that any agents move by at most a positive constant
vmax in any round. Otherwise agents are not allowed to move due to
quantization constraints, so they will move by 0 in this round, which
is also less than vmax . The proof follows since update is the only action
to change any xi , then from Figure 5.4, Line 39, by the assumption that
∀i ∈ F(x), vmin ≤ x.v fi ≤ vmax , and failures are permanent, so for any state x
reachable from x, x .v fi = x.v fi .
a
Lemma 5.10. For any execution α, for states x, x ∈ α such that x → x for any
a ∈ A, ∀i ∈ ID, then |x .xi − x.xi | ≤ vmax .

The following corollary states that any two agents move towards or
away from one another by at most 2vmax from one round to another and
follows from Lemma 5.10.
a
Corollary 5.11. For any execution α, for states x, x ∈ α such that x → x for
 
any a ∈ A, ∀i, j ∈ ID such that i  j, then (x .xi − x.xi ) − (x .x j − x.x j ) ≤ 2vmax .

The next lemma establishes that upon agents switching neighbors used
in Target by changes of either neighbors Nbrs(x, i) or LS (x, i) or RS (x, i) from
x to x , safety is maintained.
a
Lemma 5.12. For any execution α, for states x, x ∈ α such that x → x for any
a ∈ A, ∀i, j ∈ ID, if LS (x, i)  j and RS (x, j)  i and LS (x , i) = j and RS (x , j) = i
and x.xRS (x, j) − x.xLS (x,i) ≥ rs , then x .xRS (x ,j) − x .xLS (x ,i) ≥ rs .

Proof : Only suspect and update modify LS (x, i), RS (x, i), or xi for any i. By
Lemma 5.5, we discuss L and R. By Lemma 5.6, which states that neighbor
switching occurs symmetrically, if x.Li  j and x .Li = j, then x .R j = i.

73
It remains to be established that x .xx .R j − x .xx .Li ≥ rs . For convenient
notation, observe that x .xx .R j = x .xi and x .xx .Li = x .x j . Now,

x.xx.L j + x.xi
x .x j = , and
2
x.x j + x.xx.Ri
x .xi = ,
2

and thus
x.x j + x.xx.Ri x.xx.L j + x.xi
x .xi − x .x j = −
2 2
x.x j − x.xx.L j + x.xx.Ri − x.xi
= .
2

Finally, by the hypothesis and Assumption 5.3,

rs + rs
x .xi − x .x j ≥ ≥ rs .
2

The cases for i = N and j = 1 follow by similar analysis, as does the case
when x .xm is quantized so that x.xm = x .xm for any m ∈ ID.

Invariant 5.13 shows that targets ui and positions xi are always safe in
the presence of no failures. When failures can occur, under the following
assumption about detection and mitigation of such failures, a weaker re-
duced safety property is invariant. Particularly, all analysis in the face of
failures relies on detection of any failed agents within kd rounds of any
faili (v) transition.

Invariant 5.13. For any reachable state x, if F(x) = ∅ then Sa f ety(x).


If F(x) = { f } for some f ∈ ID, let x f be the state following any failf (v) transition,
consider the execution α f with state x f , so the sequence of states in α f is x0 . . . x f . . ..
Let xd be the first state in the failure-free execution fragment αff starting from x f .
Thus, xd is kd elements from x f in the sequence of states in α f such that f − d = kd
s −rr
and d ≥ f . If vmax ≤ r2k d
, then Sa f etyR (xd ).

Proof : The proof is by induction over the length of any execution of System.
The base case follows from Assumption 5.3. For the inductive case, for each
a
transition a ∈ A, we show if x → x ∧ x ∈ Sa f ety, then x ∈ Sa f ety.

74
(a) update: The only times x .ui  x.ui are on an update transition. The
inductive hypothesis provides Assumption 5.3 for the pre-state x. By
Lemma 5.10, it is sufficient to show if ∀i ∈ ID,

x.ui − x.ux.Li ≥ rs ∧ x.xi − xx.Li ≥ rs =⇒ x .ui − x .ux .Li ≥ rs .

All of the following follows from Figure 5.4, Lines 31–35. For all
i ∈ NF(x) ∩ Mids(x),

x .ui − x .ux .Li = (x.xx.Li + x.xx.Ri − x.xx.Lx.Li − x.xi )/2


= (x.xx.Li − x.xx.Lx.Li + x.xx.Ri − x.xi )/2
≥ rs .

For i = T(x),

x .uT(x ) − x .ux .LT(x ) = (x.xx.LT(x) + x.xT(x) + r f − x.xx.Lx.LT(x) − x.xT(x) )/2


= (r f + x.xx.LT(x) − x.xx.Lx.LT(x) )/2
≥ rs .

Since x g ≤ xH(x) , by Assumption 5.3, x .uH(x ) ≤ x.uH(x) . Cases when


quantization changes any x .ui in Line 36 follow by similar analysis
and are omitted for space.
Next is the proof of the second claim, that is, for cases where some
f ∈ ID has x.failedf = true so that F(x)  ∅. In particular, this is
considering |F(x)| = 1. In these cases, x .x f = x.x f + x.v f f by Line 38.
Since the pre-state x only ensures separation by rs , Sa f ety(x ) can no
s −rr
longer be shown. However, given the assumption that vmax ≤ r2k d
,
rs −rr
observe that at round kd , xd .x f ≤ x.x f + kd vmax = x.x f + 2 where
we considered the case for x.v f > 0 and the negative case follows
symmetrically. By assumption that any failure is detected by round kd
and by Lemma 5.10, any failed agent f and any non-failed agent i have
moved towards one another by at most 2kd vmax , and thus

xd .x f − xd .xi ≤ 2kd vmax = rs − rr .

This implies at least Sa f etyR (xm ) for any states xm in the execution

75
between x and xd . Since x.x f − x.xi ≥ rs , xd .x f − xd .xi ≥ rr and Sa f etyR (xd )
is established. It remains to be established that for a reachable state
xd , Sa f etyR (xd ). However, any agent i such that f ∈ xd .Nbrsi will have
f ∈ xd .Suspectedi , which changes LS and RS , but applying Lemma 5.12
still yields Sa f ety(xd ). Finally, by Figure 5.4, Line 28, xd .lanei  xd .lane f
since NL ≥ 2 and hence Sa f etyR (xd ).

(b) faili (v), snapStarti , snapTermi , and suspecti : these transitions do not
modify any xi or ui , so Sa f ety(x ).

5.4.5 Progress
In this section it is established that along executions of System in which fail
actions are fixed, then System reaches a terminal state, that is one satisfying
Terminal. To show this, it is first established that a state x satisfying NBM
is reached. It is further argued that NBM ⊆ FlockS so that x also satisfies
FlockS . That is, System reaches states from which no non-head agent
may move and such states satisfy the strong flocking condition, in that
they are roughly equally spaced with a tight tolerance parameter. Upon
FlockS being satisfied, it is shown that progress is made towards a state
x satisfying Goal. Upon such progress being made, only FlockW remains
invariant, but by reapplication of the previous arguments for reachability
of states satisfying NBM, another state x is reached which again satisfies
NBM and hence FlockS . Finally, by repeated application of these arguments,
it is established that a state x satisfying Terminal is reached. The order
in which NBM and Goal are satisfied depends on the initial conditions. If
System starts in a state satisfying Goal and ¬NBM, then obviously Goal is
satisfied first. However, if System starts in a state satisfying ¬Goal and
¬NBM, then it will always be the case that NBM is satisfied first.

76
The following descriptions of error dynamics are useful for later analysis:
⎧ 

⎪x.xi − x.xx.L − r f  if i ∈ Mids(x) ∪ T(x)
Δ ⎪

e(x, i) = ⎪
i


⎩0 otherwise,
⎧ 

⎪x.ui − x.ux.L − r f  if i ∈ Mids(x) ∪ T(x)
Δ ⎪

eu(x, i) = ⎪
i


⎩0 otherwise.

Here e(x, i) gives the error with respect to r f of Agenti and its non-suspected
left neighbor. The quantity eu(x, i) and eu(x, i) gives the same notion of
error as aforementioned, but with respect to target positions x.ui rather
than physical positions x.xi .
The next lemma shows that if an agent is allowed to move in spite of
quantization, then it moves by at least a strictly positive constant vmin in
any round. This follows from Figure 5.4, Line 39.

Lemma 5.14. For any failure-free execution fragment α and for two adjacent
 
rounds xk and xk+1 in α, for any i ∈ NF(xk ) ∩ NF(xk+1 ), if xk,T .ui − xk .xi  > β,
then |xk+1 .xi − xk .xi | ≥ vmin > 0.

Lemma 5.15 shows that when System is outside of states satisfying NBM,
the maximum error for all non-failed agents’ target positions ui and their
position in a state satisfying NBM is non-increasing. It also displays that
the maximum error for all non-failed agents’ positions xi and the goal is
non-increasing. Finally it shows that the maximum error for all non-failed
agents’ positions in adjacent rounds is non-increasing.

Lemma 5.15. For any failure-free execution fragment α, for any state x ∈ α, if
x  NBM, then
max eu(xQ , i) ≤ max eu(x, i).
i∈NF(xQ ) i∈NF(x)

Furthermore, if x  NBM, then

max e(xM , i) ≤ max e(x, i).


i∈NF(xQ ) i∈NF(x)

a
Finally, if x and x are in α such that x → x , ∀a ∈ A, ∀i ∈ NF(x), then

max e(x , i) ≤ max e(x, i).


i∈NF(x) i∈NF(x)

77
Proof : Target and Quant are the only subroutines of updatei to modify
ui . Now max eu(xT , i) ≤ max eu(x, i) which follows from eu(xT , i) being
i∈NF(xT ) i∈NF(x)
computed as convex combinations of positions from x,

i = H(xT ) ⇒ eu(xT , i) = 0
eu(x, x.Ri )
i = xT .RH(xT ) ⇒ eu(xT , i) =
2
eu(x, x.Li ) + eu(x, x.Ri )
i ∈ RMids(xT ) ⇒ eu(xT , i) =
2
eu(x, x.Li ) + eu(x, i)
i = T(xT ) ⇒ eu(xT , i) = .
2

Finally, Quant sets xQ .ui = xT .ui or xQ .ui = xT .xi . In the first case, the result
follows by the above reasoning. In the other case, if ui and uL are each
quantized, then ei does not change for any i and the result follows. If,
however, ui is quantized and uL is not quantized, then ei is computed as

i = H(xT ) ⇒ eu(xT , i) = 0
i = xT .RH(xT ) ⇒ eu(xT , i) = eu(x, i)
eu(x, x.Ri ) + eu(x, i)
i ∈ RMids(xT ) ⇒ eu(xT , i) =
2
eu(x, i) + eu(x, x.Li )
i = T(xT ) ⇒ eu(xT , i) = .
2

Likewise, if uL is quantized and ui is not quantized, then ei is computed as

i = H(xT ) ⇒ eu(xT , i) = 0
eu(x, i) + eu(x, x.Ri )
i = xT .RH(xT ) ⇒ eu(xT , i) =
2
eu(x, x.Li ) + eu(x, i)
i ∈ RMids(xT ) ⇒ eu(xT , i) =
2
eu(x, i)
i = T(xT ) ⇒ eu(xT , i) = .
2

Finally, applying Lemma 5.10 indicates that error between actual positions
and not target positions is non-increasing.

The following analysis demonstrates progress towards states satisfying


NBM from any states not in NBM. Define the candidate Lyapunov function

78
as 
Δ
V(x) = e(x, i).
i∈NF(x)

Note the similarity of this candidate with the one found in [101]. In partic-
ular, it is not quadratic and is the sum of absolute values of the positions of
the agents. Thus for a state x, if for some i, e(x, i) > 0, then V(x) > 0. Define
the maximum value of the candidate function obtained for any execution
α over any state x ∈ α satisfying NBM as

Δ
γ= sup V(x).
{x∈α:x∈NBM}

The next lemma shows that sets of states satisfying NBM are invariant,
that a state satisfying NBM is reached, and gives a bound on the number
of rounds required to reach a state satisfying NBM.

Lemma 5.16. Consider any failure-free execution fragment α beginning with


state xk along which xi .xH(xi ) = xk .xH(xk ) for any state xi ∈ α where i ≥ k. If
V(xk ) > γ, then any update transition decreases V(xk ) by at least a positive
constant ψ. Furthermore, there exists a finite round c such that V(xc ) ≤ γ where
 V(x )−γ 
xc ∈ NBM(x) and k < c ≤ vmin k
.

Proof : Assume that System is in a state xk  NBM(x) as otherwise there is


nothing to prove. First note that the only transition to modify any position
variable xi is update. If xk  NBM, then targets are computed as a convex
combinations by Lemma 5.15, and hence V(xk+1 ) ≤ V(xk ). By definition
 
of NBM since xk  NBM, ∃j ∈ NF(xk ) such that xk,T .x j − xk .x j  > β, where
xk,T is the state obtained applying the subroutines of the update transition
through Target. Let j = argmaxe(xk , i). Thus Figure 5.4, Line 36 is not
i∈NF(xk )
satisfied for j and
 
vmax ≥ xk+1 .x j − xk .x j  ≥ vmin

by Figure 5.4, Line 39.


Let
Δ
ΔV(xk , xk+1 ) = V(xk+1 ) − V(xk ),

and we show ΔV(xk , xk+1 ) < ψ for some ψ < 0. Observe that −vmin ≥
ΔV(xk , xk+1 ) ≥ −vmax and since vmax ≥ vmin > 0 let ψ = vmin . Therefore
update
a transition xk → xk+1 causes V(xk+1 ) to decrease by at least a positive

79
 V(x )−γ 
constant vmin . By repeated application of this reasoning, ∃c, k < c ≤ k
vmin
such that V(xc ) ∈ NBM and V(xc ) ≤ γ.

Lemma 5.16 states a bound on the time it takes for System to reach the
set of states satisfying NBM. However, to satisfy FlockS (x), all x ∈ NBM
must be inside the set of states that satisfy FlockS . If FlockS (x), then V(x) ≤
 δ(N−1)
i∈NF(x) e(x, i) = 2
. From any state x that does not satisfy FlockS (x), there
exists an agent that will compute a control that will satisfy the quantization
constraint and hence make a move towards NBM. Thus to satisfy FlockS ,
δ(N−1)
it is required that γ ≤ 2 , in which case the set x ∈ NBM will be such
that FlockS (x) is satisfied, or equivalently, NBM ⊆ FlockS . This allows a
derivation on the quantization parameter β.
The following corollary follows from Lemma 5.16, as the only time at
which FlockS (x) is not satisfied after becoming satisfied is when the head
agent moves, in which case x .xH(x ) < x.xH(x) which causes V(x ) ≥ V(x).
a
Corollary 5.17. For any execution α for x ∈ α such that, if FlockS (x) ∧ x → x
∀a ∈ A ∧ x.xH(x) = x .xH(x ) , then FlockS (x ).

Lemma 5.18 shows that once a weak flock is formed, it is invariant. This
establishes that for any reachable state x , if V(x ) > V(x), then V(x ) <
δ(N − 1).

Lemma 5.18. FlockW is a stable predicate.


a
Proof : We show that for any execution α, ∀x, x ∈ α such that x → x ∀a ∈ A,
if FlockW (x), then FlockW (x ). If FlockW (x), there are two cases to consider.

Case 1. The the system satisfies FlockW (x) ∧ ¬ FlockS (x), then FlockW (x )
holds by application of Lemma 5.16 since x .xH(x ) = x.xH(x ) by Figure 5.4,
Line 33.

Case 2. The system satisfies FlockW (x) ∧ FlockS (x) so upon termination of
the global snapshot algorithm by Assumption 5.4, if x.xH(x)  x.x g , then
H(x) computes x .uH(x ) < x.uH(x) and applies this target x .uH(x ) < x.uH(x)
by Figure 5.4, Line 32, and we show FlockS (x) ⇒ FlockW (x ). If x.xH(x) ∈
[0, β] such that the predicate on Line 36 is satisfied, then x .xH(x ) = x.xH(x)
and the proof is complete. If not, then by the definition of x .uH(x ) in

80
Figure 5.4, Line 32, H(x) will compute a target no more than δ/2 to the
 
left, so x .uH(x ) − x.uH(x)  ≤ δ/2. Now, for Agenti to have moved, the error
between the distance of H(x) and x.RH(x) and the flocking distance must
have been at most δ/2 by the definition of FlockS . AgentRH(x) will have
moved to the center of H(x) and RRH(x) , so x .uRH(x ) may be less than, equal
to, or greater than its previous position x.xRH(x) , requiring a case analysis of
each of these three possibilities. In the first two cases x .uRH(x ) ≤ x.xRH(x ) and
the proof is complete. The other case follows by applying Lemma 5.10 to
H(x) and x.RH(x) and observing that the most they would ever move apart
by is 2β ≤ δ/2 and are now separated by at most δ, hence FlockW (x ) is
satisfied.

Lemma 5.19. Consider any infinite sequence of lexicographically ordered pairs


(a1 , b1 ), (a2 , b2 ), . . ., (a j , b j ), . . . where a j , b j ∈ R≥0 . Suppose ∃c1 , c2 , c3 , c4 , c5 , c6
such that c1 > 0, c2 > 0, c3 > 0, c4 ≥ 0, c5 ≥ 0, and c6 ≥ 0. If ∀j

(i) a j+1 ≤ a j

(ii) a j+1 = a j ∧ b j > c4 then b j+1 ≤ b j − c1

(iii) a j+1 < a j then b j+1 ≤ c6

(iv) b j ≤ c2 ∧ a j > c5 then a j+1 ≤ max{0, a j − c3 }

Then, ∃t such that (a1 , b1 ), (a2 , b2 ), . . ., (a j , b j ), . . ., (at , bt ), (at+1 , bt+1 ), . . . and


(at , bt ) = (at+1 , bt+1 ) = . . ., where at ∈ A = [0, c5 ] and bt ∈ B = [0, c4 ].

Proof : First, note that by assumption, a j+1 is bounded from above by a j (i.e.,
by a1 ). Now assume for the purpose of contradiction that there exists a
pair (ap , bp ) where ap > c5 and bp > c4 such that ∀ f ≥ p, (a f , b f ) = (ap , bp ).
Then, we show there exists a q > p such that (ap , bp ) = (aq , bq ) where aq < ap
and bq < bp .
Without loss of generality, assume that bp > c2 initially. Now, starting
from (ap , bp ), the next step in the sequence is such that bp+1 ≤ bp − c1 , since
it must be the case that ap = ap+1 as we assumed bp > c2 . This process of b j
decreasing continues in the form of bn ≤ bp − nc1 where n is the step that
b −c
bn ≤ c2 , thus bn ≤ bp − nc1 ≤ c2 and n ≥ pc1 2 . At the next step from n, that
is n + 1, it must be the case that an+1 ≤ max{0, an − c3 } since bn ≤ c2 and
an = ap > c5 . Since an+1 < an , it is the case that bn+1 ≤ c6 − c2 = c6 − nc1 .

81
Note that it would seem to remain to be established that bn > c4 so that the
decrease of bn+1 could occur, but, if it is in fact the case that bn ≤ c4 , then
bp ∈ B as desired. Therefore, q = n + 1 > p and since (ap , bp ) becomes smaller
at a larger step in the sequence, we reach the contradiction. By repeatedly
applying the previous arguments, existence of such a t is established.

The following theorem shows that System reaches a neighborhood of


the goal as a strong flock, or equivalently, that there exists a round t such
that Terminal(xt ) and FlockS (xt ).

Theorem 5.20. Consider any infinite failure-free execution α = x0 , x1 , . . .. Con-


sider the infinite sequence of pairs
     
x0 .xH(x0 ) , V(x0 ) , x1 .xH(x1 ) , V(x1 ) , . . . , xt .xH(xt ) , V(xt ) , . . . .

If there exists t such that

(i) xt .xH(xt ) = xt+1 .xH(xt+1 ) ,

(ii) V(xt ) = V(xt+1 ),

(iii) xt .xH(xt ) ∈ [0, β], and

(iv) V(xt ) ≤ (N − 1) 2δ

then Terminal(xt ) and FlockS (xt ).

Proof : The proof follows from Lemma 5.19 by the analysis above, instanti-
ating

(i) c1 = vmin ,

(ii) c2 = (N − 1) 2δ ,

(iii) c3 = 2δ ,

(iv) c4 = γ,

(v) c5 = β, and

(vi) c6 = (N − 1)δ.

82
The following theorem states that System achieves the desired prop-
erties in forming a flock at the goal (the origin) within a specified time,
and follows by Theorem 5.20, Assumption 5.4, and Lemma 5.16. The con-
vergence time would be exact were it not for the O(N) rounds from the
snapshot algorithm to terminate (Assumption 5.4).

Theorem 5.21. Consider any infinite failure-free execution fragment α. Let xt be


a state at least
   
(V(x0 ) − (N − 1)δ/2) (N − 1)δ/2 x0 .xH(x0 )
+ max{1, O(N)}
vmin vmin vmin

rounds from x0 in α where x0 is the first state in α, then Terminal(xt ) and FlockS (xt ).

5.4.6 Failure Detection


We now work towards conditions under which the assumed detection time
kd in Assumption 5.2 can be matched. Unfortunately this is not always the
case. In particular, consider the following class of undetectable failures in
any amount of time. There exist failures which cannot be detected in any
amount of time.

Lemma 5.22. For any execution which may reach a terminal state, consider a
terminal state x ∈ Terminal, and assume F(x) = ∅. Now consider two infinite
executions fragments α and α starting from x, and assume α = faili (0).α, for any
i ∈ ID. For any state x ∈ α and any state x ∈ α , for all i ∈ ID, x.xi = x .xi and
x.ui = x .ui .

These two execution fragments will appear indistinguishable to any


failure detector which relies on comparing positions xi and target positions
ui , and therefore the failure of Agenti cannot ever be detected. While such
failures were undetectable in any amount of time so kd → ∞, observe that
these failures do not violate Sa f ety(x) or Terminal(x) for any reachable state
x in either execution fragment. It turns out that only failures which cause
a violation of safety or progress may be detected.

Lower-Bound on Detection Time. Having illustrated that there exist ex-


ecutions under which the occurrence of faili (v) may never be detected, we

83
show a lower-bound on the detection time for all faili (v) actions that could
cause safety or progress violations. The following lower-bound applies
for executions beginning from states that do not a priori satisfy the NBM
states. Informally, it says that a failed agent mimicked the actions of its
correct non-faulty behavior in such a way that despite the failure, System
still progressed to NBM as was intended.
More specifically, it assumes that the head agent is not at the goal—so
Goal is not satisfied—and that the head agent has failed with zero velocity.
It takes O(N) rounds to reach states which satisfy NBM, and these states also
satisfy FlockS . The head agent detects the strong flocking stable predicate
through the global snapshot algorithm in O(N) rounds and computes a
target towards the goal. However, since the head agent has failed with
zero velocity, it cannot make this movement, so a neighbor of the head
agent detects that the head agent has failed. Thereby the fact that faili (v)
occurred was undetected until O(N) rounds had passed.

Lemma 5.23. The lower-bound on detection time of actuator stuck-at failures is


O(N).

Proof : Consider both an execution in which a failure has occurred, α f , and


a failure-free execution, αn . Let the initial states x of both these executions
satisfy x  Goal and x  NBM. In both executions, let all agents always
choose to apply velocity magnitude vmin at Figure 5.4, Line 39.
Let the head agent be failed with velocity zero, so x. f ailedH(x) = true and
x.v fH(x) = 0. Only for a state x ∈ FlockS will uH(x )  0 by Figure 5.4, Line 32.
Lemma 5.16 implies that x is O(N) rounds away from x in each of α f and
αn , and only once x ∈ NBM can it be guaranteed that x ∈ FlockS . Once
x ∈ FlockS , at some state x which is O(N) rounds from x in each of α f
and αn will uH(x )  0 by Assumption 5.4 and Figure 5.4, Line 32. Thus, α f
and αn are indistinguishable up to state x and by Lemma 5.16, x is at least
O(N) rounds from x.

Accuracy and Completeness. Lemma 5.24 characterizes the accuracy


property of the above failure detector.

Lemma 5.24. In any reachable state x, ∀ j ∈ x.Suspectedi ⇒ x.failed j .

84
Proof : Given that ∃i such that x.Suspectedi  ∅, then the predicate at Fig-
ure 5.4, Line 7 has been satisfied at some round ks in the past. That is at
ks , some j was added to x.Suspectedi . Fix such a j. Let xs correspond to
the state at round ks and xs be the subsequent state in the execution. At
the round prior to ks , there are two cases based the computation of u j in
Figure 5.4, Line 36 for some j  xks −1 .Suspectedi .

Case 1: Quantization allows move. The quantization constraint


 
xs .x j − xs,T .u j  ≤ β

was not satisfied in Figure 5.4, Line 36, so Agent j applies a velocity in the
 
direction of sgn u j − x j . If
   
sgn xs .x j − xs .x j  sgn xs .u j − xs .x j ,

then Agent j moved in the wrong direction, since it computed a move xs .u j


but in actuality applied a velocity that caused it to move away from xs .u j
instead of towards it. This is possible only if
   
sgn xs .u j − xs .x j  sgn xs .u j − xs .x j ,

implying that xs .v f j  0, and thus xs .failed j = true.

Case 2: Quantization prevents move. The quantization constraint


 
xs .x j − xs,T .u j  ≤ β

was satisfied in Figure 5.4, Line 36, so


 
xs .x j − xs .u j  = 0

should have been observed, but instead it was observed that Agent j per-
formed a move, such that
 
xs .x j − xs .x j   0.

85
This implies that xs .failed j = true since the only way
 
xs .x j − xs .x j   0

is if for xs .v f j  0
xs .x j = xs .x j + xs .v f j .

The next lemma describes a partial completeness property [3], in that after
a failure has occurred, some agent eventually suspects that a failure has
occurred. This is partial completeness as already it was demonstrated that
there exists a class of failures that can never be detected in Lemma 5.22.

Lemma 5.25. For any failure-free execution fragment α, suppose that x is a state
in α such that ∃ j ∈ F(x) and ∃i ∈ ID such that j ∈ x.Nbrsi \ Suspectedi . Let i’s
   
state knowledge for j satisfy either (x.xoi, j − x.uoi, j  ≤ β ∧ x.xi,j − x.uoi, j   0)
     
or (x.xoi, j − x.uoi, j  > β ∧ sgn x.xi, j − x.xoi, j  sgn x.uoi,j − x.xoi, j ). Then
suspecti (j)
x → x.

Proof : Fix a failure-free execution fragment α. Note that there always exists
an i ∈ ID that is a neighbor of the failed agent j by the strong connectivity
assumption. For the transition suspecti to be taken, the precondition at
Figure 5.4, Line 7 must satisfy that j  x.Suspectedi , and that either
   
(x.xoi,j − x.uoi,j  ≤ β ∧ x.xi, j − x.uoi,j   0), or
     
(x.xoi,j − x.uoi,j  > β ∧ sgn x.xi, j − x.xoi,j  sgn x.uoi,j − x.xoi, j ).

These are the two hypotheses of the lemma and thus the result follows that
the suspecti transition is enabled.

The following corollary gives a bound on the number of rounds to detect


any failure which may be detected and follows by applying Lemma 5.23
with Lemma 5.25

Corollary 5.26. For all non-terminal states, the detection time is O(N). That is,
the occurrence of any faili (v) transition is suspected within O(N) rounds.

86
The following corollary states that eventually all non-faulty agents know
the set of all failed agents, and follows from Lemma 5.24 and Corollary 5.26,
and that agents share suspected sets in Figure 5.4, Line 23.

Corollary 5.27. For all executions α of System, for any state x ∈ α, there exists
an element xs in α such that ∀i ∈ NF(xs ), xs .Suspectedi = F(x).

Upon detecting nearby agents have failed, Agenti may need to move to
an adjacent to maintain the safety and eventual progress properties. For
instance, if Agent j has failed, x.xi > x.x j , and x.v j = 0, then to make progress
toward the goal 0 < x j , Agenti must somehow get past Agent j , motivating
that the mitigation action is to generally move to a different lane until either
i has passed j if x.v j = 0, or j has passed i if x.v j > 0. For this passing to
occur, the mitigating agent must also change its belief on which neighbor it
should compute its target in the Target subroutine of the update transition
based upon, motivating the need for Li and Ri to also change.
Roughly, if at state x, s is a failed agent and s is suspected by i = R(x, s),
then L(x, i) must yield Agents ’s left neighbor, L(x, s). This is always possi-
ble given the assumption that failures do not cause a partitioning of the
communications graph.
With no further assumptions on when agents fail and in which directions,
up to f ≤ NL − 1 failures may occur, with at most one in each lane. This
ensures there is a failure free lane which can always be used to mitigate
failures. However, up to f ≤ N − 1 failures may occur so long as no failure
occurs within O(N) time in a single lane and there are NL ≥ 2 lanes, which
follows by Lemma 5.25 and is formalized in the next lemma, which states
that if i is failed, then within O(N) rounds, no other agent believes i is its
left or right neighbor. This lemma is sufficient to prove convergence to
terminal states.

Lemma 5.28. If at a reachable state x, x.failedi , then for a state x reachable from
x, after O(N) rounds, ∀ j ∈ ID.x .L j  i ∧ x .R j  i.

The previous lemma ensures progress with at most one failure in each
of f ≤ L − 1 lanes. By Lemma 5.28, within O(N2 ) time no agent i ∈ NF(x)
believes any j ∈ F(x) is its Li or Ri , and thereby any failed agents diverge
 
safely along their individual lanes if x.v j  > 0 by Lemma 5.8 and i converges

87
to states that satisfy NBM by Theorem 5.20. This shows that System is self-
stabilizing when combined with a failure detector.
Alternatively, a topological requirement can be made to allow more
frequent occurrence of failures. In particular, restrict the set of executions
to those containing configurations in which there is always sufficiently
large free spacing for mitigation in some lane which is formalized below
in Invariant 5.29.

Invariant 5.29. Safety(x) in spite of f ≤ N − 1 failures, assuming along any


execution, ∀x, ∃L ∈ IDL such that ∀i ∈ NF(x), ∀ j ∈ F(x), x.lane j  L and

[x.xi − rs − 2vmax , x.xi + rs + 2vmax ] ∩ [x.x j − rs − 2vmax , x.x j + rs + 2vmax ] = ∅.

After Agent j has been suspected by its neighbors, that is, j ∈ x.Suspectedi
for all i where j ∈ Nbrs(x, i), the Mitigation subroutine of the update transi-
tion shows that they will move to some free lane at the next round. This
shows that mitigation takes at most one additional round after detection,
since we have assumed there is always free space on some lane and is thus
safe to move onto. This implies that so long as a failed agent is detected
prior to safety being violated, only one additional round is required to
mitigate, so the time of mitigation is a constant factor added to the time to
suspect, resulting in a O(N) time to both suspect and mitigate. Note that
since there is a single collection of agents (the communications graph is
strongly connected), the only time when an agent needs to change its left
or right is upon determining that its left or right neighbor has in fact failed.

5.5 Conclusion
This case study demonstrated a DCPS which when combined with a failure
detector satisfies a self-stabilization property. In particular it demonstrated
safety without failures, a reduced form of safety when a single failure
occurs, and eventually reaching a destination as a strong flock along failure-
free executions. Without the failure detector, the system would not be able
to maintain safety as agents could collide, nor make progress to states

88
satisfying flocking or the destination, since failed agents may diverge,
causing their neighbors to follow and diverge as well. Thus it presented
the development of a fault-tolerant DCPS from a fault-intolerant one.

89
CHAPTER 6

CONCLUSION

6.1 Future Work


There are many directions to investigate further. The most interesting
direction to pursue stems from an initial pass in Chapter 2 of developing a
model for distributed cyber-physical systems (DCPS) which satisfy some
notion of fault-tolerance. The case studies also have generalizations to
pursue.

Distributed Traffic Control Problem. The case study presented in Chap-


ter 4 has several future directions. The case for arbitrary tessellations of
the plane as opposed to the partition of squares seems interesting as well
as challenging, particularly if the algorithms are to have asymptotically
optimal throughput. A further generalization would be to develop algo-
rithms for flow control of multiple types of entities with arbitrary flow
patterns (not necessarily source-to-destination flows) specified for each
type. For practical applications, algorithms are needed which tolerate a
relaxed coupling between entities and allow them some degree of inde-
pendent movement while preserving safety and progress. Finally, given
the assertional structure of the proofs, an interesting avenue would be to
mechanize the proofs through the use of automated theorem proving tools
such as [85].

Distributed Flocking Problem. The case study presented in Chapter 5


could benefit from more realistic dynamics, such as the double-integrator
system considered in [79]. Flocking in higher dimensions, specifically two
and three-dimensions as in [86], has many practical applications in robotic
swarms or UAVs, and fault-tolerant algorithms should be developed for

90
these cases.
An investigation of a constant-time algorithm for failure detection is also
interesting, where agents occasionally apply a special motion and agents
which do not follow this special coordinated motion are deemed faulty.
This is conceptually similar to the motion probes used in [61].

Failure Classes. This thesis investigated through case studies two types
of failures, a cyber failure of a computer crash and an actuator stuck-at
failure. There are many other failures which could be considered in these
case studies or other DCPS, such as those enumerated in Chapter 3.

Modeling DCPS. As mentioned, the dynamics for each of the case stud-
ies were simple, and thus the modeling formalism was able to rely only on
discrete transitions. To model more complicated dynamics, as well as mes-
sage passing in partially synchronous timing models, a more expressive
formalism is necessary, and we would consider the use of timed input/out-
put automata (TIOA) [14] or hybrid input/output automata (HIOA) [17].
The work in [102] may provide a route for converting some of the results
presented here to a partially synchronous timing model with message
passing.
While this thesis investigated fault-tolerance of DCPS—in the form of the
systems satisfying an invariant safety property and an eventual progress
property—it would be interesting to investigate a provably optimal or
lower-bound on the time required to return to states which may make
progress. Interesting impossibility results regarding when a system may
not tolerant faults might arise in the partially synchronous or asynchronous
timing models.

Practical Realization. Simulating the case studies in this thesis over a


wireless network has interesting practical and theoretical directions. Anal-
ysis in the partially synchronous setting could be done in a formalism
like TIOA mentioned above and then simulated over a set of computers
on a real network. This may then rely on expanding the self-stabilizing
hierarchy of algorithms for each case study. For instance in the traffic
control problem, the self-stabilizing algorithm could be composed of self-
stabilizing clock-synchronization and then the self-stabilizing routing al-

91
gorithm. Similarly in the safe flocking problem, a self-stabilizing DCPS im-
plemented in a simulation over a network may rely on the composition of
(a) a self-stabilizing clock synchronization algorithm, (b) a self-stabilizing
leader election algorithm to decide on the head agent, (c) a self-stabilizing
distributed snapshot algorithm for strong flock detection, and (d) a self-
stabilizing failure detector.
Finally, an interesting case study would be how self-stabilizing algo-
rithms could be combined with supervisory controllers, like in the inverted
pendulum of [103].

6.2 Conclusions
Overall, this thesis took a first step in constructing a theory of fault-
tolerance for DCPS. A general model of DCPS and a definition for such
systems to be fault-tolerant were introduced. Furthermore, it introduced
a general method for establishing whether a given DCPS is fault-tolerant
through the use of self-stabilization. If the DCPS was found not to be fault-
tolerant, it was shown that through the construction of a failure detector,
the DCPS could be converted into a fault-tolerant system.
The general model was then instantiated for two specific DCPS and
their fault-tolerant properties were investigated. In the distributed traffic
control problem in Chapter 4 (and [62]), the system was shown to be
fault-tolerant. Specifically it presented a DCPS in which it is possible for
physical safety to be maintained in spite of arbitrary crash failures of the
software controling agents. However, progress cannot be preserved, but
due to the self-stabilizing nature of the algorithm, the DCPS was shown to
automatically return to states from which progress can be made.
In the distributed flocking problem in Chapter 5, the system was shown
to require a failure detector to satisfy fault-tolerance. Specifically a failure
detector was constructed which eventually suspects agents which have
failed with actuator stuck at faults, when it is possible to suspect such
faults.
In each case study, an invariant safety property was established as well
as an eventual progress property, according to which the invariant safety
properties each specified that some bad states were not reached and the

92
eventual progress properties ensured that eventually the problem specifi-
cations were satisfied, all in spite of failures. The importance of the work
will be realized as the proliferation of sensors, actuators, networking, and
computing, results in the creation of DCPS like mobile robot swarms, the
future electric grid, the automated highway system, and other systems
which make strong use of combining distributed computation with phys-
ical processes.

93
REFERENCES

[1] E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, “A


survey of rollback-recovery protocols in message-passing systems,”
ACM Comput. Surv., vol. 34, no. 3, pp. 375–408, 2002.

[2] F. B. Schneider, “Implementing fault-tolerant services using the state


machine approach: a tutorial,” ACM Comput. Surv., vol. 22, no. 4, pp.
299–319, 1990.

[3] T. D. Chandra and S. Toueg, “Unreliable failure detectors for reliable


distributed systems,” J. ACM, vol. 43, no. 2, pp. 225–267, 1996.

[4] S. Dolev, Self-stabilization. Cambridge, MA: MIT Press, 2000.

[5] M. Baum and K. Passino, “A search-theoretic approach to coopera-


tive control for uninhabited air vehicles,” in AIAA Guidance, Navi-
gation, and Control Conference and Exhibit, no. AIAA-2002-4589, Aug.
2002, pp. 1–8.

[6] P. A. Ioannou, Automated Highway Systems. New York, NY, USA:


Plenum Press, 1997.

[7] L. S. Communications, “The smart grid: An introduction,”


United States Department of Energy’s Office of Electricity Delivery
and Energy Reliability, Tech. Rep., 2008. [Online]. Available:
http://www.oe.energy.gov/SmartGridIntroduction.htm

[8] T. Horvath, T. Abdelzaher, K. Skadron, and X. Liu, “Dynamic voltage


scaling in multitier web servers with end-to-end delay control,” IEEE
Trans. Comput., vol. 56, no. 4, pp. 444–458, 2007.

[9] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wire-


less sensor networks: a survey,” Computer Networks, vol. 38, no. 4,
pp. 393–422, 2002.

[10] I. F. Akyildiz and I. H. Kasimoglu, “Wireless sensor and actor net-


works: research challenges,” Ad Hoc Networks, vol. 2, no. 4, pp.
351–367, 2004.

94
[11] H. Attiya and J. Welch, Distributed Computing: Fundamentals, Simula-
tions, and Advanced Topics. John Wiley and Sons, Inc., 2004.

[12] H. K. Khalil, Nonlinear Systems, 3rd ed. Upper Saddle River, NJ:
Prentice Hall, 2002.

[13] R. Alur and D. L. Dill, “A theory of timed automata,” Theoretical


Computer Science, vol. 126, pp. 183–235, 1994.

[14] D. K. Kaynar, N. Lynch, R. Segala, and F. Vaandrager, The Theory


of Timed I/O Automata, ser. Synthesis Lectures in Computer Science.
Morgan & Claypool Publishers, 2006.

[15] R. Alur, C. Courcoubetis, T. A. Henzinger, and P.-H. Ho, “Hybrid au-


tomata: An algorithmic approach to the specification and verification
of hybrid systems,” in Hybrid Systems, R. L. Grossman, A. Nerode,
A. P. Ravn, and H. Rischel, Eds. London, UK: Springer-Verlag, 1993,
pp. 209–229.

[16] T. A. Henzinger, “The theory of hybrid automata,” in LICS ’96: Pro-


ceedings of the 11th Annual IEEE Symposium on Logic in Computer Sci-
ence. Washington, DC, USA: IEEE Computer Society, 1996, p. 278.

[17] N. Lynch, R. Segala, and F. Vaandrager, “Hybrid i/o automata,” Inf.


Comput., vol. 185, no. 1, pp. 105–157, 2003.

[18] A. Arora and M. Gouda, “Closure and convergence: A foundation


of fault-tolerant computing,” IEEE Trans. Softw. Eng., vol. 19, pp.
1015–1027, 1993.

[19] T. D. Chandra and S. Toueg, “Unreliable failure detectors for asyn-


chronous systems (preliminary version),” in PODC ’91: Proceedings of
the tenth annual ACM symposium on Principles of distributed computing.
New York, NY, USA: ACM, 1991, pp. 325–340.

[20] M. Reynal, “A short introduction to failure detectors for asyn-


chronous distributed systems,” SIGACT News, vol. 36, no. 1, pp.
53–70, 2005.

[21] P. M. Frank, “Fault diagnosis in dynamic systems using analyti-


cal and knowledge-based redundancy—a survey and some new re-
sults,” Automatica, vol. 26, no. 3, pp. 459–474, 1990.

[22] J. Gertler, “Survey of model-based failure detection and isolation in


complex plants,” Control Systems Magazine, IEEE, vol. 8, no. 6, pp.
3–11, Dec. 1988.

95
[23] C. N. Hadjicostis, “Non-concurrent error detection and correction in
fault-tolerant discrete-time lti dynamic systems,” IEEE Trans. Autom.
Control, vol. 48, pp. 2133–2140, 2002.

[24] R. Su and W. Wonham, “Global and local consistencies in distributed


fault diagnosis for discrete-event systems,” IEEE Trans. Autom. Con-
trol, vol. 50, no. 12, pp. 1923–1935, Dec. 2005.

[25] F. P. Preparata, G. Metze, and R. T. Chien, “On the connection assign-


ment problem of diagnosable systems,” IEEE Trans. Electron. Comput.,
vol. EC-16, pp. 848–854, Dec. 1967.

[26] R. Reiter, “A theory of diagnosis from first principles,” Artificial


Intelligence, vol. 32, no. 1, pp. 57–95, 1987.

[27] F. Cristian, “Understanding fault-tolerant distributed systems,”


Commun. ACM, vol. 34, no. 2, pp. 56–78, 1991.

[28] M. Schneider, “Self-stabilization,” ACM Comput. Surv., vol. 25, no. 1,


pp. 45–67, 1993.

[29] H. Kopetz, Real-Time Systems: Design Principles for Distributed Embed-


ded Applications. Norwell, MA, USA: Kluwer Academic Publishers,
1997.

[30] J. Wensley, L. Lamport, J. Goldberg, M. Green, K. Levitt, P. Melliar-


Smith, R. Shostak, and C. Weinstock, “Sift: Design and analysis of a
fault-tolerant computer for aircraft control,” Proceedings of the IEEE,
vol. 66, no. 10, pp. 1240 – 1255, Oct. 1978.

[31] L. Lamport, R. Shostak, and M. Pease, “The byzantine generals prob-


lem,” ACM Trans. Program. Lang. Syst., vol. 4, no. 3, pp. 382–401, 1982.

[32] S. Delaët and S. Tixeuil, “Tolerating transient and intermittent fail-


ures,” Journal of Parallel and Distributed Computing, vol. 62, no. 5, pp.
961 – 981, 2002.

[33] R. E. L. DeVille and S. Mitra, “Stability of distributed algorithms


in the face of incessant faults,” in Proceedings of 11th International
Symposium on Stabilization, Safety, and Security of Distributed Systems
(SSS), November 2009, pp. 224–237.

[34] N. A. Lynch, Distributed Algorithms. San Francisco, CA, USA: Mor-


gan Kaufmann Publishers Inc., 1996.

[35] J. Beauquier, S. Delaet, S. Dolev, and S. Tixeuil, “Transient fault de-


tectors,” Distributed Computing, no. 1, pp. 39–51, July.

96
[36] T. Henzinger, B. Horowitz, and C. Kirsch, “Giotto: a time-triggered
language for embedded programming,” Proceedings of the IEEE,
vol. 91, no. 1, pp. 84–99, Jan. 2003.

[37] G. Baliga, S. Graham, L. Sha, and P. Kumar, “Service continuity


in networked control using etherware,” Distributed Systems Online,
IEEE, vol. 5, no. 9, pp. 2–2, Sep. 2004.

[38] D. Seto, B. Krogh, L. Sha, and A. Chutinan, “The simplex architecture


for safe on-line control system upgrades,” in Proc. American Control
Conference, Philadelphia, PA, June 1998, pp. 3504–3508.

[39] S. Bak, D. K. Chivukula, O. Adekunle, M. Sun, M. Caccamo, and


L. Sha, “The system-level simplex architecture for improved real-
time embedded system safety,” in Real-Time and Embedded Technology
and Applications Symposium, IEEE, vol. 0. Los Alamitos, CA, USA:
IEEE Computer Society, 2009, pp. 99–107.

[40] F. Bonnet and M. Raynal, “Looking for the weakest failure detector
for k-set agreement in message-passing systems: Is Πk the end of
the road?” in SSS ’09: Proceedings of the 11th International Symposium
on Stabilization, Safety, and Security of Distributed Systems. Berlin,
Heidelberg: Springer-Verlag, 2009, pp. 149–164.

[41] K. Birman, R. Renesse, and R. Van Renesse, Reliable distributed comput-


ing with the Isis toolkit. IEEE Computer Society Press Los Alamitos,
CA, USA, 1994.

[42] H. A. Zia, N. Sridhar, and S. Sastry, “Failure detectors for wireless


sensor-actuator systems,” Ad Hoc Netw., vol. 7, no. 5, pp. 1001–1013,
2009.

[43] C. Delporte-Gallet, H. Fauconnier, R. Guerraoui, V. Hadzilacos,


P. Kouznetsov, and S. Toueg, “The weakest failure detectors to solve
certain fundamental problems in distributed computing,” in PODC
’04: Proceedings of the Twenty-Third annual ACM symposium on Princi-
ples of distributed computing. New York, NY, USA: ACM, 2004, pp.
338–346.

[44] J. Gertler, Fault Detection and Diagnosis in Engineering Systems. New


York, NY, USA: Marcel Dekker, 1998.

[45] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and


D. Teneketzis, “Diagnosability of discrete-event systems,” IEEE
Trans. Autom. Control, vol. 40, no. 9, pp. 1555–1575, Sep. 1995.

97
[46] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and
D. Teneketzis, “Failure diagnosis using discrete-event models,” IEEE
Trans. Control Syst. Technol., vol. 4, no. 2, pp. 105–124, Mar. 1996.

[47] I. Roychoudhury, G. Biswas, and X. Koutsoukos, “Designing dis-


tributed diagnosers for complex continuous systems,” Automation
Science and Engineering, IEEE Transactions on, vol. 6, no. 2, pp. 277–
290, Apr. 2009.

[48] P. J. Ramadge and W. M. Wonham, “Modular feedback logic for


discrete event systems,” SIAM J. Control Optim., vol. 25, no. 5, pp.
1202–1218, 1987.

[49] Y. Ru and C. N. Hadjicostis, “Fault diagnosis in discrete event sys-


tems modeled by partially observed petri nets,” Discrete Event Dy-
namic Systems, vol. 19, no. 4, pp. 551–575, 2009.

[50] A. Paoli and S. Lafortune, “Safe diagnosability for fault-tolerant su-


pervision of discrete-event systems,” Automatica, vol. 41, no. 8, pp.
1335–1347, 2005.

[51] E. W. Dijkstra, “Self-stabilizing systems in spite of distributed con-


trol,” Commun. ACM, vol. 17, no. 11, pp. 643–644, 1974.

[52] M. Hutle and J. Widder, “Self-stabilizing failure detector algorithms,”


in IASTED International Conference on Parallel and Distributed Comput-
ing and Networks, Innsbruck, Austria, Feb. 2005, pp. 485–490.

[53] K. M. Chandy and L. Lamport, “Distributed snapshots: determin-


ing global states of distributed systems,” ACM Trans. Comput. Syst.,
vol. 3, no. 1, pp. 63–75, 1985.

[54] A. Arora and M. Gouda, “Distributed reset,” IEEE Trans. Comput.,


vol. 43, no. 9, pp. 1026–1038, 1994.

[55] S. Katz and K. J. Perry, “Self-stabilizing extensions for message-


passing systems,” Distrib. Comput., vol. 7, no. 1, pp. 17–26, 1993.

[56] A. Arora and S. Kulkarni, “Detectors and correctors: a theory of


fault-tolerance components,” in Distributed Computing Systems, 1998.
Proceedings. 18th International Conference on, May 1998, pp. 436–443.

[57] Y. Afek, S. Kutten, and M. Yung, “Memory-efficient self stabilizing


protocols for general networks,” in WDAG ’90: Proceedings of the
4th International Workshop on Distributed Algorithms. London, UK:
Springer-Verlag, 1991, pp. 15–28.

98
[58] Y. Afek, S. Kutten, and M. Yung, “The local detection paradigm and
its applications to self-stabilization,” Theor. Comput. Sci., vol. 186, no.
1-2, pp. 199–229, Oct. 1997.

[59] B. Awerbuch, B. Patt-Shamir, and G. Varghese, “Self-stabilization by


local checking and correction,” in Foundations of Computer Science,
1991. Proceedings, 32nd Annual Symposium on, Oct. 1991, pp. 268–277.

[60] Y. Afek and S. Dolev, “Local stabilizer,” J. Parallel Distrib. Comput.,


vol. 62, no. 5, pp. 745–765, 2002.

[61] M. Franceschelli, M. Egerstedt, and A. Giua, “Motion probes for fault


detection and recovery in networked control systems,” in American
Control Conference, 2008, June 2008, pp. 4358–4363.

[62] T. Johnson, S. Mitra, and K. Manamcheri, “Safe and stabilizing


distributed cellular flows,” in Distributed Computing Systems, 2010.
ICDCS 2010. Proceedings. 30th IEEE International Conference on, Genoa,
Italy, June 2010.

[63] C. Daganzo, M. Cassidy, and R. Bertini, “Possible explanations


of phase transitions in highway traffic,” Transportation Research A,
vol. 33, pp. 365–379, May 1999.

[64] D. Helbing and M. Treiber, “Jams, waves, and clusters,” Science, vol.
282, pp. 2001–2003, Dec. 1998.

[65] B. S. Kerner, “Experimental features of self-organization in traffic


flow,” Phys. Rev. Lett., vol. 81, no. 17, pp. 3797–3800, Oct. 1998.

[66] M. Nolan, Fundamentals of air traffic control. Wadsworth Publishing


Company, 1994.

[67] F. Borgonovo, L. Campelli, M. Cesana, and L. Coletti, “Mac for ad hoc


inter-vehicle network: services and performance,” in IEEE Vehicular
Technology Conf., vol. 5, 2003, pp. 2789–2793.

[68] B. Hoh, M. Gruteser, R. Herring, J. Ban, D. Work, J.-C. Herrera,


A. M. Bayen, M. Annavaram, and Q. Jacobson, “Virtual trip lines
for distributed privacy-preserving traffic monitoring,” in MobiSys
’08: Proceeding of the 6th International Conference on Mobile Systems,
Applications, and Services. New York, NY, USA: ACM, 2008, pp.
15–28.

[69] T. Prevot, “Exploring the many perspectives of distributed air traffic


management: The multi aircraft control system macs,” in Proceedings
of the HCI-Aero, 2002, pp. 149–154.

99
[70] N. Leveson, M. de Villepin, J. Srinivasan, M. Daouk, N. Neogi,
E. Bachelder, J. Bellingham, N. Pilon, and G. Flynn, “A safety and
human-centered approach to developing new air traffic management
tools,” in Proceedings Fourth USA/Europe Air Traffic Management R&D
Seminar, Dec. 2001, pp. 1–14.
[71] C. Livadas, J. Lygeros, and N. A. Lynch, “High-level modeling and
analysis of TCAS,” in Proceedings of the 20th IEEE Real-Time Systems
Symposium (RTSS’99), Dec. 1999, pp. 115–125.
[72] J. Misener, R. Sengupta, and H. Krishnan, “Cooperative collision
warning: Enabling crash avoidance with wireless technology,” in
12th World Congress on Intelligent Transportation Systems, 2005, pp.
1–11.
[73] A. Girard, J. de Sousa, J. Misener, and J. Hedrick, “A control architec-
ture for integrated cooperative cruise control and collision warning
systems,” in Decision and Control, 2001. Proceedings of the 40th IEEE
Conference on, vol. 2, 2001, pp. 1491–1496.
[74] C. Tomlin, G. Pappas, and S. Sastry, “Conflict resolution of air traffic
management: A study in multi-agent hybrid systems,” IEEE Trans.
Autom. Control, vol. 43, pp. 509–521, 1998.
[75] C. Muñoz, V. Carreño, and G. Dowek, “Formal analysis of the op-
erational concept for the Small Aircraft Transportation System,” in
Rigorous Engineering of Fault-Tolerant Systems, ser. LNCS, vol. 4157,
2006, pp. 306–325.
[76] D. Swaroop and J. K. Hedrick, “Constant spacing strategies for pla-
tooning in automated highway systems,” Journal of Dynamic Systems,
Measurement, and Control, vol. 121, pp. 462–470, 1999.
[77] E. Dolginova and N. Lynch, “Safety verification for automated pla-
toon maneuvers: A case study,” in HART’97 (International Workshop
on Hybrid and Real-Time Systems), ser. LNCS, vol. 1201. Springer
Verlag, March 1997.
[78] P. Varaiya, “Smart cars on smart roads: Problems of control,” IEEE
Trans. Autom. Control, vol. 38, pp. 195–207, 1993.
[79] H. Kowshik, D. Caveney, and P. R. Kumar, “Safety and liveness in
intelligent intersections,” in Hybrid Systems: Computation and Control
(HSCC), 11th International Workshop, ser. LNCS, vol. 4981, Apr. 2008,
pp. 301–315.
[80] P. Weiss, “Stop-and-go science,” Science News, vol. 156, no. 1, pp.
8–10, July 1999.

100
[81] Kornylak, “Omniwheel brochure,” Hamilton, Ohio, 2008. [Online].
Available: http://www.kornylak.com/images/pdf/omni-wheel.pdf.
[82] S. Gilbert, N. Lynch, S. Mitra, and T. Nolte, “Self-stabilizing robot for-
mations over unreliable networks,” ACM Trans. Auton. Adapt. Syst.,
vol. 4, no. 3, pp. 1–29, July 2009.
[83] S. Dolev, L. Lahiani, S. Gilbert, N. Lynch, and T. Nolte, “Virtual
stationary automata for mobile networks,” in PODC ’05: Proceedings
of the twenty-fourth annual ACM symposium on Principles of distributed
computing. New York, NY, USA: ACM, 2005, pp. 323–323.
[84] T. Nolte and N. Lynch, “A virtual node-based tracking algorithm
for mobile networks,” in Distributed Computing Systems, International
Conference on (ICDCS). Los Alamitos, CA, USA: IEEE Computer
Society, 2007, pp. 1–9.
[85] S. Owre, S. Rajan, J. Rushby, N. Shankar, and M. Srivas, “PVS:
Combining specification, proof checking, and model checking,” in
Computer-Aided Verification, CAV ’96, ser. LNCS, R. Alur and T. A.
Henzinger, Eds., no. 1102. New Brunswick, NJ: Springer-Verlag,
July/August 1996, pp. 411–414.
[86] R. Olfati-Saber, “Flocking for multi-agent dynamic systems: algo-
rithms and theory,” IEEE Trans. Autom. Control, vol. 51, no. 3, pp.
401–420, Mar. 2006.
[87] J. Fax and R. Murray, “Information flow and cooperative control of
vehicle formations,” IEEE Trans. Autom. Control, vol. 49, no. 9, pp.
1465–1476, Sep. 2004.
[88] A. Jadbabaie, J. Lin, and A. Morse, “Coordination of groups of mo-
bile autonomous agents using nearest neighbor rules,” IEEE Trans.
Autom. Control, vol. 48, no. 6, pp. 988–1001, June 2003.
[89] H. Tanner, A. Jadbabaie, and G. Pappas, “Stable flocking of mobile
agents, Part I: Fixed topology,” in 42nd IEEE Conference on Decision
and Control, 2003. Proceedings, vol. 2, 2003.
[90] V. Gazi and K. M. Passino, “Stability of a one-dimensional discrete-
time asynchronous swarm,” IEEE Trans. Syst., Man, Cybern. B, vol. 35,
no. 4, pp. 834–841, Aug. 2005.
[91] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in the presence
of partial synchrony,” J. ACM, vol. 35, no. 2, pp. 288–323, 1988.
[92] M. J. Fischer, N. A. Lynch, and M. S. Paterson, “Impossibility of
distributed consensus with one faulty process,” J. ACM, vol. 32,
no. 2, pp. 374–382, 1985.

101
[93] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous
deterministic and stochastic gradient optimization algorithms,”
IEEE Trans. Autom. Control, vol. 31, no. 9, pp. 803–812, Sep. 1986.

[94] V. Blondel, J. Hendrickx, A. Olshevsky, and J. Tsitsiklis, “Conver-


gence in multiagent coordination, consensus, and flocking,” in Deci-
sion and Control, 2005 and 2005 European Control Conference. CDC-ECC
’05. 44th IEEE Conference on, Dec. 2005, pp. 2996–3000.

[95] H. G. Tanner, A. Jadbabaie, and G. J. Pappas, “Flocking in fixed


and switching networks,” IEEE Trans. Autom. Control, vol. 52, pp.
863–868, May 2007.

[96] W. Ren, R. Beard, and E. Atkins, “Information consensus in multive-


hicle cooperative control,” Control Systems Magazine, IEEE, vol. 27,
no. 2, pp. 71–82, Apr. 2007.

[97] R. Lozano, M. Spong, J. Guerrero, and N. Chopra, “Controllability


and observability of leader-based multi-agent systems,” in Decision
and Control, 2008. CDC 2008. 47th IEEE Conference on, Dec. 2008, pp.
3713–3718.

[98] A. Kashyap, T. Başar, and R. Srikant, “Quantized consensus,” Auto-


matica, vol. 43, no. 7, pp. 1192–1203, May 2007.

[99] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. Tsitsiklis, “On distributed


averaging algorithms and quantization effects,” in Decision and Con-
trol, 2008. CDC 2008. 47th IEEE Conference on, Dec. 2008, pp. 4825–
4830.

[100] D. Liberzon, Switching in Systems and Control. Boston, MA, USA:


Birkhäuser, 2003.

[101] J. Yu, S. LaValle, and D. Liberzon, “Rendezvous without coordi-


nates,” in Decision and Control, 2008. CDC 2008. 47th IEEE Conference
on, Dec. 2008, pp. 1803–1808.

[102] K. M. Chandy, S. Mitra, and C. Pilotto, “Convergence verification:


From shared memory to partially synchronous systems,” in Formal
Modeling and Analysis of Timed Systems (FORMATS‘08), ser. LNCS,
vol. 5215. Springer Verlag, 2008, pp. 217–231.

[103] D. Seto and L. Sha, “A case study on analytical analysis of the in-
verted pendulum real-time control system,” Carnegie Mellon Uni-
versity, Pittsburgh, PA, CMU/SEI Tech. Rep. 99-TR-023, Nov. 1999.

102

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy