CIS 763: Notes On Faults and Fault-Tolerances
CIS 763: Notes On Faults and Fault-Tolerances
Recall that, in the absence of faults, a program satises its safety and liveness specication. We prove this satisfaction by exhibiting an invariant predicate such that, in the absence of faults, the program is always at a state where the invariant predicate is true.
Faults.
of ways:
Type: e.g., the faults are stuck-at, fail-stop, crash, omission, timing, performance, or Byzantine. Duration: e.g., the faults are permanent, intermittent, or transient. Observability: e.g., the faults are detectable or not. Repair: e.g. the faults are correctable or not. To reason about faults in a simple and uniform manner, we adopt the following thesis: Faults are systematically represented by actions whose execution perturbs the program state. Denition (Fault-class). A fault-class for a program p is a set of actions over the variables of p.
The fault that corrupts the state of the wire is represented by the fault action: out = in out :=? ,
Consider, for example, a fault that corrupts the state of a wire. The wire itself is represented by the following program action over two bit variables in and out: out = in out := in .
For this representation to capture all of the categories mentioned above sometimes requires the use of auxiliary state. For example, consider the fault by which the wire is stuck-at-low-voltage. In this case, the correct behavior of the wire is represented by using an auxiliary boolean variable broken and the program action: out = in broken out := in . If a fault occurs, the incorrect behavior of the wire is represented by the program action that sets out to 0 provided that the state of the wire is broken: broken out := 0 . The stuck-at-low-voltage fault is represented by the fault action: broken broken := true . Continuing along these lines, consider process crashes. The crash of a process is represented by introducing an auxiliary variable up for that process, as follows. Each action of that process is to be executed only if up is true. The crash itself is modeled as the occurrence of a fault that corrupts up, by setting it to false. Similarly, the Byzantine behavior of a process can be captured by introducing an auxiliary variable good, as follows: If the variable good is true, then the process executes its normal actions. When a fault action corrupts good to false, the process executes actions whose behavior is nondeterministic.
We are now ready to dene what it means for a program p with an invariant S to tolerate a fault-class F .
Tolerances.
Denition (Fault-span). Let S be an invariant of a program p and F be a fault-class. T is an F -span of p from S i S T, T is closed in p, and each action of F preserves T . Denition(F -tolerant for SP EC from S). p is F -tolerant for SP EC from S i there exists a state predicate T that satises the following three conditions: Starting from any state where T is true, if any action in p or F is executed, the resulting state is also one where T is true. (In other words, T is closed in p and T is closed in F .) Starting from any state where T is true, every computation of p alone eventually reaches a state where S is true. (In other words, T leads to S in p.) This denition may be understood as follows. The state predicate T is an F -span of p from S a boundary in the state space of p up to which (but not beyond which) the state of p may be perturbed by the occurrence of faults in F . If faults in F continue to occur, the state of p remains within this boundary. When faults in F stop occurring, p converges from this boundary to the stricter boundary in the state space where the invariant S is true. It is important to note that there may be multiple such state predicates T from which p meets the above three requirements. Each of these multiple T state predicates captures a (potentially dierent) type of fault-tolerance of p. At any state where S is true, T is also true. (In other words, S T .)
Types of Tolerances.
We now proceed to classify three types of fault-tolerances that a program can exhibit, namely masking, nonmasking, and fail-safe tolerance.
1. In the presence of faults, a masking tolerant program always satises its safety specication, and the execution of p after execution of actions in F yields a computation that is in both the safety and liveness specication of p, i.e., the computation is in the problem specication of p. Denition (masking tolerant). p is masking tolerant to F for SP EC from S i p is F -tolerant for SP EC from S, and S is closed in F . (In other words, if a fault in F occurs in a state where S is true, p continues to be in a state where S is true.) We prove this tolerance by exhibiting an invariant predicate such that even in the presence of faults the program is always at a state where the invariant predicate is true. 2. Nonmasking tolerance is less strict than masking tolerance: in the presence of faults, the program need not satisfy its safety specication but, when faults stop occurring, the program eventually satises both its safety and liveness specication; i.e., the computation has a sux that is in the problem specication. Denition (nonmasking tolerant). p is nonmasking tolerant to F for SP EC from S i p is F -tolerant for SP EC from S, and S is not closed in F . (In other words, if a fault in F occurs in a state where S
is true, p may be perturbed to a state where S is violated. However, p then recovers to a state where S is true.) We prove this tolerance by exhibiting an invariant predicate such that when faults stop occurring the computation eventually reaches (recovers to) a state where the invariant predicate is true. More specically, this would involve calculating a fault-span predicate, and showing that: T leads-to S in p We distinguish a special case of nonmasking tolerance: p is stabilizing tolerant to F i p is nonmasking tolerant to F , and true converges to S in p. (In other words, stabilizing tolerant programs recover from any state in the program state space to S.) 3. Fail-safe tolerance is also less strict than masking: in the presence of faults, the program satises its safety specication but, when faults stop occurring, the program need not satisfy its liveness specication; i.e., the computation is in the safety specication but not necessarily in the liveness specication. Denition (fail-safe tolerant). Let SSP EC be the minimal safety specication that contains SP EC. p is fail-safe tolerant to F for SP EC from S i there exists a state predicate R such that p is F -tolerant for SSP EC from S R, S R is closed in p and in F . (In other words, if a fault in F occurs in a state where S is true, p may be perturbed to a state where S or R is true. In the latter case, the subsequent execution of p yields a computation that is in SSP EC but not necessarily in SP EC.) We prove this satisfaction by exhibiting an invariant predicate and a safe predicate such that when faults occur the program is always at a state where the invariant predicate is true or at least the safe predicate is true.
Examples of Types of Tolerances. Consider the critical section problem: Its safety specication is mutual exclusion multiple processes cannot simultaneously be in the critical section and its liveness specication is freedom from deadlock if some process requests critical section access then eventually some process accesses its critical section.
For the critical section problem, a masking fault-tolerant solution would preserve both mutual exclusion in the presence of the faults and satisfy freedom from deadlock if only nitely many faults occurred. A nonmasking fault-tolerant solution would eventually satisfy both mutual exclusion and freedom from deadlock if only nitely many faults occurred. Observe that this is equivalent to saying that the solution would satisfy freedom from deadlock and eventually satisfy mutual exclution if only nitely many faults occurred. A failsafe fault-tolerant solution would satisfy mutual exclusion in the presence of faults, but not necessarily freedom from deadlock. Next, we give an example in the use of double/triple modular redundancy: The problem is to assign the value of an input variable into the variable out. Sensors named x, y, z contain the value of the input variable. Faults may corrupt the sensor values values of at most one of the sensors. Fault-intolerant program IR. Program IR consists of a single action that copies the value of x into out. The value of out denotes that out has not been assigned. Thus, the action of IR is as follows: IR :: out = out := x
IR satises the specication in the absence of one sensor corruption but not in its presence.
Fail-safe fault-tolerant program SR. To preserve safety in the presence of one corrupted sensor, we use another sensor y thus obtaining double modular redundancy: SR :: out = x = y out := x
SR does not satisfy its liveness specication in the presence of one sensor corruption. Nonmasking fault-tolerant program NR. To restore safety in the presence of one corrupted sensor, while preserving liveness, we use yet another sensor z thus obtaining triple modular redundancy: N R1 :: N R2 :: out = out = x (x = y x = z) out := x out := y or out := z
M R satises the livenss specication and eventually satises the safety specication in the presence of one sensor corruption. Masking fault-tolerant program MR. In fact, triple modular redundancy suces to preserve both safety and liveness in the presence of a sensor corruption: M R1 :: M R2 :: M R3 :: out = (x = y x = z) out = (y = x y = z) out = (z = y z = x) out := x out := y out := z
Remarks.
In the absence of faults means that each computation consists of program actions only. In the presence of faults means that each computation is an interleaving of program and fault actions. When faults stop occurring means that the computation has only nitely many occurrences of fault actions. A computation eventually satises a property means that the computation has a sux that satises the property. For design and engineering purposes, it is important to characterize the classes of faults that the program is subject to. This characterization involves analyzing the environment of the program the environment includes other program with which this interacts. In some cases, exhaustively characterizing the fault classes is dicult. In such cases, one should choose some fault-class that is large enough to accommodate all possible faults. It is often for this reason that designers choose weak fault-models such as transient state failures (where the state may be perturbed arbitrarily) or Byzantine failure (where the program may behave arbitrarily). We have made an assumption in this discussion: execution of any fault action in F always maintains the problem specication, i.e., if a prex maintains a problem specication and s is the extended prex obtained by execution of a fault action in F (where s is a state and s is the concatenation of and s), then s also maintains the problem specication.