0% found this document useful (0 votes)
43 views10 pages

R2 - On Tolerating Faults in Naturally Redundant Algorithms

This document proposes a method for making algorithms fault-tolerant by exploiting natural redundancy in their problem variables. It defines "naturally redundant algorithms" as algorithms where the output of one computation phase provides redundant information for the input of the next phase. This allows errors to be detected and corrected between phases with little overhead. The document applies this method to the Jacobi method for solving Laplace equations and calculating Markov chain invariant distributions. Experiments showed less than 15% performance degradation when tolerating faults.

Uploaded by

N EM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

R2 - On Tolerating Faults in Naturally Redundant Algorithms

This document proposes a method for making algorithms fault-tolerant by exploiting natural redundancy in their problem variables. It defines "naturally redundant algorithms" as algorithms where the output of one computation phase provides redundant information for the input of the next phase. This allows errors to be detected and corrected between phases with little overhead. The document applies this method to the Jacobi method for solving Laplace equations and calculating Markov chain invariant distributions. Experiments showed less than 15% performance degradation when tolerating faults.

Uploaded by

N EM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

ON TOLERATING FAULTS IN

NATURALLY REDUNDANT ALGORITHMS*


Luiz A. Laranjeirat
Miroslaw Malekl Roy Jeneveins

Department of Electrical and Computer Engineering


The University of Texas at Austin
Austin, Texas 78712, USA

Abstract In this paper we characterize a class of algorithms,


A class of algorithms suitable f o r fault-tolerant ex- that need no hardware replication and require very
ecution in multiprocessor s y s t e m s by exploiting the ex- small execution time overhead in order to provide re-
isting embedded redundancy in the problem variables dundancy for fault recovery, and sometimes for fault
is characterized in this paper. Because of this unique detection, too. This is possible due to a unique prop-
property, n o extra computations need be superimposed erty these algorithms present, natural redundancy in
t o t h e algorithm in order t o provide redundancy f o r the problem variables. Furthermore, the type of recov-
fault recovery, as well a s fault detection in s o m e cases. ery this natural redundancy enables, forward recov-
A forward recovery s c h e m e is thus employed with very ery [l],can cause lower performance degradation than
low t i m e overhead. T h e method is applied t o t h e im- other recovery techniques such as checkpointing and
plementation of t w o iterative algorithms: solution of rollback [2], besides not requiring massive hardware
Laplace equations by Jacobi's method and the calcula- replication as fault masking techniques such as NMR
t i o n of the invariant distribution of a Markov chain. (N-Modular Redundancy) [3]. In order to achieve fault
Experiments show less t h a n 15% performance degra- tolerance with this class of algorithms it is still nec-
dation f o r significant problem instances in fault-free essary to add explicit procedures to use the available
situations, and as low as 2.43% in s o m e cases. T h e redundancy for fault recovery and detection, and to
extra computation tame needed f o r locating and recov- implement schemes for fault location. Redundancy
ering f r o m a detected fault, does not exceed the t i m e for fault recovery (and sometimes for fault detection),
necessary t o execute a single iteration. T h e fault de- however, comes for free, and this is the most salient
tection procedures provide fault coverage close t o 100% feature of the method we are proposing.
f o r faults causing errors that affect the correctness of Other advantages of our approach include the sim-
the computations. plicity of the software-implemented schemes necessary
for detecting faults and recovering the missing correct
computational values.
Other methods that can tolerate hardware
1 Introduction faults through software-based schemes include self-
stabilizing algorithms [4], inherently fault-tolerant al-
It is a well known fact that redundancy is a necessity gorithms [5] and algorithm-based fault tolerance [6].
in the design of fault-tolerant systems. Therefore, a The approach exploited in this research can tolerate
greatly important goal in this research area is how to a broader range of sin le processor faults (both per-
provide the necessary redundancy for a system to be manent and temporary? with less performance degra-
able to execute reliably with minimum overhead in dation than self-stabilizing and inherent fault-tolerant
terms of space (including hardware and software) and algorithms, and with no need of algorithm redesign
execution time. for providing redundancy for fault recovery as in
algorithm-based fault tolerance.
'This research was supported in part by ONR under Grant Although the proposed technique could be applied
N00014-88-K-0543 and NASA under Grant NAG9-426. to both single and multiple processor architectures,
1Phone: (512) 471-1658, Email: luiz@emx.utexas.edu we focus on multiprocessors where fault-tolerant alg-
f On leave at the Office of Naval Research in London. rithms are very necessary due to the increased prob-
Phone: (44)(71)4044478, E m a i l : malek@emx.utexas.edu ability of faults. The extra hardware existing in a
§Phone: (512) 471-9722, Email: jenevein@cs.utexas.edu multiprocessor system increases the probability of oc-

I18
CH3021-3/91/~/0118/$01.000 1991 JEEE
currences of failures, and consequently of faults, as algorithm would potentially be able to recover more
well as provides the potential for tolerating them. In than one erroneous y i .
the body of this paper we will define Naturally Redun-
dant Algorithms and show that they are well suited t o In the parallel execution of many applications pro-
exploit this potential. cessors communicate their intermediate calculation
We illustrate the application of our method with values to other processors as the computation pro-
two iterative synchronous algorithms: solution of ceeds. In such cases, the erroneous intermediate cal-
Laplace equations by Jacobi’s method and the com- culations of a faulty processor can corrupt subsequent
putation of the the invariant distribution of a Markov computations of other processors. It is thus desirable,
chain. The experimental results show a performance that the correct intermediate calculations could be re-
degradation of less than 15%, and as low as 2,43% covered before they are communicated to other pro-
in some cases, in the fault-free execution of the al- cessors. This motivates the definition of algorithms
gorithms for significant problem sizes. When a fault which can be divided in phases which are themselves
occurs it is located and recovered from with a per- naturally redundant.
formance penalty no larger than the execution time Definition 2: An algorithm A is called a phase-wise
of a single iteration. The target architecture for our
experiments was a Sequent Symmetry multiprocessor
(see Section 3) in which we considered that the bus is
fault-free and that memory faults are taken care of by
L
naturally redundant algorithm i f a) Algorithm A can
be divided in phases such that t e output vector of
one phase is the input vector for the following phase;
(b) The output vector of each phase satisfies the re-
error detecting/correcting codes. Our method would dundancy relation.
also be amenable for implementation in a distributed
environment with some inexpensive modifications. In In this paper we focus our attention on phase-wise
such case the fault coverage would be larger and the naturally redundant algorithms. In order to use nat-
performance degradation higher when recovering from ural redundancy for achieving fault tolerance we will
faults see Section 3).
i
A c e a r disadvantage of our method is that it is
application-dependent rather than general. We also
utilize mappings to a multiprocessor architecture such
that in each phase, the components of the phase out-
put vector will be computed independently (by differ-
assume that the sofwtare is correct, that is, the pro- ent processors).
posed technique aims t o tolerate processor faults.
Even though single faults are the ones that can always According to Mili in [l]a correct intermediate state
be tolerated with our method, in some cases multiple of a computation of an algorithm can be strictly cor-
faults could also be covered (see Section 4.3). rect, loosely correct or specification-wise correct. Cor-
In the next section we state some definitions and respondingly, an algorithm can be naturally redun-
Section 3 describes the target architecture and the uti- dant in a strict , lose or specification-wise sense de-
lized fault model. Sections 4 and 5 detail how natural pending on whether the value of a component of a
redundancy was exploited to provide fault tolerance in phase output vector, as calculated by the redundancy
our examples. Section 6 presents the results of our ex- relation, is strictly, loosely or specification-wise cor-
periments and Section 7 states our conclusions. rect. The value of a component of a phase output vec-
tor calculated by the redundancy relation is strictly
correct if it is exactly equal to the value (correctly)
2 Naturally Redundant and calculated by the algorithm. It is losely correct if it is
not equal to the value calculated by the algorithm but
Fault-Tolerant Algorithms its utilization in subsequent calculations will still lead
to the expected results (those that would be achieved if
In lhis section we would like to give definitions that only strictly correct values were utilized). Finally, it is
will clarify some concepts we will work with through- loosely correct if it is not equal to the value computed
out the rest of the paper. by the algorithm and its further utilization does not
lead to the expected results, but to results that also
Definition 1: If a given algorithm A maps an in- satisfy system specifications.
put vector X = (2122 ...2,) to an output vector Y =
(y1y2 ...ym) and the redundancy relation { V y i , yi E
Of the two examples presented in this paper the
Y , 3 3i I ~i = Fi(Y - {yi})} holds, than A is algorithm for Laplace equations is naturally redundant
called a Naturally Redundant Algorithm. Each x i ( y i ) in a loose sense and the algorithm for Markov chains
may be either a single component of the input (out- calculation is naturally redundant in a strict sense.
put) or a subvector of components. Natural redundancy allows for a forward recovery
From this definition we can see that a naturally approach, since there is no need of backtracking the
computation in order to restore a correct value of an
redundant algorithm running on a processor architec- erroneous output vector component.
ture P has at least the potential to restore the correct
value of any single erroneous component yi in its out- A naturally redundant algorithm can be made
put vector. This will be the case when each 3 i is a fault-tolerant by adding t o it specific functionality to
function of every y j , j # i . If each 3, is a function of detect, locate and recover from faults utilizing its nat-
only a subset of the components of Y - {y,} then the ural redundancy.

I I9
3 Target Architecture and
Fault Model

We will consider as our target architecture an asyn-


chronous shared memory MIMD machine such as the
Sequent Symmetry, where 12 processors are linked by
a common bus. The parallelism in such an architec-
ture is considered to be a coarse-grained parallelism.
We consider that processors operate by responding to
triggering events such as the acquisition of a lock or Incorrect Computation
the reaching of a synchronization point. Faults
The approach described in this paper aims to toler-
ate single processor faults, either permanent or tem- Figure 1: A nested fault classification scheme.
porary (transient or intermittent). We view the ap-
plications as fault-free, that is, software design faults
are not considered. Since we are studying algorithms
than can be divided in phases, we allow one processor 4 Solution of Laplace Equa-
per phase to produce erroneous results due to tempo- tions
rary (transient or intermittent) faults or one processor
under a permanent faulty condition for the whole du-
ration of the computation. 4.1 Iterative Techniques for Laplace
Equations
We modify a fault classification scheme in [7], where The solution of Laplace equations is required in the
several nested fault classes are considered, by adding study of important problems such as seismic modeling,
a layer of incorrect computation faults (see Fig. 1). weather forecasting, heat transfer and electrical po-
A crash fault occurs when a processor systematically tential distributions. Equation l is a two-dimensional
stops responding to its triggering events. An omission Laplace Equation.
fault occurs when a processor either systematically or
occasionally does not respond to a triggering event.
A timing fault occurs when, in response to a trigger-
ing event, a processor gives the right output too early,
too late or never. An incorrect computation fault may
be a timing fault (with correct computational values), The usual approach to an iterative solution con-
a computation which is delivered on time but con- sists in discretizing the problem domain with an n
tains wrong or corrupted results, or a computation x n uniformly-spaced square grid of points, so that
with incorrect results which is also delivered out of all n2 points, except those on boundaries have four
the expected time interval. Another aspect of our fault equidistant nearest neighbors. Equation 1 is then ap-
model is that faults may be permanent or temporary. proximated by a first order difference equation such
Crash faults are always permanent, whereas omission, as Equation 2, where 3: and y are the row and column
timing and incorrect computation faults can be either indices over all grid points, and the value of function 4
permanent or temporary. at each grid point is calculated in an iterative fashion
until convergence is achieved.
We assume that the bus is fault-free and mem-
ory faults are tolerated with the use of error detec-
tion/correction codes, such as Hamming codes. We
also consider that the address generation logic of pro-
cessors and the address decoding circuits of the mem- 4.2 Jacobi’s Iterative Method and Natu-
ory system are reliable. This can be achieved by hard-
ware redundancy schemes such as self-checking logic or ral Redundancy
replicated logic with voting (see [3]). On of the most common iterative techniques used
to solve Laplace equations is Jacobi’s method. Equa-
Our approach would also be also amenable for im- tion 3 shows the update procedure for Jacobi’s
plementation in a distributed environment utilizing method. In order to calculate the next iteration value
message passing instead of shared memory. In that of a point one needs the values of its nearest neighbors
case, memory and addressing faults could be recovered calculated at the previous iteration.
without redundant logic. They would be simply seen
as processor faults and be reflected in the correctness
of the computations. The overall method would not
change although some more execution time overhead
should be expected when faults occur due to extra
number of messages that would be exchanged during
fault diagnosis and recovery.

120
Let us consider Q k the vector composed by the val- E O E O E O E O E O E O E O E O
ues of all grid points q5k,y after the kth iteration. In O E O E O E O E
E O E O E O E O
O E O E O E O E
E O E O E O E O
general, the sequence defined by {ak}, k 2 0, will
converge t o a solution Q* = (4:,1,4?,2,..., 4A,n). In O E O E O E O E O E O E O E O E
practice, however, one cannot obtain the exact final E
O
0 E 0 E 0 E 0
E O E O E O E
E
O
O
E
E
O
O
E
E
O
O
E
E
O
O
E
solution a* due to computer finite wordlength limi- E
O
O E O E O E O
E O E O E O E
E
O
O
E
E
O
O
E
E
O
O
E
E
O
O
E
tations. A convergence criterion is then established
which is defined by an approximation factor E . The O E O E O E O E
E O E O E O E O E O E O E O E O
execution of the algorithm should stop after the kth O E O E O E O E O E O E O E O E
iteration if I
- 4z,y 5 E , 1 5 I,y 5 n. As the val- E O E O E O E O

ues of the 4yZy are not known in most cases, conver- E


O
O E O E O E O
E O E O E O E
E
O
O
E
E
O
O
E
E
O
O
E
E
O
O
E
- 4;i11 5
E 0 E 0 E 0 E 0 E O E O E O E O
gence is considered t o be achieved if O E O E O E O E O E O E O E O E
E , 1 5 x,y 5 n.
Figure 2: Domain decomposition of grid points with
Theorem 1: An algorithm implementing Jacobi’s 8 partitions.
method for solving Laplace Equations is a phase-wise
naturally redundant algorithm in a loose sense.
odd, or new even points if k is even
The proof of this theorem is omitted here because of Fig. 3 shows this domain partition for the case
space limitations but is available in [9]. P = 8 and Fig. 4 depicts the communication graph
for the execution of the algorithm with two clusters of
4.3 Fault Tolerance and Mapping to the processors (this graph captures only the calculation of
Target Architecture the iteration values, not the convergence checking nor
other intracluster interactions necessary during fault
Once we have a redundant algorithm, the next step location, recovery or reconfiguration).
in achieving fault tolerance is to map the algorithm I t is easy to see that this scheme can be used to
into the target architecture in such a way that the achieve a fault-tolerant execution of Jacobi’s method.
existing redundancy can be exploited for tolerating If a faulty processor, say in cluster A, produces erro-
faults. neous results while calculating odd point values, error-
An ordinary implementation of Jacobi’s method free values corresponding t o the same points can be
for solving Laplace Equations utilizing P processors recovered by the processor in cluster B which has the
would assign t o each processor a subsquare, or sub- newly calculated even points of that partition.
rectangle, of the grid points. This scheme is shown in Although our basic goal is to tolerate single proces-
Fig. 2 for P = 8, where odd points (those that have the sor faults, due to the special redundant characteristics
sum of their grid coordinates equal to an odd value) of this algorithm, multiple faults affecting only proces-
and even grid points (those that have the sum of their sors in one cluster (leaving the processors in the other
grid coordinates equal to an even va1ue)were differ- cluster fault-free) or affecting noncorresponding p r e
entiated. In order to calculate the next iteration of cessors in different clusters could also be tolerated.
points, processors in charge of contiguous portions of
the grid need the values of their neighboring points. 4.4 Fault Detection and Fault Diagnosis
Algorithm synchronization is achieved by introducing The special cases of crash, timing and omission
a synchronization point after which each processor can faults can be detected by the use of watchdog timers.
start the next iteration. Fault location in this case can be easily accomplished.
We present an alternative mapping that exploits The processor that did not reach the synchronization
the problem’s natural redundancy. This mapping is point due to a fault can be known by the other pro-
based on the fact that in order t o calculate the next cessors if they read the values of the corresponding
iteration of odd points one needs only even points, synchronization variables.
and vice-versa. For an architecture with P proces- Faults producing timely executed computations
sors, we divide the grid in P / 2 portions and assign with erroneous values can be detected by letting each
two processors t o each partition. One processor cal- processor in a cluster repeat part of the computation
culates the even points, while the other is in charge (one row or column) of one of its neighboring proces-
of the odd points of the partition in a given iteration. sors in each iteration, and then compare the common
This is the same as having each partition divided in results. The neighboring processor tested by a proces-
two subpartitions, the even one and the odd one, and sor is chosen so that the testing graph for each cluster
each subpartition assigned to one processor. The P is a ring. The system testing graph is shown in Fig. 6 ,
processors are then divided into two clusters of P / 2 where a processor pointed by an arrow is tested by the
processors each, cluster A and B. Generally, in the processor at the tail of the arrow.
k t h iteration ( k = 1 , 2 , 3 , ...) the processors in cluster In more detail, a processor testing a neighboring
A will be calculating new even points, if k is odd, or processor reads from the data space of the tested pro-
new odd points if k is even. Conversely, processors cessor two rows (columns) during the calculation of
in cluster B will be calculating new odd points if k is an iteration, instead of just one as in the nonfault-
Iteration I f k Iteration # (k+l)

E
O
E
O
O
E
O
E
E
O
E
O
O
E
O
E
E
O
E
O
O
E
O
E
E
O
E
O
O
E
O
E
E
O
E
O
E
O
O
E
O
E
O
E
E
O
E
O
E
O
2
O
E
O
E
O
E
E
O
E
O
E
O
O
E
O
E
O
E
E
O
E
O
E
O
O
E
O
E
O
E
1 r-l
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
E O E O E O E O
. E O E O E O E O 0 0 0 0 0 0

O E O E O E O E
E O E O E O E O
O E O E O E O E 0 0 0 0 0 0 0
E O E O E O E O
O E O E O E O E O E O E O E O E
E O E O 8 O E O E O E O E O E O Two rows obtained E x t r a row
O E O E O E O E O E O E O E O E from neighboring calculated for
E O E O E O E O E O E O E O E O partition testing
O E O E O E O E O E O E O E O E
E O E O E O E O E O E O E O E O
O E O E O E O E O E O E O E O E
E
O
O
E
E
O
O
E
E
O
O E O E O E O E O E O Figure 5: How a processor can calculate an extra row
E O E O E of elements for testing purposes.
O E O E O
-- E O E O E
4 3

I EE.E.E
E E E E E
E E E E
.E.E.E.E.
E E E E E

E E E E E
1 0 0 0 0 0
0 0 0 0
0 0 0 0 0
I
. . . . . . . . . . . o. .o. .o . o. .
0 0 0 0 0
0 0 0 0
Cluster A Cluster B

Figure 6: System testing graph.


4A 3A 4B 3B

tolerant scheme. This can be seen in Fig. 5, where a


processor with even points reads two rows from the
data space of the processor in the same cluster which
Figure 3: Odd-Even domain decomposition with 4 is in charge of the neighboring partition (4A in Fig. 4).
main partitions generating 2 clusters of 4 subparti- We call this processor its south neighbor for simplicity.
tions each. The tester processor then calculates the next iteration
of odd points corresponding to its partition plus the
uppermost row of odd points belonging to the parti-
tion of its south neighbor. Then it reads from the data
space of the tested processor the row of odd points
corresponding to the extra one it computed and does
the comparison. If the comparison succeeds (equal
results), a fault-free situation is assumed. If the com-
parison fails (different results), the tester assumes the
tested processor is faulty and a fault diagnosis algo-
rithm is triggered. The fault diagnosis algorithm is
necessary to locate the actually faulty processor be-
cause a faulty processor testing a fault-free one may
find him erroneously faulty.
We used a distributed algorithm for fault diagno-
sis (fault location) similar to the one in [lo]. Due to
the Sequent Symmetry architecture (shared memory)
Cluster A and our assumption of reliable memory accesses, each
processor can correctly access the test results of every
other processor. This way all the fault-free proces-
Figure 4: A processor communication graph for a sors can correctly diagnose a faulty processor without
modified algorithm. needing several communication rounds as would be
necessary in a message passing architecture. In our
implementation, the tests already available from the
fault detection procedure were utilized in the diagnosis
algorithm.
The distinction between transient and permanent

122
faults can be made by allowing a processor t o recover
from a maximum number of transient faults during
the course of one computation. If the same processor
is found faulty more than this maximum number of
times in the same computation, it is then considered
permanently faulty.
629 0 45 0 4 0.55 0.17
0.0

0.45

0.78
0.4

0.0

0.72
0.F.

0.55

0.0

Figure 7: Graphic representation of a three-state


4.5 Reconfiguration Markov model (a) and the corresponding transition
probability matrix P (b).
System reconfiguration addresses a problem of as-
suring a continuous execution of all processes involved
in the computation after a permanent hardware fault 5 Computation of the Invari-
is detected. In our work we assume that a spare pro-
cessor is available in such a away as to be able to ant Distribution of Markov
replace a permanently faulty processor. This assump- Chains
tion appears to be reasonable as compared to the mas-
sive hardware replication often utilized in order to pro-
vide reliability for critical applications. We should no- 5.1 The Markov Model and Markov
tice that we do not need an extra processor in fault- Chains
free situations nor when temporary faults occur. It The Markov process model is a powerful tool for
is just assumed for reconfiguration purposes after the analyzing complex probabilistic systems such as those
occurrence of permanent faults. used in queueing theory and computer systems relia-
bility and availability modeling. The main concepts
Reconfiguration takes place when a permanently of this model are state and state transition. As time
faulty processor is detected. In this case, the pro- passes, the system goes from state to state under the
cess running in that processor is killed and, after fault basic assumption that the probability of a given state
recovery, a substitute process is created in the spare transition depends only on the current state. We are
processor (cold spare, in this case). particularly interested here in the discrete-time time-
invariant Markov model, which requires all state tran-
sitions to occur at fixed time intervals and transition
probabilities not to change over time.
Figure 7(a) shows a graphic representation of a
4.6 Fault Recovery three-state Markov model. The nodes represent the
states of the modeled system, the directed arcs repre-
Fault recovery is the strongest point of our ap- sent the possible state transitions and the arc weights
proach. In case a processor calculating even (odd) represent the transition probabilities. The informa-
tion conveyed in the graphic model can be summarized
points in the k t h iteration is found faulty, the pro- in a square matrix P (Fig. 7(b)), whose elements p i ,
cessor in the other cluster which calculated the corre- are the probabilities of a transition of a state a' to a
sponding odd (even) points will recover the erroneous state j in a given time step. Such an n x n square
even (odd) points using Jacobi's update procedure as matrix P is called the transition probability matrix of
the forward recovery function. After that, if the fault an n-state Markov model. P is a stochastic matrix,
is permanent, a process for the spare processor is cre-
ated. Then the computation of the algorithm can con- since it meets the following properties: pi, 2 0 for
tinue. 1 5 i, j 5 n, and Cy=,p i j = 1 for 1 5 i 5 n.
A discrete-time, finite state Markov chain is a se-
The computation time required by the recovery quence { X k I k = 0 , 1 , 2 , ...} of random variables
function is equal to the computation time for calcu- that take values in a finite set (state space) {1,2, ...,n}
lating a new iteration of points. Taking into account
that in a normal iteration, besides calculating the new and such that Pr(Xk+' = j 1 X k = i) = qij., k 2 0,
point values, a processor checks for faults and tests where Pr means probability. A Markov chain could
the convergence of the algorithm, we can observe that be interpreted as the sequence of states of a system
the performance degradation due to the execution of modeled by a Markov model with the probabilities qi,
the fault diagnosis algorithm plus the execution of the given by the entries pi, of the transition probability
fault recovery function is less than the computation matrix P .
time of a complete iteration in fault-free conditions. Let a' be an n-dimensional nonnegative row vector
This is actually a very small price to pay in terms of whose entries sum to 1. Such a vector defines a proba-
execution time overhead if we consider that significant bility distribution for the initial state X o by means of
problems of this nature often need thousands, if not the formula Pr(Xo = i) = a . Given an initial proba-
:
millions, of iterations before convergence is reached. bility distribution n o ,the probability distribution ak,
corresponding to the k t h state X k , would be given by distribution vector a ' (as we will see why below, it
Equation 4, where P k means P to the k t h power. +
also keeps a copy of the ( i l)thcolumn of P for the
sake of fault location . After the calculation of a new
ak = *Opk k>_O (4) 1'
iteration, the new va ue of each AT^ calculated by each
processor can be accessed by the other processors.
Equivalent 1y, Again, crash, timing and omission faults can be de-
tected by the use of watchdog timers. The fault loca-
A'+' = akP k20 (5) tion procedure is identical t o the one described in Sec-
It is often desired to compute the steady-state (in- tion 4.4. We describe ahead how fault detection and
variant) probability distribution r3' for a Markov diagnosis are accomplished in the case of faults result-
chain. The vector A" is a nonnegative row vec- ing from computations executed on time but contain-
tor whose components sum 1, and has the property ing erroneous results.
-
- asap. Before initiating the next iteration we apply reason-
ableness check. Each processor checks the calculated
The following definitions and a theorem complete
the theoretical background needed for our purposes.
We omit the corresponding proofs, but they can be
values against errors using the relation F;=l a; = 1.
If the relation holds, a new iteration t a es place. If
found by the interested reader in [8]. the relation does not hold, then a distributed fault di-
agnosis algorithm is triggered. We see that, in this ex-
Definition 3: If P is a stochastic matrix then : (a) ample, the natural redundancy in the algorithm allows
The spectral radius p ( P ) of P is equal to 1. (b) If for fault detection. This will be the case for algorithms
ak is a row vector whose entries sum 1, then the row that are naturally redundant in the strict sense.
vector akP has the same property. It is worth noticing here that this fault detection
procedure incurs no additional communication over-
Definition 4: A stochastic matrix P is called primi- head. In order for a processor i t o calculate its new
tive if there exists a positive integer t such that, given iteration value a:+', it reads from the shared memory
the matrix P', for all entries pi, of Pf it is true that the values of every , a j # i, calculated by the other
!
pij > 0. processors. Since those are also the values it needs to
do the checking, no extra memory accesses are neces-
Theorem 2: If P is a primitive stochastic matrix sary exclusively for the fault checking procedure.
then: (a) There exists a unique row vector x s s such The fault diagnosis algorithm is analogous to the
that T" = T"P and Cy='=l ' = 1. (b) The limit of
a
: one used in the first example we presented. In this
P', as t tends to infinity, exists and is the matrix with case, however, we do not have individual processors
all rows equal to T " . (c) If Cy==, T: = 1, then the testing each other in the fault detection procedure be-
cause that is not necessary. Rather, fault detection is
iteration given by ak+' = T'P, k 2 0, converges to accomplished by checking if the redundancy relation
7r".
holds. So, we need to provide processor-to-processor
5.2 A Naturally Redundant Algorithm tests for fault location in another way. In order to do
Considering that all matrices P we deal with in this that we use an approach similar to R E S 0 [ll]. The
paper are primitive, it is possible to prove the follow- difference is that we do a shifted recomputation at
ing theorem. The proof is however omitted because of the processor level, not at the register level. If a fault
space limitations but is available in [9]. was detected by the fault detection procedure in the
k t h iteration, the fault diagnosis procedure starts and
Theorem 3: An algorithm implementing the itera- +
processor i tests processor ( i 1) by recalculating the
tion given by ak+' = a k P , k 2 0 (see Equation 5), value of xf+' and comparing it to the value previously
for the calculation of the invariant distribution of a
Markov chain is a phase-wise naturally redundant al-
+
calculated by processor i 1 (if there are p processors,
processor p tests processor 1). The faulty processor is
gorithm an a strict sense. then located using the diagnosis algorithm described
in Section 4.4. If the fault is permanent, reconfigura-
5.3 Obtaining a Fault-Tolerant Algorithm tion takes place. The differentiation between perma-
The necessary steps in order to have a fault-tolerant nent and temporary faults is done in the same way as
algorithm are: mapping to the target architecture, de- in the Laplace example, as also is the reconfiguration
riving schemes for fault detection, fault location, fault procedure.
recovery and reconfiguration. Finally, if the i f h processor is detected faulty, fault
The mapping of the algorithm to the parallel ar- recovery is accomplished through the relation a; =
chitecture is straightforward. In the kfh iteration,
each processor calculates one element of the proba-
1- cjn,l,jfi r;.
. In this case we will have
bility distribution vector a
'
Again, the total time required for fault diagnosis
and recovery is less than the time of a complete iter-
p = n processors and the ith processor will calculate ation.
the value of a! in the k t h iteration. For that, the It is worth noticing here that our solution utilizes
ith processor keeps locally a copy of the ith column shifted recomputation only for the sake of fault loca-
of matrix P plus the initial value of the probability tion, after a fault has been detected. Error checks for

124
fault detection, which represent the real overhead in
fault-free situations, are accomplished using the nat-
ural redundancy of the problem. Because of that, the
execution time overhead of this algorithm due to the
addition of fault tolerance is less than in the previous
example as shown by the results of the next section.

6 Experimental Results
6.1 Testbed Description and Discussion
of Main Issues
We ran our experiments on a Sequent Symmetry
bus-based MIMD architecture with a configuration of
12 processors. We utilized the programming language
C with a library of parallel extensions called PPL (Par-
allel Programming Language) and single floating point
precision in the implementations.
Fault insertion was simulated by C statements in-
troducing erroneous data at specific points in the com- implementation of the algorithm will be less than or
putation paths. A bit error was introduced by flipping equal the amount of communication in the normal
a bit and a word error was introduced by flipping all implementation because in the fault-tolerant version
the bits in a data word. We considered a fault to be processes communicate only with processes running
permanent if it lasts more than three iterations. in processors of the same cluster.
The effects of finite precision arithmetic are relevant
when fault detection is accomplished by checking if The overhead added by the diagnosis and recov-
the computational results meet a certain property, as ery procedures in order to recover from a detected
in the case of the Markov algorithm. This is because fault was found to be less than the time for a com-
the quantities to be compared are obtained through plete iteration. This is a very affordable price if we
different computational paths with different roundoff consider that actual iterative problems execute thou-
errors. When comparisons are used for fault detection, sands or even millions of iterations before convergence
as with the Laplace algorithm, roundoffs are the same is achieved.
because the different sets of data that are compared Due to the fact that the error checks here are done
are obtained through equal computational paths. by comparing equal quantities computed by homoge-
neous processors with equal computation paths, finite
6.2 Results of the Laplace Algorithm arithmetic errors do not interfere with the checking
We implemented the Laplace algorithm in two ver- process and a 100% fault coverage was obtained. This
sions: normal implementation and fault-tolerant im- is true for bit or word errors, transient or permanent.
plementation.
The performance degradation introduced by the 6.3 Results of the Markov Algorithm
fault-tolerant schemes was measured and the results We implemented two versions of the Markov algo-
summarized in Figure 8. Experiments were run for rithm: the normal version and the fault-tolerant one.
different grid sizes and different number of processors. We conducted experiments with two different classes
In this figure, a problem size of value n represents a of problems. In the first class, the number of elements
square grid with n2 points. The performance degrada- of vector x is equal to the number of processors. Con-
tion depicted in the figure was obtained by comparing sequently, each processor updates one element of the
a fault-tolerant version of the algorithm, with a data output vector. For this problem class we measured the
partition as in Fig. 3, with the normal version, which fault coverage and the performance degradation of the
used a data partition as in Fig. 2. It is clear from Fig. 8 fault-tolerant schemes. In the second problem class,
that, as the grid size increases, the ratio between the the number of elements of vector A is much larger
execution time needed for implementing fault toler- than the number of processors. Consequently, each
ance and that needed for the normal execution of the processor updates many elements of the output vec-
algorithm decreases. For grid sizes reater than 96 the tor. For this second problem class we implemented
time overhead will be less than 1 5 2 This is in fact an the fault detection procedure and measured the per-
attractive result because grid sizes for most significant formance degradation of the fault-tolerant schemes in
problems are much larger than that. The decrease in fault-free conditions.
the overall time overhead for larger grid sizes is due The fault coverage measurements for the one vec-
to the fact that for larger grid sizes the time overhead tor element per processor case are summarized in Ta-
due to process synchronization increases and super- ble l . We worked with four different data sets for
sedes the time overhead due to cache coherence traffic four different sets of processors. Since the fault detec-
(which is larger for a chessboard partition than for the tion procedure of this algorithm compares two quan-
normal grid partition). The amount of process com- tities that are calculated through different data paths,
munication in fault-free situations in the fault-tolerant roundoff errors are important. The error checking is

I25
tloiic by subtract.iiig tlir two quailtitics and conipariiig
Ilie absolute value of (.lie rcsult of this su1)tracbioii I,O
a certain 6 which accoilul.s f o r roundoir errors. If 6
is too big, false alarms will happen. rIIiat is, correct
coniptit,atioris will be t.hoiigli(.of as erroncotis. If t.liis
tlelt.a is t.oo sinall, itiany errors will not \)e tlctcct,ctl.
r .
I he soltitmionwe eniployetl was to cxperirneiit,a.Ily tic-
t.erinine, for each data srt., 1 . 1 1 ~iiririiiiliitn v:il\ie of 6
t.ltat caiiscs no rrrors i i i a fault-free exccittioii or I,\IC
algorit.lim. 'f'his 6 rcprcsciits the I ~ I ~ X ~ I I N I v:tl~tc
III Or
t . 1 1 ~fi1iit.c nritlirrict.ic error f o r that, fixctl tI;~(.nsrt ailcl
prol)lcin size. Using this valiie f o r 6, we iricasiirod the
failit) coverage of the fault-kdoraiit sclienics.
We Imsically found three t.ypcs of situations: fa.nIts
that wcre detected and, tliercfore, recoverable, faults
that were riot detected but did not cause the algo-
ritliiii to deliver erroneous results, and nondetect,able
faults that caused the algorithin to deliver erroiieoiis Table 1: Percent. of fault (error) coverage for the fault-
results. ?'lie percent of detected faults is given in Ta- tolerant Markov algorithm (one vector element per
ble l . It, is not.iccable that all fa1llts causing wort1 processor case).
errors were covered. hlost, of the faii1t.s that. wcrc not,
detcct,ed were sirniilatccl by errors i n tlie lower order
hits of the floating point rcprcscntatioii and caiiscd no
liariii to 1 . l ~
final outxoiiie of the algorit.liin. A few
noritletcct,ablc faults caiiscd the algoritliin to be i n an
oscillatory niotle, riot acliievirtg convergence. 'llicsc
were faults i n the mult,iplicr siinulated hy pcrniaiicnt
bit errors in data sets 1 arid 2. ?'lie total perccnt.age of
t,licse nondetectable fault,s for cach of t,lie ineiitioncd
data sets was 0.39%. It is clear from these results
that the fault coverage of (,lie fault-tolerant sclicrnes
for practical considerations is very eKec,tive.
The next round of experiments aimed at itivtstigat-
ing the pcrforinance degradation caused by the fault-
tolerant schemes. The problem size wiis given 11y the
riuinlm of elcrriciits of t.lic oribptit vector T , wllicli is
also tlie order of the squarc ii1at.ri.u P , the t,raiisition
probabi1it.y matrix. Figure 9 sliows the rcsiilt,s for the
case of one vect,or elciiient per processor. We can ob-
serve that as the problem size increases tlie overliead Figure 9: Plot of perforniaiice degradation as a func-
tion of thc pro1)leni size for tlie fatilt-tolerant Markov
decreases. In this case the ratio bctween fatilt-tolerant algorithm .
computations and normal comput.ations is const.ant.
?'lie reason for this heliavior is that the time for syn-
cliroriizatioii here is considerable as compared to the
computation time. So, as the number of processors
increase tlie syncliroiiization time is larger. Siiice the
time for synclironization is equal for the normal and t,ation time is much larger t.han synchronization time.
fault-tolerant versions of the algorit.lirn, the ratio be- The amount of computation time due to fault detec-
tween tile compiit,ation time for fa.ult tolerance and tion is constant for a given problern size. IIowever,
the total execution time for t-lie norrnal version of the as the number of processors increases, the amount
algorithm decreases as the number of processors iii- of t,iirie for normal computations decreases. This ex-
creases. As can be seen i n Figure 9 the performance plaiiis why, for a fixed problem size, the overhead is
degradation was as low as 2.43% in t.he casc of 12 pro- 1ilrgf.r as the number of processors is larger. For dif-
cesors. f m w t problem sizes and equal nninber of processors
Again, the overliead atltlctl by t.lw tliiignosis a n d re- it is easy to see that the ratio between computation
covery procedures in order t,o rccovcr from a dct,ect.etl time for fault, tolcrance anti normal computation time
fault was found to be less than the t,irne for a cornplet,e decreases as the problem size increases.
i t er at,ioii .
The results for t,Iw case of larger order problcrrrs are I I I suininary, tlie fault,-tolerant execution of the
summarized in Figure 10. Ifere again the overhead hlarkov algoritlirri was shown to cause IOW perfor-
clue to fault tolerance decreases its the pro1,lcin size inanre degradat,ion. For small problem sizes the over-
iiicrcaws. Synchronization time tlors aKect, h o w the Iicatl was less than 7%, and less than 13% for larger
overliead varies. 'Hiis is because, i n this case, coinpti- prol)lcin sizes.

I26
to other methods. It seems to be a very cost effective
application level fault-tolerant scheme, especially for
iterative algorithms.

References
A. Mili, “Towards a Theory of Forward Recov-
ery,” IEEE Transactions on Software Engineer-
ing, vol. SE-11, no. 8, pp. 735-748, August 1985.
R. Koo, S. Toueg, “Checkpointing and Rollback-
Recovery for Distributed Systems,” IEEE Trans-
actions on Software Engineering, vol. SE-13, no.
1, pp. 23-31, January 1987.
D. P Siewiorek, R. S. Swarz, The Theory and
Practice of Reliable System Design, Digital Press,
1982.

7 Conclusions E. W. Dijkstra, “Self-Stabilizing Systems in spite


of Distributed Control,” Communications of the
We have presented a new approach to algorithmic fault ACM, vol. 17, no. 11, pp. 643-644, November
tolerance that relies on natural problem redundancy 1974.
and allows the utilization of low-cost recovery schemes, F. B. Bastani, I . Yen, I. Chen, “A Class of In-
such as forward recovery, that can tolerate single pro- herently Fault-Tolerant Distributed Programs,”
cessor faults. In order for our method to be applicable IEEE Transactions on Software Engineering, vol.
to a particular algorithm this algorithm must be nat- 14, no. 10, pp. 1432-1442, October 1988.
urally redundant and the components of its output
vector must be computed independently. K. H. Huang, J . A. Abraham, “Algorithm-Based
We have applied our method to iterative numerical Fault Tolerance for Matrix Operations,” IEEE
algorithms which satisfy those conditions of applica- Transactions on Software Engineering, vol. SE-
bility. The implementation was realized on a shared 33, no. 6, pp. 518-528, June 1984.
memory architecture (Sequent Symmetry 12-processor
computer). Our experimental results demonstrated F. Cristian, H. Aghili, R. Strong, D. Dolev,
that the utilization of the proposed technique causes “Atomic Broadcast: From Simple Diffusion to
very low performance degradation in fault-free situa- Byzantine Agreement”, 15th Int. Conference on
tions. For significant problem instances the time over- Fault-Tolerant Computing, 1985.
head relative to the non fault-tolerant version of the
algorithm was shown to be less than 15%, and as low D. P. Bertsekas. J . N . Tsitsiklis. Parallel and Dis-
as 2.43% in some cases. When a fault does occur the tributed Computation, Prentice-Hall, Englewood
additional computational time needed for tolerating Cliffs, 1989.
it is no more than the execution time of a single it- L. A. Laranjeira, M. Malek, R. Jenevein, “Nat-
eration. Furthermore, the fault coverage provided by urally Redundant Algorithms,” Technical Re-
the fault detection scheme used with the Laplace algo-
port, Depart,ment of Electrical and Computer En-
rithm was shown to be 100% for the considered class gineering, The University of Texas a t Austin,
of faults. In the case of the Markov algorithm the
fault coverage was close to 100% for faults producing February 1991.
measurable errors in the final output. Our method J . Kuhl, S. Reddy, “Distributed Fault-Tolerance
could also be implemented on a distributed environ- for Large Multiprocessor Systems,” Proc. Seventh
ment with some simple modifications. The fault cover- Annual Symposium on Computer Architecture,
age would be larger than in the shared memory imple- pp. 23-30, 1980.
mentation and the performance degradation incurred
by the fault diagnosis and fault recovery procedures J . H . Patel, L. Y. Fung, “Concurrent Error De-
would be higher (see Section 3). tection in ALUs by Recomputing with Shifted
The outstanding advantages of o u r fault-tolerant Operands,” IEEE Transactions on Computers,
approach are that it requires no hardware replication vol. C-31, no. 7, J u l y 1982, pp. 589-595.
and causes very small execution time overhead. A
clear disadvantage is that it is application dependent.
In view of a consensus which is being reached by
the fault tolerance research community that we need
fault-tolerant techniques at every level of system hier-
archy our approach can be considered complementary

I27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy