R2 - On Tolerating Faults in Naturally Redundant Algorithms
R2 - On Tolerating Faults in Naturally Redundant Algorithms
I18
CH3021-3/91/~/0118/$01.000 1991 JEEE
currences of failures, and consequently of faults, as algorithm would potentially be able to recover more
well as provides the potential for tolerating them. In than one erroneous y i .
the body of this paper we will define Naturally Redun-
dant Algorithms and show that they are well suited t o In the parallel execution of many applications pro-
exploit this potential. cessors communicate their intermediate calculation
We illustrate the application of our method with values to other processors as the computation pro-
two iterative synchronous algorithms: solution of ceeds. In such cases, the erroneous intermediate cal-
Laplace equations by Jacobi’s method and the com- culations of a faulty processor can corrupt subsequent
putation of the the invariant distribution of a Markov computations of other processors. It is thus desirable,
chain. The experimental results show a performance that the correct intermediate calculations could be re-
degradation of less than 15%, and as low as 2,43% covered before they are communicated to other pro-
in some cases, in the fault-free execution of the al- cessors. This motivates the definition of algorithms
gorithms for significant problem sizes. When a fault which can be divided in phases which are themselves
occurs it is located and recovered from with a per- naturally redundant.
formance penalty no larger than the execution time Definition 2: An algorithm A is called a phase-wise
of a single iteration. The target architecture for our
experiments was a Sequent Symmetry multiprocessor
(see Section 3) in which we considered that the bus is
fault-free and that memory faults are taken care of by
L
naturally redundant algorithm i f a) Algorithm A can
be divided in phases such that t e output vector of
one phase is the input vector for the following phase;
(b) The output vector of each phase satisfies the re-
error detecting/correcting codes. Our method would dundancy relation.
also be amenable for implementation in a distributed
environment with some inexpensive modifications. In In this paper we focus our attention on phase-wise
such case the fault coverage would be larger and the naturally redundant algorithms. In order to use nat-
performance degradation higher when recovering from ural redundancy for achieving fault tolerance we will
faults see Section 3).
i
A c e a r disadvantage of our method is that it is
application-dependent rather than general. We also
utilize mappings to a multiprocessor architecture such
that in each phase, the components of the phase out-
put vector will be computed independently (by differ-
assume that the sofwtare is correct, that is, the pro- ent processors).
posed technique aims t o tolerate processor faults.
Even though single faults are the ones that can always According to Mili in [l]a correct intermediate state
be tolerated with our method, in some cases multiple of a computation of an algorithm can be strictly cor-
faults could also be covered (see Section 4.3). rect, loosely correct or specification-wise correct. Cor-
In the next section we state some definitions and respondingly, an algorithm can be naturally redun-
Section 3 describes the target architecture and the uti- dant in a strict , lose or specification-wise sense de-
lized fault model. Sections 4 and 5 detail how natural pending on whether the value of a component of a
redundancy was exploited to provide fault tolerance in phase output vector, as calculated by the redundancy
our examples. Section 6 presents the results of our ex- relation, is strictly, loosely or specification-wise cor-
periments and Section 7 states our conclusions. rect. The value of a component of a phase output vec-
tor calculated by the redundancy relation is strictly
correct if it is exactly equal to the value (correctly)
2 Naturally Redundant and calculated by the algorithm. It is losely correct if it is
not equal to the value calculated by the algorithm but
Fault-Tolerant Algorithms its utilization in subsequent calculations will still lead
to the expected results (those that would be achieved if
In lhis section we would like to give definitions that only strictly correct values were utilized). Finally, it is
will clarify some concepts we will work with through- loosely correct if it is not equal to the value computed
out the rest of the paper. by the algorithm and its further utilization does not
lead to the expected results, but to results that also
Definition 1: If a given algorithm A maps an in- satisfy system specifications.
put vector X = (2122 ...2,) to an output vector Y =
(y1y2 ...ym) and the redundancy relation { V y i , yi E
Of the two examples presented in this paper the
Y , 3 3i I ~i = Fi(Y - {yi})} holds, than A is algorithm for Laplace equations is naturally redundant
called a Naturally Redundant Algorithm. Each x i ( y i ) in a loose sense and the algorithm for Markov chains
may be either a single component of the input (out- calculation is naturally redundant in a strict sense.
put) or a subvector of components. Natural redundancy allows for a forward recovery
From this definition we can see that a naturally approach, since there is no need of backtracking the
computation in order to restore a correct value of an
redundant algorithm running on a processor architec- erroneous output vector component.
ture P has at least the potential to restore the correct
value of any single erroneous component yi in its out- A naturally redundant algorithm can be made
put vector. This will be the case when each 3 i is a fault-tolerant by adding t o it specific functionality to
function of every y j , j # i . If each 3, is a function of detect, locate and recover from faults utilizing its nat-
only a subset of the components of Y - {y,} then the ural redundancy.
I I9
3 Target Architecture and
Fault Model
120
Let us consider Q k the vector composed by the val- E O E O E O E O E O E O E O E O
ues of all grid points q5k,y after the kth iteration. In O E O E O E O E
E O E O E O E O
O E O E O E O E
E O E O E O E O
general, the sequence defined by {ak}, k 2 0, will
converge t o a solution Q* = (4:,1,4?,2,..., 4A,n). In O E O E O E O E O E O E O E O E
practice, however, one cannot obtain the exact final E
O
0 E 0 E 0 E 0
E O E O E O E
E
O
O
E
E
O
O
E
E
O
O
E
E
O
O
E
solution a* due to computer finite wordlength limi- E
O
O E O E O E O
E O E O E O E
E
O
O
E
E
O
O
E
E
O
O
E
E
O
O
E
tations. A convergence criterion is then established
which is defined by an approximation factor E . The O E O E O E O E
E O E O E O E O E O E O E O E O
execution of the algorithm should stop after the kth O E O E O E O E O E O E O E O E
iteration if I
- 4z,y 5 E , 1 5 I,y 5 n. As the val- E O E O E O E O
E
O
E
O
O
E
O
E
E
O
E
O
O
E
O
E
E
O
E
O
O
E
O
E
E
O
E
O
O
E
O
E
E
O
E
O
E
O
O
E
O
E
O
E
E
O
E
O
E
O
2
O
E
O
E
O
E
E
O
E
O
E
O
O
E
O
E
O
E
E
O
E
O
E
O
O
E
O
E
O
E
1 r-l
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
E O E O E O E O
. E O E O E O E O 0 0 0 0 0 0
O E O E O E O E
E O E O E O E O
O E O E O E O E 0 0 0 0 0 0 0
E O E O E O E O
O E O E O E O E O E O E O E O E
E O E O 8 O E O E O E O E O E O Two rows obtained E x t r a row
O E O E O E O E O E O E O E O E from neighboring calculated for
E O E O E O E O E O E O E O E O partition testing
O E O E O E O E O E O E O E O E
E O E O E O E O E O E O E O E O
O E O E O E O E O E O E O E O E
E
O
O
E
E
O
O
E
E
O
O E O E O E O E O E O Figure 5: How a processor can calculate an extra row
E O E O E of elements for testing purposes.
O E O E O
-- E O E O E
4 3
I EE.E.E
E E E E E
E E E E
.E.E.E.E.
E E E E E
E E E E E
1 0 0 0 0 0
0 0 0 0
0 0 0 0 0
I
. . . . . . . . . . . o. .o. .o . o. .
0 0 0 0 0
0 0 0 0
Cluster A Cluster B
122
faults can be made by allowing a processor t o recover
from a maximum number of transient faults during
the course of one computation. If the same processor
is found faulty more than this maximum number of
times in the same computation, it is then considered
permanently faulty.
629 0 45 0 4 0.55 0.17
0.0
0.45
0.78
0.4
0.0
0.72
0.F.
0.55
0.0
124
fault detection, which represent the real overhead in
fault-free situations, are accomplished using the nat-
ural redundancy of the problem. Because of that, the
execution time overhead of this algorithm due to the
addition of fault tolerance is less than in the previous
example as shown by the results of the next section.
6 Experimental Results
6.1 Testbed Description and Discussion
of Main Issues
We ran our experiments on a Sequent Symmetry
bus-based MIMD architecture with a configuration of
12 processors. We utilized the programming language
C with a library of parallel extensions called PPL (Par-
allel Programming Language) and single floating point
precision in the implementations.
Fault insertion was simulated by C statements in-
troducing erroneous data at specific points in the com- implementation of the algorithm will be less than or
putation paths. A bit error was introduced by flipping equal the amount of communication in the normal
a bit and a word error was introduced by flipping all implementation because in the fault-tolerant version
the bits in a data word. We considered a fault to be processes communicate only with processes running
permanent if it lasts more than three iterations. in processors of the same cluster.
The effects of finite precision arithmetic are relevant
when fault detection is accomplished by checking if The overhead added by the diagnosis and recov-
the computational results meet a certain property, as ery procedures in order to recover from a detected
in the case of the Markov algorithm. This is because fault was found to be less than the time for a com-
the quantities to be compared are obtained through plete iteration. This is a very affordable price if we
different computational paths with different roundoff consider that actual iterative problems execute thou-
errors. When comparisons are used for fault detection, sands or even millions of iterations before convergence
as with the Laplace algorithm, roundoffs are the same is achieved.
because the different sets of data that are compared Due to the fact that the error checks here are done
are obtained through equal computational paths. by comparing equal quantities computed by homoge-
neous processors with equal computation paths, finite
6.2 Results of the Laplace Algorithm arithmetic errors do not interfere with the checking
We implemented the Laplace algorithm in two ver- process and a 100% fault coverage was obtained. This
sions: normal implementation and fault-tolerant im- is true for bit or word errors, transient or permanent.
plementation.
The performance degradation introduced by the 6.3 Results of the Markov Algorithm
fault-tolerant schemes was measured and the results We implemented two versions of the Markov algo-
summarized in Figure 8. Experiments were run for rithm: the normal version and the fault-tolerant one.
different grid sizes and different number of processors. We conducted experiments with two different classes
In this figure, a problem size of value n represents a of problems. In the first class, the number of elements
square grid with n2 points. The performance degrada- of vector x is equal to the number of processors. Con-
tion depicted in the figure was obtained by comparing sequently, each processor updates one element of the
a fault-tolerant version of the algorithm, with a data output vector. For this problem class we measured the
partition as in Fig. 3, with the normal version, which fault coverage and the performance degradation of the
used a data partition as in Fig. 2. It is clear from Fig. 8 fault-tolerant schemes. In the second problem class,
that, as the grid size increases, the ratio between the the number of elements of vector A is much larger
execution time needed for implementing fault toler- than the number of processors. Consequently, each
ance and that needed for the normal execution of the processor updates many elements of the output vec-
algorithm decreases. For grid sizes reater than 96 the tor. For this second problem class we implemented
time overhead will be less than 1 5 2 This is in fact an the fault detection procedure and measured the per-
attractive result because grid sizes for most significant formance degradation of the fault-tolerant schemes in
problems are much larger than that. The decrease in fault-free conditions.
the overall time overhead for larger grid sizes is due The fault coverage measurements for the one vec-
to the fact that for larger grid sizes the time overhead tor element per processor case are summarized in Ta-
due to process synchronization increases and super- ble l . We worked with four different data sets for
sedes the time overhead due to cache coherence traffic four different sets of processors. Since the fault detec-
(which is larger for a chessboard partition than for the tion procedure of this algorithm compares two quan-
normal grid partition). The amount of process com- tities that are calculated through different data paths,
munication in fault-free situations in the fault-tolerant roundoff errors are important. The error checking is
I25
tloiic by subtract.iiig tlir two quailtitics and conipariiig
Ilie absolute value of (.lie rcsult of this su1)tracbioii I,O
a certain 6 which accoilul.s f o r roundoir errors. If 6
is too big, false alarms will happen. rIIiat is, correct
coniptit,atioris will be t.hoiigli(.of as erroncotis. If t.liis
tlelt.a is t.oo sinall, itiany errors will not \)e tlctcct,ctl.
r .
I he soltitmionwe eniployetl was to cxperirneiit,a.Ily tic-
t.erinine, for each data srt., 1 . 1 1 ~iiririiiiliitn v:il\ie of 6
t.ltat caiiscs no rrrors i i i a fault-free exccittioii or I,\IC
algorit.lim. 'f'his 6 rcprcsciits the I ~ I ~ X ~ I I N I v:tl~tc
III Or
t . 1 1 ~fi1iit.c nritlirrict.ic error f o r that, fixctl tI;~(.nsrt ailcl
prol)lcin size. Using this valiie f o r 6, we iricasiirod the
failit) coverage of the fault-kdoraiit sclienics.
We Imsically found three t.ypcs of situations: fa.nIts
that wcre detected and, tliercfore, recoverable, faults
that were riot detected but did not cause the algo-
ritliiii to deliver erroneous results, and nondetect,able
faults that caused the algorithin to deliver erroiieoiis Table 1: Percent. of fault (error) coverage for the fault-
results. ?'lie percent of detected faults is given in Ta- tolerant Markov algorithm (one vector element per
ble l . It, is not.iccable that all fa1llts causing wort1 processor case).
errors were covered. hlost, of the faii1t.s that. wcrc not,
detcct,ed were sirniilatccl by errors i n tlie lower order
hits of the floating point rcprcscntatioii and caiiscd no
liariii to 1 . l ~
final outxoiiie of the algorit.liin. A few
noritletcct,ablc faults caiiscd the algoritliin to be i n an
oscillatory niotle, riot acliievirtg convergence. 'llicsc
were faults i n the mult,iplicr siinulated hy pcrniaiicnt
bit errors in data sets 1 arid 2. ?'lie total perccnt.age of
t,licse nondetectable fault,s for cach of t,lie ineiitioncd
data sets was 0.39%. It is clear from these results
that the fault coverage of (,lie fault-tolerant sclicrnes
for practical considerations is very eKec,tive.
The next round of experiments aimed at itivtstigat-
ing the pcrforinance degradation caused by the fault-
tolerant schemes. The problem size wiis given 11y the
riuinlm of elcrriciits of t.lic oribptit vector T , wllicli is
also tlie order of the squarc ii1at.ri.u P , the t,raiisition
probabi1it.y matrix. Figure 9 sliows the rcsiilt,s for the
case of one vect,or elciiient per processor. We can ob-
serve that as the problem size increases tlie overliead Figure 9: Plot of perforniaiice degradation as a func-
tion of thc pro1)leni size for tlie fatilt-tolerant Markov
decreases. In this case the ratio bctween fatilt-tolerant algorithm .
computations and normal comput.ations is const.ant.
?'lie reason for this heliavior is that the time for syn-
cliroriizatioii here is considerable as compared to the
computation time. So, as the number of processors
increase tlie syncliroiiization time is larger. Siiice the
time for synclironization is equal for the normal and t,ation time is much larger t.han synchronization time.
fault-tolerant versions of the algorit.lirn, the ratio be- The amount of computation time due to fault detec-
tween tile compiit,ation time for fa.ult tolerance and tion is constant for a given problern size. IIowever,
the total execution time for t-lie norrnal version of the as the number of processors increases, the amount
algorithm decreases as the number of processors iii- of t,iirie for normal computations decreases. This ex-
creases. As can be seen i n Figure 9 the performance plaiiis why, for a fixed problem size, the overhead is
degradation was as low as 2.43% in t.he casc of 12 pro- 1ilrgf.r as the number of processors is larger. For dif-
cesors. f m w t problem sizes and equal nninber of processors
Again, the overliead atltlctl by t.lw tliiignosis a n d re- it is easy to see that the ratio between computation
covery procedures in order t,o rccovcr from a dct,ect.etl time for fault, tolcrance anti normal computation time
fault was found to be less than the t,irne for a cornplet,e decreases as the problem size increases.
i t er at,ioii .
The results for t,Iw case of larger order problcrrrs are I I I suininary, tlie fault,-tolerant execution of the
summarized in Figure 10. Ifere again the overhead hlarkov algoritlirri was shown to cause IOW perfor-
clue to fault tolerance decreases its the pro1,lcin size inanre degradat,ion. For small problem sizes the over-
iiicrcaws. Synchronization time tlors aKect, h o w the Iicatl was less than 7%, and less than 13% for larger
overliead varies. 'Hiis is because, i n this case, coinpti- prol)lcin sizes.
I26
to other methods. It seems to be a very cost effective
application level fault-tolerant scheme, especially for
iterative algorithms.
References
A. Mili, “Towards a Theory of Forward Recov-
ery,” IEEE Transactions on Software Engineer-
ing, vol. SE-11, no. 8, pp. 735-748, August 1985.
R. Koo, S. Toueg, “Checkpointing and Rollback-
Recovery for Distributed Systems,” IEEE Trans-
actions on Software Engineering, vol. SE-13, no.
1, pp. 23-31, January 1987.
D. P Siewiorek, R. S. Swarz, The Theory and
Practice of Reliable System Design, Digital Press,
1982.
I27