Butler TM 2008 215108 Primer FT
Butler TM 2008 215108 Primer FT
Ricky W. Butler
Langley Research Center, Hampton, Virginia
February 2008
The NASA STI Program Office . . . in Profile
Since its founding, NASA has been dedicated to the • CONFERENCE PUBLICATION. Collected
advancement of aeronautics and space science. The papers from scientific and technical
NASA Scientific and Technical Information (STI) conferences, symposia, seminars, or other
Program Office plays a key part in helping NASA meetings sponsored or co-sponsored by NASA.
maintain this important role.
• SPECIAL PUBLICATION. Scientific,
The NASA STI Program Office is operated by technical, or historical information from NASA
Langley Research Center, the lead center for NASA’s programs, projects, and missions, often
scientific and technical information. The NASA STI concerned with subjects having substantial
Program Office provides access to the NASA STI public interest.
Database, the largest collection of aeronautical and
space science STI in the world. The Program Office is
• TECHNICAL TRANSLATION. English-
also NASA’s institutional mechanism for
language translations of foreign scientific and
disseminating the results of its research and
technical material pertinent to NASA’s mission.
development activities. These results are published by
NASA in the NASA STI Report Series, which
Specialized services that complement the STI
includes the following report types:
Program Office’s diverse offerings include creating
custom thesauri, building customized databases,
• TECHNICAL PUBLICATION. Reports of organizing and publishing research results ... even
completed research or a major significant phase providing videos.
of research that present the results of NASA
programs and include extensive data or For more information about the NASA STI Program
theoretical analysis. Includes compilations of Office, see the following:
significant scientific and technical data and
information deemed to be of continuing
• Access the NASA STI Program Home Page at
reference value. NASA counterpart of peer-
http://www.sti.nasa.gov
reviewed formal professional papers, but having
less stringent limitations on manuscript length
• E-mail your question via the Internet to
and extent of graphic presentations.
help@sti.nasa.gov
• TECHNICAL MEMORANDUM. Scientific
• Fax your question to the NASA STI Help Desk
and technical findings that are preliminary or of
at (301) 621-0134
specialized interest, e.g., quick release reports,
working papers, and bibliographies that contain
minimal annotation. Does not contain extensive • Phone the NASA STI Help Desk at
analysis. (301) 621-0390
Ricky W. Butler
Langley Research Center, Hampton, Virginia
February 2008
The use of trademarks or names of manufacturers in this report is for accurate reporting and does not
constitute an official endorsement, either expressed or implied, of such products or manufacturers by the
National Aeronautics and Space Administration.
Available from:
NASA Center for AeroSpace Information (CASI) National Technical Information Service (NTIS)
7115 Standard Drive 5285 Port Royal Road
Hanover, MD 21076-1320 Springfield, VA 22161-2171
(301) 621-0390 (703) 605-6000
Table of Contents
1 Introduction ....................................................................................................3
2 Basics ............................................................................................................3
3 Faults and Failures ........................................................................................3
3.1 Faults......................................................................................................3
3.2 Errors......................................................................................................4
3.3 Fault Tolerance Mechanisms..................................................................5
3.4 Fault Containment Regions ....................................................................6
3.5 Design for Minimum Risk........................................................................6
4 Watch-dog Timers .........................................................................................6
5 Voting ............................................................................................................7
6 Interactive Consistency..................................................................................8
7 Clock Synchronization .................................................................................10
7.1 Impact of Asymmetric Failures .............................................................11
7.2 Fault-tolerant Clock Synchronization Algorithms ..................................12
7.3 Application-Level Reference Time........................................................13
8 Diagnosing Failed Components...................................................................13
8.1 Detection Using Exact-Match Voters ....................................................13
8.2 Detection Using Thresholds..................................................................14
8.3 Detection Using Built-in-Test (BIT) .......................................................14
9 Real time Operating Systems and Fault Tolerance .....................................15
10 Reconfiguration ........................................................................................15
11 Transient Faults .......................................................................................17
11.1 Distinguishing Transient Faults from Permanent Faults .......................17
11.2 Transient Fault Recovery......................................................................17
12 Self-Checking Pairs..................................................................................18
13 Bus Guardian Units ..................................................................................20
14 Integrated Modular Avionics (IMA) ...........................................................21
14.1 ARINC 653 ...........................................................................................21
15 Protecting a System from Common Cause Faults ...................................22
15.1 Types of Common Cause Faults ..........................................................22
15.2 Software Common Cause.....................................................................23
15.3 Design Errors in Hardware ...................................................................24
15.4 Radiation Induced Common Cause Failure ..........................................24
15.5 Other Common Cause Faults ...............................................................25
15.6 Functional-level Dissimilar Backup System ..........................................26
15.7 Common Cause Failure and Integrated Modular Avionics....................27
16 Re-Usable Fault Tolerance and System Layering....................................27
16.1 Asynchronous Flight Control Systems ..................................................28
16.2 Synchronous Fault-Tolerant Systems...................................................30
16.3 Maintaining Independence between the applications and the fault-
tolerant mechanisms .......................................................................................31
16.4 Technology Obsolescence and Refresh ...............................................32
17 Reliability Analysis ...................................................................................32
17.1 Markov Models .....................................................................................33
1
17.2 Solution of a Markov Model ..................................................................34
17.3 The Impact of Long Mission Times.......................................................35
17.4 Beware of the Ambiguous Term “coverage” .........................................36
18 The Synergism Between Formal Verification and Reliability Analysis......39
19 Function Migration....................................................................................39
20 Vehicle Health Management ....................................................................40
20.1 Basic Concepts.....................................................................................40
20.2 Failure Modes and Effects Analysis (FMEA) ........................................41
20.3 Sensor Fault Tolerance ........................................................................42
21 Concluding Remarks................................................................................43
22 Glossary ...................................................................................................43
23 References...............................................................................................47
2
1 Introduction
Fault Tolerance is a deep subject with hundreds of sub-topics. It is often
difficult to know where to begin the study of this vast subject. The purpose of this
paper is to illustrate the key issues in architectural-level fault tolerance by way of
example. The main objective is to explain the rationale and identify the trade-offs
between the variety of techniques that are used to achieve fault tolerance. The
primer focuses on high-level fault tolerance concepts (i.e. architectural) rather
than low-level mechanisms such as Hamming codes or protocols used for
communication. For information about the latter the reader is referred to
[Pradhan86].
2 Basics
Fault Tolerance is founded on redundancy. If we have two or more
identical components we can ignore the faulty component or switch to a spare if
the primary fails. Of course this assumes that we know when the failure occurs.
Some failures are easy to detect, e.g. the device just stops working. Other
failures are not, e.g., the device continues to work but produces incorrect results.
So immediately we are confronted with one of the reasons that the fault tolerance
field is broad—systems are designed to handle different kinds of failures. Some
systems are designed to handle fail-stop faults. Others are designed to handle
any kind of fault. Still others are designed to handle faults that can be detected
via a diagnostic program. Sometimes the faults are always assumed to be the
manifestation of a physical disruption, while other system designs seek to survive
logical errors as well. There are many possibilities. Many designs seek to
survive the class of faults that are assumed to be common and provide little or no
capability against what are assumed to be less common failures. Ideally, the set
of faults handled by a system is delineated in a well-specified fault model.
3
assumptions about how components of the system can fail. The following types
of faults are often considered:
• Fail stop (or fail silent) -- the component stops producing outputs when it
fails
• Fail benign – the component’s failure is recognized by all non-faulty
components
• Fail symmetric – the fault results in the same erroneous value being sent
to all other replicates
• Fail asymmetric (Byzantine) – the fault results in different erroneous
values being sent to some of the other replicates
See [Thambidurai88] for more details about this classification.
However, it should be noted that although a fault is transient, it can still produce
permanent error in the system state if the system is not designed to handle
transients.
Some defects or events can trigger multiple simultaneous errors. This class of
fault is referred to as common cause faults (CCF) and if not mitigated they may
overcome all of the available redundancy and hence cause system failure.
Sources of common cause faults are many and varied and require special
consideration in the design of a fault tolerant system. The following is a partial
list of common cause faults:
• Design flaw (i.e. bug) in software
• Hardware design errors (e.g., logical error in a processor)
• Compiler error
• Manufacturing defects
• Environmental induced effects (e.g. MMOD, Lightning, EMI, Launch
shock/vibrations)
3.2 Errors
4
Since a service is a sequence of the system’s external states, a service
failure means that at least one (or more) external state of the system
deviates from the correct service state. The deviation is called an error…
In most cases, a fault first causes an error in the service state of a
component that is a part of the internal state of the system and the
external state is not immediately affected. For this reason, the definition of
an error is the part of the total state of the system that may lead to its
subsequent service failure. It is important to note that many errors do not
reach the system’s external state and cause a failure. A fault is active
when it causes an error, otherwise it is dormant.
An error is detected if its presence is noted by the system via an error message
or error signal. Errors that are present but not detected are latent errors.
Note: a fault is usually defined fairly broadly as a defect within the system. This
definition includes “software bugs” in addition to physical failures such as a
memory “stuck-at-1” fault. It also includes the bit flips induced by noise in
communications systems. Because the techniques used to detect and remove
software bugs (e.g. logical mistakes) are very different from the techniques used
to handle physical faults, it is very important to distinguish situations where one is
talking about physical faults and when one is talking about “design errors”.
The techniques used to achieve this functionality is a major focus of this paper.
It is very important to recognize that redundancy alone does not provide fault-
tolerance. There must be mechanisms that coordinate and control the
redundancy and make decisions and selections concerning the redundant
information. These mechanisms may be centralized or distributed, they may be
implemented in hardware or software, they may be simple or complex, but it is
absolutely essential that these mechanisms be designed correctly. If there is a
logical defect in the design of the redundancy management logic, system failure
can occur even when there is no physical failure in the system. For example,
5
most of the problems which were discovered in the AFTI F-16 flight tests at
Dryden were due to defects in the redundancy management logic [Mackall88].
Fault tolerant systems are often built around the concept of fault containment
regions (FCRs). The primary goal of a FCR is to limit the effects of a fault and
prevent the propagation of errors from one region of the system to another. A
FCR is a subsystem that will operate correctly regardless of any arbitrary fault
outside the region. FCRs must be physically separated, electrically isolated, and
have independent power supplies. Physical dispersion limits the effect of physical
phenomena such as the impact of a micrometeoroid. Electrical isolation protects
against fault propagation from lightning or other forms of static discharge. Power
supply isolation prevents a power failure affecting the entire system. The number
of FCRs in a system is a primary factor in determining how many faults a system
can tolerate without failure.
http://nodis3.gsfc.nasa.gov/displayDir.cfm?Internal_ID=N_PR_8705_002A_&page_name=AppendixB
http://mmptdpublic.jsc.nasa.gov/mswg/Policies.htm#DFMR
4 Watch-dog Timers
A well known type of system failure is the non-responsive system or locked-up
system, which is sometimes referred to as the “blue screen of death” in desktop
computers. This is usually the result of a software bug that throws the system
into a non-operable state (e.g. an erroneous jump into the data space). When
this occurs in a desktop computer, a simple reboot usually suffices to recover the
system. But in some safety-critical systems, operator restart is not available or
6
appropriate (e.g. requires a response time that is beyond normal human
capability). In these systems, a watch-dog timer can be helpful.
The simplest approach is to have some watch-dog subsystem observe “I’m alive”
messages which are periodically produced by the primary. If these messages
disappear then the watch-dog system initiates some recovery action such as
reboot or rollback to a previous safe state. It is also possible to have the
watch-dog system turn over control to a backup system. There are several
issues here, but the most important is how to protect the system from failures in
the watch-dog subsystem.
5 Voting
Fault tolerant systems are usually designed to handle more than just fail-
stop faults, so inevitably some form of voting must be employed. There are
many varieties, but there are roughly three basic categories of voting: exact
match, fault-tolerant average, and mechanical. In the exact match case all of the
replicates including their internal data are assumed to be identical so any bit in
an output that differs may be an indication of failure. One can simply select the
majority value and use that value for the system output. In averaging, the good
replicates are assumed to be “close” to each other but not identical. So the
medial values either come from non-faulty components or they are bound by
other good values1. Either way the selection of a median or an average of some
middle values is enough to mask a faulty component
Voting can also be used to detect or diagnose which component of the system is
faulty, but careful engineering is required to determine when a disagreement in a
vote may be used for the diagnosis of faults. This is discussed in Section 8
(Diagnosing Failed Components).
1
By assuming that there is only one faulty component, we have two cases: (1) the faulty component is the
middle value, in which case it is between two good values, so the faulty component is still producing an
acceptable output, or (2) the faulty component is one of the extreme values, in which case the middle value
came from a good component.
7
sensors computer
sensors computer
6 Interactive Consistency
A fundamental issue in a fault-tolerant system is how to distribute input data to
the replicates in a manner that preserves consistency. Because inputs begin as
single source data, we must be sure that the replicate consistency is not
destroyed by a failure in the distribution system.
Even if there are multiple sensors the individual sensor values are distributed to
each of the computers:
sensors computer
sensors computer
8
The following diagram illustrates the problem with asymmetric failure
So after the failure one good channel will be operating on a value of 177 and
another on a value of 176. If exact-match voting is used on the outputs, then the
wrong processor will be identified as failed. If mid-value select voting is used on
the outputs, then there will be no immediate problem. However, if the
asymmetric fault persists over many iterations, the deviation between the two
good channels can continue to get larger and larger and eventually exceed the
threshold set for fault diagnosis. Once again the wrong processor can be
reconfigured out of the system. Recent results at NASA Langley (Paul Miner)
have shown that a two-stage mid-value select can be designed to achieve
Byzantine Agreement. However most asynchronous systems today have not
been designed using this particular strategy.
As shown above some types of faults are more insidious and require more FCRs
to properly mask them. For example, it has been shown [Driscoll03] that
tolerating f Byzantine faults requires:
• 3f+1 Fault Containment Regions (FCRs)
• FCRs must be interconnected through 2f+1 disjoint paths
• Inputs must be exchanged f+1 times between FCRs
• FCRs must be synchronized to a bounded skew
9
single asymmetric fault, one needs four fault isolation regions, but they need not
all be full channels. A very simple circuit can provide the necessary functionality
to implement critical Byzantine Agreement algorithms. If the circuit is electrically
isolated it can serve as one of the fault isolation regions. This idea was first
exploited in the Draper FTP computer system:
interstages
processors
A triplex FTP contains six components, but only three of them are full processors.
The three interstages are simple circuits requiring approximately 50 gates. Thus,
the overall hardware complexity, and hence the fault-arrival rate, of a triplex FTP
is less than a quadruplex. So its cost and reliability are correspondingly better
than using six complete processors. Furthermore, a quadruplex FTP provides
fault tolerance comparable to a 5-plex, but with only four full processors.
The beauty of modern fault-tolerance is that you don't have to replicate entire
"strings" to increase your ability to withstand more types of faults. To withstand a
single asymmetric fault, one needs four fault isolation regions, but they need not
all be full channels. A very simple circuit called an interstage can provide the
necessary functionality to implement critical Byzantine Agreement
algorithms. This was first done by Draper Labs in their FTP architecture [Lala ].
Since the interstage is electrically isolated it can serve as one of the fault
isolation regions. It merely relays messages and does not participate in the
voting steps that provide fault-tolerance in FTP. Full protection from a Byzantine
fault is achieved with only three processors and three interstages. The Draper
FTPP built on this foundation and developed a parallel processing version
[Harper91].
7 Clock Synchronization
It is usually advantageous to provide many layers of voting in a fault-tolerant
system and not just rely on a final force-sum voting at the actuators. If the system
10
is designed to perform a vote within the processors itself, it is necessary that
there be some mechanism to synchronize the clocks of the system so that this
vote can be scheduled and accomplished. Fault tolerant systems that
synchronize the clocks on the redundant processors are called synchronous
fault-tolerant systems. Systems that do not synchronize the clocks are called
asynchronous fault-tolerant systems.
Clock Time
Clock 1
(fast)
Clock 2 Clock 3
(slow) as seen by 2
Real Time
Re-sync time
Here both clocks 1 and 2 are non-faulty, but clock 1 is faster than real time
and clock 2 is slower than real time. When clock 3 fails it sends clock 1 a
11
value bigger than clock 1 and sends clock 2 a value smaller than clock 2.
Therefore both clock 1 and clock 2 select themselves as the mid value and
continue to drift apart. If clock 3 continues to do this, clocks 1 and 2 can drift
arbitrarily far apart even though they continue to execute the synchronization
algorithm.
Where Cj(t) denotes the clock time of processor j at real time t. Once you have
this implemented in your system, the voting system can be reliably built on top of
this as follows:
1. A vote is scheduled at a predetermined time (usually in a static schedule
table)
2. Let D = the maximum transport delay in sending a clock value from one
processor to another over the communication system3.
3. The vote is delayed until D + ε time units after the scheduled vote time. In
this way all of the good processors are guaranteed to have good values
from all of the good clocks.
2
The bound on the clock skew is determined by the variance of the communication times and
not the mean. The mean can be subtracted out in the clock sync algorithms.
3
Note that this means that communication system must have a maximum delay. If a databus is
used it must have predictable behavior and bounded communication delays. Therefore fault-
tolerant systems are developed using TDMA rather than CSMA/CD communication protocols.
12
7.3 Application-Level Reference Time
The NIST Internet Time Service can be used to synchronize the clock of any
computer connected to the Internet. However, due to the unreliability of the
internet, this is not a suitable candidate for a safety-critical system. For a safety-
critical system the Global Positioning System (GPS), which is a used for
navigation throughout the world, is more suitable. GPS signals are derived from
atomic clocks on the satellites so they can be used as a reference for time
synchronization and frequency calibration.
There are 3 basic methods for diagnosing when a component has failed:
• Using the discrepancies in the exact-match votes
• Thresholds on the mid-value select voting
• Using built-in test (BIT)
sensors computer
computer Actuators
sensors
VOTER
sensors computer
sensors computer
13
majority value is an indication of failure. But it should be noted that this strategy
depends critically upon the use of interactive consistency algorithms to properly
distribute single source data to the replicates. See Section 6 (Interactive
Consistency).
If the lanes (i.e. FCRs) are asynchronous, then the inputs will be sampled at
slightly different times and so the memory states of the different lanes will not be
exactly the same. In this case the system must use a mid-value select or
averaging algorithm. Unfortunately, mid-value selection algorithms alone do not
provide a way to detect and isolate a faulty lane. To accomplish this, thresholds
must also be used. A threshold is the maximum amount of deviation from the
average value that is tolerated before a lane is declared to be faulty. The system
designer must set thresholds so that most failures are detected, but he must be
careful not to make them too tight so as to cause excessive false alarms.
Consequently it can be very difficult to determine appropriate levels for the
thresholds and it requires extensive testing [Mackall88].
The ideal BIT would be one that could detect every possible fault, but 100%
coverage is rarely achievable in practice. It is not uncommon to see requirements
on the order of 98% coverage, that is, the BIT can detect 98% of all possible
faults. Also most BIT mechanisms also produce some level of false alarms.
Because of the imperfection of BIT, it is not usually used as the first line of
defense in most safety-critical systems. Rather it used to augment the
functionality of the redundancy management system and aid in identifying a
faulty component. BIT is also extremely important for off-line maintenance and
repair.
14
9 Real time Operating Systems and Fault Tolerance
Traditional real time operating systems (RTOS) do not provide direct system
services that manage redundant processes. Nevertheless, custom software can
be developed that accomplishes this task. A primary goal of this custom
software is to hide the details of the process management and voting. The
application software should be designed as if there were only simplex versions of
the software. All of the details about replication and voting should be hidden
from the applications. Nevertheless, the applications will have visibility into
whether the simplex abstractions are preemptible or non-preemptible. If the
tasks are preemptible, then an additional challenge must be met, namely how
can I be assured that all of the critical tasks meet their hard real-time deadlines.
While this is the classic problem that RTOS’s solve, it should be noted that in this
situation the RTOS will be scheduling redundant tasks with a strict need to vote
their outputs at task completion. In the non-preemptible case, this problem is
usually solved by using a preplanned static table. This is what was done in
SafeBUS and TTP/C [Rushby01].
10 Reconfiguration
A system that can survive a single fault is sometimes referred to as one fault-
tolerant. A system that can survive two faults is referred to as two fault-tolerant4.
A two fault tolerant system is usually constructed using reconfiguration, however
a five-plex with five-way voting can mask two simultaneous faults.5 A four plex
that reconfigures to a three-plex is an example of a two fault-tolerant system.
4
The N fault-tolerant characterization is very crude. The use of actual reliability numbers is much to be
preferred. A one fault-tolerant system can be more reliable than a two fault-tolerant system if the latter has
a higher processor failure rate.
5
This assumes that there are at least two additional fault containment regions for solving the interactive
consistency problem in order to be sure that we are guaranteed to have replicate consistency in the 5-plex.
15
Because the reconfiguration process relies on fault-detection, it is usually
desirable to augment the detection by voting with built-in test. If the system
reliability requirements do not warrant reconfiguration, then built-in test may not
be needed, since the voters can provide the needed fault masking capability.
However, in a reconfigurable system it is desirable to minimize fault-latency (i.e.
the time period from fault arrival to its manifestation as an error) because while
the first fault is latent, a second fault may arrive in another component. The
simultaneous presence of two faulty components will defeat the typical 3-way
voter. The time that it takes for the system to detect, isolate and reconfigure a
faulty component directly impacts its reliability. Interestingly, a two-fault tolerant
system can actually have a lower reliability than a one-fault tolerant system, if its
reconfiguration time is poor or if its components have much higher failure rates.
Therefore it is desirable to specify the system fault tolerance requirements in
terms of a probability of system failure rather than as simply one or two-fault
tolerant.
It is important to note that the view of the working set must be consistent
throughout the system. This is sometimes referred to as the distributed
diagnosis problem. It is essential that one verifies that all good components
agree who is in the working set. In system initialization or in recovery from a
massive transient upset, establishing this working set is the fundamental
challenge. If the system employs smart actuators, it is necessary that the
computers systems in these smart actuators know what is in the “working set” so
that the physical force-sum voters can know which inputs to ignore.
16
still a functioning system. This definition implicitly assumes that there is
adequate time for the system to reconfigure from the first fault before the arrival
of the second fault. Without reconfiguration in a quad system, the arrival of the
second fault can create a 2-2 dilemma case that can lead to directly to system
failure. For example, if the two good processors report that the Boolean variable
release_parachute is true, while the two faulty processors report that the
value is false, the voting logic could easily pick the wrong value.
11 Transient Faults
The reliability of a fault-tolerant system depends upon a reasonably fast detection
of faults to ensure that two faults are not active at the same time. The detection
and removal of faults is complicated by the presence of transient faults. Although
a transient fault can corrupt the internal state of the system and must be handled,
it does not permanently damage the hardware. Therefore the recovery
mechanism is different. You want to restore the state of the memory but you do
not want to remove hardware. So the redundancy management system must
make a judgment about whether a fault is permanent or transient.
When using the first technique, the designer has to decide on an appropriate
time period (or an error count level) before declaring a fault to be permanent
rather than transient. Surprisingly the reliability of a typical N-modular redundant
(NMR) system is not very sensitive to this parameter. Even a delay of a few
minutes does not degrade the system reliability [DiVito90].
Even though a fault is transient, it can still corrupt memory. The process of
correcting memory errors is sometimes called scrubbing.
17
correcting codes (ECC) which take advantage of extra bits in the memory.6
Another approach is to rely on the reading of new inputs to replace corrupted
memory. Of course, this does not give 100% coverage over the space of
potential memory upsets, but it is much more effective than one might expect at
first glance. Since control-law implementations produce outputs as a function of
periodic inputs and a relatively small internal state, a large fraction of the memory
upsets can be recovered in this manner. This accounts for the fact that although
many systems in service are not designed to accommodate transient faults, they
do actually exhibit some ability to tolerate such faults.
12 Self-Checking Pairs
A fault-tolerant system can be built on the foundation of self-checking pairs. The
use of self-checking pairs brings several key advantages, but this is not without
some additional cost. The key benefit is that self-checking pairs can greatly
simplify the design of certain aspects of the system by providing a high
probability that faults will manifest themselves as “fail-stop”. However, this
comes at the price of a more inefficient reconfiguration process which wastes
more good hardware than the more traditional NMR approach.
6
Single Error Correction Double Error Code (SECDEC) is commonly used for spacecraft
systems. However, the processor’s internal caches often do not have this level of correction.
Periodic cache flush is sometimes used or a flush on error detect is employed.
18
The basic idea of a self-checking pair (SCP) is to combine two identical
processing elements which are given identical inputs and their results are
compared. If there is a mis-compare of their outputs, then the comparator circuit
shuts-down the SCP and prevents the output from leaving the SCP. This can be
a temporary shutdown (e.g. the current output only) or a permanent shutdown.
Because it is usually not desirable to permanently remove a SCP due to a
transient fault, a persistence counter can be used which delays permanent
shutdown until a number of consecutive faults have occurred.
Self-checking pair
Processor 1 output
input
error,
comparator
Processor 2 shutdown
Self-checking pair
Processor 1
output
input
comparator error,
Processor 2
shutdown
Self-checking pair
Processor 1 output
input
comparator error,
Processor 2
shutdown
Bus
19
This type of architecture is often referred to as Dual-Dual. The key advantage of
this strategy is that it provides low level, autonomous processor fault detection
and isolation. There is no need to design voting mechanisms and high-level
strategies for redundancy management.
Of course there must be some mechanism to handle the transition from one self-
checking pair to another. Under the assumption that a self-checking pair fail
stops, the bus can just accept the first broadcast from either one of the pairs or
search for a valid output in some pre-determined order. Alternatively one of the
pairs can be active while the other one “shadows” the active one. Either way you
are employing four-fold redundancy to avoid the problem of loading the data
state after the primary fails. But of course this only takes care of faults in the
processing element. What about failures in the connection from the processor to
the bus? To deal with these failures we need something more sophisticated –
bus guardian units (see next section).
20
The bus guardian unit must protect the bus from any transmission outside of its
allocated time slot, while not keeping the self-checking pair from transmitting at
correct times. Because the fault-tolerant clock signal cannot be assumed to be
perfect, the bus guardian must open the window slightly earlier than the
start of the time slot, and close it slightly later than the expected end of
its time slot. The system designer must be careful to make sure that these small
excesses do not overlap with the excesses of other units. So there has to be
some wasted bandwidth of the bus to achieve fault-tolerance this way. The
selection of these time intervals must be analyzed carefully to insure that the
system functions properly.
The API (ARINC 653) defines an APplication EXecutive (APEX) for space and
time partitioning that is gaining support in the commercial avionics markets.
21
Several vendors currently offer ARINC 653 compliant real-time operating
systems including the LynxOS®-178 RTOS, Green Hills Integrity-178,
VxWorks653, and the BAE Systems CsLEOS, the first two of which have been
used and certified under the FAA’s DO-178B guidelines. Each partition in an
ARINC 653 system represents a separate application and makes use of memory
space that is dedicated to it. Similarly, the APEX allots a dedicated time slice to
each, thus creating time partitioning. Each ARINC 653 partition supports
multitasking. The advantages are:
• Uses a simple approach to memory and time partitioning and provides
support for common I/O through a well-defined application program interface
(API).
• Simplified maintenance through the ability to modify or add hardware without
impacting application software.
• Enhanced safety assurance through software fault isolation. For example,
software faults in one partition cannot corrupt the memory space in another
partition or impact the timing of another partition.
• Better use of computing resources (as compare to federated system) results
in reduction of mass and power
• Scalable: the centralized processing function can be distributed in multiple
computers (cabinets)
It should also be noted that a fault-tolerant operating system can provide different
levels of fault tolerance on the same platform. The tasks can be dispatched at
different levels of redundancy. Some tasks can run as simplex tasks while others
may be triplex or higher.
Sources of common cause faults are many and varied and require different
mitigation strategies. Examples that can lead to common cause failure are:
22
• Compiler/loader programming error
• Manufacturing defects
• Environmental induced effects (i.e. MMOD, Lightning, EMI, total dose
radiation)
• A Byzantine asymmetric fault in a system not designed to handle such
faults
A valuable resource for learning more about CCF is the SAE ARP 4761, which
outlines guidelines for conducting safety assessment for civil airborne systems.
This standard enumerates a total of eight common mode types, twenty two
common mode subtypes with between two and nine example sources for each
subtype. Due to the breadth of this topic, it cannot possibly be covered
completely in this report, but some brief mitigation strategies are given as
examples.
Software common cause can be addressed through rigorous verification and test
methods which seek to discover and remove potential errors before the system is
placed in operation. Software dissimilarity is another technique that has been
advocated to address software CCF. The use of dissimilarity at the code level
has been discredited (Knight-Leveson)7. However, at the requirements level
where a completely different function or approach can be specified, this is
deemed by most experts generally to be a good idea. Another method that can
provide some mitigation of software common cause errors is the use of restarts
and retries though the latter can sometimes be complicated by the need for some
rollback mechanism. The hope is that after the restart, the software will not
traverse through exactly the same set of inputs that triggered the software bug,
because the system will be processing new sensor values. Software errors can
also result from compiler design errors. However, these errors are less common
due to the heritage of the compilers and a large user base which tends to weed
out the bugs in the field. Probably the most promising of all approaches is formal
7
It has been demonstrated that even independently-developed software versions can fail on the same input.
In fact the probability that this occurs was shown to be much greater than one would expect (i.e. the
independence assumption is false) [Leveson]. It is believed that low-level design diversity (e.g. often
called software fault tolerance) is more vulnerable to this problem than high-level design diversity. .
Because design diversity does not provide a strong guarantee against common mode failure, a combination
of all the approaches is highly recommended. In the most critical systems and subsystems the use of formal
verification is prudent, even though it is may impose higher early life cycle costs.
Both N-version programming techniques and recovery block schemes have been proposed as code-level
mechanisms. For an excellent tutorial on software fault tolerance see [Pomales00]. Recovery blocks are
based around the idea of an acceptance test. If the output of a module fails the acceptance test, a backup
software routine is used to produce the output. N-version programming relies on multiple versions of a
program created by different programming teams. All versions are created from the same specification so
that the outputs can be voted.
23
methods. Formal methods are able to mathematically establish the absence of
hazardous behavior over all possible inputs using formal proof (using theorem
provers) or exhaustive search (using model checkers).
Design errors in hardware (i.e processors, ASICs, FPGAs) are also possible.
These types of errors generally have lower failure rates than software errors
because of the lower complexity and more rigorous verification culture present in
the hardware industry. Adding hardware dissimilarity is a method that mitigates
this class of error, but this can easily be “straining at a gnat”. The author knows
of no loss of a safety-critical system due to a design error in the hardware.
Whereas the loss of safety-critical systems due to errors in the application
software are quite numerous. Is should also be noted that adding dissimilarity
to a redundant computing system can actually increase the probability that it fails
due to CCF. Whenever redundancy is added to a system, additional logic (in
hardware/software) must be created to manage that redundancy. Mistakes in
that logic can be catastrophic –- creating a new single point of failure. So
dissimilarity is no panacea. It should be used with the utmost of care and
probably only where there is a commitment to formally prove the managing logic.
24
safe mode due to processor reset caused by latch upset in DRAM) and (6)
GRACE (Resets, reboots, double-bit errors in MMU-A, some GPS errors).
Common cause power related problems are worthy of special mention. Because
there are often significant constraints on mass and cable length, fault
containment regions share the redundant power lanes. Therefore if a power fault
occurs, multiple FCRs (from a defect or data transfer perspective) can be
affected. Also NASA has a history of power related faults, making it appropriate
for extra scrutiny. Parts screening, derating, testing and careful fault containment
analysis and design help mitigate this class of fault.
A final example that can lead to common cause failure is due to the vibration and
shock loads of a launch or reentry. The harshness of these environments needs
to be carefully analyzed and designed for with ample margin for uncertainty.
Fault containment through physical separation and isolation are mitigation
25
strategies. Also, environmental testing can identify problems that may not be
identified by system modeling.
The process that governs the switchover to the backup system is of critical
importance. You cannot have the backup unilaterally takeover, because it may
do so improperly when it has failed. And the failure rate of the simplex backup
system is much higher than the primary fault-tolerant system. Of course it is
reasonable to allow the primary system to give control to the backup, but this
cannot be relied upon for all situations. Therefore inevitably, the switchover
process must involve the human operator. But it should be remembered that the
effectiveness of this strategy depends upon the ability of the astronauts to
recognize the need for and accomplish the switchover to the backup in adequate
time.
Before a backup system is given control, the backup has to be initialized with all
of the relevant program state. Sometimes the backup system is run as a hot
spare and performs all of the same calculations as the primary system, only its
outputs are not sent to the system actuators. If the backup is “cold” it has to be
loaded with appropriate code and data prior to being given control.
The cost of a backup system can be considerable given that different software
has to be written for it. Also the backup system must be connected to the
26
input/output communications systems or have its own set of sensors and
actuators.
Many fault-tolerant computer systems have been built with voting strategies that
utilize information about the applications. Although at first this may seem like a
good thing to do, i.e. we might as well use the maximum information available to
detect faults, it turns out that this is really very counter-productive. The problem
is that one cannot verify and validate the fault-tolerant aspects of the system
without the applications. There is no divide-and-conquer approach that can
simplify the verification and validation. The aviation industry has frequently used
application-level fault-tolerance in the design of their flight control systems.
Interestingly they have largely avoided this approach in the design of fault-
tolerant flight management systems. Here architectures based upon self-
checking pairs are frequently employed.
27
16.1 Asynchronous Flight Control Systems
Voting in an asynchronous architecture is built around the idea that the rate of
change of output value is bounded and that the sensor data can be separated in
time by no more the period of the sample rate. This is illustrated below
Read Read
sensors Output compute Output
compute sensors
⎧ dx ⎫ Read Read
Errorx ≤ T f max ⎨ ⎬ sensors compute Output sensors
⎩ dt ⎭
Tf max (df/dt)
where Tf = sensor sample period, and max (df/dt) = the maximum rate of change
of the output function. If the control laws are stable then the output differences
will be bounded if the input differences are bounded. Using these bounds,
thresholds can be set at the voters. The mid-value from the channels is used to
drive the actuators and if the difference between an output and the mid-value
exceeds the threshold then the channel is declared as failed.
But eventually the designers have to deal with the fact that the control laws have
state variables associated with them (e.g. integrator variables). So although the
input variables are re-aligned once a channel drifts a full period apart, the
integrator variables are one iteration step apart! So inevitably in these
architectures the designers end up using cross-channel strapping and data
synchronization techniques. So this is typically handled by performing data
synchronization at the application level. For example Y.C. Yeh of the Boeing
Commercial Aircraft Company describes how this is done for the Boeing 777:
28
“The potential for functional asymmetry, which can lead to disagreement
among digital computing systems, is dealt with via frame and data
synchronization and median value selection of the PFC’s output
commands” []
So although the original design was built around the concept of no clock
synchronization between the channels, they end up synchronizing anyway. But,
instead of solving this problem at a lower level of the system in an application-
independent way, the problem is solved at the application level while dealing with
a lot of other issues. In other words, there is no separation of concerns. It is
possible to defer the issue of synchronizing the channels, but eventually this
issue must be at solved at some higher level of the system.
The whole strategy for diagnosing failed channels is based upon bounded rates
of change of the output values. But what happens when there is a discrete
change or a mode change? Well this could lead to an erroneous diagnosis of a
channel failure. So another ad-hoc solution is patched onto the architecture--
the channel outputs are passed through an output filter which ramps up and
ramps down the actuator outputs in a way that bounds the rate of change. These
ramps also serve to smooth out any rapid change of an output to an actuator.
So the designers are essentially dealing with the lack of clock synchronization
using a suite of ad-hoc patches. But there is no separation of concerns, no
divide and conquer and thus one ends up with an extremely complex architecture
that is difficult to validate and test. Dale Mackell wrote about this problem after
Dryden had flight tested the asynchronous F-16 DFCS, “The fault-tolerant design
should also be transparent to the control law functions. The control laws should
not have to be tailored to the system’s redundancy level.” [Mackall88]
There are some additional concerns associated with mid-value select algorithms.
Because the channel outputs are not identical even in fault-free conditions,
exact-match voting cannot be used. A mid value select algorithm must be used,
but mid value select algorithms cannot be used for diagnosis because they do
not decide which of the other two channels are faulty. Therefore, thresholds
based upon dynamics must also be employed for fault diagnosis. But this is not
as good a fault detector as an exact match voter. A permanently faulty processor
may remain undiagnosed for a long time, e.g. if it “flat lines” between two good
channels. So fault latency increases and this impacts the reliability analysis. Also
where to set the thresholds is fundamentally a trial-and-error process. These
thresholds change as the control laws change. It is not unusual for the control
laws to be modified at integration testing when the vehicle dynamics are better
understood. System designers have to wait until they have a full-scale simulation
(e.g. iron-bird) to debug the fault-tolerance of the system. But this means that the
basic fault-tolerance design is being modified at integration time as well. What
29
should be an early-life cycle, low level activity is deferred until late in the life-cycle
when repair of errors are notoriously expensive to repair.
Redundancy Management
Many cost-saving and safety benefits accrue from this approach. First, the
application software can be designed by a different vendor than the one building
the fault-tolerant computing platform. This prevents the government from getting
locked into a single large vendor. Second, the software can be developed as if it
were running on a single ultra-reliable operating system. It can be tested and
verified independently from the fault-tolerant system. Third, the fault-tolerant
computing platform can be reused over many different applications. The fault-
tolerant computing platform can be highly configurable and scalable supporting
different safety and reliability goals. Fourth, the redundancy management
algorithms can be designed in a processor-independent manner which enables
the use of COTS processors that won’t lock you into an antiquated hardware
technology. The following characteristics of the redundancy management layer
are highly desirable:
• Fault-masking, fault detection, and reconfiguration are independent of the
applications software
30
• The redundancy management should be handled in a processor-
independent way (either via software or via small custom hardware that
interfaces to COTS processors)
• The redundancy management should be highly configurable allowing
some processes to run in triplex, some as simplex, some as dual-dual etc.
• The redundancy management should support low power modes, allowing
systems to be turned on and off without interruption of critical applications
and processes.
• The redundancy management should provide a standard interface so that
multiple IMA operating systems could be supported.
P1: A B D V I/O ..
.
P2: C A F V I/O ..
P3: D E B V I/O
.
..
P4: B A V I/O
.
..
.
P5: E D C V I/O ..
.
16.3 Maintaining Independence between the applications and
the fault-tolerant mechanisms
Many fault-tolerant systems that are deployed today are not reusable because
the fault-tolerance mechanisms used in these systems are intimately connected
to the specific applications that run on them. If the applications are changed,
then the fault-tolerance mechanisms must be changed as well.
31
• Threshold voting based on application software characteristics leads to
false alarms because of uncertainty in the cause of a threshold being
exceeded (e.g. did vehicle dynamics change or was it a processor fault).
• Exact match voting detects errors without regard to the nature of the
applications.
• Exact match voting can be validated early in the life cycle.
• Exact match voting can be used in conjunction with BIT to provide 100%
fault masking and high probability of a timely reconfiguration.
• You do not have to artificially ramp-up/ramp-down outputs to make sure
that voting thresholds won’t be exceeded during non-linear actions such
as mode switching.
• You do not have to introduce cross-channel synchronization of integrator
values because the interactive consistency algorithms used will maintain a
globally consistent state on all good processors.
• Diagnosis of faulty processors, memories, and I/O resources is not
confounded by uncertainty over whether the fault is due to computational
resource failure or due to failure in an external vehicle system or actuator
that is affecting the thresholds.
The fault-tolerance used in most avionics systems today is not easily updated by
new technology. Because much of the timing and voting in the system is
handled by the processors themselves and depends upon specific aspects of
these processors, they cannot be easily updated with more capable processors.
The fault-tolerant system must be designed with this goal in mind and carefully
configured to avoid these dependencies. The SPIDER is a modern architecture
that has been designed with this as a primary goal [Geser02]. See
http://shemesh.larc.nasa.gov/fm/spider/ for details about this fault-tolerant
system.
17 Reliability Analysis
Since fault tolerance seeks to increase the reliability of a system through
redundancy it is important to be able to compute the reliability of a fault-tolerant
system as a function of measurable parameters such as processor failure rates
and system recovery rates.
32
17.1 Markov Models
Fr(t)
3λ 2λ
30 31 32
Fr(t)
λ
10 11
The states of the system are labeled with two digits. The first digit denotes the
number of processors that are currently being voted. The second digit denotes
the number of faulty processors in the system. The system starts in state 40,
where all four active processors participate in the vote and none of them are
faulty. Anyone of these processors can fail, which takes the system to state 41
at rate 4λ, where λ represents the single processor failure rate. Because the
processors are identical, the failure of each processor is not represented with a
separate transition. For example, at state (41), the system has one failed
processor but there is no delineation as to which processor has failed and so the
total rate of reaching this state is 4λ. Here, the system analyzes the errors from
the voter and diagnoses the problem. The transition from state (41) to state (30)
represents the removal (reconfiguration) of the faulty processor. The
reconfiguration transitions are labeled with a distribution function (Fr(t)) rather
than a rate. The reason for this labeling is that experimental measurement of the
reconfiguration process has revealed that the distribution of recovery time is
usually not exponential. Consequently, the transition is not described by a
constant rate. This label indicates that the probability that the transition time from
state (41) to state (30) will be less than t is Fr (t). The presence of a non-
exponential transition generalizes the mathematical model to the class of semi-
Markov models. At state (30), the system is operational with three good
processors. The recovery transition from state (41) to state (30) occurs as long
as a second processor does not fail before the diagnosis is complete. Otherwise,
the voter could not distinguish the good results from the bad. Thus, a second
transition exists from state (41) to state (42), which represents the situation
where two out of the four processors have failed and are still participating in the
33
vote. The rate of this transition is 3 λ, because any of the remaining three
processors could fail. State (42) is a death state (an absorbing state) that
represents failure of the system due to coincident faults. It is labeled in red to
indicate system failure. Of course, this is a conservative assumption. Although
two out of the four processors have failed, the failed processors may not produce
errors at the same time nor in the same place in memory. In this case, the voting
mechanism may effectively mask both faults and the reliability of the system
would be better than the model predicts.
At state (30), the system is operational with three good processors and no faulty
processors in the active configuration. Either one of these processors may fail
and take the system to state (31). At state (31), once again, a race occurs
between the reconfiguration process that ends in state (10) and the
failure of a second processor that ends in state (32). The recovery distribution
from state (31) could easily be different from the recovery distribution from state
(41) to state (30). However, for simplicity it is assumed to be the same. State (32)
is thus another death state and state (10) is the operational state where one
good processor remains. In this case the system was not designed to operate as
a dual so the system degrades to a simplex here. The transition from state (10)
to state (11) represents the failure of the last processor. At state (11) no good
processors remain, and the probability of reaching this death state is
often referred to as failure by exhaustion of parts.
8
The SURE and WinSURE programs solve semi-Markov models. A semi-Markov model is more general
than a pure Markov model in that it allows non-exponential transitions. The SURE program requires the
mean and standard deviation for all of the non-exponential distributions. The PAWS and STEM programs
accept the same input language as SURE, but they assume that all transitions are exponentially distributed.
The exponential rate is derived from the specified mean.
34
systems. The PAWS and STEM programs accept the model input in exactly the
same format as the SURE program. They assume that all of the recovery
transitions are exponential because they are pure Markov solvers. The ASSIST
program is a tool that generates large models for SURE, STEM, or PAWS
[Johnson95].
The model shown in figure 1 can be described in the SURE input language as
follows:
40,41 = 4*LAMBDA;
41,42 = 3*LAMBDA;
41,30 = <REC_TIME, REC_TIME>;
30,31 = 3*LAMBDA;
31,32 = 2*LAMBDA;
31,10 = <REC_TIME, REC_TIME>;
10,11 = LAMBDA;
TIME = 1; (* Mission Time of 1 hour *)
LOWERBOUND UPPERBOUND
-------------- -------------- --------------
2.865386E-0014 3.005409E-0014
4λ 41 3λ 42
40
35
LAMBDA = 1E-5; (* processor failure rate *)
TIME = 10 TO+ 4000; (* mission time *)
1,2 = 4*LAMBDA;
2,3 = 3*LAMBDA;
The following logarithmic plot shows the dramatic impact of a large mission time.
0.01
0.001
0.0001
Prob
Failure
1e-005
1e-006
100 days
1e-007
60 days
14 days
1e-008
1 10 100 1000 10000
TIME (hrs)
The WinSTEM output is
From this graph it is clear why the fault tolerance used in deep space missions is
very different from that used in commercial aircraft. In long mission scenarios,
the designers focus on reducing the processor failure rate λ and sometimes use
cold spares.
If you hang around reliability people for a while, you will inevitably encounter the
term “coverage”. Unfortunately the term is terribly ambiguous. It is used in many
different ways in different contexts. In this section, four of the most common
uses will be explained. It is the author’s opinion that this term should be used
with caution or totally avoided, because it is easily mis-interpreted.
36
First fault coverage: Some systems are not fully capable of surviving all first
faults. The percentage of first faults that they can recover from is often called
coverage. For example, suppose you have a dual system which runs a built-in
test diagnostic with coverage C. The following model describes this system:
2Cλ λ
20 10 11
2(1-C)λ
21
The system starts in state 20 with two good processors. State 21 represents the
system after the arrival of a fault which the system cannot detect and hence
mask so it is a death state. State 10 represents the system after the arrival of a
fault that is covered. Here the system successfully removes the faulty processor
and continues operation with one good processor. The following table shows
the dramatic impact of first fault coverage with a fixed λ = 1.0 x 10-5/hour:
C Pf
0.9999 2.1 x 10-9
0.999 2.0 x 10-8
0.99 2.0 x 10-7
0.9 2.0 x 10-6
0 2.0 x 10-5
Detection coverage: Some systems are fully capable of masking all first faults but
can only detect a fraction of the first faults. Suppose we have an asynchronous
quadraplex system that relies on threshold voting for detection. Hence there can
be some faults that remain latent because they do not propagate errors that
violate the thresholds. Suppose C represents the fraction of faults that are
detectable by the threshold voters and the built-in test (BIT). This means that C
percent of all faults will be detected and reconfigured out. The following Markov
model describes this quadraplex:
37
3λ
λ 41u 42u
1- C)
4(
4Cλ 3λ
40 41 42
r 2λ
λ 31u 32u
-C)
3 (1
3Cλ 2λ
30 31 32
r
λ
10 11
Notice that here the “uncovered” faults do not lead to system failure because
they are outvoted (i.e. masked), but they do take the system to a state (i.e. 41u)
where no reconfiguration transition is found.
3λ 2λ
30 31 32
Fr(t)
λ
10 11
is replaced by
3λ 2Cλ
30 31 32
2(1-C)λ
11
38
concepts into one number. It is the author’s opinion that this type of coverage
should be avoided.
BIT coverage: the fraction of the faults that can be detected by a Built-In-Test
technique. (See Section 8.3 “Detection Using Built-in-Test (BIT)”).
Together these provide a basis for assuring the safety of the system.
19 Function Migration
39
all of the processors in the system. This can be accomplished by providing
access to a mass memory where all of the software codes are stored or by
making copies of the codes on all of the local stores. Often fault-tolerant
systems rely on time-division multiplex buses that are driven by static schedule
tables. In these systems, function migration requires that there be some
mechanism for updating these tables in a fault-tolerant and safe manner.
This is a huge topic in itself but a few observations about this topic will be
provided here:
1. The mechanisms that are used for FDIR (fault detection, isolation, and
recovery) are very different from those used in the computing resources
and are often based upon concepts from artificial intelligence.
2. The VHM system is usually implemented as a software application that
runs on the fault-tolerant computing system.
3. The VHM system seeks to diagnose which external components have
failed on the basis of observables.
4. The scope of a VHM system can be huge including sensors, actuators,
power systems, displays, thermal systems, landing gear, hydraulics, etc.
5. The use of a single, unified approach to diagnose many different kinds of
subsystem failures is usually referred to as Integrated Vehicle Health
Management (IVHM).
40
IVHM system should remain independent of lower-level redundancy
management:
Operating System
Redundancy Management
A number of tools and techniques have been developed to aid the system
designer in the identification of hazards in safety-critical systems. Some
examples are (1) Failure Modes and Effects Analysis (FMEA), (2) Hazard and
Operability Studies (HAZOP), and Deviation Analysis. While performing an
FMEA, the analyst creates a list of component failure modes and tries to deduce
the effects of those failure modes on the system. Then an assessment is made
about the severity, the likelihood of these failures and the ability of the system to
detect them. Sometimes these factors are rolled up into a single risk priority
number that is assigned to each identified failure mode. This enables the
designer to focus his attention on the most critical failure modes. While this type
of analysis can be applied to the low-level design of the computing platform
components, it is most useful when applied to the vast array of components of
the subsystems external to computing resources. This type of analysis aids the
designer of an IVHM system or the designer of application software which
performs some local diagnosis and recovery from subsystem component failures.
This analysis aids the designer in identifying the critical component failures and
helps him develop mechanisms to handle these failures. For more information
about FMEA, the reader is referred to [Dailey04].
41
20.3 Sensor Fault Tolerance
Due to the criticality of the sensor selection software it must also run as a
redundant task on the fault-tolerant computing platform:
Sensor
Selection App 2 App 3 App 4
Operating System
Redundancy Management
42
Sometimes there are dynamic relationships between different kinds of sensors,
e.g. speed and acceleration. These dynamic relationships can be leveraged to
synthesize more accurate values for the required measurements. This can be
especially useful when critical sensors have failed. This approach is called
analytic redundancy. These techniques tend to be very application specific
because they depend upon the dynamic characteristics of a particular vehicle.
21 Concluding Remarks
The design of a fault-tolerant computing system is an extremely challenging
endeavor. The algorithms and techniques that are at the center of fault-tolerance
are among the most subtle and difficult to prove in Computer Science.
Fortunately these algorithms have been vigorously studied and analyzed by the
academic world. Many of these have been formally verified and mechanically
checked using theorem prover technology [Miner04]. It would be foolish to
design and implement a fault-tolerant computer today without taking advantage
of this storehouse of results.
The study of fault tolerance cannot be divorced from reliability analysis. A basic
understanding of Markov modeling and analysis is essential to understanding the
tradeoffs that must be made in the design of a fault-tolerant computer.
Fortunately this is not a difficult thing to obtain [Butler95]. The solution of these
models is straight-forward using freely available programs such as SURE or
STEM [Butler88].
22 Glossary
Application Programming Interface (API): A language processed by an operating
system (or lower layer in the system), which is used to provide services to the
applications.
Application software: the software that implements the primary functions of the
system. This software executes in an environment providing by system software
(e.g. the operating system) which is distinct from the application software.
43
Built-In-Test (BIT): diagnostics which run automatically and seek to isolate faulty
components.
Common Cause Fault (CCF): A fault that can trigger multiple simultaneous errors
in different fault containment regions.
Design Error: A design error is a difference between the system requirement and
the specified design. The failure mechanism is in the human mind. Design errors
range from syntax errors in the program code to fundamental mistakes including
the use of wrong algorithms, inconsistent interfaces, and software architecture
mistakes.
44
Error recovery: the process of restoring the system state to an error-free state
after the occurrence of a fault (usually transient).
Failure: The result of a system deviating from the specified function of that
system because of an error.
Failure Modes and Effects Analysis (FMEA): A methodology that helps the
system designer to identify and handle hazards in safety-critical systems.
Fault: A defect in the hardware, software or system component that can lead to
an incorrect state (i.e. error).
Interactive Consistency: A property of a system which states that the input values
to the system are distributed in a manner that guarantees that all redundant tasks
get exactly the same value even in the presence of faults.
45
Intermittent fault: A fault that appears, disappears and then reappears.
Mean Time Between Failure (MTBF): is the average time between failures of a
system. When the failure rate of the system is constant, it is the reciprocal of the
failure rate.
N fault tolerant: a system that is still operational after N consecutive faults (not
simultaneous). For example, a two fault tolerant (2FT) system is a system that is
"fail operational, fail operational", i.e. after two sequential faults, the system is still
a functioning system.
Self-Checking Pair: Built out of two identical processing elements. The two
outputs are compared and if there is a miscompare, then the output is inhibited
making the SCP fail-stop.
Transient fault: a fault that appears for a short time and then disappears
Voting: the process of selecting a final result from the outputs of redundant
channels or redundant computations.
46
23 References
[Avizienis04] Algirdas Avizienis, Fellow, IEEE, Jean-Claude Laprie,
Brian Randell, and Carl Landwehr, Basic Concepts and Taxonomy of
Dependable and Secure Computing, IEEE Transactions On Dependable And
Secure Computing, Vol. 1, No. 1, January-March 2004
[Butler88] Butler, Ricky W.; and White, Allan L.: SURE Reliability Analysis:
Program and Mathematics. NASA Technical Paper 2764, Mar. 1988.
[Butler92] Butler, Ricky W.: The SURE Approach to Reliability Analysis. IEEE
Transactions on Reliability, vol. 41, no. 2, June 1992, pp. 210--218.
[Butler95] Ricky W. Butler and Sally C. Johnson, Techniques for Modeling the
Reliability of Fault-Tolerant Systems With the Markov State-Space Approach ,
NASA RP-1348, September 1995, pp. 130.
[DiVito90] Di Vito, Ben L.; Butler, Ricky W.; and Caldwell, James L.: Formal
Design and Verification of a Reliable Computing Platform For Real-Time Control
(Phase 1 Results). NASA Technical Memorandum 102716, NASA Langley
Research Center, Hampton, Virginia, October 1990.
[Geser02] Alfons Geser, Paul Miner. A Formal Correctness Proof of the SPIDER
Diagnosis Protocol.. Theorem-Proving in Higher-Order Logics (TPHOLs), track B,
2002.
[Johnson95] Johnson, Sally C.; and Boerschlein, David P.: ASSIST User Manual.
NASA technical memorandum 4592, August 1995.
47
[Lala86] Jaynarayan H. Lala. A Byzantine resilient fault tolerant computer for
nuclear power application. IEEE Fault Tolerant Computing Symposium 16
(FTCS-16), pp. 338-343, Vienna, Austria, July 1986.
[MA2-00-057] http://mmptdpublic.jsc.nasa.gov/mswg/Documents/MA2-00-057.pdf
[Miner04] Paul Miner, Alfons Geser, Lee Pike, and Jeffrey Maddalon: A Unified
Fault-Tolerance Protocol. Presented at Formal Modelling and Analysis of Timed
Systems - Formal Techniques in Real-Time and Fault Tolerant System
(FORMATS-FTRTFT 2004), Grenoble, France, September 22-24, 2004.
[Ramanathan90] Ramanathan, P.; Shin, K. G.; and Butler, Ricky W.: Fault-
Tolerant Clock Synchronization in Distributed Systems. IEEE Computer, October
1990, Vol. 23, No. 10, p. 33-42.
48
Form Approved
REPORT DOCUMENTATION PAGE OMB No. 0704-0188
The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources,
gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this
collection of information, including suggestions for reducing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and
Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person
shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To)
01- 02 - 2008 Technical Memorandum
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
A Primer on Architectural Level Fault Tolerance
5b. GRANT NUMBER
604746.02.06.08.04
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
REPORT NUMBER
NASA Langley Research Center
Hampton, VA 23681-2199
L-19403
14. ABSTRACT
This paper introduces the fundamental concepts of fault tolerant computing. Key topics covered are voting, fault detection,
clock synchronization, Byzantine Agreement, diagnosis, and reliability analysis. Low level mechanisms such as Hamming
codes or low level communications protocols are not covered. The paper is tutorial in nature and does not cover any topic in
detail. The focus is on rationale and approach rather than detailed exposition.
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF RESPONSIBLE PERSON
ABSTRACT OF
a. REPORT b. ABSTRACT c. THIS PAGE PAGES STI Help Desk (email: help@sti.nasa.gov)
19b. TELEPHONE NUMBER (Include area code)
U U U UU 53 (301) 621-0390
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std. Z39.18