Lecture 7 - FAULT-TOLERANT COMPUTING
Lecture 7 - FAULT-TOLERANT COMPUTING
COMPUTING
• INTRODUCTION :
• What is fault tolerance?
• Fault tolerance is the property that enables a system to continue
operating properly in the event of the failure of some of its
components.
• Fault tolerance is particularly sought in high-availability or life-critical
systems. It is the art and science of building computing systems that
continue to operate satisfactorily in the presence of faults.
•
• Fault Tolerance Requirements : The basic characteristics of fault
tolerance are:
• No single point of failure
• No single point of repair
• Fault isolation to the failing component
• Fault containment to prevent propagation of the failure
• system fails if it behaves in a way which is not consistent with its
specification. Such a failure is a result of a fault in a system
component.
• Systems are fault-tolerant if they behave in a predictable manner,
according to their specification, in the presence of faults
• ⇒there are no failures in a fault tolerant system.
• Several application areas need systems to maintain a correct
(predictable) functionality in the presence of faults
• What is correct functionality in the presence of faults?
• The answer depends on the particular application (on the
specification of the system):
• •The system stops and doesn’t produce any erroneous (dangerous)
result/behaviour.
• •The system stops and restarts after a while without loss of
information.
• •The system keeps functioning without any interruption and
(possibly) with unchanged performance
• fault can be:
1.Hardware fault: malfunction of a hardware component (processor,
communication line, switch, etc.).
2.Software fault: malfunction due to a software bug.
• A fault can be the result of:
1. Mistakes in specification or design: such mistakes are at the origin of
all software faults and of some of the hardware faults.
2. Defects in components: hardware faults can be produced by
manufacturing defects or by defects caused as result of deterioration in the
course of time.
3 .Operating environment: hardware faults can be the result of stress
produced by adverse environment: temperature, radiation, vibration, etc
• Fault types according to their temporal behavior:
• 1.Permanent fault: the fault remains until it is re-paired or the
affected unit is replaced.
• 2.Intermittent fault: the fault vanishes and reap-pears (e.g. caused by
a loose wire).
• 3.Transient fault: the fault dies away after sometime (caused by
environmental effects)
• Fault tolerance • A system or a component fails due to a fault • Fault
tolerance means that the system continues to provide its services in
presence of faults
• • A system may experience and should recover also from partial
failures
• • Fault categories in time
• Transient- Occurs once and disappear
• Intermittent- Occurs many times in an irregular way
• Permanent
•
•
• Fault-tolerant computing is the art and science of building computing
systems that continue to operate satisfactorily in the presence of
faults.
• A fault-tolerant system may be able to tolerate one or more fault
types including
• Transient, intermittent or permanent hardware faults,
• Software and hardware design errors,
• Operator errors, or
• Externally induced upsets or physical damage.
Techniques of Fault Tolerant Systems
• There are four main techniques which are
• the HW redundancy which done by using more than one unit of the
component to be tolerated.
• second technique is the SW redundancy this is done by using
additional programs, subprograms, and program
• The third technique is the time redundancy, which is useful in the
case of soft errors;
• . The last technique is the information redundancy, which relies on
the coding theory.