CH 4
CH 4
Attributes
• Availability
• Reliability Consequences
• Safety • Fault
•Maintainability • Error Strategies
• Failure • Fault prevention
• Fault tolerance
• Fault recovery
• Fault forcasting
Faults, Errors and Failures
Fault Error Failure
Timing failure A server's response lies outside the specified time interval.
• Byzantine faults
• If the components exhibit Byzantine faults, then a minimum of
2k+1 components are needed to achieve k fault tolerance.
– Eg. 3 processes report different values about the same variable so we
need at least (3 + 1) processes to report the same value about the
same variable [3- fault tolerant is possible in the case of 7 components]
• WE NEED MORE GENERAL AGREEMENT………………..
Active replication
• Active replication is a technique for achieving fault tolerance through
physical redundancy.
• A common instantiation of this is triple modular redundancy (TMR). This
design handles 2-fault tolerance with fail-silent faults or 1-fault tolerance
with Byzantine faults.
– Eg. consider a system where the output of A goes to the output of B
and the output of B goes to C. Any single component failure will cause
the entire system to fail.
Figure Triple modular redundancy.
• Each voter has three inputs and one output.
• If two or three of the inputs are the same, the output is equal to that
input.
• If all three inputs are different, the output is UNDEFINED.
Eg. Suppose that element A2 fails.
Each of the voters, V1, V2, and V3 gets two
Identical inputs and one different input, and each of them outputs the
correct value to the second stage.
In essence, the effect of A2 failing is completely
masked, so that the inputs to B1, B2, and B3 are exactly the same as they
would have been had no fault occurred.
Agreement in Faulty Distributed Systems
Two-army problem
• Two divisions of an army, A and B, coordinate an attack on enemy army, C.
• A and B are physically separated and use a messenger to communicate.
• A sends a messenger to B with a message of "let's attack at dawn".
• B receives the message and agrees, sending back the messenger with an
"OK" message.
• The messenger arrives at A, but A realizes that B did not know whether the
messenger made it back safely.
• If B is not convinced that A received the acknowledgement, then it will not
be confident that the attack should take place since the army will not win on
its own.
• SOLUTION : A may choose to send the messenger back to B with a message
of "A received the OK" but A will then be unsure as to whether B received
this message.
• OPTIMAL SOLUTION: The best we can do is hope that it usually works.
Byzantine Generals Problem
• The other case to consider is that of reliable communication lines but
faulty processors. This is known as the Byzantine Generals Problem.
• In this problem, there are n army generals who head different
divisions, but m of the generals are traitors (faulty) and are trying to
prevent others from reaching agreement by feeding them incorrect
information. How the loyal generals still can reach agreement on the
size of their division.
• SOLUTION : Agreement algorithm
• The goal of the algorithm: The loyal generals still can reach
agreement on the size of their division.
• To the problem of overcoming m traitors requires a minimum of
3m+1 participants (2m+1 loyal generals). This means that more than
2/3 of the generals must be loyal.
• Eg. If traitors are 5 what should be the number of royals? And what
about the total number of generals?
Agreement in Faulty Systems
How does a process group deal with a faulty member?
(1)
• The same algorithm as in previous slide, except now with 2 loyal generals and 1
traitor. Note: It is no longer possible to determine the majority value in each
column, and the algorithm has failed to produce agreement.
• It has been shown that for the algorithm to work properly, more than two-thirds
of the processes have to be working correctly. That is: if there are M faulty
processes, we need 2M + 1 functioning processes to reach agreement.
Real time Distributed Systems
• What is real time system?
• What is a real time system?
– Definition 1:
– A real-time system is one that must process information and produce
a response within a specified time. That is, in a system with a real-
time constraint it is no good to have the correct action or the correct
answer after a certain deadline: it is either by the deadline or it is
useless!
– Definition 2:
– Any system in which the time at which output is produced is
significant. This is usually because the input corresponds to some
event in the physical world, and the output has to relate to that same
event.
• Example: ticket reservation system at airport, over-temperature monitor
in nuclear power station, mobile phone
• It will become more and more difficult to meet all the deadlines for
a single machine or component, specially in the case of multiple
events occurred at the same time.
• So this problem causes to the rise of
distributed real time system. Eg. Distributed
security control system
• The response time requirements of hard real-time
systems are in the order of milliseconds or less
and can result in a catastrophe if not met.
Example: Patient diagnosis system