0% found this document useful (0 votes)
43 views25 pages

CH 4

Uploaded by

Natanem Yimer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views25 pages

CH 4

Uploaded by

Natanem Yimer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter Four

Fault Tolerance and Replication


in Distributed Systems
Fault Tolerance and Replication in
Distributed Systems
4.1 Faults, Errors and Failures
4.2 Fault Classification
4.3 Failures models
4.4 Replication (Redundancy)
4.5 Agreement in Faulty Distributed
Systems
4.6 Real time Distributed Systems
Introduction
• Hardware, software and networks cannot be totally free from
failures

• Fault tolerance is a non-functional requirement that requires a


system to continue to operate, even in the presence of faults

• Distributed systems can be more fault tolerant than centralized


(where a failure is often total), but with more processor hosts
generally the occurrence of individual faults is likely to be more
frequent

• Notion of a partial failure in a distributed system

• In distributed systems the replication and redundancy can be hidden


(by the provision of transparency)
Faults
►Faults: attributes, consequences and strategies

Attributes
• Availability
• Reliability Consequences
• Safety • Fault
•Maintainability • Error Strategies
• Failure • Fault prevention
• Fault tolerance
• Fault recovery
• Fault forcasting
Faults, Errors and Failures
Fault Error Failure

► Fault is a defect within the system


► fail-silent (stop functioning )or a Byzantine (runs and produces incorrect
results)
► Error is observed by a deviation from the expected behavior of
the system
► Failure occurs when the system can no longer perform as
required (does not meet spec)
► Fault Tolerance is ability of system to provide a service, even in
the presence of errors
Example
• For example, in a software system, an incorrectly written
instruction in a program may decrement an internal variable
instead of incrementing it. Clearly, if this statement is
executed, it will result in the incorrect value being written. If
other program statements then use this value, the whole
system will deviate from its desired behavior.
• In this case,
– the erroneous statement is the fault,
– the invalid value is the error, and
– the failure is the behavior that results from the error.
Basic Concepts
• Systems attributes
– Availability – the system is ready to be used immediately.
– Reliability –the system can run continuously without failure.
– Safety – if a system fails, nothing catastrophic will happen.
– Maintainability – when a system fails, it can be repaired
easily and quickly (and, sometimes, without its users
noticing the failure).
Types of Fault
• There are three main types of ‘fault’:
1. Transient Fault – appears once, then disappears.
– Eg. A network message doesn't reach its destination but does when the
message is retransmitted.
2. Intermittent Fault – occurs, vanishes, reappears; but: follows no
real pattern (worst kind).
– Eg. A loose connection is an example of this kind of fault.
3. Permanent Fault – once it occurs, only the replacement/repair of
a faulty component will allow the DS to function normally.
– Eg. Software bugs
Classification of Failure Models
Different types of failures, with brief descriptions.

Type of failure Description

Crash failure A server halts, but is working correctly until it halts.

Omission failure A server fails to respond to incoming requests.


Receive omission - A server fails to receive incoming messages.
Send omission - A server fails to send outgoing messages.

Timing failure A server's response lies outside the specified time interval.

Response failure The server's response is incorrect.


Value failure - The value of the response is wrong.
State transition failure - The server deviates from the correct flow of control.

Arbitrary failure (Byzantine A server may produce arbitrary responses at arbitrary


failures) times.
Redundancy (Replication)

• The general approach to building fault tolerant systems is


redundancy. Redundancy may be applied at several levels.
• Information redundancy seeks to provide fault tolerance through
replicating or coding the data.
– Eg. Using extra bits in data to recover a certain ratio of failed
bits.
• Time redundancy achieves fault tolerance by performing an
operation several times.
– This form of redundancy is useful in the presence of transient or
intermittent faults. It is of no use with permanent faults.
– Eg. Retransmission of packets.
• Physical redundancy We add extra equipment to enable the
system to tolerate the loss of some failed components.
– Eg. Disks and backup name servers are examples of physical
redundancy.
How much redundancy?
• Fail silent faults
• A system is said to be k-fault tolerant if it can withstand k faults. If
the components fail silently, then it is sufficient to have k+1
components to achieve k fault tolerance:
– Eg. 3 power supplies will be 2-fault tolerant: 2 power supplies can fail
and the system will still function.

• Byzantine faults
• If the components exhibit Byzantine faults, then a minimum of
2k+1 components are needed to achieve k fault tolerance.
– Eg. 3 processes report different values about the same variable so we
need at least (3 + 1) processes to report the same value about the
same variable [3- fault tolerant is possible in the case of 7 components]
• WE NEED MORE GENERAL AGREEMENT………………..
Active replication
• Active replication is a technique for achieving fault tolerance through
physical redundancy.
• A common instantiation of this is triple modular redundancy (TMR). This
design handles 2-fault tolerance with fail-silent faults or 1-fault tolerance
with Byzantine faults.
– Eg. consider a system where the output of A goes to the output of B
and the output of B goes to C. Any single component failure will cause
the entire system to fail.
Figure Triple modular redundancy.
• Each voter has three inputs and one output.
• If two or three of the inputs are the same, the output is equal to that
input.
• If all three inputs are different, the output is UNDEFINED.
Eg. Suppose that element A2 fails.
Each of the voters, V1, V2, and V3 gets two
Identical inputs and one different input, and each of them outputs the
correct value to the second stage.
In essence, the effect of A2 failing is completely
masked, so that the inputs to B1, B2, and B3 are exactly the same as they
would have been had no fault occurred.
Agreement in Faulty Distributed Systems

• Distributed processes often have to agree on


something:
– to elect a coordinator
– To commit a transaction,
– To divide tasks
– To critical section, etc.
• What happens when the processes and/or the
communication lines are imperfect?
• Faulty communication lines- two-army problem
• Faulty processes- Byzantine Generals Problem

Two-army problem
• Two divisions of an army, A and B, coordinate an attack on enemy army, C.
• A and B are physically separated and use a messenger to communicate.
• A sends a messenger to B with a message of "let's attack at dawn".
• B receives the message and agrees, sending back the messenger with an
"OK" message.
• The messenger arrives at A, but A realizes that B did not know whether the
messenger made it back safely.
• If B is not convinced that A received the acknowledgement, then it will not
be confident that the attack should take place since the army will not win on
its own.
• SOLUTION : A may choose to send the messenger back to B with a message
of "A received the OK" but A will then be unsure as to whether B received
this message.
• OPTIMAL SOLUTION: The best we can do is hope that it usually works.
Byzantine Generals Problem
• The other case to consider is that of reliable communication lines but
faulty processors. This is known as the Byzantine Generals Problem.
• In this problem, there are n army generals who head different
divisions, but m of the generals are traitors (faulty) and are trying to
prevent others from reaching agreement by feeding them incorrect
information. How the loyal generals still can reach agreement on the
size of their division.
• SOLUTION : Agreement algorithm
• The goal of the algorithm: The loyal generals still can reach
agreement on the size of their division.
• To the problem of overcoming m traitors requires a minimum of
3m+1 participants (2m+1 loyal generals). This means that more than
2/3 of the generals must be loyal.
• Eg. If traitors are 5 what should be the number of royals? And what
about the total number of generals?
Agreement in Faulty Systems
How does a process group deal with a faulty member?
(1)

• The “Byzantine Generals Problem” for 3 loyal generals and 1 traitor.


a) The generals announce their troop strengths (in units of 1 kilosoldiers) to the
other members of the group by sending a message.
b) The vectors that each general assembles based on (a), each general knows
their own strength. They then send their vectors to all the other generals.
c) The vectors that each general receives in step 3. It is clear to all that General
3 is the traitor. In each ‘column’, the majority value is assumed to be correct.
Agreement in Faulty Systems (2)

• The same algorithm as in previous slide, except now with 2 loyal generals and 1
traitor. Note: It is no longer possible to determine the majority value in each
column, and the algorithm has failed to produce agreement.
• It has been shown that for the algorithm to work properly, more than two-thirds
of the processes have to be working correctly. That is: if there are M faulty
processes, we need 2M + 1 functioning processes to reach agreement.
Real time Distributed Systems
• What is real time system?
• What is a real time system?
– Definition 1:
– A real-time system is one that must process information and produce
a response within a specified time. That is, in a system with a real-
time constraint it is no good to have the correct action or the correct
answer after a certain deadline: it is either by the deadline or it is
useless!

– Definition 2:
– Any system in which the time at which output is produced is
significant. This is usually because the input corresponds to some
event in the physical world, and the output has to relate to that same
event.
• Example: ticket reservation system at airport, over-temperature monitor
in nuclear power station, mobile phone
• It will become more and more difficult to meet all the deadlines for
a single machine or component, specially in the case of multiple
events occurred at the same time.
• So this problem causes to the rise of
distributed real time system. Eg. Distributed
security control system
• The response time requirements of hard real-time
systems are in the order of milliseconds or less
and can result in a catastrophe if not met.
Example: Patient diagnosis system

• In contrast, the response time requirements of


soft real-time systems are higher and not very
strict. Example : online reservation systems
• Clock Synchronization
Example for both cases : Elevator Controller

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy