0% found this document useful (0 votes)

43 views25 pages

CH 4

Uploaded by

Natanem Yimer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views25 pages

CH 4

Uploaded by

Natanem Yimer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Chapter Four

Fault Tolerance and Replication

in Distributed Systems
Fault Tolerance and Replication in
Distributed Systems
4.1 Faults, Errors and Failures
4.2 Fault Classification
4.3 Failures models
4.4 Replication (Redundancy)
4.5 Agreement in Faulty Distributed
Systems
4.6 Real time Distributed Systems
Introduction
• Hardware, software and networks cannot be totally free from
failures

• Fault tolerance is a non-functional requirement that requires a

system to continue to operate, even in the presence of faults

• Distributed systems can be more fault tolerant than centralized

(where a failure is often total), but with more processor hosts
generally the occurrence of individual faults is likely to be more
frequent

• Notion of a partial failure in a distributed system

• In distributed systems the replication and redundancy can be hidden

(by the provision of transparency)
Faults
►Faults: attributes, consequences and strategies

Attributes
• Availability
• Reliability Consequences
• Safety • Fault
•Maintainability • Error Strategies
• Failure • Fault prevention
• Fault tolerance
• Fault recovery
• Fault forcasting
Faults, Errors and Failures
Fault Error Failure

► Fault is a defect within the system

► fail-silent (stop functioning )or a Byzantine (runs and produces incorrect
results)
► Error is observed by a deviation from the expected behavior of
the system
► Failure occurs when the system can no longer perform as
required (does not meet spec)
► Fault Tolerance is ability of system to provide a service, even in
the presence of errors
Example
• For example, in a software system, an incorrectly written
instruction in a program may decrement an internal variable
instead of incrementing it. Clearly, if this statement is
executed, it will result in the incorrect value being written. If
other program statements then use this value, the whole
system will deviate from its desired behavior.
• In this case,
– the erroneous statement is the fault,
– the invalid value is the error, and
– the failure is the behavior that results from the error.
Basic Concepts
• Systems attributes
– Availability – the system is ready to be used immediately.
– Reliability –the system can run continuously without failure.
– Safety – if a system fails, nothing catastrophic will happen.
– Maintainability – when a system fails, it can be repaired
easily and quickly (and, sometimes, without its users
noticing the failure).
Types of Fault
• There are three main types of ‘fault’:
1. Transient Fault – appears once, then disappears.
– Eg. A network message doesn't reach its destination but does when the
message is retransmitted.
2. Intermittent Fault – occurs, vanishes, reappears; but: follows no
real pattern (worst kind).
– Eg. A loose connection is an example of this kind of fault.
3. Permanent Fault – once it occurs, only the replacement/repair of
a faulty component will allow the DS to function normally.
– Eg. Software bugs
Classification of Failure Models
Different types of failures, with brief descriptions.

Type of failure Description

Crash failure A server halts, but is working correctly until it halts.

Omission failure A server fails to respond to incoming requests.

Receive omission - A server fails to receive incoming messages.
Send omission - A server fails to send outgoing messages.

Timing failure A server's response lies outside the specified time interval.

Response failure The server's response is incorrect.

Value failure - The value of the response is wrong.
State transition failure - The server deviates from the correct flow of control.

Arbitrary failure (Byzantine A server may produce arbitrary responses at arbitrary

failures) times.
Redundancy (Replication)

• The general approach to building fault tolerant systems is

redundancy. Redundancy may be applied at several levels.
• Information redundancy seeks to provide fault tolerance through
replicating or coding the data.
– Eg. Using extra bits in data to recover a certain ratio of failed
bits.
• Time redundancy achieves fault tolerance by performing an
operation several times.
– This form of redundancy is useful in the presence of transient or
intermittent faults. It is of no use with permanent faults.
– Eg. Retransmission of packets.
• Physical redundancy We add extra equipment to enable the
system to tolerate the loss of some failed components.
– Eg. Disks and backup name servers are examples of physical
redundancy.
How much redundancy?
• Fail silent faults
• A system is said to be k-fault tolerant if it can withstand k faults. If
the components fail silently, then it is sufficient to have k+1
components to achieve k fault tolerance:
– Eg. 3 power supplies will be 2-fault tolerant: 2 power supplies can fail
and the system will still function.

• Byzantine faults
• If the components exhibit Byzantine faults, then a minimum of
2k+1 components are needed to achieve k fault tolerance.
– Eg. 3 processes report different values about the same variable so we
need at least (3 + 1) processes to report the same value about the
same variable [3- fault tolerant is possible in the case of 7 components]
• WE NEED MORE GENERAL AGREEMENT………………..
Active replication
• Active replication is a technique for achieving fault tolerance through
physical redundancy.
• A common instantiation of this is triple modular redundancy (TMR). This
design handles 2-fault tolerance with fail-silent faults or 1-fault tolerance
with Byzantine faults.
– Eg. consider a system where the output of A goes to the output of B
and the output of B goes to C. Any single component failure will cause
the entire system to fail.
Figure Triple modular redundancy.
• Each voter has three inputs and one output.
• If two or three of the inputs are the same, the output is equal to that
input.
• If all three inputs are different, the output is UNDEFINED.
Eg. Suppose that element A2 fails.
Each of the voters, V1, V2, and V3 gets two
Identical inputs and one different input, and each of them outputs the
correct value to the second stage.
In essence, the effect of A2 failing is completely
masked, so that the inputs to B1, B2, and B3 are exactly the same as they
would have been had no fault occurred.
Agreement in Faulty Distributed Systems

• Distributed processes often have to agree on

something:
– to elect a coordinator
– To commit a transaction,
– To divide tasks
– To critical section, etc.
• What happens when the processes and/or the
communication lines are imperfect?
• Faulty communication lines- two-army problem
• Faulty processes- Byzantine Generals Problem

Two-army problem
• Two divisions of an army, A and B, coordinate an attack on enemy army, C.
• A and B are physically separated and use a messenger to communicate.
• A sends a messenger to B with a message of "let's attack at dawn".
• B receives the message and agrees, sending back the messenger with an
"OK" message.
• The messenger arrives at A, but A realizes that B did not know whether the
messenger made it back safely.
• If B is not convinced that A received the acknowledgement, then it will not
be confident that the attack should take place since the army will not win on
its own.
• SOLUTION : A may choose to send the messenger back to B with a message
of "A received the OK" but A will then be unsure as to whether B received
this message.
• OPTIMAL SOLUTION: The best we can do is hope that it usually works.
Byzantine Generals Problem
• The other case to consider is that of reliable communication lines but
faulty processors. This is known as the Byzantine Generals Problem.
• In this problem, there are n army generals who head different
divisions, but m of the generals are traitors (faulty) and are trying to
prevent others from reaching agreement by feeding them incorrect
information. How the loyal generals still can reach agreement on the
size of their division.
• SOLUTION : Agreement algorithm
• The goal of the algorithm: The loyal generals still can reach
agreement on the size of their division.
• To the problem of overcoming m traitors requires a minimum of
3m+1 participants (2m+1 loyal generals). This means that more than
2/3 of the generals must be loyal.
• Eg. If traitors are 5 what should be the number of royals? And what
about the total number of generals?
Agreement in Faulty Systems
How does a process group deal with a faulty member?
(1)

• The “Byzantine Generals Problem” for 3 loyal generals and 1 traitor.

a) The generals announce their troop strengths (in units of 1 kilosoldiers) to the
other members of the group by sending a message.
b) The vectors that each general assembles based on (a), each general knows
their own strength. They then send their vectors to all the other generals.
c) The vectors that each general receives in step 3. It is clear to all that General
3 is the traitor. In each ‘column’, the majority value is assumed to be correct.
Agreement in Faulty Systems (2)

• The same algorithm as in previous slide, except now with 2 loyal generals and 1
traitor. Note: It is no longer possible to determine the majority value in each
column, and the algorithm has failed to produce agreement.
• It has been shown that for the algorithm to work properly, more than two-thirds
of the processes have to be working correctly. That is: if there are M faulty
processes, we need 2M + 1 functioning processes to reach agreement.
Real time Distributed Systems
• What is real time system?
• What is a real time system?
– Definition 1:
– A real-time system is one that must process information and produce
a response within a specified time. That is, in a system with a real-
time constraint it is no good to have the correct action or the correct
answer after a certain deadline: it is either by the deadline or it is
useless!

– Definition 2:
– Any system in which the time at which output is produced is
significant. This is usually because the input corresponds to some
event in the physical world, and the output has to relate to that same
event.
• Example: ticket reservation system at airport, over-temperature monitor
in nuclear power station, mobile phone
• It will become more and more difficult to meet all the deadlines for
a single machine or component, specially in the case of multiple
events occurred at the same time.
• So this problem causes to the rise of
distributed real time system. Eg. Distributed
security control system
• The response time requirements of hard real-time
systems are in the order of milliseconds or less
and can result in a catastrophe if not met.
Example: Patient diagnosis system

• In contrast, the response time requirements of

soft real-time systems are higher and not very
strict. Example : online reservation systems
• Clock Synchronization
Example for both cases : Elevator Controller

300+ (UPDATED) Software Engineering MCQs Answers PDF 2023
100% (1)
300+ (UPDATED) Software Engineering MCQs Answers PDF 2023
52 pages
Preschool English Activity
100% (1)
Preschool English Activity
64 pages
LIFT DATA SHEET (Single Mobile Crane Lift)
No ratings yet
LIFT DATA SHEET (Single Mobile Crane Lift)
1 page
CA LISA Virtualization - Presentation
No ratings yet
CA LISA Virtualization - Presentation
15 pages
Class 6 - Lasers Problems - Dr. Ajitha - PHY1701
100% (1)
Class 6 - Lasers Problems - Dr. Ajitha - PHY1701
15 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Unit 4
No ratings yet
Unit 4
11 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
Lec 3
No ratings yet
Lec 3
30 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Process Resilience: by Ravalika Pola
No ratings yet
Process Resilience: by Ravalika Pola
17 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Lesson 2 - Fault and Error Modelling
No ratings yet
Lesson 2 - Fault and Error Modelling
7 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Chen 07
No ratings yet
Chen 07
39 pages
Fault
No ratings yet
Fault
101 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
ProcessResilience FaultTolerance Recovery
No ratings yet
ProcessResilience FaultTolerance Recovery
21 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Failure Model
No ratings yet
Failure Model
14 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
Distributed System - Failures
No ratings yet
Distributed System - Failures
12 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Module 5 Notes
No ratings yet
Module 5 Notes
10 pages
# Consensus and Agreement Algorithms: Distributed Computing
No ratings yet
# Consensus and Agreement Algorithms: Distributed Computing
9 pages
DC - Unit Iv - Consensus and Recovery Notes
No ratings yet
DC - Unit Iv - Consensus and Recovery Notes
33 pages
ch08 Ts TK Fault Tolerance I
No ratings yet
ch08 Ts TK Fault Tolerance I
29 pages
Failure Model
No ratings yet
Failure Model
14 pages
UNIT 4 DC Final
No ratings yet
UNIT 4 DC Final
38 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Du3 1
No ratings yet
Du3 1
54 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
32 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
Fault Tolerance
No ratings yet
Fault Tolerance
33 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
DC - Unit 4 Latest
No ratings yet
DC - Unit 4 Latest
110 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
6 pages
STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
Consensus
No ratings yet
Consensus
10 pages
Blockchain - Unit1
No ratings yet
Blockchain - Unit1
115 pages
Consensus & Agreement: Arvind Krishnamurthy Fall 2003
No ratings yet
Consensus & Agreement: Arvind Krishnamurthy Fall 2003
41 pages
Unit Iv Consensus and Recovery
No ratings yet
Unit Iv Consensus and Recovery
38 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
CST402 Distributed Computing M5
No ratings yet
CST402 Distributed Computing M5
41 pages
Unit 3-1
No ratings yet
Unit 3-1
26 pages
11 Errors
No ratings yet
11 Errors
33 pages
Distrsyslectureset7 Win20
No ratings yet
Distrsyslectureset7 Win20
114 pages
Breaking the Availability Barrier Ii: Achieving Century Uptimes with Active/Active Systems
From Everand
Breaking the Availability Barrier Ii: Achieving Century Uptimes with Active/Active Systems
Dr. Bruce Holenstein
No ratings yet
10 Project Procurement Management
No ratings yet
10 Project Procurement Management
23 pages
9 Project Risk Management
No ratings yet
9 Project Risk Management
17 pages
6 Project Quality Management
No ratings yet
6 Project Quality Management
14 pages
Application Lettre
No ratings yet
Application Lettre
1 page
Cover
No ratings yet
Cover
1 page
8 Project Communications Management
No ratings yet
8 Project Communications Management
21 pages
Obeject Orineted Questio Part One
No ratings yet
Obeject Orineted Questio Part One
14 pages
Hospital/community Pharmacy Assisting (Level III) : Part 1. CHOOSE Choose The Best Answer For The Following Questions
No ratings yet
Hospital/community Pharmacy Assisting (Level III) : Part 1. CHOOSE Choose The Best Answer For The Following Questions
30 pages
Rift Valley University Harar Campus
No ratings yet
Rift Valley University Harar Campus
1 page
Week 7 How To Fix Domain Settings Name Server On Squarespace
No ratings yet
Week 7 How To Fix Domain Settings Name Server On Squarespace
6 pages
300+ TOP Computer Fundamentals Questions & Answers 2023
100% (1)
300+ TOP Computer Fundamentals Questions & Answers 2023
22 pages
Scholastic Aptitude Test Extreme Series
100% (1)
Scholastic Aptitude Test Extreme Series
77 pages
CH 3
No ratings yet
CH 3
29 pages
Algorithm Unit 3
No ratings yet
Algorithm Unit 3
15 pages
This Set of Object Oriented Programming Abstraction
No ratings yet
This Set of Object Oriented Programming Abstraction
4 pages
OOP Features Five
No ratings yet
OOP Features Five
4 pages
Algorithm Unit 4
No ratings yet
Algorithm Unit 4
17 pages
Computer Science Quizzes - Ebook
100% (1)
Computer Science Quizzes - Ebook
2 pages
Computer MCQs - IT, Computer Science - FPSC NTS PPSC Computer Science Past Papers SPSC KPPSC MCQ Test Questions With Answers
100% (1)
Computer MCQs - IT, Computer Science - FPSC NTS PPSC Computer Science Past Papers SPSC KPPSC MCQ Test Questions With Answers
3 pages
Assignment Applied I For Computer Science
No ratings yet
Assignment Applied I For Computer Science
1 page
Chapter 2
No ratings yet
Chapter 2
25 pages
Mo 2019
No ratings yet
Mo 2019
21 pages
Dormitory Management System Final
100% (1)
Dormitory Management System Final
80 pages
FOR BOOK STORE (Autosaved)
No ratings yet
FOR BOOK STORE (Autosaved)
6 pages
134520081906
No ratings yet
134520081906
8 pages
Super Shop Website - For Merge
No ratings yet
Super Shop Website - For Merge
77 pages
Economic Order Quantity: Information
No ratings yet
Economic Order Quantity: Information
11 pages
Agilent 54622D Oscilloscope Service
No ratings yet
Agilent 54622D Oscilloscope Service
118 pages
Thesis: Master
No ratings yet
Thesis: Master
145 pages
Enrtl-Rk Rate Based Dipa Model
No ratings yet
Enrtl-Rk Rate Based Dipa Model
34 pages
Experiment No. 1 Linear System Simulator
100% (1)
Experiment No. 1 Linear System Simulator
2 pages
Robotics: Ece 411: Robotics Engr. Lalaine Jean A. Ballais, Ect
No ratings yet
Robotics: Ece 411: Robotics Engr. Lalaine Jean A. Ballais, Ect
9 pages
Fundamental Biostatistics Dillon Jones
No ratings yet
Fundamental Biostatistics Dillon Jones
68 pages
Batiment International, Building Research and Practice
No ratings yet
Batiment International, Building Research and Practice
2 pages
Assignment On MAT141
No ratings yet
Assignment On MAT141
2 pages
GstarCAD 2019 User Guide PDF
100% (2)
GstarCAD 2019 User Guide PDF
198 pages
Intrinsically 1
No ratings yet
Intrinsically 1
5 pages
Neural Network Presentation
100% (4)
Neural Network Presentation
33 pages
Line Algorithm
No ratings yet
Line Algorithm
62 pages
Dynamic Behavior of Materials, Volume 1: Leslie E. Lamberson Editor
No ratings yet
Dynamic Behavior of Materials, Volume 1: Leslie E. Lamberson Editor
218 pages
HCIA-Datacom V1.0 Lab Guide
No ratings yet
HCIA-Datacom V1.0 Lab Guide
182 pages
Solubilidad Del Florfenicol Con Diferentes Solventes
No ratings yet
Solubilidad Del Florfenicol Con Diferentes Solventes
4 pages
Fx3uc 32mt LT 2 Hardware Manual Jy997d31601 D (05.11)
No ratings yet
Fx3uc 32mt LT 2 Hardware Manual Jy997d31601 D (05.11)
8 pages
Analisa Sifat Material
No ratings yet
Analisa Sifat Material
10 pages
Notes For Practical
No ratings yet
Notes For Practical
49 pages
Trig Practice 2
No ratings yet
Trig Practice 2
3 pages
Gill
No ratings yet
Gill
474 pages
Flutter Analysis of The Aircraft Wing: Paramasivam Suresh (Ur13Ae044)
No ratings yet
Flutter Analysis of The Aircraft Wing: Paramasivam Suresh (Ur13Ae044)
9 pages
Spinach 1
No ratings yet
Spinach 1
7 pages
Assignment One
No ratings yet
Assignment One
4 pages
Programming Language - Common Lisp 8. Structures
No ratings yet
Programming Language - Common Lisp 8. Structures
10 pages
Nigerian Communications Commission Grant Presentation
No ratings yet
Nigerian Communications Commission Grant Presentation
69 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CH 4

Uploaded by

CH 4

Uploaded by

Chapter Four

Fault Tolerance and Replication

• Fault tolerance is a non-functional requirement that requires a

• Distributed systems can be more fault tolerant than centralized

• Notion of a partial failure in a distributed system

• In distributed systems the replication and redundancy can be hidden

► Fault is a defect within the system

Type of failure Description

Crash failure A server halts, but is working correctly until it halts.

Omission failure A server fails to respond to incoming requests.

Response failure The server's response is incorrect.

Arbitrary failure (Byzantine A server may produce arbitrary responses at arbitrary

• The general approach to building fault tolerant systems is

• Distributed processes often have to agree on

• The “Byzantine Generals Problem” for 3 loyal generals and 1 traitor.

• In contrast, the response time requirements of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.