1904050001
1904050001
MODULE III
Recovery in
Distributed Systems
Recovery
• Failure of a site/node in a distributed system causes
inconsistencies in the state of the system.
• Recovery: bringing back the failed node in step with
other nodes in the system.
• Classification of Failures:
• Process failure:
• Deadlocks, protection violation, erroneous user input,
etc.
• System failure:
• Failure of processor/system. System failure can have
full/partial amnesia.
• It can be a pause failure (system restarts at the same
state it was in before the crash) or a complete halt.
• Secondary storage failure: data inaccessible.
• Communication failure: network inaccessible. 3
An error is manifestation of fault and can
lead to failure
Fault
Erroneous
System State
Process / System
failure
4
Backward & Forward
Recovery
• Forward Recovery:
• Assess damages that could be caused by faults, remove those
damages (errors), and help processes continue.
• Difficult to do forward assessment. Generally tough.
• Backward Recovery:
• When forward assessment not possible. Restore processes to
previous error-free state.
• Expensive to rollback states
• Does not eliminate same fault occurring again (i.e. loop
on a fault + recovery)
• Unrecoverable actions: print outs, cash dispensed at
ATMs.
5
Problems with Backward Error
Recovery Approach
• The major problems associated with the backward
error recovery approach are:
• Performance Penalty : The overhead to restore a
process state to a prior state can be quite high.
• There is no guarantee that faults will not occur again
when processing begins from a prior state.
• Some component of the system state may be
recoverable.
• The forward error recovery technique, incur less
overhead because only those parts of the state that
deviate from the intended value need to be corrected.
6
Recovery System Model
• System is Single Machine, consist of Stable storage
and secondary storage.
• Storage that does not lose information in event of
system failure is stable storage.
• Stable storage is used to store the logs and recovery
points.
• It is assumed that data on the secondary storage is
archived periodically.
CPU
Secondary Stable
Storage Main Memory Storage
7
Recovery System Model
• For Backward Recovery
• Backward is simpler than forward as it dependent of
fault and error caused by fault.
• A single system with secondary and stable storage
• Stable storage does not lose information on failures
• Stable storage used for logs and recovery points
• Stable storage assumed to be more secure than
secondary storage.
• Data on secondary storage assumed to be archived
periodically.
8
Approaches
• Operation-based Approach
– Maintaining logs: all modifications to the state of a process are
recorded in sufficient detail so that a previous state can be
restored by reversing all changes made to the state.
– (e.g.,) Commit in database transactions: a transaction if it is
committed to by all nodes, then the changes are permanent. If
it does not commit, the effect of transactions are to be undone.
– Updating-in-place: Every write (update) results in a log of (1)
object name (2) old object state (3) new state. Operations:
• A do operation updates & writes the log
• An undo operation uses the log to remove the effect of a do
• A redo operation uses the log to repeat a do
– Write-ahead-log: To avoid the problem of a crash after update
and before logging.
• Write (undo & redo) logs before update
9
Approaches
• State-based Approach
• Establish a recovery point where the process state is saved.
• Recovery done by restoring the process state at the recovery,
called a checkpoint. This process is called rollback.
• Process of saving called checkpointing or taking a check
point.
• Rollback normally done to the most recent checkpoint, hence
many checkpoints are done over the execution of a process.
• Shadow pages technique can be used for checkpointing. Page
containing the object to be updated is duplicated and
maintained as a checkpoint in stable storage.
• Actual update done on page in secondary storage. Copy
in stable storage used for rollback.
10
Recovery in Concurrent Systems
• Distributed system state involves message
exchanges.
11
Recovery in Concurrent Systems
• X has record of m at x3 but Y has no record. m -> orphan
message.
• Y rolls back to y2 -> X should go to x2.
• If Z rolls back, X and Y has to go to x1 and y1 -> Domino
effect, roll back of one process causes one or more processes
to roll back.
x1 x2 x3
X
y1 y2 m
Y
Z z2
z1
12
Lost Messages
• If Y fails after receiving m, it will rollback to y1.
• X will rollback to x1
• m will be a lost message as X has recorded it as sent
and Y has no record of receiving it.
X x1
m
y1
Y X
Failure
13
Livelocks
X x1
n1
y1 m1
Y X
Failure
X x1
n2
n1
y1 m2
Y X
2nd Rollback
Z z2
z1
Time
15
Synchronous Approach
• Checkpointing:
• First phase:
• An initiating process, Pi, takes a tentative checkpoint.
• Pi requests all other processes to take tentative checkpoints.
• Every process informs whether it was able to take
checkpoint.
• A process can fail to take a checkpoint due to the nature of
application (e.g.,) lack of log space, unrecoverable
transactions.
• Second phase:
• If all processes took checkpoints, Pi decides to make the
checkpoint permanent.
• Otherwise, checkpoints are to be discarded.
• Pi conveys this decision to all the processes as to whether
16
checkpoints are to be made permanent or to be discarded.
Asynchronous Approach
• Disadvantages of Synchronous Approach:
• Additional message exchanges for taking checkpoints
• Delays normal executions as messages cannot be exchanged
during checkpointing.
• Unnecessary overhead if no failures occur between
checkpoints.
• Asynchronous approach: independent checkpoints at
each processor. Identify a consistent set of
checkpoints if needed, for roll backs.
• E.g., {x3,y3,z2} not consistent; {x2,y2,z2} consistent. Used for
rollback x1
X x2 x3
y1 y2 y3
Y
z2 17
Z
z1
Asynchronous Approach...
• Assumption: 2 types of logging.
• Volatile logging: takes less time but contents lost on failure.
Periodically flushed to stable logs.
• Stable log: may take more time but contents not lost.
• Logging: tuple {s, m, msgs_sent}. s process state, m message
received, msgs_sent the set of messages sent during the
event.
• Event logging initiated on message receipt.
• Notations & data structures:
• RCVDi<-j (CkPti): Number of messages received by
processor Pi from Pj as per checkpoint CkPti.
• SENTi->j(CkPti): Number of messages sent by processor Pi
to Pj as per checkpoint CkPti.
• Basic Idea:
• Each processor keeps track of the number of messages sent/
received to/ from other processors. 18
Asynchronous Approach...
• Basic Idea ....
• Existence of orphan messages identified by comparing the
number of messages sent and received.
• If number of received messages > sent messages -> presence
of orphans -> receiving process needs to rollback.
• Algorithm:
• A recovering processor broadcasts a message to all
processors.
• if Pi is the recovering processor, CkPti := latest stable log.
• else CkPti := latest event that took place in i.
• for k := 1 to N do (N the total number of processors in the
system)
• for each neighboring processor j do send ROLLBACK
(i,SENTi->j(CkPti)) message.
• Wait for ROLLBACK message from every neighbor. 19
Asynchronous Approach...
• Algorithm ...
• for every ROLLBACK(j,c) message received from a neighbor
j, i does the following:
• if RCVDi<-j(CkPti) > c then /* orphans present */
• find the latest event e such that RCVDi<-j(e) = c;
• CkPti := e.
• end for k.
• Algorithm has |N| iterations.
• During kth (k != 1) iteration, Pi based CkPti determined in
(k-1)th iteration, computes SENTi->j(CkPti) for each
neighbor.
• This value is sent in a ROLLBACK message (in kth
iteration)
• At the end of each iteration, at least 1 processor will roll
back to its final recovery point. 20
Asynch. Approach Example
x1 ex1 ex2 ex3
X
30
An Algorithm for Site Recovery
• Data Structures
• Each site k maintains the following two data structures:
• ASk The session number of site k is maintained in a variable.
As is set to zero when site k is non-operational.
• PS is a vector of size n where n is the number of the sites in
the system. PS [i] is the session number of the site I as
known to the site K. Since the sites are up and down
dynamically, a site’s knowledge of the system is not always
correct. Thus, PS gives the state of the system as perceived
by k , PS [i] is set to zero whenever k learns that site I is
down or some other site informs k that site I is down
• We next describe how the system functions under normal
conditions, failures , and during recovery.
31
An Algorithm for Site Recovery
• User Transactions -
• Each requests that originates at a site I for reading or
writing a data item at site k carries PSi[k]. if PSi[K] ≠ ASk
OR ASk = 0 than the request is rejected by site k Otherwise,
there are three possible cases, (1) The data item is readable:
the request is processed at site K (2) The data item is
marked unreadable and the operation is a write operation :
the data item is modified and will be marked redable when
the transaction commits. (3) The data item is marked
unreadable and the operation is a read operation : a copier
transaction is initiated by site k,
• The copier transaction uses the perceived session vector to
locate a readable copy. A copy at the site j is readable for a
copier transaction from a site k if PS [j] = AS . The copier
transaction uses the contents of the readable copy to
renovate the local copy and removes the unreadable mark on
the local copy.
32
An Algorithm for Site Recovery
• Copier Transactions -
• Copier transaction may be initiated for all the data items
marked unreadable when a site starts recovering. Also,
copier transaction may be initiated on demand basis. It also
follows concurrency protocol.
• Control Transaction -
• Control transactions are special transaction that update AS
and PS at all sites. When a recovering site say k decides that
it is ready to change its state from recovering to operational,
it initiates a type 1 control transaction. A type 1 control
transaction performs following operations:
• It reads Psi from some reachable site I and refreshes PSk.
• It choose a new session number, sets PSk[k] to this new
session number and writes PSk[k] to all Psi[k] where
PSk[i] ≠ 0 (i.e. at all sites that are perceived up by site k).
33
An Algorithm for Site Recovery
• Control Transaction -
• When a site discover that one or more sites are down, it
initiates type 2 control transaction, which performs following
operations:
• It sets PSk[m] and PSk[n] to zero; m,n are sites
• For all I such that PSk[i] ≠ 0, it sets PSi[m] and PSi[n] to
zero
• Control transaction also follows concurrency control and
commit protocols used by RDDBS to control access to PS
vectors.
34
THE SITE RECOVERY PROCEDURE:
When a site k restarts after failure, the recovery procedure at site
k performs the following steps:-
1. It sets ASk to zero . That is, site k is recovering and is not
ready to accept user transactions.
2. It marks all the copies of data items unreadable.
3. It initiates a type -1 control transaction.
4. If the control transaction of steps3 successfully terminates,
then the site copies the new session number from PSk[k] to
ASk (Note that a new session number is set in PSk [k] by the
type -1 control transaction.) Note that once ASk ≠ 0, the site is
ready to accept user transactions.
5. If step 3 fails because of discovering that another site has
failed, site k initiates a type- 2 control transaction to exclude
the newly failed site and then restarts from step -3.
• In step -2, a recovering site will mark all the data items
unreadable . however, only those data items that missed
updates while the site was non- operational need to be marked
unreadable.
Fault Tolerance
36
Basic Concepts
• Fault Tolerance is closely related to the notion of
“Dependability”. In Distributed Systems, this is
characterized under a number of headings:
37
But, What Is “Failure”?
• Definition:
38
Types of Fault
• There are three main types of ‘fault’:
39
Classification of Failure Models
• Different types of failures, with brief
descriptions.
Type of failure Description
40
Failure Masking by Redundancy
• Strategy: hide the occurrence of failure from other
processes using redundancy.
42
System reliability: Fault-Intolerance vs.
Fault-Tolerance
• The fault intolerance (or fault-avoidance)
approach improves system reliability by
removing the source of failures (i.e., hardware
and software faults) before normal operation
begins
43
Issues : Since FTS must behave in specified manner in the event
of failure so following are implications -
• Process Deaths:
• All resources allocated to a process must be recovered when
a process dies
• Kernel and remaining processes can notify other
cooperating processes
• Client-server systems: client (server) process needs to be
informed that the corresponding server (client) process died
• Machine failure:
• All processes running on that machine will die
• Client-server systems: difficult to distinguish between a
process and machine failure
• Issue: detection by processes of other machines
• Network Failure:
• Network may be partitioned into subnets
• Machines from different subnets cannot communicate
• Difficult for a process to distinguish between a machine 44
and a communication link failure
Fault Tolerance
• Recovery: bringing back the failed node in step with other
nodes in the system.
• Fault Tolerance: Increase the availability of a service or the
system in the event of failures. Two ways of achieving it:
46
46
Committing
• A group of actions is grouped as a transaction and the group
is treated as an atomic action.
• The transaction, during the course of its execution, decides
to commit or abort.
47
47
2-phase Commit Protocol
• Distributed transaction carried out by a coordinator + a set
of cohorts executing at different sites.
• Phase 1:
• At the coordinator:
• Coordinator sends a COMMIT-REQUEST message to
every cohort requesting them to commit.
• Coordinator waits for reply from all others.
• At the cohorts:
• On receiving the request: if the transaction execution is
successful, the cohort writes UNDO and REDO log on
stable storage. Sends AGREED message to coordinator.
• Otherwise, sends an ABORT message.
• Phase 2:
• At the coordinator:
• All cohorts agreed? : write a COMMIT record on log, send
COMMIT request to all cohorts.
48
48
2-phase Commit Protocol ...
• Phase 2...:
• At the coordinator...:
• Otherwise, send an ABORT message
• Coordinator waits for acknowledgement from each
cohort.
• No acknowledgement within a timeout period? : resend
the commit/abort message to that cohort.
• All acknowledgements received? : write a COMPLETE
record to the log.
• At the cohorts:
• On COMMIT message: resources & locks for the transaction
released. Send Acknowledgement to the coordinator.
• On ABO RT message: undo the transaction using UNDO log,
release resources & locks held by the transaction, send
Acknowledgement.
49
49
Handling failures
• 2-phase commit protocol handles failures as below:
• If coordinator crashes before writing the COMMIT record:
• on recovery, it will send ABORT message to all others.
• Cohorts who agreed to commit, will simply undo the
transaction using the UNDO log and abort.
• Other cohorts will simply abort.
• All cohorts are blocked till coordinator recovers.
• Coordinator crashes after COMMIT before writing
COMPLETE
• On recovery, broadcast a COMMIT and wait for ack
• Cohort crashes in phase 1? : coordinator aborts the
transaction.
• Cohort crashes in phase 2? : on recovery, it will check with
the coordinator whether to abort or commit.
• Drawback: blocking protocol. Cohorts blocked if coordinator
fails.
• Resources and locks held unnecessarily.
50
50
2-phase commit: State
Machine
• Synchronous protocol: all sites proceed in rounds, i.e., a site
never leads another by more than 1 state transition.
• A state transition occurs in a process participating in the 2-
phase commit protocol whenever it receives/sends messages.
• States: q (idle or querying state), w (wait), a (abort), c
(commit).
• When coordinator is in state q, cohorts are in q or a.
• Coordinator in w -> cohort can be in q, w, or a.
• Coordinator in a/c -> cohort is in w or a/c.
• A cohort in a/c: other cohorts may be in a/c or w.
• A site is never in c when another site is in q as the protocol
is synchronous.
51
51
2-phase commit: State
Machine...
Coordinator Cohort i
q1
qi
C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts
All agreed/ wi ai
w1 Commit Abort from
One or more abort Commit msg
received coordinator
reply/ Abort msg to all
from
sent to all cohorts
coordinator
a1 c1 ci
52
52
Drawback
• Drawback: blocking protocol. Cohorts blocked if
coordinator fails.
• Resources and locks held unnecessarily.
• Conditions that cause blocking:
• Assume that only one site is operational. This site cannot
decide to abort a transaction as some other site may be in
commit state.
• It cannot commit as some other site can be in abort state.
• Hence, the site is blocked until all failed sites recover.
53
53
Non-blocking Commit
• Non-blocking commit? :
• Sites should agree on the outcome by examining their local
states.
• A failed site, upon recovery, should reach the same
conclusion regarding the outcome. Consistent with other
working sites.
• Independent recovery: if a recovering site can decide on the
final outcome based solely on its local state.
• A non-blocking commit protocol can support independent
recovery.
• Notations:
• Concurrency set: Let Si denote the state of the site i. The set
of all the states that may be concurrent with it is
concurrency set (C(si)).
• (e.g.,) Consider a system having 2 sites.If site 2’s state is w2,
then C(w2) = {c1, a1, w1}. C(q2) = {q1, w1}. a1, c1 not in C(q2)
as 2-phase commit protocol is synchronous within 1 state
transaction.
• Sender set: Let s be any state, M be the set of all messages
received in s. Sender set, S(s) = {i | site i sends m and m in
M} 54
54
3-phase Commit
• Lemma: If a protocol contains a local state of a site with both
abort and commit states in its concurrency set, then under
independent recovery conditions it is not resilient to an
arbitrary single failure.
• In previous figure, C(W2) can have both abort and commit
states in the concurrency set.
• To make it a non-blocking protocol: introduce a buffer state
at both coordinator and cohorts.
• Now, C(W1) = {q2, w2, a2} and C(w2) = {a1, p1, w1}.
55
55
3-phase commit: State
Machine
Coordinator Cohort i
qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts
All agreed/ wi ai
One or more abort w1 Prepare msg Prep msg
Abort from
reply/ Abort msg coordinator
to all received/
sent to all cohorts send Ack
a1 P1 Pi
All cohorts Commit
Ack/ Send Commit received from
msg to all coordinator
c1
ci
56
56
Failure, Timeout Transitions
• A failure transition occurs at a failed site at the instant it
fails or immediately after it recovers from the failure.
• Reason:
• Failed site makes a transition to a commit (abort) state using
failure transition rule.
• So, the operational site must make the same transition to
ensure that the final outcome is the same at all sites.
58
58
3-phase commit + Failure
Trans.
Coordinator Cohort i
qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
F,T message sent to F,T
all cohorts F,T
All agreed/
wi ai
One or more abort w1 Prep msg
Abort from
Prepare msg coordinator
reply/ Abort msg received/
to all
sent to all cohorts F,T send Ack Abort from
T P1 coordinator
a1 Pi
Abort to all All cohorts
cohorts Commit
Ack/ Send Commit F,T received from
F
msg to all coordinator
c1
F: Failure Transition ci
T: Timeout Transition
F,T: Failure/Timeout
59
59
Nonblocking Commit
Protocol
• Phase 1:
• First phase identical to that of 2-phase commit, except for
failures.
• Here, coordinator is in w1 and each cohort is in a or w or q,
depending on whether it has received the commit_request
message or not.
• Phase 2:
• Coordinator sends a Prepare message to all the cohorts (if all
of them sent Agreed message in phase 1).
• Otherwise, it will send an Abort message to them.
• On receiving a Prepare message, a cohort sends an
acknowledgement to the coordinator.
• If the coordinator fails before sending a Prepare message,
it aborts the transaction on recovery.
• Cohorts, on timing out on a Prepare message, also aborts
the transaction.
60
60
Nonblocking Commit
Protocol
• Phase 3:
• On receiving acknowledgements to Prepare messages, the coordinator
sends a Commit message to all cohorts.
• Cohort commits on receiving this message.
• Coordinator fails before sending commit? : commits upon recovery.
• So cohorts on Commit message timeout, commit to the transaction.
• Cohort failed before sending an acknowledgement? : coordinator
times out and sends an abort message to all others.
• Failed cohort aborts the transaction upon recovery.
• Use of buffer state:
• (e.g.,) Suppose state pi (in cohort) is not present. Let coordinator wait
in state p1 waiting for ack. Let cohort 2 (in w2) acknowledge and
commit.
• Suppose cohort 3 fails in w3. Coordinator will time out and abort.
Cohort 3 will abort on recovery. Inconsistent with cohort 2.
61
61
Commit Protocols
Disadvantages
• No protocol using the above independent recovery technique
for simultaneous failure of more than 1 site.
• The above protocol is also not resilient to network
partitioning.
• Alternative: Use voting protocols.
• Basic idea of voting protocol:
• Each replica assigned some number of votes
• A majority of votes need to be collected before accessing a
replica.
• Voting mechanism: more fault tolerant to site failures,
network partitions, and message losses.
• Types of voting schemes:
• Static
• Dynamic
62
62
Static Voting Scheme
• System Model:
• File replicas at different sites. File lock rule: either one
writer + no reader or multiple readers + no writer.
• Every file is associated with a version number that gives
the number of times a file has been updated.
• Version numbers are stored on stable storage. Every
successful write updates version number.
• Basic Idea:
• Every replica assigned a certain number of votes. This
number stored on stable storage.
• A read or write operation permitted if a certain number of
votes, called read quorum or write quorum, are collected by
the requesting process.
• Voting Algorithm:
• Let a site i issue a read or write request for a file.
• Site i issues a Lock_Request to its local lock manager.
63
63
Static Voting ...
• Voting Algorithm...:
• When lock request is granted, i sends a Vote_Request
message to all the sites.
• When a site j receives a Vote_Request message, it issues a
Lock_Request to its lock manager. If the lock request is
granted, then it returns the version number of the replica
(VNj) and the number of votes assigned to the replica (Vj)
at site i.
• Site i decides whether it has the quorum or not, based on
replies received within a timeout period as follows.
• For read requests, Vr = Sum of Vk, k in P, where P is
the set of sites from which replies were received.
• For write requests, Vw = Sum of Vk, k in Q such that:
• M = max{VN j: j is in P}
• Q = {j in P : VNj = M}
• Only the votes of the current (version) replicas are
counted in deciding the write quorum.
64
64
Static Voting ...
• Voting Algorithm...:
• If i is not successful in getting the quorum, it issues a
Release _Lock to the lock manager & to all sites that gave
their votes.
• If i is successful in collecting the quorum, it checks whether
its copy of file is current (VNi = M). If not, it obtains the
current copy.
• If the request is read, i reads the local copy. If write, i
updates the local copy and VN.
• i sends all updates and VNi to all sites in Q, i.e., update only
current replicas. i sends a Release_Lock request to its lock
manager as well as those in P.
• All sites on receiving updates, perform updates. On receiving
Release_Lock, releases lock.
• Vote Assignment:
• Let v be the total number of votes assigned to all copies.
Read & write quorum, r & w, are selected such that: r + w >65
v; w > v/2. 65
Static Voting ...
• Vote Assignment ...:
• Above values are determined so that there is a non-null
intersection between every read and write quorum, i.e., at
least 1 current copy in any reading quorum gathered.
• Write quorum is high enough to disallow simultaneous
writes on 2 distinct subset of replicas.
• The scheme can be modified to collect write quorums from
non-current replicas. Another modification: obsolete replicas
updated.
• (e.g.,) System with 4 replicas at 4 sites. Votes assigned: V1 =
1, V2 = 1, V3 = 2, & V4 = 1.
• Let disk latency at S1 = 75 msec, S2 = 750 msec, S3 = 750 msec.
& S4 = 100 msec.
• If r = 1 & w = 5, and read access time is 75 ms and write access
is 750 msec.
66
66
Dynamic Voting
• Used to overcome from failure.
• Can change vote value also.
• Dynamic voting: adapt the number of votes or the set of sites
that can form a quorum, to the changing state of the system
due to sites & communication failures.
• Approaches:
67
67
Majority-based Approach
ABCDE • Figure indicates the partitions and
(one) merger that takes place
• Assume one vote per copy.
ABD • Static voting scheme: only partitions
CE
ABCDE, ABD, & ACE allowed
access.
AB D • Majority-based approach: one partition
can collect quorums and the other cannot.
• Partitions ABCDE, ABD, AB, A, and
A ACE can collect quorums, others cannot.
B
ACE
68
68
Majority Approach ...
• Notations used:
• Version Number, VNi: of a replica at a site i is an integer
that counts the number of successful updates to the replica
at i. Initially set to 0.
• Number of replicas updated, RUi: Number of replicas
participating in the most recent update. Initially set to the
total number of replicas.
• Distinguished sites list, DSi,: at i is a variable that stores
IDs of one or more sites. DSi depends on RUi.
• RUi is even: DSi identifies the replica that is greater (as
per the linear ordering) than all the other replicas that
participated in the most recent update at i.
• RUi is odd: DSi is nil.
• RUi = 3: DSi lists the 3 replicas that participated in the
most recent update from which a majority is needed to
allow access to data.
69
69
Majority Approach: Example
• Example:
• 5 replicas of a file stored at sites A,B,C,D, and E. State of the
system is shown in table. Each replica has been updated 3
times, RUi is 5 for all sites. DSi is nil (as RUi is odd and !=
3).
• A B C D E
• VN 3 3 3 3 3
• RU 5 5 5 5 5
• DS - - - - -
• B receives an update request, finds it can communicate only
to A & C. B finds that RU is 5 for the last update. Since
partition ABC has 3 of the 5 copies, B decides that it belongs
to a distinguished partition. State of the system:
• A B C D E
• VN 4 4 4 3 3
• RU 3 3 3 5 5
• DS ABC ABC ABC - -
70
70
Majority Approach:
Example...
• Example...:
• Now, C needs to do an update and finds it can communicate
only to B. Since RUc is 3, it chooses the static voting protocol
and so DS & RU are not updated. System state:
• A B C D E
• VN 4 5 5 3 3
• RU 3 3 3 5 5
• DS ABC ABC ABC - -
• Next, D makes an update, finds it can communicate with
B,C, & E. Latest version in BCDE is 5 with RU = 3. A
majority from DS = ABC is sought and is available (i.e., BC).
RU is now set to 4. RU is even, DS set to B (highest
lexicographical order). System state:
• A B C D E
• VN 4 6 6 6 6
• RU 3 4 4 4 4
• DS ABC B B B B
71
71
Majority Approach:
Example...
• Example...:
• C receives an update, finds it can communicate only with B.
BC has half the sites in the previous partition and has the
distinguished site B (DS is used to break the tie in the case
of even numbers). Update can be carried out in the partition.
Resulting state:
• A B C D E
• VN 4 7 7 6 6
• RU 3 2 2 4 4
• DS ABC B B B B
72
72
Majority-based: Protocol
• Site i receives an update and executes following protocol:
• i issues a Lock_Request to its local lock manager
• Lock granted? : i issues a Vote_Request to all the sites.
• Site j receives the request: j issues a Lock_Request to its
local lock manager. Lock granted? : j sends the values of
VNj, RUj, and DSj to i.
• Based on responses received, i decides whether it belongs to
the distinguished partition procedure.
• i does not belong to distinguished partition? : issues
Release_Lock to local lock manager and Abort to other sites
(which will issue Release_Lock to their local lock manager).
• i belongs to distinguished partition? : performs update on
local copy (current copy obtained before update is local copy
is not current). i sends a commit message to participating
sites with missing updates and values of VN, RU, and DS.
Issues a Release_Lock request to local lock manager.
73
73
Majority-based: Protocol
• Site i receives an update and executes following protocol...:
• Site j receives commit message: updates its replica, RU, VN,
& DS, and sends Release_Lock request to local lock
manager.
• Distinguished Partition: Let P denote the set of responding
sites.
• i calculates M (the most recent version in partition), Q (set of
sites containing version M), and N (the number of sites that
participated in the latest update):
• M = max{VNj : j in P}
• Q = {j in P : VNj = M}
• N = RUj, j in Q
• |Q| > N/2 ? : i is a member of distinguished partition.
• |Q| = N/2 ? : tie needs to be broken. Select a j in Q. If DSj in
Q, i belongs to the distinguished partition. (If RUj is even,
DSj contains the highest ordered site). i.e., i is in the
partition containing the distinguished site. 74
74
Majority-based: Protocol
• Distinguished Partition...:
• If N = 3 and if P contains 2 or all 3 sites indicated by DS
variable of the site in Q, i belongs to the distinguished
partition.
• Otherwise, i does not belong to a distinguished partition.
• Update: invoked when a site is ready to commit. Variables
are updated as follows:
• VNi = M + 1
• RUi = cardinality of P (i.e., |P|)
• DSi is updated when N != 3 (since static voting protocol is
used when N = 3).
• DSi = K if RUi is even, where K is the highest ordered site
• DSi = P if RUi = 3
• Majority-based protocol can have deadlocks as it uses locks.
One needs a deadlock detection & resolution mechanism
along with this.
75
75
Dynamic Vote Reassignment Protocol
• Here, the number of votes assigned to a site changes
dynamically.
• The idea of DVR is categorized in two ways:
• Group Consensus
• The site in active group agree upon the new vote
assignment using either a distributed algorithm or
by electing co-ordinator.
• Autonomous Reassignment
• Each site uses a view of the system to make
decision about change in its votes and picking a
new vote value.
• This is carried out through:
• Vote Increasing Protocol
• Vote Decreasing Protocol
• Vote Collecting Protocol
Load Balancing
77
Motivations
• In a locally distributed system, there is a good possibility that
several computers are heavily loaded while others are idle or
lightly loaded
If we can move jobs around (in other words, distribute the
load more evenly), the overall performance of the system
can be maximized
78
Motivations (cont)
79
Issues In Load Distributing
• Load estimation
Resource queue lengths
CPU utilization
80
Issues In Load Distributing (cont)
• Classification of Load distributing
algorithms
82
Issues In Load Distributing (cont)
83
Components Of A Load Distributing
Algorithm
Typically a Load distributing algorithm has four components
- transfer policy, selection policy, location policy and
Information policy.
• Transfer Policy
Determines when a node needs to send tasks to other
nodes or can receive tasks from other nodes. When a new
task originates at a node, and the load at that node
exceeds a threshold T, the transfer policy decide the node
is sender, if load falls below T then receiver for remote
task.
Threshold policy
84
Components Of A Load Distributing
Algorithm (cont)
• Selection Policy
Determines which task(s) to transfer
Newly originated tasks that have caused the node to
become a sender by increasing the load at the node >
threshold
Estimated average execution time for task > threshold
Response time will be improved upon transfer
Overhead incurred by the transfer should be minimal
The number of location-dependent system calls made
by the selected task should be minimal
85
Components Of A Load Distributing
Algorithm (cont)
• Location Policy
Find suitable nodes for load sharing
• Information policy
Deciding when information about the states of other
nodes in the system should be collected, where it should
be collected from, and what information should be
collected.
Demand-driven: a node collects the state of other
nodes only when it becomes either a sender or
receiver.
Periodic: nodes exchange load information
periodically.
State-change-driven: nodes disseminate state
information whenever their state changes by a certain
degree.
86
Stability
• The queuing-theoretic perspective
A load distributing algorithm is effective under a given set
of conditions if it improves the performance relative to
that of a system not using load distribution
• Algorithmic perspective
An algorithm is unstable if it can perform fruitless actions
indefinitely with finite probability
87
Load Distributing Algorithms
• Sender-Initiated Algorithms
• Eager, Lazowska, and Zohorjan
• Receiver-Initiated Algorithms
• Symmetrically Initiated Algorithms
• Adaptive Algorithms
88
Sender-Initiated Algorithms
89
Sender-Initiated Algorithms (cont.)
• Random:
• A task is simply transferred to a node selected
at random, with no information exchange
between the nodes to aid in decision making.
• Problem: useless task transfers can occur when a
task is transferred to a node that is already
heavily loaded.
90
Sender-Initiated Algorithms (cont.)
• Threshold:
• The problem of random is overcome
• Sender selects receiver by help of polling
• If node is found, then the job is transferred
• Number of polls limited by a parameter called
PollLimit
• If no receiver is found then the node (sender)
has to execute the job itself
Sender-Initiated Algorithms (cont.)
• Shortest:
• Choose the best receiver for a task.
• A number of nodes (= PollLimit) are selected at
random and are polled to determine their queue
length -> choose the node with the shortest
queue length as the destination for transfer
unless its queue length >= T
• The destination node will execute the task
regardless of its queue length at the time of
arrival of the transferred task.
92
Sender-initiated load sharing with
threshold location policy
Sender-Initiated Algorithms (cont.)
• Information policy: demand-driven type.
94
Receiver-Initiated Algorithms
• In receiver initiated algorithm, the load distributing
activity is initiated from an under loaded node
(receiver) that is trying to obtain a task from an
overloaded node (sender).
• Location policy:
95
Receiver-Initiated Algorithms (cont.)
96
Receiver-Initiated Algorithms (cont.)
97
Comparison of Sender-Initiated and
Receiver-Initiated Algorithms
98
Comparison of Sender-Initiated and
Receiver-Initiated Algorithms
99
Symmetrically Initiated
Algorithms
Both senders and receivers search for
receivers and senders.
Advantages and disadvantages of both
sender and receiver initiated
algorithms.
Above average algorithm.
100
The above average algorithm
101
Transfer Policy
Use two adaptive thresholds: equidistant from the
node’s estimate of the average load across all nodes
(average load is 2 the lower threshold = 1 and
the upper threshold = 3).
A node whose load is greater than upper threshold
a sender, and load is less than lower threshold
a receiver.
Nodes that have loads between these thresholds lie
within the acceptable range, so they are neither
senders nor receivers.
102
Location Policy
103
Sender-initiated component
A sender (a node has a load greater than the acceptable
range) broadcast a TooHigh message, sets a TooHigh time
out alarm, and listens for an Accept message untill the
time out expires.
A receiver (a node has a load less than the acceptable
range) receives a TooHigh message, cancels its TooLow
timeout sends an Accept message to the source of TooHigh
message, increases its load value, and sets an
AwaitingTask time out. Increasing load value prevents
receiver from over committing itself to accepting remote
task. If AwaitingTask gets time out without arrival , the
load value is decreased at receiver.
On receiving accept message, if node is still sender, it
choose the best task to transfer and transfer it to node
that responded.
104
Sender-initiated component
When a sender that is waiting for a response for its
TooHigh message receives a TooLow message, it sends
TooHigh message to the node that sent the TooLow
message. This TooHigh message is handled by the receiver
as described under the “Receiver initiated component”.
On expiration of the TooHigh time out, if no Accept
message has been received, the sender inferes that its
estimate of the average system load is too low (since no
node has a load much lower). To correct this problem
sender braodcast a ChangeAverage message to increase
the average load estimate at the other nodes.
105
Sender-initiated component
TooHigh
No accept
TooLow
TooHigh Accept Load ++
ChangeAverage AwaitingTask
Waiting Transfer task
Accept
SENDER RECEIVER
106
Receiver-initiated component
A node, on becoming a receiver, broadcasts a
TooLow message, set a TooLow timeout, and
starts listening for a TooHigh message.
107
Selection and Information
Policy
Selection policy: this algorithm can
make use of any of the approaches
discussed earlier.
Information policy: demand-driven.
108
Adaptive Algorithms
109
A stable symmetrically initiated
algorithm
Instability in previous algorithms is indiscriminate
polling by sender’s negotiation component.
112
Sender initiated component
pollis
is receiver?
receiver?
transfer task
ID_R ID_R ID_C
ID_R ID_S
ID_C
Inform its status
Remove ID_S status
Sender
Receiver
or OK
Sender Receiver
113
Receiver initiated component
is sender?
Receiver Sender
114
Selection and Information
Policy
Selection policy:
The sender initiated component
considers only newly arrived tasks for
transfer.
The receiver initiated component can
make use of any of the approaches
discussed earlier.
Information policy: demand-driven.
115
A stable sender initiated
algorithm
Two desirable properties:
It does not cause instability
Load sharing is due to non-preemptive transfer
(which are cheaper) only.
This algorithm uses the sender initiated
load sharing component of the stable
symmetrically initiated algorithm as is, but
has a modified receiver initiated component
to attract the future non-preemptive task
transfers from sender nodes.
116
A stable sender initiated
algorithm
The data structure (at each node) of the stable
symmetrically initiated algorithm is augmented by
a array called statevector.
The statevector is used by each node to keep track
of which list (senders, recevers, or OK) it belongs to
at all the other nodes in the system.
When a sender polls a selected node, the sender’s
statevector is updated to reflect that the sender
now belongs the senders list at the selected node,
the polled node update its statevector based on the
reply it sent to the sender node to reflect which list
it will belong to at the sender
117
A stable sender initiated
algorithm
The receiver initiated component is replaced by the
following protocol:
When a node becomes a receiver, it informs all the nodes
that are are misinformed about its curren state. The
misinformed node are those nodes whose receivers lists do
not contain the receiver’s ID.
The statevector at the receiver is then updated to reflect
that it now belongs to the receivers list at all those nodes
that were informed of its current state.
By this technique, this algorithm avoids the receivers
sending broadcast messages to inform other nodes that
they are receivers.
No preemptive transfers of partly executed tasks
here.
118
Performance comparision
119