0% found this document useful (0 votes)
69 views119 pages

1904050001

The document discusses recovery in distributed systems, including different types of failures, approaches to backward and forward recovery, and challenges in ensuring consistency when processes rollback to previous checkpoints due to failures. It covers operation-based and state-based recovery approaches, dealing with orphan messages and livelocks during recovery, and synchronous and asynchronous methods for establishing consistent checkpoints across processes.

Uploaded by

Rakesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views119 pages

1904050001

The document discusses recovery in distributed systems, including different types of failures, approaches to backward and forward recovery, and challenges in ensuring consistency when processes rollback to previous checkpoints due to failures. It covers operation-based and state-based recovery approaches, dealing with orphan messages and livelocks during recovery, and synchronous and asynchronous methods for establishing consistent checkpoints across processes.

Uploaded by

Rakesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 119

|| Shri Hari ||

MODULE III
Recovery in
Distributed Systems
Recovery
• Failure of a site/node in a distributed system causes
inconsistencies in the state of the system.
• Recovery: bringing back the failed node in step with
other nodes in the system.
• Classification of Failures:
• Process failure:
• Deadlocks, protection violation, erroneous user input,
etc.
• System failure:
• Failure of processor/system. System failure can have
full/partial amnesia.
• It can be a pause failure (system restarts at the same
state it was in before the crash) or a complete halt.
• Secondary storage failure: data inaccessible.
• Communication failure: network inaccessible. 3
An error is manifestation of fault and can
lead to failure

Fault

Manufacturing Design External Fatigue

Erroneous
System State

Process / System
failure
4
Backward & Forward
Recovery
• Forward Recovery:
• Assess damages that could be caused by faults, remove those
damages (errors), and help processes continue.
• Difficult to do forward assessment. Generally tough.
• Backward Recovery:
• When forward assessment not possible. Restore processes to
previous error-free state.
• Expensive to rollback states
• Does not eliminate same fault occurring again (i.e. loop
on a fault + recovery)
• Unrecoverable actions: print outs, cash dispensed at
ATMs.

5
Problems with Backward Error
Recovery Approach
• The major problems associated with the backward
error recovery approach are:
• Performance Penalty : The overhead to restore a
process state to a prior state can be quite high.
• There is no guarantee that faults will not occur again
when processing begins from a prior state.
• Some component of the system state may be
recoverable.
• The forward error recovery technique, incur less
overhead because only those parts of the state that
deviate from the intended value need to be corrected.
6
Recovery System Model
• System is Single Machine, consist of Stable storage
and secondary storage.
• Storage that does not lose information in event of
system failure is stable storage.
• Stable storage is used to store the logs and recovery
points.
• It is assumed that data on the secondary storage is
archived periodically.

CPU

Secondary Stable
Storage Main Memory Storage

7
Recovery System Model
• For Backward Recovery
• Backward is simpler than forward as it dependent of
fault and error caused by fault.
• A single system with secondary and stable storage
• Stable storage does not lose information on failures
• Stable storage used for logs and recovery points
• Stable storage assumed to be more secure than
secondary storage.
• Data on secondary storage assumed to be archived
periodically.

8
Approaches
• Operation-based Approach
– Maintaining logs: all modifications to the state of a process are
recorded in sufficient detail so that a previous state can be
restored by reversing all changes made to the state.
– (e.g.,) Commit in database transactions: a transaction if it is
committed to by all nodes, then the changes are permanent. If
it does not commit, the effect of transactions are to be undone.
– Updating-in-place: Every write (update) results in a log of (1)
object name (2) old object state (3) new state. Operations:
• A do operation updates & writes the log
• An undo operation uses the log to remove the effect of a do
• A redo operation uses the log to repeat a do
– Write-ahead-log: To avoid the problem of a crash after update
and before logging.
• Write (undo & redo) logs before update
9
Approaches
• State-based Approach
• Establish a recovery point where the process state is saved.
• Recovery done by restoring the process state at the recovery,
called a checkpoint. This process is called rollback.
• Process of saving called checkpointing or taking a check
point.
• Rollback normally done to the most recent checkpoint, hence
many checkpoints are done over the execution of a process.
• Shadow pages technique can be used for checkpointing. Page
containing the object to be updated is duplicated and
maintained as a checkpoint in stable storage.
• Actual update done on page in secondary storage. Copy
in stable storage used for rollback.

10
Recovery in Concurrent Systems
• Distributed system state involves message
exchanges.

• In distributed systems, rolling back one process can


cause the roll back of other processes.

• Orphan messages & the Domino effect: Assume Y


fails after sending m.

11
Recovery in Concurrent Systems
• X has record of m at x3 but Y has no record. m -> orphan
message.
• Y rolls back to y2 -> X should go to x2.
• If Z rolls back, X and Y has to go to x1 and y1 -> Domino
effect, roll back of one process causes one or more processes
to roll back.

x1 x2 x3
X

y1 y2 m
Y

Z z2
z1

12
Lost Messages
• If Y fails after receiving m, it will rollback to y1.
• X will rollback to x1
• m will be a lost message as X has recorded it as sent
and Y has no record of receiving it.

X x1
m
y1
Y X
Failure

13
Livelocks

X x1
n1
y1 m1
Y X
Failure

X x1
n2
n1
y1 m2
Y X
2nd Rollback

• Y crashes before receiving n1. Y rolls back to Y1 -> X to x1.


• Y recovers, receives n1 and sends m2.
• X recovers, sends n2 but has no record of sending n1
• Hence, Y is forced to rollback second time. X also rolls back as it has
received m2 but Y has no record of m2.
• Above sequence can repeat indefinitely, causing a livelock.
14
Consistent Checkpoints
• Overcoming domino effect and livelocks: checkpoints
should not have messages in transit.
• Consistent checkpoints: no message exchange
between any pair of processes in the set as well as
outside the set during the interval spanned by
checkpoints.
• {x1,y1,z1} is a strongly consistent checkpoint.
x2
X x1
m
y1 y2
Y

Z z2
z1
Time

15
Synchronous Approach
• Checkpointing:
• First phase:
• An initiating process, Pi, takes a tentative checkpoint.
• Pi requests all other processes to take tentative checkpoints.
• Every process informs whether it was able to take
checkpoint.
• A process can fail to take a checkpoint due to the nature of
application (e.g.,) lack of log space, unrecoverable
transactions.
• Second phase:
• If all processes took checkpoints, Pi decides to make the
checkpoint permanent.
• Otherwise, checkpoints are to be discarded.
• Pi conveys this decision to all the processes as to whether
16
checkpoints are to be made permanent or to be discarded.
Asynchronous Approach
• Disadvantages of Synchronous Approach:
• Additional message exchanges for taking checkpoints
• Delays normal executions as messages cannot be exchanged
during checkpointing.
• Unnecessary overhead if no failures occur between
checkpoints.
• Asynchronous approach: independent checkpoints at
each processor. Identify a consistent set of
checkpoints if needed, for roll backs.
• E.g., {x3,y3,z2} not consistent; {x2,y2,z2} consistent. Used for
rollback x1
X x2 x3

y1 y2 y3
Y

z2 17
Z
z1
Asynchronous Approach...
• Assumption: 2 types of logging.
• Volatile logging: takes less time but contents lost on failure.
Periodically flushed to stable logs.
• Stable log: may take more time but contents not lost.
• Logging: tuple {s, m, msgs_sent}. s process state, m message
received, msgs_sent the set of messages sent during the
event.
• Event logging initiated on message receipt.
• Notations & data structures:
• RCVDi<-j (CkPti): Number of messages received by
processor Pi from Pj as per checkpoint CkPti.
• SENTi->j(CkPti): Number of messages sent by processor Pi
to Pj as per checkpoint CkPti.
• Basic Idea:
• Each processor keeps track of the number of messages sent/
received to/ from other processors. 18
Asynchronous Approach...
• Basic Idea ....
• Existence of orphan messages identified by comparing the
number of messages sent and received.
• If number of received messages > sent messages -> presence
of orphans -> receiving process needs to rollback.
• Algorithm:
• A recovering processor broadcasts a message to all
processors.
• if Pi is the recovering processor, CkPti := latest stable log.
• else CkPti := latest event that took place in i.
• for k := 1 to N do (N the total number of processors in the
system)
• for each neighboring processor j do send ROLLBACK
(i,SENTi->j(CkPti)) message.
• Wait for ROLLBACK message from every neighbor. 19
Asynchronous Approach...
• Algorithm ...
• for every ROLLBACK(j,c) message received from a neighbor
j, i does the following:
• if RCVDi<-j(CkPti) > c then /* orphans present */
• find the latest event e such that RCVDi<-j(e) = c;
• CkPti := e.
• end for k.
• Algorithm has |N| iterations.
• During kth (k != 1) iteration, Pi based CkPti determined in
(k-1)th iteration, computes SENTi->j(CkPti) for each
neighbor.
• This value is sent in a ROLLBACK message (in kth
iteration)
• At the end of each iteration, at least 1 processor will roll
back to its final recovery point. 20
Asynch. Approach Example
x1 ex1 ex2 ex3
X

ey1 ey2 ey3 failure


Y X
y1
ez1
Z
z1 ez2

• Y fails, restarts from y1. CkPtx is ex3 & CkPtz is ez2.


• 1st iteration:
• Y sends RollBack(Y,2) to X & RollBack(Y,1) to Z
• X sends RollBack(X,1) to Y & RollBack(X,0) to Z
• Z send RollBack(Z,0) to X & RollBack(Z,1) to Y.
• Discussion:
• RCVDx<-y(CkPtx) = 3 > 2 (in Y’s RollBack message) CkPtx
set to ex2 to match the equality constraint.
• RCVDz<-y(CkPtz) = 2 > 1 (in Y’s message) CkPtz set to ez1. 21
Distributed Databases
• Checkpointing objectives in distributed database
systems (DDBS):
• Normal operations should be minimally interfered with, by
checkpointing.
• A DDBS may update different objects in different sites, local
checkpointing at each site is better.
• For faster recovery, checkpoints be consistent (desirable
property).
• Activity in DDBS is in terms of transactions. So in
DDBS, a consistent checkpoint should either include
updates of a transaction completely or not include it
all.
• Issues in identifying checkpoints:
• How sites agree on what transactions are to be included
• Taking checkpoints without interference 22
DDBS Check pointing
• Assumptions:
• Basic unit of activity is transactions
• Transactions follow some concurrency control protocol
• Lamport’s logical clocks used for time-stamping transactions.
• Failures detected by network protocols or timeouts
• Network partitioning never occurs
• Basic Idea
• All sites agree on a Global Checkpoint Number (GCPN)
• Transactions with timestamps <= GCPN are included in the
checkpoint. Called BCPTs: Before Checkpoint Transactions.
• Timestamps of After Checkpoint Transactions (ACPTs) >
GCPN.
• Each site multiple versions of data items being updated by
ACPTs in volatile storage -> No interference during check
pointing.
23
DDBS Checkpointing ...
• Data Structures
• LC: local clock as per Lamport’s logical clock
• LCPN (local checkpoint number): determined locally for the
current checkpoint.
• Algorithm: initiated by checkpoint coordinator (CC).
CC uses checkpoint subordinates (CS).
• Phase 1 at the CC
• CC broadcasts a Checkpoint_Request message with a
local timestamp LCcc.
• LCPNcc := LCcc
• CONVERTcc := false
• Wait for replies from CSs.
• Phase 1 at CSs
24
DDBS Checkpointing ...
• Phase 1 at CSs
• On receiving a Checkpoint_Request message, a site m,
updates its local clock as LCm := MAX(LCm, LCcc+1)
• LCPNm := LCm
• m informs LCPNm to the CC
• CONVERTm := false
• m marks all the transactions with timestamps !>
LCPNm as BCPTs and the rest as temporary-ACPTs.
• All updates of temporary-ACPTs are stored in the
buffers of the ACPTs
• If a temporary-ACPT commits, updates are not
flushed to the database but maintained as committed
temporary versions (CTVs).
• Other transactions access CTVs for reads. For writes,
another version of CTV is created.
25
DDBS Check pointing ...
• Phase 2 at CC
• All CS’s replies received -> GCPN := Max(LCPN1, ..,
LCPNn)
• Broadcast GCPN
• Phase 2 at the CSs
• On receiving GCPN, m marks all temporary-ACPTs that
satisfy the following conditions as BCPTs:
• LCPNm < transaction time stamp <= GCPN
• Updates of the above converted BCPTs are included in
checkpoints
• CONVERTm := true (i.e., GCPN & BCPTs identified)
• When all BCPTs terminate and CONVERTm = true, m
takes a local checkpoint by saving the state of the data
objects.
• After local checkpointing, database is updated with
Committed Temporary Version, CTVs and CTVs are deleted. 26
Recovery in Replicated DDS
• To enhance performance and availability, a distributed
database system is replicated where copies of data are stored
at different sites.
• Such a system is known as a replicated distributed database
system (RDDBS). In RDDBS, transactions are allowed to
continue despite one or more site failures as long as one copy
of the database is available.
• The availability and performance of a database system is
enhanced as the transaction are not blocked even when one
or more sites fail.
• However, in the above scheme, copies of the database at the
failed sites may miss some updates while the sites are not
operational. These copies will be inconsistent with the copies
at the operational sites.
• The goal of recovery algorithms in RDDBS is to hide such
inconsistencies from user transactions, bring the copies at
recovering sites up- to date with respect to the rest of copies,
and enable the recovering sites to start processing
27
transaction as soon as possible.
Recovery in Replicated DDS
• Two approaches have been proposed to recover failed sites.
• In one approach, message spoolers are used to save all the
updates directed towards failed sites. On recovery, the failed
site processes all the missed updates before resuming normal
operations.
• The other approach employs special transactions known as
copier transactions, Copier transactions read the up- to- date
copies at the operational sites and update the copies at
recovering sites.
• Copier transactions run concurrently with user transactions ,
The recovery scheme should guarantee that:
• (1) the out- of- date replicas are not accessible to user
transactions, and
• (2) once the out-of- date replicas are made up-to date by
copier transactions, they are also updated along with the
other copies by the user transactions.
28
An Algorithm for Site Recovery
• Proposed by Bhargava and Ruan, which is based on copier
transactions. A limitation of this scheme is that it does not
handle network partitions where sites of the database system
are partitioned into different groups, and sites in different
partitions cannot communicate with each other.

• SYSTEM MODEL. The database is assumed is to be


manipulated through transactions whose access to the
database is controlled by a concurrency control algorithm .
Transactions either run to completion or have no effect on the
database. The semantics of read and write operations on the
database are such that a read operations will read from any
available copy and write operation updates all the available
copies, All the out-of-date copies in the database are
assumed to be marked ‘’unreadable’’ . we also assume that
the database is fully replicated (i.e,, every site has a copy of
the database). A Site may be in any on e of the following
states: 29
An Algorithm for Site Recovery
• Operational/ Up. The site is operating normally and user
transactions are accepted.
• Recovering. The recovering is still in progress at the site
and the site is not ready to accept user transactions.
• Down. No RDDBS activity can be performed at the site.
• Non- operational. The site’s state is either recovering or
down.
• An operational session of a site is a time period in which the
site is up. Each operational session of a site is designated
with a session number (an integer) which is unique in the
site’s history , but not necessarily unique system wide . The
session numbers are stored on nonvolatile storage so that a
recovering site can use an appropriate new session number.

30
An Algorithm for Site Recovery
• Data Structures
• Each site k maintains the following two data structures:
• ASk The session number of site k is maintained in a variable.
As is set to zero when site k is non-operational.
• PS is a vector of size n where n is the number of the sites in
the system. PS [i] is the session number of the site I as
known to the site K. Since the sites are up and down
dynamically, a site’s knowledge of the system is not always
correct. Thus, PS gives the state of the system as perceived
by k , PS [i] is set to zero whenever k learns that site I is
down or some other site informs k that site I is down
• We next describe how the system functions under normal
conditions, failures , and during recovery.

31
An Algorithm for Site Recovery
• User Transactions -
• Each requests that originates at a site I for reading or
writing a data item at site k carries PSi[k]. if PSi[K] ≠ ASk
OR ASk = 0 than the request is rejected by site k Otherwise,
there are three possible cases, (1) The data item is readable:
the request is processed at site K (2) The data item is
marked unreadable and the operation is a write operation :
the data item is modified and will be marked redable when
the transaction commits. (3) The data item is marked
unreadable and the operation is a read operation : a copier
transaction is initiated by site k,
• The copier transaction uses the perceived session vector to
locate a readable copy. A copy at the site j is readable for a
copier transaction from a site k if PS [j] = AS . The copier
transaction uses the contents of the readable copy to
renovate the local copy and removes the unreadable mark on
the local copy.
32
An Algorithm for Site Recovery
• Copier Transactions -
• Copier transaction may be initiated for all the data items
marked unreadable when a site starts recovering. Also,
copier transaction may be initiated on demand basis. It also
follows concurrency protocol.

• Control Transaction -
• Control transactions are special transaction that update AS
and PS at all sites. When a recovering site say k decides that
it is ready to change its state from recovering to operational,
it initiates a type 1 control transaction. A type 1 control
transaction performs following operations:
• It reads Psi from some reachable site I and refreshes PSk.
• It choose a new session number, sets PSk[k] to this new
session number and writes PSk[k] to all Psi[k] where
PSk[i] ≠ 0 (i.e. at all sites that are perceived up by site k).
33
An Algorithm for Site Recovery
• Control Transaction -
• When a site discover that one or more sites are down, it
initiates type 2 control transaction, which performs following
operations:
• It sets PSk[m] and PSk[n] to zero; m,n are sites
• For all I such that PSk[i] ≠ 0, it sets PSi[m] and PSi[n] to
zero
• Control transaction also follows concurrency control and
commit protocols used by RDDBS to control access to PS
vectors.

34
THE SITE RECOVERY PROCEDURE:
When a site k restarts after failure, the recovery procedure at site
k performs the following steps:-
1. It sets ASk to zero . That is, site k is recovering and is not
ready to accept user transactions.
2. It marks all the copies of data items unreadable.
3. It initiates a type -1 control transaction.
4. If the control transaction of steps3 successfully terminates,
then the site copies the new session number from PSk[k] to
ASk (Note that a new session number is set in PSk [k] by the
type -1 control transaction.) Note that once ASk ≠ 0, the site is
ready to accept user transactions.
5. If step 3 fails because of discovering that another site has
failed, site k initiates a type- 2 control transaction to exclude
the newly failed site and then restarts from step -3.
• In step -2, a recovering site will mark all the data items
unreadable . however, only those data items that missed
updates while the site was non- operational need to be marked
unreadable.
Fault Tolerance

36
Basic Concepts
• Fault Tolerance is closely related to the notion of
“Dependability”. In Distributed Systems, this is
characterized under a number of headings:

• Availability – the system is ready to be used immediately.


• Reliability – the system can run continuously without failure.
• Safety – if a system fails, nothing catastrophic will happen.
• Maintainability – when a system fails, it can be repaired
easily and quickly (and, sometimes, without its users
noticing the failure).

37
But, What Is “Failure”?
• Definition:

• A system is said to “fail” when it cannot


meet its promises.
• A failure is brought about by the
existence of “errors” in the system.
• The cause of an error is called a “fault”.

38
Types of Fault
• There are three main types of ‘fault’:

• Transient Fault – appears once, then


disappears.
• Intermittent Fault – occurs, vanishes,
reappears; but: follows no real pattern
(worst kind).
• Permanent Fault – once it occurs, only the
replacement/repair of a faulty component
will allow the DS to function normally.

39
Classification of Failure Models
• Different types of failures, with brief
descriptions.
Type of failure Description

Crash failure A server halts, but is working correctly until it halts.

Omission failure A server fails to respond to incoming requests.


Receive omission - A server fails to receive incoming messages.
Send omission - A server fails to send outgoing messages.
Timing failure A server's response lies outside the specified time interval.

Response failure The server's response is incorrect.


Value failure - The value of the response is wrong.
State transition failure - The server deviates from the correct flow of control.
Arbitrary failure A server may produce arbitrary responses at arbitrary times.

40
Failure Masking by Redundancy
• Strategy: hide the occurrence of failure from other
processes using redundancy.

• Three main types:

• Information Redundancy – add extra bits to allow


for error detection/recovery (e.g., Hamming codes
and the like).

• Time Redundancy – perform operation and, if needs


be, perform it again. Think about how transactions
work (BEGIN/END/COMMIT/ABORT).

• Physical Redundancy – add extra (duplicate)


hardware and/or software to the system. 41
Failure Masking by Redundancy

• Triple modular redundancy. (Physical


Redundancy)

42
System reliability: Fault-Intolerance vs.
Fault-Tolerance
• The fault intolerance (or fault-avoidance)
approach improves system reliability by
removing the source of failures (i.e., hardware
and software faults) before normal operation
begins

• The approach of fault-tolerance expect faults to


be present during system operation, but employs
design techniques which insure the continued
correct execution of the computing process

43
Issues : Since FTS must behave in specified manner in the event
of failure so following are implications -

• Process Deaths:
• All resources allocated to a process must be recovered when
a process dies
• Kernel and remaining processes can notify other
cooperating processes
• Client-server systems: client (server) process needs to be
informed that the corresponding server (client) process died
• Machine failure:
• All processes running on that machine will die
• Client-server systems: difficult to distinguish between a
process and machine failure
• Issue: detection by processes of other machines
• Network Failure:
• Network may be partitioned into subnets
• Machines from different subnets cannot communicate
• Difficult for a process to distinguish between a machine 44
and a communication link failure
Fault Tolerance
• Recovery: bringing back the failed node in step with other
nodes in the system.
• Fault Tolerance: Increase the availability of a service or the
system in the event of failures. Two ways of achieving it:

• Masking failures: Continue to perform its specified function


in the event of failure.
• Well defined failure behavior: System may or may not
function in the event of failure, but can facilitate actions
suitable for recovery.
• (e.g.,) effect of database transactions visible only if
committed to by all sites. Otherwise, transaction is
undone without any effect to other transactions.

• Key approach to fault tolerance: redundancy. e.g., multiple


copies of data, multiple processes providing same service.
45
45
Atomic Actions
• Example: Processes P1 & P2 share a data named X.
• P1: ... lock(X); X:= X + Z; unlock(X); ...
• P2: ... lock(X); X := X + Y; unlock(X); ...
• Updating of X by P1 or P2 should be done atomically i.e.,
without any interruption.
• Atomic operation if:
• the process performing it is not aware of existence of any
others.
• the process doing it does not communicate with others
during the operation time.
• No other state change in the process except the operation.
• Effects on the system gives an impression of indivisible and
perhaps instantaneous operation.

46
46
Committing
• A group of actions is grouped as a transaction and the group
is treated as an atomic action.
• The transaction, during the course of its execution, decides
to commit or abort.

• Commit: guarantee that the transaction will be completed.

• Abort: guarantee not to do the transaction and erase any


part of the transaction done so far.

• Global atomicity: (e.g.,) A distributed database transaction


that must be processed at every or none of the sites.

• Commit protocols: are ones that enforce global atomicity.

47
47
2-phase Commit Protocol
• Distributed transaction carried out by a coordinator + a set
of cohorts executing at different sites.
• Phase 1:
• At the coordinator:
• Coordinator sends a COMMIT-REQUEST message to
every cohort requesting them to commit.
• Coordinator waits for reply from all others.
• At the cohorts:
• On receiving the request: if the transaction execution is
successful, the cohort writes UNDO and REDO log on
stable storage. Sends AGREED message to coordinator.
• Otherwise, sends an ABORT message.
• Phase 2:
• At the coordinator:
• All cohorts agreed? : write a COMMIT record on log, send
COMMIT request to all cohorts.
48
48
2-phase Commit Protocol ...
• Phase 2...:
• At the coordinator...:
• Otherwise, send an ABORT message
• Coordinator waits for acknowledgement from each
cohort.
• No acknowledgement within a timeout period? : resend
the commit/abort message to that cohort.
• All acknowledgements received? : write a COMPLETE
record to the log.
• At the cohorts:
• On COMMIT message: resources & locks for the transaction
released. Send Acknowledgement to the coordinator.
• On ABO RT message: undo the transaction using UNDO log,
release resources & locks held by the transaction, send
Acknowledgement.

49
49
Handling failures
• 2-phase commit protocol handles failures as below:
• If coordinator crashes before writing the COMMIT record:
• on recovery, it will send ABORT message to all others.
• Cohorts who agreed to commit, will simply undo the
transaction using the UNDO log and abort.
• Other cohorts will simply abort.
• All cohorts are blocked till coordinator recovers.
• Coordinator crashes after COMMIT before writing
COMPLETE
• On recovery, broadcast a COMMIT and wait for ack
• Cohort crashes in phase 1? : coordinator aborts the
transaction.
• Cohort crashes in phase 2? : on recovery, it will check with
the coordinator whether to abort or commit.
• Drawback: blocking protocol. Cohorts blocked if coordinator
fails.
• Resources and locks held unnecessarily.
50
50
2-phase commit: State
Machine
• Synchronous protocol: all sites proceed in rounds, i.e., a site
never leads another by more than 1 state transition.
• A state transition occurs in a process participating in the 2-
phase commit protocol whenever it receives/sends messages.
• States: q (idle or querying state), w (wait), a (abort), c
(commit).
• When coordinator is in state q, cohorts are in q or a.
• Coordinator in w -> cohort can be in q, w, or a.
• Coordinator in a/c -> cohort is in w or a/c.
• A cohort in a/c: other cohorts may be in a/c or w.
• A site is never in c when another site is in q as the protocol
is synchronous.

51
51
2-phase commit: State
Machine...
Coordinator Cohort i

q1
qi
C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts

All agreed/ wi ai
w1 Commit Abort from
One or more abort Commit msg
received coordinator
reply/ Abort msg to all
from
sent to all cohorts
coordinator

a1 c1 ci

52
52
Drawback
• Drawback: blocking protocol. Cohorts blocked if
coordinator fails.
• Resources and locks held unnecessarily.
• Conditions that cause blocking:
• Assume that only one site is operational. This site cannot
decide to abort a transaction as some other site may be in
commit state.
• It cannot commit as some other site can be in abort state.
• Hence, the site is blocked until all failed sites recover.

53
53
Non-blocking Commit
• Non-blocking commit? :
• Sites should agree on the outcome by examining their local
states.
• A failed site, upon recovery, should reach the same
conclusion regarding the outcome. Consistent with other
working sites.
• Independent recovery: if a recovering site can decide on the
final outcome based solely on its local state.
• A non-blocking commit protocol can support independent
recovery.
• Notations:
• Concurrency set: Let Si denote the state of the site i. The set
of all the states that may be concurrent with it is
concurrency set (C(si)).
• (e.g.,) Consider a system having 2 sites.If site 2’s state is w2,
then C(w2) = {c1, a1, w1}. C(q2) = {q1, w1}. a1, c1 not in C(q2)
as 2-phase commit protocol is synchronous within 1 state
transaction.
• Sender set: Let s be any state, M be the set of all messages
received in s. Sender set, S(s) = {i | site i sends m and m in
M} 54
54
3-phase Commit
• Lemma: If a protocol contains a local state of a site with both
abort and commit states in its concurrency set, then under
independent recovery conditions it is not resilient to an
arbitrary single failure.
• In previous figure, C(W2) can have both abort and commit
states in the concurrency set.
• To make it a non-blocking protocol: introduce a buffer state
at both coordinator and cohorts.
• Now, C(W1) = {q2, w2, a2} and C(w2) = {a1, p1, w1}.

55
55
3-phase commit: State
Machine
Coordinator Cohort i

qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts
All agreed/ wi ai
One or more abort w1 Prepare msg Prep msg
Abort from
reply/ Abort msg coordinator
to all received/
sent to all cohorts send Ack
a1 P1 Pi
All cohorts Commit
Ack/ Send Commit received from
msg to all coordinator
c1
ci

56
56
Failure, Timeout Transitions
• A failure transition occurs at a failed site at the instant it
fails or immediately after it recovers from the failure.

• Rule for failure transition: For every non-final state s (i.e.,


qi, wi, pi) in the protocol, if C(s) contains a commit, then
assign a failure transition from s to a commit state in its
FSA. Otherwise, assign a failure transition from s to an
abort state.
• Reason: pi is the only state with a commit state in its
concurrency set. If a site fails at pi, then it can commit on
recovery. Any other state failure, safer to abort.

• If site i is waiting on a message from j, i can time out. i can


determine the state of j based on the expected message.

• Based on j’s state, the final state of j can be determined


using failure transition at j.
57
57
Failure, Timeout Transitions
• This can be used for incorporating Timeout transitions at i.

• Rule for timeout transition: For each nonfinal state s, if site j


in S(s),and site j has a failure transition from s to a commit
(abort) state, then assign a timeout transition from s to a
commit (abort) state.

• Reason:
• Failed site makes a transition to a commit (abort) state using
failure transition rule.
• So, the operational site must make the same transition to
ensure that the final outcome is the same at all sites.

58
58
3-phase commit + Failure
Trans.
Coordinator Cohort i

qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
F,T message sent to F,T
all cohorts F,T
All agreed/
wi ai
One or more abort w1 Prep msg
Abort from
Prepare msg coordinator
reply/ Abort msg received/
to all
sent to all cohorts F,T send Ack Abort from
T P1 coordinator
a1 Pi
Abort to all All cohorts
cohorts Commit
Ack/ Send Commit F,T received from
F
msg to all coordinator
c1
F: Failure Transition ci
T: Timeout Transition
F,T: Failure/Timeout
59
59
Nonblocking Commit
Protocol
• Phase 1:
• First phase identical to that of 2-phase commit, except for
failures.
• Here, coordinator is in w1 and each cohort is in a or w or q,
depending on whether it has received the commit_request
message or not.
• Phase 2:
• Coordinator sends a Prepare message to all the cohorts (if all
of them sent Agreed message in phase 1).
• Otherwise, it will send an Abort message to them.
• On receiving a Prepare message, a cohort sends an
acknowledgement to the coordinator.
• If the coordinator fails before sending a Prepare message,
it aborts the transaction on recovery.
• Cohorts, on timing out on a Prepare message, also aborts
the transaction.
60
60
Nonblocking Commit
Protocol
• Phase 3:
• On receiving acknowledgements to Prepare messages, the coordinator
sends a Commit message to all cohorts.
• Cohort commits on receiving this message.
• Coordinator fails before sending commit? : commits upon recovery.
• So cohorts on Commit message timeout, commit to the transaction.
• Cohort failed before sending an acknowledgement? : coordinator
times out and sends an abort message to all others.
• Failed cohort aborts the transaction upon recovery.
• Use of buffer state:
• (e.g.,) Suppose state pi (in cohort) is not present. Let coordinator wait
in state p1 waiting for ack. Let cohort 2 (in w2) acknowledge and
commit.
• Suppose cohort 3 fails in w3. Coordinator will time out and abort.
Cohort 3 will abort on recovery. Inconsistent with cohort 2.

61
61
Commit Protocols
Disadvantages
• No protocol using the above independent recovery technique
for simultaneous failure of more than 1 site.
• The above protocol is also not resilient to network
partitioning.
• Alternative: Use voting protocols.
• Basic idea of voting protocol:
• Each replica assigned some number of votes
• A majority of votes need to be collected before accessing a
replica.
• Voting mechanism: more fault tolerant to site failures,
network partitions, and message losses.
• Types of voting schemes:
• Static
• Dynamic

62
62
Static Voting Scheme
• System Model:
• File replicas at different sites. File lock rule: either one
writer + no reader or multiple readers + no writer.
• Every file is associated with a version number that gives
the number of times a file has been updated.
• Version numbers are stored on stable storage. Every
successful write updates version number.
• Basic Idea:
• Every replica assigned a certain number of votes. This
number stored on stable storage.
• A read or write operation permitted if a certain number of
votes, called read quorum or write quorum, are collected by
the requesting process.
• Voting Algorithm:
• Let a site i issue a read or write request for a file.
• Site i issues a Lock_Request to its local lock manager.
63
63
Static Voting ...
• Voting Algorithm...:
• When lock request is granted, i sends a Vote_Request
message to all the sites.
• When a site j receives a Vote_Request message, it issues a
Lock_Request to its lock manager. If the lock request is
granted, then it returns the version number of the replica
(VNj) and the number of votes assigned to the replica (Vj)
at site i.
• Site i decides whether it has the quorum or not, based on
replies received within a timeout period as follows.
• For read requests, Vr = Sum of Vk, k in P, where P is
the set of sites from which replies were received.
• For write requests, Vw = Sum of Vk, k in Q such that:
• M = max{VN j: j is in P}
• Q = {j in P : VNj = M}
• Only the votes of the current (version) replicas are
counted in deciding the write quorum.
64
64
Static Voting ...
• Voting Algorithm...:
• If i is not successful in getting the quorum, it issues a
Release _Lock to the lock manager & to all sites that gave
their votes.
• If i is successful in collecting the quorum, it checks whether
its copy of file is current (VNi = M). If not, it obtains the
current copy.
• If the request is read, i reads the local copy. If write, i
updates the local copy and VN.
• i sends all updates and VNi to all sites in Q, i.e., update only
current replicas. i sends a Release_Lock request to its lock
manager as well as those in P.
• All sites on receiving updates, perform updates. On receiving
Release_Lock, releases lock.
• Vote Assignment:
• Let v be the total number of votes assigned to all copies.
Read & write quorum, r & w, are selected such that: r + w >65
v; w > v/2. 65
Static Voting ...
• Vote Assignment ...:
• Above values are determined so that there is a non-null
intersection between every read and write quorum, i.e., at
least 1 current copy in any reading quorum gathered.
• Write quorum is high enough to disallow simultaneous
writes on 2 distinct subset of replicas.
• The scheme can be modified to collect write quorums from
non-current replicas. Another modification: obsolete replicas
updated.
• (e.g.,) System with 4 replicas at 4 sites. Votes assigned: V1 =
1, V2 = 1, V3 = 2, & V4 = 1.
• Let disk latency at S1 = 75 msec, S2 = 750 msec, S3 = 750 msec.
& S4 = 100 msec.
• If r = 1 & w = 5, and read access time is 75 ms and write access
is 750 msec.

66
66
Dynamic Voting
• Used to overcome from failure.
• Can change vote value also.
• Dynamic voting: adapt the number of votes or the set of sites
that can form a quorum, to the changing state of the system
due to sites & communication failures.

• Approaches:

• Majority based approach: set of sites change with system state.


This set can form a majority to allow access to replicated data.

• Dynamic vote reassignment: number of votes assigned to a site


changes dynamically.

67
67
Majority-based Approach
ABCDE • Figure indicates the partitions and
(one) merger that takes place
• Assume one vote per copy.
ABD • Static voting scheme: only partitions
CE
ABCDE, ABD, & ACE allowed
access.
AB D • Majority-based approach: one partition
can collect quorums and the other cannot.
• Partitions ABCDE, ABD, AB, A, and
A ACE can collect quorums, others cannot.
B

ACE

68
68
Majority Approach ...
• Notations used:
• Version Number, VNi: of a replica at a site i is an integer
that counts the number of successful updates to the replica
at i. Initially set to 0.
• Number of replicas updated, RUi: Number of replicas
participating in the most recent update. Initially set to the
total number of replicas.
• Distinguished sites list, DSi,: at i is a variable that stores
IDs of one or more sites. DSi depends on RUi.
• RUi is even: DSi identifies the replica that is greater (as
per the linear ordering) than all the other replicas that
participated in the most recent update at i.
• RUi is odd: DSi is nil.
• RUi = 3: DSi lists the 3 replicas that participated in the
most recent update from which a majority is needed to
allow access to data.
69
69
Majority Approach: Example
• Example:
• 5 replicas of a file stored at sites A,B,C,D, and E. State of the
system is shown in table. Each replica has been updated 3
times, RUi is 5 for all sites. DSi is nil (as RUi is odd and !=
3).
• A B C D E
• VN 3 3 3 3 3
• RU 5 5 5 5 5
• DS - - - - -
• B receives an update request, finds it can communicate only
to A & C. B finds that RU is 5 for the last update. Since
partition ABC has 3 of the 5 copies, B decides that it belongs
to a distinguished partition. State of the system:
• A B C D E
• VN 4 4 4 3 3
• RU 3 3 3 5 5
• DS ABC ABC ABC - -
70
70
Majority Approach:
Example...
• Example...:
• Now, C needs to do an update and finds it can communicate
only to B. Since RUc is 3, it chooses the static voting protocol
and so DS & RU are not updated. System state:
• A B C D E
• VN 4 5 5 3 3
• RU 3 3 3 5 5
• DS ABC ABC ABC - -
• Next, D makes an update, finds it can communicate with
B,C, & E. Latest version in BCDE is 5 with RU = 3. A
majority from DS = ABC is sought and is available (i.e., BC).
RU is now set to 4. RU is even, DS set to B (highest
lexicographical order). System state:
• A B C D E
• VN 4 6 6 6 6
• RU 3 4 4 4 4
• DS ABC B B B B

71
71
Majority Approach:
Example...
• Example...:
• C receives an update, finds it can communicate only with B.
BC has half the sites in the previous partition and has the
distinguished site B (DS is used to break the tie in the case
of even numbers). Update can be carried out in the partition.
Resulting state:
• A B C D E
• VN 4 7 7 6 6
• RU 3 2 2 4 4
• DS ABC B B B B

72
72
Majority-based: Protocol
• Site i receives an update and executes following protocol:
• i issues a Lock_Request to its local lock manager
• Lock granted? : i issues a Vote_Request to all the sites.
• Site j receives the request: j issues a Lock_Request to its
local lock manager. Lock granted? : j sends the values of
VNj, RUj, and DSj to i.
• Based on responses received, i decides whether it belongs to
the distinguished partition procedure.
• i does not belong to distinguished partition? : issues
Release_Lock to local lock manager and Abort to other sites
(which will issue Release_Lock to their local lock manager).
• i belongs to distinguished partition? : performs update on
local copy (current copy obtained before update is local copy
is not current). i sends a commit message to participating
sites with missing updates and values of VN, RU, and DS.
Issues a Release_Lock request to local lock manager.
73
73
Majority-based: Protocol
• Site i receives an update and executes following protocol...:
• Site j receives commit message: updates its replica, RU, VN,
& DS, and sends Release_Lock request to local lock
manager.
• Distinguished Partition: Let P denote the set of responding
sites.
• i calculates M (the most recent version in partition), Q (set of
sites containing version M), and N (the number of sites that
participated in the latest update):
• M = max{VNj : j in P}
• Q = {j in P : VNj = M}
• N = RUj, j in Q
• |Q| > N/2 ? : i is a member of distinguished partition.
• |Q| = N/2 ? : tie needs to be broken. Select a j in Q. If DSj in
Q, i belongs to the distinguished partition. (If RUj is even,
DSj contains the highest ordered site). i.e., i is in the
partition containing the distinguished site. 74
74
Majority-based: Protocol
• Distinguished Partition...:
• If N = 3 and if P contains 2 or all 3 sites indicated by DS
variable of the site in Q, i belongs to the distinguished
partition.
• Otherwise, i does not belong to a distinguished partition.
• Update: invoked when a site is ready to commit. Variables
are updated as follows:
• VNi = M + 1
• RUi = cardinality of P (i.e., |P|)
• DSi is updated when N != 3 (since static voting protocol is
used when N = 3).
• DSi = K if RUi is even, where K is the highest ordered site
• DSi = P if RUi = 3
• Majority-based protocol can have deadlocks as it uses locks.
One needs a deadlock detection & resolution mechanism
along with this.
75
75
Dynamic Vote Reassignment Protocol
• Here, the number of votes assigned to a site changes
dynamically.
• The idea of DVR is categorized in two ways:
• Group Consensus
• The site in active group agree upon the new vote
assignment using either a distributed algorithm or
by electing co-ordinator.
• Autonomous Reassignment
• Each site uses a view of the system to make
decision about change in its votes and picking a
new vote value.
• This is carried out through:
• Vote Increasing Protocol
• Vote Decreasing Protocol
• Vote Collecting Protocol
Load Balancing

77
Motivations
• In a locally distributed system, there is a good possibility that
several computers are heavily loaded while others are idle or
lightly loaded
 If we can move jobs around (in other words, distribute the
load more evenly), the overall performance of the system
can be maximized

78
Motivations (cont)

• A distributed scheduler is a resource


management component of a distributed
operating system that focuses on
judiciously and transparently
redistributing the load of the system among
the computers to maximize the overall
performance

79
Issues In Load Distributing

• Load estimation
 Resource queue lengths
 CPU utilization

80
Issues In Load Distributing (cont)
• Classification of Load distributing
algorithms

• Basic function: transfer load (tasks) from


heavily loaded computers to idle or lightly
loaded computers.

• Can be characterized as:


 Static: decisions are hard-wired in the algorithm using
a priori knowledge of the system.

 Dynamic: use system state information (the loads at


nodes), at least in part.

 Adaptive: dynamically changing parameters of the


algorithm to suit the changing system state.
81
Issues In Load Distributing (cont)

• Load balancing vs. load sharing


 Strive to reduce the likelihood of an unshared
state (a state in which one computer lies idle
while at the same time tasks contend for service
at another computer) by transferring tasks to
lightly loaded nodes.

 Load balancing algorithms go a step further by


attempting to equalize loads at all computers

82
Issues In Load Distributing (cont)

• Preemptive vs. Non-preemptive Transfers

 Preemptive task transfers involve the transfer


of a task that is partially executed.
 Non-preemptive task transfers involve the
transfer of a task that have not begun
execution.

83
Components Of A Load Distributing
Algorithm
Typically a Load distributing algorithm has four components
- transfer policy, selection policy, location policy and
Information policy.

• Transfer Policy
 Determines when a node needs to send tasks to other
nodes or can receive tasks from other nodes. When a new
task originates at a node, and the load at that node
exceeds a threshold T, the transfer policy decide the node
is sender, if load falls below T then receiver for remote
task.

 Threshold policy

84
Components Of A Load Distributing
Algorithm (cont)

• Selection Policy
 Determines which task(s) to transfer
 Newly originated tasks that have caused the node to
become a sender by increasing the load at the node >
threshold
 Estimated average execution time for task > threshold
 Response time will be improved upon transfer
 Overhead incurred by the transfer should be minimal
 The number of location-dependent system calls made
by the selected task should be minimal

85
Components Of A Load Distributing
Algorithm (cont)
• Location Policy
 Find suitable nodes for load sharing
• Information policy
 Deciding when information about the states of other
nodes in the system should be collected, where it should
be collected from, and what information should be
collected.
 Demand-driven: a node collects the state of other
nodes only when it becomes either a sender or
receiver.
 Periodic: nodes exchange load information
periodically.
 State-change-driven: nodes disseminate state
information whenever their state changes by a certain
degree.
86
Stability
• The queuing-theoretic perspective
 A load distributing algorithm is effective under a given set
of conditions if it improves the performance relative to
that of a system not using load distribution

• Algorithmic perspective
 An algorithm is unstable if it can perform fruitless actions
indefinitely with finite probability

87
Load Distributing Algorithms

• Sender-Initiated Algorithms
• Eager, Lazowska, and Zohorjan
• Receiver-Initiated Algorithms
• Symmetrically Initiated Algorithms
• Adaptive Algorithms

88
Sender-Initiated Algorithms

• Load distributing activity is initiated by an


overloaded node (sender) that attempts to send a
task to an under loaded node (receiver).
• Transfer policy: uses a threshold policy based on
CPU queue length.
• Selection policy: consider only newly arrived tasks
for transfer.
• Location policy:
 Random
 Threshold
 Shortest

89
Sender-Initiated Algorithms (cont.)

• Random:
• A task is simply transferred to a node selected
at random, with no information exchange
between the nodes to aid in decision making.
• Problem: useless task transfers can occur when a
task is transferred to a node that is already
heavily loaded.

90
Sender-Initiated Algorithms (cont.)

• Threshold:
• The problem of random is overcome
• Sender selects receiver by help of polling
• If node is found, then the job is transferred
• Number of polls limited by a parameter called
PollLimit
• If no receiver is found then the node (sender)
has to execute the job itself
Sender-Initiated Algorithms (cont.)
• Shortest:
• Choose the best receiver for a task.
• A number of nodes (= PollLimit) are selected at
random and are polled to determine their queue
length -> choose the node with the shortest
queue length as the destination for transfer
unless its queue length >= T
• The destination node will execute the task
regardless of its queue length at the time of
arrival of the transferred task.

92
Sender-initiated load sharing with
threshold location policy
Sender-Initiated Algorithms (cont.)
• Information policy: demand-driven type.

• Stability: cause system instability at high


system load, where no node is likely to be
lightly loaded -> the probability that a
sender will succeed in finding a receiver
node is very low.

94
Receiver-Initiated Algorithms
• In receiver initiated algorithm, the load distributing
activity is initiated from an under loaded node
(receiver) that is trying to obtain a task from an
overloaded node (sender).

• Transfer policy: threshold policy based on CPU queue


length. Transfer policy is triggered when a task
departs. If the local queue length falls below the3 T, the
node is identified as receiver for obtaining a task from a
node (sender) to be determined by the location policy.

• Selection policy: any of the approaches discussed


earlier.

• Location policy:
95
Receiver-Initiated Algorithms (cont.)

96
Receiver-Initiated Algorithms (cont.)

• Information policy: demand-driven type.


• Stability: do not cause system instability.

• A drawback: most transfer are preemptive


and therefore expensive. However, sender
initiated uses non preemptive transfer
because they can initiate load distributing
activity.

97
Comparison of Sender-Initiated and
Receiver-Initiated Algorithms

98
Comparison of Sender-Initiated and
Receiver-Initiated Algorithms

99
Symmetrically Initiated
Algorithms
 Both senders and receivers search for
receivers and senders.
 Advantages and disadvantages of both
sender and receiver initiated
algorithms.
 Above average algorithm.

100
The above average algorithm

 Proposed by Krueger and Finkel


 Try to maintain the load at each node
within an acceptable range of system
average

101
Transfer Policy
 Use two adaptive thresholds: equidistant from the
node’s estimate of the average load across all nodes
(average load is 2  the lower threshold = 1 and
the upper threshold = 3).
 A node whose load is greater than upper threshold
 a sender, and load is less than lower threshold
 a receiver.
 Nodes that have loads between these thresholds lie
within the acceptable range, so they are neither
senders nor receivers.

102
Location Policy

 The location policy has the following


two components:
 Sender-initiated component
 Receiver-initiated component

103
Sender-initiated component
 A sender (a node has a load greater than the acceptable
range) broadcast a TooHigh message, sets a TooHigh time
out alarm, and listens for an Accept message untill the
time out expires.
 A receiver (a node has a load less than the acceptable
range) receives a TooHigh message, cancels its TooLow
timeout sends an Accept message to the source of TooHigh
message, increases its load value, and sets an
AwaitingTask time out. Increasing load value prevents
receiver from over committing itself to accepting remote
task. If AwaitingTask gets time out without arrival , the
load value is decreased at receiver.
 On receiving accept message, if node is still sender, it
choose the best task to transfer and transfer it to node
that responded.

104
Sender-initiated component
 When a sender that is waiting for a response for its
TooHigh message receives a TooLow message, it sends
TooHigh message to the node that sent the TooLow
message. This TooHigh message is handled by the receiver
as described under the “Receiver initiated component”.

 On expiration of the TooHigh time out, if no Accept
message has been received, the sender inferes that its
estimate of the average system load is too low (since no
node has a load much lower). To correct this problem
sender braodcast a ChangeAverage message to increase
the average load estimate at the other nodes.

105
Sender-initiated component

TooHigh

No accept
TooLow
TooHigh Accept Load ++
ChangeAverage AwaitingTask
Waiting Transfer task
Accept

SENDER RECEIVER

106
Receiver-initiated component
 A node, on becoming a receiver, broadcasts a
TooLow message, set a TooLow timeout, and
starts listening for a TooHigh message.

 If a TooHigh message is received, the receiver


performs the same actions that it does under
sender-initiated negotiation

 If the TooLow timeout expires before receiving


any TooHigh message, the receiver broadcasts a
ChangeAverage message to decrease the average
load estimate at the other nodes

107
Selection and Information
Policy
 Selection policy: this algorithm can
make use of any of the approaches
discussed earlier.
 Information policy: demand-driven.

108
Adaptive Algorithms

 A stable symmetrically initiated


algorithm
 A stable sender initiate algorithm

109
A stable symmetrically initiated
algorithm
 Instability in previous algorithms is indiscriminate
polling by sender’s negotiation component.

 Utilize the information gathered during polling to


classify the nodes in the system as either
Sender/overloaded, Receiver/underloaded, or OK.

 The knowledge concerning the state of node is


maintained by a data structure at each node: a
sender list, a receiver list, and an OK list.

 Initially, each node assumes that every other node


is a receiver.
110
Transfer policy
 A threshold policy where decisions are
based on CPU queue length.
 Trigger when a new task originates or
when a task departs.
 Two threshold values: a lower threshold
(LT), an upper threshold (UT).
 A node is said to be a sender if its queue
length > UT, a receiver if its queue length <
LT, and OK if LT ≤ node’s queue length ≤
UT.
111
Location policy

 Sender initiated component


 Receiver initiated component

112
Sender initiated component
pollis
is receiver?
receiver?

transfer task
ID_R ID_R ID_C
ID_R ID_S
ID_C
Inform its status
Remove ID_S status
Sender
Receiver
or OK

OK SEND RECV OK SEND RECV

Sender Receiver
113
Receiver initiated component
is sender?

Transfer task and


ID_S ID_C
ID_S ID_S inform its after status ID_R
ID_C
Inform its current status
Remove ID_R status
Sender
Receiver or OK

OK SEND RECV OK SEND RECV

Receiver Sender
114
Selection and Information
Policy
 Selection policy:
 The sender initiated component
considers only newly arrived tasks for
transfer.
 The receiver initiated component can
make use of any of the approaches
discussed earlier.
 Information policy: demand-driven.
115
A stable sender initiated
algorithm
 Two desirable properties:
 It does not cause instability
 Load sharing is due to non-preemptive transfer
(which are cheaper) only.
 This algorithm uses the sender initiated
load sharing component of the stable
symmetrically initiated algorithm as is, but
has a modified receiver initiated component
to attract the future non-preemptive task
transfers from sender nodes.
116
A stable sender initiated
algorithm
 The data structure (at each node) of the stable
symmetrically initiated algorithm is augmented by
a array called statevector.
 The statevector is used by each node to keep track
of which list (senders, recevers, or OK) it belongs to
at all the other nodes in the system.
 When a sender polls a selected node, the sender’s
statevector is updated to reflect that the sender
now belongs the senders list at the selected node,
the polled node update its statevector based on the
reply it sent to the sender node to reflect which list
it will belong to at the sender
117
A stable sender initiated
algorithm
 The receiver initiated component is replaced by the
following protocol:
 When a node becomes a receiver, it informs all the nodes
that are are misinformed about its curren state. The
misinformed node are those nodes whose receivers lists do
not contain the receiver’s ID.
 The statevector at the receiver is then updated to reflect
that it now belongs to the receivers list at all those nodes
that were informed of its current state.
 By this technique, this algorithm avoids the receivers
sending broadcast messages to inform other nodes that
they are receivers.
 No preemptive transfers of partly executed tasks
here.
118
Performance comparision

 Symmetrically initiated load sharing


 Stable load sharing algorithms
 Performance under heterogeneous
workloads

119

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy