Unit 2DC
Unit 2DC
Logical Time: Physical Clock Synchronization: NTP – A Framework for a System of Logical
Clocks – Scalar Time – Vector Time; Message Ordering and Group Communication:
Message Ordering Paradigms – Asynchronous Execution with Synchronous Communication –
Synchronous Program Order on Asynchronous System – Group Communication – Causal
Order – Total Order; Global State and Snapshot Recording Algorithms: Introduction –
System Model and Definitions – Snapshot Algorithms for FIFO Channels.
3. Offset Clock offset is the difference between the time reported by a clock and the real
time. The offset of the clock Ca is given by Ca(t)−t. The offset of clock Ca relative to Cb at
time t ≥ 0 is given by Ca(t)−Cb(t).
4. Skew The skew of a clock is the difference in the frequencies of the clock and the perfect
clock. The skew of a clock Ca relative to clock Cb at time t is Ca'(t)−Cb'(t).
If the skew is bounded by ρ, then as per Eq.(3.1), clock values are allowed to diverge at a rate
in the range of 1−ρ to 1+ρ.
5. Drift (rate) The drift of clock Ca is the second derivative of the clock value with respect to
time, namely, Ca''(t). The drift of clock Ca relative to clock Cb at time t is is Ca''(t)−Cb''(t).
Clock inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC.
However, due to the clock inaccuracy, a timer (clock) is said to be working within its
specification if
where constant ρ is the maximum skew rate.
Each NTP message includes the latest three timestamps T1, T2, and T3, while T4
is determined upon arrival.
2
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Figure :The Behavior of fast, slow and perfect clocks with respect to UTC.
3
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
4
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
5
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
R2 : Each message piggybacks the clock value of its sender at sending time. When a process pi receives
a message with timestamp Cmsg, it executes the following actions:
1. Ci := max(Ci, Cmsg);
2. execute R1;
3. deliver the message.
Figure 3.1 shows the evolution of scalar time with d=1.
Event counting
If the increment value of d is 1 then, if the event e has a timestamp h, then h−1 represents
minimum number of events that happened before producing the event e;
6
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
In the figure,five events precede event b on the longest causal path ending at b.
Figure: Five events precede event b on the longest causal path ending at b
No strong consistency
The system of scalar clocks is not strongly consistent; that is, for two events ei and ej, C(ei) <
C(ej) ≠> ei→ej .
In Figure , the 3rd event of process P1 has smaller scalar timestamp than 3rd event of process P2.
7
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Linear Extension
A linear extension of a partial order (E, ) is a linear ordering of E i.e., consistent with
partial order, if two events are ordered in the partial order, they are also ordered in the
8
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
linear order. It is viewed as projecting all the events from the different processes on a
single time axis.
Dimension
The dimension of a partial order is the minimum number of linear extensions whose
intersection gives exactly the partial order.
Example:
• The timestamp of an event is the value of the vector clock of its process when the event is executed.
• Figure shows an example of vector clocks progress with the increment value d=1
• Initially, a vector clock is [0, 0, 0, ...., 0].
9
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
An asynchronous execution (or A-execution) is an execution (E, ≺) for which the causality
relation is a partial order.
to the physical properties of the medium, a logical link may be formed as a composite
of physical links and multiple paths may exist between the two end points of the logical
link.
(6.1 Illustrating FIFO and non-FIFO executions. (a) An A-execution that is not a
FIFO execution. (b) An A-execution that is also a FIFO execution.)
FIFO executions
10
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
A FIFO logical channel can be created over a non-FIFO channel by using a separate
numbering scheme to sequence the messages on each logical channel.
The sender assigns and appends a <sequence_num, connection_id> tuple to each
message.
The receiver uses a buffer to order the incoming messages as per the sender’s
sequence numbers, and accepts only the “next” message in sequence.
Two send events s and s’ are related by causality ordering (not physical time ordering),
then a causally ordered execution requires that their corresponding receive events r and
r’ occur in the same order at all common destinations.
If s and s’ are not related by causality, then CO is vacuously(blankly)satisfied.
Causal order is used in applications that update shared data, distributed shared memory,
or fair resource allocation.
The delayed message m is then given to the application for processing. The event of an
application processing an arrived message is referred to as a delivery event.
No message overtaken by a chain of messages between the same (sender, receiver)
pair.
(Fig:CO executions)
If send(m1) ≺ send(m2) then for each common destination d of messages m1 and m2,
deliverd(m1) ≺deliverd(m2) must be satisfied.
.
2. Empty Interval Execution: An execution (E ≺) is an empty-interval (EI)
execution if for each pair of events (s, r) ∈ T, the open interval set
11
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
12
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
S2: If (s, r ∈ T, then) for all x ∈ E, [(x<< s ⇐⇒ x<<r) and (s<< x ⇐⇒ r<< x)].S3:
Examples: In Figure 6.5(a-c) using a timing diagram, will deadlock if run with
13
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
synchronous primitives.
Figure 6.5 Illustrations of asynchronous executions and of crowns. (a) Crown of size 2. (b)
Another crown of size 2. (c) Crown of size 3.
An execution can be modeled to give a total order that extends the partial order (E,
≺).
In the non-separated linear extension, if the adjacent send event and its corresponding
receive event are viewed atomically, then that pair of events shares a common past and a
common future with each other.
Crown
The crown is <(s1, r1) (s2, r2)> as we have s1 ≺ r2 and s2 ≺ r1. Cyclic dependencies may
exist in a crown. The crown criterion states that an A-computation is RSC, i.e., it can be realized
on a system with synchronous communication, if and only if it contains no crown.
An execution (E, ≺) is RSC if and only if there exists a mapping from E to T(scalar
timestamps) such that
14
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
A receive call can receive a message from any sender who has sent a message, if the
expected sender is not specified.
Multiple send and receive calls which are enabled at a process can be executed in
an interchangeable order.
There is no semantic dependency between the send and the immediately following
receive at each of the processes. If the receive call at one of the processes can be
scheduled before the send call, then there is no deadlock.
Rendezvous
For the receive command, the sender must be specified. However, multiple receive
commands can exist. A type check on the data is implicitly performed.
15
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Scheduling involves pairing of matching send and receives commands that are both
enabled. The communication events for the control messages under the covers do not
alter the partial order of the execution.
If multiple interactions are enabled, a process chooses one of them and tries to
synchronize with the partner process. The problem reduces to one of scheduling messages
satisfying the following constraints:
2. A send command, once enabled, remains enabled until it completes, i.e., it is not
possible that a send command gets before the send is executed.
4. Each process attempts to schedule only one send event at any time.
The message (M) types used are: M, ack(M), request(M), and permission(M). Execution
events in the synchronous execution are only the send of the message M and receive of the
message M. The send and receive events for the other message types – ack(M), request(M), and
permission(M) which are control messages. The messages request(M), ack(M), and
permission(M) use M’s unique tag; the message M is not included in these messages.
(Message types) -M, ack(M), request(M), permission(M)
Pi execute send(M) and blocks until it receives ack(M) from Pj. The send event
SEND(M) now completes.
Any M’ message (from a higher priority processes) and request(M’) request for
16
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
synchronization (from a lower priority processes) received during the blocking period
arequeued.
(2c) when the permission (M) arrives Pi knows partner Pj is synchronized and Pi
executes send(M). The SEND(M) now completes.
When Pi is unblocked, it dequeues the next (if any) message from the queue and
processes it as a message arrival (as per rule 3 or 4).
17
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Messages used to implement synchronous order. Pi has higher priority than Pj . (a) Pi
issues SEND(M).
(b) Pj issues SEND(M).
18
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
19
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Complexity:
This algorithm takes O(n 2 )
The Kshemkalyani –Singhal optimal algorithm
An optimal CO algorithm stores in local message logs and propagates on messages,
information of the form d is a destination of M about a message M sent in the causal past, as
long as and only as long as:
Propagation Constraint II: it is not known that a message has been sent to d in the causal
future of Send(M), and hence it is not guaranteed using a reasoning based on transitivity that
the message M will be delivered to d in CO.
The Propagation Constraints also imply that if either (I) or (II) is false, the information
“d ∈ M.Dests” must not be stored or propagated, even to remember that (I) or (II) has been
falsified:
not in the causal future of Deliverd(M1, a)
not in the causal future of e k, c where d ∈Mk,cDests and there is no
20
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
interruption in line 2a where a non-blocking wait is required to meet the Delivery Condition.
21
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
The data structures maintained are sorted row–major and then column–major:
1. Explicit tracking:
Tracking of (source, timestamp, destination) information for messages (i) not known to be
delivered and (ii) not guaranteed to be delivered in CO, is done explicitly using the
I.Destsfield of entries inlocal logs at nodes and o.Dests field of entries in messages.
Sets li,aDestsand oi,a. Dests contain explicit information of destinations to which Mi,ais
not guaranteed to be delivered in CO and is not known to be delivered.
The information about d ∈Mi,a .Destsis propagated up to the earliestevents on all causal
paths from (i, a) at which it is known that Mi,a isdelivered to d or is guaranteed to be
delivered to d in CO.
2. Implicit tracking:
Tracking of messages that are either (i) already delivered, or (ii) guaranteed to be
delivered in CO, is performed implicitly.
22
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
The information about messages (i) already delivered or (ii) guaranteed tobe delivered
in CO is deleted and not propagated because it is redundantas far as enforcing CO is
concerned.
It is useful in determiningwhat information that is being carried in other messages and
is being storedin logs at other nodes has become redundant and thus can be purged.
Thesemantics are implicitly stored and propagated. This information about messages
that are (i) already delivered or (ii) guaranteed to be delivered in CO is tracked
without explicitly storing it.
The algorithm derives it from the existing explicit information about messages (i) not
known to be delivered and (ii) not guaranteed to be delivered in CO, by examining
only oi,aDests or li,aDests, which is a part of the explicit information.
Multicast M4,3
At event (4, 3), the information P6 ∈M5,1.Dests in Log4 is propagated onmulticast M4,3only
to process P6 to ensure causal delivery using the DeliveryCondition. The piggybacked
information on message M4,3sent to process P3must not contain this information because of
constraint II. As long as any future message sent to P6 is delivered in causal order w.r.t.
M4,3sent to P6, it will also be delivered in causal order w.r.t. M5,1. And as M5,1 is already
23
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Processing at P6
When message M5,1 is delivered to P6, only M5,1.Dests = P4 is added to Log6. Further,
P6propagates only M5,1.Dests = P4 on message M6,2, and this conveys the current
implicit information M5,1 has been delivered to P6 by its very absence in the explicit
information.
When the information P6 ∈ M5,1Dests arrives on M4,3, piggybacked as M5,1 .Dests
= P6 it is used only to ensure causal delivery of M4,3 using the Delivery
Condition,and is not inserted in Log6 (constraint I) – further, the presence of M5,1
.Dests = P4in Log6 implies the implicit information that M5,1 has already been
delivered to P6. Also, the absence of P4 in M5,1 .Dests in the explicit
piggybacked information implies the implicit information that M5,1 has been
delivered or is guaranteed to bedelivered in causal order to P4, and, therefore,
M5,1. Dests is set to ∅ in Log6.
When the information P6 ∈ M5,1 .Dests arrives on M5,2 piggybacked as M5,1. Dests
= {P4, P6} it is used only to ensure causal delivery of M4,3 using the Delivery
Condition, and is not inserted in Log6 because Log6 contains M5,1 .Dests = ∅,
which gives the implicit information that M5,1 has been delivered or is
guaranteedto be delivered in causal order to both P4 and P6.
24
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Processing at P1
When M2,2arrives carrying piggybacked information M5,1.Dests = P6 this
(new)information is inserted in Log1.
When M6,2arrives with piggybacked information M5,1.Dests ={P4}, P1learns
implicit information M5,1has been delivered to P6 by the very absence of explicit
information P6 ∈ M5,1.Dests in the piggybacked information, and hence marks
information P6 ∈ M5,1Dests for deletion from Log1. Simultaneously, M5,1Dests =
P6 in Log1 implies the implicit information that M5,1has been delivered or is
guaranteed to be delivered incausal order to P4.Thus, P1 also learns that the explicit
piggybacked information M5,1.Dests = P4 is outdated. M5,1.Dests in Log1 is set to
∅.
The information “P6 ∈M5,1.Dests piggybacked on M2,3,which arrives at P 1, is
inferred to be outdated usingthe implicit knowledge derived from M5,1.Dest= ∅”
in Log1.
For each pair of processes Pi and Pj and for each pair of messages Mx and My that are
delivered to both the processes, Pi is delivered Mx before My if and only if Pj is
delivered Mxbefore My.
Example
The execution in Figure 6.11(b) does not satisfy total order. Even
if the message m did not exist, total order would not be satisfied. The execution
in Figure 6.11(c) satisfies total order.
25
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Each process sends the message it wants to broadcast to a centralized process, which
relays all the messages it receives to every other process over FIFO channels.
Complexity: Each message transmission takes two message hops and exactly n messages
in a system of n processes.
Drawbacks: A centralized algorithm has a single point of failure and congestion, and is
not an elegant solution.
Sender side
Phase 1
In the first phase, a process multicasts the message M with a locally unique tag and
the local timestamp to the group members.
Phase 2
The sender process awaits a reply from all the group members who respond with a
tentative proposal for a revised timestamp for that message M.
The await call is non-blocking.
Phase 3
The process multicasts the final timestamp to the group.
26
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Phase 2
The receiver sends the revised timestamp back to the sender. The receiver
then waitsin a non-blocking manner for the final timestamp.
Phase 3
The final timestamp is received from the multicaster. The corresponding
message entry in temp_Q is identified using the tag, and is marked as
deliverable after the revised timestamp is overwritten by the final
timestamp.
The queue is then resorted using the timestamp field of the entries as the
key. As thequeue is already sorted except for the modified entry for the
message under consideration, that message entry has to be placed in its
sorted position in the queue.
If the message entry is at the head of the temp_Q, that entry, and all
consecutive subsequent entries that are also marked as deliverable, are
dequeued from temp_Q,and enqueued in deliver_Q.
27
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Complexity
This algorithm uses three phases, and, to send a message to n − 1 processes, it uses
3(n – 1)messages and incurs a delay of three message hops
Example An example execution to illustrate the algorithm is given in Figure 6.14.
Here, A and B multicast to a set of destinations and C and D are the common
destinations for both multicasts.
Figure 6.14a. The main sequence of steps is as follows:
1. A sends a REVISE_TS(7) message, having timestamp 7. B sends a
REVISE_TS(9) message, having timestamp 9.
2. C receives A’s REVISE_TS(7), enters the corresponding message in temp_Q,
and marks it as undeliverable; priority = 7. C then sends PROPOSED_TS(7)
message to A.
3. D receives B’s REVISE_TS(9), enters the corresponding message in temp_Q,
and marks it as undeliverable; priority = 9. D then sends PROPOSED_TS(9)
message to B.
4. C receives B’s REVISE_TS(9), enters the corresponding message in
temp_Q, and marks it as undeliverable; priority = 9. C then sends
PROPOSED_TS(9) message to B.
5. D receives A’s REVISE_TS(7), enters the corresponding message in temp_Q,
and marks it as undeliverable; priority = 10. D assigns a tentative timestamp
value of 10, which isgreater than all of the timestamps on REVISE_TSs seen
so far, and then sends PROPOSED_TS(10) message to A.
The state of the system is as shown in the figure.
• Figure 6.14(b) The main steps is as follows:
6. When A receives PROPOSED_TS(7) from C and PROPOSED_TS(10)
from D, it computes the final timestamp as max(7, 10) = 10, and sends
FINAL_TS(10) to C and D.
7. When B receives PROPOSED_TS(9) from C and PROPOSED_TS(9) from
D, it computes the final timestamp as max(9, 9)= 9, and sends
FINAL_TS(9) to C and D.
8. C receives FINAL_TS(10) from A, updates the corresponding entry in
temp_Q with the timestamp, resorts the queue, and marks the message as
deliverable. As the message is not at the head of the queue, and some entry
ahead of it is still undeliverable, the message is not moved to delivery_Q.
9. D receives FINAL_TS(9) from B, updates the corresponding entry in
temp_Q by marking the corresponding message as deliverable, and resorts
the queue. As the message is at the head of the queue, it is moved to
delivery_Q.
10. When C receives FINAL_TS(9) from B, it will update the corresponding
entry in temp_Q by marking the corresponding message as deliverable. As
the message is at the head of the queue, it is moved to the delivery_Q, and
the next message (of A), which is also deliverable, is also moved to the
28
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
delivery_Q.
11. When D receives FINAL_TS(10) from A, it will update the corresponding entry in temp_Q by
marking the corresponding message as deliverable. As the message is at the head of the queue, it
29
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
30
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
2.13 SYSTEM MODEL AND DEFINITIONS
The system consists of a collection of n processes, p1, p2,…, pn, that are connected
by channels.
There is no globally shared memory and processes communicate solely by passing
messages (send and receive) asynchronously i.e., delivered reliably with finite but
arbitrary time delay.
There is no physical global clock in the system.
The system can be described as a directed graph where vertices represents processes
and edges represent unidirectional communication channels.
Let Cij denote the channel from process pi to process pj .
Processes and channels have states associated with them.
Process State: is the contents of processor registers, stacks, local memory, etc.,
and dependents on the local context of the distributed application.
Channel State of Cij: is SCij , is the set of messages in transit of the channel.
The actions performed by a process are modeled as three types of events,
o internal events – affects the state of the process.
o message send events, and
o message receive events.
For a message mij that is sent by process pi to process pj, let send(mij) and
rec(mij)denote its send and receive events affects state of the channel,
respectively.
The events at a process are linearly ordered by their order of occurrence.
At any instant, the state of process pi, denoted by LSi, is a result of the sequence of all
31
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
32
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Left side of the cut is referred as PAST event and right is referred as FUTURE event.
A consistent global state corresponds to a cut in which every message
received in the PAST of the cut has been sent in the PAST of that cut. Such
a cut is known as a consistent cut.
Example: Cut C2 in the above figure and
All the messages that cross the cut from the PAST to the FUTURE are
captured in thecorresponding channel state.
If the flow is from the FUTURE to the PAST is inconsistent. Example: Cut C1.
33
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
o All processes take their local snapshots at that instant in the global
time.
o The snapshot of channel Cij includes all the messages that process
pj receives after taking the snapshot and whose timestamp is smaller
than the time of the snapshot.
However, a global physical clock is not available in a distributed system. Hence
thefollowing two issues need to be addressed to record a consistent global snapshot.
I1: How to distinguish between the messages to be recorded in the snapshot
from those not to be recorded?
Any message i.e., sent by a process before recording its snapshot, must be recorded
inthe global snapshot. (from C1).
Any message that is sent by a process after recording its snapshot, must not be
recorded in the global snapshot (from C2).
I2: How to determine the instant when a process takes its snapshot.
A process pj must record its snapshot before processing a message mij that was sent
byprocess pi after recording its snapshot.
These algorithms use two types of messages: computation messages and control
messages. The former are exchanged by the underlying application and the latter are
exchanged by the snapshot algorithm.
2.14 SNAPSHOT ALGORITHMS FOR FIFO CHANNELS
Each distributed application has number of processes running on different
physical servers. These processes communicate with each other through messaging
channels.
A snapshot captures the local states of each process along with the state of each communication channel.
34
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
after pi has taken its snapshot, process pj must record its snapshot if not recorded
earlier and record the state of the channel that was received along the marker
message. This addresses issue I2.
The algorithm
The algorithm is initiated by any process by executing the marker sending rule.
The algorithm terminates after each process has received a marker on all of its
incoming channels.
35
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Terminating a snapshot
All processes have received a marker.
All process have received a marker on all the N-1 incoming channels.
A central server can gather the partial state to build a global snapshot.
Correctness
To prove the correctness of the algorithm, it is shown that a recorded
snapshot satisfiesconditions C1 and C2.
Since a process records its snapshot when it receives the first marker on any
incoming channel, no messages that follow markers on the channels
incoming to it are recorded in the process’s snapshot.
Moreover, a process stops recording the state of an incoming channel when a
marker isreceived on that channel.
Due to FIFO property of channels, it follows that no message sent after the
marker on that channel is recorded in the channel state. Thus, condition C2
is satisfied.
When a process pj receives message mij that precedes the marker on
channel Cij, it actsas follows:
If process pj has not taken its snapshot yet, then it includes mij in its
recorded snapshot. Otherwise, it records mij in the state of the channel Cij.
Thus, condition C1 is satisfied.
Complexity
The recording part of a single instance of the algorithm requires O(e)
messages and O(d) time, where e is the number of edges in the network and
d is the diameter of the network.
Properties of the recorded global state
The recorded global state may not correspond to any of the global states that
occurred during the computation.
This happens because a process can change its state asynchronously before
the markers it sent are received by other sites and the other sites record their states.
But the system could have passed through the recorded global states in some
equivalent executions.
The recorded global state is a valid state in an equivalent execution and if a
stable property (i.e., a property that persists) holds in the system before the snapshot
algorithm begins, it holds in the recorded global snapshot.
Therefore, a recorded global state is useful in detecting stable properties.
36
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Snapshot 1:
Snapshot 2:
Snapshot 3:
37
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Snapshot 4:
Snapshot 5:
Snapshot 6:
38
CS3551-DISTRIBUTED COMPUITNG [UNIT II]
Snapshot 7:
39