Unit II Notes
Unit II Notes
UNIT II
MESSAGE ORDERING & SNAPSHOTS
Message ordering and group communication: Message ordering paradigms –
Asynchronous execution with synchronous communication –Synchronous program order on an
asynchronous system –Group communication – Causal order (CO) - Total order. Global state and
snapshot recording algorithms: Introduction –System model and definitions –Snapshot
algorithms for FIFO channels
1
⚫ An A-execution in which: for all (s, r ) and (sj, r j) ∈ T , (s ∼ sj and r ∼ r j and s ≺ sj)
⚫ FIFO Execution:
=⇒ r ≺ r j
⚫ Logical link inherently non-FIFO
⚫ Can assume connection-oriented service at transport layer, e.g., TCP
⚫ To implement FIFO over non-FIFO link:
⚫ use ( seq num, conn id ) per message. Receiver uses buffer to order messages.
FIFO Eexecution.
and s ≺ sj) =⇒ r ≺ r j
⚫ If send events s and sj are related by causality ordering (not physical time ordering),
their corresponding receive events r and rj occur in the same order at all common
dests.
⚫ If s and sj are not related by causality, then CO is vacuously satisfied.
have r3 ≺ r1 in figure a.
⚫ Casually Ordered(CO) Execution:
⚫ Figure (b) shows an execution that satisfies CO. Only s1 and s2 are related by
causality but the destinations of the corresponding messages are different.
⚫ Figure (c) shows an execution that satisfies CO. No send events are related by
causality.
⚫ Figure (d) shows an execution that satisfies CO. s2 and s1 are related by causality
but the destinations of the corresponding messages are different. Similarly for s2 and
s3.
2
⚫ If send (m1) ≺ send (m2) then for each common destination d of messages m1 and
⚫ CO alternate definition
⚫ An execution (E, ≺) is CO iff for each pair (s, r ) ∈ T and each event e ∈ E ,
⚫ Common Past and Future
3
⚫ S2. If (s, r ) ∈ T , then for all x ∈ E , [(x s ⇐⇒ x r ) and (s x ⇐⇒ r x
)]
⚫ S3. If x y and y z , then x z
⚫ Synchronous execution (or S-execution).
⚫ An execution (E, ) for which the causality relation is a partial order.
(scalar timestamps) |
⚫ Synchronous send and receive primitives of pairs of processes will produce synchronous
order.
⚫ One question here is :
⚫ Will a program written for an asynchronous system (A-execution) run correctly if
run with synchronous primitives?
⚫ Answer is:
⚫ An algorithm runs on asynchronous system may deadlock on synchronous systems.
⚫ A linear extension of (E, ≺) such that for each pair (s, r ) ∈ T , the interval
{ x ∈ E | s ≺ x ≺ r } is empty.
⚫ s2 r2 s3 r3 s1 r1 is a linear extension that is non-separated.
⚫ s2 s1 r2 s3 r3 s1 is a linear extension that is separated
4
⚫ if the adjacent send event and its corresponding receive event are
viewed atomically, then that pair of events shares a common past and a
common future with each other.
⚫ Crown:
⚫ Study of characterization of execution in terms of graph structure is called crown
⚫ It leads a feasible test for a RSC execution.
>of pairs of corresponding send and receive events such that: s0 ≺ r1, s1 ≺ r2, ,
sk−2 ≺ rk−1, sk−1 ≺ r0`
Example,
Some observations
• In a crown, si and r i +1 may or may not be on same process Non-CO
execution must have a crown
⚫ In an execution that is not CO, it is possible to generalize this to state that a non-CO
execution must have a crown size at least 2.(example in fig a & b)
⚫ CO execution that are not synchronous has a crown size of 3. (example in fig c)
⚫ Need to determine whether there exist any cyclic decencies among message for determining
whether execution holds proper RSC or not.
⚫ Define the ‹→: T × T relation on messages in the execution (E, ≺) as follows. Let
⚫ Crown Test for RSC executions:
‹→ ([s, r ], [sj, r j]) iff s ≺ r j. Observe that the condition s ≺ r j (which has the form
used in the definition of a crown) is implied by all the four conditions: (i) s ≺ sj, or
(ii) s ≺ r j, or (iii) r ≺ sj, and (iv) r ≺ r j.
5
⚫ Now define a directed graph G‹→ = (T , ‹→), where the vertex set is the set of
messages T and the edge set is defined by ‹→.
⚫ Observe that ‹→: T × T is a partial order iff G‹→ has no cycle, i.e., there must not be
⚫ Observe from the defn. of a crown that G‹→ has a directed cycle iff (E, ≺) has
a cycle with respect to ‹→ on the set of corresponding (s, r ) events.
a crown.
⚫ Crown criterion
⚫ An A-computation is RSC, i.e., it can be realized on a system with synchronous
communication, iff it contains no crown.
⚫ RSC ⊂ CO ⊂ FIFO ⊂ A.
⚫ An A-execution is RSC iff A is an S-execution.
⚫ More restrictions on the possible message orderings in the smaller classes. The
degree of concurrency is most in A, least in SYN C.
⚫ A program using synchronous communication easiest to develop and verify. A
program using non-FIFO communication, resulting in an A-execution, hardest to
design and verify.
6
⚫ Binary rendezvous:
⚫ Implemented by using tokens
⚫ Token for each enabled interaction schedule online, atomically, in a
distributed manner crown-free scheduling
⚫ Bagrodia’s Algorithm for Binary Rendezvous:
⚫ Assumption:
⚫ Receive commands are forever enabled from all processes.
⚫ A send command, once enabled, remains enabled until it completes, i.e., it is not
possible that a send command gets disabled (by its guard getting falsified) before
the send is executed.
⚫ To prevent deadlock, process identifiers are used to introduce asymmetry to break
potential crowns that arise.
⚫ Each process attempts to schedule only one send event at any time.
⚫ Message types:
⚫ M, ack(M), request(M), permission(M)
⚫ Process blocks when it knows it can successfully synchronize the current message.
⚫ Fig shows the high and low priority process blocks.
⚫ Each process maintains a queue that is processed in FIFO order only when the process
is unblocked.
⚫ When a process is blocked waiting for a particular message that it is currently
synchronizing, any other message that arrives is queued up.
⚫ Ack(M), request(M), and permission(M)- Control messages.
⚫ Lower Priority Process:
⚫ messages M and ack(M) are involved in that order.
⚫ The sender issues send(M) and blocks until ack(M) arrives.
⚫ Thus, when sending to a lower priority process, the sender blocks waiting for
the partner process to synchronize and send an acknowledgement.
7
Bagrodia’s Algorithm
8
2.1.4. Group Communication
⚫ Processes across a distributed system cooperate to solve a joint task.
⚫ They need to communicate with each other as a group.
⚫ So ,there is a needs to support for group communication.
⚫ Message Unicast- sending of a message to particular destination process
⚫ Message Broadcast- sending of a message to all members in the distributed system
⚫ Message Multicast- message is sent to a certain subset, identified as a group, of the
processes in the system
⚫ Group are dynamic
⚫ They may be created and destroyed.
⚫ Process may join or leave group
⚫ Process may belongs to multiple groups.
⚫ Groups allow processes to deal with collections of process as one abstraction.
⚫ A process should only send message to a group and need not know or care who its members
are.
⚫ Group communication can be implemented in several ways.
⚫ One to many
⚫ Many to one
⚫ Many to many
⚫ Spanning tree network protocol is used for both broad cast and multicast.
⚫ It is an efficient mechanism for distributing information.
⚫ Some of the features are not provided by hardware assisted or network protocol assisted
multicast.
⚫ Some of them are:
⚫ Application-specific ordering semantics on the order of delivery of messages.
⚫ Adapting groups to dynamically changing membership.
⚫ Sending multicasts to an arbitrary set of processes at each send event.
⚫ Providing various fault-tolerance semantics.
⚫ One to Many
⚫ Message is sent by one sender to multiple receiver.
⚫ One-to-many scheme is also known as multicast communication.
⚫ A special case of multicast communication is broadcast communication.
⚫ Group Management in One-to-Many
⚫ Two types:
⚫ Closed group
⚫ Open group
⚫ Closed group
9
⚫ Only the members of the group can send a message to the group.
⚫ An outside process cannot send a message to the group as a whole, although
it may send a message to an individual member of the group.
⚫ Open group
⚫ Any process in the system can send a message to the group as a whole.
⚫ Use of open or closed group is depends upon application.
⚫ Group Addressing in One-to-Many
⚫ A two-level naming scheme is normally used for group addressing.
⚫ High level group name
⚫ Low level group name.
⚫ High level group name is ASCII string and it is location independent.
⚫ Low level group naming depends upon underlying hardware.
⚫ User applications use high level group names in programs.
⚫ Buffered and Un buffered Multicast in One-to-Many
⚫ Multicasting is an asynchronous communication mechanism
⚫ Because, multicast send can not be synchronous due to the following reasons:
⚫ It is unrealistic to expect a sending process to wait until all the receiving
processes that belong to the multicast group are ready to receive the multicast
message.
⚫ The sending process may no be aware of all the receiving processes that
belong to the multicast group.
⚫ Un Buffered Multicast
⚫ Message is not buffered for the receiving process.
⚫ It is lost if the receiving process is not in state ready to receive it,.
⚫ Therefore, message is received only by those processes of the multicast group
that are ready to receive it.
⚫ Buffered Multicast
⚫ Message is buffered for the receiving process.
⚫ So, each process of the multicast group will eventually receive the message.
⚫ Semantics in One-to-Many
⚫ Two types of semantics
⚫ Send- to – all semantics:
⚫ A copy of the message is sent to each process of the multicast group
and message is buffered until it is accepted by the process.
⚫ Bulletin-board semantics:
⚫ A message to be multicast is addressed to a channel instead of being
sent to every individual process of the multicast group.
⚫ Flexibility reliability in Multicast communication:
⚫ Different application require different degrees of reliability
⚫ In one-to-many communication, the degree of reliability is normally expressed in the
following forms:
⚫ 0-reliability:
⚫ No response is expected by the sender from any of the receiver.
⚫ 1-reliability:
⚫ The sender expects a response from any of the receivers
⚫ M-out-of-n- reliable:
⚫ The multicast group consists of n receivers and the sender expects a
response from m (1<m<n) of the receivers.
10
⚫ All-reliable:
⚫ The sender expects a response message form all the receiver in
the multicast group.
⚫ Atomic multicast:
⚫ Atomic multicast has an all-or- nothing property.
⚫ When a message is sent to a group by atomic multicast, it is either received by all the
correct processes that are members of the group or else it is not received by any of
them.
⚫ Many-to-one Communication:
⚫ Multiple senders send message to a single receiver.
⚫ Single receiver may be selective or non selective
⚫ Selective receiver:
⚫ Specifies a unique sender.
⚫ Message exchange takes place only if that sender send a message.
⚫ Non selective receiver:
⚫ Specifies a set of sender
⚫ If any one sender in the set sends a message to this receiver a message
exchange takes place.
⚫ Many to one scheme is non-determinism.
⚫ Many-to-Many Communication:
⚫ Multiple sender send message to multiple receiver.
⚫ Ordered message delivery ensures that all messages are delivered to all receiver in
an order acceptable to the application.
⚫ For example,
⚫ Suppose 2 senders send messages to update the same record of a database to
2 server process having a replica of the database.
⚫ If the message of the 2 senders is received but the 2 servers in different
orders, then the final value of the updated record of the database may be
different in its 2 replicas.
⚫ Message Ordering in Many-to-Many Communication:
⚫ R1 and R2 receive m1 and m2 in different order.
⚫ Fig shows the no ordering constraints for message delivery.
⚫ Some message ordering is required :
⚫ Absolute ordering
⚫ Consistent/ total ordering
⚫ Causal ordering
⚫ FIFO ordering.
11
⚫ Rule: mi must be delivered before mj if Ti < Tj
⚫ Implementation:
⚫ A clock synchronized among machine is required.
⚫ A sliding time window used to commit message delivery whose time stamp
in this window.
⚫ Window size is properly chosen taking into consideration the maximum
possible time that may be required by a message to go from one machine to
other machine in the network.
⚫ Example: Distributed Simulation
⚫ Drawbacks:
⚫ Too strict constraint
⚫ No absolute synchronized clock
⚫ No guarantee to catch all tardy message
12
⚫ Implementation: use of a vector message,
⚫ Example: distributed file system
⚫ Drawbacks:
⚫ Vector as an overhead
⚫ Broadcast assumed.
2.1.5. Causal order (CO)
a → b iff ta < tb
Events a and b are causally related Events a and b are causally related iff ta < tb or tb < ta,
else they are concurrent they are concurrent
Note that this is still not a total order Note that this is still not a total order
Uses of Vector Clock in CO of Messages
If send(m1) If send(m1) → send(m2), then every recipient of both send(m2), then every
recipient of both message m1 and m2 must “deliver” m1 before m2.
o “deliver” –when the message is actually given to the application for
processing application for processing
Birman-Schiper Schiper-Stephenson Stephenson Protocol Protocol
occurrence. In case two or more events occur at the same time, an arbitrary total ordering ≺
To totally order the events in a system, the events are ordered according to their times of
There is total ordering because for any two events in the system, it is clear which happened
first.
The total ordering of events is very useful for distributed system implementation.
13
2.2. Global State and Snapshot Algorithm
2.2.1. Introduction
Distributed Garbage Collection
Deadlock Detection
Termination Detection
Debugging
14
The state of the collection of processes – is much harder to address. The essential problem is
the absence of global time. If all processes had perfectly synchronized clocks, then we could agree
on a time at which each process would record its state – the result would be an actual global state of
the system.
A consistent cut cannot violate temporal causality by implying that a result occurred before
its cause, as in message m1 being received before the cut and being sent after the cut. The union of
the individual process histories:
H= h0 U h1U…U hN – 1
A cut of the system’s execution is a subset of its global history that is a union of prefixes of
process histories.
c1 c2 cN
C =h1 Uh2 U…U hN
A consistent global state is one that corresponds to a consistent cut. We may characterize the
execution of a distributed system as a series of transitions between global states of the system:
S0 -> S1-> S2-> …..
A linearization or consistent run is an ordering of the events in a global history that is consistent
with this happened-before relation ->on H. Note that a linearization is also a run.
Global state predicates, stability, safety and liveness:
Detecting a condition such as deadlock or termination amounts to evaluating a global state
predicate. A global state predicate is a function that maps from the set of global states of processes
in the system.
Once the system enters a state in which the predicate is True, it remains True in all future
states reachable from that state. By contrast, when we monitor or debug an application we are often
interested in non-stable predicates, such as that in our example of variables whose difference is
supposed to be bounded. Here two further notions relevant to global state predicates: safety and
liveness.
Safety with respect to α is the assertion that α evaluates to False for all states S reachable
from S0. Liveness with respect to β is the property that, for any linearization L starting in the state
S0 , β evaluates to True for some state SL reachable from S0
2.2.3. Snapshot algorithms for FIFO channels
The ‘snapshot’ algorithm of Chandy and Lamport:
Chandy and Lamport describe a ‘snapshot’ algorithm for determining global states of
distributed systems. The goal of the algorithm is to record a set of process and channel states (a
‘snapshot’) for a set of processes pi ( i = 1, 2, .., N ) such that, even though the combination of
recorded states may never have occurred at the same time, the recorded global state is consistent.
The algorithm records state locally at processes; it does not give a method for gathering the
global state at one site.
The algorithm assumes that:
Neither channels nor processes fail – communication is reliable so that every message sent is
eventually received intact, exactly once.
15
Channels are unidirectional and provide FIFO-ordered message delivery.
The graph of processes and channels is strongly connected (there is a path between any two
processes).
Any process may initiate a global snapshot at any time.
The processes may continue their execution and send and receive normal messages while
the snapshot takes place.
The algorithm proceeds through use of special marker messages, which are distinct from any
other messages the processes, send and which the processes may send and receive while they
proceed with their normal execution.
The algorithm is defined through two rules, the marker receiving rule and the marker sending
rule.
The marker sending rule obligates processes to send a marker after they have recorded their
state, but before they send any other messages.
The marker receiving rule obligates a process that has not recorded its state to do so. In that
case, this is the first marker that it has received. It notes which messages subsequently arrive on the
other incoming channels.
The algorithm for a system of two processes, p1 and p2 , connected by two unidirectional
channels, c1 and c2 . The two processes trade in ‘widgets’. Process p1 sends orders for widgets over
c2 to p2. Sometime later, process p2 sends widgets along channel c1 to p1.
Process p1 records its state in the actual global state S0, when the state of p1 is <$1000, 0>.
Following the marker sending rule, process p1 then emits a marker message over its outgoing
channel c2 before it sends the next application-level message: (Order 10, $100), over channel c2 .
16
The system enters actual global state S1. Before p2 receives the marker, it emits an
application message (five widgets) over c1 in response to p1’s previous order, yielding a new actual
global state S2. Now process p1 receives p2’s message (five widgets), and p2 receives the marker.
Following the marker receiving rule, p2 records its state as <$50, 1995> and that of channel c2 as
the empty sequence. Following the marker sending rule, it sends a marker message over c1. When
process p1 receives p2’s marker message, it records the state of channel c1 as the single message
(five widgets) that it received after it first recorded its state. The final actual global state is S3. The
final recorded state is p1 : <$1000, 0>; p2 : <$50, 1995>; c1 : <(five widgets)>; c2 : < >. Note that
this state differs from all the global states through which the system actually passed.
Termination of the snapshot algorithm:
A process that has received a marker message records its state within a finite time and sends
marker messages over each outgoing channel within a finite time. If there is a path of
communication channel and processes from a process pi to a process pj (j ǂ i), then it is clear on
these assumptions that pj will record its state a finite time after pi recorded its state.
17