0% found this document useful (0 votes)

17 views489 pages

DS Midsems

Uploaded by

predator862001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views489 pages

DS Midsems

Uploaded by

predator862001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 489

Distributed System

Dr. D.S. Kushwaha

Computer Architectures
 Computer architectures consisting of
interconnected, multiple processors are
basically of two types:
 Tightly coupled systems
 Loosely coupled systems
Tightly coupled systems
 There is a single system wide primary
memory (address space) that is shared by all
the processors.
 These are also known as parallel processing
systems.

System wide
CPU CPU CPU CPU
Shared memory

Interconnection hardware
Loosely coupled systems
 The processors do not share memory, and
each processor has its own local memory.
 These are also known as distributed
computing systems or simply distributed
systems.

Local memory Local memory Local memory Local memory

CPU CPU CPU CPU

Communication Network
Distributed Computing System
 A DCS is a collection of independent computers that
appears to its users as a single coherent system,
or
A collection of processors interconnected by a
communication network in which
 Each processor has its own local memory and other
peripherals, and
 The communication between any two processors of
the system takes place by message passing.
 For a particular processor, its own resources are
local, whereas the other processors and their
resources are remote.
Cont…
 Together, a processor and its resources are
usually referred to as a node or site or machine
of the distributed computing system.

 A distributed system is organized as middleware.

Note that the middleware layer extends over multiple

machines.
Distributed System
Distributed Computing System Models
1. Minicomputer Model
2. Workstation model
3. Workstation-server model
4. Processor-pool model
5. Hybrid model
Factors that led to the emergence of
distributed computing system
 Inherently distributed applications
 Organizations once based in a particular location have gone
global.

 Information sharing among distributed users

 CSCW

 Resource sharing
 Such as software libraries, databases, and hardware resources.
Factors that led to the emergence of
distributed computing system
 Better price-performance ratio

 Shorter response times and higher throughput

 Higher reliability
 Reliability refers to the degree of tolerance against errors
and component failures in a system.
 Achieved by multiplicity of resources.
Factors that led to the emergence of
distributed computing system
 Extensibility and incremental growth
 By adding additional resources to the system as and when
the need arise.
 These are termed as open distributed System.

 Better flexibility in meeting user’s needs

Distributed Operating system
 OS controls the resources of a computer system and
provides its users with an interface or virtual machine
that is more convenient to use than the bare machine.
 Primary tasks of an operating system:
 To present users with a virtual machine that is easier to
program than the underlying hardware.
 To manage the various resources of the system.
 Types of operating systems used for distributed
computing systems:
 Network operating systems
 Distributed operating systems.
Uniprocessor Operating Systems

 Separating applications from operating system code

through a microkernel
Multicomputer Operating Systems
Message Transfer
Network Operating System
Network Operating System
Distributed System as Middleware
Middleware and Openness

In an open middleware-based distributed system, the protocols

used by each middleware layer should be the same, as well as
the interfaces they offer to applications
Comparison between Systems
Distributed OS
Network Middleware-
Item Multi- Multi- OS based OS
processor computer
Degree of
Very High High Low High
transparency
Same OS on all
Yes Yes No No
nodes
Number of copies
1 N N N
of OS
Basis for Shared
Messages Files Model specific
communication memory
Resource Global, Global,
Per node Per node
management central distributed
Scalability No Moderately Yes Varies
Openness Closed Closed Open Open
Major Differences
1. System image
 In NOS, user perceives the DCS as a group of nodes
connected by a communication N/W. Hence, user is aware of
multiple computers.

 DOS hides the existence of multiple computers and provides

a single system image to its users. Hence group of Networked
nodes act as virtual Uniprocessor.

 In NOS, by default user’s job is executed on the machine on

which the user is currently logged on, else he has to remote
login.

 DOS dynamically allocates jobs to the various machines of

the system for processing.
Cont…
2. AUTONOMY
Different nodes in NOS may use different O.S. but communicate

with each other by using a mutually agreed upon communication
protocol.
 In DOS, there exists single system wide O.S. and each node of
the DCS runs a part of OS (i.e. identical kernels run on all the
nodes of DCS).
This ensures same set of system calls globally valid.

3. FAULT TOLERANCE
 Low in Network OS
 High in Distribute OS

DCS using NOS generally referred as N/W System

DCS using DOS generally referred as Distributed System
Issues
 Transparency
 Reliability
 Flexibility
 Performance
 Scalability
 Heterogeneity
 Security
 Emulation to existing operating systems
TRANSPARENCY
 To make the existence of multiple computers transparent &
 Provide a single system image to its users.

 ISO 1992 identifies eight types of transparencies:

 Access Transparency
 Location Transparency
 Replication Transparency
 Failure Transparency
 Migration Transparency
 Concurrency Transparency
 Performance Transparency
 Scaling Transparency
Access Transparency

Allows users to access remote resources in the same

way as local i.e. user interface which is in the form of a
set of system calls should not distinguish between local
and remote resources.
Location transparency
 Aspects of location transparency
 Name Transparency
 Name of a resource should not reveal any information
about physical location of resource.
 Even movable resources such as files must be
allowed to move without having their name changed.

 User Mobility
 No matter which machine a user is logged onto, he
should be able to access a resource with the same
name.
Replication Transparency
 Replicas are used for better performance and reliability.

 Replicated resource and the replication activity should be

transparent to user.

 Two important issues related to replication transparency are:

 Naming of Replicas
 Replication Control

 There should be a method to map a user-supplied name of the

resource to an appropriate replica of the resource.
Failure Transparency
 In partial failures, system continues to function may be
with degraded transparency.

 Complete failure transparency is not achievable.

 Failure of the communication network of a distributed

system normally disrupts the work of its users and is
noticeable by the users.
Migration Transparency
 Migration may be done for performance, reliability and
security reasons.

 Aim of migration transparency

 To ensure that the movement of the object is handled
automatically by the system in a user-transparent manner.

 Issues for migration transparency:

 Migration decisions
 Migration of an object
 Migration object as a process
Concurrency transparency
 Virtualization that:
 One is sole user of the system &
 Other users do not exist in the system.

 Properties for providing concurrency transparency:

 Event Ordering Property
 It ensures that all access requests to various system
resources are properly ordered to provide a consistent view
to all users of the system.

 Mutual Exclusion property

 At any time, at most one process accesses a shared
resource & not used simultaneously.
Cont…
 No starvation policy
 It ensures that if every process that is granted a
resource, which must not be used simultaneously by
multiple processes, eventually releases it, every request
for that resource is eventually granted.

 No deadlock
 It ensures that a situation will never occur in which
competing processes prevent their mutual progress even
though no single one requests more resources than
available in the system.
Performance transparency
 The aim of performance transparency is to allow the
system to be automatically reconfigured to improve
performance, as loads vary dynamically in the system.

 A situation in which one processor of the system is

overloaded with jobs while another processor is idle
should not be allowed to occur.
Scaling transparency
 Allows the system to expand and scale without
disrupting the activities of the users.

 Requires open system architecture and use of scalable

algorithms.
RELIABILITY

 The distributed systems are required to be

more reliable than centralized systems due to
 The multiple instances of resources.
 Failure should be avoided & their types are:
 Fail stop Failure: System stops functioning after
changing to a state in which its failure can be
detected.
 Byzantine Failure: System continues to function but
produces wrong results.
Cont…
 Methods to handle failure in distributed system :
 Fault Avoidance
 Fault Tolerance
 Fault Detection
 Fault detection and recovery
Fault Avoidance

 It Deals with designing the components of the system

in such a way that the occurrence of faults is
minimized.

 For this, reliable hardware components and intensive

software test are used.
Fault Tolerance
 Ability of a system to continue functioning in the event
of partial system failure.
 Method for tolerating the faults:
 Redundancy techniques:
 Avoid single point of failure by replicating critical
hardware and software components,
 Additional overhead is needed to maintain multiple
copies of replicated resource & consistency issue.
 Distributed control
 A highly available distributed file system should have
multiple and independent file servers controlling multiple
and independent storage device.
Fault detection and recovery
 Atomic Transaction

 Use of stateless servers

 History of the serviced requests between Client and Server
affects the execution of service request.

 Acknowledgements and timeout-based retransmissions

of messages
 For IPC between two processes, the system must have ways
to detect lost messages so that these could be retransmitted.
FLEXIBILITY
 The design of a distributed operating system should be flexible due
to:
 Ease of Modification
 Ease of Enhancement

 Kernel is the most important influencing design factor which

operates in a separate address space that is inaccessible to user
processes.

 Commonly used models for kernel design in distributed operating

systems:
 Microkernel model
 Monolithic model
Monolithic kernel model
 Most operating system services such as process management,
memory management, device management , file management,
name management, and inter-process communication are
provided by the kernel.

 Result: The kernel has a large monolithic structure.

 The large size of the kernel reduces the overall flexibility and
configurability of the resulting OS.

 A request may be serviced faster.

 No message passing and no context switching are required while the
kernel is performing the job.
Cont…

Node 1 Node 2 Node n

user user user
applications applications applications
Monolithic Monolithic Monolithic
kernel kernel kernel
(includes most (includes most (includes most
OS services) OS services) OS services)

Network Hardware
Microkernel model
 Goal is to keep the kernel as small as possible.

 Kernel provides only the minimal facilities necessary for

implementing additional operating system services like:
 inter-process communication,
 low-level device management, &
 some memory management.

 All other OS services are implemented as user-level server

processes. So it is easy to modify the design or add new
services.
 Each server has its own address space and can be
programmed separately.
Cont…

Node 1 Node 2 Node n

user user user
applications applications applications
Server/manager Server/manager Server/manager
modules modules modules

Microkernel Microkernel Microkernel

(has only (has only (has only
minimal facilities) minimal facilities) minimal facilities)

Network Hardware
Cont…
 Modular in nature, OS is easy to design, implement and
install.
 For adding or changing a service, there is no need to stop
the system and boot a new kernel, as in the case of
monolithic kernel.
 Performance penalty :
 Each server module has to its own address space. Hence
some form of message based IPC is required while
performing some job.
 Message passing between server processes and the
microkernel requires context switches, resulting in additional
performance overhead.
Cont…
 Advantages of microkernel model over monolithic kernel
model:
 Flexibility to design, maintenance and portability.
 In practice, the performance penalty is not too much due to
other factors and the small overhead involved in exchanging
messages which is usually negligible.
PERFORMANCE
Design principles for better performance are:
 Batch if possible
 Transfer of data in large chunk is more efficient than
individual pages.
 Caching whenever possible
 Saves a large amount of time and network bandwidth.

 Minimize copying of data

 While a message is transferred from sender to receiver, it
takes the following path:
 From senders stack to its message buffer

 From message buffer in senders address apace to

message buffer in kernels address space

Cont…
 Finally from kernel to NIC
 Similarly on receipt ,hence six copy operations are
required.

 Minimize Network traffic

 Migrate process closer to resource.

 Take advantage of fine-grain parallelism for

multiprocessing
 Use of threads for structuring server processes.
 Fine-grained concurrency control of simultaneous
accesses by multiple processes to a shared resource for
better performance.
SCALABILITY
 Scalability refers to the capability of a system to adapt to
increased service load.

 Some principles for designing the scalable distributed

systems are:
 Avoid centralized entities
 Central file server

 Centralized database

 Avoid centralized algorithms

 Perform most operations on client workstation
Heterogeneity
 Dissimilar hardware or software systems.

 Some incompatibilities in a heterogeneous distributed

system are:
 Internal formatting schemes
 Communication protocols and topologies of different networks
 Different servers at different nodes

 Some form of data translation is necessary for

interaction.

 An intermediate standard data format can be used.

Security
 More difficult than in a centralized system because of:
 The lack of a single point of control &
 The use of insecure networks for data communication.

 Requirements for security

 It should be possible for the sender of a message to know
that the message was received by the intended receiver.
 It should be possible for the receiver of a message to know
that the message was sent by the genuine receiver.
Cont…
 It should be possible for both the sender and receiver of a
message to be guaranteed that the contents of the message
were not changed while it was in transfer.

 Cryptography is the solution.

Message Passing
Communication Primitives

 For the purposes of developing communication

algorithms, we divide distributed systems into
two categories:
◦ Synchronous Systems
◦ Asynchronous Systems
Synchronous Systems
• The key property of such systems is that all
communication and processing takes bounded
time.

• Assuming a reliable communication medium and

a crash failure model for processors, a sender
can know for sure that a receiver has failed if the
sender didn't get an acknowledgement from the
receiver within some finite time
Asynchronous Systems
• The key property of such systems is that
communication and processing delays can be
arbitrarily long even in the absence of failures.
• Hence, there is no way we can distinguish
between an arbitrarily slow processor, and one
that has failed.
• Implementing group communication semantics in
the presence of failures becomes more
challenging in the asynchronous model.
Introduction
 A process is a program in execution.
 Each computer of a distributed system may have a
resource manager process to monitor the current status
of usage of its local resources.
 Resource managers of all the computers might
communicate with each other from time to time to
dynamically balance the system load among all the
computers.
◦ A distributed operating system needs to provide interprocess
communication (IPC) mechanisms to facilitate such
communication activities.
Message Passing

 Two process communicate by IPC. It requires

information sharing either by:

◦ Original Sharing or Shared Data Approach

◦ Copy sharing or Message Passing Approach

The shared-data approach

Shared
P1 Memory P2
Area

The information to be shared is placed in a common

memory area that is accessible to all the processes
involved in an IPC.
Message-Passing approach

P1 P2

Data to be shared is physically copied from

sender process address space to receiver by
transmitted data.
Message Passing System

 Provides a set of message based IPC protocol.

 Shields details of complex network protocol.
 Shields details of multiple heterogeneous platform.
 Allows programs to be written using simple
communication primitives.
 It provides infra structure to build other higher level IPC
system such as RPC and DSM.
Desirable Features of MPS
 Simplicity
 Uniform semantics
 Efficiency
 Reliability
 Correctness
 Flexibility
 Security
 Portability
Issues in IPC by message passing

 Sender & Receiver

 Acceptance of message by receiver.
 Reply of the message
 Failures during communication
 Buffering by the receiver
 Size of the buffer
 Order of the outstanding messages
Synchronization
 Synchronization primitives:
◦ BLOCKING PRIMITIVE (SYNCHRONOUS)
 Its invocation blocks execution of invoker.
◦ NON-BLOCKING (ASYNCHRONOUS)
 It does not block execution.

 Other primitives:
◦ Send primitive
◦ Receive primitive
Cont…
 Blocking send primitive:
◦ After execution of the send statement, the sending
process is blocked until it receives an
acknowledgement.

 Non-blocking send primitive:

◦ After execution of the send statement, the sending
process is allowed to proceed with its execution as soon
as the message has been copied to a buffer.
Cont…
 Blocking receive primitive:
◦ After execution of the receive statement, the receiving
process is blocked until it receives a message.

 Non-blocking receive primitive:

◦ The receiving process proceeds with its execution after
execution of the receive statement, which returns
control almost immediately.
Blocking send and receive primitives
Sender’s Receiver’s
execution execution

Receive (message);
Execution suspended
Send (message); Message
Execution suspended
Execution resumed

Execution resumed Send (ack)

acknowledgment

Blocked state
executing state
Issues In Non-blocking

 How the receiving process knows that the message

has arrived in the message buffer?

 Done by Polling or Interrupt

◦ Polling: Receiver uses “test” primitive to allow it
to check buffer status.
◦ Interrupt: When message has arrived in the buffer,
a s/w interrupt is used to notify the receiving
process.
Issues In Blocking Send

 Sending process could get blocked forever if

receiving process crashes or message lost in
network.

 Hence it should use timeout values.

 It could be a parameter of “send” primitive.

Implementation Ease

• Synchronous communication is easy to

implement.
• If message gets lost or is undelivered, no
backward error recovery is required.
• Synchronous communication limits concurrency.
• Subject to communication deadlock.
Buffering
 Message are transmitted from one process to another by
copying the body of the message from the address space of
sending process to address space of receiver process
(possibly via address space of kernels of sending and
receiving computers).

 In some cases, the receiver process may not be ready to

receive a message but may want O.S. to save messages for
later reception.
Cont…

 Message buffering strategy related to

synchronization strategy

◦ Synchronous Mode: Null /No Buffer

◦ Asynchronous Mode: Buffer with unbounded capacity

Null Buffering or no buffering

It has two strategies.

1st strategy
– Message remains in SPAS (Sender Process
Address Space) and the execution of send is
delayed until the receiver executes the
corresponding receive.

– After send, when ACK is received, it executes

“send” again to send the message.
Cont…

2nd strategy
– The message is simply discarded and the time
out mechanism is used to resend message
after a time out period.

– After executing send, sender process wait for

an ACK. After time out, sender retries
executing send
Cont…
Sending Receiving
process process

Message

Message transfer in synchronous send with no buffering

strategy (only one copy operation is needed).

23
Cont..
 The null buffer strategy is not suitable for
synchronous communication between two
processes in a distributed system.

◦ If the receiver is not ready, a message has to be

transferred two or more times and the receiver of the
message has to wait.
Single-message buffer
 A buffer having a capacity to store a single
message is used on the receiver’s node.

 Used for synchronous communication.

◦ An application module may have at most one message
outstanding at a time.
Cont…
 This strategy keeps the message ready for use
at the location of the receiver.
◦ Buffer is used if receiver is not ready to receive the
message.

 The buffer may be located in the kernel’s

address space or in the receiver process address
space.
Cont…

Sending Receiving
process process

Message

Single Message Buffer

Node
boundary

Message transfer in synchronous send with single

buffering strategy (two copy operation is needed).

27
Unbounded-capacity buffer
 Buffer with unbounded-capacity message may be
used in the asynchronous mode of
communication.
 It can store all unreceived messages with the
assurance that all messages sent to the receiver
will be delivered.
 This strategy is practically impossible.
Finite Bound Buffer
 Asynchronous mode of communication uses
this strategy of finite bound buffer.

 This strategy may lead to the problem of

possible buffer overflow.

29
Cont…
 Methods to deal with the problem of buffer
overflow:
◦ Unsuccessful communication
 Message transfers simply fail whenever there is no more
buffer space.

◦ Flow-controlled communication
 The sender is blocked until the receiver accepts some
messages, thus creating space in the buffer for new
messages.

30
Cont…
 A create-buffer system call is provided to the user.

 This system call when executed by a receiver

process creates a buffer of a size specified by
receiver either in kernel AS or receiver process
AS.

31
Cont…
Sending Receiving
process Message 1 process
Message 2
Message Message 3

Message n

Multiple-message
Buffer/mailbox/port

Message transfer in asynchronous send with multiple-message

buffering strategy (two copy operations are needed).

32
Multidatagram messages
 A message whose size is greater than the MTU
(maximum transfer unit) is fragmented into
multiples of the MTU.

 Each fragment is sent separately.

 Responsibilities of MPS includes:

◦ The disassembling of a multidatagram message into
multiple packets on the sender side,
◦ The reassembling of the packets on the receiver side.
Encoding & Decoding of Message Data

 Structure of program/object should be preserved

while they are being transmitted from address space
of sending process to receiving process’s address
space.
Difficult to achieve because:
 An absolute pointer losses its meaning when
transferred from one process AS to another.
◦ Requires flattening and shaping of objects.
 Due to above problems, program object are not
transferred in their original form. They are first
converted to stream form by encoding

34
Process Addressing

 Processing addressing is naming of parties

involved in interaction.

 Two types of process Addressing:

1. Explicit Addressing
2. Implicit Addressing

35
Explicit Addressing
 The process with which communication is desired
is explicitly named as a parameter in the
communication primitive used.

 Primitives for explicit process addressing are

◦ Send ( process-id, message )
◦ Receive(process-id, message )

36
Implicit Addressing

 A process willing to communicate does not explicitly

name a process for communication.
 Primitives for implicit addressing are:
◦ Send_any (service_id, message)
Send a message to any process that provides the
service of type “service_id”
◦ Receive_any (process_id, message)
Receive a message from any process and returns the
“process_id” of the process from which the message
was received.
Functional addressing
 The address used in the communication
primitive identifies a service rather than a
process.

 The client is not concerned with which particular

server out of a set of servers providing the
service desired by the client actually services its
request.
Process addressing
 Simple method to identify a process is by a combination of
machine-id and local-id such as machine-id@ local-id.

 Local-id can be process-id or port-id, that uniquely

identifies a process on a machine.

 Machine-id is used by sending machine kernel to send the

message to the receiver process machine.

 Eg: Berkely UNIX

◦ 32 bit Internet address for machine-id
◦ 16 bit for local-id

39
Cont…
 Benefit:
 No global coordination required for local-id
 Drawback
◦ It does not allow a process to migrate from one machine
to another on requirement like
 One or more processes of a heavily loaded machine may be
migrated to a lightly loaded machine to balance the overall
system load.
Cont…
 Solution
◦ Processes can be identified by a combination of the
three fields
 machine_id
 Local_id
 machine_id
Cont…
Machine-id , Local-id , Machine-id ,

Node on which
process created Id of the process Last known Node
Location of the process

Never Change This may

 This type of adding is known as link based process

addressing.
 During Migration, a link Information (p-id + m/c id of
new node) is left on previous node.

42
Cont…
 Drawbacks in this:
◦ The overload of locating a process may be large if the
process has migrated several times during its lifetime.

◦ It may not be possible to locate a process if an

intermediate node on which the process once resided
during its lifetime is down.
Cont…
 Both process-addressing methods are
nontransparent due to the need to specify the
machine identifier.

 Solutions are:
– Ensure that the system wide unique identifier of a
process does not contain an embedded machine
identifier.
– Use Two-level naming scheme for processes.
Two Level Naming Scheme

 Each process has two id:

◦ A high level name that is m/c independent (ASCII
string).
◦ A low level name that is machine dependent
(machine_id@local_id).

 A name server is used to maintain a mapping

table.

45
Cont…
 When a process wants to send a message to another
process, it specifies high level name of the receiver
process.

 The kernel of the machine first contacts the name server

to get low level name.

 The name server can also be used for functional

addressing.
◦ High level name identifies a service instead of a
process.
◦ The name server maps a service identifier to one or
more processes that provide that service.
Cont…
 Name server
◦ Drawbacks – poor reliability and poor scalability.
◦ Solution – replicate the name server.

 Extra overhead in replicating name server.

Failure Handling
Failure can be due to:
1. Loss of Request message
2. Loss of Response Message
3. Unsuccessful execution of request
• Due to the receiver’s node crashing while the
request is being processed.

48
Cont…

sender Receiver

Send request
Request message

lost

Request message is lost.

Cont…

sender Receiver

Send request Request message

Successful request
execution
Send response

lost

Response message is lost.

Cont…

sender Receiver

Send request Request message

unsuccessful request
crash execution

restarted

Receiver’s PC crashed.
Cont…
To overcome these problems
• A reliable IPC protocol of a message-passing system is
designed.
• It is based on the ideas of internal retransmission of
messages after time out and
• Return of an ACK to sending m/c kernel by receiver
m/c kernel.

52
Cont…
 Protocols used for client-server communication
between two processes are:
◦ Four-message reliable IPC protocol
◦ Three -message reliable IPC protocol
◦ Two-message reliable IPC protocol
Four-message reliable IPC protocol

 The time for which the sender waits is slightly

more than the approximate round trip time +
the average time required for executing the
request.

54
Four-message reliable IPC protocol
Sender’s Receiver’s
execution execution

request

acknowledgment

Blocked state
acknowledgment
executing state
Three -message reliable IPC
protocol
The result of the processed request is sufficient
acknowledgment that the request message was
received by the server.
Three-message reliable IPC protocol
Sender’s Receiver’s
execution execution

request

Blocked state
acknowledgment
executing state
Cont…
Problem is with timeout value.
• If the request message is lost, it will be
retransmitted only after the timeout period,
which might have been set to a larger value.

• If the timeout value is not set properly taking

into consideration the long time needed for
request processing, unneccessary
retransmissions will take place.
Two Message Reliable IPC Protocol
 When request is received at servers machine, it
starts a timer.
 If server finishes processing the req. before time
expires, reply acts as ACK
 Else a separate ACK is sent by kernel.
Two-message reliable IPC
Sender’s Receiver’s
execution execution

request

Blocked state
executing state
Fault-tolerant communication
client server
Send Request message
request
timeout
lost
Send
request Retransmit
timeout Request message

Unsuccessful Request execution

Send crash
request Retransmit
Request message
timeout Successful Request execution
Send response
These two successful
lost executions of the same
Send request may produce
request Retransmit different results.
Request message Successful Request execution
Send response

Response message
Idempotency
 It means repeatibility.
 An Idempotent operation produces the same
result without any side effects, no matter how
many times it is performed with the same
argument.
 Example: using procedure GetSqrt(64) for
calculating the square root of a given number.
 ISSUE : Duplicate Requests.
◦ If the execution of the request is nonidempotent, then
its repeated execution will destroy the consistency of
information.
Handling of Duplicate Requests
 If client makes a request.
 Server processes the request.
 Client doesn't receive the response.
 After time out, again issues REQ.

 What Happens?
Cont…
 Solution
◦ Use unique id for every REQ
◦ maintains a reply cache in the kernel’s address space
on the server machine to cache replies.

 The use of a reply cache does not make a

nonidempotent routine idempotent.
Multidatagram Messages
 A complete message transmission is:
◦ When all the packets of the message have been received
by the process to which it is sent.

◦ For this, reliable delivery of every packet is important.

Cont…

 Use of Blast Protocol

◦ Requires Single ACK for all packets of multidatagram
message.
◦ Two fields in each packet – total no. of packets and seq
no. of packet.
◦ After timeout, it follows Selective Repeat.

 Use of Stop and Wait Protocol

◦ Requires ACK for each packet.
Group Communication

 A group is a set of parties that, presumably, want to

exchange information in a reliable, consistent manner.

 Group communication is a paradigm for multi-party

communication that is based on the notion of groups as a
main abstraction.
Cont…
 The set of replicas of a fault-tolerant database
server may constitute a group.
 Consider update messages to the server where the
contents of the database depends on the history of
all update messages received, all updates must be
delivered to all replicas.
 Furthermore, all updates must be delivered in the
same order, otherwise, inconsistencies may arise.
Cont…
Following three types of group communication are
popular:
◦ One to Many
 Single sender and multiple receiver
 multicast communication
◦ Many to One
 Multiple senders and single receiver
◦ Many to Many
 Multiple senders and multiple receivers
Group management
 A message-passing system with group
communication facility provides the flexibility:
◦ to create and delete groups dynamically, &
◦ to allow a process to join or leave a group at any time.

 It uses centralized group server to manage the

groups.

 Replication of group servers is required.

Group addressing
 A two-level naming scheme is used for group
addressing.

 The high-level group name is an ASCII string that

is independent of the location information of the
processes in the group.

 The low-level group name depends to a large

extent on the underlying hardware.
◦ Like a multicast address for a group.
Message Delivery to Receiver Process

 User applications use high level group names in

programs.
 Centralized group server (GS) maintains a mapping of
high-level group names to their low level names.
 Group server also maintains a list of process identifier
of all the processes for each group.
 Kernel contacts the group server to obtain low level name
& process_id of processes belonging to that group.
 This list of process_id is inserted in the message.
Buffered / Unbuffered Multicast

 Multicasting is an Asynchronous
communication mechanism.

 So which one to use

◦ BUFFERED or UNBUFFERED?
Reliability in Multicasting
 Depends on degree of reliability required.
 Sender of a multicast message can specify the
number of receivers from which a response
message is needed.
 This is expressed in the following form:
◦ 0-reliable
◦ 1-reliable
◦ ‘m’ out of ‘n’ reliable
◦ all reliable
Atomic Multicast

 This has all or nothing property i.e. when a message

is sent to a group, either all or none receive it.
 Only “all-reliable” kind of reliability needs this strict
paradigm.
 A flexible message-passing system should support
both atomic and nonatomic multicast facilities.
Many to Many Communication
 One to many and many to one are implicit in this
scheme.
 Issue: ordered message delivery.
 Ordered message delivery
◦ Ensures that all messages are delivered to all receivers
in an order acceptable to the application.
◦ Requires message sequencing.
Cont…
 Semantics of ordered delivery are:
• Absolute ordering
• Consistent Ordering
• Causal Ordering
Absolute Ordering
 This semantics ensures that all messages are
delivered to all receiver processes in the exact
order in which they were sent.
 It uses global timestamps /synchronized clocks as
message identifiers (not easy to implement).
 Uses a sliding-window mechanism to periodically
deliver the message from the queue to the
receiver.
Cont…

S1 R1 R2 S2

t1
m1 Time
t2
t1 < t2
m2
m1
m2
Consistent Ordering
 This semantics ensures that all messages are
delivered to all receiver processes in the same
order.

 This order may be different from the order in

which messages were sent.
Cont…
 Method to implement this semantics
◦ Make the many-to-many scheme appear as a
combination of many-to-one and one-to-many
schemes.

◦ The kernels of the sending machines send messages

to a single receiver (known as sequencer) that
assigns a sequence number to each message and
then multicasts it.
Cont…
S1 R1 R2 S2

t1
Time
t2
m2 m2 t1 < t2

m1 m1
Causal Ordering
• PCR
CBCAST Protocol
• For Causal Ordering – USED in ISIS, a real commercial
distributed system, based on process groups.
• ISIS project has moved from Cornell University to ISIS
Distributed System a subsidiary of Stratus Computer Inc.
• Each member process of a group maintains a vector of
“n” components, where “n” is the total number of
members in the group.
• Each member is assigned a sequence number from 0 to
n, and the ith component of the vector corresponds to
the member with sequence number i’.
• In particular, the value of the ith component of a
member’s vector is equal to the number of the last
message received in sequence by this member from
member ‘i’.
Cont…
• To send a message, a process increments the value of
its own component in its own vector and sends the
vector as part of the message.

• When the message arrives at a receiver process’s site, it

is buffered by the runtime system.

• The runtime system tests the two conditions to decide

whether the message can be delivered to the user
process or its delivery must be delayed to ensure causal
ordering semantics.
Cont…
• Let
• “S”= vector of sender process that is
attached to the message
• “R” = vector of receiver process
• “i” = sequence number of sender process

• Two conditions to be tested are:

• S[i] = R[i] +1 and
• S[j] <= R[j] for all j != i
Contd…
• First condition ensures that the receiver has not
missed any message from the sender.
• It ensures that the two messages from the same
sender are always causally related.
• Second condition ensures that the sender has
not received any message that the receiver has
not yet received.
• It make sure that the senders’ message is not
causally related to the message missed by the
receiver.
cbcast
 S[i] = R[i] +1 and S[j] <= R[j] for all j != i
Asynchronous / Synchronous API
Request Handling Workflow
Accept the incoming request –
For a new HTTP connection, an underlying TCP connection
must be established first.
Read the request
• Reading the raw bytes from the socket (I/O-bound) &
• Parsing the actual HTTP request (CPU-bound)
• If request contains entity such as POST parameters or a
file upload, this additional content must be read as well.
Depending on the implementation, the web server either
• Buffers the entity until loaded completely - content
offloading and is important for slow connections
• Pipes it directly to the application server - decreases
latencies.
Dispatch the request to the Application Level
• The parsed request is issued to the application server.
• We use decoupled components, generally a network-
based task, using Messaging (or alternatives - RPC).
• Write the generated response to the socket
• Once response is generated (generated HTML file from
the App Server or a static image from the file system), it
can be returned to the client by writing to the socket.
Web server can either
• Buffer the response and thus provide offloading for the
application servers or
• Pipe the generated response directly to the client.
Finish the request - Depending on the connection state
negotiated and the HTTP defaults, the web server either:
• Closes the connection, or
• It starts from the beginning and awaits the next request
to be sent from the client.

Server's Performance Concerning Web Infrastructures

• Request throughput (#/sec)
• Raw data throughput (Mbps)
• Response times (ms)
• Number of concurrent connections (#)
Performance Statistics to be observed locally on the Server:
• CPU utilization
• Memory usage
• Number of open socket / file handles
• Number of threads / processes
Aim
• To handle as many requests:
• In parallel as possible,
• As fast as possible &
• With as few resources as necessary.
How does Non-Blocking IO Work
• Applications may have to handle 10000s of req./sec
• Peak loads can come up to 20K requests per second.
• Thread-per-connection models do not suffice.
How does the caller of a non-blocking / asynchronous API
gets notified when data is ready?
• Caller must be notified that the data is ready.
• The only “listener” that is natively available in most
computers is the hardware interrupt processor.
• Hardware interrupts do not scale well and lack
flexibility.
• Constant checking, like an infinite loop that polls to see
if data is ready.
• An infinite loop must be taking quite some CPU time.
Blocking IO
• Calling an API that requests data from IO will cause the
running thread to “block”,
• When a thread is blocked in Linux, it goes in a Sleep
state by the kernel until data has returned to the caller.
Why non-blocking IO?
• Fewer threads to handle same amount of IO requests.
• A thread costs around 1MB, and there are some costs
due to context switching.
Types of blocking
• CPU-bound blocking
• IO-bound blocking
CPU-bound blocking
Some CPU intensive task it performs takes more time than
“instantly”.

CPU-bound blocking

IO-bound blocking
Wait for data to return from an IO source, such as a network
or a hard drive.

IO-bound blocking
Non-blocking IO
• When data has returned from IO, the caller will be
notified
• Done with a callback function that has access to the
returned data.
Callback
Network IO and Sockets
• At kernel level a socket is used as an abstraction to
communicate with a NIC.
• Socket takes care of reading and writing data to / from
NIC, NIC sends data over the UTP cable to the internet.
• For example, if you go to a URL in your browser:
• At low level the data in your HTTP request is written
to a socket using the send(2) system call.
• When a response is returned, response data can be
read from that socket using the recv(2) system call.
• So when data has returned from network IO, it is ready
to be read from the socket.
Non-Blocking IO under the hood
• Use an infinite loop that constantly checks (polls) if data
is returned from IO called Event Loop
• It checks if data is ready to read from a network socket.
• Sockets are implemented as file descriptors (FD) on
UNIX systems.
• All sockets are file descriptors but converse is not true
• So technically, FD is checked for ready data.
• The list of FDs that you want to check for ready data is
generally called the Interest List.
Optimizations to Event Loop
Each (major) OS provides kernel level APIs to help create
an event loop
• Linux - epoll or io_uring,
• BSD - kqueue &
• Windows - IOCP.
Each of these APIs is able to check FDs for ready data with
a computational complexity of around
O(No_of_Events_Occurred).
In other words, you can monitor 100,000s of FDs, but the
API’s execution speed only depends on the amount of
events that occur in the current iteration of the event loop.
Conclusion
Applications that need to handle high event rates mostly use
non-blocking IO models implemented with event loops.
For best performance, the event loop is built using kernel
APIs such as kqueue, io_uring, epoll and IOCP.
Hardware interrupts and Signals are less suited for non-
blocking when handling large amounts of events per second.
Server Architectures
Two competitive server architectures based on:

• Threads or
• Events.
Internals of an Event-Driven Architecture

• A single-threaded event loop consumes event after

event from the queue and sequentially executes
associated event handler code.
• New events are emitted by external sources such as
socket or file I/O notifications.
• Event handlers trigger I/O actions that eventually
result in new events later.
• Processing an event either requires:
• Registered event handler code for specific events, or
• Based on the execution of a callback associated to
the event in advance.
• Instead of sequential operations, an event-driven
program uses a cascade of asynchronous calls and
callbacks that get executed on events.
• This notion often makes the flow of control less
obvious and complicates debugging.
• The usage of event-driven server architectures has
historically depended on the availability of:
• Asynch. / non-blocking I/O operations on OS level &
• Suitable high performance event notification
interfaces such as epoll and kqueue.
Remote Procedure Calls
Introduction
 IPC part of distributed system can often be
conveniently handled by message-passing model.
 It doesn't offer a uniform panacea for all the needs.
 It can be said as the special case of message-passing
model.
RPC
 It has become widely accepted because of the
following features:
 Simple call syntax and similarity to local procedure calls.
 It specifies a well defined interface and this property
supports compile-time type checking and automated
interface generation.
 Its ease of use, efficiency and generality.
 It can be used as an IPC mechanism between processes on
different machines and also between different processes on
the same machine.
RPC Model
 It is similar to commonly used function / procedure
call model. It works in the following manner:
1. For making a procedure call, the caller places arguments
to the procedure in some well specified location.
2. Control is then transferred to the sequence of instructions
that constitutes the body of the procedure.
3. The procedure body is executed in a newly created
execution environment that includes copies of the
arguments given in the calling instruction.
Cont…
4. After the procedure execution is over, control returns to
the calling point, returning a result.

 The RPC enables a call (either from a local or remote

process) to be made to a procedure that does not reside in the
address space of the calling process.

 Since the caller and the callee processes have disjoint address
space, the remote procedure has no access to data and
variables of the callers environment.
Cont…
 Therefore RPC facility uses a message-passing scheme for
information exchange between the caller and the callee
processes.

 On arrival of request message, the server process extracts the

procedure’s parameters, computes the result, sends a reply
message, and then awaits the next call message.

 Only one of the two processes is active at any given time.

Cont…
Cont…
 It is not always necessary that the caller gets blocked.
 There can be RPC implementations depending on the
parallelism of the caller and the callee’s environment or other
features.
 The RPC could be asynchronous, so that the client may do
useful work while waiting for the reply from the server.
 Server can create a thread to process an incoming request so
that the server can be free to receive other requests.
Transparency of RPC
 A transparent RPC is one in which the local and remote
procedure calls are indistinguishable to the programmers.
 Types of transparencies:
 Syntactic transparency
 A remote procedure call should have exactly the same syntax as a
local procedure call.
 Semantic transparency
 The semantics of a remote procedure call are identical to those of a
local procedure call.

 Syntactic transparency is not an issue but semantic

transparency is difficult.
RPCs vs. local procedure calls

1. Unlike local procedure calls, with RPCs, the called

procedure is executed in an address space that is disjoint
from the calling program’s address space.
 Absence of shared memory.
 So, it is meaningless making call by reference, using
addresses in arguments and pointers.
Cont…

2. RPC’s are more vulnerable to failure.

 Possibility of processor crashes and communication
problems of a network.

3. RPCs are much more time consuming than local procedure

calls due to the involvement of communication network.

 Due to these reasons, total semantic transparency is

impossible.
Implementing RPC Mechanism

 To achieve the goal of semantic transparency, the

implementation of RPC mechanism is based on the
concepts of stubs.
Cont…
 Stubs
 Provide a normal / local procedure call abstraction by
concealing the underlying RPC mechanism.
 A separate stub procedure is associated with both the client
and server processes.
 To hide the underlying communication network, RPC
communication package known as RPC Runtime is used
on both the sides.
Cont…
 Thus implementation of RPC involves the five
elements of program:
1. Client
2. Client Stub
3. RPC Runtime
4. Server stub
5. Server
Cont…
 The client, the client stub, and one instance of
RPCRuntime execute on the client machine.
 The server, the server stub, and one instance of
RPCRuntime execute on the server machine.
 As far as the client is concerned, remote services are
accessed by the user by making ordinary local procedure
calls instead of using the send and receive primitives.
Client Stub
 It is responsible for the following two tasks:
 On receipt of a call request from the client:
 it packs the specifications of the target procedure and
the arguments into a message &
 asks the local RPC Runtime to send it to the server
stub.

 On receipt of the result of procedure execution, it unpacks

the result and passes it to the client.
RPCRuntime
 It handles transmission of messages across the
network between Client and the server machine.

 It is responsible for
 Retransmission,
 Acknowledgement,
 Routing &
 Encryption.
Server Stub
 It is responsible for the following two tasks:

 On receipt of a call request message from the local RPC

Runtime, it unpacks it and makes a perfectly normal call
to invoke the appropriate procedure in the server.

 On receipt of the result of procedure execution from the

server, it unpacks the result into a message and then asks
the local RPC Runtime to send it to the client stub.
Stub Generation
 Stubs can be generated in one of the following two
ways:
 Manually &
 Automatically
Automatic Stub Generation
 Interface Definition Language (IDL)
 Used to define interface between a client and the
server.
 Interface definition:
 It is a list of procedure names supported by the
interface together with the types of their
arguments and results.

 It also plays role in reducing data storage and

controlling amount of data transferred over the
network.

 It has information about type definitions,

enumerated types, and defined constants.
Cont…
 Export the interface
 A server program that implements procedures in the
interface.

 Import the interface

 A client program that calls procedures from an
interface.

 The interface definition is compiled by the IDL

compiler.
Cont…
 IDL compiler generates:
 components that can be combined with client and
server programs, without making any changes to
the existing compilers;
 client stub and server stub procedures;
 appropriate marshaling and unmarshaling
operations;
 a header file that supports the data types.
RPC Messages
 RPC system is independent of transport protocols and is not
concerned as to how a message is passed from one process
to another.

 Types of messages involved in the implementation of RPC

system:
 Call messages
 Reply messages
Call Messages
 Components necessary in a call message are:
1. The identification Information of the remote procedure
to be executed.

2. The arguments necessary for the execution of the

procedure.

3. Message Identification field that consists of a sequence

number for identifying lost and duplicate messages.
Call Messages
4. Message Type field is used to distinguish between call
and reply messages.

5. Client Identification Field allows the server to:

 Identify the client to whom the reply message has to

be returned and

 To allow server to authenticate the client process.

Reply Messages
 These are sent by the server to the client for
returning the result of remote procedure
execution.
Cont…
 Conditions for unsuccessful message sent by
the server:
 The server finds that the call message is not
intelligible to it.
 Client is not authorized to use the service.
 Remote procedure identifier is missing.
 The remote procedure is not able to decode the
supplied arguments.
 Occurrence of exception condition.
Reply message formats

Message Message Reply status Result

identifier type (successful)

A successful reply message format

Message Message Reply status Reason for

identifier type (unsuccessful) failure

An unsuccessful reply message format

Marshalling Arguments and Results
 The arguments and results in remote
procedure calls are language-level data
structures, which are transferred in the form of
message data.

 Transfer of data between two computers

requires encoding and decoding of the
message data.
Marshalling Arguments and Results
 Encoding and decoding of messages in RPC is
known as marshaling & Unmarshaling,
respectively.

 Classification of marshaling procedures:

 Those provided as a part of the RPC software.

 Those that are defined by the users of the RPC system.

Server Management
 Issues :

 Server implementation

 Server creation
Server Implementation
 Types of servers :

 Stateful servers

 Stateless servers
Stateful Server
 A Stateful Server maintains clients state
information from one RPC to the next.

 These provide an easier programming paradigm.

 These are more efficient than stateless servers.

Stateful file server

Client process Server process

Open (filename,mode)

Return (fid) Fid mode R/W

pointer
read (fid,100,buf)

Return(bytes 0 to 99)

Read (fid, 100, buf)

Return(bytes 100 to 199)

Stateless Server
 Every request from a client must be
accompanied with all necessary parameters to
successfully carry out the desired operation.

 Stateless servers have distinct advantage over

stateful server in the event of a failure.

 The choice of using a stateless or a stateful

server is purely application dependent.
Stateless file server

Client process Server process

File state information

File mode R/W read (filename,0,100,buf)

name pointer
Return(bytes 0 to 99)

Read (filename,100,100, buf)

Return(bytes 100 to 199)

Server Creation Semantics
 A server process is independent of a client
process that makes a remote procedure call to
it.

 Server processes may either be created and

installed before their client processes or be
created on a demand basis.
Server Creation Semantics
 Servers based on the life duration for which RPC
servers survive:

 Instance-per-call Servers

 Instance-per-session Server

 Persistent Servers
Instance-per-call Servers
 These exist only for the duration of a single call.

 These server are created by RPC Runtime on the

server machine only when a call message arrives.

 The server is deleted after the call has been

executed.

 The servers of this type are

 Stateless, and
 more expensive for invoking same type of service.
Instance-per-session Server
 These servers exist for the entire session for
which a client and a server interact.
 It can maintain intercall state information to
minimize
 the overhead involved in server creation &
 destruction for a client-server session that involves a
large number of calls.
Persistent Servers
 Persistent servers remain in existence
indefinitely.

 Advantages:
 Most commonly used.
 Improves performance and reliability.
 Shared by many clients.

 Each server exports its services by registering

with binding agent.
Parameter Passing Semantics
 Call by Value Semantics

 Call by Reference Semantics

Call-by-value
 All parameters are copied into a message that is
transmitted from the client to the server through
the intervening network.

 It is time consuming for passing large data types

such as trees, arrays, etc.
Call-by-reference
 RPC mechanisms can use the call-by-reference
semantics for parameter passing
 The client and the server exist in different address
space.

 Distributed systems having DSM mechanisms

can allow passing of parameters by reference.
Call-by- object -reference
 A call-by-reference in which objects invocation is
used by the RPC mechanism.

 Here, the value of a variable is a reference to an

object.
Call-by-move
 Call-by-move
 A parameter is passed by reference as in the method of
call-by-object-reference, but at the time of the call, the
parameter object is moved to the destination node
(callee) or
 May remains at the caller’s node.
 Thus call-by-visit or move.

 It allows packaging of the argument objects in the same

network packet as the invocation message, thereby
reducing the network traffic and message count.
Call Semantics
 Failure of communication link between the caller
and the callee node is possible.

 Failures can be because of:

 Call Message
 Reply
 Caller / Callee crash or
 Link

 Failure Handling code is part of RPC Runtime

Types of Call Semantics
 Possibly or May-Be Call Semantics
 Last-one call semantics
 Last-of-many call semantics
 At-least-once call semantics
 Exactly-once call semantics
Possibly or May-Be Call Semantics
 It is a Request Protocol.
 It uses time out but no surety of reply.
 It is an Asynchronous RPC.
 Server need not send back the reply.
 Useful for periodic update services.
Last-One Call Semantic
 It is Request / Reply protocol.

 Retransmission of call mesg. Based on time out,

until result recd. By caller.

 Result of last executed call are used by caller.

 It issues Orphan Call (whose parent has expired

due to node crash).

 Problems with nested RPC in transitive calls

 N1 –> N2 -> N3 & N1 Crashes.
Last-of-many Call Semantic
 It is similar to last-one semantic except that the
orphan calls are neglected.

 When a call is repeated, it is assigned a new call

identifier.

 A caller accepts a response only if the call

identifier associated with it matches with the
identifier of the most recently repeated call,
otherwise it ignores the response message.

 Use call identifier to associate result

At-least-once Call Semantic
 It guarantees that the call is executed one or more times but
does not specify which results are returned to the caller.

 It uses timeout-based retransmissions without caring for the

orphan calls.

 If there are any orphan calls, it takes the result of the first
response message and ignores the others, whether or not the
accepted response is from an orphan.
Exactly-Once Call Semantics
 It is Request / Reply / ACK protocol.

 No matter how many calls are made, a procedure

is executed only once.

 The server deletes an information from its Reply

Cache only after receiving an ACK from the
client.

 Implementation of RRA protocol requires that the

unique message identifier associated with request
message must be ordered.
Communication protocols for RPCs
1. Request protocol
2. Request / Reply protocol
3. Request /Reply /Acknowledge-Reply protocol
Request (R) protocol
Client Server

First Request message Procedure

RPC execution

Next Request message Procedure

RPC execution
Request / Reply (RR) protocol
Client Server

Request message
First Procedure
RPC execution
Reply message
Also serves as acknowledge-
Ment for the request message

Request message
Next Procedure
RPC execution
Reply message
Also serves as acknowledge-
Ment for the request message
Request /Reply /Acknowledge-Reply
(RRA) protocol
Client Server

Request message
First Procedure
RPC execution
Reply message
Reply acknowledgement msg

Request message
Next Procedure
RPC execution
Reply message
Reply acknowledgement msg
Complicated RPC’s
 Types of complicated RPCs

 RPC’s involving long duration calls or Large gaps

between calls

 RPC’s involving arguments and / or results that are too

large to fit in a single datagram packet.
RPCs involving Long-Duration Calls
& Large Gaps Between Calls

 Two methods to handle these RPCs:

 Periodic probing of the server by the clients.

 Periodic generation of an ACK by the server.

Periodic probing of the server by the clients

 After a client sends a request message to a server, it

periodically sends a probe packet to the server, which the
server is expected to acknowledge.

 This allows the client to detect a server’s crash or

communication link failures and to notify the corresponding
user of an exception condition.

 If a server is not able to generate the next packet significantly

sooner than the expected retransmission time, it
spontaneously generates an acknowledgement.
RPC’s involving Long Messages
 It proposes the use of multidatagram messages.

 A single ACk packet is used for all packets of this

multidatagram message.

 Another crude method is to break one logical RPC into

several physical RPC as in SUN Microsystem where RPC is
limited to 8 kilo bytes.
Client – Server Binding
 Client Stub must know the location of a server before RPC
can take place between them.
 Process by which client gets associated with server is known
as BINDING.
 Servers export their operations to register their willingness to
provide services &
 Clients import operations, asking the RPCRuntime to locate
a server and establish any state that may be needed at each
end.
Cont…
 Issues for client-server binding process:
 How does a client specify a server to which it wants to get
bound?
 How does the binding process locate the specified server?
 When is it proper to bind a client to a server?
 Is it possible for a client to change a binding during
execution?
 Can a client be simultaneously bound to multiple servers
that provide the same service?
Server Naming
 The naming issue is the specification by a client of a server
with which it wants to communicate.

 Interface name of a server is its unique identifier.

 It is specified by type & instance.

 Type specifies the version number of different interfaces
 Instance specifies a server providing the services within
that interface.
Cont…
 Interface name semantics are based on an arrangement
between the exporter and importer.
 Interface names are created by the users, not by the RPC
package.
 The RPC package only dictates the means by which an
importer uses the interface name to locate an exporter.
Server locating
 Methods are:
1. Broadcasting
2. Binding agent
Broadcasting
 A message to locate a node is broadcast to all
the nodes from the client node.

 Node having desired server returns a response

message. (could be replicated)

 Suitable for small networks.

Binding Agent
 It is basically a name server used to bind a client to a server by
providing the client with the location information of the
desired server.

 It maintains a binding table containing mapping of server

interface name to its location.

 The table contains identification information and a handle is

used to locate it.
Cont…
 The location of the binding agent (having a fixed address) is
known to all nodes by using a broadcast message .

 Primitives used by Binding Agent

 Register
 Deregister
 Lookup
Cont…
1. The server registers itself
with the binding agent.
Binding
2. The client requests the Agent
binding agent for the server’s
location. 2
1
3. The binding agent returns the 3
server’s location information
to the client.
Client 4 Server
4. The client calls the server. Process Process
Cont…
 Advantages

 Support multiple servers having same interface type.

 Client requests can be spread evenly to balance the load.

Cont…
 Disadvantages
 The overhead involved in binding clients to servers is
large and becomes significant when many client
processes are short lived.

 A binding agent must be robust against failures and

should not become a performance bottleneck.
Cont…
 Solution
 distribute the binding function among several binding
agents, and

 Replicate the information among them.

 Again overhead in creating and handling replicas.

Binding Time
 Binding at compile time
 Servers network address can be compiled into client
code.

 Binding at link time

 Servers handle cached by the client

 Binding at call time

 Indirect call method
Binding at link time
 Client makes an import request to the binding agent for
the service before making a call.

 Servers handle cached by the client to avoid contacting

binding agent for subsequent calls.

 This method is suitable for those situations in which a

client calls a server several times once it is bound to it.
Binding at call time (Indirect Call)

1. The client process passes the

server’s interface name and the
arguments of the RPC calls to the
binding agent. Binding
2. The binding agent sends an RPC call Agent
message to the server, including in
it the arguments received from the
1
client. 3
4 2
3. The server returns the result of
request processing to the binding
agent.
Client 5 Server
4. The binding agent returns this
Process Process
result to the client along with the
server’s handle.
5. Subsequent calls are sent directly
from the client process to the
server process.
Security issues
 Is the authentication of the server by the client required?

 Is the authentication of the client by the server required

when the result is returned?

 Is it all right if the arguments and results of the RPC are

accessible to users other than the caller and the callee?
Some special types or RPCs
 Callback RPC
 Broadcast RPC
 Batch-mode RPC
Callback RPC
 It facilitates a peer-to-Peer paradigm among
participating processes.

 It allows a process to be both a client and a

server.

 Issues
 Providing server with clients handle
 Making the client process wait for the callback RPC
 Handling callback deadlocks
Cont…
Client Server
Call (parameter list)
Start Procedure execution

Callback (parameter list)

Stop Procedure execution
temporarily
Process callback
request and
send reply Reply (result of callback)
resume Procedure
execution

Reply (result of call)

Procedure execution ends
Broadcast RPC
A client’s request is broadcast on the network
and is processed by all the servers that have
the concerned procedure for processing that
request.
Cont…
Methods for broadcasting a client’s request:

1. Use of broadcast primitive to indicate that the client’s

request message has to be broadcasted.

2. Declare broadcast ports.

Cont…
Back-off algorithm can be used to increase the time
between retransmissions.

 In Sun OS, it retransmits the broadcast and waits for 4

sec.
 It then waits for 6 sec and so on.
 Thus increasing the time between retransmission is
known as back-off.
 It helps in reducing the load on the physical network
and computers involved.
Batch-mode RPC
 Batch-mode RPC is used to queue separate RPC requests
in a transmission buffer on the client side and then send
them over the network in one batch to the server.

 It reduces the overhead involved in sending requests

and waiting for a response for each request.

 It is efficient for applications requiring lower call rates

and client doesn’t need a reply.

 It requires reliable transmission protocol.

Lightweight RPC (LRPC)
 The communication traffic in operating systems are of two
types:
1. Cross-domain - involves communication between
domains on the same machine.

2. Cross-machine - involves communication between

domains located on separate machines.

 The LRPC is a communication facility designed and

optimized for cross-domain communications.
Cont…
 For cross domain, user level server processes have its own
address space.

 LRPC is safe and transparent and represents a viable

communication alternative for microkernel operating
systems.
Cont…
 Techniques used by LRPC for better performance:

 Simple control transfer

 Simple data transfer

 Simple stubs

 Design for concurrency

Simple control transfer
 It uses a special threads scheduling mechanism,
called handoff scheduling for direct context
switch from the client thread to the server thread
of an LRPC.
Simple data transfer
 LRPC reduces the cost of data transfer by
performing fewer copies of the data during its
transfer from one domain to another.

 An LRPC uses a shared-argument stack that is

accessible to both the client and the server.
 RPC – 4 copy operations
 Message buffer to kernel buffer to message buffer to
server stack
 LRPC – single copy operation
Cont…
 Pairwise allocation of argument stacks enables
LRPC to provide a private channel between the
client and server and also
 Allows the copying of parameters and results as
many times as are necessary to ensure correct
and safe operation.
Simple stub
 Every procedure has a call stub in the client’s
domain and an entry stub in the server’s
domain.
 At the time of transfer of control, kernel
associates execution stack with the initial call
frame expected by the server procedure.
 Now it directly invokes the corresponding
procedure’s entry in servers domain.
 Thus LRPC stubs blur the boundaries between
the protocol layers.
Optimizations for better performance
 Concurrent access to multiple servers
 Serving multiple requests simultaneously
 Reducing per-call workload of servers
 Reply caching of idempotent remote procedures
 Proper selection of timeout values
 Proper design of RPC protocol specification
 Only 3 out of 13 header fields of IP Suite (both TCP/IP &
UDP/IP) are used.
The Concepts of GRPC
Introduction
• Several organizations, from tech giants to
startups, owe their business development to APIs.
• APIs make a range of web products accessible to
millions of users across the Internet.
• Choosing the right tech. to provide an API for any
app is crucial.
What Is gRPC?
• gRPC is a robust open-source RPC framework used
to build scalable and fast APIs.
• It allows the client and server apps to
communicate transparently and develop
connected systems.
• Many leading tech firms have adopted gRPC, such
as Google, Netflix, Square, IBM, Cisco, & Dropbox.
• This framework relies on HTTP/2, protocol buffers,
and other modern technology stacks to ensure
max. API security, performance, and scalability.
History of the gRPC
• In 2015, Google developed gRPC as an extension
of the RPC framework for its internal
infrastructure to link many microservices created
with different technologies.
• Later -open-source for community use.
• In 2017, it became the Cloud Native Computing
Foundation (CNCF) incubation project due to its
increasing popularity.
Protocol Buffers
• Most of the gRPC advantages stem from the
following concepts:
– Protocol buffers, or Protobuf, is Google’s
serialization / deserialization protocol that
enables the easy definition of services and
auto-generation of client libraries.
– gRPC uses this protocol as their Interface
Definition Language (IDL) and serialization
toolset.
– Current version - proto3,
– gRPC services and messages between clients
and servers are defined in proto files.
– The Protobuf compiler - protoc, generates
client and server code that loads the .proto file
into the memory at runtime and uses the in-
memory schema to serialize / deserialize the
binary message.
– After code generation, message is exchanged
between the client and remote service.
• Protobuf offers benefits over JSON and XML.
• Parsing with Protobuf requires fewer CPU
resources since data is converted into a binary
format, and encoded messages are lighter in size.
• So, messages are exchanged faster, even in
machines with a slower CPU, such as mobile
devices.
Streaming
• Streaming is another key concept of gRPC, where
many processes can take place in a single request.
• The multiplexing capability of HTTP/2
– Sending multiple responses or
– Receiving multiple requests together over a
single TCP connection
Streaming Types
• Server-streaming RPCs
– The client sends a single request to the server
and receives back a stream of data sequences.
– The sequence is preserved, and server messages
continuously stream until there are no
messages left.
• Client-streaming RPCs
– The client sends a stream of data sequences to
the server, which then processes and returns a
single response to the client.
– Once again, gRPC guarantees message
sequencing within an independent RPC call.
• Bidirectional-streaming RPCs
– It is two-way streaming where both client and
server sends a sequence of messages to each
other.
• Both streams operate independently; thus, they
can transmit messages in any sequence.
• The sequence of messages in each stream is
preserved.
Http 2.0 Advanced Capabilities
• Compatible with HTTP/1.1, HTTP/2
• Binary Framing Layer
– Unlike HTTP/1.1, HTTP/2 request/response is
divided into small messages and framed in
binary format, making message transmission
efficient.
– With binary framing, the HTTP/2 protocol has
made request/response multiplexing possible
without blocking network resources.
• Streaming
– Bidirectional full-duplex streaming in which the
client can request and the server can respond
simultaneously.
• Flow Control
– Flow control mechanism is used in HTTP/2,
enabling detailed control of memory used to
buffer in-flight messages.
• Header Compression
– Everything in HTTP/2, including headers, is
encoded before sending, significantly improving
overall performance.
• Using the HPACK compression method, HTTP/2
only shares the value different from the previous
HTTP header packets.
• Processing
– With HTTP/2, gRPC supports both Synchronous
and Asynchronous processing,
– This can be used to perform different types of
interaction and streaming RPCs.
Benefits Derived
• All these features of HTTP/2 enable gRPC to:
– use fewer resources,
– resulting in reduced response times between
apps and services running in the cloud
– longer battery life for a client running mobile
devices.
Channels
• Channels are a core concept in gRPC.
• The HTTP/2 streams allow many simultaneous
streams on one connection.
• Channels extend this concept by supporting
multiple streams over multiple concurrent
connections.
• Channels provide a way to connect to the gRPC
server on a specified address and port and are
used in creating a client stub.
Code Generation
• The prime feature of gRPC methodology is the
native code generation for client/server
applications.
• gRPC frameworks use protoc compiler to
generate code from the .proto file.
• It can produce server-side skeletons and client-
side network stubs, which saves significant
development time in applications with various
services.
gRPC Architecture

• In gRPC, every client service includes a stub

(autogenerated files), similar to an interface
containing the current remote procedures.

• The gRPC client makes the local procedure call to

the stub with parameters to be sent to the server.
• The client stub then serializes the parameters with
the marshaling process using Protobuf and
forwards the request to the local client-time
library in the local machine.
• The OS makes a call to the remote server
machine via HTTP/2 protocol.
• The server’s OS receives the packets and calls the
server stub procedure, which decodes the
received parameters and executes the respective
procedure invocation using Protobuf.
• The server stub then sends back the encoded
response to the client transport layer.
• The client stub gets back the result message and
unpacks the returned parameters, and the
execution returns to the caller.
Performance
• gRPC offers up to 10x faster performance and
API-security than REST+JSON communication as it
uses Protobuf and HTTP/2.
• Protobuf serializes messages on server and client
quickly, resulting in small and compact message
payloads.
• HTTP/2 scales up the performance ranking via:
– server push,
– multiplexing, &
– header compression.
• Server push enables HTTP/2 to push content from
server to client before getting requested, while
multiplexing eliminates head-of-line blocking.
• HTTP/2 uses a more advanced compression
method to make the messages smaller, resulting
in faster loading.
HTTP/2 Server Push
• Server push allows to send site assets to the user
before they’ve even asked for them.
• Accessing websites has always followed a request
and response pattern.
• The user sends a request to a remote server, and
with some delay, the server responds with the
requested content.
• The initial request to a web server is commonly
for an HTML document.
• Server replies with the requested HTML
resource.
• The HTML is then parsed by the browser, where
references to other assets are discovered, such as
style sheets, scripts and images.
• Upon their discovery, the browser makes
separate requests for those assets, which are
then responded to in kind.

Typical Web Access Pattern

Drawbacks
• Problem is that it forces the user to wait for the
browser to discover and retrieve critical assets
until after an HTML document has been
downloaded.
• This delays rendering and increases load times.
• Server push lets the server preemptively “push”
website assets to the client without the user
having explicitly asked for them.
• When used with care, one can send what we
know the user is going to need for the page
they’re requesting.

Interoperability
• gRPC tools and libraries are designed to work with
multiple platforms and programming languages,
including Java, JavaScript, Ruby, Python, Go, Dart,
Objective-C, C#, and more.
Should I Use gRPC instead of REST?
• Basically, gRPC is another alternative that could be
useful in certain circumstances:
– large-scale microservices connections
– real-time communication
– low-power, low-bandwidth systems
– multi-language environments
• While HTTP supports mediators for edge caching,
gRPC calls use the POST method, which is a
threat to API-security.
• The responses can’t be cached through
intermediaries.
• Moreover, the gRPC specification doesn’t make
any provisions and even indicates the wish for
cache semantics between server and client.
Distributed Shared
Memory
Distributed shared memory
 DSM paradigm provides process with shared address
space.
 Primitives for shared memory:
1. Read(address)
2. Write(address, data)
 Shared memory paradigm gives the system a illusion of
physically shared memory.
 DSM refers to shared memory paradigm applied to
loosely coupled distributed memory systems.
Cont….
 Shared memory exists only virtually.
 Similar concept to virtual memory.
 DSM also known as DSVM.
 DSM provides a virtual address space shared among
processes on loosely coupled processors.
 DSM is basically an abstraction that integrates the local
memory of different machine into a single logical entity
shared by cooperating processes.
Distributed shared memory
DSM Architecture
 Each node of the system consist of one or more CPUs
and memory unit.
 Nodes are connected by high speed communication
network.
 Simple message passing system for nodes to exchange
information.
 Main memory of individual nodes is used to cache pieces
of shared memory space.
 Reduces network latency
Cont….
 Memory mapping manager routine maps local memory
to shared virtual memory.
 Shared memory space is partitioned into blocks.
 Data caching is used in DSM system to reduce network
latency.
 The basic unit of caching is a memory block.
Cont….
 If data is not available in local memory network block
fault is generated.

 On Network block fault:

 The missing block is migrated from the remote node to the
client process’s node and OS maps it into the application’s
address space.
 Data blocks keep migrating from one node to another on
demand but no communication is visible to the user processes.
Design and implementation issues
1. Granularity
2. Structure of Shared memory
3. Memory coherence and access synchronization
4. Data location and access
5. Replacement strategy
6. Thrashing
7. Heterogeneity
Granularity

 Refers to the block size of DSM.

 The unit of sharing & the unit of data transfer across the
network when a network block fault occurs.

 Possible unit are a few word, a page or few pages.

Memory Coherence & Access Synchronization
 Replicated and shared data items may
simultaneously be available in the main memories of
a number of nodes.

 Memory Coherence Problem

 Deals with the consistency of a piece of shared data lying
in the main memories of two or more nodes.
Replacement strategy
 Ifthe local memory of a node is full, a cache miss at that
node implies not only a fetch of accessed data block from
a remote node but also a replacement.

 Data block must be replaced by the new data block.

Thrashing

 Data block migrate between nodes on demand.

 Therefore if two nodes compete for write access to a

single data item, the corresponding data block may be
transferred back.
Granularity

 Most visible parameter in the design of DSM system is

block size.

 Sending small packet of data is more expensive than

sending large packet.
Cont…
 Factors influencing block size selection:
1. Paging overhead
2. Directory size
3. Thrashing
4. False sharing
Paging overhead
 A process is likely to access a large region of its
shared address space in a small amount of time.

 The paging overhead is less for large block size as

compared to the paging overhead for small block size.

Directory size
 The larger the block size, the smaller the
directory.

 Result: reduced directory management overhead

for larger block size.
Thrashing

 The problem of thrashing may occur:

 when data item in the same data block are being

updated by multiple nodes at the same time.

 with any block size, more likely with larger block

size.
False sharing
 Occurs when two different processes access
two unrelated variable that reside in the
same data block.
 The larger is the block size, higher is the probability of
false sharing.
 False sharing of a block may lead to a thrashing problem
Possible Solutions
Using page size as block size

 Relative advantage and disadvantages of small

and large block size makes it difficult for DSM
designer to decide on a proper block size.

 On Intel Core2 Duo 64-bit system, the page size

is 4kB, which is normal for almost every desktop
PC architectures.

 Linux kernel 2.6.x versions support large pages

(4MB).
Advantages of using page size as
block size

 Use of existing page fault schemes to trigger a DSM page

fault.
 Allows the access right control.
 Page size do not impose undue communication overhead
at the time of network page fault.
 Page size is a suitable data entity unit with respect to
memory contention.
Structure of Shared-Memory Space

 Structure defines the abstract view of the shared

memory space for application programmers.

 The structure and granularity of a DSM system are

closely related.
Consistency Models
 Consistency requirement vary from application to
application.

 Refers to the degree of consistency that has to be

maintained for the shared memory data.

 If a system supports a stronger consistency model,

weaker consistency model is automatically supported
but the converse is not true.
Types of Consistency Models
1. Strict Consistency model
2. Sequential Consistency model
3. Casual consistency model
4. Pipelined Random Access Memory consistency
model (PRAM)
5. Processor Consistency model
6. Weak consistency model
7. Release consistency model
Strict Consistency Model
 The strongest form with most stringent consistency
requirement.

 Value returned by a read operation on a memory

address is always the same as the value written by
the most recent write operation to that address.

 All writes instantaneously become visible to all

processes.

 Implementation requires the existence of an absolute

global time to synchronize clocks of all nodes.
 Practically impossible.
Sequential Consistency Model
 Proposed by Lamport in 1979.
 All processes see the same order of all memory access
operations on the shared memory.
 Exact order of access operations are interleaved & does
not matter.
Example :
 Operations performed in order
1. Read(r1)
2. write(w1)
3. Read(r2)
Cont…

 Only acceptable ordering

 for a strictly consistency memory
(r1, w1, r2)

 For sequential consistency model

Any of the orderings (r1,w1,r2), (r1,r2,w1), (w1,r1,r2),
(w1,r2,r1), (r2,r1,w1), (r2,w1,r1), is correct

if all processes see the same ordering.

Cont…
 The consistency requirement of the sequential
consistency model is weaker than that of the strict
consistency model.

 A sequentially consistent memory provide one-copy /

single-copy semantics.

 Acceptable by most applications.

Casual Consistency Model (PCR)
 Proposed by hutto and ahemad in 1990.

 All processes see only those memory reference

operations in the correct order that are potentially
casually related (PCR).
 Otherwise seen by different processes in different order.

 Required to construct and maintain dependency graphs

for memory access operations.
Pipelined Random Access Memory
(PRAM) Consistency model
 Proposed by Lipton and Sandberg in 1988.
 A weaker consistency semantics.
 Ensures that all write operations performed by a single
process are seen by all other processes in the order in
which they were performed as if all write operations
performed by a single process are in a pipeline.
PRAM
 P1 – W11 & W12
 P2 – W21 & W22
 P3 can see these as ((W11,W12), (W21,W22))
 P4 can see these as ((W21,W22), (W11,W12))
 Advantages:
 Simple and easy to implement.
 Good performance
 Limitation:
 All processes may not agree on the same order of memory
reference operations.
Processor Consistency model
 Proposed by Goodman in 1989.
 Very similar to PRAM model with additional restriction of
memory coherence.
 Memory coherence:
 For any memory location, all processes agree on the same order
of all write operation to that location.

 Processor consistency ensures that all write operations

performed on the same location are seen by all
processes in the same order.
Weak Consistency Model
 Many applications may demand that:
 It is not necessary to show the change in memory
done by every write operation to other processes.
 Idea of weak consistency:
 Better performance can be achieved if consistency
is enforced on a group of memory reference
operations rather than on individual memory
reference operations.
 Uses a special variable called a
synchronization variable to synchronize
memory.
Cont…
Requirements:
1. All accesses to synchronization variables must obey
sequential consistency semantics.
2. All previous write operations must be completed
everywhere before an access to a synchronization
variable is allowed.
3. All previous accesses to synchronization variables must
be completed before access to a non-synchronization
variable is allowed.
4. Better performance at the cost of putting extra burden on the
programmers.
Release consistency model

 Enhancement of weak consistency model.

 Use of two synchronization variables:

 Acquire
 Release
Cont…
 Acquire
 Used to tell the system that a process is
entering Critical section.
 Results in propagating changes made by other
nodes to process's node.
 Release
 Used to tell the system that a process has just
exited critical section.
 Results in propagating changes made by the
process to other nodes.
Cont…
 A variation of release consistency is lazy
release consistency proposed by Keleher in
1992.

 Modifications are not sent to other nodes at the

time of release but only on demand.

 Better performance.
Implementing Seq. Consistency Model

 Protocol for implementing depends on

whether the DSM system allows:
 replication & / or
 Migration
Of shared memory data blocks.

Strategies:
1. Nonreplicated, Nonmigrating blocks (NRNMB)
2. Nonreplicated, Migrating blocks (NRMB)
3. Replicated, migrating blocks (RMB)
4. Replicated, Nonmigrating blocks (RNMB)
NRNMBS
 Simplest strategy.

 Each block of the shared memory has a single

copy whose location is always fixed.
Cont…
 Enforcing sequential consistency is trivial.

 Method is simple and easy to implement.

 Drawbacks:
 Serializing data access creates a bottleneck.

 Parallelism is not possible.

Cont…
Data locating in the NRNMB strategy:
 There is a single copy of each block in the entire
system.

 The location of a block never changes.

 Requires a simple mapping function to map a block

of a node.
NRMBS
 Each block of the shared memory has a single
copy in the entire system.

 Migration is allowed.
 Owner node of a block changes as soon as the
block is migrating to a new node.

 Only the processes executing on one node

can read or write a given data item at any
time.
Cont…
Cont…
 Advantage :
 No communications cost are incurred when a
process accesses data currently held locally.
 Advantage of data access locality.

 Drawbacks:
 Prone to thrashing problem.
 No parallelism.
Data locating in the NRMB strategy

 There is a single copy of each block, the location

of a block keeps changing dynamically.
Cont…
Following method used:
1. Broadcasting
2. Centralized server algorithm
3. Fixed distributed server algorithm
4. Dynamic distributed server algorithm
Broadcasting

 Each node maintains an owned block table

that contains an entry for each block for
which the node is the current owner.
Cont…
Cont….

 On a fault,
 Fault handler of the faulting node broadcasts a
read/write request on the network.

Disadvantage:
 Not scalable.
Centralized server algorithm
Cont…
 A centralized server maintains a block table
that contains the location information for all
block in the shared memory space.

Drawbacks:
 A centralized server serializes location
queries, reducing parallelism.

 The failure of the centralized server will cause

the DSM system to stop functioning.
Fixed distributed server algorithm
Cont..
 A direct extension of the centralized server
scheme.

 There is a block manager on several nodes.

 Each block manager is given a predetermined

subset of data blocks to manage.

 On fault, the mapping functions is used by

the currently accessed block.
Dynamic distributed server algorithm
Cont…

 Does not use any block manager and

attempts to keep track of the ownership
information of all block in each node.

 Each node has a block table.

 Contains the ownership information for all blocks.

 Use of probable owner.

 When fault occurs, the faulting node extracts
from its block table the node information.
RMB - Replicated, migrating blocks

 Used to increase parallelism.

 Read operations are carried out in parallel at multiple
nodes by accessing the local copy of the data.

 Replication tends to increase the cost of write

operation.
 Problem to maintain consistency.

 Extra expense if the read/write ratio is large.

Cont…
 Protocols to ensure sequential consistency:
1. Write-invalidate
2. Write-update

Write-invalidate
 All copies of a piece of data except one are invalidated
before a write can be performed on it.

 After invalidation of a block, only the node that performs

the write operation on the block holds the modified
version of the block.
Cont…
Write-update
 A write operation is carried out by updating all copies of
the data on which the write is performed.

 On write fault, the fault handler copies the accessed

block from one of the block’s current node to its own
node.

 The write operation completes only after all the copies of

the block have been successfully updated.
Cont…
Cont…
 Assign sequence number to the modification and
multicasts that to all the nodes where a replica of the
data block to be modified is located.

 The write operations are processed at each node in

sequence number order.

 If the verification fails, node request the sequencer

for a retransmission of the missing modification.
Cont…
Cont…
 Write-update approach is very expensive.

 Inthe write-invalidate approach, updates are only

propagated when data is read and several updates
can take place before communication is necessary.
 Use of status tag associated with each block.
 Indicates the block is valid/shared/ read-only / writable.
Data Locating in the RMB strategy

Following algorithms may be used:

1. Broadcasting
2. Centralized-server algorithm
3. Fixed distributed-server algorithm
4. Dynamic distributed-server algorithm
Replicated, Non-Migrating Block
 A shared memory block may be replicated at multiple
node of the system but the location of each replica is
fixed.
 All replicas of a block are kept consistent by updating
them all in case of a write access.
 Sequential consistency is ensured by using a global
sequencer to sequence the write operation of all nodes.
Data locating in the RNMB strategy
Following characteristics:
1. The replica location of a block never change.
2. All replicas of a data block are kept consistent.
3. Only a read request can be directly sent to one of the
node having a replica and all write requests have to
be sent to the sequencer.
Synchronization
Introduction
 Sharing system resources among multiple
concurrent processes may be:
 Cooperative or competitive in nature.

 Both require adherence to certain rules of

behavior:
 For enforcing correct interaction.

This requires synchronization mechanisms.

Synchronization-related issues
1. Clock synchronization
2. Event ordering
3. Mutual exclusion
4. Deadlock
5. Election algorithms
Clock synchronization
 Requirement of a timer mechanism for
every computer:
 To keep track of current time,
 For various accounting purposes,
 For correct results, and
 helps in measuring the duration of distributed
activities.
Implementation of computer clocks
 Components of a computer clock:
1. A quartz crystal
 Oscillates at a well-defined frequency.

2. A counter register
 Keeps track of the oscillations of the quartz crystal.
 If value is zero, an interrupt is generated and
reinitialized to the value of constant register.

3. A constant register
 Stores a constant value, based on the frequency of
oscillation of the quartz crystal.
Cont…
 Each interrupt is known as a clock tick.

 The value in the constant register is chosen so

that 60 clock ticks occur in a second.

 The computer clock is synchronized with real

time.
Drifting of clocks
 A clock always runs at a constant rate.
 There may be difference in the crystals.

 For clocks based on a quartz crystals, the

drift rate is approximately 10-6.
 It must be periodically resynchronized with the
real-time clock.

 A clock is non-faulty if:

 There is a bound on the amount of drift from
real time for any given finite time interval.
Cont…
Example
 real time = t

 The time value of a clock p is = Cp(t).

 If all clocks in the world are perfectly
synchronized, we would have Cp(t) = t for all p
and for all t.

 Ideal case:
dC/dt =1
Cont…
 A clock is nonfaulty if
 The maximum drift rate allowable is ρ.

 Condition :
1- ρ <= dC/dt <=1 + ρ
slow, perfect and fast clocks

Perfect clock
dC/dt = 1
Fast clock region
dC/dt > 1
Clock
time
Slow clock region
dC/dt < 1

Real time
Cont…
 Types of clock synchronization in DS:
1. Synchronization of the computer clocks with
real-time (or external) clocks.

 Required for real-time applications.

 Coordinated Universal Time (UTC), an

external time source is used for
synchronization.

 Also synchronized internally.

Cont…
2. Mutual (or internal) synchronization of the
clocks of different nodes of the system.

 Used when a consistent view of time across all nodes

of a DS is required &

 A measurement of the duration of distributed

activities are required.
Clock Synchronization Issues

 The difference between the two clock values is

called clock skew.

 Required to be synchronized if the clock skew is

more than some specified constant δ.
Cont…
 Issues:
1. Calculating the value of the unpredictable
communication delays between two nodes to
deliver a clock signal is practically impossible.
 Depends on the amount of communication and
computations,

2. Time must never run backward.

 It may cause problems like repetition of certain
operations.
Clock synchronization algorithms
1. Centralized algorithms

2. Distributed algorithms
Centralized algorithms
 Use of time server node for referencing the
correct time.

 These algorithms keep the clocks of all other

nodes synchronized with the clock time of the
time server node.
Cont…
 Algorithms depending on the role of time
server node:

1. Passive time server Centralized Algorithm.

2. Active time server Centralized Algorithm.

Passive time server Centralized Algorithm
 Each node periodically sends a message for the
current time to the time server at any time t0.

 Time server responds with the current time T at

the time t1.

 The propagation time of the message :

(t1 – t0) /2

 When the reply is received at the client’s node,

its clock is readjusted to
T + (t1 –t0) /2
Proposals to improve Centralized Algorithm
1. Assuming availability of some additional Info.

 Assumes that the approximate time taken by the time

server to handle the interrupt and process the request
is known.
Cont…
2. Cristian method
 Several measurements are made & the
measurements for which t1-t0 exceeds some
threshold value are considered to be unreliable and
discarded.

 If the value of t1-t0 is below threshold, it is

averaged.
 Half of this value is used as the value to be
added to T.

 Need to restrict the number of measurements for

estimating the value to be added to T.
Active time server centralized algorithm
 The time server periodically broadcasts its clock
time.

 The receiving nodes use the clock value of the

message for correcting their own clocks.

 Nodes have a priori knowledge of the

approximate time required for the propagation of
the message.
Cont…
 Drawback :
1. It is not fault tolerant.
2. Requires broadcast facility to be supported by
the network.

 Solution : Berkeley algorithm

Cont…
Berkeley Algorithm

 Used for internal synchronization of clocks of a

group of computers running the Berkeley Unix.

 The time server periodically sends a message to

the group.

 Receivers send back their clock value to the time

server.
Cont…
 Based on the priori knowledge of propagation
time, the time server readjusts the clock value of
the reply messages.

 It uses a fault-tolerant average.

 The time server sends the amount of time by

which each node should readjust its clock.
Cont…
 Drawbacks of centralized clock
synchronized algorithms:

1. Subject to a single-point failure.

2. Not acceptable from the scalability point of

view.
Distributed algorithms
 Externally synchronized clocks are internally
synchronized for better accuracy.

 Algorithms for internal synchronization:

 Global averaging distributed algorithms.

 Localized averaging distributed algorithms.

Global averaging distributed algorithms
 Clock process of each node broadcast its local
time at the beginning of every fixed-length
resynchronization interval.
 Can’t happen simultaneously from all nodes.
 Hence each node waits for some time T.

 After this, the clock process collects resync

messages broadcast by other nodes.
 Records their time,
 Estimates the skew of its clock, &
 Computes the fault-tolerant average.
Cont…
 Algorithms for computing fault-tolerant average
of the estimated skews:
 Takes the average of the estimated skews and use it as
the correction for the local clock.
 Use threshold for comparison.

 Each node limits the impact of faulty clocks by first

discarding the m highest and m lowest estimated skews.
 Calculate the average of the remaining skews for
correcting the local clock.
Cont…
 Drawback
 Not scalable: due to broadcast facility and a
large amount of network traffic.
Localized averaging distributed
algorithms
 The nodes are logically arranged in some
kind of pattern (such as ring or a grid).

 Periodically, nodes exchange their clock

time with their neighbours.
 Sets their clock time to the average.
Event Ordering
 Lamport observed that for most applications,
clock synchronization is not required.

 Total order of all events is sufficient that is

consistent with observed behavior.

 For partial ordering of events, following can

be used:
 Relation of Happened-before, &
 Logical clocks.
Happened-Before relation
 A relation () on a set of events satisfies
the following conditions:
1. If ‘a’ and ‘b’ are events in the same process
and ‘a’ occurs before ‘b’, then a b.

2. If ‘a’ is the event of sending a message by

one process and ‘b’ is the event of the receipt
of same message by another process, then
ab.

3. If ab and bc, then ac [a transitive

relation].
Cont…
 Happened-before is an irreflexible partial
ordering on the set of all events in the
system.
aa is not true since an event can’t happen
before itself.

 Concurrent events are not related by this

relation.
ab and ba

 That why, happened before is sometimes also

known as the relation of causal ordering.
Space-time diagram for three
processes

e32
e24
e13
e23
e12
e22
e31
time
e21
e11
Events
e30 Message
e10 e20 transfer

Process P1 Process P2 Process P3

Cont…
 Some of the events that are related by the
happened-before relation:
e10e11
e20e24
e11e23
e21e13
e30e24 (since e30e22 and e22e24)
e11e32 (since e11e23, e23e24 and)
Cont…
 Concurrent events:
e12 and e20, e21 and e30,
e10 and e30, e11 and e31,
e12 and e32, e13 and e22

This is because no path exists between these.

Logical clocks concept
 It is a way to associate a timestamp with
each system event for correct working of
happened-before relation.

 Each process Pi has a clock Ci associated

with it that assigns a number Ci(a) to any
event ‘a’ in that process.
 The clock of each process is known as logical
clock.
Cont…
 The timestamps assigned to the events by
the system of logical clocks must satisfy
the clock condition:

For any two events a and b, if ab then C(a) <

C(b).

C is the clock function assigned to events.

Implementation of logical clocks
Clock conditions
 C1: If ‘a’ and ‘b’ are two events within the same
process Pi and ab, then Ci(a)<Ci(b)
 C2: if ‘a’ is the sending of a message ‘m’ by
process Pi, and ‘b’ is the receipt of that message
by process Pj, then Ci(a)<Cj(b).
 C3: a clock Ci associated with a process Pi must
always go forward, never backward.
Cont…
Implementation Rules:

 IR1: Each process Pi increments Ci between any

two successive events.

 Ensures condition C1.

Cont…
 IR2: If event ‘a’ is the sending of a message ‘m’
by process Pi , the message ‘m’ contains a
timestamp Tm=Ci(a), and upon receiving the
message ‘m’, a process Pj sets Cj greater than or
equal to its present value but greater than Tm.

 Ensures the condition C2.

 Both ensures the condition C3.
Logical Clock Implementation
 Implementation of logical clocks by using
1. Counters

2. Physical clocks
Using counters
 Each process has a counter like C1 and C2
for process P1 and P2, respectively.

 Counters
 Act as logical clocks.
 Initialized to zero.
 Increments by 1 on events of the process.
Cont…

C1=8 e08 Timestamp =6

C1=7 e07 C2=6
e14
C1=6 e06
Time Timestamp =4 e13 C2=3->5
C1=5 e05 since 3 is less
than timestamp 4
C1=4 e04

C1=3 e03
e12 C2=2
C1=2 e02

C1=1 e01
e11 C2=1
C1=0 C2=0
Process P1 Process P2
Using physical clocks
 Each process has a physical clock
associated with it.
 Each clock runs at a constant rate (it may
be different for each of these process).
 Example:
 When Process p1 has ticked 10 times, process
p2 has ticked only 8 times.
Physical clock times if Physical clock times
No corrections were made after corrections (if any)

e08 120 Timestamp =85 96 101

110 88 93
e07 100 80 85 e14

e06 90 72 77
e05 80 64 69
70 Timestamp =60 56 61 e13
Time e04 60 48
50 40
e03 40 32 e12

e02 30 24
20 16 e11
e01 10 8
0 0

Process P1 Process P2
Total ordering of events
 No events can occur at exactly the same
time.

 Process identity numbers may be used to

break ties and to create a total ordering of
events.
Concurrent Access of Resources
 Requirements of an algorithm for
implementing mutual exclusion:
1. Mutual exclusion

2. No starvation
Cont…
 Mutual exclusion:
 Given a shared resource accessed by multiple concurrent
processes, at any time only one process should access
the resource.
 Can be implemented in single-processor systems, using
semaphores, monitors and similar constructs.

 No starvation:
 If every process that is granted the resource eventually
releases it, every request must be eventually granted.
Cont…
 Approaches:
1. Centralized approach
2. Distributed approach
3. Token passing approach
Centralized approach
 One of the processes in the system is elected as
the coordinator.

 The coordinator coordinates the entry to the

critical sections.

 Every process takes permission from the

coordinator before entering critical sections.

 Request queue is used to grant requests.

 First-come, first-serve
Cont…
P2

3) Request 7) Release

6) Reply

5) Release 9) Release
2) Reply 8) Reply
P1 Pc P3
1) Request 4) Request

Status of request queue:

Initial status p3 p2 Status after 4 Status

After 7
P2 Status after 3 p3 Status after 5
Cont…
 Mutual exclusion by the coordinator:
 Only one process is allowed to enter a critical
section.

 No starvation:
 Due to use of first-come, first-served
scheduling policy.
Cont…
 Advantages of algorithm:
 Simple to implement
 Requires only three messages- a request, a
reply and a release.

 Drawbacks:
 Coordinator is the single point of failure.
Distributed approach
 The decision making for mutual exclusion
is distributed across the entire system.
 All processes that want to enter the same
critical section cooperate with each other
before reaching a decision on which process
will enter the critical section next.
Ricart and Agrawala Algorithm
 Use of event-ordering scheme to generate a
unique timestamp for each event in the system.

 When a process wants to enter a critical section,

it sends a request message to all other
processes.
Cont…
 Message:
 Process id
 Name of the critical section
 A unique timestamp generated by the sender
process.

 A process either immediately sends back a reply

message to the sender or defers sending a reply.
Example
TS=6
P1 TS=4 P2

TS=4
TS=6
TS=4 TS=6

P4 P3

Already in the
critical section

(a) Status when processes P1 and P2 send request messages

to other processes while process p4 is already in the critical
section.
Cont…
Ok Queue
P1 P2 P1
Defers sending a
Queue OK OK reply to P1
P2 P1
Defers sending a
reply to P1 and P2
P4 P3

(b) Status when process P4 is still in the critical section.

Cont…
Queue
P1 P2 P1
Enters critical
OK section
OK

Exits critical P4 P3
section

(c) Status after process P4 exits critical section

Cont…
OK Exits critical
Enters critical P1 P2 section
section

P4 P3

(d) Status after process P2 exits critical section

Cont…
 Algorithm is free from deadlock.

 For n processes, the algorithm requires

 n-1 request messages
 n-1 reply messages,
 Thus, 2(n-1) messages per critical section
entry.
Cont…
 Drawbacks:
1. In a system has n processes, the algorithm is
liable to n points of failure.

2. Requirement that each process knows the identity

of all the processes participating in the mutual-
exclusion algorithm.

3. A process willing to enter a critical section can do

so only after communicating with all other
processes and getting permission from them.
Token passing approach
 Use of a single token to achieve mutual
exclusion.

 Token is circulated among the processes in the

system in a ring structure.

 Token: a special type of message that entitles its

holder to enter a critical section.
Cont…
 If token holder wants to enter a critical section, it
keeps token & enters CS.

 Otherwise, it just passes it along the ring to its

neighbor process.
Cont…
 Mutual exclusion is guaranteed.

 The waiting time may be 0 to n-1.

 Depends on the position of token.
Cont…
 Types of failures:
1. Process failure

2. Lost token
Process failure
 Process receiving token from neighbour always
sends an ACK
 Each node maintains current ring configuration
 If process fails, dynamic reconfiguration of ring is
carried out.
 When process turns alive, it simply informs to
others.
Lost Token
 To regenerate lost token – one process in the
ring acts as a Monitor process.
 Monitor periodically circulates “who has the token
message”
 Process containing token inserts its id in the
special field
 If no id found implies token lost.
 Problems:
 Monitor process may itself fail
 Who has the token message may be lost
Deadlock
Deadlock
 A system consists of finite number of
resources.
 Multiple concurrent processes have to
compete to use a resource.
 The sequence of events to use a resource:
1. Request
2. Allocate
3. Release
Cont…
 Request
 Number of units requested may not exceed the
total no. of available units of the resource.

 Allocate
 Allocate the resource as soon as possible.
 Maintain a table of resources allocated or
available.
Cont…

 Release
 Release the resources after the process has
finished using the allocated resource.

 Request and release are system calls,

initiated by a process.

 System can control only allocate

operation.
Cont…
 Deadlock is the state of permanent
blocking of a set of processes each of
which is waiting for an event that only
another process in the set can cause.
Cont… : Example
 Resources : two tape drives T1, T2.

 Allocation strategy:
 Allocate the resource to the requester if free.

 Concurrent Processes : P1 and P2.

Cont…
 On request, P1 gets T1 and P2 gets T2.

 If either P1 or P2 requests for one more tape

drives, process waits until another process
releases the allocated tape.

 Processes are in the state of deadlock.

Cont…

A deadlock situation: There is a circular chain of

processes and resources that results in
deadlock.
Cont…
 Deadlock occurs:
 Because the total requests of both processes
exceed the total number of available
resources, &

 The resource allocation policy allocates a

resource on request if the resource is free
without considering the safe states.
Cont…
 This is also true for logical objects (like files,
tables, records in a database, etc..).

 Non-Preemptable resource:
 One that cannot be taken away from a process to which
it was allocated until the process voluntarily releases it.
Resource Types
 Two general categories of resources:
 Reusable &
 Consumable.

 A reusable resource is one that can be safely

used by only one process at a time and is not
depleted by that use.
 Eg.: Processors, I/O channels, main and
secondary memory, devices, &
 Data structures such as files, databases, and
semaphores.
Consumable Resources
 A consumable resource is one that can be
created (produced) and destroyed
(consumed).
 Example: interrupts, signals, messages, and
information in I/O buffers.
Communication deadlocks
 Occurs among a set of processes, when these are
blocked waiting for messages from other
processes in the set, in order to start execution
but there are no messages in transit between
these.

 All processes in the set are deadlocked.

Necessary conditions for deadlock
1. Mutual-exclusion condition
2. Hold-and-wait condition
3. No-preemption condition
4. Circular-wait condition

All are required for a deadlock to occur.

Cont…
 Mutual-exclusion condition
 If a resource is held by a process, any other process
requesting for that resource must wait until the
resource has been released.

 Hold-and-wait condition
 Processes are allowed to request for new resources
without releasing the resources that are currently held.
Cont…
 No-preemption condition
 A resource that has been allocated to a process
becomes available for allocation to another process
only after it has been voluntarily released by the
process holding it.

 Circular-wait condition
 Two or more processes must form a circular chain in
which each process is waiting for a resource that is
held by the next process of the chain.

 It implies the hold-and-wait condition.

Deadlock modeling
 Deadlocks can be modeled using directed
graphs.

 Directed graph:
 A pair (N,E), where N is a nonempty set of
nodes and E is a set of directed edges.

 Path :
 A sequence of nodes (a,b,c,….i,j) of a directed
graph such that (a,b), (b,c),….. (i,j) are
directed edges.
Cont…
 Cycle :
 A path whose first and last nodes are the
same.
 Reachable set:
 The reachable set of a node ‘a’ is the set of all
nodes ‘b’ such that a path exists from ‘a’ to ‘b’.
 Knot: A nonempty set ‘K’ of nodes such
that the reachable set of each node in ‘K’
is exactly the set ‘K’.
 It always contains one or more cycles.
A directed graph

Cycles :
a c 1. (a,b,c,d,e,f,a)
2. (b,c,d,e,b)

Knot:
d {a,b,c,d,e,f}
f

e
Resource allocation graph
 Both the set of nodes and the set of edges are
partitioned into two types, resulting in the
following graph elements.

1. Process nodes
2. Resource nodes
3. Assignment edges
4. Request edges
Cont…
Pi A process named Pi

Rj A resource Rj having 3 units in the system.

Pi Process Pi holding a unit of resource Rj.

Pi R3 Process Pi requesting for a unit of resource Rj.

Necessary and sufficient conditions for
deadlock
1. A cycle in the graph is both a necessary
and a sufficient condition for deadlock
 if all the resource types requested by the
processes forming the cycle have only a
single unit each.
A cycle representing a deadlock

R2 R1 R3

P2
Cont…
2. A cycle in the graph is a necessary but
not a sufficient condition for deadlock
 if one or more of the resource types requested
by the processes forming the cycle have more
than one unit.

 In this situation, a knot is the sufficient

condition for deadlock.
A Knot representing a deadlock

R2 P2 R1 R3

P3
Wait-for graph
 A simplified graph, obtained from the original
resource allocation graph by removing the
resource nodes and collapsing the appropriate
edges.

 It shows which processes are waiting for which

other processes.

 Appropriate for modeling Communication

Deadlocks.
Cont…
Simplified to
P1 P1

R2 R1

P3 P2 P3 P2

(a) Resource allocation graph Corresponding WFG

Resource deadlocks
 Occurs when two or more processes wait
permanently for resources held by each
other.
 Handling deadlocks in Distributed Systems
 Strategies:
 Avoidance
 Prevention
 Detection & recovery
Basic Deadlock Avoidance
 Resources are carefully allocated to avoid
deadlocks.

 These methods use some advance knowledge of

the resource usage of processes.

 Considers safe / unsafe states.

Cont…
 Safe state:
 The system is not in a deadlock state.

 There exists some ordering of the processes in

which the resource requests of the processes
can be granted to run all of these to
completion.
Cont…
 Safe sequence:
 Any ordering of the processes that can
guarantee the completion of all the processes.

 Unsafe state:
 A system state if no safe sequence exists.
Cont…
 Some assumptions of the algorithm:
 The advance knowledge of the resource
requirements of the various processes is
available.

 The number of processes that compete for a

particular resource and the number of units of
that resource are fixed and known in advance.
Deadlock Avoidance
 Approaches to deadlock avoidance:

1. Process Initiation Denial

2. Resource Allocation Denial
Process Initiation Denial
 Do not start a process if its demands
might lead to deadlock.
 Define a deadlock avoidance policy that
refuses to start a new process if its
resource requirements might lead to
deadlock.
 A process is only started if the maximum
claim of all current processes plus those of
the new process can be met.
 Strategy is hardly optimal.
 Assumes the worst: that all processes will
make their maximum claims together.
Resource Allocation Denial
 Do not grant an incremental resource request
to a process if this allocation might lead to
deadlock.

 Also referred to as the banker’s algorithm.

Cont…
 Consider a system with a fixed number of
processes and a fixed number of
resources.

 At any time, a process may have zero or

more resources allocated to it.
Cont…
 State of the system
 The current allocation of resources to
processes consists of:
 two vectors: Resource and Available &
 two matrices: Claim and Allocation.
Deadlock Prevention
 Constraints are imposed on the ways in which
processes request resources, in order to prevent
deadlocks.

 Tries to ensure that at least one of the necessary

conditions for a deadlock is never satisfied.
Cont…
 Deadlock prevention methods:
 Collective requests
 Ordered requests
 Preemption
Collective Requests
 This method denies the hold-and-wait
condition.

 Ensures that whenever a process requests

a resource, it does not hold any other
resources.
Cont…
 Resource allocation policies:
1. Either allocate all requested resources
or none, before a process begins
execution.
 Process would wait.
Cont…
2. Provide resources at time of execution
only when process doesn't hold any
other resource.
 First
release existing once and then re-
request all the necessary resources.

 Usefulif requirements of resources is not

known initially.

 Usefulwhen some resources are required at

the end of its execution.
Cont…
 Problems :
 Low resource utilization
 May cause starvation of a process
 An accounting question.
Ordered Requests
 Each resource type is assigned a unique global
number to impose a total ordering of all resource
types.

 The process should not request a resource with a

number lower than the number of any of the
resources that it is already holding.

 If process requires several units of same resource

type, it should make single request for all the
required units of a resources.
Cont…
 A process may not acquire all its resources in
strictly increasing sequence.
 Eg. Holding resource # 7 & 10. Now release 10 & can
request for 9.

 Problem :
 Reordering may become inevitable when new
resources are added to the system.
Preemption
 A preemptable resource is one whose
state can be easily saved and restored
later.

 Examples : CPU registers, main memory.

Resource allocation policies
1. Preempt all the resources of a process
when it request for a resource that is
not currently available.
 Process is blocked.

2. If a resource is held by a waiting

(waiting for another resource) process
and another process requests for that,
 Preempt the requested resource from the
waiting process,
 Allot to the requesting process.
Cont…
 Method works only for preemptable
resources.

 So, further requirements are:

 The availability of atomic transactions.
 Global timestamps - in distributed and
database transaction processing systems.
Cont…
 Transaction-based deadlock prevention
method
 Use of unique priority numbers for
transactions.

 Priority numbers are used to break the tie.

 Global unique timestamp may serve as priority

number.

 Lower the value of timestamp means higher

the priority .
Cont…
 Schemes based on the above idea:
 Wait-die scheme
 Wait-wound scheme
Wait-die scheme
 Three Transactions Ti (oldest), Tj and
Tk(youngest)
 If a transaction Ti requests a resource that
is currently held by another transaction Tj,
 Ti is blocked because its timestamp is lower
than that if Tj;
 If a transaction Tk requests a resource
that is currently held by another
transaction Tj,
 Tk is aborted (dies).
Wait-wound scheme
 Three Transactions Ti (oldest), Tj and
Tk(youngest)
 If a transaction Ti requests a resource that is
currently held by another transaction Tj,
 Tj is preempted because its timestamp is
higher than that if Tj;
 If a transaction Tk requests a resource that is
currently held by another transaction Tj,
 Tk blocked because its timestamp is lower than
that if Tj
Deadlock Detection and Recovery
 Deadlocks are allowed to occur.
 A detection algorithm is used to detect
these.
 After a deadlock is detected, it is resolved
by certain means.
 Detection algorithms are the same in both
centralized and distributed systems.
 Based on resource allocation graph and
possibility of deadlock.
Cont…
 Properties for the correctness of deadlock
detection algorithms:

 Progress property
 Deadlock must be detected in a finite amount of
time.

 Safety property
 If a deadlock is detected, it must indeed exist.
 No phantom deadlocks.
Cont…
 Steps to construct WFG for a distributed
system:
 Construct the resource allocation graph for
each site of the system.

 Convert that into a separate WFG for each site.

 Take the union of the WFGs of all sites and

construct a single global WFG.
Cont… : Example
P1 P3 P1 P3

R1 R2
P1 P3
P2 P2

Site S1 Site S1
P2
P1 R3 P3 P1 P3

Site S2 Site S2
(c) Global WFG by
(A) Resource (B) WFGs taking the union of
allocation graph for corresponding to the two local WFGs of
each site graphs in (a) (b).s
Cont…
 Local WFGs are not sufficient to
characterize all deadlocks in a distributed
system.

 So, the construction of a global WFG is

required
 Taking the union of all local WFGs.
Cont…
 Problem:
 How to maintain WFG in a distributed system?

 Techniques are:
 Centralized
 Hierarchical
 Distributed
Centralized approach for deadlock detection
 Use of local coordinator for each site
 Maintains a WFG for its local resources

 Use of central coordinator/centralized

deadlock detector
 Receives information from the local
coordinators of the sites.

 Constructs the union of al the individual WFGs.

Cont…
 Approach:
 Local deadlocks are detected and handled by
the local coordinator.
 Considers the cycles that exist in the local WFG of a
site.

 Deadlocks at two or more sites are detected

and resolved by the central coordinator.
 Considers the cycles in the global WFG.
Cont…
 Methods to transfer information from local
coordinators to the central coordinator:

 Continuous transfer
 Transfer of message whenever a new edge is added
to or deleted from the local WFG.

 Periodic transfer

 Transfer-on-request
Cont…
 Drawbacks of centralized deadlock
detection approach:
 Vulnerable to failures of the central
coordinator.

 Performance bottleneck in large systems

having too many sites.

 Detection of false deadlocks.

Example
 Processes : P1, P2 and P3
 Resources : R1, R2 and R3
 Steps:
 1: P1 requests for R1 and R1 is allocated to it.
 2: P2 requests for R2 and R2 is allocated to it.
 3: P3 requests for R3 and R3 is allocated to it.
 4: P2 requests for R1 and waits for it.
 5: P3 requests for R2 and waits for it.
 6: P1 releases R1 and R1 is allocated to P2.
 7: P1 requests for R3 and waits for it.
Cont…
 Sequences of messages sent to the central
coordinator (using continuous transfer):
 M1: from site S1 to add the edge (R1, P1).
 M2: from site S1 to add the edge (R2, P2).
 M3: from site S2 to add the edge (R3, P3).
 M4: from site S1 to add the edge (P2, R1).
 M5: from site S1 to add the edge (P3, R2).
 M6: from site S1 to delete edges (R1,P1) and (P2,R1),
and add edge (R1, P2).
 M7: from site S2 to add the edge (P1, R3).
Cont…

P1 P3 P1 P3
R3

R3 P3
R1 R2 R1 R2

P2 P2
Resource allocation Resource allocation
Resource allocation
graph of the local graph maintained
graph of the local
coordinator of site by the central
coordinator of site
S2 coordinator
S1

Resource allocation graphs after step 5

Cont…

P3 P1 P3
R3

P1 R3 P3
R1 R2 R1 R2

P2 P2

Site S2 Central coordinator

Site S1

Resource allocation graphs after step 7

Cont…

P1 R3 P3

R1 R2

P2
Central coordinator

Resource allocation graph of the central coordinator

showing false deadlock if message m7 is received
before m6 by the central coordinator.
Cont…
 Result :
 Possibility of detection of phantom deadlocks
 when the method of continuous transfer of
information is used.

 In other methods of information transfer, due to

incomplete or delayed information.
Cont…
 Solution:
 Use unique global timestamp with each
message.

 Disadvantages of centralized approach:

 Less attractive for most real applications.

 Time and message overhead.

Hierarchical approach for deadlock detection
 Uses a logical hierarchy (tree) of deadlock
detectors, known as controllers.

 Each controller is responsible for detecting only

those deadlocks that involve the sites falling
within its range in the hierarchy.

 Approach is distributed over a number of

different controllers.
Cont…
 Rules :
 Each controller that forms a leaf of the
hierarchy tree maintains the local WFG of a
single site.

 Each nonleaf controller maintains a WFG that

is the union of the WFGs of its immediate
children in the hierarchy tree.
Cont…
 The lowest level controller that finds a
cycle in its WFG
 detects a deadlock and

 Takes necessary action to resolve it.

 A WFG that contains a cycle will never be

passed as it is to a higher level controller.
Example : Hierarchical deadlock
detection approach
P4 P5

P7 P6 Controller G

P1 P3 P4 P5
P5 P6 P7
P2 P7
Controller E Controller F

P1 P3 P1 R3 P3
P5 R6 P6 P6 R7 P7
R1 R2 P4 R4 P5

P2 R5 P7 Site S3 Site S4
Controller C Controller D
Site S1 Site S2
Controller A Controller B
Cont…
 Deadlock cycle
 (p1,p3,p2,p1) of site S1 and S2 gets reflected
in the WFG of controller E.

 (p4,p5,p6,p7,p4) of site S2,S3 and S4 gets

reflected in the WFG of controller G.
Fully distributed approaches for deadlock
detection
 Each site of the system shares equal
responsibility for deadlock detection.

 Algorithms are:
 WFG-based distributed algorithm for deadlock
detection.

 Probe-based distributed algorithm for deadlock

detection.
WFG-based fully distributed deadlock
detection algorithms
 Each site maintains its own local WFG.

 An extra node Pex is added to the local WFG of

each site.

 Pex node is connected to the WFG of the

corresponding site.
Cont…
 An edge (Pi, Pex) is added if process Pi is waiting
for a resource in another site being held by any
process.

 An edge (Pex, Pj) is added if Pj is a process of

another site that is waiting for a resource
currently being held by a process of this site. (Pj
is indicated as external process pointed by Pex).
 In the WFG of site S1, edge (P1, Pex) is added
because P1 is waiting for a resource in the site S2
that is held by P3 &

 Edge (Pex, P3) is added because P3 is a process

of site S2 that is waiting to acquire a resource
held by P2 of S1.

 WFG of S2, edge (P3, Pex) added because P3 is

waiting for a resource in S1 held by P2 &

 Edge (P3, Pex) added because P1 of S1 is waiting

to acquire resource held by P3 of S2.
Example
P1 P3 Pex

P1 P3
Pex
P4 P2
P1 P3 P5
Site S1 P4 P2
Site S1
P2

Pex
P1 P5 Updating local
P3
P1 P3 P5 WFGs of site S2
after receiving the
Site S2 Site S2 deadlock detection
message from site
Local WFGs Local WFGs after S1.
addition of node
Pex
Cont…
 If a local WFG contains a cycle that:
 does not involve node Pex, a deadlock that involves only
local processes of that site has occurred – Resolved
Locally

 does involve Pex, there is a possibility of a distributed

deadlock that involves processes of multiple sites – By
Distributed Deadlock Detection Algorithm.
Cont…
 Problem:
 Two sites may initiate the deadlock detection algorithm
independently for a deadlock that involves the same
processes.
 More than the necessary process may be killed.

 Solution:
 Assign a unique identifier to each process.
Probe-based distributed algorithm for
deadlock detection
 Proposed by Chandy et al. in 1983.

 Also known as Chandy-Mishra-Hass or CMH

algorithm.

 Use of probe message by the requesting process

for a resource to the process holding the
requested resource.
Cont…
 Fields of probe message:
 The identifier of the process just blocked.
 The identifier of the sender process.
 The identifier of the receiver process.
Cont…
 If recipient is using the resource, it ignores the
probe message.

 If recipient itself is waiting for any resource, it

forwards the probe message to the process
holding the resource for which it is waiting.

 Cycle exists if the probe message returns back to

the original sender.
 System is deadlocked.
 A probe packet : (p1,p1,p3)
p1 generates a probe message and sends it to
p3.
Cont…: CMH distributed deadlock
detection algorithm

(P1,P1,P3) (P1,P3,P5)
P1 P3 P5
(P1,P2,P1)

(P1,P3,P2)
(P1,P2,P4)
P4 P2

Site S1 Site S2
Cont…
 Features of the CMH algorithm:
 Easy to implement.
 Each message is of fixed length.
 Few computational steps.
 Low overhead of the algorithm
 No graph construction and information
collection
 Doesn’t detect false deadlocks
 Does not require any particular structure
among the processes.
Methods for recovery from deadlock
 Asking for operator intervention.
 Termination of process (es).
 Rollback of process (es).
Asking for operator intervention
 Inform the operator about the deadlock.

 Let the operator deal with it manually.

 Works for a centralized system.

 Not suitable for dealing with the deadlocks

involving processes of multiple sites.
Termination of process
 Terminate one or more processes and
reclaim the resources held by them for
reallocation.

 Requirements:
 Analyze the resource requirements and
interdependencies of the processes involved in
a deadlock cycle.
Rollback of process
 Reclaim the needed resources from the
processes that were selected for being
killed.
 Rollback the process to a point where the
resource was not allocated to the process.
 Processes are checkpointed periodically.
 Approach is less expensive than the
process termination approach.
 Extra overhead involved in periodic
checkpointing of all the processes.
Issues in recovery from deadlock
 Selection of victim (s)
 Use of transaction mechanism

 Victim: the process which is killed or rolled

back to break the deadlock.

 Factors for victim selection:

 Minimization of recovery cost
 Prevention of starvation
Election algorithms
 Most algorithms requires a coordinator process to
perform some type of coordination activity.

 Election algorithms are used for electing a

coordinator process from among the currently
running processes.

 At any instance of time there is a single

coordinator for all processes in the system.
Cont…
 Assumptions:
 Each process in the system has a unique
priority number.

 Process with highest priority will be elected as

the Coordinator.

 On recovery, a failed process can take

appropriate actions to rejoin the set of active
processes.
Bully algorithm
 Assumption :
 Every process knows the priority number of every other
process in the system.

 When a process (pi) sends a request to the

coordinator and does not receive a reply within a
fixed timeout period, it starts election process
assuming that the coordinator has failed.
Cont…
 Pi sends election message to all with a higher
priority number than itself.
 Informing others that now it is the coordinator.

 Process (Pj) with the higher priority number than

the mentioned one, sends alive message.
 Now Pj is the coordinator.

 On recovery of previous coordinator

 It simply sends a coordinator message to all other
processes and bullies the current coordinator into
submission.
Cont…
 Algorithm requires
 O(n2) messages in worst case.
 (n-2) messages in the best case.
Ring Algorithm
 All the processes in the system are organized in a
logical unidirectional ring.

 Every process in the system knows the structure

of the ring.

 If the successor of the sender process (Pi) is

down, may be the coordinator,
 It may start election process.
 the sender can skip over the successor, or
 The one after that, until an active member is located.
Cont…
 The election message contains the priority
number of sender process (Pi).

 Processes enter their priority number to the

message and pass it to the successor.

 Pi recognizes its own election message.

 It elects the process having the highest priority
as the new coordinator.
 It sends a coordination message over the ring
about the new coordinator.
Cont…
 If previous coordinator (Pj) recovers after
failure:
 Sends an inquiry message to its successor.

 The current coordinator replies to Pj that it is

the current coordinator.
Cont…
 Drawback :
 Two or more processes may circulate an
election message over the ring on discovery
that the coordinator has crashed.

 An election always requires 2(n-1) messages.

 n/2 messages are required on an average for

recovery action.

 More efficient and easier to implement than

bully algorithm.

OPC DA 2.05a Specification
100% (1)
OPC DA 2.05a Specification
194 pages
Introduction
No ratings yet
Introduction
51 pages
Distributed Operating Systems
No ratings yet
Distributed Operating Systems
54 pages
Distributed Systems CSE 2052 Module 1 Dr. Jayakumar 26 Aug 2023
No ratings yet
Distributed Systems CSE 2052 Module 1 Dr. Jayakumar 26 Aug 2023
60 pages
24-25 DC - Chapter - 1.2
No ratings yet
24-25 DC - Chapter - 1.2
22 pages
Distributed Systems Report
100% (1)
Distributed Systems Report
36 pages
Introduction To Distributed Systems: BY: Sunita Sahu Assistant Professor, VESIT, Mumbai
No ratings yet
Introduction To Distributed Systems: BY: Sunita Sahu Assistant Professor, VESIT, Mumbai
48 pages
BIT Lesson #7-Distributed OS
No ratings yet
BIT Lesson #7-Distributed OS
25 pages
TT-1 QB 2024-25-BTech DC
No ratings yet
TT-1 QB 2024-25-BTech DC
13 pages
Chapter 1
No ratings yet
Chapter 1
55 pages
Chapter One
No ratings yet
Chapter One
42 pages
DC Module1
No ratings yet
DC Module1
54 pages
Lecture 2.0 - Issues in Design of Distributed System
No ratings yet
Lecture 2.0 - Issues in Design of Distributed System
14 pages
CH 4 Distributed Operating System Final
No ratings yet
CH 4 Distributed Operating System Final
58 pages
Uds24201j Unit I
No ratings yet
Uds24201j Unit I
27 pages
RMCS
No ratings yet
RMCS
127 pages
DC Imp Qna 2025 12 04 03 57 48
No ratings yet
DC Imp Qna 2025 12 04 03 57 48
32 pages
Introduction To Distributed Systems: by Petros H
No ratings yet
Introduction To Distributed Systems: by Petros H
48 pages
Introduction
No ratings yet
Introduction
35 pages
Mod 1 Part 1
No ratings yet
Mod 1 Part 1
48 pages
Ch1 Introduction
No ratings yet
Ch1 Introduction
57 pages
Distributed Operating Systems: Unit - 2
No ratings yet
Distributed Operating Systems: Unit - 2
48 pages
Introduction To Distributed Systems
No ratings yet
Introduction To Distributed Systems
38 pages
Introduction To DC
No ratings yet
Introduction To DC
21 pages
Fundamentals Ds
No ratings yet
Fundamentals Ds
38 pages
Dist Comp Intro
No ratings yet
Dist Comp Intro
41 pages
Chapter 1 (A) - Distribted System
No ratings yet
Chapter 1 (A) - Distribted System
40 pages
Distributed Vs Parallel Computing
No ratings yet
Distributed Vs Parallel Computing
31 pages
Chapter 1-Introduction
No ratings yet
Chapter 1-Introduction
51 pages
Comparative Study of DOS and NOS
No ratings yet
Comparative Study of DOS and NOS
22 pages
Distributed-Computing Book
No ratings yet
Distributed-Computing Book
149 pages
Distributed Systems: Lecturer: Dr. Nadia Tarik Saleh
No ratings yet
Distributed Systems: Lecturer: Dr. Nadia Tarik Saleh
19 pages
Lecture 1 - Fundamentals of Distributed System
No ratings yet
Lecture 1 - Fundamentals of Distributed System
13 pages
2 - Lect 0 - Introduction To Distributed Systems
No ratings yet
2 - Lect 0 - Introduction To Distributed Systems
30 pages
Distributed Systems REPORT
No ratings yet
Distributed Systems REPORT
39 pages
DC TechNeo
No ratings yet
DC TechNeo
205 pages
Distributed Systems
No ratings yet
Distributed Systems
41 pages
Dsunit 1 PART2
No ratings yet
Dsunit 1 PART2
28 pages
Distributed Systems: Dr.P.Amudha Associate Professor
100% (4)
Distributed Systems: Dr.P.Amudha Associate Professor
38 pages
Distributed Systems-A Brief Introduction
No ratings yet
Distributed Systems-A Brief Introduction
30 pages
Distributed System Unit-1
No ratings yet
Distributed System Unit-1
27 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
44 pages
Aos Unit-1
No ratings yet
Aos Unit-1
39 pages
Lect 1 & 2
No ratings yet
Lect 1 & 2
74 pages
4-5 Distributed Operating Systems Concepts and Principles + Comparison of NOS and DOS
No ratings yet
4-5 Distributed Operating Systems Concepts and Principles + Comparison of NOS and DOS
43 pages
Lec 1 Introduction
No ratings yet
Lec 1 Introduction
94 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Unit 6
No ratings yet
Unit 6
23 pages
Distributed System
No ratings yet
Distributed System
129 pages
Distributed Comp (Intro)
No ratings yet
Distributed Comp (Intro)
39 pages
Distributed Systems: A Brief Introduction
No ratings yet
Distributed Systems: A Brief Introduction
30 pages
(Given) DSs CH 01 - Introduction
No ratings yet
(Given) DSs CH 01 - Introduction
9 pages
Objectives: To Learn The Concept and Characteristics of Distributed System
No ratings yet
Objectives: To Learn The Concept and Characteristics of Distributed System
23 pages
Introduction
No ratings yet
Introduction
59 pages
Ds 01
No ratings yet
Ds 01
41 pages
Unit 1 DOS Revised
No ratings yet
Unit 1 DOS Revised
18 pages
DC Module1
No ratings yet
DC Module1
62 pages
Lecture - 10 Cryptographic Hash Functions
No ratings yet
Lecture - 10 Cryptographic Hash Functions
46 pages
Part I: Introduction: Purpose
No ratings yet
Part I: Introduction: Purpose
9 pages
I Will Showcase How SAP Alert Notification Service Can Be Used To Generate Incidents in ServiceNow For Iflow Execution Failure in Cloud Integration
No ratings yet
I Will Showcase How SAP Alert Notification Service Can Be Used To Generate Incidents in ServiceNow For Iflow Execution Failure in Cloud Integration
8 pages
UNIT5 - PLDS, CPLDs & FPGA
No ratings yet
UNIT5 - PLDS, CPLDs & FPGA
203 pages
Itanium Processor: Presented by Name-Mohammad Faizan Akhter Branch-ETC (Section) Semester-6 Regd No-1801289179
No ratings yet
Itanium Processor: Presented by Name-Mohammad Faizan Akhter Branch-ETC (Section) Semester-6 Regd No-1801289179
18 pages
Chapter 03 Linear Programming - Simplex Method
No ratings yet
Chapter 03 Linear Programming - Simplex Method
21 pages
ZProtect 1.4 Unpacker
No ratings yet
ZProtect 1.4 Unpacker
9 pages
OSM Concepts
No ratings yet
OSM Concepts
120 pages
Soa Module 1
No ratings yet
Soa Module 1
150 pages
Doordash Software Engineer, Intern - Toronto (Summer 2023) - DoorDash - LinkedIn
No ratings yet
Doordash Software Engineer, Intern - Toronto (Summer 2023) - DoorDash - LinkedIn
6 pages
Multi - Pilecap Design Spreadsheet
No ratings yet
Multi - Pilecap Design Spreadsheet
1 page
Be Informed: Secos - Saurer Customer Portal
No ratings yet
Be Informed: Secos - Saurer Customer Portal
8 pages
Programmable Logic Controller L T P C 1 0 0 1: Department of
No ratings yet
Programmable Logic Controller L T P C 1 0 0 1: Department of
4 pages
GFK 1440 A
No ratings yet
GFK 1440 A
21 pages
Dotnet Unit5
No ratings yet
Dotnet Unit5
24 pages
AJP Practice Mcqs
No ratings yet
AJP Practice Mcqs
33 pages
Splunk Soar
No ratings yet
Splunk Soar
2 pages
1-G1003 HART To Modbus Gateway User Manual
No ratings yet
1-G1003 HART To Modbus Gateway User Manual
53 pages
Jobsinmalta CV
No ratings yet
Jobsinmalta CV
2 pages
SuperMax Receiver Biss English
No ratings yet
SuperMax Receiver Biss English
7 pages
Common Report SMEC For PDF
No ratings yet
Common Report SMEC For PDF
29 pages
Azure
No ratings yet
Azure
28 pages
PANDUAN VidIQ
No ratings yet
PANDUAN VidIQ
3 pages
Lesson 3 Emtech
No ratings yet
Lesson 3 Emtech
3 pages
Compiler Construction: By: Engr. Muhammad Adnan Malik Class of BS-CS, NCBA&E, MULTAN
100% (1)
Compiler Construction: By: Engr. Muhammad Adnan Malik Class of BS-CS, NCBA&E, MULTAN
64 pages
9.1.2.5 Lab - Install Linux in A Virtual Machine and Explore The GUI
No ratings yet
9.1.2.5 Lab - Install Linux in A Virtual Machine and Explore The GUI
4 pages
DX Diag
No ratings yet
DX Diag
30 pages
Nvidia Fundamentals of Deep Learning PPT 1
No ratings yet
Nvidia Fundamentals of Deep Learning PPT 1
40 pages
Types of Software - G8
No ratings yet
Types of Software - G8
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.