DS Midsems
DS Midsems
System wide
CPU CPU CPU CPU
Shared memory
Interconnection hardware
Loosely coupled systems
The processors do not share memory, and
each processor has its own local memory.
These are also known as distributed
computing systems or simply distributed
systems.
Communication Network
Distributed Computing System
A DCS is a collection of independent computers that
appears to its users as a single coherent system,
or
A collection of processors interconnected by a
communication network in which
Each processor has its own local memory and other
peripherals, and
The communication between any two processors of
the system takes place by message passing.
For a particular processor, its own resources are
local, whereas the other processors and their
resources are remote.
Cont…
Together, a processor and its resources are
usually referred to as a node or site or machine
of the distributed computing system.
Resource sharing
Such as software libraries, databases, and hardware resources.
Factors that led to the emergence of
distributed computing system
Better price-performance ratio
Higher reliability
Reliability refers to the degree of tolerance against errors
and component failures in a system.
Achieved by multiplicity of resources.
Factors that led to the emergence of
distributed computing system
Extensibility and incremental growth
By adding additional resources to the system as and when
the need arise.
These are termed as open distributed System.
3. FAULT TOLERANCE
Low in Network OS
High in Distribute OS
User Mobility
No matter which machine a user is logged onto, he
should be able to access a resource with the same
name.
Replication Transparency
Replicas are used for better performance and reliability.
No deadlock
It ensures that a situation will never occur in which
competing processes prevent their mutual progress even
though no single one requests more resources than
available in the system.
Performance transparency
The aim of performance transparency is to allow the
system to be automatically reconfigured to improve
performance, as loads vary dynamically in the system.
The large size of the kernel reduces the overall flexibility and
configurability of the resulting OS.
Network Hardware
Microkernel model
Goal is to keep the kernel as small as possible.
Network Hardware
Cont…
Modular in nature, OS is easy to design, implement and
install.
For adding or changing a service, there is no need to stop
the system and boot a new kernel, as in the case of
monolithic kernel.
Performance penalty :
Each server module has to its own address space. Hence
some form of message based IPC is required while
performing some job.
Message passing between server processes and the
microkernel requires context switches, resulting in additional
performance overhead.
Cont…
Advantages of microkernel model over monolithic kernel
model:
Flexibility to design, maintenance and portability.
In practice, the performance penalty is not too much due to
other factors and the small overhead involved in exchanging
messages which is usually negligible.
PERFORMANCE
Design principles for better performance are:
Batch if possible
Transfer of data in large chunk is more efficient than
individual pages.
Caching whenever possible
Saves a large amount of time and network bandwidth.
Centralized database
Shared
P1 Memory P2
Area
P1 P2
Other primitives:
◦ Send primitive
◦ Receive primitive
Cont…
Blocking send primitive:
◦ After execution of the send statement, the sending
process is blocked until it receives an
acknowledgement.
Receive (message);
Execution suspended
Send (message); Message
Execution suspended
Execution resumed
Blocked state
executing state
Issues In Non-blocking
2nd strategy
– The message is simply discarded and the time
out mechanism is used to resend message
after a time out period.
Message
23
Cont..
The null buffer strategy is not suitable for
synchronous communication between two
processes in a distributed system.
Sending Receiving
process process
Message
27
Unbounded-capacity buffer
Buffer with unbounded-capacity message may be
used in the asynchronous mode of
communication.
It can store all unreceived messages with the
assurance that all messages sent to the receiver
will be delivered.
This strategy is practically impossible.
Finite Bound Buffer
Asynchronous mode of communication uses
this strategy of finite bound buffer.
29
Cont…
Methods to deal with the problem of buffer
overflow:
◦ Unsuccessful communication
Message transfers simply fail whenever there is no more
buffer space.
◦ Flow-controlled communication
The sender is blocked until the receiver accepts some
messages, thus creating space in the buffer for new
messages.
30
Cont…
A create-buffer system call is provided to the user.
31
Cont…
Sending Receiving
process Message 1 process
Message 2
Message Message 3
Message n
Multiple-message
Buffer/mailbox/port
32
Multidatagram messages
A message whose size is greater than the MTU
(maximum transfer unit) is fragmented into
multiples of the MTU.
34
Process Addressing
35
Explicit Addressing
The process with which communication is desired
is explicitly named as a parameter in the
communication primitive used.
36
Implicit Addressing
39
Cont…
Benefit:
No global coordination required for local-id
Drawback
◦ It does not allow a process to migrate from one machine
to another on requirement like
One or more processes of a heavily loaded machine may be
migrated to a lightly loaded machine to balance the overall
system load.
Cont…
Solution
◦ Processes can be identified by a combination of the
three fields
machine_id
Local_id
machine_id
Cont…
Machine-id , Local-id , Machine-id ,
Node on which
process created Id of the process Last known Node
Location of the process
42
Cont…
Drawbacks in this:
◦ The overload of locating a process may be large if the
process has migrated several times during its lifetime.
Solutions are:
– Ensure that the system wide unique identifier of a
process does not contain an embedded machine
identifier.
– Use Two-level naming scheme for processes.
Two Level Naming Scheme
45
Cont…
When a process wants to send a message to another
process, it specifies high level name of the receiver
process.
48
Cont…
sender Receiver
Send request
Request message
lost
sender Receiver
lost
sender Receiver
unsuccessful request
crash execution
restarted
Receiver’s PC crashed.
Cont…
To overcome these problems
• A reliable IPC protocol of a message-passing system is
designed.
• It is based on the ideas of internal retransmission of
messages after time out and
• Return of an ACK to sending m/c kernel by receiver
m/c kernel.
52
Cont…
Protocols used for client-server communication
between two processes are:
◦ Four-message reliable IPC protocol
◦ Three -message reliable IPC protocol
◦ Two-message reliable IPC protocol
Four-message reliable IPC protocol
54
Four-message reliable IPC protocol
Sender’s Receiver’s
execution execution
request
acknowledgment
reply
Blocked state
acknowledgment
executing state
Three -message reliable IPC
protocol
The result of the processed request is sufficient
acknowledgment that the request message was
received by the server.
Three-message reliable IPC protocol
Sender’s Receiver’s
execution execution
request
reply
Blocked state
acknowledgment
executing state
Cont…
Problem is with timeout value.
• If the request message is lost, it will be
retransmitted only after the timeout period,
which might have been set to a larger value.
request
reply
Blocked state
executing state
Fault-tolerant communication
client server
Send Request message
request
timeout
lost
Send
request Retransmit
timeout Request message
Response message
Idempotency
It means repeatibility.
An Idempotent operation produces the same
result without any side effects, no matter how
many times it is performed with the same
argument.
Example: using procedure GetSqrt(64) for
calculating the square root of a given number.
ISSUE : Duplicate Requests.
◦ If the execution of the request is nonidempotent, then
its repeated execution will destroy the consistency of
information.
Handling of Duplicate Requests
If client makes a request.
Server processes the request.
Client doesn't receive the response.
After time out, again issues REQ.
What Happens?
Cont…
Solution
◦ Use unique id for every REQ
◦ maintains a reply cache in the kernel’s address space
on the server machine to cache replies.
Multicasting is an Asynchronous
communication mechanism.
S1 R1 R2 S2
t1
m1 Time
t2
t1 < t2
m2
m1
m2
Consistent Ordering
This semantics ensures that all messages are
delivered to all receiver processes in the same
order.
t1
Time
t2
m2 m2 t1 < t2
m1 m1
Causal Ordering
• PCR
CBCAST Protocol
• For Causal Ordering – USED in ISIS, a real commercial
distributed system, based on process groups.
• ISIS project has moved from Cornell University to ISIS
Distributed System a subsidiary of Stratus Computer Inc.
• Each member process of a group maintains a vector of
“n” components, where “n” is the total number of
members in the group.
• Each member is assigned a sequence number from 0 to
n, and the ith component of the vector corresponds to
the member with sequence number i’.
• In particular, the value of the ith component of a
member’s vector is equal to the number of the last
message received in sequence by this member from
member ‘i’.
Cont…
• To send a message, a process increments the value of
its own component in its own vector and sends the
vector as part of the message.
CPU-bound blocking
IO-bound blocking
Wait for data to return from an IO source, such as a network
or a hard drive.
IO-bound blocking
Non-blocking IO
• When data has returned from IO, the caller will be
notified
• Done with a callback function that has access to the
returned data.
Callback
Network IO and Sockets
• At kernel level a socket is used as an abstraction to
communicate with a NIC.
• Socket takes care of reading and writing data to / from
NIC, NIC sends data over the UTP cable to the internet.
• For example, if you go to a URL in your browser:
• At low level the data in your HTTP request is written
to a socket using the send(2) system call.
• When a response is returned, response data can be
read from that socket using the recv(2) system call.
• So when data has returned from network IO, it is ready
to be read from the socket.
Non-Blocking IO under the hood
• Use an infinite loop that constantly checks (polls) if data
is returned from IO called Event Loop
• It checks if data is ready to read from a network socket.
• Sockets are implemented as file descriptors (FD) on
UNIX systems.
• All sockets are file descriptors but converse is not true
• So technically, FD is checked for ready data.
• The list of FDs that you want to check for ready data is
generally called the Interest List.
Optimizations to Event Loop
Each (major) OS provides kernel level APIs to help create
an event loop
• Linux - epoll or io_uring,
• BSD - kqueue &
• Windows - IOCP.
Each of these APIs is able to check FDs for ready data with
a computational complexity of around
O(No_of_Events_Occurred).
In other words, you can monitor 100,000s of FDs, but the
API’s execution speed only depends on the amount of
events that occur in the current iteration of the event loop.
Conclusion
Applications that need to handle high event rates mostly use
non-blocking IO models implemented with event loops.
For best performance, the event loop is built using kernel
APIs such as kqueue, io_uring, epoll and IOCP.
Hardware interrupts and Signals are less suited for non-
blocking when handling large amounts of events per second.
Server Architectures
Two competitive server architectures based on:
• Threads or
• Events.
Internals of an Event-Driven Architecture
Since the caller and the callee processes have disjoint address
space, the remote procedure has no access to data and
variables of the callers environment.
Cont…
Therefore RPC facility uses a message-passing scheme for
information exchange between the caller and the callee
processes.
It is responsible for
Retransmission,
Acknowledgement,
Routing &
Encryption.
Server Stub
It is responsible for the following two tasks:
Server implementation
Server creation
Server Implementation
Types of servers :
Stateful servers
Stateless servers
Stateful Server
A Stateful Server maintains clients state
information from one RPC to the next.
Return(bytes 0 to 99)
Instance-per-call Servers
Instance-per-session Server
Persistent Servers
Instance-per-call Servers
These exist only for the duration of a single call.
Advantages:
Most commonly used.
Improves performance and reliability.
Shared by many clients.
If there are any orphan calls, it takes the result of the first
response message and ignores the others, whether or not the
accepted response is from an orphan.
Exactly-Once Call Semantics
It is Request / Reply / ACK protocol.
Request message
First Procedure
RPC execution
Reply message
Also serves as acknowledge-
Ment for the request message
Request message
Next Procedure
RPC execution
Reply message
Also serves as acknowledge-
Ment for the request message
Request /Reply /Acknowledge-Reply
(RRA) protocol
Client Server
Request message
First Procedure
RPC execution
Reply message
Reply acknowledgement msg
Request message
Next Procedure
RPC execution
Reply message
Reply acknowledgement msg
Complicated RPC’s
Types of complicated RPCs
Issues
Providing server with clients handle
Making the client process wait for the callback RPC
Handling callback deadlocks
Cont…
Client Server
Call (parameter list)
Start Procedure execution
Simple stubs
Drawbacks
• Problem is that it forces the user to wait for the
browser to discover and retrieve critical assets
until after an HTML document has been
downloaded.
• This delays rendering and increases load times.
• Server push lets the server preemptively “push”
website assets to the client without the user
having explicitly asked for them.
• When used with care, one can send what we
know the user is going to need for the page
they’re requesting.
Interoperability
• gRPC tools and libraries are designed to work with
multiple platforms and programming languages,
including Java, JavaScript, Ruby, Python, Go, Dart,
Objective-C, C#, and more.
Should I Use gRPC instead of REST?
• Basically, gRPC is another alternative that could be
useful in certain circumstances:
– large-scale microservices connections
– real-time communication
– low-power, low-bandwidth systems
– multi-language environments
• While HTTP supports mediators for edge caching,
gRPC calls use the POST method, which is a
threat to API-security.
• The responses can’t be cached through
intermediaries.
• Moreover, the gRPC specification doesn’t make
any provisions and even indicates the wish for
cache semantics between server and client.
Distributed Shared
Memory
Distributed shared memory
DSM paradigm provides process with shared address
space.
Primitives for shared memory:
1. Read(address)
2. Write(address, data)
Shared memory paradigm gives the system a illusion of
physically shared memory.
DSM refers to shared memory paradigm applied to
loosely coupled distributed memory systems.
Cont….
Shared memory exists only virtually.
Similar concept to virtual memory.
DSM also known as DSVM.
DSM provides a virtual address space shared among
processes on loosely coupled processors.
DSM is basically an abstraction that integrates the local
memory of different machine into a single logical entity
shared by cooperating processes.
Distributed shared memory
DSM Architecture
Each node of the system consist of one or more CPUs
and memory unit.
Nodes are connected by high speed communication
network.
Simple message passing system for nodes to exchange
information.
Main memory of individual nodes is used to cache pieces
of shared memory space.
Reduces network latency
Cont….
Memory mapping manager routine maps local memory
to shared virtual memory.
Shared memory space is partitioned into blocks.
Data caching is used in DSM system to reduce network
latency.
The basic unit of caching is a memory block.
Cont….
If data is not available in local memory network block
fault is generated.
Directory size
The larger the block size, the smaller the
directory.
Better performance.
Implementing Seq. Consistency Model
Strategies:
1. Nonreplicated, Nonmigrating blocks (NRNMB)
2. Nonreplicated, Migrating blocks (NRMB)
3. Replicated, migrating blocks (RMB)
4. Replicated, Nonmigrating blocks (RNMB)
NRNMBS
Simplest strategy.
Drawbacks:
Serializing data access creates a bottleneck.
Migration is allowed.
Owner node of a block changes as soon as the
block is migrating to a new node.
Drawbacks:
Prone to thrashing problem.
No parallelism.
Data locating in the NRMB strategy
On a fault,
Fault handler of the faulting node broadcasts a
read/write request on the network.
Disadvantage:
Not scalable.
Centralized server algorithm
Cont…
A centralized server maintains a block table
that contains the location information for all
block in the shared memory space.
Drawbacks:
A centralized server serializes location
queries, reducing parallelism.
Write-invalidate
All copies of a piece of data except one are invalidated
before a write can be performed on it.
2. A counter register
Keeps track of the oscillations of the quartz crystal.
If value is zero, an interrupt is generated and
reinitialized to the value of constant register.
3. A constant register
Stores a constant value, based on the frequency of
oscillation of the quartz crystal.
Cont…
Each interrupt is known as a clock tick.
Ideal case:
dC/dt =1
Cont…
A clock is nonfaulty if
The maximum drift rate allowable is ρ.
Condition :
1- ρ <= dC/dt <=1 + ρ
slow, perfect and fast clocks
Perfect clock
dC/dt = 1
Fast clock region
dC/dt > 1
Clock
time
Slow clock region
dC/dt < 1
Real time
Cont…
Types of clock synchronization in DS:
1. Synchronization of the computer clocks with
real-time (or external) clocks.
2. Distributed algorithms
Centralized algorithms
Use of time server node for referencing the
correct time.
e32
e24
e13
e23
e12
e22
e31
time
e21
e11
Events
e30 Message
e10 e20 transfer
2. Physical clocks
Using counters
Each process has a counter like C1 and C2
for process P1 and P2, respectively.
Counters
Act as logical clocks.
Initialized to zero.
Increments by 1 on events of the process.
Cont…
C1=3 e03
e12 C2=2
C1=2 e02
C1=1 e01
e11 C2=1
C1=0 C2=0
Process P1 Process P2
Using physical clocks
Each process has a physical clock
associated with it.
Each clock runs at a constant rate (it may
be different for each of these process).
Example:
When Process p1 has ticked 10 times, process
p2 has ticked only 8 times.
Physical clock times if Physical clock times
No corrections were made after corrections (if any)
e06 90 72 77
e05 80 64 69
70 Timestamp =60 56 61 e13
Time e04 60 48
50 40
e03 40 32 e12
e02 30 24
20 16 e11
e01 10 8
0 0
Process P1 Process P2
Total ordering of events
No events can occur at exactly the same
time.
2. No starvation
Cont…
Mutual exclusion:
Given a shared resource accessed by multiple concurrent
processes, at any time only one process should access
the resource.
Can be implemented in single-processor systems, using
semaphores, monitors and similar constructs.
No starvation:
If every process that is granted the resource eventually
releases it, every request must be eventually granted.
Cont…
Approaches:
1. Centralized approach
2. Distributed approach
3. Token passing approach
Centralized approach
One of the processes in the system is elected as
the coordinator.
3) Request 7) Release
6) Reply
5) Release 9) Release
2) Reply 8) Reply
P1 Pc P3
1) Request 4) Request
No starvation:
Due to use of first-come, first-served
scheduling policy.
Cont…
Advantages of algorithm:
Simple to implement
Requires only three messages- a request, a
reply and a release.
Drawbacks:
Coordinator is the single point of failure.
Distributed approach
The decision making for mutual exclusion
is distributed across the entire system.
All processes that want to enter the same
critical section cooperate with each other
before reaching a decision on which process
will enter the critical section next.
Ricart and Agrawala Algorithm
Use of event-ordering scheme to generate a
unique timestamp for each event in the system.
TS=4
TS=6
TS=4 TS=6
P4 P3
Already in the
critical section
Exits critical P4 P3
section
P4 P3
2. Lost token
Process failure
Process receiving token from neighbour always
sends an ACK
Each node maintains current ring configuration
If process fails, dynamic reconfiguration of ring is
carried out.
When process turns alive, it simply informs to
others.
Lost Token
To regenerate lost token – one process in the
ring acts as a Monitor process.
Monitor periodically circulates “who has the token
message”
Process containing token inserts its id in the
special field
If no id found implies token lost.
Problems:
Monitor process may itself fail
Who has the token message may be lost
Deadlock
Deadlock
A system consists of finite number of
resources.
Multiple concurrent processes have to
compete to use a resource.
The sequence of events to use a resource:
1. Request
2. Allocate
3. Release
Cont…
Request
Number of units requested may not exceed the
total no. of available units of the resource.
Allocate
Allocate the resource as soon as possible.
Maintain a table of resources allocated or
available.
Cont…
Release
Release the resources after the process has
finished using the allocated resource.
Allocation strategy:
Allocate the resource to the requester if free.
Non-Preemptable resource:
One that cannot be taken away from a process to which
it was allocated until the process voluntarily releases it.
Resource Types
Two general categories of resources:
Reusable &
Consumable.
Hold-and-wait condition
Processes are allowed to request for new resources
without releasing the resources that are currently held.
Cont…
No-preemption condition
A resource that has been allocated to a process
becomes available for allocation to another process
only after it has been voluntarily released by the
process holding it.
Circular-wait condition
Two or more processes must form a circular chain in
which each process is waiting for a resource that is
held by the next process of the chain.
Directed graph:
A pair (N,E), where N is a nonempty set of
nodes and E is a set of directed edges.
Path :
A sequence of nodes (a,b,c,….i,j) of a directed
graph such that (a,b), (b,c),….. (i,j) are
directed edges.
Cont…
Cycle :
A path whose first and last nodes are the
same.
Reachable set:
The reachable set of a node ‘a’ is the set of all
nodes ‘b’ such that a path exists from ‘a’ to ‘b’.
Knot: A nonempty set ‘K’ of nodes such
that the reachable set of each node in ‘K’
is exactly the set ‘K’.
It always contains one or more cycles.
A directed graph
Cycles :
a c 1. (a,b,c,d,e,f,a)
2. (b,c,d,e,b)
Knot:
d {a,b,c,d,e,f}
f
e
Resource allocation graph
Both the set of nodes and the set of edges are
partitioned into two types, resulting in the
following graph elements.
1. Process nodes
2. Resource nodes
3. Assignment edges
4. Request edges
Cont…
Pi A process named Pi
P1
R2 R1 R3
P2
Cont…
2. A cycle in the graph is a necessary but
not a sufficient condition for deadlock
if one or more of the resource types requested
by the processes forming the cycle have more
than one unit.
P1
R2 P2 R1 R3
P3
Wait-for graph
A simplified graph, obtained from the original
resource allocation graph by removing the
resource nodes and collapsing the appropriate
edges.
R2 R1
P3 P2 P3 P2
Unsafe state:
A system state if no safe sequence exists.
Cont…
Some assumptions of the algorithm:
The advance knowledge of the resource
requirements of the various processes is
available.
Problem :
Reordering may become inevitable when new
resources are added to the system.
Preemption
A preemptable resource is one whose
state can be easily saved and restored
later.
Progress property
Deadlock must be detected in a finite amount of
time.
Safety property
If a deadlock is detected, it must indeed exist.
No phantom deadlocks.
Cont…
Steps to construct WFG for a distributed
system:
Construct the resource allocation graph for
each site of the system.
R1 R2
P1 P3
P2 P2
Site S1 Site S1
P2
P1 R3 P3 P1 P3
Site S2 Site S2
(c) Global WFG by
(A) Resource (B) WFGs taking the union of
allocation graph for corresponding to the two local WFGs of
each site graphs in (a) (b).s
Cont…
Local WFGs are not sufficient to
characterize all deadlocks in a distributed
system.
Techniques are:
Centralized
Hierarchical
Distributed
Centralized approach for deadlock detection
Use of local coordinator for each site
Maintains a WFG for its local resources
Continuous transfer
Transfer of message whenever a new edge is added
to or deleted from the local WFG.
Periodic transfer
Transfer-on-request
Cont…
Drawbacks of centralized deadlock
detection approach:
Vulnerable to failures of the central
coordinator.
P1 P3 P1 P3
R3
R3 P3
R1 R2 R1 R2
P2 P2
Resource allocation Resource allocation
Resource allocation
graph of the local graph maintained
graph of the local
coordinator of site by the central
coordinator of site
S2 coordinator
S1
P3 P1 P3
R3
P1 R3 P3
R1 R2 R1 R2
P2 P2
P1 R3 P3
R1 R2
P2
Central coordinator
P7 P6 Controller G
P1 P3 P4 P5
P5 P6 P7
P2 P7
Controller E Controller F
P1 P3 P1 R3 P3
P5 R6 P6 P6 R7 P7
R1 R2 P4 R4 P5
P2 R5 P7 Site S3 Site S4
Controller C Controller D
Site S1 Site S2
Controller A Controller B
Cont…
Deadlock cycle
(p1,p3,p2,p1) of site S1 and S2 gets reflected
in the WFG of controller E.
Algorithms are:
WFG-based distributed algorithm for deadlock
detection.
P1 P3
Pex
P4 P2
P1 P3 P5
Site S1 P4 P2
Site S1
P2
Pex
P1 P5 Updating local
P3
P1 P3 P5 WFGs of site S2
after receiving the
Site S2 Site S2 deadlock detection
message from site
Local WFGs Local WFGs after S1.
addition of node
Pex
Cont…
If a local WFG contains a cycle that:
does not involve node Pex, a deadlock that involves only
local processes of that site has occurred – Resolved
Locally
Solution:
Assign a unique identifier to each process.
Probe-based distributed algorithm for
deadlock detection
Proposed by Chandy et al. in 1983.
(P1,P1,P3) (P1,P3,P5)
P1 P3 P5
(P1,P2,P1)
(P1,P3,P2)
(P1,P2,P4)
P4 P2
Site S1 Site S2
Cont…
Features of the CMH algorithm:
Easy to implement.
Each message is of fixed length.
Few computational steps.
Low overhead of the algorithm
No graph construction and information
collection
Doesn’t detect false deadlocks
Does not require any particular structure
among the processes.
Methods for recovery from deadlock
Asking for operator intervention.
Termination of process (es).
Rollback of process (es).
Asking for operator intervention
Inform the operator about the deadlock.
Requirements:
Analyze the resource requirements and
interdependencies of the processes involved in
a deadlock cycle.
Rollback of process
Reclaim the needed resources from the
processes that were selected for being
killed.
Rollback the process to a point where the
resource was not allocated to the process.
Processes are checkpointed periodically.
Approach is less expensive than the
process termination approach.
Extra overhead involved in periodic
checkpointing of all the processes.
Issues in recovery from deadlock
Selection of victim (s)
Use of transaction mechanism