4 Merged
4 Merged
ion
Distributed Systems
Foundations
• Layered Protocol
• Types of Communication
A Client-Server Transaction
• Most distributed systems applications are based on a client-server
model
• A server process and one or more client processes
• Server manages some resource
• Server provides services to client by managing client resources
• Server activated by request from client
The problem of communication
Applicatio HTTP Skype SSH FTP
ns
Transmissio
Coaxial Fiber Wi-
n media
cable optic Fi
Transmissio
Coaxial Fiber Wi-Fi
n media
cable optic
• New apps or media need only implement for intermediate layer’s interface
5
Basic Networking Model
• Physical layer: contains the specification and
implementation of bits, and their
transmission between sender and receiver
• Data link layer: prescribes the transmission of
a series of bits into a frame to allow for
error and flow control
• Network layer: describes how packets in a
network of computers are to be routed and
handle congestion control
• Transport layer: establish a reliable
connection between application running
two computers
Intermediate on
Layers • NOTE: For many distributed systems, the
lowest-level interface is that of the network
layer.
Remote Procedure
Calls
Basic RPC operation
Parameter Passing
RPC-based application support
Variations on RPC
Example: DCE RPC
Basic RPC Operation (1)
• Observations:
• Application developers are familiar
with simple procedure model
• Well-engineered procedures
operate in isolation (black box)
• There is no fundamental
reason
not to execute procedures on
separate machine
• Conclusion: Communication
between caller & callee can be
hidden by using procedure-call
mechanism.
Basic RPC Operation (2)
• Client procedure calls client stub.
• Client stub builds calls local OS.
• message;
Client OS sends to Remote OS.
• message
Server (Remote) OS gives message to
server stub.
• Server unpacks message and does a local
procedure call; returns result to server
stub.
• Server stub builds calls server
message;
OS.
• Server OS sends message to client OS.
• Client OS gives message to client stub.
• Client stub unpacks result; returns to
client procedure
DistributedSystems\4.Communication\rpc_dblist.py
RPC Parameter Passing (1)
• There's more than just wrapping parameters into a message
• Client and server machines may have different data representations (e.g.
little endian or big endian)
• Wrapping a parameter means transforming a value into a sequence of
bytes
• Client and server have to agree on the same encoding:
• How are basic data values represented (integers, floats,
characters)
• How are complex data values represented (arrays, unions)
• Conclusion: Client and server need to properly interpret messages,
transforming them into machine-dependent representations.
Examples of Serializable in Java
• Sharing object through binary file:
DistributedSystems\4.Communication\PersonSerialize.java
•Sending object over network:
DistributedSystems\4.Communication\Message.j
ava
DistributedSystems\4.Communication\MessageServ
er.java
DistributedSystems\4.Communication\MessageClien
Reference Code from https://gist.github.com/chatton/14110d2550126b12c0254501dde73616
t.java
RPC Parameter Passing (2)
• Some assumptions:
• Copy in/copy out semantics: while procedure is executed, nothing can
be assumed about parameter values.
• All data that is to be operated on is passed by parameters. Excludes
passing references to (global) data.
DistributedSystems\4.Communication\rpc-dblist_param_marshalling.py
Parameter Passing in Object Based
System
RPC-based Application Support
• Stub Generation
void someFunction(char x; float y; int z[5])
• Try to get rid of the strict request- • Differed RPC: The client sends
reply behavior, but let the client requests to the RPC server and
continue without waiting for an upon receiving ack continues. The
answer from the server. Two cases: server sends the result when
• Client is not expecting any return ready. The client callback function
value executes to process the results.
• Client is expecting return value but
is
not willing to wait for result
• One-way RPC: After the client
sends a request, it does not
wait for ack. Causes problems
for unreliable network
connection
Implement asynchronous RPC yourself in the language of your choice
Multiple RPC
• Client sends requests to group of
servers using differed RPC or
one-way RPC
• Executes callback for each
request upon receiving the
results from each servers
• Assimilates the results to make
final
Example: Distributed Computing
Environment (DCE) RPC
• Developed by the Open Software Foundation (OSF)
• The basis for Microsoft's distributed computing environment (DCOM)
• Used in Samba – file server to access files from Windows file systems to
non-
Windows file systems
• Uses RPC protocol suite
Writing client
and server in
DCE RPC
Client to Server Binding (DCE)
Issues
(1) Client must locate server machine, and (2) locate the server
(i.e. server process).
Message
Oriented
Communicatio
Simple transient messaging with sockets (Socket Programming)
Advanced transient messaging (MPI)
Message-oriented persistent communication
n
Example: IBM’s WebSphere message-queuing system
Example: Advanced Message Queuing Protocol
(AMQP)
How does a specific application on the
client connect to a specific application
on the
• In this server?
example, the server is 3
different applications: a
database server, an email
(SMTP) server, a nd a server
(HTTP). web
• How will a client connect
through database UI to the
database server and from
the browser to the email
and web server? Image Source: https://aviadezra.blogspot.com/2008/07/code-sample-sending-typed-serialized.html
Hardware and Software
Organization of Socket
Interface
Internet From a Programmers View
1. Hosts are mapped to a set of 32-‐bit IPv4 addresses e.g. 128.2.203.179
2. The set of IP addresses is mapped to a set of identifiers called Internet domain names.
• 104.238.110.159 is mapped to www.daiict.ac.in
$ ping www.daiict.ac.in
PING www.daiict.ac.in (104.238.110.159) 56(84) bytes of data.
64 bytes from ip-104-238-110-159.ip.secureserver.net (104.238.110.159): icmp_seq=1 ttl=57
time=349 ms
^C
--- www.daiict.ac.in ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 349.388/349.388/349.388/0.000 ms
You can also use https://www.whatismyip.com/dns-lookup/ to get IP address mapped to domain
name.
3. A process on one Internet host can communicate with a process on another
Internet host over a Internet connection
(1) IP Addresses
• 32-‐bit IP addresses are stored in an IP address struct
• IP addresses are always stored in memory in network byte order (big-‐endian byte order)
• True in general for any integer transferred in a packet header from one machine to another.
• E.g., the port number used to identify an Internet connection.
/* Internet address structure */
struct in_addr {
uint32_t s_addr; /* network byte order (big-endian) */
};
• By convention, each byte in a 32-‐bit IP address is represented by its decimal value and
separated by a period
• IP address: 0x8002C2F2 = 128.2.194.242 Dotted Decimal Format
• Big-endian 1000 1001 1002 1003 Little- 1000 1001 1002
endian 1003
LSB 0xF2 0xC2 0x02 0x80 MSB LSB
• UseMSB 0x80
getaddrinfo and getnameinfo functions to convert between IP 0x02 and
addresses 0xC2 0xF2
doted decimal
format.
(2) Internet Domain Names
• The Internet maintains a mapping
between IP addresses and ac … in Top level domain names
domain names in a huge be
worldwide distributed database
called DNS
• Conceptually, programmers can co org … ac edu gov …
2nd level domain
view the DNS database as a names
collection of millions of host
entries.
• Each host entry defines the mapping daiict 3rd level domain
between a set of domain names and names
IP addresses.
• In a mathematical sense, a host entry
is an equivalence class of domain
names and IP addresses.
Properties of DNS Mappings
• Can explore properties of DNS • Multiple domain names mapped
mappings using nslookup to the same IP address:
• Each host has a locally defined nslookup cs.mit.edu and nslookup
domain name = localhost and IP eecs.mit.edu both returns the
= 127.0.0.1 same IP address
• Simple case: one-‐to-‐one Address: 18.25.0.23
mapping between domain name • Same domain names mapped
and IP address: to multiple IP addresses
$ nslookup www.daiict.ac.in $ nslookup www.netflix.com
Address: 20.198.80.43 Address: 3.251.50.149
Address: 54.74.73.31
Address: 54.155.178.5
(3) Internet Connections
• Clients and servers communicate by sending streams of bytes
over connections. Each connection is:
• Point‐to‐point: connects a pair of processes.
• Full-duplex: data can flow in both directions at the same time,
• Reliable: stream of bytes sent by the source is eventually received by the
destination
in the same
• A socket is an order it wasof
endpoint sent.
a
connection
• Socket address is an IPaddress:port pair
• A port is a 16-‐bit integer that identifies a
process:
• Ephemeral port: Assigned automatically by client kernel when client makes a
connection request.
• Well-known port: Associated with some service provided by a server (e.g., port 80 is
associated with Web servers)
Well-known Ports and Service
Names
• Popular services have permanently assigned well-‐known ports
and corresponding well-known service names:
• echo server: 7/echo
• ssh servers: 22/ssh
• email server: 25/smtp
• Web servers: 80/http
• File Transfer Protocol server : 21/ftp
• Each addrinfo struct returned by getaddrinfo contains arguments that can be passed directly to socket function.
• Also points to a socket address struct that can be passed directly to connect and bind
functions.
Host and Service Conversion:
getnameinfo()
int getnameinfo( • getnameinfo displays a socket
const struct sockaddr *sa, socklen_t salen, /* address to the corresponding
In: socket addr */
host (name or IP) and service
char *host, size_t hostlen, /* Out: host */ (service or port).
char *serv, size_t servlen, /* Out: service • Replaces obsolete gethostbyaddr
*/ and getservbyport funcs.
int flags); /* optional flags */ • Reentrant and protocol
flags = NI_NUMERICHOST | independent.
NI_NUMERICSERV;
/* Display address string instead of
domain name and port number instead of
service name */
getaddrinfo() example
DistributedSystems\4.Communication\hostinfo.c
If we disable line
#define IPv4 1, it will provide IPv4 as well as IPv6
addresses
$ ./hostinfo.out www.google.com
172.217.163.100
2404:6800:4007:811::2004
If we enable line
#define IPv4 1, it will provide
IPv4 addresses only
$ ./hostinfo.out www.google.com
172.217.163.100
Transient Messaging Through Socket
Interface
• Install in Ubuntu:
• sudo apt-get update && sudo apt-get install infiniband-diags ibverbs-utils \
libibverbs-dev
libfabric1 libfabric-dev libpsm2-dev –y
• sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev libgtk2.0-dev
• sudo apt-get install librdmacm-dev libpsm2-dev
MPI: Hello World
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
DistributedSystems\4.Communication\
mpi_hello.c
How to Compile and Run
• mpicc - compiler for MPI
mpicc mpi_hello.c -o mpi_hello.out
• mpiexec or mpirun for executing the code
mpiexec -np 4 ./mpi_hello.out runs 4 process
Operation Description
MPI_Bcast Broadcasts a message from the process with rank root to all DistributedSystems\4.Commu
other processes of the group. nication\mpi_Broadcast.c
MPI_Reduce Reduces values on all processes within a group. DistributedSystems\4.Comm
unication\mpi_reduce.c
MPI_Scatter Sends data from one task to all tasks in a group. DistributedSystems\4.Com
munication\mpi_scatter.c
MPI_Gather Gathers values from a group of processes. DistributedSystems\4.Commu
nication\mpi_gather.c
Message Queuing o Message-
(MQ)
Middleware r Oriented
(MOM) persistent communication through the support of
•Asynchronous
middleware-level queues. Queues correspond to buffers at
communication servers.
Basic Operations
Operation Description
put Append a message to a specified queue
get Block until the specified queue is nonempty, and remove the
first message
poll Check a specified queue for messages, and remove the
first. Never block
notify Install a handler to be called when a message is put into the
specified queue
General Model
• Queue managers: Queues are managed by queue managers. An
Attribute Description
Transport type Determines the transport protocol to be used
FIFO delivery Indicates that messages are to be delivered in the order they are sent
Message length Maximum length of a single message
Setup retry count Specifies maximum number of retries to start up the remote MCA
Delivery retries Maximum times MCA will try to put received message into queue
IBM's WebSphere MQ: Routing
• By using logical names, in combination with name resolution to local
queues, it is possible to put a message in a remote queue
Multicast Communication
• Application-level tree-based multicasting
• Flooding-based multicasting
• Gossip-based data dissemination
Application-level Multicasting
• Organize nodes of a distributed system into an overlay network
and use that network to disseminate data:
• Oftentimes a tree, leading to unique paths
• Alternatively, also mesh networks, requiring a form of routing
Application-level Multicasting in
Chord
• Initiator generates a multicast identifier mid .
• Lookup succ(mid), the node responsible for mid .
• Request is routed to succ(mid ), which will become the root.
• If P wants to join, it sends a join request to the root.
• When request arrives at Q:
• Q has not seen a join request before ⇒ it becomes forwarder; P
request anymore.
ALM: Some Costs
• Different Metics:
• Link stress: How often does an ALM message cross the same physical link?
Example: message from A to D needs to cross (Ra, Rb) twice.
• Stretch: Ratio in delay between ALM-level path and network-level path.
level ⇒ stretch =
Example: messages B to C follow path of length 73 at ALM, but 47 at
network End Host
73/47. A 1
Floodi
ng The size of a random
• P simply sends a message m
to overlay as function of
each of its neighbors. Each the number of nodes
neighbor will forward that
message, except to P, and
only if
it had not seen m before.
• Performance: The more edges,
the more expensive!
Variation
Let Q forward a message with a certain
probability pflood , possibly even dependent on
Epidemic protocols
• Assume there are no write–write conflicts
• Update operations are performed at a single server
• A replica passes updated state to only a few neighbors
• Update propagation is lazy, i.e., not immediate
• Eventually, each update should reach every replica
• With pull, 𝑝𝑖+1 = (𝑝𝑖)2: the node was not updated during the ith
Analysis: staying ignorant
𝑒
Formal Analysis
Rumor Spreading: The effect of
stopping
Note
If we really have to ensure that all servers are eventually updated,
rumor
spreading alone is not enough
1/pstop
s Ns
1 0.203188 2032
2 0.059520 595
3 0.019827 198
4 0.006977 70
5 0.002516 25
6 0.000918 9
7 0.000336 3
Deleting Values
• Fundamental problem: We cannot remove an old value from a server
and expect the removal to propagate. Instead, mere removal will
be undone in due time using epidemic algorithms
• Solution: Removal has to be registered as a special update by
inserting a death certificate
Deleting Values
• When to remove a death certificate (it is not allowed to stay for ever)
• Run a global algorithm to detect whether the removal is known everywhere,
and then collect the death certificates (looks like garbage collection)
• Assume death certificates propagate in finite time, and associate a maximum
lifetime for a certificate (can be done at risk of not reaching all servers)