Low 201108 TCP Cambridge
Low 201108 TCP Cambridge
Steven Low
netlab.CALTECH.edu
Cambridge 2011
Goal of tutorial
Top-down summary of congestion
control on Internet
Introduction to mathematical models
of congestion control
Illustration of theory-guided CC
algorithm design
Theory-guided design
Tight integration of theory, design, experiment
Analysis done at design time, not after
Theory-guided design
Integration of theory, design, experiment
can be very powerful
Each needs the other
Combination much more than sum
Agenda
9:00 Congestion control protocols
10:00 break
10:15 Mathematical models
11:15 break
11:30 Advanced topics
12:30 lunch
CONGESTION CONTROL
PROTOCOLS
Congestion collapse
October 1986, the first congestion collapse on the
Internet was detected
Link between UC Berkeley and LBL
400 yards, 3 hops, 32 Kbps
throughput dropped to 40 bps
factor of ~1000 drop!
WHY ?
load
Network milestones
1969
1974
81 83
1988
1991
Tahoe
HTTP
1996
1999
2003 2006
TCP/IP
ARPANet
TCP
Backbone speed:
Cutover
to TCP/IP
50-56kbps, ARPANet
T1
NSFNet
T3, NSFNet
OC12
MCI
Network is exploding
OC48
vBNS
OC192
Abilene
Application milestones
1971 1973
1969 1972
ARPANet
Network TCP
Mail
50-56kbps, ARPANet
Telnet
81 83
1988
TCP/IP
Cutover
to TCP/IP
1993 1995
1990
Internet
Talk
Radio
Tahoe
Tahoe
HTTP
2004 2005
Napster
music
Internet
Phone
iTunes
video
AT&T
VoIP
Whitehouse
online
T1
YouTube
NSFNet
T3, NSFNet
File
Transfer
Simple applications
OC12
MCI
OC48
vBNS
OC192
Diverse & demanding applications
Abilene
The first network email was sent by Ray Tomlinson between these two
computers at BBN that are connected by the ARPANet.
Telephony
Music
Friends
Games
Cloud computing
Congestion collapse
1969
1974
81
1983
83
TCP/IP
TCP/IP
ARPANet
Cutover
to TCP/IP
Network TCP
Mail
50-56kbps, ARPANet
Telnet
File
Transfer
1988
1993 1995
1990
Internet
Talk
Radio
Tahoe
Tahoe
HTTP
2004
Napster
music
Internet
Phone
2006
iTunes
video
AT&T
VoIP
Whitehouse
online
T1
YouTube
congestion collapseNSFNet
detected at LBL
T3, NSFNet
OC12
MCI
OC48
vBNS
OC192
Abilene
Congestion collapse
October 1986, the first congestion collapse on
the Internet was detected
Link between UC Berkeley and LBL
400 yards, 3 hops, 32 Kbps
throughput dropped to 40 bps
factor of ~1000 drop!
load
congestion collapse
detected at LBL
1974
81
1983
83
TCP/IP
TCP/IP
ARPANet
Cutover
to TCP/IP
Network TCP
Mail
50-56kbps, ARPANet
Telnet
1988
1993 1995
1990
Internet
Talk
Radio
Tahoe
Tahoe
HTTP
2004
Napster
music
Internet
Phone
2006
iTunes
video
AT&T
VoIP
Whitehouse
online
T1
YouTube
congestion collapseNSFNet
detected at LBL
T3, NSFNet
Flow control:
File
Transfer
Prevent
overwhelming receiver
OC12
MCI
OC48
vBNS
+ Congestion control:
Prevent overwhelming network
OC192
Abilene
Transport milestones
1969
1974
1983
1988
94 96 98
00
2006
TCP/IP
ARPANet
TCP
Tahoe
DECNet
AIMD
Vegas
delay
based
NUM
p
formula
reverse
engr TCP
systematic
design
of TCPs
Packet networks
Packet-switched as opposed to circuitswitched
No dedicated resources
Simple & robust: states in packets
Network mechanisms
Transmit bits across a link
encoding/decoding, mod/dem, synchronization
Medium access
who transmits when for how long
Routing
choose path from source to destination
Loss recovery
recover packet loss due to congestion, error,
interference
Flow/congestion control
efficient use of bandwidth/buffer without
overwhelming receiver/network
Protocol stack
Network mechanisms implemented as
protocol stack
Each layer designed separately, evolves
asynchronously
application
transport
network
Routing (IP)
link
physical
Search
News
Video
Audio
Friends
TCP
IP
Ethernet 802.11
3G/4G
ATM
Optical
Link technologies
Satellite
Bluetooth
IP layer
Routing from source to destination
Distributed computation of routing decisions
Implemented as routing table at each router
Shortest-path (Dijkstra) algorithm within an
autonomous system
BGP across autonomous systems
Datagram service
Best effort
Unreliable: lost, error, out-of-order
TCP layer
End-to-end reliable byte stream
On top of unreliable datagram service
Correct, in-order, without loss or duplication
Congestion control
Source-based distributed control
UDP
IP
ICMP
ARP
TCP Segment
IP hdr
Ethernet Frame
20 bytes
IP data
20 bytes
4 bytes
Early TCP
Pre 1988
Go-back-N ARQ
Detects loss from timeout
Retransmits from lost packet onward
Self-clocking
Key references
TCP/IP spec
RFC 791 Internet Protocol
RFC 793 Transmission Control Protocol
AIMD idea: Chiu, Jain, Ramakrishnan 1988-90
Tahoe/Reno: Jacobson 1988
Vegas: Brakmo and Peterson 1995
FAST: Jin, Wei, Low 2004
CUBIC: Ha, Rhee, Xu 2008
CTCP: Kun et al 2006
RED: Floyd and Jacobson 1993
REM: Athuraliya, Low, Li, Yin 2001
There are many many other proposals and references
Tahoe
Reno
TCP Tahoe
(Jacobson 1988)
window
time
SS
CA
TCP Reno
SS
(Jacobson 1990)
CA
time
SS
CA
TCP CC variants
Differ mainly in Congestion Avoidance
Vegas: delay-based
FAST: delay-based, scalable
CUBIC: time since last congestion
CTCP: use both loss & delay
dupACKs
congestion
avoidance
FR/FR
tim eout
slow start
retransm it
Congestion avoidance
Reno
Jacobson
1988
Vegas
Brakmo
Peterson
1995
(AI)
(M D )
Congestion avoidance
FAST
Jin, Wei, Low
2004
periodically
{
baseRTT
W
W
RTT
}
Feedback control
pl(t)
xi(t)
TCP/AQM
pl(t)
TCP:
Reno
Vegas
FAST
xi(t)
AQM:
DropTail
RED
REM/PI
AVQ
Implicit feedback
Drop-tail
FIFO queue
Drop packet that arrives at a full buffer
Implicit feedback
Queueing process implicitly computes and
feeds back congestion measure
Delay: simple dynamics
Loss: no convenient model
RED
Avg queue
REM
8
10
12
Link conges tion measure
14
16
18
20
REM
Clear buffer and match rate
pl (t 1) [ pl (t ) ( l bl (t ) x l (t ) cl )]
Clear buffer
Match rate
Sum prices
1
pl ( t )
p s (t )
Summary: CC protocols
End-to-end CC implemented in TCP
Basic window mechanism
TCP performs connection setup, error recovery,
and congestion control,
CC dynamically computes cwnd that limits max
#pkts enroute
Agenda
9:00 Congestion control protocols
10:00 break
10:15 Mathematical models
11:15 break
11:30 Advanced topics
12:30 lunch
MATHEMATICAL
MODELS
Mathematical models
Why mathematical models?
Dynamical systems model of CC
Convex optimization primer
Reverse engr: equilibrium properties
Forward engr: FAST TCP
application
transport
network
link
physical
application
transport
network
link
physical
Refines intuition
Guides design
Suggests ideas
Explores boundaries
Understands structural properties
Risk
All models are wrong
some are useful
Validate with simulations & experiments
Structural properties
Equilibrium properties
Throughput, delay, loss, fairness
Dynamic properties
Stability
Robustness
Responsiveness
Scalability properties
Information scaling (decentralization)
Computation scaling
Performance scaling
Homogeneous protocols
All flows use the same congestion measure
Fluid approximation
Ignore packet level effects, e.g. burstiness
Inaccurate buffering process
Robustness,
responsiveness
to address
these
issues to various degrees
Mathematical models
Why mathematical models?
Dynamical systems model of CC
Convex optimization primer
Reverse engr: equilibrium properties
Forward engr: FAST TCP
TCP/AQM
pl(t)
TCP:
Reno
Vegas
FAST
xi(t)
AQM:
DropTail
RED
REM/PI
AVQ
Network model
Network
Links l of capacities
Sources i
Source rates
xi(t)
Routing matrix R
x1(t)
1 1 0
1
0
1
x1 x2 c1
x1 x3 c2
p1(t)
x2(t)
p2(t)
x3(t)
Network model
x
F1
Network
TCP
G1
FN
q
AQM
GL
p
IP routing
Reno, Vegas
Droptail, RED
Examples
Derive (Fi, Gl) model for
Reno/RED
Vegas/Droptail
FAST/Droptail
Model: Reno
for every ack (ca)
{ W + = 1/W }
for every loss
{ W := W /2 }
wi t
xi (t)(1 qi (t))
wi
wi (t)
xi (t)qi (t)
2
Model: Reno
for every ack (ca)
{ W + = 1/W }
for every loss
{ W := W /2 }
wi t
throughput
xi (t)(1 qi (t))
wi (t)
window size
wi (t)
xi (t)qi (t)
2
round-trip
loss probability
link loss
probability
Model: Reno
for every ack (ca)
{ W + = 1/W }
for every loss
{ W := W /2 }
wi t
xi (t)(1 qi (t))
wi (t)
2
i
1 x
xi (t1) xi (t) 2 qi (t)
Ti
2
Fi xi (t),qi (t)
wi (t)
xi (t)qi (t)
2
Uses:
wi (t)
xi (t)
Ti
qi (t) 0
Model: RED
yl (t) Rli xi (t)
marking prob
1
queue length
aggregate
link rate
source
rate
Model: Reno/RED
2
i
1 x
xi (t1) xi (t) 2 qi (t)
Ti
2
xi (t1)Fi xi (t),qi (t)
Decentralization structure
x
yy
F1
Network
TCP
G1
AQM
FN
qq
GL
R
Validation Reno/REM
Queue
REM
p = Lagrange multiplier!
DropTail
queue = 94%
p increasing in queue!
RED
min_th = 10 pkts
max_th = 40 pkts
max_p = 0.1
Model: Vegas/Droptail
for every RTT
{ if W /RTTm in W /RTT < then W + +
if W /RTTm in W /RTT > then W -- }
for every loss
W := W /2
Fi:
Gl:
queue size
1
xi t1 xi (t) 2
Ti (t)
1
xi t1 xi (t) 2
Ti (t)
if wi (t) di xi (t) i di
xi t1 xi (t)
else
if wi (t) di xi (t) i di
Ti (t) di qi (t)
Model: FAST/Droptail
periodically
{
baseRTT
W :
W
RTT
}
i
xi (t1) xi (t)
i xi (t)qi (t)
Ti (t)
1
pl (t1) pl (t) yl (t) cl
cl
wi (t i f )
f
i (t i ) x0 (t ) c
w
i d i p (t )
[Jacobsson et al 2009]
Same RTT, no cross traffic
Recap
Protocol (Reno, Vegas, FAST, Droptail, RED)
Equilibrium
Performance
Throughput, loss, delay
Fairness
Utility
Dynamics
Local stability
Global stability
Mathematical models
Why mathematical models?
Dynamical systems model of CC
Convex optimization primer
Reverse engr: equilibrium properties
Forward engr: FAST TCP
Background: optimization
max
x0
U ( x )
i
subject to
Rx c
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
8
10
12
14
Link congestion measure
16
18
20
Background: optimization
max
x0
U ( x )
i
subject to
Rx c
Background: optimization
U ( x )
max
x0
subject to
Rx c
Theorem
Optimal solution x* exists
It is unique if Ui are strictly concave
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
8
10
12
14
Link congestion measure
16
18
strictly concave
20
not
Background: optimization
max
x0
U ( x )
i
subject to
Rx c
Theorem
such that
Ui ' x q : R p
*
i
*
i
*
li l
cl
y : R x
*
l
*
li i
c
if
p
l
l 0
Lagrange
multiplier
Complementary
slackness: all
bottlenecks are
fully utilized
Background: optimization
max
x0
U ( x )
i
subject to
Rx c
Theorem
*
li l
incentive compatible
Background: optimization
max
x0
U ( x )
i
subject to
Rx c
Theorem
Gradient decent algorithm to solve the dual
problem is decentralized
'1
i
qi (t)
Background: optimization
max
x0
U ( x )
i
subject to
Rx c
Theorem
Gradient decent algorithm to solve the dual
problem is decentralized
'1
i
qi (t)
reverse/forward
engineer TCP
Mathematical models
Why mathematical models?
Dynamical systems model of CC
Convex optimization primer
Reverse engr: equilibrium properties
Forward engr: FAST TCP
x* F (x*, RT p* )
p* G (Rx*, p* )
Uniqueness of equilibrium
x* is unique when U is strictly
concave
p* is unique when R has full row rank
x* F (x*, RT p* )
p* G (Rx*, p* )
max
x0
U ( x )
i
subject to
Rx c
if 1
(1 ) 1 xi1
if 1
U i ( xi )
log xi
Low 2003
max
x0
U ( x )
i
subject to
Rx c
if 1
(1 ) 1 xi1
if 1
U i ( xi )
log xi
maximum throughput
proportional fairness
min delay fairness
maxmin fairness
Low 2003
Some implications
Equilibrium
Always exists, unique if R is full rank
Bandwidth allocation independent of AQM or
arrival
Can predict macroscopic behavior of large scale
networks
Equilibrium throughput
Reno
xi
1
0.5
Ti qi
HSTCP
xi
1
0.84
Ti qi
Vegas, FAST
xi
qi
Consequences
Excessive backlog
Unfairness to older sources
Theorem
A relative error of s in propagation delay
estimation distorts the utility function to
Evalidation
Validation
Source rates (pkts/ms)
# src1
src2
src3
src4
src5
1 5.98 (6)
2 2.05 (2) 3.92 (4)
3 0.96 (0.94) 1.46 (1.49) 3.54 (3.57)
4 0.51 (0.50) 0.72 (0.73) 1.34 (1.35) 3.38 (3.39)
5 0.29 (0.29) 0.40 (0.40) 0.68 (0.67) 1.30 (1.30) 3.28 (3.34)
#
1
2
3
4
5
queue (pkts)
19.8 (20)
59.0 (60)
127.3 (127)
237.5 (238)
416.3 (416)
baseRTT (ms)
10.18 (10.18)
13.36 (13.51)
20.17 (20.28)
31.50 (31.50)
49.86 (49.80)
Mathematical models
Why mathematical models?
Dynamical systems model of CC
Convex optimization primer
Reverse engr: equilibrium properties
Forward engr: FAST TCP
Reno design
Reno TCP
Packet level
ACK: W
W + 1/W
Loss: W
W 0.5W
Flow level
1.5
qi
Equilibrium
xiTi
Dynamics
1
2 2
i (t)
w
1 wi (t)qi (t)
Ti
3
pkts
Reno design
Packet level
Designed and implemented first
Flow level
Understood afterwards
Forward engineering
1. Decide congestion measure
Loss, delay, both
Forward engineering
Tight integration of theory, design, experiment
Performance analysis done at design time
Not after
HSTCP
AIMD(a(w), b(w))
STCP
MIMD(a, b)
FAST
ACK: W
W + 1/W
Loss: W
W 0.5W
ACK: W
W + a(w )/W
Loss: W
W b(w )W
ACK: W
W + 0.01
Loss: W
W 0.125W
RTT : W W
baseRTT
RTT
Flow level:
qi (t)
(t) 1 '
Ui (t)
control
gain
flow level
goal
Flow level:
qi (t)
(t) 1 '
Ui (t)
control
gain
flow level
goal
NetLab
Lee Center
rsrg SISL
2001
2002
Lee
Center
2003
FAST TCP
theory
2004
2005
2006
2007
WAN-in-Lab
Testbed
IPAM Wkp
U ( x )
i
s. t. Rx c
i xi (t ) Rli pl (t )
Ti
l
p l Rli xi (t ) cl
cl i
theory
experiment
testbed
deployment
SC02
Demo
theory
testbed
WAN-in-Lab : one-of-a-kind windtunnel in academic networking, with
2,400km of fiber, optical switches,
routers, servers, accelerators
experiment
deployment
TCP
eq 2
eq 3
SC 2004
eq 1
Internet2 LSR
SuperComputing BC
Internet
FAST in a box
x i
with
FAST
without
FAST
Some benefits
Transparent interaction among components
TCP, AQM
Clear understanding of structural properties
With FAST
throughput: 120Mbps
SF New York
June 3, 2007
Average download speed 8/24 30, 2009, CDN customer (10G appliance)
FAST vs TCP stacks in BSD, Windows, Linux
Agenda
9:00 Congestion control protocols
10:00 break
10:15 Mathematical models
11:15 break
11:30 Advanced topics
12:30 lunch
ADVANCED
TOPICS
Advanced topics
Heterogeneous protocols
Layering as optimization
decomposition
Some implications
homogeneous
heterogeneous
equilibrium
unique
bandwidth
allocation
on AQM
independent
bandwidth
allocation
on arrival
independent
FAST throughput
buffer size = 80 pkts
Multiple equilibria:
throughput
depends on arrival
Dummynet experiment
eq 2
eq 1
eq 2
Path 1
52M
13M
path 2
61M
13M
path 3
27M
93M
eq 1
Multiple equilibria:
throughput
depends on arrival
Dummynet experiment
eq 2
eq 1
eq 2
Path 1
52M
13M
path 2
61M
13M
path 3
27M
93M
eq 3 (unstable)
eq 1
Duality model:
max
x0
U ( x )
i
s.t. Rx c
R p ,x
x Fi
*
i
*
li l
*
i
Fi xi i xi Rli pl
Ti
l
1
xi2
Fi 2
Ti
2
li
pl
Duality model:
max
x0
U ( x )
i
s.t. Rx c
R p ,x
x Fi
*
i
*
li l
*
i
Fi xi i xi Rli pl
Ti
l
1
xi2
Fi 2
Ti
2
li
pl
1
p l
cl
R x (t ) c
li i
p l g l pl (t ), Rli xi (t )
i
Homogeneous protocol
x
F1
G1
Network
TCP
FN
GL
xi (t 1) Fi
li
pl (t ), xi (t )
R m p (t ) ,
j
same price
for all sources
xi (t 1) Fi
j
AQM
li
xi (t )
Heterogeneous protocol
x
F1
G1
Network
TCP
FN
GL
xi (t 1) Fi
li
pl (t ), xi (t )
xi (t 1) Fi
j
AQM
R m p (t ) ,
li
xi (t )
heterogeneous
prices for
type j sources
Heterogeneous protocols
Equilibrium: p that satisfies
xi ( p ) f i
j
R m
li
( pl )
cl
yl ( p ) : R x ( p )
i,j
cl
j j
li i
if pl 0
Heterogeneous protocols
Equilibrium: p that satisfies
xi ( p ) f i
j
R m
li
( pl )
cl
yl ( p ) : R x ( p )
i,j
cl
j j
li i
if pl 0
Notation
Simpler notation: p is equilibrium if
y ( p ) c on bottleneck links
Jacobian: J ( p ) : y ( p )
p J ( p ) p(t)
Existence
Theorem
Equilibrium p exists, despite lack of
underlying utility maximization
Generally non-unique
There are networks with unique bottleneck
set but infinitely many equilibria
There are networks with multiple bottleneck
set each with a unique (but distinct)
equilibrium
Regular networks
Definition
A regular network is a tuple (R, c, m, U) for
which all equilibria p are locally unique, i.e.,
y
det J ( p ) : det
( p) 0
p
Theorem
Almost all networks are regular
A regular network has finitely many and
odd number of equilibria (e.g. 1)
Global uniqueness
m lj [al ,21/ L al ] for any al 0
m lj [a j ,21/ L a j ] for any a j 0
Theorem
If price heterogeneity is small, then equilibrium is
globally unique
Implication
a network of RED routers with slope inversely
proportional to link capacity almost always has
globally unique equilibrium
Local stability
m lj [al ,21/ L al ] for any al 0
m lj [a j ,21/ L a j ] for any a j 0
Theorem
If price heterogeneity is small, then the unique
equilibrium p is locally stable
If all equilibria p are locally stable, then it is
globally unique
Linearized dual algorithm:
p J ( p * ) p(t)
Re J ( p ) 0
Summary
homogeneous
heterogeneous
equilibrium
unique
non-unique
bandwidth
allocation
on AQM
independent
dependent
bandwidth
allocation
on arrival
independent
dependent
Efficiency
Result
Every equilibrium p* is Pareto efficient
Proof:
Every equilibrium p* yields a (unique) rate x(p*)
that solves
max ij ( p * )U i j ( xij )
x0
s. t. Rx c
Efficiency
Result
Every equilibrium p* is Pareto efficient
Measure of optimality
: max U i ( xi )
Achieved:
x 0
V ( p ) :
*
s. t. Rx c
U
j
j
i
( xi ( p ))
Efficiency
Result
Every equilibrium p* is Pareto efficient
Loss of optimality:
j
min ml
V(p )
*
V
max m lj
*
Measure of optimality
: max U i ( xi )
Achieved:
x 0
V ( p ) :
*
s. t. Rx c
U
j
j
i
( xi ( p ))
Efficiency
Result
Every equilibrium p* is Pareto efficient
Loss of optimality:
j
min ml
V(p )
*
V
max m lj
*
Intra-protocol fairness
Result
Fairness among flows within each type is
unaffected, i.e., still determined by their utility
functions and Kellys problem with reduced link
capacities
Proof idea:
Each equilibrium p chooses a partition of link
capacities among types, cj:= cj(p)
Rates xj(p) then solve
j
j
j j
j
max
U
(
x
)
s.
t.
R
x
c
i i
j
x 0
Inter-protocol fairness
Theorem
Any fairness is achievable with a linear scaling of
utility functions
j
j
j j
x j : arg max
U
(
x
)
s.
t.
R
x c
i i
j
x 0
xij (t )
j qij (t )
f i
j
i (t )
i j (t 1) i j i j (t ) (1 i j )
j
m
l ( pl (t ))
l
p (t )
l
ns2 simulation:
buffer=80pks
FAST throughput
ns2 simulation:
buffer=400pks
FAST throughput
Advanced topics
Heterogeneous protocols
Layering as optimization
decomposition
Search
News
Video
Audio
Friends
TCP
IP
Ethernet 802.11
3G/4G
ATM
Optical
Link technologies
Satellite
Bluetooth
application
transport
network
link
physical
application
transport
network
link
physical
max
x 0
U ( x )
i
Phy: power
subj to Rx c( p )
x X
IP: routing
Link: scheduling
application
transport
network
link
physical
application
transport
network
link
physical
generalized NUM
subproblems
decomposition methods
functions of primal or dual vars
Examples
Optimal web layer: Zhu, Yu, Doyle 01
HTTP/TCP: Chang, Liu 04
application
transport
network
link
physical
Provides
basic structure of key algorithms
framework to aid protocol design
worksLijun Chen, Steven H. Low and John C. Doyle.Computer Networks Journal, Special Is
d
si
f ijd
xid 0
if i S
for all i N , d D
max
x , f 0
s.t.
d
d
d
d
U
(
x
)
f
s s ij ij
( s ,d )
(i , j ) d
Dual decomposition
utility function Uid
congestion control
transmission rate
xid
routing
output queue d*
to service
scheduling
pid
f sid
xik
conflict graph
f sik
d
ij
links (i,j) to
transmit
pik
f
j
k
ij
Algorithm architecture
Security Mgt
Application
d
utility U i
local
price
d
i
xid
Congestion Control
queue
length
p dj , dij
Routing
Estimation
other
nodes
conflict
graph
weights wi , j
d
*
En/Dequeue
Scheduling/MAC
Xmit/Rcv
Topology Control
Radio Mgt
Mobility Management
Physical
Xmission