0% found this document useful (0 votes)
130 views12 pages

Deridex Sigcomm 2023-16

Uploaded by

sherryli101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views12 pages

Deridex Sigcomm 2023-16

Uploaded by

sherryli101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Improving Network Availability with Protective ReRoute

David Wetherall, Abdul Kabbani∗ , Van Jacobson, Jim Winget, Yuchung Cheng,
Charles B. Morrey III, Uma Moravapalle, Phillipa Gill, Steven Knight, and Amin Vahdat
Google

ABSTRACT In our experience, the main barrier to high availability in large


We present PRR (Protective ReRoute), a transport technique for networks is the non-negligible number of faults for which routing
shortening user-visible outages that complements routing repair. does not quickly restore connectivity. Sometimes routing fails to
It can be added to any transport to provide benefits in multipath help due to configuration mistakes, software bugs, or unexpected
networks. PRR responds to flow connectivity failure signals, e.g., interactions that are only revealed after a precipitating event, such
retransmission timeouts, by changing the FlowLabel on packets as a hardware fault. For example, bugs in switches may cause pack-
of the flow, which causes switches and hosts to choose a different ets to be dropped without the switch also declaring the port down.
network path that may avoid the outage. To enable it, we shifted our Or switches may become isolated from their SDN controllers so
IPv6 network architecture to use the FlowLabel, so that hosts can that part of the network cannot be controlled. And routing or traffic
change the paths of their flows without application involvement. engineering may use the wrong weights and overload links.
PRR is deployed fleetwide at Google for TCP and Pony Express, While unlikely, these complex faults do occur in hyperscaler
where it has been protecting all production traffic for several years. networks due to their scale. They result in prolonged outages due
It is also available to our Cloud customers. We find it highly ef- to the need for higher-level repair workflows. This has an out-
fective for real outages. In a measurement study on our network size impact on availability: a single 5 min outage means <99.99%
backbones, adding PRR reduced the cumulative region-pair outage monthly uptime. Qualitatively, outages that last minutes are highly
time over TCP with application-level recovery by 63–84%. This is disruptive for customers, while brief outages lasting seconds may
the equivalent of adding 0.4–0.8 “nines” of availability. not be noticed.
Perhaps surprisingly, routing is insufficient even for the out-
CCS CONCEPTS ages that it can repair. Traffic control systems operate at multiple
timescales. Fast reroute [3, 4] performs local repair within seconds
• Networks → Data path algorithms; Network reliability; End
to move traffic to backup paths for single faults. Routing performs
nodes;
a global repair, taking tens of seconds in very large networks to
propagate topology updates, compute new routes, and install them
KEYWORDS
at switches. Finally, traffic engineering responds within minutes to
Network availability, Multipathing, FlowLabel fit demand to capacity by adjusting path weights.
ACM Reference Format: For routing alone to maintain high availability, we need fast
David Wetherall, Abdul Kabbani, Van Jacobson, Jim Winget, Yuchung reroute to always succeed. Yet a significant fraction of outages only
Cheng, Charles B. Morrey III, Uma Moravapalle, Phillipa Gill, Steven Knight, fully recover via global routing and traffic engineering. This is be-
and Amin Vahdat. 2023. Improving Network Availability with Protective cause fast reroute depends on limited backup paths. These paths
ReRoute. In ACM SIGCOMM 2023 Conference (ACM SIGCOMM ’23), Sep- may not cover the actual fault because they are pre-computed for
tember 10, 2023, New York, NY, USA. ACM, New York, NY, USA, 12 pages.
the Shared Risk Link Group (SRLG) [9] of a planned fault. Backup
https://doi.org/10.1145/3603269.3604867
paths that do survive the fault are less performant than the failed
paths they replace. Because the number of paths grows exponen-
1 INTRODUCTION tially with path length, there are fewer local options for paths be-
Hyperscalers operate networks that support services with billions tween the switches adjacent to a fault than global end-to-end paths
of customers for use cases such as banking, hospitals and com- that avoid the same fault. With limited options plus the capacity
merce. High network availability is a paramount concern. Since it loss due to the fault, backup paths are easily overloaded.
is impossible to prevent faults in any large network, it is necessary Our reliability architecture shifts the responsibility for rapid
to repair them quickly to avoid service interruptions. The classic repair from routing to hosts. Modern networks provide the oppor-
repair strategies depend on routing protocols to rapidly shift traffic tunity to do so because they have scaled by adding more links, not
from failed links and switches onto working paths. only faster links. To add capacity, the links must be disjoint. This
leads to multiple paths between pairs of endpoints that can fail
*The author contributed to this work while at Google.
independently. Thus the same strategy that scales capacity also
Permission to make digital or hard copies of part or all of this work for personal or adds diversity that can raise reliability.
classroom use is granted without fee provided that copies are not made or distributed Hosts typically see bimodal outage behavior: some connections
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
take paths to a destination that are black holes which discard pack-
For all other uses, contact the owner/author(s). ets in the network, while other connections to the same destination
ACM SIGCOMM ’23, September 10, 2023, New York, NY, USA take paths that continue to work correctly. This partial loss of
© 2023 Copyright held by the owner/author(s). connectivity is a consequence of multiple paths plus standard tech-
ACM ISBN 979-8-4007-0236-5/23/09.
https://doi.org/10.1145/3603269.3604867 niques to avoid single points of failure. For example, it is unlikely
FL#1

FL#2 X
FL#3
FL#4
X
Src DCN DCN Dst

Site A Site B Site A

Figure 1: Multipath between hosts can route around faults shown in dotted red.

that a fault will impact geographically distinct fiber routes, or all 2 DESIGN
shards of a network entity that is partitioned for reliability. We describe PRR, starting with our network architecture, and then
To restore connectivity, hosts must avoid failed paths and use how outage detection and repathing work in it.
working ones. This poses a dilemma because the IPv4 architecture
does not let hosts influence the network path of a flow. Instead, with
the rise of multiple paths, Equal-Cost Multi-Path (ECMP) [24, 48]
routing at switches uses transport identifiers to spread flows across 2.1 Multipath Architecture
paths. This design ties a connection to a path. Fortunately, now that
PRR is intended for networks in which routing provides multiple
we have migrated Google networks to IPv6, we can leverage the
paths between each pair of hosts, as is the case for most modern
IPv6 FlowLabel [2, 34] in its intended architectural role to influence
networks. For example, in Fig 1 we see an abstract WAN in which a
the path of connections.
pair of hosts use four paths at once out of many more possibilities.
With Protective ReRoute (PRR), switches include the FlowLabel
When there is a fault, some paths will lose connectivity but others
in the ECMP hash. Hosts then repath connections by changing the
may continue to work. In the figure, one path fails due to a switch
FlowLabel on their packets. Linux already supports this behavior
failure, and another due to a link failure. But two paths do not fail
for IPv6, building on the longstanding notion of host rerouting on
despite these faults.
transport failures, e.g., TCP timeouts. We completed this support
Multipath provides an opportunity for hosts to repair the fault
by handling outages encountered by acknowledgement packets.
for users. To do so, we need hosts to be able to control path selection
PRR repairs failed paths at RTT timescales and without added
for their connections. We shifted our IPv6 network architecture to
overhead. It is applicable to all reliable transports, including multi-
use the FlowLabel [2] to accomplish this goal.
path ones (§2.5). It has a small implementation, and is incrementally
In the original IPv4 Internet, hosts were intentionally oblivious
deployable across hosts and switches.
to the path taken by end-to-end flows. This separation between end
For the last several years, PRR has run 24x7 on Google networks
hosts and routing was suited to a network that scaled by making
worldwide to protect all TCP and Pony Express [31] (an OS-bypass
each link run faster and thus had relatively few IP-level paths be-
datacenter transport) traffic. Recently, we have extended it to cover
tween hosts. Over the past two decades, networks have increasingly
Google Cloud traffic. The transition was lengthy due to the ven-
scaled capacity by adding many links in parallel, resulting in numer-
dor lead times for IPv6 FlowLabel hashing, plus kernel and switch
ous additional viable paths. Flows must be spread across these paths
upgrades. But the availability improvements have been well worth-
to use the capacity well, and so multipathing with ECMP [24, 48]
while: fleetwide monitoring of our backbones over a 6-month study
has become a core part of networks. The same paths that spread
found that PRR reduced the cumulative region-pair outage time
flows to use capacity for performance also increase diversity for
(sum of outage time between all pairs of regions in the network)
reliability.
over TCP with application-level recovery by 63–84%.
The IPv6 architecture allows hosts to directly manage the multi-
This work makes two contributions. We describe the nature of
plicity of paths by using the FlowLabel to tell switches which set of
outages in a large network, using case studies to highlight long
packets needed to take the same path, without knowing the specifics
outages that undermine availability. And we present data on the
of the path [34]. This arrangement lets transports run flows over
effectiveness of PRR for our fleet as a whole, plus learnings from
individual paths while letting the IP network scale via parallel links.
production experience. We hope our work will encourage others to
However, IPv6 was not widely deployed until around 2015, and
leverage the IPv6 FlowLabel with an architecture in which routing
this usage did not catch on. In the interim, ECMP used transport
provides hosts with many diverse paths, and transports shift flows
headers for multipathing, e.g., each switch hashes IP addresses and
to improve reliability.
TCP/UDP ports of each packet to pseudo-randomly select one of
This work does not raise ethical issues.
the available next-hops. This approach eases the scaling problem,
but limits each TCP connection to a fixed network path.
The adoption of IPv6 allows us to use the FlowLabel as intended. X
X
RTO X RTO
At switches, we include the FlowLabel in the ECMP hash, a capa- Time X
Dup
bility which is broadly available across vendors1 . As a pragmatic RTO RTO
Dup
Repair Repair
step, we extend ECMP to use the FlowLabel along with the usual
4-tuple. The key benefit is to let hosts change paths without chang-
ing transport identifiers. In Fig 1, one connection may be shifted Figure 2: Example recovery of a Unidirectional Forward (left) and
across the four paths simply by changing the FlowLabel. Reverse (right) fault. Non-solid lines indicate a changed FlowLabel.

2.2 High-Level Algorithm ACK Path. Interestingly, RTOs are not sufficient for detecting
PRR builds on the concept of host rerouting in response to transport reverse path outages. Failure of the ACK path will not cause an
issues. An instance of PRR runs for each connection at a host to RTO for the ACK sender because ACKs are not themselves reliably
protect the forward path to the remote host. This is necessary acknowledged. Consider request/response traffic. The response will
because connections use different paths due to routing and ECMP, not be sent until the request has been fully received, which will not
so one instance of PRR cannot learn working paths from another. happen if the pure ACKs for a long request are lost.
An outage for a connection occurs when its network path loses We use the reception of duplicate data beginning with the sec-
IP connectivity. If the connection is active during an outage, PRR ond occurrence as a signal that the ACK path has failed. A single
will detect an outage event via the transport after a short time, duplicate is often due to a spurious retransmission or use of Tail
as described in §2.3. The outage may be a forward, reverse, or Loss Probes (TLP) [10], whereas a second duplicate is highly likely
bidirectional path failure. Unidirectional failures are quite common to indicate ACK loss.
since routes between hosts are often asymmetric, due to traffic Control Path. Finally, we detect outages in an analogous way
engineering onto non-shortest paths [23]. when a new connection is being established. We use SYN timeouts in
PRR triggers repathing intended to recover connectivity in re- the client to server direction, and reception of SYN retransmissions
sponse to the outage event, as described in §2.4. If the new path suc- in the server to client direction.
cessfully recovers connectivity, the outage ends for the connection
Performance. The performance of PRR depends on how quickly
(even though the fault may persist). If not, PRR will detect a subse-
these sequences are driven by RTOs. Outside Google, a reasonable
quent outage event after an interval, and again trigger repathing. It
heuristic for the first RTO on established connections is RTO = 3RTT,
will continue in this manner until either recovery is successful, the
with a minimum of 200ms. Inside Google, we use the default Linux
outage ends for exogenous reasons (e.g. repair of the fault by the
TCP RTO formula [36] but reduce the lower-bound of RTTVAR and
control plane), or the connection is terminated.
the maximum delayed ACK time to 5ms and 4ms from the default
A fundamental ambiguity in this process is that a host alone
200ms and 40ms, respectively [8]. Thus a reasonable heuristic is
cannot distinguish forward from reverse or bidirectional path fail-
RTO = RTT + 5ms. It yields RTOs as low as single digit ms for
ure. As a result, outage events may cause PRR to perform spurious
metropolitan areas, tens of ms within a continent, and hundreds of
repathing, which can slow recovery by causing a forward path fail-
ms for longer paths. These lower RTOs speed PRR by 3–40X over
ure where none existed. Fortunately, this does not affect correctness
the outside heuristic. For new connections, a typical first SYN time-
because outage events will continue to trigger repathing until both
out is 1s, at the upper end of RTOs. This implies that connection
forward and reverse paths recover.
establishment during outages will take significantly longer than
repairing existing connections.
2.3 TCP Outage Detection
Examples. Recovery for request/response traffic with a unidirec-
PRR can be implemented for all reliable transports. We describe tional path fault is shown in Fig 2. For simplicity the figures focus
how it works with TCP as an important example. Since network on the initial transmission and later ones involving repathing, while
outages are uncommon, PRR must have acceptable overhead when omitting other packets present for fast recovery and TLP, etc. Each
there is no outage. This means PRR must be very lightweight in non-solid line indicates a changed FlowLabel.
terms of host state, processing and messages. Our approach is to In the forward case, the behavior is simple: each RTO triggers
detect outages as part of normal TCP processing. retransmission with repathing (dashed line). This continues until a
Data Path. There are many different choices for how to infer working forward path is found. Recovery is then direct since the re-
an outage, from packet losses to consecutive timeouts. For estab- verse path works. The number of repaths (here two) is probabilistic.
lished connections, PRR takes each TCP RTO (Retransmission Time- The expected value depends on the fraction of working paths.
Out) [35] in the Google network as an outage event. This method The reverse case differs in two respects. The RTOs cause spuri-
provides a signal that recurs at exponential backoff intervals while ous repathing (dot-dash line) since the forward path was already
connectivity is sufficiently impaired that the connection is unable working. However, it is not harmful because only the reverse di-
to make forward progress. It is possible that an RTO is spurious rection is faulty. Duplicate detection at the remote host (after TLP
or indicates a remote host failure, but repathing is harmless in which is not shown) causes reverse repathing until a working path
these situations (as it is either very likely to succeed or won’t help is found. The recovery time is the same as the forward case despite
because the fault is not network related). these differences.
Bidirectional faults are more involved because the repair be-
1 Support was limited when we began but has increased steadily. havior depends on whether the connection initially failed on the
X
X
probability. However, the key question is how often a new path
RTO X RTO X
will intersect the fault.
RTO RTO
X
Dup X
The fraction of failed paths for the outage must be measured
empirically since it depends on the nature of the fault. It also varies
RTO X RTO X
for each prefix-pair and changes over time. Often the outage fraction
is small, leading to rapid recovery. For example, with a 25% outage
RTO RTO (in which a quarter of the paths fail) a single random draw will
Dup X
Repair
succeed 75% of the time. More generally, for an IP prefix-pair with
a 𝑝% outage, the probability of a connection being in outage after
RTO
Time X
Dup 𝑁 rerouting attempts falls as 𝑝 𝑁 .
Avoiding Cascades Repathing needs to avoid cascade failures,
RTO where shifting a large amount of traffic in response to a problem
Dup
Repair
focuses load elsewhere and causes a subsequent problem. PRR is
superior to routing in this respect: fast-reroute shifts many “live”
Figure 3: Example recovery of a Bidirectional fault when only the
connections in the same way at the same time, while PRR shifts
Reverse path (left) or both directions (right) initially failed. Non-
solid lines indicate a changed FlowLabel. Red dash-dot lines indi- traffic more gradually and smoothly.
cate harmful repathing. The shift is gradual because each TCP connection independently
reacts to the outage, which spreads reaction times out at RTO
timescales. Each connection is quiescent when it moves, following
forward path, reverse path, or in both directions. Unlike the case an RTO, and will ramp its sending rate under congestion control.
of a unidirectional fault, we may have harmful spurious repathing The shift is smooth because random repathing loads working
and delayed reverse repathing. paths according to their routing weights. The expected load increase
If the connection initially experiences a forward path failure but on each working path due to repathing in one RTO interval is
(by chance) no reverse failure, then recovery proceeds as for the bounded by the outage fraction. For example, it is 50% for a 50%
unidirectional case. When the forward repair succeeds, it is the outage: half the connections repath and half of them (or a quarter)
first time the packet reaches the receiver. Thus the receiver will not land on the other half of paths that remain. This increase is at most
repath. It will use its working reverse path to complete recovery. 2X, and usually significantly lower, which is no worse than TCP
Conversely, if the connection initially fails only on the reverse slow-start [25] and comfortably within the adaptation range of
path (Fig 3 left) then we see different behavior. At the first RTO, congestion control. Moreover, PRR can only use policy-compliant
spurious forward repathing occurs. Now it can be harmful, and is in paths which have acceptable properties such as latency, while there
this example (dot-dash red line). It causes a forward path loss that are no such guarantees for bypass routes.
requires subsequent RTOs to repair. When the forward path works, A related concern is that repathing in response to an outage
duplicate reception causes reverse repathing. We must draw both a will leave traffic concentrated on a portion of the network after the
working forward and reverse path at the same time to recover. outage has concluded. However, this does not seem to be the case in
The longest recovery occurs when the connection has failed in practice: routing updates spread traffic by randomizing the ECMP
both directions (Fig 3 right). Reverse repathing is needed for recov- hash mapping, and connection churn also corrects imbalance.
ery but delayed until after two repairs of the forward path. This
is because the receiver first repaths on the second duplicate recep- 2.5 Alternatives
tion and the original transmission and TLP are lost. To complete
recovery, we need a subsequent joint forward and reverse repair. Multipath Transports. A different approach is to use multipath
As before, spurious forward repathing may slow this process. transports such as MPTCP [6] or SRD [38] since connections that
use several paths are more likely to survive an outage; these trans-
2.4 Repathing with the FlowLabel ports can also improve performance in the datacenter [33]. However,
PRR repaths as a local action by using the FlowLabel. It does not PRR may be applied to any transport to boost reliability, including
rely on communication with other network components (e.g., SDN multipath ones. For example, MPTCP can lose all paths by chance,
controllers or routing) during outages. and it is vulnerable during connection establishment since sub-
flows are only added after a successful three-way handshake. PRR
Random Repathing. For each outage event, PRR randomly changes
protects against these cases.
the FlowLabel value used in the packets of the connection. This
Moreover, many connections are lightly used, so the resource
behavior has been the default in Linux via txhash since 2015, with
and complexity cost of maintaining multiple paths does not have
our last upstream change (for repathing ACKs) landing in 2018. The
a corresponding performance benefit. The prospect of migrating
OS takes this action without involving applications, which greatly
all datacenter usage to a new transport was also a consideration.
simplifies deployment.
Instead, we apply PRR to TCP and Pony Express to increase the
With a good ECMP hash function, different inputs are equivalent
reliability of our existing transports.
to a random draw of the available next-hops at each switch. If we
define a path as the concatenation of choices at each switch, then Application-Level Recovery. Applications can approximate PRR
paths more than a few switches long will change with very high without IPv6 or the FlowLabel by reestablishing TCP connections
0.3 0.3 0.6
RTO=1.0 RTO =0.5 (No Spread) RTO=0.1 UNI 50% UNI 25% BI 25%+25% All Both Reverse Forward Oracle
Fault Duration Fault Duration Fault Duration
Failed Fraction

Failed Fraction

Failed Fraction
0.2 0.2 0.4

0.1 0.1 0.2

0.0 0.0 0.0


0 20 40 60 80 0 25 50 75 100 0 25 50 75 100
Time (seconds) Time (RTOs) Time (RTOs)
Figure 4: (a) Effect of RTO (b) Uni- and bi-directional repair curves (c) Breakdown of bidirectional repair

that have not made progress after a timeout2 . We relied on this clustered around a median of 0.5s, with most mass from 0.45 to 0.55s.
approach before PRR. However, using low RPC timeout values is They are generated with a log-normal distribution, LogN(0, 0.06),
much more expensive than PRR, as well as less performant. This is scaled by the median RTO of 0.5s. The connections also have 1s of
because establishing an RPC channel takes several RTTs and has jitter in their start times. The aggregate behavior is a “step” pattern,
computational overhead for security handshakes. Coverage is also where the failed fraction is reduced by 50% when the connections
more limited since it relies on application developers. Adding PRR repath at each step. Note that the failed fraction starts at around 0.2,
to TCP covers all manner of applications, including control traffic much lower than the 50% of connections that were initially black
such as BGP and OpenFlow, whether originating at switches or holed. This is because the black holed connections RTO and most
hosts, and achieves the ideal of avoiding application recovery. recover before the 2s timeout.
PLB. Finally, PRR is closely related to Protective Load Balanc- While instructive, we only see this step pattern on homogeneous
ing [32]. In our implementation, they are unified to use the same subsets of data. More typical are the smooth curves of the top and
repathing mechanism but for different purposes. PLB repaths using bottom lines due to connections repathing at different times. They
congestion signals (from ECN and network queuing delay) to bal- have median RTOs of 1s and 100ms, respectively, and are generated
ance load. PRR repaths using connectivity signals (e.g., timeouts) with LogN(0, 0.6) scaled by the median RTO. This distribution
to repair outages. These signals coexist without difficulty, but there spreads the RTOs to have a standard deviation that is 10X larger
is one point of interaction. PRR activates during an outage to move than before. The bottom line shows a much faster repair due to the
traffic to a new working path. Since outages reduce capacity, it is lower median RTO and has become smooth due to the RTO spread.
possible that PLB will then activate due to subsequent network The smaller 100ms RTO makes a large difference. It both reduces
congestion and repath back to a failed path. Therefore, we pause the initial failed fraction and reaches successive RTOs more quickly.
PLB after PRR activates to avoid oscillations and a longer recovery. The top line, with initial RTOs around 1s, models the failure rate for
new connections as well as long RTTs. It shows a correspondingly
3 SIMULATION RESULTS slower repair due to the larger RTO.
We use a simple model to predict how PRR reduces the fraction For both curves, the failed fraction of connections, 𝑓 , falls poly-
of failed connections. Our purpose is to provide the reader with nomially over time. Suppose the outage fails a fraction paths 𝑝.
a mental model for how PRR is expected to perform; we will see After 𝑁 RTOs, 𝑓 is 𝑝 𝑁 below its starting value, which is exponen-
these effects in our case studies (§4.2). tially lower. However, the RTOs are also exponentially-spaced in
We simulate repathing driven by TCP exponential backoff for an time, 𝑡, so we have 𝑡 ≈ 2𝑁 for the Nth RTO. Combining expressions,
ensemble of 20K long-lived connections under various fault models. 𝑓 ≈ 𝑝 𝑙𝑜𝑔2 (𝑡 ) = 1/(𝑡) 𝐾 , for 𝐾 = −𝑙𝑜𝑔𝑝 (2). Thus for 𝑝 = 12 , the failure
This workload represents the active probing that we use to measure probability falls as 1/𝑡. For 𝑝 = 14 , it falls as 1/𝑡 2 .
connectivity. The fault starts at 𝑡 = 0 and impacts each connection A notable effect is that the fault (dashed line) ends at 𝑡 = 40s, yet
when it first sends at 𝑡 ≥ 0. We model black hole loss and ignore some connections still lack connectivity until 𝑡 = 80s. That is, the
congestive loss. A connection is considered failed if a packet is not failures visible to TCP can last longer than the IP-level outage. The
acknowledged within 2s. The repair behavior is strongly influenced reason is exponential backoff. It is not until the first retry after the
by two factors that we consider in turn: the RTO (Retransmission fault has resolved that the connection will succeed in delivering a
TimeOut), and the outage fraction of failed paths. packet and recover. If the fault ends at 𝑡 = 40s then some connec-
RTO. The RTO depends on the connection RTT and state, and tions may see failures in the interval [20, 40). These connections
the TCP implementation. For our network the RTT ranges from will increase their RTO and retry in the interval [40, 80).
O(1ms) in a metro, to O(10ms) in a continent, to O(100ms) globally.
For new connections, the SYN RTO is commonly 1s. This variation Outage Fraction. The severity of the fault has a similarly large
has a corresponding effect on the speed of recovery. impact on recovery time. Fig 4(b) shows repair for three different
Fig 4(a) shows the repair of a 50% outage (i.e., half the paths long-lived faults. We normalize time in units of median initial RTOs,
fail) in one direction for three different RTO scenarios. The middle spread RTOs as before, and use a timeout of twice the median RTO.
line in the graph is the repair curve for connections having RTOs The top solid line is for a 50% outage in one direction. It cor-
responds to the 1s RTO curve from before. The bottom solid line
2 Linux fails TCP connections after ∼15 mins by default. Application timeouts are shows the repair of a 25% outage in one direction. It has the same
shorter.
effects, but starts from a lower failed fraction and falls more quickly. We use three types of probes to observe loss at different network
Now, each RTO repairs 75% of the remaining connections. layers. First, UDP probes measure packet loss at the IP level. We
The dashed line shows the repair of a bidirectional outage in refer to these probes as L3. They let us monitor the connectivity
which 25% of paths fail in each direction. For each connection, the of the underlying network, which highlights faults and routing
forward and reverse paths fail independently to model asymmetric recovery, but not how services experience the network.
routing. This curve is similar to the 50% unidirectional outage, even To measure application performance before PRR, we use empty
though it might be expected to recover more quickly since the Stubby RPCs as probes; Stubby is an internal version of the open
probability of picking a working forward and reverse path is 16 9, source gRPC [18] remote procedure call framework. We refer to
1
which is larger than 2 . The reason is that the bidirectional outage these probes as L7. They benefit from TCP reliability and RPC
has three components that repair at different rates, as we see next. timeouts, which reestablish TCP connections that are not making
In Fig 4(c), we break the repair of a long-lived 50% forward and progress. An L7 probe is considered lost if the RPC does not com-
50% reverse outage (solid line) into its components (dashed lines) plete within 2s. Stubby reestablishes TCP connections after 20s to
as described in §2.3. This outage is demanding, with 75% of the match the gRPC default timeout.
round-trip paths having failed, so the tail falls by only one quarter Finally, to measure application performance with PRR, we is-
at each RTO. Connections that initially failed in one direction only, sue the L7 probes with PRR enabled, which we refer to as L7/PRR.
either forward or reverse, are repaired most quickly. Connections These probes benefit from PRR repathing as well as TCP reliability
that initially failed in both directions are repaired slowly due to and RPC timeouts. Note that the network may be in outage while
spurious repathing and the delayed onset of reverse repathing. To applications are not, due to PRR, TCP and RPC repair mechanisms.
see the cost of these effects, the Oracle line (dotted) shows the how Thus comparing the three sets of probes lets us understand the ben-
the failed fraction improves without them. efits of PRR relative to applications without it and the underlying
network.
Summary. We conclude that for established connections with Our fleet summaries use probe data for 6 months to the middle
small RTOs, PRR will repair >95% of connections within seconds of 2023 between all region-pairs in our core network, and on both
for faults that black hole up to half the paths. This repair is fast backbones. Since the functionality to disable PRR is not present
enough that the black holes are typically not noticed. Larger RTOs on all probe machines, we note that there are slightly fewer L7
and new connections will require tens of seconds for the same level probes (29%) than L7/PRR (37%) and L3 probes (33%), but we have
of repair, and show up as a small service interruption. PRR cannot no reason to believe the results are not representative.
avoid a noticeable service interruption for the combination of large
RTOs and faults that black hole the vast majority of paths, though
it will still drive recovery over time. 4.2 Case Studies
We begin with case studies of how PRR behaves during a diverse
4 PRODUCTION RESULTS set of significant outages. Most outages are brief or small outages.
PRR runs in Google networks 24x7 and fleetwide for TCP and Pony The long and large outages we use for case studies are exceptional.
Express traffic. We present a measurement study to show how it They are worthy of study because they are highly disruptive to
is able to maintain high network availability for users, beginning users, unless repaired by PRR or other methods.
with case studies of outages and ending with fleet impact.
Case Study 1: Complex B4 Outage. The first outage is the longest
Our study observes real outages at global scale across the entire
we consider, lasting 14 mins. It impacted region-pairs connected
production network. It covers two backbones, B2 and B4 [23, 26],
by the B4 backbone. We use it to highlight multiple levels of repair
that use widely differing technologies (from MPLS to SDN) and
operating at different timescales. It was caused by a rare dual power
so have different fault and repair behaviors. Further, the results
failure that took down one rack of switches in a B4 supernode [23]
we derive are consistent with service experience to the best of our
and disconnected the remainder of the supernode from an SDN
knowledge. One limitation of our study is that we are unable to
controller. It was prolonged by a repair workflow that was blocked
present data on service experience. We report results for long-lived
by unrelated maintenance. Long outages have diverse causes and
probing flows instead. Still, our measurements are comprehensive,
typically complex behaviors. In this case, a single power failure
with literally trillions of probes.
would not have led to an outage, and a successful repair workflow
would have led to a much shorter outage, but in this unlucky case
4.1 Measuring loss three events happened together.
We monitor loss in our network using active probing. We send The probe loss during the outage is shown in Fig 5. The top
probes between datacenter clusters using multiple flows, defined graph shows the loss versus time for L3, L7 and L7/PRR probes
as source/destination IP and ports. Flows take different paths due over impacted inter-continental region-pairs over B4. The bottom
to ECMP. Each flow sends ∼120 probes per minute. Each pair of graph shows the same for intra-continental pairs. We separate them
clusters is probed by at least 200 flows. This arrangement lets us since the behaviors are qualitatively different. One datapoint covers
look at loss over time and loss over paths, with high resolution 0.5s, so the graphs show a detailed view of how the fault degraded
in each dimension. We also aggregate measurements to pairs of connectivity over time. Each datapoint is averaged over many thou-
network regions, where each region is roughly a metropolitan area sands of flows sending in the same interval, which exercised many
and contains multiple clusters. thousands of paths.
L7/PRR L7 L3 0.50% 60.0% L7/PRR L7 L3
Average Probe Loss Ratio

Average Probe Loss Ratio


10.00%

10.0% 0.25%
40.0% 5.00%

0.00%
0 25 50 0.00%
0 10 20
5.0%
20.0%

0.0% 0.0%
0 200 400 600 800 0 20 40
Time Since Start of Event (seconds) Time Since Start of Event (seconds)

(a) Inter-continental probe loss (a) Inter-continental probe loss


L7/PRR L7 L3 0.50% 60.0% L7/PRR L7 L3
Average Probe Loss Ratio

Average Probe Loss Ratio


7.5% 10.00%
0.25%
40.0% 5.00%
5.0%
0.00%
0 25 50 0.00%
0 10 20
20.0%
2.5%

0.0% 0.0%
0 100 200 300 400 0 20 40
Time Since Start of Event (seconds) Time Since Start of Event (seconds)

(b) Intra-continental probe loss (b) Intra-continental probe loss

Figure 5: Probe loss during a complex B4 outage. Figure 6: Probe loss during an optical link failure on B4.

Consider first the L3 line, which shows the IP-level connectivity. TCP connection; and (2) PRR operates at RTT timescales, which
Since the SDN control plane was disconnected by the fault, it could are much shorter than the 20s RPC timeout.
not program fixes to drive a fast repair process. Around 100s, global Some TCP flows required multiple attempts to find working
routing systems intervened to reroute traffic not originating from paths, so there was a tail as connectivity was restored. The subse-
or destined for the outage neighborhood. This action reduced the quent spikes arose when routing updates changed paths, altering
severity of the outage but did not fix it completely. After more the ECMP mapping and causing some working connections to be-
than 10 mins, the drain workflow removed the faulty portion of the come black holed. TCP retransmissions were of little help in this
network from service to complete the repair. process, since a connection either worked or lost all of its packets
The loss rate stayed below 13% throughout the outage because at a given time.
only one B4 supernode was affected by the fault so most flows tran- Finally, the L7/PRR line shows the outage recovery using the
sited healthy supernodes. However, the outage was more disruptive same RPC probes as the L7 case but with PRR enabled. The loss
than the loss rate may suggest because the failure was bimodal, as rate was greatly reduced, to the point of being visible only in the
is typical for non-congestive outages: all flows taking the faulty inset panel. The repair was roughly 100X more rapid than the L7
supernode saw 100% loss, while all flows taking the healthy supern- case, especially for the intra-continental case due to its shorter RTT.
odes saw normal, low loss. For customers using some of the faulty It achieved the desired result: most customers were unaware that
paths, the network was effectively broken (without PRR) because there was an outage because the connectivity interruption is brief
it would stall applications even though most paths still functioned and does not trigger application recovery mechanisms.
properly. This case study highlights outage characteristics that we observe
The L7 line shows the outage recovery behavior prior to the more broadly. Many outages black hole a fraction of paths between
development of PRR. The L7 loss rate started out the same as L3, a region-pair while leaving many other paths working at the same
but dropped greatly after 20s, after which it decayed slowly, with time. And some outages are not repaired by fast reroute so they
occasional spikes. The L7 improvement is due to the RPC layer, have long durations that are disruptive for users without quick
i.e., application-level recovery, which opened a new connection recovery. Outages with both characteristics provide an opportunity
after 20s without progress. These new connections with different for PRR to raise network availability.
port numbers avoided the outage by taking different network paths
due to ECMP. Most of the new paths worked by chance because Case Study 2: Optical failure. Next we consider an optical link
the L3 loss rate tells us that on average only 13% of paths initially failure that resulted in partial capacity loss for the B4 backbone.
failed. This is similar to the repathing done by PRR except that (1) Fig 6 shows the probe loss over time for inter- and intra-continental
by using the FlowLabel, PRR can repath without reestablishing the paths during the outage. In this case, L3 loss was around 60% when
20.0%
L7/PRR L7 L3 L7/PRR L7 L3
2.00% 15.00%
Average Probe Loss Ratio

Average Probe Loss Ratio


60.0%
15.0% 10.00%
1.00%
5.00%
40.0%
10.0%
0.00% 0.00%
30 40 50 0 100 200

5.0% 20.0%

0.0% 0.0%
0 100 200 300 400 500 0 200 400 600
Time Since Start of Event (seconds) Time Since Start of Event (seconds)

Figure 7: Inter-continent probe loss during a device failure on B2. Figure 8: Intra-continental probe loss for a regional fiber cut in B2.
(No intra-continent probe loss was observed.) (The inter-continental graph is omitted as similar.)

the event began, indicating that most paths had failed but there 100%
L7 vs. L3 L7/PRR vs. L7 L7/PRR vs. L3
was still working capacity.

Reduction in Cumulative
Immediately after the outage began, fast routing repair mecha- 75%

Outage Minutes
nisms acted to reduce the L3 loss to ∼40% within 5s. Further im-
provements gradually reduced the L3 loss to ∼20% by around 20s
50%
from the start of the event. The cause of this sluggish repair was
congestion on bypass links due to the loss of a large fraction of
outgoing capacity, and SDN programming delays due to the need to 25%

update many routes at once. Finally, the outage was resolved after
60 seconds when unresponsive data plane elements were avoided 0%
using traffic engineering. B4:Inter B4:Intra B2:Inter B2:Intra

The L7 line starts out slightly lower then the L3 one because L7
probes have a timeout of 2s, during which time routing was able to Figure 9: Reduction in outage minutes for B2 and B4 intra- and
repair some paths. We see that TCP was unable to mitigate probe inter-continental paths.
loss during the first 20s since retransmissions do not help more than
routing recovery. In fact, after around 10s, L7 loss exceeded L3 loss
because the detection of working paths was delayed by exponential
backoff. Around 20s, RPC channel reestablishment roughly halves peak probe loss seen by L3 was 19%. L7/PRR reduced this peak loss
the loss rate for the remainder of the outage for both intra- and over 15X to 1.2% and, as with the prior outage, quickly lowered the
inter-continental paths. All these effects are consistent with our loss level to near zero after 20 seconds. Conversely, the L7 probe
simulation results (§3). loss has a large peak of 14% and persists for significantly longer
In contrast, PRR lowered peak probe loss and quickly resolved the than L7/PRR.
outage. For intra-continental paths, L7/PRR reduced the peak probe
loss to 2.4% and had completely mitigated the loss by 20s into the Case Study 4: Regional fiber cut. Finally, we present an outage
outage. L7/PRR similarly performed well for inter-continental paths that challenged PRR (Fig 8). In this outage, a fiber cut caused a
where probe loss peaked at around 11%, which is over 5X less than significant loss of capacity. The average L3 probe loss peaked at
the peak L3 probe loss. This outage illustrated how the path RTT 70% and remained around 50% or higher for 3 mins. This severe fault
affects PRR. Consistent with simulation (§3), intra-continental paths impacted many paths. Fast reroute did not mitigate it because the
that have lower RTTs observed a lower peak and faster resolution bypass paths were overloaded due to the large capacity reduction.
than inter-continental paths. In both cases, PRR greatly reduced After ∼3 mins, global routing moved traffic away from the outage,
probe loss beyond L7. lowering the loss rate and alleviating congestion on the bypass
paths.
Case Study 3: Line card issues on a single device. The next L7/PRR reduced the peak loss to 14%, a 5X improvement. It was
outage involved a single device in our B2 backbone (Fig 7). During much more effective than L7, which reduced peak loss to 65%, but
the outage, the device had two line-cards malfunction, which caused was not able to fully repair this outage because of the large fraction
probe loss for some inter-continental paths. Due to the nature of the of path loss. This path loss was exacerbated by routing updates
malfunction, routing did not respond. The outage was eventually during the event: PRR moved connections to working paths only
mitigated when an automated procedure drained load from the for some of the connections to shift back to failed paths when the
device and took it out of service. ECMP mapping was changed. As a result, we see a pattern in which
While the cause of this outage is different than the others, we see L7/PRR loss fell over time but was interrupted with a series of
similar results: PRR was able to greatly reduce loss. In this case, the spikes.
4.3 Aggregate Improvements 100%
L7/PRR vs. L3 L7/PRR vs. L7 L7 vs. L3

Case studies provide a detailed view of how PRR repairs individual

Reduction in Daily
outages. To quantify how PRR raises overall network availability, 75%

Outage Minutes
we aggregate measurements for all outages across all region-pairs
in the Google network for the 6-month study period. The vast 50%

majority of the total outage time is comprised of brief or small


outages. Our results show that PRR is effective for these outages, 25%
as well as long and large outages.
Outage Minutes Our goal is to raise availability, which is defined 0%
Jan Apr Jul
as 𝑀𝑇 𝐵𝐹 /(𝑀𝑇 𝐵𝐹 + 𝑀𝑇𝑇 𝑅), where 𝑀𝑇 𝐵𝐹 is the Mean Time Be- Date
tween Failures and 𝑀𝑇𝑇 𝑅 is the Mean Time to Repair. This formula
is equivalent to 1 minus the fraction of outage time. Since we are Figure 10: Fraction of outage minutes reduced over time.
unable to report absolute availability measures due to confiden-
tiality, we report relative reductions in outage time across L3, L7,
and L7/PRR layers. These relative reductions translate directly to
availability gains. For instance, a 90% reduction in outage time is minutes repaired. We see some variation over time, reflecting the
equivalent to adding one “nine” to availability, e.g., hypothetically varying nature of outages, while PRR consistently delivers large
improving from 99% to 99.9% reductions in outage minutes throughout the study period.
We measure outage time for each of L3, L7 and L7/PRR in minutes
derived from flow-level probe loss. Specifically, we compute the 4.4 Effectiveness for Region-Pairs
probe loss rate of each flow over each minute. If a flow has more Outages affect multiple region-pairs, each of which may see a dif-
than 5% loss, such that it exceeds the low, acceptable loss of normal ferent repair behavior. We next look at how the benefit of PRR is
conditions, then we mark it as lossy. If a 1-minute interval between distributed across region-pairs in our network. Fig 11 shows the
a pair of network regions has more than 5% of lossy flow, such Complementary Cumulative Distribution Function (CCDF) over
that it is not an isolated flow issue, then it is an outage minute for region-pairs of the fraction of outage minutes repaired between
that region-pair. We further trim the minute to 10s intervals having layers. This graph covers all region-pairs in the fleet over our en-
probe loss to avoid counting a whole minute for outages that start tire study period. Points higher and further to the right are better
or end within the minute. outcomes as they mean a larger fraction of region-pairs repaired a
In Fig 9, we show the percent of total outage minutes that were greater fraction of outage minutes.
repaired for L7/PRR probes relative to L3 probes. We give results PRR performs well for a variety of paths. We see that the vast
for both B2 and B4 backbones, and broken out by intra- and inter- majority of region-pairs see a large benefit from L7/PRR over L3 on
continental paths. Similarly, we compare L7 (without PRR) with L3 both backbones. It is able to repair 100% of outage minutes for 50%
and L7/PRR with L7 to understand where the gains come from. The and 16% of B2 intra- and inter-continental region-pairs, respectively.
results are computed across many thousands of region-pairs and PRR performance is more varied for B4 where outage minutes are
include hundreds of outage events. decreased by half for 63% and 77% of intra- and inter-continental
PRR reduces outage minutes by 64-87% over L3. PRR com- region-pairs, respectively.
bined with transport and application layer recovery mechanisms
PRR improves significantly over L7. As expected, L7/PRR pro-
is very effective at shortening IP network outages for applications.
vides much greater benefit than L7. The lines showing PRR gain
We see large reductions in outage minutes when using L7/PRR for
are quite similar whether they are relative to L3 or L7 and show
both backbone networks. The reductions range from 64% for inter-
a reduction in outage minutes for nearly all region-pairs. (The ex-
continental paths on B4, to 87% for intra-continental paths on B2.
ceptions tend to be region-pairs with very few outage minutes for
Note that, unlike for individual outages, we do not see a consistent
which L7/PRR dynamics for sampling were unlucky.) Conversely,
pattern between inter- and inter-continental results across outages.
L7 without PRR increases the number of outage minutes relative to
This is because PRR effectiveness depends on topological factors
L3 for 3-16% of region pairs. This counter-intuitive result is possible
and not only the RTT.
because TCP exponential backoff on failed paths tends to prolong
PRR reduces outage minutes by 54-78% over L7. PRR is able to outage times (until RPC timeouts are reached).
repair most of the outage minutes that are not repaired by the TCP
and application-level recovery of L7 probes. This result confirms
5 DISCUSSION
that most of the L7/PRR improvement over L3 is not due to the
RPC layer and TCP retransmissions; L7 reduces the cumulative We briefly discuss some additional aspects of PRR.
outage minutes by only 15–42% relative to L3. We believe that a Other Transports. PRR can be applied to protect all transports,
large portion of this gain is coming from RPC reconnects, since including multipath ones, since all reliable transports detect de-
TCP retransmission is ineffective for black holes. livery failures. For example, we use PRR with Pony Express [31]
PRR performs well over time. As a check, we also look at how OS-bypass traffic with minor differences from TCP. User-space UDP
PRR behavior varies over time. Fig 10 shows the Generalized Addi- transports can implement repathing by using syscalls to alter the
tive Model (GAM) smoothing [43] of the fraction of daily outage FlowLabel when they detect network problems. Even protocols
100% Propagate FL
Network Region Pairs

75% PSP PSP


IPv6 UDP IPv6 L4 L4 Payload
Percentage of

Header Trailer
50%
VM Packet

25%
Figure 12: PSP Encapsulation
L7/PRR vs. L3 L7/PRR vs. L7 L7 vs. L3
0%
0% 25% 50% 75% 100%
Fraction of Outage Minutes Repaired by Between Protocols
Now, when the guest OS has PRR and a TCP connection detects an
(a) intra-continental (B2) outage and changes its FlowLabel, the encapsulation headers also
100%
change and hence ECMP causes the connection to be repathed. For
different encapsulation formats, e.g., IPSEC, the details will vary,
Network Region Pairs

75%
but the propagation approach is the same.
Percentage of

We also use encapsulation to enable PRR for Cloud IPv4 traffic.


50%
The gve [20] driver passes connection metadata to the hypervisor,
25%
which hashes it into the encapsulation headers. This technique
works for other IPv4 networks as well: encapsulate with IPv4 to
0%
L7/PRR vs. L3 L7/PRR vs. L7 L7 vs. L3 provide a layer of indirection, and propagate inner header entropy
0% 25% 50% 75% 100% to an outer UDP header (since there is no FlowLabel).
Fraction of Outage Minutes Repaired by Between Protocols
Deployment. Backwards-compatibility and incremental deploy-
(b) inter-continental (B2)
ment are highly desirable to protect existing deployments. PRR
100%
excels in this respect. Its rollout was lengthy due to vendor partici-
pation, upstream kernel changes, and switch upgrades, but other-
Network Region Pairs

75%
Percentage of

wise straightforward. Hosts could be upgraded in any order to use


50%
the FlowLabel for outgoing TCP traffic. Switches could be concur-
rently upgraded in any order to ECMP hash on the FlowLabel; this
25% change is harmless given the semantics of the FlowLabel.
It is not necessary for all switches to hash on the FlowLabel
0%
L7/PRR vs. L3 L7/PRR vs. L7 L7 vs. L3 for PRR to work, only some switches upstream of the fault. Often,
0% 25% 50% 75% 100%
Fraction of Outage Minutes Repaired by Between Protocols substantial protection is achieved by upgrading only a fraction of
switches. This property has implications for the reliability of IPv6
(c) intra-continental (B4)
traffic in the global Internet. Not only can each provider enable
100%
FlowLabel hashing to protect their own network, but upstream
providers have some ability to work around downstream faults by
Network Region Pairs

75%
changing the ingress into downstream networks.
Percentage of

50%
6 RELATED WORK
25% Most work on improving network reliability focuses on network
internals such as fast reroute [3, 4], backbones [23, 40, 47], or dat-
L7/PRR vs. L3 L7/PRR vs. L7 L7 vs. L3
0% acenters [44]. Fewer papers report on the causes of outages [17,
0% 25% 50% 75% 100%
Fraction of Outage Minutes Repaired by Between Protocols 21, 41] or availability metrics suited to them, such as windowed-
(d) inter-continental (B4) availability [22], which separates short from long outages.
Hosts have the potential to raise availability using the FlowLa-
Figure 11: CCDF of improvement across region pairs. bel [5], but no large-scale deployment or evaluation has been pre-
sented to the best of our knowledge. Most host-side work focuses
on multipath transports like MPTCP [6], SRD [38], and multipath
such as DNS and SNMP can change the FlowLabel on retries to QUIC [13] that send messages over a set of network paths. While
improve reliability. they primarily aim to improve performance, these transports also
Cloud & Encapsulation. PRR must be extended to protect IPv6 increase availability, e.g., MPTCP may reroute data in one subflow
traffic from Cloud customers because virtualization affects ECMP. to another upon RTO. However, they use only a small set of paths
Google Cloud virtualization [12] uses PSP encryption [19], adding and may not protect the reliability of connection establishment, e.g.,
IP/UDP/PSP headers to the original VM packet as shown in Fig 12. MPTCP adds paths only after a successful three-way handshake [5].
In the network, switches use the outer headers for ECMP and PRR can be added to multipath transports to increase reliability by
ignore the VM packet headers. To enable the VM to repath via exploring new paths until it finds working ones, and protecting
the FlowLabel, we hash the VM headers into the outer headers. connection establishment.
There is a large body of work on a related host behavior: multi- [9] Sid Chaudhuri, Gisli Hjalmtysson, and Jennifer Yates. 2000. Control of lightpaths
pathing for load balancing [1, 6, 14–16, 27–30, 32, 37, 39, 42, 45, 46]. in an optical network. In Optical Internetworking Forum.
[10] Yuchung Cheng, Neal Cardwell, Nandita Dukkipati, and Priyaranjan Jha. 2021.
Some of this work considers reliability, for example, CLOVE [28] The RACK-TLP Loss Detection Algorithm for TCP. RFC 8985. (Feb. 2021). https:
uses ECMP to map out working paths. PRR shows this mapping //doi.org/10.17487/RFC8985
[11] David Clark. 1988. The Design Philosophy of the DARPA Internet Protocols.
is not necessary, as random path draws work well. Most of this SIGCOMM Comput. Commun. Rev. 18, 4 (aug 1988), 106–114. https://doi.org/10.
work is not done in the context of IPv6 and does not consider the 1145/52325.52336
FlowLabel; we find it apt for multipathing. In our network, PRR [12] Mike Dalton, David Schultz, Ahsan Arefin, Alex Docauer, Anshuman Gupta,
Brian Matthew Fahs, Dima Rubinstein, Enrique Cauich Zermeno, Erik Rubow,
and PLB [32] are implemented together, using repathing for both Jake Adriaens, Jesse L Alpert, Jing Ai, Jon Olson, Kevin P. DeCabooter, Marc Asher
load balancing and rerouting around failures. de Kruijf, Nan Hua, Nathan Lewis, Nikhil Kasinadhuni, Riccardo Crepaldi, Srini-
Finally, [7] argues that hosts should play a greater role in path se- vas Krishnan, Subbaiah Venkata, Yossi Richter, Uday Naik, and Amin Vahdat.
2018. Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Net-
lection, instead of routers reacting to failures. PRR is one realization work Virtualization. In 15th USENIX Symposium on Networked Systems Design
of this argument. and Implementation, NSDI 2018. USENIX Association, Renton, WA, 373–387.
[13] Quentin De Coninck and Olivier Bonaventure. 2017. Multipath QUIC: Design
and Evaluation. In Proceedings of the 13th International Conference on Emerging
7 CONCLUSION Networking EXperiments and Technologies.
[14] Advait Dixit, Pawan Prakash, Y. Charlie Hu, and Ramana Rao Kompella. 2013.
PRR is deployed fleet-wide at Google, where it has protected the On the impact of packet spraying in data center networks. In 2013 Proceedings
reliability of nearly all production traffic for several years. It is also IEEE INFOCOM. 2130–2138. https://doi.org/10.1109/INFCOM.2013.6567015
[15] Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, and Mohammad Alizadeh.
available to our Cloud customers. PRR greatly shortens user-visible 2016. Juggler: A Practical Reordering Resilient Network Stack for Datacenters.
outages. In a 6-month study on two network backbones, it reduced In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys
the cumulative region-pair outage time over TCP with application- ’16). Association for Computing Machinery, New York, NY, USA, Article 20,
16 pages. https://doi.org/10.1145/2901318.2901334
level recovery by 63–84%. This is the equivalent of adding 0.4–0.8 [16] Soudeh Ghorbani, Zibin Yang, P. Brighten Godfrey, Yashar Ganjali, and Amin
“nines” to availability. We now require that all our transports use Firoozshahian. 2017. DRILL: Micro Load Balancing for Low-Latency Data Center
PRR. Networks. In Proceedings of the Conference of the ACM Special Interest Group on
Data Communication (SIGCOMM ’17). Association for Computing Machinery,
PRR (and its sister technique, PLB [32]) represent a shift in our New York, NY, USA, 225–238. https://doi.org/10.1145/3098822.3098839
network architecture. In the early Internet [11], the network in- [17] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding
Network Failures in Data Centers: Measurement, Analysis, and Implications.
structed hosts how to behave, e.g., ICMP Source Quench. This did SIGCOMM Comput. Commun. Rev. 41, 4 (aug 2011), 350–361.
not work well, and in the modern Internet neither hosts nor routers [18] Google. 2015. gRPC Motivation and Design Principles (2015-09-08). https:
instruct each other: hosts send packets, and routers decide how to //grpc.io/blog/principles/. (2015).
[19] Google. 2022. PSP Architecture Specification (2022-11-17). https://github.com/
handle them. google/psp/blob/main/doc/PSP_Arch_Spec.pdf. (2022).
In our architecture, hosts instruct the network how to select [20] Google. 2022. Using Google Virtual NIC. https://cloud.google.com/compute/
paths for their traffic by using the IPv6 FlowLabel. This shift has docs/networking/using-gvnic. (2022).
[21] Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat.
come about because networks scale capacity by adding parallel 2016. Evolve or Die: High-Availability Design Principles Drawn from Googles
links, which has greatly increased the diversity of paths. To make Network Infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference
(SIGCOMM ’16). Association for Computing Machinery, New York, NY, USA,
the most of this diversity, we rely on routing to provide hosts 58–72. https://doi.org/10.1145/2934872.2934891
access to many paths, and hosts to shift traffic flows across paths [22] Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan.
to increase reliability and performance. We hope this architectural 2020. Meaningful Availability. In 17th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA,
approach, enabled by the FlowLabel, will become widespread. 545–557. https://www.usenix.org/conference/nsdi20/presentation/hauer
[23] Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi,
Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill
REFERENCES Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney,
[1] Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Monika Zahn, Jonathan Zolla, Joon Ong, and Amin Vahdat. 2018. B4 and after:
Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Ma- Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale
tus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: Distributed in Google’s Software-Defined WAN. In Proceedings of the 2018 Conference of the
Congestion-Aware Load Balancing for Datacenters. In Proceedings of the 2014 ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association
ACM Conference on SIGCOMM (SIGCOMM ’14). Association for Computing Ma- for Computing Machinery, New York, NY, USA, 74–87. https://doi.org/10.1145/
chinery, New York, NY, USA, 503–514. https://doi.org/10.1145/2619239.2626316 3230543.3230545
[2] Shane Amante, Jarno Rajahalme, Brian E. Carpenter, and Sheng Jiang. 2011. IPv6 [24] Christian Hopps and Dave Thaler. 2000. Multipath Issues in Unicast and Multicast
Flow Label Specification. RFC 6437. (Nov. 2011). https://doi.org/10.17487/RFC6437 Next-Hop Selection. RFC 2991. (Nov. 2000). https://doi.org/10.17487/RFC2991
[3] Alia Atlas, George Swallow, and Ping Pan. 2005. Fast Reroute Extensions to RSVP- [25] Van Jacobson. 1988. Congestion Avoidance and Control. SIGCOMM Comput.
TE for LSP Tunnels. RFC 4090. (May 2005). https://doi.org/10.17487/RFC4090 Commun. Rev. 18, 4 (aug 1988), 314–329. https://doi.org/10.1145/52325.52356
[4] Alia Atlas and Alex D. Zinin. 2008. Basic Specification for IP Fast Reroute: [26] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun
Loop-Free Alternates. RFC 5286. (Sept. 2008). https://doi.org/10.17487/RFC5286 Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs
[5] Alexander Azimov. 2020. Self-healing Network or The Magic of Flow Label. Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a Globally-
https://ripe82.ripe.net/presentations/20-azimov.ripe82.pdf. (2020). Deployed Software Defined Wan. SIGCOMM Comput. Commun. Rev. 43, 4 (aug
[6] Olivier Bonaventure, Christoph Paasch, and Gregory Detal. 2017. Use Cases 2013), 3–14.
and Operational Experience with Multipath TCP. RFC 8041. (Jan. 2017). https: [27] Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. 2014.
//doi.org/10.17487/RFC8041 FlowBender: Flow-Level Adaptive Routing for Improved Latency and Through-
[7] Matthew Caesar, Martin Casado, Teemu Koponen, Jennifer Rexford, and Scott put in Datacenter Networks. In Proceedings of the 10th ACM International on
Shenker. 2010. Dynamic Route Recomputation Considered Harmful. ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT
SIGCOMM Computer Communication Review 40, 2 (Apr. 2010), 66–71. https: ’14). Association for Computing Machinery, New York, NY, USA, 149–160.
//doi.org/10.1145/1764873.1764885 https://doi.org/10.1145/2674005.2674985
[8] Neal Cardwell, Yuchung Cheng, and Eric Dumazet. 2016. TCP Op- [28] Naga Katta, Aditi Ghag, Mukesh Hira, Isaac Keslassy, Aran Bergman, Changhoon
tions for Low Latency: Maximum ACK Delay and Microsecond Times- Kim, and Jennifer Rexford. 2017. Clove: Congestion-Aware Load Balancing at
tamps, IETF 97 tcpm. https://datatracker.ietf.org/meeting/97/materials/ the Virtual Edge. In Proceedings of the 13th International Conference on Emerging
slides-97-tcpm-tcp-options-for-low-latency-00. (2016).
Networking EXperiments and Technologies (CoNEXT ’17). Association for Comput- [39] Shan Sinha, Srikanth Kandula, and Dina Katabi. 2004. Harnessing TCP’s bursti-
ing Machinery, New York, NY, USA, 323–335. https://doi.org/10.1145/3143361. ness with flowlet switching. In Proc. 3rd ACM Workshop on Hot Topics in Networks
3143401 (Hotnets-III).
[29] Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer [40] Sucha Supittayapornpong, Barath Raghavan, and Ramesh Govindan. 2019. To-
Rexford. 2016. HULA: Scalable Load Balancing Using Programmable Data Planes. wards Highly Available Clos-Based WAN Routers. In Proceedings of the ACM
In Proceedings of the Symposium on SDN Research (SOSR ’16). Association for Special Interest Group on Data Communication (SIGCOMM ’19). Association for
Computing Machinery, New York, NY, USA, Article 10, 12 pages. https://doi.org/ Computing Machinery, New York, NY, USA, 424–440. https://doi.org/10.1145/
10.1145/2890955.2890968 3341302.3342086
[30] Ming Li, Deepak Ganesan, and Prashant Shenoy. 2009. PRESTO: Feedback-Driven [41] Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. 2010. Cal-
Data Management in Sensor Networks. IEEE/ACM Transactions on Networking ifornia Fault Lines: Understanding the Causes and Impact of Network Fail-
17, 4 (2009), 1256–1269. https://doi.org/10.1109/TNET.2008.2006818 ures. SIGCOMM Comput. Commun. Rev. 40, 4 (aug 2010), 315–326. https:
[31] Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, //doi.org/10.1145/1851275.1851220
Carlo Contavalli, Mike Dalton, Nandita Dukkipati, William C. Evans, Steve Grib- [42] Erico Vanini, Rong Pan, Mohammad Alizadeh, Parvin Taheri, and Tom Edsall.
ble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, 2017. Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching.
Lena Olson, Mike Ryan, Erik Rubow, Kevin Springborn, Paul Turner, Valas Valan- In 14th USENIX Symposium on Networked Systems Design and Implementation
cius, Xi Wang, and Amin Vahdat. 2019. Snap: a Microkernel Approach to Host (NSDI 17). USENIX Association, Boston, MA, 407–420. https://www.usenix.org/
Networking. In In ACM SIGOPS 27th Symposium on Operating Systems Principles. conference/nsdi17/technical-sessions/presentation/vanini
New York, NY, USA. [43] Simon N Wood. 2017. Generalized additive models: an introduction with R (second
[32] Mubashir Adnan Qureshi, Yuchung Cheng, Qianwen Yin, Qiaobin Fu, Gautam ed.). Chapman and Hall/CRC, Boca Raton. https://doi.org/10.1201/9781315370279
Kumar, Masoud Moshref, Junhua Yan, Van Jacobson, David Wetherall, and Abdul [44] Dingming Wu, Yiting Xia, Xiaoye Steven Sun, Xin Sunny Huang, Simbarashe
Kabbani. 2022. PLB: Congestion Signals Are Simple and Effective for Network Dzinamarira, and T. S. Eugene Ng. 2018. Masking Failures from Application
Load Balancing. In Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM Performance in Data Center Networks with Shareable Backup. In Proceedings of
’22). Association for Computing Machinery, New York, NY, USA, 207–218. https: the 2018 Conference of the ACM Special Interest Group on Data Communication
//doi.org/10.1145/3544216.3544226 (SIGCOMM ’18). Association for Computing Machinery, New York, NY, USA,
[33] Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon 176–190. https://doi.org/10.1145/3230543.3230577
Wischik, and Mark Handley. 2011. Improving Datacenter Performance and [45] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy
Robustness with Multipath TCP. SIGCOMM Comput. Commun. Rev. 41, 4 (aug Katz. 2012. DeTail: Reducing the Flow Completion Time Tail in Datacenter
2011), 266–277. Networks. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications,
[34] Jarno Rajahalme, Alex Conta, Brian E. Carpenter, and Dr. Steve E Deering. 2004. Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM
IPv6 Flow Label Specification. RFC 3697. (March 2004). https://doi.org/10.17487/ ’12). Association for Computing Machinery, New York, NY, USA, 139–150. https:
RFC3697 //doi.org/10.1145/2342356.2342390
[35] Matt Sargent, Jerry Chu, Dr. Vern Paxson, and Mark Allman. 2011. Computing [46] Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, and Mosharaf Chowdhury.
TCP’s Retransmission Timer. RFC 6298. (June 2011). https://doi.org/10.17487/ 2017. Resilient Datacenter Load Balancing in the Wild. In Proceedings of the
RFC6298 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM
[36] Pasi Sarolahti and Alexey Kuznetsov. 2002. Congestion Control in ’17). Association for Computing Machinery, New York, NY, USA, 253–266. https:
Linux TCP. In 2002 USENIX Annual Technical Conference (USENIX ATC 02). //doi.org/10.1145/3098822.3098841
USENIX Association, Monterey, CA. https://www.usenix.org/conference/ [47] Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia,
2002-usenix-annual-technical-conference/congestion-control-linux-tcp and Ying Zhang. 2021. ARROW: Restoration-Aware Traffic Engineering. In
[37] Siddhartha Sen, David Shue, Sunghwan Ihm, and Michael J. Freedman. 2013. Proceedings of the 2021 ACM SIGCOMM 2021 Conference (SIGCOMM ’21). As-
Scalable, Optimal Flow Routing in Datacenters via Local Link Balancing. In sociation for Computing Machinery, New York, NY, USA, 560–579. https:
Proceedings of the Ninth ACM Conference on Emerging Networking Experiments //doi.org/10.1145/3452296.3472921
and Technologies (CoNEXT ’13). Association for Computing Machinery, New York, [48] Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun
NY, USA, 151–162. https://doi.org/10.1145/2535372.2535397 Singh, and Amin Vahdat. 2014. WCMP: Weighted Cost Multipathing for Improved
[38] Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag. 2020. A Cloud- Fairness in Data Centers. Article No. 5. https://dl.acm.org/doi/10.1145/2592798.
Optimized Transport Protocol for Elastic and Scalable HPC. IEEE Micro 40, 2592803
6 (2020), 67–73. https://doi.org/10.1109/MM.2020.3016891

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy