Network-Integrated Decoding System For Real-Time Quantum Error Correction With Lattice Surgery
Network-Integrated Decoding System For Real-Time Quantum Error Correction With Lattice Surgery
Abstract—Existing real-time decoders for surface codes are results of the decoder instances to solve the global decoding
limited to isolated logical qubits and do not support logical op- problem.
arXiv:2504.11805v1 [quant-ph] 16 Apr 2025
1
X-type ancilla qubits
X-type ancilla qubits connecting both patches
connected to two data qubits
time
Merge
Split
time
II. BACKGROUND
We provide brief overviews of the implementation of logical
operations using lattice surgery, the requirements for decoders
to support such operations, and the role of qubit controllers.
A. Lattice Surgery
Lattice surgery [3] is a resource-efficient method to imple-
ment logical operations with more than one logical qubits
encoded in surface codes. As lattice surgery requires only
nearest-neighbor physical qubit interactions, it enables logical
circuits to be executed on planar quantum hardware without
the need for long-range connections between qubits. This
hardware-friendly property makes lattice surgery a promising
candidate for scalable fault-tolerant quantum computing. As a
result, researchers have embraced lattice surgery over alterna- Fig. 2: Lattice surgery modifies the structure of the decoding graph, assuming
tive methods such as transversal gates [15–19]. under a phenomenological noise model for a d = 5 surface code. (top) De-
Lattice surgery involves merging and splitting surface code coding graph of Z-ancillas over five measurement rounds. (bottom) Decoding
graph of a system involving two logical qubits that are merged for five rounds
patches to perform multi-qubit logical operations. A merge and then split. The decoding graph contains additional vertices (marked in red)
operation joins the adjacent boundaries of two surface code when the two logical qubits are merged.
patches by measuring ancilla qubits that interact with data
qubits from both logical qubits across the boundary, as il-
lustrated in Figure 1. Depending on whether the XL or ZL worst-case scenario requires the decoder to handle a merged
boundary is merged, this process consists of two concurrent patch that spans the entire qubit array. This necessitates a
steps: (1) applying additional gates between data qubits at decoder capable of processing all logical qubits collectively,
the boundary of one logical qubit and ancilla qubits at the as independent decoding of each logical qubit would fail to
boundary of the other, (2) measuring previously unused ancilla accurately decode the shared error syndromes introduced by
qubits at the boundary. In Figure 1, merging occurs along merging.
the ZL boundary: it converts the X-type ancilla qubits at the In comparison, splitting is much simpler. Once a patch is
boundary into ancilla qubits connecting both logical qubits, split, two decoders can decode the resulting logical qubits in-
and measures previously unused Z-type ancillas (highlighted dependently. Decoders can exploit this parallelism to improve
with red borders). This approach is generalizable to more decoder throughput when managing multiple logical qubits.
than two logical qubits, and two distant logical qubits can
be merged using a chain of ancillary logical qubits [3]. B. QEC Decoders
Splitting is the reverse operation of merging. After merging, When measurements of the ancilla qubits are ready, a
the combined logical patch can be split back into two separate decoder, using classical computing, identifies the most likely
logical qubits by ceasing joint stabilizer measurements and set of physical errors that could cause the observed defect
resuming individual stabilizer measurements for each patch. measurements. To mitigate the effects of measurement errors,
Lattice surgery operations, especially merge, pose additional the decoder processes at least d rounds of measurements
challenges to QEC decoding. The merging of logical qubits re- together, collectively referred to as a syndrome. With lattice
sults in shared error syndromes between the qubits. Since any surgery, the decoder must decode d rounds of measurements
two logical qubits in the system can potentially be merged, the together after each merge or split operation [3].
2
Given the syndrome, the problem of finding the most and a total latency of approximately d µs [27].
likely errors is commonly represented by a decoding graph.
Each vertex in the decoding graph corresponds to an ancilla C. Qubit Controller
measurement, and each edge represents a potential error. In a quantum computer, the decoder must receive measure-
Figure 2 (top) illustrates the decoding graph for a single ments from the qubit controllers. Using either an FPGA or
logical qubit coded in a surface code with d = 5. For specialized hardware, a controller generates the control signal
example, Sparse Blossom [7] and Parity Blossom [20] solve to manipulate a qubit. A quantum computer needs 100s of
this problem exactly, while Union-Find (UF) [21] solves this qubits and therefore 100s of controllers in parallel to support
problem approximately [22] but faster. experiments with many logical qubits [28–34].
1) Dynamic Nature of Decoding Graphs: With lattice This poses specific design requirements for the decoder.
surgery, the decoding graph becomes dynamic, because the First, the decoder must be tightly integrated with qubit con-
graph structure depends on the outcomes of previous measure- trollers to aggregate data from all the controllers with minimal
ments and conditional operations in the circuit. For instance, latency. Second, the decoder’s compute and I/O commu-
whether two logical qubits should be merged may be deter- nication capabilities must scale with the number of qubit
mined by the result of a measurement on a third qubit, only controllers to prevent it from becoming a system bottleneck.
available at runtime, as in the case of a T-gate implemented
using magic state injection [23]. As a result, the complete III. R ELATED WORK
decoding graph cannot be precomputed, and the decoder must To our knowledge, D ECO N ET /H ELIOS is the first scalable
construct the decoding graph as the computation progresses. real-time decoder system that provides empirical results on
This runtime graph construction poses new challenges to decoding hundreds of interacting logical qubits simultane-
decoding. In particular, the decoder must not only process ously, including support for dynamic lattice surgery circuits.
syndrome data in real time, but also update the decoding D ECO N ET /H ELIOS draws inspiration from several works that
graph itself based on conditional control flow. This requires explored decoding lattice surgery-based dynamic graphs. It
tight integration with the control stack and minimal latency is the first decoder system that is capable of using network-
to keep up with real-time circuit execution. In Figure 2 integrated compute resources and supports real-time decoding
(bottom), two logical qubits are merged for d rounds and with thousands of logical qubits.
then split apart. Whether this merge occurs can depend on Wu et al. [35] suggest a distributed decoding system in
a prior measurement result, meaning that the decoding graph, which a coordinator assigns decoding blocks to multiple
especially the inclusion of vertices corresponding to shared compute resources and merges them using a fusion operation.
ancillas during the merge stage (shown in red), is not known D ECO N ET /H ELIOS can be considered the first implementation
ahead of time. of this distributed decoding system, with new implementation
2) Performance Metrics: The performance of a decoder ideas such as combining fusion with windowing to reduce
is evaluated using three metrics: accuracy, throughput, and latency and statically assigning decoding blocks to hardware
latency. Accuracy determines the size of the surface codes units for more efficient execution. Bombin et al. [18] propose
to achieve a given logical error rate. Decoders with lower a modular decoding framework that partitions the decoding
accuracy require much more hardware. For current physical graph into subgraphs. These subgraphs correspond to com-
error rates in superconducting quantum computers, p = 0.001, monly used logical operations. The framework first decodes
a Union-Find decoder requires d ≈ 29 to achieve a logical the boundaries between these subgraphs, followed by parallel
error rate of 10−15 . Latency determines the rate of logical decoding of the individual subgraphs. The authors also con-
operations, as intermediate decoding results are often needed sidered dynamic circuits and proposed a hardware architecture
for time-sensitive operations such as T-gates [24]. As a result, in which multiple processing units, sharing a global memory,
latency determines the overall execution time of the circuit decode subgraphs in parallel, without providing a hardware
and the probability of failure. For superconducting quantum implementation or any empirical data on decoding latency or
computers, researchers typically assume a decoding latency throughput.
budget of 10 µs when estimating the timing of quantum Lin et al. [17], Skoric et al. [36], and Tan et al. [37]
algorithms [25]. Throughput or rate of decoding must match propose partitioning the decoding graph into spatial regions,
the measurement rate. Otherwise, a backlog of undecoded or windows, that can be decoded in parallel with limited inter-
measurements can accumulate, exponentially slowing down window dependencies. Lin et al. further analytically show
the quantum computer. For superconducting quantum comput- that inter-window communication latency does not impose a
ing, which has the most stringent requirements, the decoding scalability bottleneck. Our work builds on both these efforts
rate should be 1µs per measurement round [26]. to implement window-based decoding across FPGAs and, for
In this context, a real-time decoder is one whose throughput the first time, empirically validating that inter-FPGA commu-
exceeds the measurement rate and whose latency is com- nication is not a scalability bottleneck.
parable to the time to measure d rounds. For state-of-the- To our knowledge, QULATIS [19] is the only reported
art superconducting quantum systems, this corresponds to a hardware design that supports lattice surgery-based dynamic
throughput greater than one million measurements per second decoding graphs. It decodes errors using a greedy algorithm
3
that sacrifices accuracy for simplicity. Without an imple- The design of D ECO N ET is agnostic to the choice of
mentation, the authors evaluated the decoding latency of decoder implementation. This flexibility is crucial as decoding
QUALTIS using SPICE simulations. Compared to QUALTIS, technologies rapidly evolve, each offering different trade-offs
D ECO N ET /H ELIOS features a much more scalable architecture in accuracy, latency, and resource utilization. D ECO N ET only
that leverages network-integrated FPGAs. Furthermore, our poses two requirements to the decoder: first, it must decode a
implementation of D ECO N ET /H ELIOS employs a Union-Find- decoding block of ≈ d3 vertices in real-time; and second, it
based decoder, achieving at least two orders of magnitude must support the fusion of decoding blocks across both space
higher decoding accuracy compared to the greedy algorithm and time. These requirements are reasonable in view of recent
used in QULATIS. advancements in decoder designs. Many decoders meet the
real-time requirement for moderate values of d [6–13, 20, 27].
IV. D ECO N ET OVERVIEW Several recent decoders also support fusion [13, 20]; both UF
A. Design Goals and MWPM decoders can be extended to support fusion.
Our goal is to build a system capable of real-time decoding We implement D ECO N ET using Helios as decoder in-
of multiple interacting logical qubits, implemented by the stances, referred to as D ECO N ET /H ELIOS (detailed in §VII).
surface code. In graph-based decoding, this means handling This implementation spans five FPGAs and can decode up to
a dynamic decoding graph comprising all logical qubits in 100 logical qubits at d = 5.
the quantum computer. The system should be both scalable
and adaptable. Scalability refers to the system’s capability to V. D ECO N ET D ECODING A LGORITHM
handle increasingly larger decoding graphs by incrementally This section describes the decoding algorithm used in
and trivially adding compute resources without significant D ECO N ET. It outlines how D ECO N ET partitions the decoding
redesign. Adaptability refers to the system’s ability to support graph into decoding blocks, distributes them across compute
a dynamic decoding graph whose complete structure is not resources to decode them in parallel, and combines them to
fully known before runtime. form the complete graph.
B. Key Ideas
A. Decoding Block
D ECO N ET achieves its design goals by integrating several
key ideas: A decoding block is the basic unit of decoding in D E -
Partitioning the decoding graph into decoding blocks: Since CO N ET . For a logical qubit of code distance d, a decoding
a monolithic decoder cannot scale efficiently due to hard- block consists of d rounds of ancilla measurement outcomes.
ware resource constraints, we partition the decoding graph As FTQC requires decoding at least d rounds of measurements
into smaller units independently decodable, called decoding after a logical operation, a decoding block is the smallest
blocks, as inspired by [35]. The system partially decodes each subgraph in the decoding graph capable of producing a mean-
decoding block separately using a decoder and combines them ingful result.
to form a global decoding solution. Dividing the decoding graph into decoding blocks enables
Network-Integrated Resources for Scalability: D ECO N ET fast construction of a dynamic decoding graph. Each decoding
uses network-integrated compute resources to scale beyond block has a static graph structure that D ECO N ET generates
the limitations of a single node, computer, or FPGA. Unlike offline. At runtime, D ECO N ET constructs the dynamic decod-
board- or chip-level integration, network integration allows us ing graph from these pre-constructed blocks. Thus D ECO N ET
to incrementally and trivially add resources to scale the design. represents graph changes such as merges and splits between
Fusion-based and Window-based Approaches: We use two logical qubits as merges and splits between decoding blocks.
complementary methods to combine decoding blocks into By updating only the connections between affected decoding
a global decoding solution. Within a single computational blocks, D ECO N ET avoids reconstructing the entire decoding
resource (such as a single FPGA), we use a fusion-based ap- graph at runtime, which would be time-intensive.
proach [20, 35]. Across multiple network-integrated compute Decoding blocks also increases parallelism. When logical
resources, we adopt a window-based approach [17]. We detail qubits are split, D ECO N ET decodes their decoding blocks
these methods and our partitioning strategy in §V. in parallel. D ECO N ET combines relevant blocks only when
Hybrid Tree-Grid Network Topology: To minimize com- logical qubits merge, after initially decoding them in parallel.
munication latency among compute resources, we organize
B. Combining Decoding Blocks
the compute resources in a novel hybrid topology which we
describe in detail in §VI. The hybrid topology uses a grid We consider two scenarios when combining decoding
structure to facilitate local communication required for lattice blocks: (1) combining within a single compute resource and
surgery operations among neighboring compute resources, (2) combining across network-integrated compute resources.
and a tree structure to ensure minimal worst-case latency 1) Intra-resource Decoding Block Fusion: D ECO N ET uses
when communicating decoding outcomes between compute re- a fusion operation previously suggested by Wu et.al. [20, 35] to
sources. Additionally, the hybrid topology allows I/O resources merge blocks within the same compute resource both spatially
to scale proportionally with the number of compute resources. and temporally. Fusion provides few advantages compared
4
to alternate combining methods such as parallel window Compute Decode
Fusion
decoding and sliding window decoding. resources blocks
Group 1 1 2 3 4 5 6
• Reduced Redundant Computation: Conventional methods
require overlapping windows of at least d/2 width along Group 2 1 2 3 4
Share boundary
every direction to combine blocks, causing redundant com- Group 3 defects 1 2
putations in the overlapping region. Furthermore, the redun- d 2d 3d 4d 5d 6d
dant computation region scales O(d3 ). In contrast, when time (rounds measured)
using a clustering-based decoding algorithm, fusion avoids Fig. 3: Timeline of the decoding procedure pipelined across three compute
redundant computations as it preserves the cluster details resource groups. Each group performs parallel decoding of its assigned
from the first decoding stage when starting the fusion oper- decoding blocks (white boxes). After decoding, D ECO N ET fuses adjacent
decoding blocks in space and time (blue). The system then shares the boundary
ation. Furthermore, fusion only affects clusters incident to information between groups (blue arrows). Numbers in the white boxes denote
the boundary between merged blocks, which scales O(d2 ). the round index of decoding blocks. Groups 2 and 3 lag behind group 1 due
• Increased Parallelism: Conventional methods impose se- to data dependencies.
quential dependencies due to artificial defects along bound-
aries [1, 17, 36]. Fusion eliminates this bottleneck by
allowing simultaneous decoding of all the blocks prior to as motivated above. Figure 3 illustrates the pipelined execution
combining. This is especially advantageous when merging of this procedure across the three groups.
blocks along the time domain. When using fusion decoding, When the d · ith round of measurements becomes available,
the first block can start after the first d rounds are available, Group 1 decodes the corresponding decoding blocks assuming
whereas sliding window decoders must wait for all 2d they are isolated. After local decoding, Group 1 fuses adjacent
rounds to start decoding. decoding blocks corresponding to merged logical qubits and
performs time-domain fusion with decoding blocks from the
2) Inter-resource Decoding Block Combination via Parallel (i − 1)th round corresponding to the same logical qubit. Once
Window Decoding: We use the parallel window decoding fusion completes, Group 1 commits the (i − 1)th round and
method, proposed by Lin et al. [17] and Skoric et al. [36], transmits boundary information to Group 2. Group 2 then
to combine decoding blocks across compute resources due decodes and fuses the (i − 1)th round using this information
to its lower communication overhead. This method reduces and commits the (i − 2)th round. Similarly, Group 3 processes
communication latency by requiring only one unidirectional and commits the (i − 3)th round.
message per pair of adjacent decoding blocks, unlike fusion- This pipelined execution introduces a minimum decoding
based methods that involve multiple data exchanges. This latency of 3d rounds but does not affect the throughput
reduction is critical when combining decoding blocks across of decoding. Each group begins decoding a new round of
the network, where communication latency often exceeds decoding blocks immediately after it commits the previous
decoding latency. For example, state-of-the-art FPGA decoders round and receives the required boundary information.
can decode a d=21 surface code in 240 ns [38], but commu-
nication between FPGAs over gigabit transceivers can take VI. D ECO N ET NETWORK ARCHITECTURE
around 300 ns [39].
The key to D ECO N ET’s scalability is to exploit network-
The parallel window decoding method partitions the decod- integrated compute resources, allowing it to go beyond a single
ing graph into multiple windows [17, 36]. In the context of compute node, e.g., FPGA, which has limited prior decoder
D ECO N ET, each window consists of decoding blocks mapped implementations such as Helios [38], Lilliput [11], and Micro
to the same compute resource. Parallel window decoding Blossom [13]. We next describe the network architecture of
groups these windows into multiple groups such that no two D ECO N ET, which organizes the compute resources in a hybrid
adjacent windows belong to the same group. This method then tree-grid topology for scalable and efficient decoding.
decodes the groups sequentially, performing parallel decoding
of windows within each group and transmitting boundary in- A. Hybrid Tree-Grid Network Topology
formation between groups in order. This boundary information Figure 4 illustrates the hybrid tree-grid network topology
represents the most probable physical qubit errors that cross employed by D ECO N ET. Compute resources at the lowest
window boundaries. level (leaf nodes) run decoder instances, while intermediate
Prior work shows that at least three groups are required to nodes act as routers. The root node serves as the central
satisfy the adjacency constraint [17, 36]. Because the number interface for user interaction, configuration, and experiment
of groups directly impacts decoding latency, D ECO N ET groups monitoring.
the network-integrated compute resources into three groups to This hybrid tree-grid topology brings the benefits of both
minimize latency while satisfying the constraint. tree and grid. The grid structure efficiently handles local
communication between adjacent leaf nodes. Through careful
C. Pipelined Decoding Procedure
mapping of decoding blocks to leaf nodes (§VI-C), D ECO N ET
Next, we describe the general decoding procedure of D E - restricts boundary information exchange exclusively to adja-
CO N ET that organizes the compute resources into three groups cent compute resources in the grid. By restricting boundary
5
DecoNet network architecture
Instructions (input) and
Root decoded result (output) Boundary defects
Intermediate
nodes
Coordinator
Leaves
Feedback to
Qubit
Controllers Decoder Instances
Qubit
Controllers Router
Measurements
Quantum Hardware
Fig. 4: Network Architecture of D ECO N ET, showing the hybrid tree-grid Fig. 5: Internal architecture of D ECO N ET leaf nodes, comprising a coordina-
structure. The tree’s leaf nodes run decoder instances, while intermediary tor, router, and multiple decoder instances.
nodes route information.
6
Union-Find (UF) decoder [21], which offers slightly lower
accuracy than minimum-weight perfect matching (MWPM)
decoders such as those used in Micro-Blossom [13]. However,
Helios achieves significantly lower latency and can decode
surface codes up to d = 21 in under 250 ns. This low latency
enables us to demonstrate that D ECO N ET can support low-
latency decoding.
Since Helios is based on the Union-Find (UF) algorithm,
we extended it to support the fusion operation, referred to as
Fusion Union-Find (Fusion UF). We provide implementation
details in §VII-E.
Fig. 6: Implementation of D ECO N ET /H ELIOS using five FPGAs. The right- The choice of decoder instance presents a design trade-off.
most FPGA is the root of the tree, and the other four FPGAs are leaf nodes. More accurate but resource-intensive decoders such as Micro-
Blossom [13] can improve logical error rates but increase
featuring a Versal VM1802 FPGA-based SoC [14]. We chose decoding latency and reduce the number of logical qubits
these boards because they are one of the readily available that can be supported per FPGA. In contrast, more resource-
Versal FPGA development boards with a very high LUT count efficient UF-based decoders such as LCD [27] can support
and a high number of gigabit transceivers. Figure 6 shows more logical qubits per FPGA but incur higher decoding
the implementation D ECO N ET /H ELIOS. The network of our latency. Helios strikes a balance by offering low latency
implementation is a two-level tree, with one root node and and moderate resource usage while maintaining acceptable
four leaf nodes, which we also connect in a grid. We next decoding accuracy, making it suitable for our implementation.
describe the important choices made in our implementation. C. Implementation of FPGA logic
A. Network Connection between FPGAs We implemented the FPGA logic using Verilog and Tcl
We implemented the FPGA interconnection using Gigabit scripts, comprising approximately 9000 lines of code. The
Transceivers to maximize fan-out. Each VMK180 board in- source code is publicly available at [41]. To support multi-
cludes 30 GTs exposed through QSFP, SFP+, and FMC+ links, qubit decoding, where boundaries dynamically change, we
allowing each parent node to connect to up to 25 child nodes. maintain an array of registers to track the state of each
This high fan-out capability significantly reduces D ECO N ET’s boundary. The coordinator has write access to these registers
tree height, enhancing scalability when the system grows. and updates them based on the instructions from the root.
The Aurora core, the standard Xilinx IP core designed for Each Helios instance within the FPGA has read access to
high-throughput communication [39], incurs a high core-to- these registers, allowing it to determine the current state of
core latency of approximately 320 ns due to its optimization the boundary for the logical qubit it is decoding.
for throughput, making it unsuitable for real-time decoding.
D. Resource Usage
To address this limitation, we developed a custom low-latency
transceiver, the Eos core [40], which reduces latency to 95 Table I presents the resource usage breakdown for the
ns by sacrificing throughput. Eos core breaks messages into D ECO N ET /H ELIOS implementation configured to decode 100
smaller chunks and inserts more frequent error correction logical qubits at d = 5. This configuration requires close to 1
codes, allowing the transmitter and receiver to process data million LUTs distributed across five FPGAs.
in fewer cycles, minimizing overall latency. Despite the trade- In the leaf nodes, decoder instances consume approximately
off in throughput, the Eos core supports stable operation at 16 95% of the LUTs in the implementation, which is expected
Gbps, which is sufficient for real-time decoding. We validated since decoding is the most compute-intensive task in the
its reliability through 24-hour stress testing, confirming its system. Adopting more resource-efficient decoder instances,
suitability for D ECO N ET. The Eos core is open-source and such as [42], could potentially increase the system’s decod-
available at [40]. ing capacity per FPGA. We further explore these scalability
1) Messaging format: We use a fixed 64-bit format for limitations in §VIII-D.
messages exchanged between FPGAs. The first 8 bits specify In contrast to the decoder instances, the remaining logic in
the destination FPGA, the next 8 bits define the message the leaf nodes consumes only around 14,000 LUTs and 21,000
header, and the remaining 48 bits serve as the payload. This registers, accounting for approximately 1.5% of the FPGA’s
simple format enables faster routing at each node, which is resources. This small footprint leaves sufficient room for the
necessary for low-latency communication. compute-intensive decoder instances. The 1.5% includes the
coordinator logic, inter-FPGA communication links, block-
B. Choice of Decoder Instance: Helios RAM-based FIFOs for inter-module communication, and pe-
We chose the Helios decoder as the decoder instance due ripheral logic for interacting with the ARM core.
to its favorable trade-off between scalability, latency, and The root node utilizes only 2% of the resources on the
accuracy [12]. Helios is a distributed implementation of the evaluation board, making it possible to use FPGAs with
7
TABLE I: Breakdown of resource usage for the configuration decoding
maximum number of logical qubits, 100 (d = 5) • Accuracy: How does the fusion-UF approach compare in
accuracy to alternative methods?
Component LUTs Registers BRAMs
We first describe our methodology and then present empiri-
Root Node
cal results answering these questions. To validate the system’s
Eos core 4389 5365 0
practicality, we decode a set of micro benchmarks.
Root Coordinator 53 646 0
Residual Logic 14004 8274 46 A. Methodology
Leaf Node (per FPGA)
Coordinator 25,038 3,859 0 To evaluate the accuracy of fusion-UF, we extend the soft-
Decoder Instances 201,022 98,981 0 ware simulation library [43] to support fusion-UF. Software
Eos core 6141 7317 0 simulations enable us to test higher code distances (d) without
Residual Logic 7753 13542 40 encountering hardware resource constraints.
For latency and throughput evaluations, we measure FPGA
clock cycles required for syndrome decoding using D E -
lower LUT counts for non-leaf nodes in D ECO N ET. How- CO N ET /H ELIOS. We define latency as the time between the
ever, existing evaluation boards with smaller LUT capacities availability of the last measurement round for a decoding
typically have fewer transceiver links, preventing us from block at the decoder and the availability of the decoded result.
pursuing this option. Using a board with fewer transceiver Inverse throughput is the time interval between a decoder
links would increase the tree height, leading to higher decoding instance accepting consecutive decoding blocks, normalized by
latency. Alternatively, the unused resources in non-leaf nodes d. We analyze the trends in latency and throughput to identify
can potentially be repurposed for other tasks in the quantum system limitations.
control stack, such as logical qubit routing. 1) Experimental setup: We use the five-FPGA Helios-
Net implementation described in §VII as the experimental
E. Fusion Union-Find platform. The ARM cores on each evaluation board generate
sample syndromes and transfer them to the decoder instances
We introduce Fusion Union-Find (Fusion UF), our novel
in the FPGAs. To verify the correctness of our multi-FPGA
approach for merging multiple partitions of a decoding graph,
implementation, we compare the logical state of each decoding
decoded using the Union-Find algorithm. This is an alternative
block after every decoding round with results from offline
to conventional merging techniques such as sliding window
simulations of the original UF decoder by Delfosse et al. [21],
decoding and parallel window decoding [36, 37]. Fusion UF
executed on the ARM core. The comparison reveals identical
draws inspiration from Fusion Blossom [20], which uses a
results between the software simulation and the hardware
similar methodology to speed up MWPM-based decoding.
implementation. We perform 106 trials for each error rate and
1) Merging using Fusion Union-Find: Fusion UF merges
code distance.
two blocks as follows. Initially, the Distributed Union-Find
We use two configurations of the Helios decoder as decoder
decoder processes the two blocks independently, treating their
instances in our evaluation. In most experiments, we use the
shared face as an artificial boundary. Any cluster with fully
default Helios (Helios-1) configuration, which offers the best
grown edges that touch this artificial boundary is considered
latency scalability with d. To demonstrate the decoding of the
even and do not grow further.
maximum number of logical qubits, we use Helios-n, where
After completing the clustering phase in both blocks, the
n = d. Helios-n requires fewer FPGA resources than Helios-1,
system removes this artificial boundary and calculates the
supporting more decoding blocks on the implementation.
cluster parities again. If any cluster in either block is odd,
We use d = 5 as our default configuration, as we believe it
the decoder resumes the growing and merging phase in both
would be a reasonable distance for the first experiments with
blocks simultaneously until no more odd clusters remain.
multiple logical qubits on actual quantum hardware.
Finally, the system moves to peeling phase in each block.
2) Noise Model: We use the phenomenological noise
In §VIII-C, we present empirical evidence demonstrating
model [1] with measurement errors for our experiments. Prior
that Fusion UF achieves lower latency compared to other
studies widely use this model for single-qubit decoders [6, 38,
methods.
44]. The system can be easily extended to other noise models,
VIII. E VALUATION such as circuit-level noise and erasure errors, as the Helios
decoder supports both models.
The main objective of our evaluation is to assess the For most of our experiments, we use a default noise level
scalability of D ECO N ET. To that end, we answer the following of p = 0.001, consistent with prior works [6, 38, 42]. This
key questions: is a reasonable assumption as p = 0.001 is more than 10
• Latency and throughput growth: How do latency and times below the threshold, which is necessary to exponentially
throughput scale with the code distance and the number of reduce the logical error rates. Furthermore, for scalability
merged logical qubits? evaluations, we randomly merge and split adjacent decoding
• Limitations: What are the scalability limits of D ECO N ET ? blocks with a 50% probability after every d rounds.
8
1 400 3000 3000
d=3 parallel Decoder Instances Decoder Instances
0.1 d=5 fusion 2500 Communication 2500 Communication
d=7
Logical error rate
Latency (ns)
Latency (ns)
300
d = 11
10 3 d = 13 1500 1500
10 4
200 1000 1000
10 5 500 500
10 6
10 10 3 10 2
0 5 7 9 11 0 4 16 36 64
4 0.1 100
3 5 7 9 11 13
Physical error rate number of merged blocks d logical qubits
(a) No notable accuracy loss (b) Lower latency (a) Latency grows with d (b) Latency grows with num qubits
Fig. 7: Accuracy and scalability of the fusion-UF approach. (a) Comparison 200 200
of decoding accuracy for two merged blocks using fusion-UF and a global UF 180
inv. tp.
180
inv. tp.
9
TABLE II: Microbenchmark results showing latency and inverse throughput
(standard deviation in brackets). Inverse throughput is normalized by d. across FPGAs in a pipelined manner (Figure 3), some FP-
GAs process additional measurement rounds to complete the
Microbenchmark # L. # Latency Inv. Thpt
Qubits Rounds (ns) (ns) decoding of a logical circuit.
Meas.-based feedback 1 d 916 (191.1) 86.9 (22.2) All microbenchmarks exhibit an average decoding latency
Merge + split 2 3d 2003 (355.0) 84.5 (35.1)
Move qubit 3 3d 2087 (238.9) 87.2 (22.4) below 30% of the measurement acquisition time. The in-
CNOT 3 3d 3258 (619.9) 86.5 (24.6) verse throughput is 8× faster than the measurement rate of
CNOT (plane layout) 6 3d 3351 (484.8) 91.5 (19.8)
Single-ctrl multi-CNOT 5 3d 3249 (482.7) 91.7 (34.8) 1 µs, ensuring that D ECO N ET /H ELIOS operates backlog-free
State expansion 4 2d 2751 (536.4) 86.4 (19.7) across all benchmarks. We highlight two microbenchmarks
15-1 magic state distill. 24 5d 5701 (633.0) 123.1 (35.9)
that demonstrate key capabilities of D ECO N ET: measurement-
based feedback and 15-1 magic state distillation.
In measurement-based feedback, the system decodes a log-
ns to 121 ns, which remains well below the 1 µs threshold.
ical qubit and transmits the result to another FPGA, a control
Based on the experimental results showing that the inverse
flow required in the T-gate. The quantum hardware must idle
throughput scales sublinearly with d, the largest decoding
until the result is available, and circuit execution can slow
block that can fit in an FPGA determines the maximum d.
polynomially due to this latency [36]. In our experiments, the
On VMK-180 SoCs, we support up to d = 13. However,
average latency for decoding and transmitting the result to
the design could potentially scale up to d = 23 when using
another FPGA is 0.91 µs, more than five times faster than
a VU19P FPGA, the largest commercially available option.
the time to measure d rounds. Across 106 trials, we observe
Even though the number of decoding blocks per FPGA is also
a worst-case latency of 2.25 µs, which is still over twice as
limited by logic utilization, this bottleneck is easily addressed
fast as the time to measure d rounds. As D ECO N ET /H ELIOS
by distributing blocks across additional leaf nodes.
can consistently deliver results before the next measurement
The number of leaves and inter-FPGA latency do not impose round becomes available, it introduces minimal slowdown to
immediate limits on scalability, as they impact latency but not quantum circuit execution.
throughput. However, increasing inter-FPGA latency reduces 15-1 magic state distillation, spanning 24 logical qubits,
the logical operation frequency, leading to a polynomial in- represents one of the largest merge-and-split-based operations
crease in circuit execution time. Additionally, when the num- required for FTQC. This circuit is essential for generating
ber of leaves increase, the system will eventually bottleneck at T-gates and involves five rounds of CNOT gates [45]. A
the root node, which determines circuit control paths based on practical decoder must process this circuit faster than the
prior decoding results. We can potentially avoid this bottleneck rate of measurement to be useful for large-scale FTQC.
by taking distributed control decisions at intermediate nodes, D ECO N ET decodes the corresponding 5d rounds in 5.7 µs
but this direction requires further investigation. for d = 5, achieving inverse throughput 8× faster than the
rate of measurement.
E. Decoding 100 logical qubits The decoding latency of each microbenchmark also de-
We decode up to 100 logical qubits of d = 5 using our five- pends on how many FPGAs participate in decoding, which
FPGA implementation by employing the resource-efficient is determined by the placement of the logical circuit. To
Helios-d configuration as the decoder instance. To support 100 evaluate worst-case behavior, we map each circuit across
logical qubits, each FPGA processes 25 logical qubits. Scaling FPGA boundaries whenever possible, maximizing inter-FPGA
beyond this point causes the decoding rate to fall below the communication. Mapping a circuit to a single FPGA reduces
measurement rate due to the limited scalability of the Helios- latency. For example, a CNOT gate requires 1297 ns on a
d configuration [12]. When decoding 100 logical qubits, the single FPGA versus 3258 ns across three. However, inverse
system achieves an average latency of 12.01 µs and an inverse throughput remains statistically unchanged due to D ECO N ET
throughput of 0.84 µs. While these values are significantly ’s pipelined execution across all FPGAs.
higher than those of the Helios-1 configuration, they remain G. Comparison with related work
within the bounds required for real-time decoding.
In terms of scalability, the Helios-d configuration reaches We compare D ECO N ET /H ELIOS with QULATIS [19], the
the decoding time limit before exhausting FPGA resources. only prior hardware decoder design in the literature that
Based on latency growth trends, we estimate that for an error provides detailed decoding latencies.
rate of p = 0.001, Helios-d can support up to d = 25. D ECO N ET /H ELIOS achieves significantly higher decoding
accuracy than QULATIS, while QULATIS outperforms D E -
CO N ET /H ELIOS in latency and throughput. At p = 0.001
F. Microbenchmarks
and d = 5, D ECO N ET /H ELIOS provides over two orders
We report the decoding latency and inverse throughput of of magnitude better accuracy than QULATIS, and this gap
selected micro benchmarks in Table II. Each entry lists the widens with increasing d. This is due to QULATIS using a
number of logical qubits and the number of measurement greedy decoding algorithm, which has orders of magnitude
rounds required to implement the circuit and achieve fault- lower accuracy than Union-Find. QULATIS achieves lower
tolerant decoding. Because D ECO N ET organizes decoding latency due to its higher operating frequency. For 15-1 magic
10
state distillation at d = 5 and p = 0.001, QULATIS reports a IX. C ONCLUSION
latency of 1.16 µs based on SPICE-level simulation at 2 GHz, We present D ECO N ET, a network-integrated decoding sys-
whereas D ECO N ET /H ELIOS achieves 5.7 µs when running tem for fault-tolerant quantum computers that decodes a dy-
at 100 MHz. In terms of cycle count, QULATIS requires namic graph of multiple interacting logical qubits. D ECO N ET
2235 cycles, while D ECO N ET /H ELIOS completes decoding introduces a scalable architecture that expands compute and
in 570 cycles. For the same circuit, QULATIS achieves an I/O resources by trivially adding hardware, enabling the de-
inverse throughput of 46.7 ns, compared to 123.1 ns for D E - coding of thousands of logical qubits. Using a five-FPGA im-
CO N ET /H ELIOS, again primarily due to the clock frequency plementation, D ECO N ET /H ELIOS decodes 100 logical qubits
difference. of distance five in real-time. To the best of our knowledge, this
QULATIS also faces more stringent scalability limits. Its is the highest number of interacting logical qubits decoded
power consumption at cryogenic temperatures (4K) limits the faster than the measurement rate by any system. Given this
number of supported logical qubits. Their estimation suggests scalability, we consider D ECO N ET a strong candidate for the
it can run 40 concurrent 15-1 distillation circuits of d = 9. In logical qubit decoding layer in future fault-tolerant quantum
contrast, D ECO N ET ’s scalability is limited by the root node’s computers.
capacity to process decoded results, a bottleneck that does not
emerge until scaling to thousands of logical qubits. ACKNOWLEDGMENT
This work was supported in part by Yale University and
NSF MRI Award #2216030.
11
R EFERENCES [25] C. Gidney and M. Ekerå, “How to factor 2048 bit rsa integers in 8 hours
[1] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, “Topological quantum using 20 million noisy qubits,” Quantum, 2021.
memory,” Journal of Mathematical Physics, 2002. [26] Z. Chen, K. J. Satzinger, J. Atalaya, A. N. Korotkov, A. Dunsworth,
[2] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface D. Sank, C. Quintana, M. McEwen, R. Barends, P. V. Klimov, S. Hong,
codes: Towards practical large-scale quantum computation,” Physical C. Jones, A. Petukhov, D. Kafri, S. Demura, B. Burkett, C. Gidney,
Review A, 2012. A. G. Fowler, H. Putterman, I. Aleiner, F. Arute, K. Arya, R. Bab-
[3] C. Horsman, A. G. Fowler, S. Devitt, and R. Van Meter, “Surface code bush, J. C. Bardin, A. Bengtsson, A. Bourassa, M. Broughton, B. B.
quantum computing by lattice surgery,” New Journal of Physics, 2012. Buckley, D. A. Buell, N. Bushnell, B. Chiaro, R. Collins, W. Courtney,
[4] F. Battistel, C. Chamberland, K. Johar, R. W. J. Overwater, F. Sebastiano, A. R. Derk, D. Eppens, C. Erickson, E. Farhi, B. Foxen, M. Giustina,
L. Skoric, Y. Ueno, and M. Usman, “Real-time decoding for fault- J. A. Gross, M. P. Harrigan, S. D. Harrington, J. Hilton, A. Ho,
tolerant quantum computing: Progress, challenges and outlook,” Nano T. Huang, W. J. Huggins, L. B. Ioffe, S. V. Isakov, E. Jeffrey,
Futures, 2023. Z. Jiang, K. Kechedzhi, S. Kim, F. Kostritsa, D. Landhuis, P. Laptev,
[5] E. Campbell, “A series of fast-paced advances in quantum error correc- E. Lucero, O. Martin, J. R. McClean, T. McCourt, X. Mi, K. C. Miao,
tion,” Nature Reviews Physics, vol. 6, no. 3, 2024. M. Mohseni, W. Mruczkiewicz, J. Mutus, O. Naaman, M. Neeley,
[6] P. Das, C. A. Pattison, S. Manne, D. M. Carmean, K. M. Svore, C. Neill, M. Newman, M. Y. Niu, T. E. O’Brien, A. Opremcak, E. Ostby,
M. Qureshi, and N. Delfosse, “AFS: Accurate, fast, and scalable error- B. Pató, N. Redd, P. Roushan, N. C. Rubin, V. Shvarts, D. Strain,
decoding for fault-tolerant quantum computers,” in Proc. IEEE Int. M. Szalay, M. D. Trevithick, B. Villalonga, T. White, Z. J. Yao, P. Yeh,
Symp. High-Performance Computer Architecture (HPCA), 2022. A. Zalcman, H. Neven, S. Boixo, V. Smelyanskiy, Y. Chen, A. Megrant,
[7] O. Higgott and C. Gidney, “Sparse Blossom: correcting a million and J. Kelly, “Exponential suppression of bit or phase errors with cyclic
errors per core second with minimum-weight matching,” arXiv preprint error correction,” Nature, 2021.
arXiv:2303.15933, 2023. [27] A. B. Ziad, A. Zalawadiya, C. Topal, J. Camps, G. P. Gehér, M. P.
[8] S. Vittal, P. Das, and M. Qureshi, “Astrea: Accurate quantum error- Stafford, and M. L. Turner, “Local clustering decoder: a fast and
decoding via practical minimum-weight perfect-matching,” in Proc. adaptive hardware decoder for the surface code,” arXiv preprint
ACM/IEEE Int. Symp. Computer Architecture (ISCA), 2023. arXiv:2411.10343, 2024.
[9] B. Barber, K. M. Barnes, T. Bialas, O. Buğdaycı, E. T. Campbell, N. I. [28] N. Ofek, A. Petrenko, R. Heeres, P. Reinhold, Z. Leghtas, B. Vlastakis,
Gillespie, K. Johar, R. Rajan, A. W. Richardson, L. Skoric, C. Topal, Y. Liu, L. Frunzio, S. M. Girvin, L. Jiang, and et al., “Extending
M. L. Turner, and A. B. Ziad, “A real-time, scalable, fast and resource- the lifetime of a quantum bit with error correction in superconducting
efficient decoder for a quantum computer,” Nature Electronics, vol. 8, circuits,” Nature, 2016.
no. 1, 2025. [29] Y. Xu, G. Huang, N. Fruitwala, A. Rajagopala, R. K. Naik, K. Nowrouzi,
[10] W. Liao, Y. Suzuki, T. Tanimoto, Y. Ueno, and Y. Tokunaga, “WIT- D. I. Santiago, and I. Siddiqi, “Qubic 2.0: An extensible open-source
Greedy: Hardware system design of weighted iterative greedy decoder qubit control system capable of mid-circuit measurement and feed-
for surface code,” in Proc. ACM Asia & South Pacific Design Automation forward,” arXiv preprint arXiv:2309.10333, 2023.
Conf. (ASPDAC), New York, NY, USA, 2023. [30] C. A. Ryan, B. R. Johnson, D. Ristè, B. Donovan, and T. A. Ohki,
[11] P. Das, A. Locharla, and C. Jones, “LILLIPUT: a lightweight low-latency “Hardware for dynamic quantum computing,” Review of Scientific In-
lookup-table decoder for near-term quantum error correction,” in Proc. struments, 2017.
ACM Int. Conf. Architectural Support for Programming Languages & [31] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends,
Operating Systems (ASPLOS), 2022. R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell, and et al., “Quantum
[12] N. Liyanage, Y. Wu, S. Tagare, and L. Zhong, “FPGA-based distributed supremacy using a programmable superconducting processor,” Nature,
union-find decoder for surface codes,” IEEE Trans. Quantum Engineer- 2019.
ing, 2024. [32] Quantum Machines. (2023) Accessed: 2023-09-17. [Online]. Available:
[13] Y. Wu, N. Liyanage, and L. Zhong, “Micro Blossom: Accelerated https://www.quantum-machines.co/resources/brochure/
minimum-weight perfect matching decoding for quantum error correc- [33] Zurich Instruments, “Programmable quantum system controller,” 2025.
tion,” in Proc. ACM Int. Conf. Architectural Support for Programming [Online]. Available: https://docs.zhinst.com/pdf/ziPQSC UserManual.
Languages & Operating Systems (ASPLOS), 2025. pdf
[14] Xilinx Inc., “VMK180 Evaluation Board,” 2021. [Online]. Available: [34] Qblox, “Cluster series mainframe,” 2023. [Online]. Available:
https://www.xilinx.com/products/boards-and-kits/vmk180.html https://assets-global.website-files.com/653289e64ff83c71222f6bf2/654
[15] L. Lao, B. v. Wee, I. Ashraf, J. v. Someren, N. Khammassi, K. Bertels, 25680a033f49deaac899b QBLOX PRODUCTSHEET CLUSTER M
and C. G. Almudever, “Mapping of lattice surgery-based quantum cir- AINFRAME V1 6 1.pdf
cuits on surface code architectures,” Quantum Science and Technology, [35] Y. Wu, N. Liyanage, and L. Zhong, “LEGO: QEC decoding system
vol. 4, no. 1, 2018. architecture for dynamic circuits,” arXiv preprint arXiv:2410.03073,
[16] G. Watkins, H. M. Nguyen, K. Watkins, S. Pearce, H.-K. Lau, and 2024.
A. Paler, “A high performance compiler for very large scale surface [36] L. Skoric, D. E. Browne, K. M. Barnes, N. I. Gillespie, and E. T.
code computations,” Quantum, vol. 8, 2024. Campbell, “Parallel window decoding enables scalable fault tolerant
[17] S. F. Lin, E. C. Peterson, K. Sankar, and P. Sivarajah, “Spa- quantum computation,” Nature Communications, 2023.
tially parallel decoding for multi-qubit lattice surgery,” arXiv preprint [37] X. Tan, F. Zhang, R. Chao, Y. Shi, and J. Chen, “Scalable surface code
arXiv:2403.01353, 2024. decoders with parallelization in time,” PRX Quantum, 2022.
[18] H. Bombı́n, C. Dawson, Y.-H. Liu, N. Nickerson, F. Pastawski, and [38] N. Liyanage, Y. Wu, A. Deters, and L. Zhong, “Scalable quantum error
S. Roberts, “Modular decoding: parallelizable real-time decoding for correction for surface codes using FPGA,” in Proc. IEEE Int. Conf.
quantum computers,” arXiv preprint arXiv:2303.04846, 2023. Quantum Computing & Engineering (QCE), 2023.
[19] Y. Ueno, M. Kondo, M. Tanaka, Y. Suzuki, and Y. Tabuchi, “QU- [39] Aurora 64B/66B v12.0 LogiCORE IP Product Guide, AMD Inc, 11
LATIS: A quantum error correction methodology toward lattice surgery,” 2023. [Online]. Available: https://docs.amd.com/r/en-US/pg074-auror
in Proc. IEEE Int. Symp. High-Performance Computer Architecture a-64b66b
(HPCA), 2022. [40] “Eos Core,” https://github.com/yale-paragon/EosCore, 2024.
[20] Y. Wu and L. Zhong, “Fusion blossom: Fast MWPM decoders for QEC,” [41] “DecoNet : Network-Integrated Decoding System,” https://github.com/y
in 2023 IEEE International Conference on Quantum Computing and ale-paragon/DecoNet, 2025.
Engineering (QCE), 2023. [42] B. Barber, K. M. Barnes, T. Bialas, O. Buğdaycı, E. T. Campbell, N. I.
[21] N. Delfosse and N. H. Nickerson, “Almost-linear time decoding algo- Gillespie, K. Johar, R. Rajan, A. W. Richardson, L. Skoric, C. Topal,
rithm for topological codes,” Quantum, 2021. M. L. Turner, and A. B. Ziad, “A real-time, scalable, fast and highly
[22] Y. Wu, N. Liyanage, and L. Zhong, “An interpretation of union-find resource efficient decoder for a quantum computer,” arXiv preprint
decoder on weighted graphs,” arXiv preprint arXiv:2211.03288, 2022. arXiv:2309.05558, 2023.
[23] D. Litinski, “A game of surface codes: Large-scale quantum computing [43] “QEC Playground,” https://github.com/yuewuo/QEC-Playground, 2023.
with lattice surgery,” Quantum, vol. 3, 2019. [44] A. Holmes, M. R. Jokar, G. Pasandi, Y. Ding, M. Pedram, and F. T.
[24] B. M. Terhal, “Quantum error correction for quantum memories,” Chong, “NISQ+: Boosting quantum computing power by approximating
Reviews of Modern Physics, 2015. quantum error correction,” in Proc. ACM/IEEE Int. Symp. Computer
12
Architecture (ISCA), 2020. lattice surgery,” arXiv preprint arXiv:1808.06709, 2019.
[45] A. G. Fowler and C. Gidney, “Low overhead quantum computation using
13