Fast Algorithms For Spiking Neural Network Simulation With Fpgas
Fast Algorithms For Spiking Neural Network Simulation With Fpgas
FPGAs
Björn A. Lindqvist and Artur Podobas
arXiv:2405.02019v1 [cs.NE] 3 May 2024
1
cessors (CPUs and GPUs), their underlying compute and much more, demonstrating advantages over al-
fabric is composed of a large number of reconfigurable ternative solutions.
blocks of different types. The most common blocks In this work, we use of HLS to design simulators
are look-up tables (LUTs, often several hundred thou- for the Potjans-Diesmann cortical microcircuit (Pot-
sands), digital signal processing (DSPs – capable of jans and Diesmann, 2014). While there is ample prior
tens of TFLOP/s), or on-chip random-access memory work on FPGA-based neuromorphic systems (see sec-
(BRAM – tens-to-hundreds of MB) (Langhammer tion 5.4 for related work), our system is (to the best
et al., 2021; Murphy and Fu, 2017). These resources of our knowledge) the most energy-efficient simula-
allow designers to create custom hardware that sac- tor of the Potjans-Diesmann circuit in existence (25
rifice generality for better performance and lower en- nJ/event), while reaching a faster-than-realtime (≈
ergy consumption than general-purpose devices, as 1.2x) simulation speed on a single FPGA. We use the
well as transcending the latters’ limitations (i.e., the Intel’s OpenCL SDK for FPGA HLS toolchain (Cza-
mentioned von Neumann-bottleneck). Importantly, jkowski et al., 2012) to design our simulators, but our
these devices can, and often already are (Sano et al., designs are modular enough to easily be ported to
2023; Meyer et al., 2023; Boku et al., 2022), live side- other HLS-based systems (e.g., Vivado (O’Loughlin
by-side in HPC nodes and can be reconfigured before et al., 2014) or OneAPI), and other FPGAs. Our
runtime (Vipin and Fahmy, 2018): when the user is contributions are:
running a brain simulation, the FPGA will be config-
ured as an efficient brain simulator, while for a differ- • The first simulators of the previously mentioned
ent application a different accelerator will be used. In circuit on a single FPGA, running faster than
short, FPGA facilitates the use of special hardware real-time.
accelerators during application runtime but is general • The most energy-efficient simulators for the cir-
enough so that different accelerators can be config- cuit when measured by energy per synaptic
ured between applications. This makes them an at- event.
tractive choice for neuroscience simulation since they
can be reconfigured for different neuron, synapse, and • The presentation and analysis of the algo-
axon models, which dedicated neuromorphic ASIC rithms, thought processess, trade-offs, and
hardware, whose silicon is immutable, cannot be. lessons learned, while designing these simulators.
Historically, FPGAs have been designed using low- • An empirically motivated analysis on what hard-
level hardware description languages (HDLs) such as ware features are required to simulate the circuit
VHDL and Verilog (Perry, 2002). These languages even faster than what we are capable of.
have a steep learning curve and require specialist
knowledge to be comfortable with. However, with The rest of this article is structured as follows. In
the increased maturity of High-Level Synthesis (HLS) section 2, we discuss SNNs in general and the micro-
tools in the last two decades (Nane et al., 2015), there circuit in particular. We explain how FPGAs work
has been a resurgence of interest in using FPGAs for and we briefly introduce the HLS design methodol-
HPC. HLS allow designers to describe hardware in ogy. In section 3 we discuss SNN simulation and
relatively high-level languages such as C and C++ present the algorithms and ideas underlying our sim-
whose learning curves are shallower (Podobas et al., ulators. The utility of many of the ideas have al-
2017; Zohouri et al., 2016). For example, FPGAs ready been demonstrated in other parts of Computer
have been used to accelerate Computational Fluid Science, but not in connection with SNN simulation.
Dynamics (Karp et al., 2021; Faj et al., 2023), Quan- Hence, we believe they deserve a thorough treatment.
tum circuit simulations (Podobas, 2023; Aminian We evaluate many different variants and parametriza-
et al., 2008), Molecular Dynamics (Sanaullah and tions of our simulators in section 4. Finally, in sec-
Herbordt, 2018), N-Body systems (Del Sozzo et al., tion 5 we put our results in perspective and compare
2018; Menzel et al., 2021; Huthmann et al., 2019), them with the state-of-the-art.
2
2 Material and Methods Solving the equation using forward Euler produces
a recurrent, discrete representation of the neuron’s
We begin with an overview of spiking neural networks potential over time. With ∆t → 0,
(SNNs) before discussing the Potjans-Diesmann cor-
tical microcircuit, an SNN for simulating a small part u(t + ∆t) − u(t)
τm = −u(t) + RI(t) (3)
(microcircuit) of the mammalian brain. In section 2.3 ∆t
we explain how FPGAs works and what makes them ∆t
=⇒ u(t + ∆t) = u(t) + (−u(t) + RI(t)) (4)
different from from conventional hardware. In sec- τm
tions sections 2.4 to 2.6 we introduce HLS and how
∆t ∆tR
we use OpenCL for HLS. =⇒ u(t + ∆t) = 1 − u(t) + I(t). (5)
τm τm
2.1 Spiking Neural Networks The solution forms the basis of step-wise simulation
of LIF neurons.
An SNN is an artificial neural network (ANN) that After a neuron spikes it enters its refractory state.
transfers signals in time-dependent bursts, i.e. spikes. Its potential becomes fixed at ureset and it ceases to
Unlike other ANNs, SNNs are designed with biologi- respond to stimuli for a duration controlled by the
cal plausibility in mind, making them useful for neu- τref parameter. Usually, the refractory period is in
roscience. SNNs are usually modelled as directed the order of milliseconds and for simplicity one sets
(multi-)graphs, where vertices represent neurons and ureset = urest = 0. With r(t) denoting how long the
edges synaptic connections between neurons. Neu- neuron will say refractory at time t, uthr the neuron’s
rons have a membrane potential that varies over spiking threshold, and ∆t arbitrarily set to one, we
time. When the potential exceeds a threshold the can incorporate refractoriness into the LIF model:
neuron discharges – spikes – and sends current via
its synapses to its neighbours which they receive af- τref
if u(t + 1) ≥ uthr
ter a synapse-specific delay. The amount of cur- r(t + 1) = r(t) − 1 elseif r(t) > 0 (6)
rent as well as the transfer time is synapse-specific
0 otherwise
(Han et al., 2020). The neuron’s statefulness and (
the non-differentiable, discontinuous signal transfer 1 − τ1m u(t) + τRm I(t) if r(t) = 0
function are two fundamental aspects distinguishing u(t + 1) =
0 otherwise
SNNs from other ANNs. While these basic operating
(7)
principles are enough to describe most SNNs, SNNs
vary in neuron model and other parameters. In this
work, we use the basic leaky integrate-and-fire (LIF) 2.1.1 Network topology
neuron model, defined by A major difference between conventional ANNs and
SNNs is that the former often are layered; all neu-
du
RC = urest − u(t) + RI(t). (1) rons in one layer only receives inputs from neurons
dt
in the previous layer and only sends outputs to neu-
The equation describes the membrane potential over rons in the following layer. If all neurons in a layer
time. The variables R and C are the resistance and send output only to all neurons in the following layer
capacitance of the membrane, u(t) its potential at the layer is said to be fully-connected and its signal-
time t, I(t) the amount of current it receives at time transfer can be represented as a matrix-vector multi-
t from its neighbours, and urest its resting potential. plication. SNNs can similarily be layered and it has
With τm = RC, and u(0) = urest = 0 the solution to been found to be a good approach for classification
the equation is (Zheng et al., 2021). But it is not suitable for simu-
t
lation as the neurons in real brains are not organized
u(t) = RI(t) − RI(t)e− τm . (2) into fully-connected layers. Instead, their topology is
3
neocortical layers;1 L23, L4, L5, and L6.2 Each layer
is subdivided into one excitatory population that in-
creases neural activity and one inhibitory population
that decreases it. Around 300 million synapses con-
nect the neurons.
The microcircuit is a balanced random network so
that neural activity is balanced by excitatory neu-
rons that inreases activity and inhibitory ones that
dampen it (Brunel, 2000). Connectivity and fea-
tures are sampled from parametric probability dis-
Figure 1: A fully-connected neural network with two tributions, rather than set explicitly. Table 2 and
hidden layers, two input neurons, and three output table 3 specify these distributions’ parameters. For
neurons example, the initial potential of every inhibitory neu-
ron in L23 is set by sampling a Gaussian with mean -
63.16 mV and standard deviation 4.57 mV and the ex-
Name Value Description (Unit) pected number of synapses from population L23/inh
∆t 0.1 Time step duration (ms) to population L5/exc is 5, 834 · 4, 850 · 0.0755 ≈ 2
Cm 250 Membrane capacity (pF) million. Synapses are sampled with replacement –
τm 10 Membrane time constant (ms) multiple synapses can connect the same neuron pair.
τref 2 Refractory period (ms)
τsyn 0.5 Postsyn. current time constant (ms)
While the neurons are arranged in in terms of neocor-
urest -65 Resting and reset potential (mV) tical layers, the layers’ connection probabilities show
uthr -50 Spiking threshold (mV) that the network’s topology is not layered; neurons in
vth 8 Thl. neurons’ mean spiking rate (Hz) most populations can connect to neurons in any of
ωext 0.15 Thl. spikes amplitude (mV)
the other populations.
In addition to the synapses within the cortical col-
Table 1: Microcircuit’s general and simulation pa-
umn, the circuit receives spikes from external neu-
rameters
rons – thalamic input. Column K in table 2 spec-
ifies the number of thalamic neurons a given pop-
ulation’s neurons receive spikes from, and parame-
“chaotic” and full of recurrent connections, self-loops ter vth in table 1 how frequently thalamic neurons
(autapses), and multiple edges (multapses). This has spike.3 The expected number of thalamic spikes re-
far-reaching consequences for what data structures ceived per second by all neurons in a population is
are appropriate for SNNs. An adjacency matrix, for vth K. For example, neurons in population L23/exc
example, is not enough to represent their topological receive about 8 · 1, 600 = 12, 800 thalamic spikes per
richness. second. The amplitude of all thalamic synapses is
fixed at ωexh = 0.15 mV. As thalamic spikes can
be computationally expensive to simulate, Potjans
and Diesmann (2014) suggest approximating them
2.2 Potjans-Diesmann’s Microcircuit with constant direct current injected at a rate of
vth Kωexh τsyn mV per second. In our simulator we
In 2014 Potjans and Diesmann (2014) compiled the
results of a dozen empirical studies to create a full- 1 These layers are not analoguous to layers in conventional
4
i Pop. Ni K uinit ωi δi
1 L23/exh 20 683 1 600 N (−68.28, 5.36) N (0.15, 0.015) N (1.5, 0.75)
2 L23/inh 5 834 1 500 N (−63.16, 4.57) N (−0.6, 0.06) N (0.75, 0.325)
3 L4/exh 21 915 2 100 N (−63.33, 4.74) N (0.15, 0.015) N (1.5, 0.75)
4 L4/inh 5 479 1 900 N (−63.45, 4.94) N (−0.6, 0.06) N (0.75, 0.325)
5 L5/exh 4 850 2 000 N (−63.11, 4.94) N (0.15, 0.015) N (1.5, 0.75)
6 L5/inh 1 065 1 900 N (−61.66, 4.55) N (−0.6, 0.06) N (0.75, 0.325)
7 L6/exh 14 395 2 900 N (−66.72, 5.46) N (0.15, 0.015) N (1.5, 0.75)
8 L6/inh 2 948 2 100 N (−61.43, 4.48) N (−0.6, 0.06) N (0.75, 0.325)
Table 2: Population-specific parameters. The first four columns denote the index, i, the name, the size, Ni ,
and the number of thalamic connections, Ki , of the eight populations. The last three columns denote the
Gaussians from which the neurons’ initial potential (mV), the neurons’ synapses amplitudes (mV), and delays
(ms) of excitatory postsynaptic potential are sampled from. However, synapse amplitudes from population
L23/exh to L3/exh are sampled from N (0.3, 0.03) and not N (0.15, 0.015).
Table 3: Probability that a random neuron in the population specified by the rows is connected to a random
neuron in the population specified by the columns.
d = τsyn − τm (8)
p = τsyn τm (9)
q = τm /τsyn (10)
Cm d
wf = ≈ 585 (11)
p(q τm /d − q τsyn /d )
The constants p22 and p11 define the membranes’ and
presynaptic currents’ decay rate:
The synaptic parameters are also scaled. The
synapse amplitude by wf which is a function of the p11 = exp(−∆t/τsyn ) ≈ 0.82 (12)
membrane time constant, τm , membrane capacity,
p22 = exp(−∆t/τm ) ≈ 0.99 (13)
Cm , and the postsynaptic time constant, τsyn , that
maps postsynaptic potential to postsynaptic cur- 4 See Hanuschkin et al. (2010) for the derivation.
5
The injection of the presynaptic current is scaled by blend of performance, flexibility, and development
p21 : costs (Trimberger, 2015). FPGAs are not as per-
formant as Application-Specific Integrated Circuits
β = τsyn τm /(τm − τsyn ) (14) (ASICs), but ASICs are extremely expensive to de-
γ = β/Cm (15) velop which makes them cost-prohibitive unless the
p21 = p11 γ(exp(∆t/β) − 1) ≈ 0.00036 (16) number of units produced runs in the tens of mil-
lions. Furthermore, ASICs are not reprogrammable
The subscripts match those found in the source code and thus suitable only for the specific tasks they were
for the NEST simulator (Plesser et al., 2015) and have developed for. Central Processing Units (CPUs), on
no deeper meaning here.5 Taken together, this gives the other hand, are flexible and inexpensive, but have
us the following discrete recurrences for the step-wise low performance. Graphics Processing Units (GPUs)
update of membrane potential, ut : sit between CPUs and ASICs on the flexibility-
( performance spectrum. While GPUs can run any
ureset if rt > 0 computation, they generally only excel at highly reg-
ut+1 = , (17)
p22 ut + It p21 otherwise ular, numerically intensive computations. In partic-
ular, they do not handle divergent control flow well.
presynaptic current, It : FPGAs trade performance for flexibility in a dif-
It+1 = p11 It + Tt wf ωexh , (18) ferent manner than CPUs, GPUs, and ASICs. They
consist of configurable blocks organized in a grid-
and refractoriness, rt : like fabric and each block can be configured to com-
pute a small and specific function. Like the logical
τref
if ut+1 ≥ uthr and of two one-bit signals or comparing two eight-bit
rt+1 = rt − 1 elseif rt > 0 . (19) numbers. The number of configurable blocks varies
otherwise enormously from one FPGA to another; high-end
0
FPGAs can contain hundreds of thousands or even
The variable Tt denotes the number of thalamic millions of blocks. For performance-sensitive work-
spikes received by the presynapse at time t and is loads, FPGAs’ advantage is the lack instruction pro-
modelled as a Poisson distributed random variable cessing overhead. CPUs and GPUs load programs
with mean vth K∆t. from memory and decode and process their instruc-
Having presented the theoretical foundations for tions one after the other, while FPGAs merely pass
LIF SNNs and the specifics of the Potjans-Diesmann data through a pre-configured, fixed-function circuit,
microcircuit, we now present the technologies imple- similar to how ASICs operate.6 However, the repro-
ment our simulator with. We return to SNN sim- grammability of FPGAs comes at a great cost and
ulation in section 3 where we both delve deep into their raw performance cannot rival that of ASICs or
simulation methods and present our simulators. GPUs unless algorithms are specifically designed for
them. They also run at lower clock frequencies than
2.3 Field-Programmable Gate Arrays comparable CPUs from the same generation. For
example, our simulators run at around 600 MHz,7
An FPGA is a type of reprogrammable integrated while Intel Core i9 processors operate at over 3 GHz.
circuit. First marketed by Altera and Xilinx in the
1980’s, FPGAs have found uses in many niches of 6 It is of course true that superscalar and pipelined proces-
the electronics industry; in avionics, in telecommuni- sors can handle many instructions in parallel. But the point
cations, and in VLSI design because of their unique stands; the overhead caused by the need to decode and dis-
patch instructions is large.
5 https://github.com/nest/nest-simulator/blob/ 7 However, Langhammer and Constantinides (2023) ran a
6
are the FPGA’s basic memory blocks and work as
small and fast RAMs. Each FF stores one bit, but
multiple FFs can be combined into registers to store
larger datums.
In addition to configuring the FPGA’s logic and
memory blocks, bitstreams define how the blocks
should be connected through the FPGA’s intercon-
nect network. This is called routing and is one of the
synthesizer’s most important tasks. Routing is crit-
ical to the design as the interconnect occupies most
of the FPGA’s fabric (Betz and Rose, 1999). Good
routing should minimize the total length of the in-
terconnect, the number of blocks required, and the
Figure 2: A LUT for computing any two-variable length of the longest path connecting two blocks as
boolean function implemented as one 4:1 multiplexer the operating frequency of the design is bounded by
and four memory bits. The bit values determines this length. Routing is a challenging problem in com-
which function the LUT computes. puter science and finding high-quality routing of large
designs is both difficult and time-consuming.
Due to their reconfigurability, digital logic imple-
To stay competitive, FPGA implementations need mented in LUTs is slower and requires more com-
to carry out much more work per clock cycle than ponents than equivalent logic implemented in non-
equivalent CPU implementations. These potential reconfigurable ASIC gates. Memories built using
drawbacks aside, researchers have deployed FPGAs FFs have lots of overhead because FFs only store
for performance sensitive workloads, with promising one bit. Therefore, modern FPGAs come with non-
results (?). reconfigurable blocks for arithmetic and storage. One
Bitstreams are files that configure FPGA blocks can view these blocks as small ASICs embedded in
and specialized tools write them to the FPGA’s fab- the FPGA that the designer connects via the config-
ric. Bitstreams are to FPGAs what machine code urable blocks. For example, our Agilex 7 FPGA have
are to microprocessors. Tools that take descriptions five different types of blocks; 487 200 Adaptive Logic
of the digital circuit the FPGA should implement and Modules (ALMs) which functions as LUTs, 1 948 800
produce bitstreams are called synthesizers. The de- flip-flops, 7 110 M20K RAM blocks, two 18.432 Mb
scriptions are usually written in hardware description eSRAM blocks, and 4 510 DSPs. We describe these
languages (HDLs), but many tools support higher- blocks in the following sections.
level languages as well.
A configurable block is a logic block that func-
tions as a small combinational circuit that computes 2.3.1 Adaptive Logic Module
the boolean function it was configured for. Depend-
ing on configuration, the same logic block can serve The Adaptive Logic Module (ALM) in figure 3 is the
as an and-gate, or-gate, xor-gate, etc. Memory em- basic building block of Intel’s family of FPGAs. Con-
bedded adjacent to the block stores its configuration sisting of one fractured eight-input-LUT, four FFs,
and the block’s output is retrieved from this memory. and two full-adders, it is a versatile multi-purpose
Because the blocks look up their results from mem- block and can – depending on its configuration – com-
ory they are called look-up-tables (LUTs). Figure 2 pute two four-input boolean functions, one six-input
shows a LUT for a two-input and-gate built with a boolean function, or perform four-bit addition with
multiplexer and four memory bits. Flip-flops (FFs) carry. It can also serve as a four-bit memory.
7
Figure 3: Simplified schematic of the Agilex 7 ALM. The multiplexers’ (trapezoids) control signals (not
shown) and the contents of the LUT configures the ALM. The ALM can serve as – among other things – a
four-bit adder, a four-bit memory, or as combinational logic of six inputs depending on configuration.
2.3.2 Digital Signal Processing the memory system to be tailored to the application
rather than the other way around.
The Digital Signal Processing (DSP) block con-
tains functions for multiplication, addition, subtrac- The FPGA’s memory is on-chip if it is embedded
tion, and accumulation. The block functions as the in the FPGA fabric itself and off-chip otherwise. On-
FPGA’s arithmetic logic unit (ALU). Agilex 7 has chip memory is much faster and smaller than off-
two types of DPSs; one for integer arithmetic and chip memory. The Agilex 7 has three types of on-
one for IEEE 754 floating-point arithmetic in single- chip memory; Memory Logic Array Blocks (MLABs),
and half-precision modes. Pipeline registers orga- M20Ks, and embedded SRAM (eSRAM). Scalars and
nized into three stages are contained within the DSP. shallow FIFOs are often stored in MLABs, small ar-
Data can be routed through one or more of the stages rays and caches in M20Ks, and larger buffers in eS-
or bypassed completely to achieve a given latency. RAM. MLABs are not memory blocks per se, but
This can be useful if one operand is available multi- instead a technique for using ALMs as memory. The
ple cycles before the other. Agilex 7 can combine ten unused ALMs into one 640-
bit register. As many FPGA designs do not use all
logic blocks, repurposing them can be very useful.
2.3.3 Memory Hierarchy
MLABs have very low laencies since they are close
In processor-based hardware, registers, caches, and to the logic that use them. Xilinx FPGAs imple-
main memory is organized into a fixed hierarchy. Ap- ment the same concept under the name distributed
plications must be structured around the memory hi- RAM or LUTRAM. On-chip M20Ks are also known
erarchy to optimally use it. Typically, data transfer as block RAM (BRAM) and are much larger than
between the hierarchy’s levels is implicit and not di- MLABs. The Agilex 7 contains around seven thou-
rectly under the programmer’s control. On an FPGA sand M20K each of which can hold 20 kbit of data,
the designer constructs the memory hierarchy from as their name suggets. In total, about 139 Mbit. The
the available memory blocks and has fine-grained board also has two 36 Mbit embedded SRAM blocks.
control over exactly where data is stored. This allows These have high bandwidth and high random trans-
8
action rates and compliment the other on-chip mem- 1 __kernel void mul_sd(
ory blocks. Unfortunately, the Intel OpenCL FPGA 2 __global float *A,
compiler cannot infer eSRAM so they are unused in 3 __global float *B,
this work. On the Agilex 7 four 8 GB DRAM sticks 4 __global float *C,
constitute the board’s off-chip memory. Accessing 5 int N) {
off-chip memory takes much longer than accessing 6 for (uint i = 0; i < N; i++)
on-chip memory, with access times measured in the 7 C[i] = A[i] * B[i];
hundreds of cycles. 8 }
In addition to controlling the types of memory 9 __kernel void mul_nd(
used, the designer can also configure individual mem- 10 __global float *A,
ory blocks. Multipumping, for example, can some- 11 __global float *B,
times double memory throughput at the expense 12 __global float *C,
of significantly lowering the design’s operating fre- 13 int N) {
quency. Banking can increase expected memory ac- 14 uint i = get_global_id(0);
cess concurrency, but also increase stalls due to con- 15 if (i < N)
flicts so the technique is best reserved for evenly dis- 16 C[i] = A[i] * B[i];
tributed data. 17 }
2.4 High-Level Synthesis Figure 4: Single and multiple work item OpenCL ker-
Designers traditionally design circuits in Hardware nels for element-wise vector product
Description Languages (HDLs) such as VHDL and
(System) Verilog. With these the designer specifies
the behaviour of all the circuit’s logic gates and flip- ware will not be as efficient as if an HDL had been
flops. The result is a Register-Transfer Level (RTL) used. As the designer works in an imperative, high-
design, so called because it models the register-to- level language their view of the hardware may be ob-
register transfer the circuit’s signals. A synthesizer scured and not easy to visualize. Furthermore, the
takes the RTL design and produces a low-level rep- machine-generated HDL generated by HLS tools is of-
resentation (netlist) of it which can be lowered fur- ten opaque and near impossible to understand. These
ther to create an ASIC or transformed into an FPGA fears may be unfounded, however, as many studies
bitstream. The synthesizer’s job is far from straight- have found that HLS does not degrade performance
forward and it must – among other things – place and sometimes even improve it (Lahti et al., 2019).
every component of the circuit on a two-dimensional
grid and ensure that it can operate with the desired 2.5 OpenCL
clock frequency.
While HDLs offer a great deal of control over the In 2009 the Kronos Group published the first version
resulting circuit, they are low-level and lack support of the OpenCL (Open Computing Language) stan-
for many high-level programming constructs so us- dard (Munshi, 2009). The goal was to replace all
ing them can be tedious and error-prone. It often vendor-specific languages and application program-
makes sense to work in a higher-level language in- ming interfaces (APIs) with a common and portable
stead. That workflow is called High-Level Synthesis standard for writing high-performance code across all
(HLS) and is supported by many tools. For example, kinds of devices. An OpenCL application (unlike one
Intel’s FPGA software generates synthesizable Ver- written in a competing technology such as CUDA)
ilog code from designs coded in the imperative C-like can run unmodified on any device with a compliant
language OpenCL. OpenCL implementation. Today OpenCL now runs
The main worry of using HLS is that the hard- on millions of CPUs, GPUs, DSPs, and other com-
9
puting devices. 1 for (uint i = 0; i < N; i++) {
The OpenCL standard consists of three parts; a 2 B[i] = h(g(f(A[i])));
specification for a programming language, a host 3 }
API, and a device API. The host API runs on the
user’s main computing device (e.g., PC) and controls
their accelerator (e.g., GPU), whose functionality is Figure 5: A loop-nest that could benefit from pipeline
accesible through the device API. Thus, OpenCL pre- parallelism. The functions f, g, and h are assumed
scribes both how the programmer should program the to be short and inlineable.
accelerator and how to manage it from an external
device. 1 channel uint jobs
The OpenCL language closely resembles C. It sup- 2 __attribute__ ((depth(512)));
ports functions, loops, multiple variable scopes, ag- 3 __kernel void consumer() {
gregate data types, and many other programming 4 while (true) {
constructs familiar to C programmers. A big differ- 5 uint job =
ence from C is that OpenCL comes with three ex- 6 read_channel_intel(jobs);
plicit pointer spaces; global, local, and private. These 7 if (!msg)
grant the programmer fine-grained control over data 8 break;
storage. What constitutes global, local, and private 9 process_job(job);
memory is device-specific. Generally the largest and 10 }
slowest memory space is global, while the smallest 11 }
and fastest is private. Global memory may be sized 12 __kernel void producer(uint N)
in gigabytes, while private memory may only be a 13 for (uint i = 0; i < N; i++) {
few kilobytes. OpenCL has native support for SIMD 14 uint job = create_job(i);
types to make it easier to write algorithms for parallel 15 send_channel_intel(job);
hardware. Unlike C, OpenCL specifies the bit width 16 }
of most builtin types – an int is always 32 bits and 17 send_channel_intel(0);
a long 64 bits. 18 }
OpenCL was designed with massively parallel ar-
chitectures in mind (e.g., GPUs) and has builtin
support for concurrency in the form of work items. Figure 6: Producer-consumer kernels communicating
A work item is a small indivisible unit of work via a channel
with, ideally, no dependencies to other work items.
The OpenCL runtime can schedule independent work
items to maximally exploit the hardware’s potential. chronized can be organized into work groups which
Consider the kernels in listing 4 for computing the share local memory. However, synchronization is im-
element-wise vector product. Launching the first ker- possible between work groups.
nel – mul_sd – causes OpenCL to instantiate one ker-
nel on one core which runs N iterations of the for 2.6 OpenCL on Intel FPGAs
loop. Launching the second kernel, however, causes
OpenCL to instantiate up to N kernels, each of which As FPGAs work very differently from processor-
runs on an available core and computes one iteration based hardware, Intel’s OpenCL implementation for
of the for loop. Since the second kernel runs more FPGAs differs in important ways from other OpenCL
computations in parallel, it likely runs much faster. implementations. One major difference concerns par-
Sometimes the algorithm’s work cannot feasibly be allelism. CPUs and GPUs have multiple cores to run
separated into completely independent work items. multiple computations in parallel. Thus, the work-
For such situations, work items that must be syn- load of an OpenCL program organized into multiple
10
work items can be mapped onto multiple cores. FP- 1 for t in range(n_tics):
GAs do not have any cores in the usual sense and 2 for i in range(n_neurons):
it is difficult for Intel’s OpenCL FPGA compiler to 3 N[i] = update_neuron(N[i])
synthesize code structured around work items into ef- 4 if spikes(N[i]):
ficient FPGA bitstreams. Instead, Intel recommends 5 Q = enqueue(Q, i, t)
designers to structure OpenCL code as single-work 6 for i in range(n_neurons):
item kernels, and to derive most of the parallelism 7 c = collect(Q, i, t)
from executing multiple loop iterations concurrently; 8 N[i] = update_psc(Q, c)
a type of concurrency known as “pipeline parallelism”. 9 Q = dequeue(Q, i, t)
For example, assume the three functions f, g, and h
in listing 5 are short, side-effect free, and represent
computations that can be performed in less than one Figure 7: Synchronous SNN simulation
clock cycle. The FPGA compiler can create a three-
stage pipeline for the loop, where combinational cir-
cuits for f, g, and h constitute the pipelines three port. Channels are implemented in on-chip mem-
stages. This pipeline can process three iterations of ory as first in first out (FIFO) buffers of the desired
the loop in parallel; while the h-circuit processes data depths and are very fast. Figure 6 shows a simple
for the first, the g-circuit processes data for the sec- producer-consumer example, where one kernel calls
ond, and the f-circuit for the third iteration, and so send_channel_intel to put jobs on the FIFO and
on. While the latency of every iteration is three cy- the other calls read_channel_intel to remove them.
cles, the throughput (initiation interval) is only one Both functions block if the FIFO is full or empty.
cycle and the latency for the loop as a whole is N + 2
since it takes 2 cycles to fill the pipeline. Though, this
assumes that the latency of every function is fixed 3 SNN Simulation
and predictable. If one function executes operations
with variable latencies, such as memory accesses, the Having reviewed SNNs, the Potjans-Diesmann micro-
pipeline may need to stall. Moreover, pipeline par- circuit and our implementation tools, we now dis-
allelism cannot increase the throughput of the loop cuss methods for simulating SNNs efficiently. We ex-
beyond one cycle – for that one has to use other tech- plore SNN simulation in general before introducing
niques. the ideas and algorithms we use to optimize our sim-
The FPGA compiler supports many compiler di- ulators. The end result is a taxonomy of twelve sim-
rectives (“pragmas”) that can help it better optimize ulator families, grouped by their algorithms and im-
difficult loops or loops containing invariants it can plementation styles. We illustrate most of our points
not prove on its own. Two important directives are with Pythonesque pseudo-code, extended with two
#pragma ivdep and #pragma unroll N. The first keywords; parfor and atomic. The former for loops
tells the compiler that the loop contains no loop- whose iterations are independent and therefore can
carried dependencies and therefore it can reorder the be executed in parallel. The latter for operations
loop iterations as it pleases. The second tells it to that must execute as one indivisible unit. Text in
duplicate the loop body N times. This means that {brackets} explains what the pseudo-code should
the resulting circuit will use N times as many gates, do at that point.
but, potentially, run N times faster. Other compilers Broadly speaking, SNN simulation can be catego-
ignore these directives. rized on whether it is synchronous (time-stepping)
The FPGA compiler extends OpenCL with syn- or asynchronous (event-driven). Synchronous simu-
tax for declaring channels for inter-kernel commu- lation updates the state of every simulated object at
nication. Channels resemble the pipes feature in every tick of a clock, regardless of whether it is nec-
OpenCL 2.0 which the FPGA compiler does not sup- essary or not (Brette et al., 2007). Asynchronous
11
simulation only updates simulated objects whey they and a queue, Q, to keeps track of spikes in flight.
receive external stimuli, i.e., events. In an asyn- For every neuron the algorithm calls update_neuron
chronous SNN simulation, spike emission and recep- to update its membrane potential, presynaptic cur-
tion constitute the events, since the state of a neuron rent, and refractoriness in accordance with equa-
at a time between two events can be calculated eas- tions (17) to (19). If spikes indicates that the neu-
ily (and generally is unimportant). Hybrid strategies, ron spikes, it enqueues the neuron’s index i and the
with asynchronous updates for some parts of the SNN current time step t in Q. In the next phase (line
and synchronous updates for other parts, are possi- 6 to 9) the algorithm calls collect on the queue
ble. to aggregate current destined to the i:th neuron at
For SNN simulation, asynchronicity offers precision time step t. The call to update_psc adds the ag-
advantages. The neuron’s membrane potential only gregated current to the neuron’s presynaptic current.
has to be recomputed when it receives spikes. If this Finally, the algorithm removes the current it just han-
happens only rarely the simulator can afford to use dled from the queue. In this listing we include the
more sophisticated methods than (repeated) forward for t in range(n_tics) loop which shows that the
Euler to solve the LIF equation (equation (1)). Also, algorithm repeats the two phases n_tics times. How-
spikes can be sent and received at any time and do ever, for brevity’s sake, we omit this outer loop in the
not have to be confined to a discrete grid. For ex- following listings.
ample, a synchronous simulator with a 0.1 ms time The work required for updating the neurons’ state
steps may not be able to represent spikes sent at for one tick is in the order of O(N ), where N is the
times that are not multiples of 0.1 ms. Asynchronic- number of neurons – i.e. linear. However, for trans-
ity may also have scalability advantages as the neu- ferring spikes it is O(f ∆tpN 2 ), where p is the prob-
rons’ states do not have to be synchronized to a global ability that two randomly chosen neurons are con-
clock. However, asynchronicity entails irregular com- nected by one or more synapses, f the average spik-
putation and irregular memory accesses – traits that ing rate, and ∆t the tick duration. So the trans-
are extremely undesirable on modern hardware. Fur- fer time is proportional to the network’s density and
thermore, dense SNNs have many more synapses than quadratic in N – i.e. for large N it dominates. Fur-
neurons which results in cascading effects. One spik- thermore, updating the membranes and presynapses
ing neuron “wakes up” thousands of its neighbours, require a handful of multiplications per neuron –
causing them to spike and in turn wake up thousands cheap on modern hardware – whereas spike trans-
of their neighbours. On the whole, the event process- fer requires expensive reads and writes to and from
ing overhead may dominate over whatever compu- non-contiguous memory. Hence, we will focus on the
tational savings not being bound by a global clock transfer phase which is where synchronous simula-
brings. For example, Pimpini et al. (2022) presents a tors spend most of their time in the remainder of this
sophisticated asynchronous CPU-based SNN simula- section.
tor that supports speculative execution so that future
neuron states can be computed in advance and then
3.1 Pushing and Pulling
rolled back if received spikes (“stragglers”) invalidates
their predicted state. While the authors measured ac- Researchers have identified “pushing” and “pulling”
curacy improvements, the performance was lacklus- as two general strategies for designing algorithms for
ter. For these reasons, most high-performance SNN graph problems (Besta et al. (2017); Grossman et al.
simulators, including ours, are synchronous. (2018); Ahangari et al. (2023)). A push strategy
Synchronous simulation splits the simulation task transfers signals from a node to its neighbours by
into two phases; one for updating neurons and one writing to the neighbours incoming signal buffers. In
for transferring spikes. Listing 7 shows the basic al- contrast, a pull strategy transfers signals to a node
gorithm. It uses two data structures; an indexable from its neighbours by sweeping through the node’s
data structure, N, to stores the state of all neurons neighbours and checking whether they have any sig-
12
nals to be delivered to the node. I.e., the receiver
node has to “go and ask” its neighbours whether they
1 parfor i in range(n_neurons):
have signals for them. SNN simulation is a graph
2 spikes = False
problem involving node-to-node transfer of signals
3 if R[i] == 0:
and it too can be characterized in terms of pushing
4 x = p22*U[i] + p21*I[i]
and pulling. Listing 8 and 9 illustrate the two strate-
5 spikes = x >= U_thresh
gies. In both listings, the values of the scalars p11,
6 if spikes:
p21, and p22 come from equations (13) and (16). The
7 x = 0
arrays U, I, and R contain the neurons’ membrane po-
8 R[i] = t_ref_tics
tentials, presynaptic potentials, and refractory coun-
9 U[i] = x
ters. When the neuron’s membrane potential exceeds
10 else:
U_tresh the neuron spikes and becomes refractory for
11 U[i] = 0
t_ref_tics tics. The expression T[t, i] denote the
12 R[i] -= 1
number of thalamic spikes received by the i:th neuron
13 if spikes:
at the t:th time step.
14 parfor j, d, w in syns_from(i):
The strategies differ in how they transfer spikes.
15 atomic W[t + d, j] += w
The push strategy transfers them when spikes indi-
16 I[i] = p11*I[i] + T[t, i]*wpsn
cates that a neuron spikes. It calls syns_from to fetch
17 I[i] += W[t, i]
an iterator of all synapses originating from the i:th
neuron. The three-tuples (j, d, w) it retrieves rep-
Figure 8: One time step of push-based spike transfer resent the synapses; j the index of the destination
neuron, d the delay in time steps, and w the cur-
rent. For every three-tuple it writes to an element
of W, a two-dimensional array buffering current to be
1 parfor i in range(n_neurons): delivered. The element W[t, i] is the amount of
2 A[t, i] = False presynaptic current the i:th neuron receives at the
3 if R[i] == 0: t:th time step. The syns_to call in listing 9 works
4 x = p22*U[i] + p21*I[i] exactly like the syns_from call, except it returns all
5 if x >= U_thresh: synapses terminating at the i:th neuron and j in the
6 A[t, i] = True three-tuples (j, d, w) denotes the originating neuron.
7 x = 0 Array elements A[t, i] indicate whether the i:th
8 R[i] = t_ref_tics neuron spikes at time t or not. The size of the longest
9 else: synapse delay is in practice bounded and small so
10 x = 0 both A and W can be implemented as statically sized
11 R[i] -= 1 wrap-around arrays – a technique we cover in sec-
12 U[i] = x tion 3.2. Note that every neuron can be handled in-
13 s = 0 dependently so we use the parfor construct for both
14 for j, d, w in syns_to(i): algorithms’ outer loops.
15 if A[t - d, j]: Pushing of synaptic current to their neighbours
16 s += w happens on line 16 and 17 of the push strategy’s list-
17 I[i] = p11*I[i] + T[t, i]*wpsn + s ing. Neurons spike infreqently, but when they do, the
algorithm must retrieve all their synapses and write
their current to the array W. This part of the algo-
Figure 9: One time step of pull-based spike transmis-
rithm is performance-critical. First, the algorithm
sion
accesses all the neuron’s synapses which can be ex-
pensive, even if they are stored in contiguous memory.
13
1 for i in range(n_neurons): transfer that first collects spiking neurons and then
2 {Update neuron state as before} transfers their synapses current in a dedicated phase.
3 if spiked: Collection can either be done with a marking array,
4 Q = enqueue(Q, i) at the expensive of wasting memory, or with a queue
5 for i in contents(Q): (as in the listing), at the expense of making loop iter-
6 parfor j, d, w in syns_from(i): ations dependent. The choice depends on the target
7 W[t + d, j] += w platform. Though, splitting the update and spike
8 Q = clear(Q) transfer into two phases reduces the number of cache
conflicts which is advantagenous. Lines 13 to 15 of
the basic push algorithm can flush prefetched and
Figure 10: Deferred push-based spike transfer cached parts of the U, I, and R arrays.
Second, the algorithm gathers and scatters data from 3.2 Buffer Sizing and Wrapping
and to uncorrelated indices of W. These are costly op- To reduce the size of the W array we use “wrap-around
erations since the accesses to W cannot be coalesced indexing”. The indices of the rows in W written to in
or cached.8 To make matters worse, at one time step one time step lies within the interval t+1 to t+dmax −
multiple neurons can write to the same indices of W. 1, where dmax − 1 is the network’s largest synaptic
I.e., W[t + d, j] += w has to be executed atomi- delay and is small. Moreover, at time step t rows 0
cally to avoid data races. As the model is densely to t − 1 will not be read again so that space can be
connected, races are not uncommon and should be reused. We do that by setting the number of rows
accounted for. Partition-awareness, as suggested by in W to dmax and use modular arithmetic to index
Besta et al. (2017), is not an option because of the rows. The expressions for accessing W on line 15 and
density and randomness of the connections, making 17 in listing 8 become W[(t + d) % D_MAX, j] and
most of them remote and not local. W[t % D_MAX, i], respectively. We choose a large
The pull-based algorithm instead has the receiver enough value for dmax by evaluating the cumulative
neurons responsible or “pulling in” current. Spiking distribution function of the Gaussians we sample the
merely sets an element in A to true and does not slower excitatory synapses delays from, N (1.5, 0.75):
trigger current transfer. On subsequent time steps,
neurons connected to that neuron checks if it spiked P (N (1.5, 0.75) ≤ 6.4) ≈ 0.999999999968 : (20)
at time t−d and, if so, adds its synaptic current. The This shows that with ∆t = 0.1, 64 rows is more than
upside of this algorithm is that it is synchronization- enough since the probability of sampling a synaptic
free; neurons do not write to shared memory. The delay longer than 6.4 ms is astronomically low. It
major drawback is that every neuron at every time is also a power of two so we use masking to realize
step must check all its incoming synapses to see from the modular arithmetic. With this scheme we also
which of them it receives current. Additionally, the have to clear W[t, i] after reading it to avoid double
algorithm reads from scattered memory on line 18. reads.
The large amount of data it reads probably makes it Even with only 64 rows and assuming four-byte
inefficient, unless the number of computational units floats, the W array still consumes 64 · 4 · 77169 ≈ 20
is large and memory reads are substantially cheaper megabytes which is more than we can fit in on-chip
than writes. Neither of which is true for our FPGA. memory. We could store the array in off-chip memory
Consequently, we focus on push-based spike transfer – which is plentiful – or use half-precision two-byte
in this work. floats instead. Neither solution is satisfactory. As we
Listing 10 shows a variant of push-based spike argued in section 3.1, we need fast reads and writes
8 The memory addresses are too far apart for any coalescing to uncorrelated addresses of W which off-chip mem-
or caching scheme to be efficient. ory doesn’t give us. We also prefer not to lower the
14
1 parfor i in range(n_neurons): neurons with a “relative timestamp”, rt. The sec-
2 {Update U[i], R[i] as before} ond phase (lines 5 to 13) scans the queue and calls
3 spiked[i] = {True if neuron spiked} enqueued_at to fetch previously enqueued neurons.
4 I[i] = p11*I[i] + T[t, i]*wpsn The syns_from function works as before, but now
5 for rt in range(D_MAX): has a second parameter to select synapses with the
6 delay = (t - rt) % D_MAX given delay. Suppose that the simulator simulates
7 if delay < D_MAX - 1: time step 100 and that rt is 10 in one iteration of the
8 for n in enqueued_at(Q, rt): loop. Since 26 ≡ (100 − 10) mod dmax (dmax = 64)
9 syns = syns_from(n, delay) all synapses with 26 time steps of delay of queued
10 parfor j, d, w in syns: neurons with a relative timestamp of 10 will be acti-
11 I[j] += w vated. These neurons ought to have been stored at
12 else: time step t = 100 − 26 = 76 which is indeed the case
13 Q = dequeue(Q, rt) since 10 ≡ 74 mod dmax . After the neurons have
14 rt = t % D_MAX stayed in the queue for D_MAX - 1 time steps, the
15 for i in range(n_neurons): dequeue call evicts them. Though with a sensible
16 if spiked[i]: FIFO implementation eviction is a no-op so we omit
17 Q = enqueue(Q, rt, i) it in future pseudo-code.
How much memory do the just-in-time algorithm
save? Through experimentation we found that the
Figure 11: Three-phase just-in-time spike transfer average number of spiking neurons per time step is
23, meaning that the expected number of elements
in Q is 23 · dmax = 1472. We generously round it
numeric the precision as that makes the simulation up to 4096 and, as neuron indices take four-bytes to
less accurate. A third option is to store all spiking store, the total size of the queue is 16 kilobytes, which
neurons in a queue and only activate a subset of their comfortably fits in on-chip memory.9
synapses at a given time step. To improve the just-in-time algorithm we use
“lanes”. Essentially, we duplicate the W array so that
3.3 Just-In-Time Spike Transfer synapses of multiple spiking neurons can be activated
simultaneously. The algorithm writes the synaptic
By activating all the neuron’s synapses at once, the current of the first neuron it handles in the first lane,
push algorithm works harder than necessary. Obvi- of the second neuron in the second lane, and so on.
ously, all synapses must eventually be activated, but Synaptic current of neurons in different lanes does not
right now only synapses with a delay of one time interfere. The update phase has to be adjusted ac-
step must be activated since the current they trans- cordingly and must sum the incoming current from
fer will be read at the next time step. This observa- all lanes. As each lane consumes about 320 kb of
tion leads us to a “lazy” just-in-time algorithm which memory we can fit at most 16 lanes.
keeps all spiking neurons in a queue for a fixed num-
ber of time steps. At every time step it activates 3.4 Horizon-Based Spike Transfer
those synapses whose current is read at the next time
step. Thus, a neuron that spikes at time t will have The just-in-time algorithm requires a fair amount
its one-tick-delay synapses activated at time t + 1, of bookkeeping. It activates the spiking neuron’s
its two-tick-delay synapses at time t + 2, and so on. synapses dmax times instead of just once, and the
Listing 11 sketches a three-phase algorithm built on loop on line 10 and 11 iterates many fewer times than
this idea. The first phase (line 1 to 4) updates U and 9 As in section 3.2, we could use the cumulative distribution
R as before and marks spiking neurons in spiked. function to show that 4096 elements is enough to make the risk
The third phase (lines 14 to 17) enqueues the marked of overflow virtually zero.
15
1 parfor i in range(n_neurons): 1 def syns_from(n, d_from, d_to):
2 I[i] += W[t % H, i] 2 start = X[n, d_from]
3 W[t % H, i] = 0 3 end = X[n, d_to]
4 {update U[i] and R[i] as before} 4 for i in range(start, end):
5 I[i] = p11*I[i] + wspn*T[t, i] 5 yield S[i]
6 for i in range(D_MAX / H):
7 rt = (t - H*i - 1) % D_MAX Figure 13: Indexed access to synapse data
8 d_from = H*i + 1
9 d_to = d_from + H
10 for n in enqueued_at(Q, rt): d_from on line 7. This is necessary since the algo-
11 syns = syns_from( rithm transfers spikes one time step later than the
12 n, d_from, d_to) basic push algorithm.
13 parfor j, d, w in syns:
14 W[(d + t) % H, j] += w
15 rt = t % D_MAX 3.5 Storing Synapses
16 for i in range(n_neurons): Previous sections’ pseudo-codes imply that it is very
17 if spiked[i]: important that synapses can be queried by their
18 Q = enqueue(Q, rt, i) sender neuron quickly. In particular, synapses should
be stored so that one syns_from call only accesses
Figure 12: Three-phase spike transfer with config- memory in one contiguous chunk. We fullfill these
urable horizon goals by storing the synapses as an array sorted on
sender neuron, receiver neuron, and delay that we
query with a prebuilt index keyed on sender neu-
the corresponding loop on line 14 and 15 of the basic ron and delay. This means that finding all synapses
push algorithm. This loop is an important source of for a particular neuron or neuron-delay combination
parallelism and running it many times with fewer it- requires only two index lookups; one for the first
erations is much worse for performance than running synapse and one for the last synapse. Moreover, as
it fewer times with many iterations. Our solution is the index implicitly stores the sender neuron, we only
to reintroduce a smaller version of the spike buffer need eight bytes to represent a synapse; four for the
whose number of rows, h is a factor of dmax so that single-precision weight (32 bits), three for the desti-
when a neuron spikes we write to the buffer dmax /h nation neuron’s id (17 bits), and one for the synaptic
times. Listing 12 shows the concept. The syns_from delay (6 bits).10 Listing 13 shows how syns_from
function now retrieves synapses of the neuron whose performs index look-ups. The two-dimensional ar-
delay is within the range dfrom to dto − 1. Suppose ray X represents the index and S the synapse array so
dmax = 64, t = 100 and h = 16. The loop on lines 5 that X[n, d_from] contains the index in S where the
to 13 iterates four times, the relative timestamps as- first synapse of neuron n with delay d_from is stored.
sumes the values 35, 19, 3, and 51, and the half-open Ideally, the synapses should also be prefetched.
intervals the values [1, 17), [17, 33), [33, 49), and [49,
65). Thus, the inner loop on line 10 to 14 activates all 3.6 Disjoint Synapses
synapses with the given delays of the neurons stored
at the given relative timestamps. With this scheme A headache for push-based spike transfer is the data
we trade-off on-chip memory for better concurrency. race caused by multiple synapses delivering current
Note that we add the current from the spike buffer received by the same neuron at the same time step.
to the presynaptic potential on line 2, before we up- 10 With just-in-time transfer the delay is also stored implic-
date the membrane’s potential and we add one to itly.
16
1 parfor c in range(N_CLS): the least significant bits of the neuron index is almost
2 for i in contents(Q): random. To retain the contiguous storage we inter-
3 syns = syns_from(i, c) leave the synapses. That is, if a neuron’s synapses
4 for j, d, w in syns: are found at indexes o0 to o1 − 1, then all synapses of
5 W[t + d, j] += w class c are stored at indices o0 +nc i+c, where nc is the
6 Q = clear(Q) number of classes and i is a non-negative integer. In-
terleaving synergizes with banked memory common
on many GPUs. On our FPGA, it means that we
Figure 14: Spike transfer with partitioned synapses can have conflict-free dedicated memory ports for ev-
ery synapse class. The method causes some memory
1 o0 = X[i][0] waste however. For example, if there are four classes
2 o1 = X[i][D_MAX] and all destination neuron indices of all synapses of
3 parfor c in range(N_CLS): some neuron happen to be congruent with 2 mod 4,
4 for o in range(o0 + c, o1, N_CLS): then all indices other than o0 + 4i + 2 will be vacant.
5 j, d, w = S[o] I.e., 75% of the space will go to waste. In general, the
6 W[(t + d) % D_MAX, j] += w memory consumption for storing a neuron’s synapses
grows from ns s, where ns is its number of synapses
and s the synapse size to nc ls, where l is the number
Figure 15: Spike transfer over interleaved synaptic of synapses in the neuron’s largest class. In prac-
storage. tice, the memory waste is manageable; in the order
of 5-30% depending on the number of classes and
horizon. The more the classes and the shorter the
This is why we can’t use parfor on line 5 of list-
horizon the more uneven the classes become and the
ing 10, for example. We cannot completely solve this
more waste. We do not mark vacancies and instead
problem, but we can alleviate it by partitioning the
fill them with synapses carrying no current and ter-
synapses into disjoint classes, so that synapses of dif-
minating at idempotent neurons. This way, the code
ferent classes never trigger writes to the same mem-
in listing 15 does not need to check whether the index
ory addresses at the same time. Then every class can
is vacant.
be handled by a separate thread. The pseudocode in
listing 14 modifies the transfer phase of the deferred
push algorithm from 10 to exploit of this idea.11 3.7 More on Parallelism
The constant N_CLS denotes the number of synapse
Our algorithms’ source of parallelism is the parfor
classes and the extra argument to syns_from which
keyword. Iterations of such loops are independent
synapse class to query. Many ways of partitioning the
and can be executed concurrently in duplicated hard-
synapses into disjoint classes are possible. For exam-
ware. The replication is realized differently on differ-
ple, by delay so that one thread writes one-time step
ent targets. On CPUs we use SIMD and on GPUs
synapses, the next thread two-time steps synapses,
the single-instruction multiple threads (SIMT) exe-
and so on. Another by destination neuron so that
cution model “automatically” parallelizes the com-
one thread handles synapses going to neurons whose
putation. With Intel’s OpenCL SDK we use the
index is between 0 and 199, another those between
#pragma unroll N and #pragma ivdep compiler di-
200 and 399, and so on.
rectives to instruct the compiler to replicate loop
We choose to partition by the destination neuron’s
hardware. Essentially, this “widens” data paths, al-
congruence class. The method has low overhead and
lowing more data to be processed in parallel. But
evenly distributes the synapses over the classes since
FPGAs can also run more data paths in parallel, akin
11 We can of course us the same technique to improve all to how multiple CPU cores can run multiple threads.
other algorithms we have discussed. We implement this parallelism by dividing the algo-
17
1 while True:
1 parfor i in range(n_neurons): 2 i = read(to_transfer)
2 {Update neuron state as before} 3 if i == DONE:
3 if spikes: 4 break
4 write(to_transfer, i) 5 parfor j, d, w in syns_from(i):
5 write(to_update, DONE) 6 W[t + d, j] += w
6 read(to_update) 7 write(to_update, True)
18
1 for i in range(n_neurons):
2 write(to_update, W[t % H, i])
3 W[t % H, i] = 0
1 for i in range(n_neurons): 4 for i in range(D_MAX / H):
2 I[i] += read(to_update) 5 rt = (t - H*i - 1) % D_MAX
3 A = [] 6 for n in enqueued_at(Q, rt):
4 for i in range(n_neurons): 7 syns = syns_from(n, rt + H)
5 {Update neuron state as before} 8 parfor j, d, w in syns:
6 if spiked[i]: 9 W[(d + t) % H, j] += w
7 A.append(i) 10 for n in read(to_transfer):
8 write(to_transfer, A) 11 Q = enqueue(Q, t % D_MAX, n)
19
Figure 19: Schematic of the monolithic gmem implementations
20
Figure 21: Spike plot of the first 1000 ms of simulation. The rhythmic nature of the microcircuit’s spiking
pattern is apparent.
et al. (2018); (Potjans and Diesmann, 2014); and gold, and green bars show the KL divergences be-
others used. I.e., we simulate the microcircuit ini- tween two NEST simulations initialized with different
tialized with the same random seed with our simu- random seeds, between NEST and gmem/mono/s,
lators and with grid-based (double-precision) NEST and between NEST and jit/mono/s. The latter two
which we treat as our reference. We discard spikes initialized with identical seeds. The divergence be-
during the first second and compute the following tween two random seeds are much larger than be-
statistics over the remaining nine seconds over the tween NEST and our simulators, indicating that they
eight populations; rate of spikes per second, covari- are accurate.
ant of interspike intervals, and Pearson correlation The non-associativity of IEEE 754 floating-point is
coefficient over binned spike trains. We smooth each the main reason for the small differences. The order
distribution with Gaussian kernel density estimation presynaptic current is added depends on the spike
with bandwidth selected using Scott’s Rule. transfer algorithm, which affects rounding. The dif-
ferences are stochastic and do not bias the result.
Figure 22 shows the distributions plotted for NEST
in blue, the gmem/mono/s family in gold, and the
4.2 Simulation speed
jit/multi/s family in green. The plots indicate that
the simulators produce distributions that are very Table 4 presents the performance and resource usage
similar to NEST’s which implies that the accuracy of our fastest simulator configuraions. The first col-
loss caused by the reduced numerical precision is neg- umn shows the simulator’s family in slash-notation
ligible. We do not plot the distributions for our other (see section 3.8). The next five its parameters; up-
families of single-precision simulators as they are even date width (UW), synapse unroll (SU), horizon length
closer to NEST’s distributions. Neither do we plot (H), synapse classes (SC), and lane count (LC). The
distributions for our double-precision simulators as following column shows its real-time factor (RTF),
they produce results that are spike-for-spike identi- defined as the time taken to run the simulation –
cal to NEST. The Kullback-Leibler (KL) divergence wall-clock time – divided by the duration of the
between the distributions, shown in figure 23, quan- simulated biological time (10 seconds). We mea-
tifies the apparent similarities. The figure’s blue, sure the wall-clock time as the time from the first
21
(a) Spiking rate (b) CV ISI (c) Pearson corr. coeff.
Figure 22: Kernel density estimates of a) spikes per second, b) covariance of interspike intervals, and c)
Pearson correlation coefficient between binned spike trains for neuron samples. Blue lines for NEST, gold
for gmem/mono/s, and green for jit/mono/s.
22
Simulator UW SU H SC LC RTF Freq. ALUT Reg. ALM M20K DSP
gmem/mono/s 8 2 n/a n/a n/a 4.79 601 126k 313k 21% 23% 1%
” 64 2 n/a n/a n/a 4.93 585 155k 391k 26% 24% 2%
gmem/mono/d 8 1 n/a n/a n/a 5.38 608 141k 371k 24% 27% 2%
” 8 2 n/a n/a n/a 4.71 608 147k 394k 25% 28% 2%
” 16 2 n/a n/a n/a 4.61 605 182k 477k 30% 28% 4%
gmem/multi/s 4 4 n/a n/a n/a 4.91 600 122k 294k 20% 22% 0%
” 64 1 n/a n/a n/a 5.65 585 128k 340k 22% 22% 6%
gmem/multi/d 8 1 n/a n/a n/a 5.52 609 139k 365k 23% 26% 2%
” 8 2 n/a n/a n/a 4.86 605 145k 360k 24% 27% 2%
horiz/mono/s 32 n/a 16 32 n/a 0.81 608 127k 329k 22% 79% 4%
” 64 n/a 16 16 n/a 0.85 604 143k 375k 25% 80% 7%
horiz/mono/d 4 n/a 8 4 n/a 1.74 609 122k 333k 21% 55% 1%
” 32 n/a 16 32 n/a 0.82 601 258k 653k 42% 85% 9%
horiz/multi/s 4 n/a 8 4 n/a 1.73 610 108k 287k 19% 51% 1%
” 16 n/a 16 16 n/a 0.81 605 123k 330k 22% 80% 2%
” 32 n/a 16 32 n/a 0.79 601 133k 333k 23% 81% 4%
horiz/multi/d 4 n/a 8 4 n/a 1.79 583 126k 319k 22% 57% 1%
” 16 n/a 16 16 n/a 0.80 607 187k 492k 32% 85% 5%
” 32 n/a 16 32 n/a 0.79 600 265k 664k 43% 86% 9%
jit/mono/s 16 n/a n/a 16 16 1.47 590 116k 319k 22% 79% 7%
jit/mono/d 4 n/a n/a 4 8 2.23 611 122k 317k 21% 56% 2%
” 16 n/a n/a 16 16 1.44 597 182k 478k 32% 85% 10%
” 32 n/a n/a 32 16 1.50 584 268k 704k 44% 85% 20%
jit/multi/s 16 n/a n/a 16 16 1.43 593 120k 325k 22% 79% 7%
jit/multi/d 4 n/a n/a 4 8 2.21 600 125k 317k 22% 55% 2%
” 8 n/a n/a 8 16 1.56 602 147k 393k 26% 84% 5%
” 16 n/a n/a 16 16 1.34 606 186k 492k 33% 85% 10%
Table 4: Speed and resource usage for some simulators. The table includes each family’s fastest simulator
and some others for comparison purposes.
23
Figure 24: RTF as a function of update width for the
fastest gmem/mono/s (blue), gmem/mono/d (gold),
gmem/multi/s (green), and gmem/multi/d (red) sim-
ulators. Update widths larger than four does not de-
crease RTF.
24
because the FPGA’s DDR interface is 512 bits wide. quencies much lower than that were caused by in-
Neither the multi- nor single-kernel gmem simula- efficiently banked on-chip memory, unaccounted for
tors benefit from more synapse unroll. The reason loop-carried dependencies, or similar issues.
could be because the unrolled versions of the loop
contains a false memory dependency. There is no
way of letting the OpenCL compiler know that two
4.3 Energy Usage
iterations of the loop body – W[t + d, j] += w – We use the Terasic Dashboard GUI to measure the
writes to distinct memory locations so the compiler energy usage of our fastest simulator – horiz/mul-
refuses to schedule multiple writes per clock cycle. ti/s with the parameters H=16, UW=32, and SC=32.
The jit and horiz simulators are markedly faster The dashboard connects to the Agilex 7 via a MAX
than the gmem simulators because they transfer 10 device that continuously monitors the voltage and
spikes in on-chip memory. The best gmem simula- current rails going into the board and the FPGA
tor has an RTF of 4.61 while it is 1.50 for the best jit itself (Intel, 2023), allowing us to measure both a
simulator and 0.79 for the best horiz simulator. The “pessimistic” power consumption for the whole board
latter simulators also uses up to 90% of the on-chip (which includes unused peripherals that draw power)
memory for storing the horizon and lane buffers. The and an “optimistic” one for only the FPGA fabric.
results suggest that the larger these buffers are the Due to the device’s low sampling rate we compute
better the performance. Unfortunately, our FPGA the total energy usage as the product of the maximum
can not fit horizons longer than 16 time steps or power draw and the simulation time. According to
more than 16 lanes. The horiz simulators’ perfor- our measurements, the simulator pessimistically re-
mance edge over the jit simulators is due to them quires 44.9 W · 8.13 s = 101 mWh and optimistically
running the spike transfer loop fewer times. The jit 16.3 W · 8.13 s = 37 mWh to simulate 10 seconds of
simulators run it dmax times (i.e. 64) for every spik- biological time. The pessimistic energy per synaptic
ing neuron, while the horizon simulators only runs it event (metric defined in van Albada et al. (2018)) is
dmax /h times (i.e., 64/16=4 times). The horiz simu- 21 nJ and the optimistic one 9 nJ. These values are
lators also does not have to sum the incoming current upper bounds and the actual energy usage may be
from multiple lanes when updating the neurons state. lower.
Figure 25 plots RTF as a function of the number
of synapse classes for the horiz and jit simulators.
For both, performance improves until the parame- 5 Discussion
ter reaches 32. The more classes, the more synapses
can be activated in parallel which, clearly, is impor- Our main contribution is the presentation and anal-
tant for performance. The drawback of increasing ysis of methods for creating FPGA-based simulators
the number of classes is memory waste. Especially competitive in speed and energy usage with the state-
for the jit simulators which iterate many fewer times of-the-art in SNN simulation. We hope that others
per spike transfer loop. With 16 classes the average will create even faster simulators by adopting and re-
occupancy is only about 74%. It decreases to 47% fining our algorithms and implementation techniques
with 32 classes, meaning that the simulator accesses for their hardware. Some methods are endogenous
more than twice as much data than it needs. Interest- to our hardware. For example, by running multiple
ingly, there is no performance penalty. The benefits kernels in parallel that communicate with each other
of handling many synapses in parallel is worth lots of via channels, we overlap different phases of the simu-
extra off-chip memory reads. Perhaps since we store lation algorithm. This technique has no direct GPU-
the synapses in contiguous memory reading them is equivalent as different kernel types cannot communi-
quite cheap. cate. It also relies on the Intel-specific channel exten-
All simulators run at around 600 MHz, considered sion and may not work well even on other vendors’
very good for HLS. In our experience, operating fre- FPGAs. On the other hand, the technique for inter-
25
Figure 25: The horizon and JIT algorithms’ RTF as a function of the number of synapse classes. The horizon
length and lane count parameters are both 16. The four colors represent the four implementation styles.
26
thanks to their malleability. However, FPGAs and mance and productivity that appears in many corners
particularily tools for designing for FPGAs have of computer science. The literature suggests that in
many glaring weaknesses. Unlike compilers for soft- general the productivity gains of HLS are large and
ware, which emit the same machine code for the same the performance losses are either non-existent or low
source code, synthesizers use optimization algorithms (Lahti et al., 2019; Pelcat et al., 2016). However,
based on randomness so good performance is con- one can ask whether this holds true for performance-
tingent on choosing lucky random seeds. Even if critical design?
the differences between lucky and unlucky seeds are HLS definitely allowed us to explore algorithmic
small, the randomness makes squeezing out the last ideas at a rapid pace. In particular, scheduling stal-
few percent of performance frustrating. An algorith- lable loops and interfacing with DDR would have
mic change or parameter tuning causing a small per- been at least an order of magnitude more work
formance improvement may be due to chance. More- to implement (and debug!) in an HDL. It also
over, a single synthesis can take several hours even on helped that most – but not all – of the OpenCL
top-of-the-line hardware, compounding this problem. code we wrote were runnable verbatim on non-
Writing performance-critical code is an experimen- FPGA targets. However, for performance judi-
tal process, wherein one needs to test hundreds or cious use of compiler directivies is essential, some-
thousands of ideas to see what yields the best per- thing we struggled with. For example, adding
formance. Long turn-around times slows the process #pragma disable_loop_pipelining on a loop in a
down. Simulators do not help as their performance gmem simulator increased its operating frequency by
does not reflect the performance of real hardware. almost 200 MHz since the compiler could deduce that
For these reasons, FPGAs are decidedly more com- a memory port would not be shared across loop it-
plex to develop for than GPUs. erations. On the other hand, removing the directive
What is a good distribution of FPGA resources on a loop in a horiz simulator more than doubled
for HPC? For us, our board’s distribution is far from its performance! On several occassions, misplaced
perfect. Our fastest designs used almost all on-chip #pragma ivdep directives caused bugs that were dif-
memory, with plenty of ALUTs, FFs, and DSPs to ficult to troubleshoot. Having to “nudge” the com-
spare. If our use-case and designs are close to the piler via directives to make the right decisions made
norm then it would be wise for FPGA vendors to us feel like we were not fully in control and was at
trade-off logic resources for more on-chip memory. times frustrating.
And trends in the HPC field indicate that memory As we have no baseline to compare with, estimating
– not arithmetic – very much is a limiting factor for the performance cost of HLS is difficult. Our designs
many problems. However, inevitably, whatever re- run at over 600 MHz which – while a fair bit lower
source distribution the vendor chooses, it will not be than the theoretical limit – is in the upper range of
ideal for some designs. Optimizing an FPGA so that what typical Agilex 7 HDL designs runs at. If we are
as many designs as possible can take advantage of as correct in that performance is mostly bounded by
much of its resources as possible seems exceedingly memory resources, then non-optimal performance is
difficult. due to algorithmic choices and not due to the choice
of implementation language.
5.2 HLS for HPC
5.3 Future Research
In this work we choose HLS over traditional design
methods because of its purported productivity ad- Time constraints forced us to leave many ideas for
vantages. How much performance did HLS cost us better performance unexplored. We list some of them
and how much longer would it have taken us to im- here.
plement the simulators in an HDL? The question is Our results demonstrate that single-precision
an instance of the classic dichotomy between perfor- floating-point is sufficient. Half-precision or some
27
Figure 28: RTFs of some SNN simulators
28
Work Simulator Hardware Node RTF Syn.Ev En.
This work horiz/mono/s 1 Agilex 7 FPGA 10 0.81 25
Heittmann et al. (2022) IBM INC-3000 432 Xilinx XC Z7045 SoC 28 0.25 783
Kauth et al. (2023b) neuroAIx 35 NetFPGA SUME 28 0.05 48
Golosio et al. (2021) NeuronGPU 1 GeForce RTX 2080 Ti 12 1.06 180
Golosio et al. (2021) NeuronGPU 1 Tesla V100 12 1.64 -
Knight and Nowotny (2018) GeNN 1 GeForce RTX 2080 Ti 12 1.40 -
Knight and Nowotny (2018) GeNN 1 Tesla V100 12 2.16 470
van Albada et al. (2018) SpiNNaker 217 ASIC 130 20.00 5900
Rhodes et al. (2020) SpiNNaker 318 ASIC 130 1.00 600
Kurth et al. (2022) NEST 2 AMD EPYC Rome 14 0.53 480
29
(2023b) report a much lower RTF than us, but use 35 Data Availability Statement
boards, each drawing 26.54 W on average, resulting in
a higher total energy usage for the same amount of bi- All source code will eventually be available under a
ological time. They, like us, report an “all-inclusive” permissible Open Source license.
value for the energy usage so a dedicated platform
– without any unused peripherals – could consume
much less energy. It goes without saying that fairly References
comparing performance of systems implemented on
Ahangari, H., Özdal, M. M., and Öztürk, O.
different architectures, with different design trade-
(2023). Hls-based high-throughput and work-
offs, and accuracy constraints is tricky. One sys-
efficient synthesizable graph processing template
tem with excellent performance may be inadequate
pipeline. ACM Trans. Embed. Comput. Syst. 22.
in other regards. For example, less configurability
doi:10.1145/3529256
may improve a system’s performance, but make it
unusable for certain applications. It should be noted Aminian, M., Saeedi, M., Zamani, M. S., and Sedighi,
that most simulators, unlike ours, replace thalamic M. (2008). Fpga-based circuit model emulation of
spikes with DC input which limits their applicability. quantum algorithms. In 2008 IEEE Computer So-
ciety Annual Symposium on VLSI (IEEE), 399–
404
Author Contributions
Besta, M., Podstawski, M., Groner, L., Solomonik,
BAL and AP designed the study. BAL implemented E., and Hoefler, T. (2017). To push or to pull: On
the SNN framework and performed the experiments. reducing communication and synchronization in
BAL and AP analyzed the results and co-wrote the graph computations. In Proceedings of the 26th In-
paper. ternational Symposium on High-Performance Par-
allel and Distributed Computing. 93–104
Supplementary Material should be uploaded sepa- Brette, R., Rudolph, M., Carnevale, T., Hines, M.,
rately on submission, if there are Supplementary Fig- Beeman, D., Bower, J. M., et al. (2007). Simu-
ures, please include the caption in the same file as lation of networks of spiking neurons: a review of
the figure. LaTeX Supplementary Material templates tools and strategies. Journal of computational neu-
can be found in the Frontiers LaTeX folder. roscience 23, 349–398
30
Brunel, N. (2000). Persistent activity and the single- Furber, S. B., Lester, D. R., Plana, L. A., Gar-
cell frequency–current curve in a cortical network side, J. D., Painkras, E., Temple, S., et al. (2013).
model. Network: Computation in Neural Systems Overview of the spinnaker system architecture.
11, 261–280 IEEE Transactions on Computers 62, 2454–2467.
doi:10.1109/TC.2012.142
Carpegna, A., Savino, A., and Di Carlo, S. (2022).
Spiker: an fpga-optimized hardware accelerator for Ghosh-Dastidar, S. and Adeli, H. (2009). Third
spiking neural networks. In 2022 IEEE Computer generation neural networks: Spiking neural net-
Society Annual Symposium on VLSI (ISVLSI) works. In Advances in computational intelligence
(IEEE), 14–19 (Springer). 167–178
Golosio, B., Tiddia, G., De Luca, C., Pastorelli, E.,
Carpegna, A., Savino, A., and Di Carlo, S. (2024).
Simula, F., and Paolucci, P. S. (2021). Fast sim-
Spiker+: a framework for the generation of efficient
ulations of highly-connected spiking cortical mod-
spiking neural networks fpga accelerators for infer-
els using gpus. Frontiers in Computational Neuro-
ence at the edge. arXiv preprint arXiv:2401.01141
science 15. doi:10.3389/fncom.2021.627620
Cheung, K., Schultz, S. R., and Luk, W. (2016). Neu- Grossman, S., Litz, H., and Kozyrakis, C. (2018).
roflow: a general purpose spiking neural network Making pull-based graph processing performant.
simulation platform using customizable processors. SIGPLAN Not. 53, 246–260. doi:10.1145/3200691.
Frontiers in neuroscience 9, 516 3178506
Czajkowski, T. S., Aydonat, U., Denisenko, D., Free- Gupta, S., Vyas, A., and Trivedi, G. (2020). Fpga im-
man, J., Kinsner, M., Neto, D., et al. (2012). plementation of simplified spiking neural network.
From opencl to high-performance hardware on fp- In 2020 27th IEEE International Conference on
gas. In 22nd international conference on field pro- Electronics, Circuits and Systems (ICECS). 1–4.
grammable logic and applications (FPL) (IEEE), doi:10.1109/ICECS49266.2020.9294790
531–534
Han, J., Li, Z., Zheng, W., and Zhang, Y. (2020).
Del Sozzo, E., Rabozzi, M., Di Tucci, L., Sci- Hardware implementation of spiking neural net-
uto, D., and Santambrogio, M. D. (2018). A works on fpga. Tsinghua Science and Technology
scalable fpga design for cloud n-body simulation. 25, 479–486
In 2018 IEEE 29th international conference on
Hanuschkin, A., Kunkel, S., Helias, M., Morrison,
application-specific systems, architectures and pro-
A., and Diesmann, M. (2010). A general and effi-
cessors (ASAP) (IEEE), 1–8
cient method for incorporating precise spike times
Efnusheva, D., Cholakoska, A., and Tentov, A. in globally time-driven simulations. Frontiers in
(2017). A survey of different approaches for over- Neuroinformatics 4. doi:10.3389/fninf.2010.00113
coming the processor-memory bottleneck. Interna- Heittmann, A., Psychou, G., Trensch, G., Cox, C. E.,
tional Journal of Computer Science and Informa- Wilcke, W. W., Diesmann, M., et al. (2022). Sim-
tion Technology 9, 151–163 ulating the cortical microcircuit significantly faster
than real time on the ibm inc-3000 neural super-
Faj, J., Kenter, T., Faghih-Naini, S., Plessl, C., and computer. Frontiers in Neuroscience 15. doi:
Aizinger, V. (2023). Scalable multi-fpga design of a 10.3389/fnins.2021.728460
discontinuous galerkin shallow-water model on un-
structured meshes. In Proceedings of the Platform Huthmann, J., Shin, A., Podobas, A., Sano, K., and
for Advanced Scientific Computing Conference. 1– Takizawa, H. (2019). Scaling performance for n-
12 body stream computation with a ring of fpgas. In
31
Proceedings of the 10th International Symposium Kurth, A. C., Senk, J., Terhorst, D., Finnerty, J., and
on Highly-Efficient Accelerators and Reconfigurable Diesmann, M. (2022). Sub-realtime simulation of a
Technologies. 1–6 neuronal network of natural density. Neuromorphic
Computing and Engineering 2, 021001
Intel (2023). Intel Agilex® 7 FPGA F-Series
Development Kit User Guide. Intel Corpo- Lahti, S., Sjövall, P., Vanne, J., and Hämäläinen,
ration, 2023-06-14 edn. Available at https: T. D. (2019). Are we there yet? a study on the
//www.intel.com/content/www/us/en/docs/ state of high-level synthesis. IEEE Transactions on
programmable/683024/current/overview.html Computer-Aided Design of Integrated Circuits and
Systems 38, 898–911. doi:10.1109/TCAD.2018.
Jouppi, N., Young, C., Patil, N., and Patterson, D. 2834439
(2018). Motivation for and evaluation of the first [Dataset] Langhammer, M. and Constantinides, G.
tensor processing unit. ieee Micro 38, 10–19 (2023). egpu: A 750 mhz class soft gpgpu for fpga
Karp, M., Podobas, A., Jansson, N., Kenter, T., Langhammer, M., Nurvitadhi, E., Pasca, B., and Gri-
Plessl, C., Schlatter, P., et al. (2021). High- bok, S. (2021). Stratix 10 nx architecture and
performance spectral element methods on field- applications. In The 2021 ACM/SIGDA Inter-
programmable gate arrays: implementation, eval- national Symposium on Field-Programmable Gate
uation, and future projection. In 2021 IEEE Inter- Arrays. 57–67
national Parallel and Distributed Processing Sym-
Li, S., Zhang, Z., Mao, R., Xiao, J., Chang, L., and
posium (IPDPS) (IEEE), 1077–1086
Zhou, J. (2021). A fast and energy-efficient snn
processor with adaptive clock/event-driven compu-
Kauth, K., Stadtmann, T., Sobhani, V., and Gem-
tation scheme and online learning. IEEE Transac-
meke, T. (2023a). neuroaix: Fpga cluster for re-
tions on Circuits and Systems I: Regular Papers
producible and accelerated neuroscience simula-
68, 1543–1552. doi:10.1109/TCSI.2021.3052885
tions of snns. In 2023 IEEE Nordic Circuits and
Systems Conference (NorCAS). 1–7. doi:10.1109/ Liu, H., Chen, Y., Zeng, Z., Zhang, M., and Qu,
NorCAS58970.2023.10305473 H. (2023). A low power and low latency fpga-
based spiking neural network accelerator. In 2023
Kauth, K., Stadtmann, T., Sobhani, V., and Gem- International Joint Conference on Neural Net-
meke, T. (2023b). neuroaix-framework: design of works (IJCNN). 1–8. doi:10.1109/IJCNN54540.
future neuroscience simulation systems exhibiting 2023.10191153
execution of the cortical microcircuit model 20×
faster than biological real-time. Frontiers in Com- Makin, S. (2019). The four biggest
putational Neuroscience 17. doi:10.3389/fncom. challenges in brain simulation
2023.1144143 (https://www.nature.com/articles/d41586-019-
02209-z). Nature 571
Knight, J. C. and Nowotny, T. (2018). Gpus out- Menzel, J., Plessl, C., and Kenter, T. (2021). The
perform current hpc and neuromorphic solutions strong scaling advantage of fpgas in hpc for n-body
in terms of speed and energy when simulating a simulations. ACM Transactions on Reconfigurable
highly-connected cortical model. Frontiers in Neu- Technology and Systems (TRETS) 15, 1–30
roscience 12. doi:10.3389/fnins.2018.00941
Meyer, M., Kenter, T., and Plessl, C. (2023). Multi-
Kuon, I., Tessier, R., Rose, J., et al. (2008). Fpga fpga designs and scaling of hpc challenge bench-
architecture: Survey and challenges. Foundations marks via mpi and circuit-switched inter-fpga net-
and Trends® in Electronic Design Automation 2, works. ACM Transactions on Reconfigurable Tech-
135–253 nology and Systems 16, 1–27
32
Munshi, A. (2009). The opencl specification. In 2009 Podobas, A. (2023). Q2logic: A coarse-grained fpga
IEEE Hot Chips 21 Symposium (HCS) (IEEE), 1– overlay targeting schrödinger quantum circuit sim-
314 ulations. In 2023 IEEE International Parallel
and Distributed Processing Symposium Workshops
Murphy, C. and Fu, Y. (2017). Xilinx all pro- (IPDPSW) (IEEE), 460–467
grammable devices: A superior platform for
compute-intensive systems. Xilinx White Paper Podobas, A., Sano, K., and Matsuoka, S. (2020). A
survey on coarse-grained reconfigurable architec-
Nane, R., Sima, V.-M., Pilato, C., Choi, J., Fort, B., tures from a performance perspective. IEEE Access
Canis, A., et al. (2015). A survey and evaluation of 8, 146719–146743
fpga high-level synthesis tools. IEEE Transactions Podobas, A., Zohouri, H. R., Maruyama, N., and
on Computer-Aided Design of Integrated Circuits Matsuoka, S. (2017). Evaluating high-level de-
and Systems 35, 1591–1604 sign strategies on fpgas for high-performance com-
puting. In 2017 27th International Conference on
O’Loughlin, D., Coffey, A., Callaly, F., Lyons, D.,
Field Programmable Logic and Applications (FPL)
and Morgan, F. (2014). Xilinx vivado high level
(IEEE), 1–4
synthesis: Case studies (IET)
Potjans, T. C. and Diesmann, M. (2014). The cell-
Pani, D., Meloni, P., Tuveri, G., Palumbo, F., Mas- type specific cortical microcircuit: relating struc-
sobrio, P., and Raffo, L. (2017). An fpga plat- ture and activity in a full-scale spiking network
form for real-time simulation of spiking neuronal model. Cerebral cortex 24, 785–806
networks. Frontiers in Neuroscience 11. doi:
10.3389/fnins.2017.00090 Rhodes, O., Peres, L., Rowley, A. G. D., Gait, A.,
Plana, L. A., Brenninkmeijer, C., et al. (2020).
Pelcat, M., Bourrasset, C., Maggiani, L., and Berry, Real-time cortical simulation on neuromorphic
F. (2016). Design productivity of a high level syn- hardware. Philosophical Transactions of the Royal
thesis compiler versus hdl. In 2016 International Society A: Mathematical, Physical and Engineer-
Conference on Embedded Computer Systems: Ar- ing Sciences 378, 20190160. doi:10.1098/rsta.2019.
chitectures, Modeling and Simulation (SAMOS). 0160
140–147. doi:10.1109/SAMOS.2016.7818341
Sanaullah, A. and Herbordt, M. C. (2018). Unlocking
Perry, D. L. (2002). VHDL: programming by example performance-programmability by penetrating the
(McGraw-Hill Education) intel fpga opencl toolflow. In 2018 IEEE High Per-
formance extreme Computing Conference (HPEC)
Pimpini, A., Piccione, A., Ciciani, B., and Pelle- (IEEE), 1–8
grini, A. (2022). Speculative distributed simu- Sano, K., Koshiba, A., Miyajima, T., and Ueno, T.
lation of very large spiking neural networks. In (2023). Essper: Elastic and scalable fpga-cluster
Proceedings of the 2022 ACM SIGSIM Confer- system for high-performance reconfigurable com-
ence on Principles of Advanced Discrete Simula- puting with supercomputer fugaku. In Proceedings
tion (New York, NY, USA: Association for Com- of the International Conference on High Perfor-
puting Machinery), SIGSIM-PADS ’22, 93–104. mance Computing in Asia-Pacific Region. 140–150
doi:10.1145/3518997.3531027
Shama, F., Haghiri, S., and Imani, M. A.
Plesser, H. E., Diesmann, M., Gewaltig, M.-O., and (2020). Fpga realization of hodgkin-huxley neu-
Morrison, A. (2015). NEST: the Neural Simulation ronal model. IEEE Transactions on Neural Sys-
Tool (New York, NY: Springer New York). 1849– tems and Rehabilitation Engineering 28, 1059–
1852. doi:10.1007/978-1-4614-6675-8_258 1068
33
Theis, T. N. and Wong, H.-S. P. (2017). The end
of moore’s law: A new beginning for information
technology. Computing in science & engineering
19, 41–50
Trensch, G. and Morrison, A. (2022). A system-on-
chip based hybrid neuromorphic compute node ar-
chitecture for reproducible hyper-real-time simula-
tions of spiking neural networks. Frontiers in Neu-
roinformatics 16. doi:10.3389/fninf.2022.884033
Trimberger, S. M. (2015). Three ages of fpgas: A
retrospective on the first thirty years of fpga tech-
nology. Proceedings of the IEEE 103, 318–331. doi:
10.1109/JPROC.2015.2392104
van Albada, S. J., Rowley, A. G., Senk, J., Hopkins,
M., Schmidt, M., Stokes, A. B., et al. (2018). Per-
formance comparison of the digital neuromorphic
hardware spinnaker and the neural network simu-
lation software nest for a full-scale cortical micro-
circuit model. Frontiers in Neuroscience 12. doi:
10.3389/fnins.2018.00291
34