RAPIDO 2023 Paper 2868
RAPIDO 2023 Paper 2868
ABSTRACT 1 INTRODUCTION
Since their breakthrough, complexity of Deep Neural Networks Deep Neural Networks (DNNs) entered more and more domains
(DNNs) is rising steadily. As a result, accelerators for DNNs are and areas over the last decade due to their higher prediction perfor-
now used in many domains. However, designing and configuring mance compared to traditional algorithms. In image recognition,
an accelerator that meets the requirements of a given application for example, face recognition is used in assistive robotics to support
perfectly is a challenging task. In this paper, we therefore present the elderly [12] or in particle physics they support the compression
our approach to support the accelerator design process. With an of large datastreams [2]. While DNNs already show great perfor-
analytical model of a systolic array we can estimate performance, mance in many tasks, over time their computational complexity
energy consumption and area for each design option. To determine and memory requirements grew rapidly to fulfill even more sophis-
these metrics, usually a cycle accurate simulation is performed, ticated tasks. Especially considering yet unsolved problems such
which is a time-consuming task. Hence, the design space has to as autonomous driving, the complexity is foreseen to grow even
be restricted heavily. Analytical modelling, however, allows for further. This trend poses a challenge to the underlying hardware
fast evaluation of a design using a mathematical abstraction of the architecture, which executes the DNN. Since the computation of
accelerator. For DNNs, this works especially well since the dataflow DNNs is a highly dataflow driven and memory bound task, tradi-
and memory accesses have high regularity. To show the correctness tional computing devices like CPUs or GPUs cannot keep pace with
of our model, we perform an exemplary realization with the state- the fast rising demands. To address this challenge, dedicated DNN
of-the-art systolic array generator Gemmini and compare it with accelerator architectures, like systolic arrays, are currently state of
a cycle accurate simulation and state-of-the-art modelling tools, the art. Those DNN accelerators can compute operations in parallel
showing less than 1% deviation. We also conducted a design space and reuse data to achieve a high performance and efficiency. In
exploration, showing the analytical model’s capabilities to support addition, accelerators can incorporate optimization techniques like
an accelerator design. In a case study on ResNet-34, we can demon- pruning or quantization [6].
strate that our model and DSE tool reduces the time to find the While DNN accelerators can support fast and efficient inference,
best-fitting solution by four or two orders of magnitude compared the design parameters of such an accelerator have to be evaluated
to a cycle-accurate simulation or state-of-the-art modelling tools, carefully. The level of exploited parallelism or the number of pro-
respectively. cessing elements and the size of memories and their interfaces, have
a strong impact on various design metrics like throughput, latency,
CCS CONCEPTS power consumption or area. Besides the architecture parameters,
• Computing methodologies → Machine learning; Modeling the DNN workload itself has a strong dependency with the perfor-
and simulation; • Computer systems organization → Systolic mance, since proper mapping of the workload is also important.
arrays; Embedded systems. All these parameters open a large design space from which one
solution has to carefully picked to reach high performance and
KEYWORDS efficiency. However, due to the complexity of DNN accelerators,
Analytical Modelling, Neural Networks, Design Space Exploration determining the hyperparameters and specifications for a given
accelerator configuration is costly, since each configuration has to
ACM Reference Format: be elaborated individually. Highly accurate results can be achieved
Tim Hotfilter, Patrick Schmidt, Julian Hoefer, Fabian Kreß, Tanja Harbaum,
though cycle-accurate simulation of the whole workload [8, 10],
Juergen Becker. 2023. An Analytical Model of Configurable Systolic Arrays
to find the Best-Fitting Accelerator for a given DNN Workload. In Proceedings
which takes a long time for each iteration. Considering the large de-
of the 2023 Workshop on System Engineering for constrained embedded systems sign space, cycle-accurate simulation is not feasible for an extensive
(RAPIDO 2023), January 17–18, 2023, Toulouse, France. ACM, New York, NY, design space evaluation.
USA, 6 pages. https://doi.org/10.1145/3579170.3579258 In this paper, we therefore present our analytical model of sys-
tolic arrays, which are a very common type of DNN accelerator. The
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed analytical model is the centerpiece of our evaluation tool, shown
for profit or commercial advantage and that copies bear this notice and the full citation in Figure 1. Our analytical model estimates performance, area, and
on the first page. Copyrights for third-party components of this work must be honored. energy consumption for a given design configuration and DNN
For all other uses, contact the owner/author(s).
RAPIDO 2023, January 17–18, 2023, Toulouse, France workload in a fast and accurate way. Therefore, our approach uses
© 2023 Copyright held by the owner/author(s). the well-established roofline model for performance estimation and
ACM ISBN 979-8-4007-0045-3/23/01. bottleneck identification. During the design process, constraints,
https://doi.org/10.1145/3579170.3579258
RAPIDO 2023, January 17–18, 2023, Toulouse, France Hotfilter et al.
A A A performance [Ops/s]
Figure 1: Overview of our DNN accelerator analysis tool with workloads to analytically calculate action counts of the various sys-
our analytical model of a systolic array at its center tem components. However, Timeloop limits itself to convolutional
and fully-connected layers while neglecting other operations, like
activations and pooling layers. Further, describing the mapspace
for example, an upper area or power limit, can be defined. This requires prior knowledge of the dataflow the targeted accelerator
allows us to find a solution, which meets all design requirements. uses, and hence it is difficult to generalize it for all systolic arrays.
With our analytical model, we are able to evaluate a design in up to In contrast, ScaleSim [10] reduces the complexity by limiting itself
12000x less time compared to a cycle-accurate simulation. We ver- to the simulation of only systolic arrays. The memory hierarchy in-
ify found solutions by comparing them against the cycle-accurate side ScaleSim consists of two input memories for input and weight
simulation, showing less than 1% deviation using a 16 × 16 systolic data, and a separate memory to store results. All common dataflows
array. In a case study on ResNet-34, we use our evaluation tool for for systolic arrays, weight stationary, output stationary and input
a design space exploration (DSE), to show its capabilities in finding stationary are supported. Mapping of complex problems onto the
an optimal DNN accelerator for this workload. compute array is determined automatically. However, ScaleSim
performs a cycle accurate simulation, which leads to a very high
2 RELATED WORK simulation time. In addition, ScaleSim lacks support for pooling
Over the last decade, various DNN accelerators were presented and operations and batched data. To get a very accurate simulation,
have established themselves. One very prominent accelerator is Chen et al. [3] propose a custom simulation model that reflects the
Eyeriss by Chen et al. [4]. It implements a 12 × 14 array of com- underlying hardware directly. Their model includes various aspects
pute elements, each equipped with a small memory to buffer inputs of the accelerator, such as the number of PEs, their arrangement in
and weights. Their row-stationary dataflow allows for an efficient the array, mapping and available bandwidth. When all influences
inference by minimizing the data movement to the main memory. of these parameters are understood, it is possible to analytically
However, Eyeriss has a fixed architecture and cannot be scaled for determine the performance of the system. The downside of this
different performance requirements. SIMBA [11] is a chiplet accel- approach, is that such models will only work for the specific accel-
erator made from multiple processing elements (PEs). In contrast erator they are designed for. The benefit of these models, however,
to Eyriss, the architecture can be scaled and configured towards is their close relation to the used hardware. Unlike the previously
the different performance requirements of the DNN. However, both mentioned simulators, no additional modelling of the accelerator
Eyriss and SIMBA are standalone chips and cannot be integrated is necessary, as all relevant information can be extracted from the
into a System-on-Chip (SoC) for full flexibility. The systolic array software layer.
generator Gemmini by Genc et al. [5] allows generating a fully flex-
ible DNN accelerator design, which can be integrated into a SoC 3 CONCEPT OF OUR ANALYTICAL MODEL
design. Within the Chipyard [1] project, Gemmini can be coupled Our analytical model enables fast and systematic exploration of
with a RISC-V processor. Besides the flexible hardware architecture, a systolic array to find the best-fitting solution in the vast design
Gemmini also offers a rich software stack and is compatible with space. As stated before, crucial design parameters like buffer sizes,
common DNN frameworks. the number of PEs or the interface bandwidths can have a very
As stated before, the choice of design parameters for complex strong impact on the performance, energy efficiency and chip area
DNN accelerators is a yet unsolved challenge. Therefore, some consumption of the DNN accelerator. Hence, our model has to
research on modelling these accelerators has been carried out. deliver fast and accurate estimates of the accelerator characteris-
Timeloop [9] is a flexible tool, capable of performing analytical tics. Therefore, we base our model on the well-established roofline
simulation for a wide range of architectures, which can be mod- model introduced by Williams et al. [13] and use Accelergy [14]
elled through a set of primitives. Additionally, it allows defining for area and energy estimation. Those tools allow us to design a
the mapspace, i.e., how workloads can be mapped to the accel- highly abstracted model of the underlying hardware architecture.
erator, and supports finding optimal mappings. To estimate per- In general, the roofline model as shown in Figure 2 can be applied to
formance and energy, Timeloop exploits the regularity of DNN all computational tasks. With the roofline model, we can calculate
Analytical Modelling of Configurable Systolic Arrays RAPIDO 2023, January 17–18, 2023, Toulouse, France
between different architectures. We will discuss the calculation of For the number of MAC operations, we evaluate the two com-
their number for an example architecture in section 4. pute instructions: compute_preload and compute_accumulate.
The number of MAC operations follows the considerations from
3.3 Design Space Exploration Equation 1. To account for spatial fragmentation, we define the
scaling factor 𝜂 for 𝑚 and 𝑛 so that they match the according array
The main objective of our model is to speed up the design process.
dimension 𝐷𝐼𝑀. This way, we can model each of these instructions
To enable automatic exploration of valid designs, we first have to
put all possible design parameters options in the architecture de- blocking the full systolic array. For temporal fragmentation, in the
scription. Some parameters might be fixed like the layout of the compute_accumulate instruction, the scaling factor is set to one
on-chip memories, the number of memory banks and rows per as this effect does not occur. For compute_preload, we have to
bank. The user can limit the valid design space through a set of scale 𝑙 to match 𝐷𝐼𝑀 as no more calculations are performed after 𝑙
Design Space Constraints, for example, a maximum size of the sys- cycles, but the next computation cannot begin. To analyze the data
tolic array and maximum sizes of the on-chip memories can be movement, we can utilize Equation 2 to get Equation 4.
specified. Based on these inputs, our array configuration generator
𝐷𝐼𝑀 𝐷𝐼𝑀
will automatically construct the design space and generate valid 𝜂= ·
𝑚 𝑛
architecture descriptions that adhere to all constraints of the ac-
𝐷𝐼𝑀
celerator. Our analytical model then evaluates each architecture 𝛿𝑝𝑟𝑒𝑙𝑜𝑎𝑑 =
𝑙
description for the given DNN workload layer-by-layer and emits (4)
𝛿𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑒 = 1
action and cycle counts. Based on the action counts, we can es-
timate energy consumption and area using Accelergy. After the 𝑚𝑎𝑐𝑠𝑠𝑐𝑎𝑙𝑒,𝑝𝑟𝑒𝑙𝑜𝑎𝑑 = 𝑙 · 𝑚 · 𝑛 · 𝜂 · 𝛿𝑝𝑟𝑒𝑙𝑜𝑎𝑑
DNN workload has been evaluated on all generated architectures, 𝑚𝑎𝑐𝑠𝑠𝑐𝑎𝑙𝑒,𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑒 = 𝑙 · 𝑚 · 𝑛 · 𝜂 · 𝛿𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑒
we can post-process and analyze the results. The first step checks
found solutions for constraint violations, like a too large area, and Besides the performance metrics, we also want to estimate the
removes them. Next, the Pareto front and the global optimum are de- energy consumption and thus need action counts. For the number
termined. The Pareto front is evaluated with regard to area, power, of MAC operations, we can apply Equation 1 as it was discussed
and performance, while the global optimum depends on a user- previously. To determine the number of memory accesses requires
given target function, that for example minimizes energy. Finally, knowledge of the memory organization. For the transfer of a 𝑟𝑜𝑤 ×
all results are visualized and stored. Results can also be imported 𝑐𝑜𝑙 ⌉ · 𝑟𝑜𝑤 memory accesses are
𝑐𝑜𝑙 block of data, a total of ⌈ 𝐷𝐼 𝑀
for later evaluation with a different set of constraints. performed. To match the behavior of Gemmini, we integrate these
formulas into the roofline evaluation model of our analytical model.
4 USING GEMMINI AS AN EXEMPLARY
SYSTOLIC ARRAY 5 EVALUATION
To demonstrate and evaluate our proposed analytical model, we Our analytical model is implemented in Python to allow for straight-
selected Gemmini [5], which is an open-source systolic array gener- forward integration into common DNN frameworks like PyTorch.
ator. Gemmini offers a high degree of freedom in its design param- As workload, for all experiments that we evaluate, we choose
eters and supports a wide range of DNN workloads. It is integrated ResNet [7], since it is a well-established CNN and features a wide
into the Chipyard framework [1]. The architecture consists of a range of different kernel sizes. The network’s input size is 3×224×224,
scratchpad to store operands, an accumulator memory in which re- and we set the batch size to one. The results of our model are com-
sults are stored and the systolic array performing the computations. pared with a cycle accurate simulation, which is provided by the
To integrate Gemmini into our systolic array analytical model, we Chipyard framework [1], and with the state-of-the-art CNN accel-
have to adjust the operation unroll engine to match Gemmini’s erator simulator ScaleSim [10]. Since we target energy and area
tiling and account for Max-Pooling, which is performed during constrained applications like embedded systems, we picked 8×8,
write-backs to the main memory. While Gemmini is very flexible, 16×16 and 32×32 as systolic array sizes. For area estimation, all com-
there are, however, some architecture constraints we have to con- ponents for Accelergy are assumed to be implemented in a 40 nm
sider. Most important, due to the address generation, the size of the technology node. All experiments with our model are performed
systolic array and the number of rows in each memory has to be a on a single AMD EPYC 7702P core running Rocky Linux, multiple
power of two. In the following, we denote Gemmini’s systolic array cores can be used to run individual experiments in parallel.
size as 𝐷𝐼 𝑀 × 𝐷𝐼 𝑀. Additionally, the available bandwidth is fixed
to 128 bit per transfer. Hence, we define 𝑝𝑒𝑎𝑘𝑐𝑜𝑚𝑝 = 𝐷𝐼𝑀 · 𝐷𝐼𝑀 5.1 Estimation Accuracy and Simulation Time
and 𝑏𝑤 𝑝𝑒𝑎𝑘 = 128 𝑏𝑖𝑡. To model energy, we added models of the The results of the accuracy and runtime evaluation can be found in
scratchpad, accumulator and the systolic array to Accelergy. Figure 4 and Table 1, respectively. The plot depicts the estimated
To calculate the number of MAC operations and data movements, cycle count for the three different array sizes. For each size, the
we analyze all individual instructions that Gemmini can execute. first three ResNet layers, representing all kernel sizes occuring in a
This enables us to model the performance of the DNN inference. ResNet, are shown. Each experiment is performed using our model
Taking each instruction into account, we can also derive action and ScaleSim. In addition, a cycle-accurate simulation of Gemmini
counts of each component, which allows us to estimate the energy serves as a cycle count reference. The first layer with 7×7 convolu-
consumption. tions (𝐿1 ) has, in contrast to the others, a Max-Pooling operation.
Analytical Modelling of Configurable Systolic Arrays RAPIDO 2023, January 17–18, 2023, Toulouse, France
Cycle Count
Cycle Count
Cycle-Accurate
106 106 106
This Work
ScaleSim
105 105 105
Figure 4: Evaluation of different layer configurations across different array sizes on a logarithmic scale. 𝐿1 , 𝐿2 and 𝐿3 represent
7×7, 1×1 and 3×3 convolution operations, respectively
Table 1: Simulation time comparison of our work with cycle- Accumulator memory
accurate simulation and ScaleSim on a 16 × 16 systolic array
64k 128k 256k 512k 1024k
256k 29.3 29.1 28.8 - -
Workload This Work ScaleSim [10] Cycle-Accurate
Scratchpad
memory
512k 28.8 28.0 27.9 27.8 -
𝐿1 1.8 s 165 s (55x) 9179 s (5099x)
1024k 28.4 27.8 27.6 27.6 27.7
𝐿2 0.17 s 16 s (138x) 2201 s (12947x)
𝐿3 0.76 s 147 s (18x) 2667 s (3509x) 2048k 28.3 27.8 27.4 27.3 27.3
ResNet-34 28 s 1h > 48 h
Figure 5: ResNet-34 inference cycle count (in millions) for
different memory configurations on a 16×16 array
Especially here, the differences between the cycle-accurate simu-
lation and ScaleSim are significant. This can be explained by two
factors: First, ScaleSim does not model Max-Pooling operations at 5.2 Impact of Memory and Array Size
all. Especially, in Gemmini, pooling and convolutions are fused into Providing sufficient on-chip memory is a major challenge while
one layer. In general, pooling has an impact on the performance designing a DNN accelerator. On-chip memory is very expensive,
estimates. In our experiments, we have observed that 23% more cy- hence, it is advisable to carefully choose the memory sizes to achieve
cles are required for layers with pooling. For this reason, modelling a high efficiency. To show how our tool can help to choose the mem-
these effects is crucial to accurately model performance. Besides ory size, we perform an exploration of a wide range of memory
pooling, the underlying mapping plays a role. ScaleSim assumes sizes for a ResNet-34 workload, while keeping the array size fixed
a different mapping compared to Gemmini allowing for a higher to 16×16. For our evaluation we assume an off-chip memory with a
utilization and therefore a deviating cycle count. We are able to fixed latency, since Gemmini has an L2-cache in between the off-
take both of these effects in our model into account. Hence, our chip DRAM and the local memories, making this memory hierarchy
model reflects the cycle count observed during the simulation more difficult to model. However, looking at the local memories is still
accurately than ScaleSim. Similar trends can be observed over the very important, since they have a significant impact on the area.
different array sizes. Looking at the 1×1 convolution operation (𝐿2 ), The impact on the cycle-count of different memory sizes using our
ScaleSim is also unable to accurately reflect the correct cycle counts. analytical model is shown in Figure 5. In general, memory sizes af-
The calculated values are too low, since ScaleSim assumes a too fect the tiling of data across the scratchpad and accumulator. Larger
high bandwidth, leading to fewer stalls than are acutally present. memories tend to have a greater impact on area and energy than
In case of a 3×3 convolution (𝐿3 ), all tools are able to give close on performance. A 16×16 array in which the memories are set to
estimates. the largest configuration (2048k and 1024k) results in 7x more area
Besides accuracy, the simulation time for one design evaluation (in total 11.1𝑚𝑚 2 ) and only 7% more performance, in comparison
is another very important metric for design space exploration, since to the smallest configuration which only requires 1.5𝑚𝑚 2 . Due to
faster evaluation aids faster design space exploration. Table 1 shows the significant increase in area, making the memory larger might
the simulation time of our approach compared to ScaleSim and the not always be the correct optimization choice. In comparison, an
cycle-accurate simulation using the same array sizes and ResNet increase of array size from 16×16 to 32×32 with fixed 256k scratch-
layers. It has to be noted, that a cycle-accurate simulation of an pad and 64k accumulator memories, adds 73% area and increases
entire ResNet-34 takes multiple days, making it infeasible for design performance by 217%. Hence, it should be considered that a larger
space exploration. Depending on the workload, the table shows array can be a better choice than larger memories. Especially in
that our approach provides a speed-up of up to 12947x and 138x area constrained embedded designs, increasing the array size is the
compared to cycle-accurate simulation and ScaleSim, respectively. preferable choice.
RAPIDO 2023, January 17–18, 2023, Toulouse, France Hotfilter et al.
FPS
18
over all array configurations amounts to 7%. Compared to state-of-
16 the-art systolic array simulators, we demonstrated an improvement
60 in cycle count estimation accuracy and were able to include more
14 instructions like Max-Pooling into our simulation. Moreover, we
2 4 6 8 10 12 14 coupled our analytical model with Accelergy to get estimates of
Estimated Area [𝑚𝑚 2 ] energy consumption and area, besides the raw cycle count, making
a design space exploration feasible. Exemplary, we performed this
on a case study with ResNet-34, revealing valuable insights on how
Figure 6: Achieved FPS for ResNet-34 of Pareto optimal array
different design parameters influence energy consumption, area
configurations and the associated area and energy
and overall performance.