0% found this document useful (0 votes)
29 views6 pages

RAPIDO 2023 Paper 2868

Uploaded by

knparashar.be
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

RAPIDO 2023 Paper 2868

Uploaded by

knparashar.be
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

An Analytical Model of Configurable Systolic Arrays to find the

Best-Fitting Accelerator for a given DNN Workload


Tim Hotfilter, Patrick Schmidt, Julian Hoefer, Fabian Kreß, Tanja Harbaum, Juergen Becker
{hotfilter,patrick.schmidt2,julian.hoefer,fabian.kress,harbaum,becker}@kit.edu
Karlsruhe Institute of Technology (KIT)
Karlsruhe, Germany

ABSTRACT 1 INTRODUCTION
Since their breakthrough, complexity of Deep Neural Networks Deep Neural Networks (DNNs) entered more and more domains
(DNNs) is rising steadily. As a result, accelerators for DNNs are and areas over the last decade due to their higher prediction perfor-
now used in many domains. However, designing and configuring mance compared to traditional algorithms. In image recognition,
an accelerator that meets the requirements of a given application for example, face recognition is used in assistive robotics to support
perfectly is a challenging task. In this paper, we therefore present the elderly [12] or in particle physics they support the compression
our approach to support the accelerator design process. With an of large datastreams [2]. While DNNs already show great perfor-
analytical model of a systolic array we can estimate performance, mance in many tasks, over time their computational complexity
energy consumption and area for each design option. To determine and memory requirements grew rapidly to fulfill even more sophis-
these metrics, usually a cycle accurate simulation is performed, ticated tasks. Especially considering yet unsolved problems such
which is a time-consuming task. Hence, the design space has to as autonomous driving, the complexity is foreseen to grow even
be restricted heavily. Analytical modelling, however, allows for further. This trend poses a challenge to the underlying hardware
fast evaluation of a design using a mathematical abstraction of the architecture, which executes the DNN. Since the computation of
accelerator. For DNNs, this works especially well since the dataflow DNNs is a highly dataflow driven and memory bound task, tradi-
and memory accesses have high regularity. To show the correctness tional computing devices like CPUs or GPUs cannot keep pace with
of our model, we perform an exemplary realization with the state- the fast rising demands. To address this challenge, dedicated DNN
of-the-art systolic array generator Gemmini and compare it with accelerator architectures, like systolic arrays, are currently state of
a cycle accurate simulation and state-of-the-art modelling tools, the art. Those DNN accelerators can compute operations in parallel
showing less than 1% deviation. We also conducted a design space and reuse data to achieve a high performance and efficiency. In
exploration, showing the analytical model’s capabilities to support addition, accelerators can incorporate optimization techniques like
an accelerator design. In a case study on ResNet-34, we can demon- pruning or quantization [6].
strate that our model and DSE tool reduces the time to find the While DNN accelerators can support fast and efficient inference,
best-fitting solution by four or two orders of magnitude compared the design parameters of such an accelerator have to be evaluated
to a cycle-accurate simulation or state-of-the-art modelling tools, carefully. The level of exploited parallelism or the number of pro-
respectively. cessing elements and the size of memories and their interfaces, have
a strong impact on various design metrics like throughput, latency,
CCS CONCEPTS power consumption or area. Besides the architecture parameters,
• Computing methodologies → Machine learning; Modeling the DNN workload itself has a strong dependency with the perfor-
and simulation; • Computer systems organization → Systolic mance, since proper mapping of the workload is also important.
arrays; Embedded systems. All these parameters open a large design space from which one
solution has to carefully picked to reach high performance and
KEYWORDS efficiency. However, due to the complexity of DNN accelerators,
Analytical Modelling, Neural Networks, Design Space Exploration determining the hyperparameters and specifications for a given
accelerator configuration is costly, since each configuration has to
ACM Reference Format: be elaborated individually. Highly accurate results can be achieved
Tim Hotfilter, Patrick Schmidt, Julian Hoefer, Fabian Kreß, Tanja Harbaum,
though cycle-accurate simulation of the whole workload [8, 10],
Juergen Becker. 2023. An Analytical Model of Configurable Systolic Arrays
to find the Best-Fitting Accelerator for a given DNN Workload. In Proceedings
which takes a long time for each iteration. Considering the large de-
of the 2023 Workshop on System Engineering for constrained embedded systems sign space, cycle-accurate simulation is not feasible for an extensive
(RAPIDO 2023), January 17–18, 2023, Toulouse, France. ACM, New York, NY, design space evaluation.
USA, 6 pages. https://doi.org/10.1145/3579170.3579258 In this paper, we therefore present our analytical model of sys-
tolic arrays, which are a very common type of DNN accelerator. The
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed analytical model is the centerpiece of our evaluation tool, shown
for profit or commercial advantage and that copies bear this notice and the full citation in Figure 1. Our analytical model estimates performance, area, and
on the first page. Copyrights for third-party components of this work must be honored. energy consumption for a given design configuration and DNN
For all other uses, contact the owner/author(s).
RAPIDO 2023, January 17–18, 2023, Toulouse, France workload in a fast and accurate way. Therefore, our approach uses
© 2023 Copyright held by the owner/author(s). the well-established roofline model for performance estimation and
ACM ISBN 979-8-4007-0045-3/23/01. bottleneck identification. During the design process, constraints,
https://doi.org/10.1145/3579170.3579258
RAPIDO 2023, January 17–18, 2023, Toulouse, France Hotfilter et al.

A A A performance [Ops/s]

Design Workload Component Memory Bound Compute Bound


Constraints Description Models
𝑎𝑘
𝑤 𝑝𝑒 𝑐𝑜𝑚𝑝 𝑝𝑒𝑎𝑘
·𝑏
𝑦
Array Config Analytical Area & 𝑛𝑠
𝑖𝑡
𝑡𝑒
Generator Model Energy Estimation 𝑜𝑝
𝑖𝑛
𝑜𝑝𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡 𝑦 [Ops/Byte]
Our Proposed
Evaluation and
DSE Tool
¢ T Figure 2: Roofline model showing peak computational and
Energy & Area Performance memory performance [13]
estimates estimates

Figure 1: Overview of our DNN accelerator analysis tool with workloads to analytically calculate action counts of the various sys-
our analytical model of a systolic array at its center tem components. However, Timeloop limits itself to convolutional
and fully-connected layers while neglecting other operations, like
activations and pooling layers. Further, describing the mapspace
for example, an upper area or power limit, can be defined. This requires prior knowledge of the dataflow the targeted accelerator
allows us to find a solution, which meets all design requirements. uses, and hence it is difficult to generalize it for all systolic arrays.
With our analytical model, we are able to evaluate a design in up to In contrast, ScaleSim [10] reduces the complexity by limiting itself
12000x less time compared to a cycle-accurate simulation. We ver- to the simulation of only systolic arrays. The memory hierarchy in-
ify found solutions by comparing them against the cycle-accurate side ScaleSim consists of two input memories for input and weight
simulation, showing less than 1% deviation using a 16 × 16 systolic data, and a separate memory to store results. All common dataflows
array. In a case study on ResNet-34, we use our evaluation tool for for systolic arrays, weight stationary, output stationary and input
a design space exploration (DSE), to show its capabilities in finding stationary are supported. Mapping of complex problems onto the
an optimal DNN accelerator for this workload. compute array is determined automatically. However, ScaleSim
performs a cycle accurate simulation, which leads to a very high
2 RELATED WORK simulation time. In addition, ScaleSim lacks support for pooling
Over the last decade, various DNN accelerators were presented and operations and batched data. To get a very accurate simulation,
have established themselves. One very prominent accelerator is Chen et al. [3] propose a custom simulation model that reflects the
Eyeriss by Chen et al. [4]. It implements a 12 × 14 array of com- underlying hardware directly. Their model includes various aspects
pute elements, each equipped with a small memory to buffer inputs of the accelerator, such as the number of PEs, their arrangement in
and weights. Their row-stationary dataflow allows for an efficient the array, mapping and available bandwidth. When all influences
inference by minimizing the data movement to the main memory. of these parameters are understood, it is possible to analytically
However, Eyeriss has a fixed architecture and cannot be scaled for determine the performance of the system. The downside of this
different performance requirements. SIMBA [11] is a chiplet accel- approach, is that such models will only work for the specific accel-
erator made from multiple processing elements (PEs). In contrast erator they are designed for. The benefit of these models, however,
to Eyriss, the architecture can be scaled and configured towards is their close relation to the used hardware. Unlike the previously
the different performance requirements of the DNN. However, both mentioned simulators, no additional modelling of the accelerator
Eyriss and SIMBA are standalone chips and cannot be integrated is necessary, as all relevant information can be extracted from the
into a System-on-Chip (SoC) for full flexibility. The systolic array software layer.
generator Gemmini by Genc et al. [5] allows generating a fully flex-
ible DNN accelerator design, which can be integrated into a SoC 3 CONCEPT OF OUR ANALYTICAL MODEL
design. Within the Chipyard [1] project, Gemmini can be coupled Our analytical model enables fast and systematic exploration of
with a RISC-V processor. Besides the flexible hardware architecture, a systolic array to find the best-fitting solution in the vast design
Gemmini also offers a rich software stack and is compatible with space. As stated before, crucial design parameters like buffer sizes,
common DNN frameworks. the number of PEs or the interface bandwidths can have a very
As stated before, the choice of design parameters for complex strong impact on the performance, energy efficiency and chip area
DNN accelerators is a yet unsolved challenge. Therefore, some consumption of the DNN accelerator. Hence, our model has to
research on modelling these accelerators has been carried out. deliver fast and accurate estimates of the accelerator characteris-
Timeloop [9] is a flexible tool, capable of performing analytical tics. Therefore, we base our model on the well-established roofline
simulation for a wide range of architectures, which can be mod- model introduced by Williams et al. [13] and use Accelergy [14]
elled through a set of primitives. Additionally, it allows defining for area and energy estimation. Those tools allow us to design a
the mapspace, i.e., how workloads can be mapped to the accel- highly abstracted model of the underlying hardware architecture.
erator, and supports finding optimal mappings. To estimate per- In general, the roofline model as shown in Figure 2 can be applied to
formance and energy, Timeloop exploits the regularity of DNN all computational tasks. With the roofline model, we can calculate
Analytical Modelling of Configurable Systolic Arrays RAPIDO 2023, January 17–18, 2023, Toulouse, France

A A 75%. To consider this effect, which we use a scaling factor 𝛿. In the


example 𝛿 =
⌈24/16⌉·16
24 = 43 . We can then describe the number of
DNN Architecture scaled MAC operations in each computation with Equation 1.
Workload Configuration
𝑚𝑎𝑐𝑠𝑠𝑐𝑎𝑙𝑒 = 𝑙 · 𝑚 · 𝑛 · 𝜂 · 𝛿 (1)
Operation Analytical
For an accurate modelling of the performance, it is also important
Unroll Engine Model to consider the data movement, as DNN inference is a very memory
MACs &
tiled matrixes memory transfers intense task. It is strongly dependent on the bus bandwidth. In gen-
eral, data movement occurs in the form of block transfers between
external and on-chip memories. We can view these data blocks as
Action Count Computation and Roofline
matrices of size 𝑟𝑜𝑤 × 𝑐𝑜𝑙 which are transferred row-wise. Similar
Calculation Data Movement Evaluation
to the systolic array matrices, we have to pad bus transfers. For
example, even if only 1 B is transferred over a bus, it still effectively

A A blocks the full bus-width. As such, we scale every single transfer to


match the maximum bus-width 𝑏𝑤 𝑝𝑒𝑎𝑘 . Individual rows of the data
Action Counts Cycle Count blocks are split up into multiple transfers if they are larger than
the bus-width. Additionally, we have to account for idle periods
Figure 3: Overview of our analytical model on the bus when data movement and computations do not overlap.
This leads to additional overhead, which we account for through
a scaling factor 𝛼. The calculation of the number of bus transfers
peak operational performance (𝑐𝑜𝑚𝑝𝑝𝑒𝑎𝑘 ) and peak memory per- 𝑁𝑏𝑤 and the scaled data moved are given in Equation 2.
formance (𝑜𝑝𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡 𝑦 ·𝑏𝑤 𝑝𝑒𝑎𝑘 ), as well as the operational intensity. & '
The horizontal roof gives the maximum computational performance, 𝑁𝑏𝑤 =
𝑐𝑜𝑙
· 𝑟𝑜𝑤
while the diagonal line gives the peak memory performance. 𝑏𝑤 𝑝𝑒𝑎𝑘 (2)
𝑑𝑎𝑡𝑎𝑠𝑐𝑎𝑙𝑒 = 𝑁𝑏𝑤 · 𝑏𝑤 · 𝛼
3.1 Analytical Modelling of a Systolic Array Based on the adjusted data movements and MAC operations, we
The components and computation steps of our proposed analytical can apply the roofline model to determine the cycle count for a
model are shown in Figure 3. To compute cycle and actions counts given DNN workload on the systolic array. This evaluation happens
of a DNN inference, our model takes two inputs: A DNN work- in the Roofline Evaluation model by applying Equation 3. Besides the
load description, that holds the layer shapes, and an architecture data movements and MAC operations, we also need to take architec-
description that features, e.g., the array and memory sizes. Those tural constraints into account. This is done through 𝑐𝑜𝑚𝑝𝑝𝑒𝑎𝑘 and
inputs are evaluated in the Operation Unroll Engine, which splits 𝑏𝑤 𝑝𝑒𝑎𝑘 . They represent the number of MAC units in the systolic
large matrix operations into smaller tiles that match the underlying array and the available bandwidth. With all these four variables,
systolic array size. Based on the tiles, the Action Count Calculation we can compute the performance, from which we can calculate the
module generates action counts for Accelergy’s energy estimation. cycle count.
They represent how often an action is performed by a component,
e.g., the number of memory accesses. 𝑚𝑎𝑐𝑠𝑠𝑐𝑎𝑙𝑒
The Computation and Data Movement module takes the same 𝑜𝑝𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡 𝑦 =
𝑑𝑎𝑡𝑎𝑠𝑐𝑎𝑙𝑒
outputs from the Operation Unroll Engine and estimates the number  
of MAC operations performed and the amount of data moved over 𝑝𝑒𝑟 𝑓 𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = min 𝑜𝑝𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡 𝑦 · 𝑏𝑤 𝑝𝑒𝑎𝑘 , 𝑐𝑜𝑚𝑝𝑝𝑒𝑎𝑘 (3)
the bus. MAC operations account for all compute cycles. Their 𝑚𝑎𝑐𝑠𝑠𝑐𝑎𝑙𝑒
𝑐𝑦𝑐𝑙𝑒𝑠 =
number can be derived from a matrix-matrix multiplication between 𝑝𝑒𝑟 𝑓 𝑜𝑟𝑚𝑎𝑛𝑐𝑒
an 𝑙 × 𝑚 input matrix A and the 𝑚 × 𝑛 weight matrix B in the
systolic array. In total, this accounts for 𝑙 · 𝑚 · 𝑛 MAC operations. 3.2 Estimation of Energy and Area
However, we cannot map an arbitrary matrix onto a systolic array To estimate the energy and area of the systolic array, we need action
directly, but we have to account for mapping fragmentation effects counts derived from our model. Action counts include information
as already defined by Chen et al. [3]. For example, spatial mapping about which module has performed which operation and how often.
fragmentation occurs, when the dimensions of the matrices A or B The collected action counts are evaluated by Accelergy [14] to
are smaller than the size of the systolic array. In this case, we have estimate the energy consumption and area. Accelergy provides a
to pad the matrix such that it fits the array size. This is done through set of primitives, e.g., MAC units and memories, from which more
the scaling factor 𝜂. For example, a 10×10 matrix multiplication on a complex architectures can be modelled. In general, a systolic array
16×16 array gives 𝜂 = 16/10 = 1.6. Temporal mapping fragmentation can be modelled as a 𝐷𝐼𝑀 × 𝐷𝐼𝑀 array of MAC units and registers.
can occur, when one input matrix is larger than the array. For The on-chip memory can be modelled as banked SRAM.
example, a 24×16 matrix computed the same 16×16 array, leaves To get the number of performed MAC operations, we look at
after one full iteration eight columns. Thus, in the second pass Equation 1. For the number of memory accesses, we have to take
the array is only 50% utilized, resulting in an overall utilization of the memory organization into account. As such, this can differ
RAPIDO 2023, January 17–18, 2023, Toulouse, France Hotfilter et al.

between different architectures. We will discuss the calculation of For the number of MAC operations, we evaluate the two com-
their number for an example architecture in section 4. pute instructions: compute_preload and compute_accumulate.
The number of MAC operations follows the considerations from
3.3 Design Space Exploration Equation 1. To account for spatial fragmentation, we define the
scaling factor 𝜂 for 𝑚 and 𝑛 so that they match the according array
The main objective of our model is to speed up the design process.
dimension 𝐷𝐼𝑀. This way, we can model each of these instructions
To enable automatic exploration of valid designs, we first have to
put all possible design parameters options in the architecture de- blocking the full systolic array. For temporal fragmentation, in the
scription. Some parameters might be fixed like the layout of the compute_accumulate instruction, the scaling factor is set to one
on-chip memories, the number of memory banks and rows per as this effect does not occur. For compute_preload, we have to
bank. The user can limit the valid design space through a set of scale 𝑙 to match 𝐷𝐼𝑀 as no more calculations are performed after 𝑙
Design Space Constraints, for example, a maximum size of the sys- cycles, but the next computation cannot begin. To analyze the data
tolic array and maximum sizes of the on-chip memories can be movement, we can utilize Equation 2 to get Equation 4.
specified. Based on these inputs, our array configuration generator
𝐷𝐼𝑀 𝐷𝐼𝑀
will automatically construct the design space and generate valid 𝜂= ·
𝑚 𝑛
architecture descriptions that adhere to all constraints of the ac-
𝐷𝐼𝑀
celerator. Our analytical model then evaluates each architecture 𝛿𝑝𝑟𝑒𝑙𝑜𝑎𝑑 =
𝑙
description for the given DNN workload layer-by-layer and emits (4)
𝛿𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑒 = 1
action and cycle counts. Based on the action counts, we can es-
timate energy consumption and area using Accelergy. After the 𝑚𝑎𝑐𝑠𝑠𝑐𝑎𝑙𝑒,𝑝𝑟𝑒𝑙𝑜𝑎𝑑 = 𝑙 · 𝑚 · 𝑛 · 𝜂 · 𝛿𝑝𝑟𝑒𝑙𝑜𝑎𝑑
DNN workload has been evaluated on all generated architectures, 𝑚𝑎𝑐𝑠𝑠𝑐𝑎𝑙𝑒,𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑒 = 𝑙 · 𝑚 · 𝑛 · 𝜂 · 𝛿𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑒
we can post-process and analyze the results. The first step checks
found solutions for constraint violations, like a too large area, and Besides the performance metrics, we also want to estimate the
removes them. Next, the Pareto front and the global optimum are de- energy consumption and thus need action counts. For the number
termined. The Pareto front is evaluated with regard to area, power, of MAC operations, we can apply Equation 1 as it was discussed
and performance, while the global optimum depends on a user- previously. To determine the number of memory accesses requires
given target function, that for example minimizes energy. Finally, knowledge of the memory organization. For the transfer of a 𝑟𝑜𝑤 ×
all results are visualized and stored. Results can also be imported 𝑐𝑜𝑙 ⌉ · 𝑟𝑜𝑤 memory accesses are
𝑐𝑜𝑙 block of data, a total of ⌈ 𝐷𝐼 𝑀
for later evaluation with a different set of constraints. performed. To match the behavior of Gemmini, we integrate these
formulas into the roofline evaluation model of our analytical model.
4 USING GEMMINI AS AN EXEMPLARY
SYSTOLIC ARRAY 5 EVALUATION
To demonstrate and evaluate our proposed analytical model, we Our analytical model is implemented in Python to allow for straight-
selected Gemmini [5], which is an open-source systolic array gener- forward integration into common DNN frameworks like PyTorch.
ator. Gemmini offers a high degree of freedom in its design param- As workload, for all experiments that we evaluate, we choose
eters and supports a wide range of DNN workloads. It is integrated ResNet [7], since it is a well-established CNN and features a wide
into the Chipyard framework [1]. The architecture consists of a range of different kernel sizes. The network’s input size is 3×224×224,
scratchpad to store operands, an accumulator memory in which re- and we set the batch size to one. The results of our model are com-
sults are stored and the systolic array performing the computations. pared with a cycle accurate simulation, which is provided by the
To integrate Gemmini into our systolic array analytical model, we Chipyard framework [1], and with the state-of-the-art CNN accel-
have to adjust the operation unroll engine to match Gemmini’s erator simulator ScaleSim [10]. Since we target energy and area
tiling and account for Max-Pooling, which is performed during constrained applications like embedded systems, we picked 8×8,
write-backs to the main memory. While Gemmini is very flexible, 16×16 and 32×32 as systolic array sizes. For area estimation, all com-
there are, however, some architecture constraints we have to con- ponents for Accelergy are assumed to be implemented in a 40 nm
sider. Most important, due to the address generation, the size of the technology node. All experiments with our model are performed
systolic array and the number of rows in each memory has to be a on a single AMD EPYC 7702P core running Rocky Linux, multiple
power of two. In the following, we denote Gemmini’s systolic array cores can be used to run individual experiments in parallel.
size as 𝐷𝐼 𝑀 × 𝐷𝐼 𝑀. Additionally, the available bandwidth is fixed
to 128 bit per transfer. Hence, we define 𝑝𝑒𝑎𝑘𝑐𝑜𝑚𝑝 = 𝐷𝐼𝑀 · 𝐷𝐼𝑀 5.1 Estimation Accuracy and Simulation Time
and 𝑏𝑤 𝑝𝑒𝑎𝑘 = 128 𝑏𝑖𝑡. To model energy, we added models of the The results of the accuracy and runtime evaluation can be found in
scratchpad, accumulator and the systolic array to Accelergy. Figure 4 and Table 1, respectively. The plot depicts the estimated
To calculate the number of MAC operations and data movements, cycle count for the three different array sizes. For each size, the
we analyze all individual instructions that Gemmini can execute. first three ResNet layers, representing all kernel sizes occuring in a
This enables us to model the performance of the DNN inference. ResNet, are shown. Each experiment is performed using our model
Taking each instruction into account, we can also derive action and ScaleSim. In addition, a cycle-accurate simulation of Gemmini
counts of each component, which allows us to estimate the energy serves as a cycle count reference. The first layer with 7×7 convolu-
consumption. tions (𝐿1 ) has, in contrast to the others, a Max-Pooling operation.
Analytical Modelling of Configurable Systolic Arrays RAPIDO 2023, January 17–18, 2023, Toulouse, France

107 107 107


Cycle Count

Cycle Count

Cycle Count
Cycle-Accurate
106 106 106
This Work
ScaleSim
105 105 105

104 104 104


𝐿1 𝐿2 𝐿3 𝐿1 𝐿2 𝐿3 𝐿1 𝐿2 𝐿3
(a) Array size 8x8 (b) Array size 16x16 (c) Array size 32x32

Figure 4: Evaluation of different layer configurations across different array sizes on a logarithmic scale. 𝐿1 , 𝐿2 and 𝐿3 represent
7×7, 1×1 and 3×3 convolution operations, respectively

Table 1: Simulation time comparison of our work with cycle- Accumulator memory
accurate simulation and ScaleSim on a 16 × 16 systolic array
64k 128k 256k 512k 1024k
256k 29.3 29.1 28.8 - -
Workload This Work ScaleSim [10] Cycle-Accurate

Scratchpad
memory
512k 28.8 28.0 27.9 27.8 -
𝐿1 1.8 s 165 s (55x) 9179 s (5099x)
1024k 28.4 27.8 27.6 27.6 27.7
𝐿2 0.17 s 16 s (138x) 2201 s (12947x)
𝐿3 0.76 s 147 s (18x) 2667 s (3509x) 2048k 28.3 27.8 27.4 27.3 27.3
ResNet-34 28 s 1h > 48 h
Figure 5: ResNet-34 inference cycle count (in millions) for
different memory configurations on a 16×16 array
Especially here, the differences between the cycle-accurate simu-
lation and ScaleSim are significant. This can be explained by two
factors: First, ScaleSim does not model Max-Pooling operations at 5.2 Impact of Memory and Array Size
all. Especially, in Gemmini, pooling and convolutions are fused into Providing sufficient on-chip memory is a major challenge while
one layer. In general, pooling has an impact on the performance designing a DNN accelerator. On-chip memory is very expensive,
estimates. In our experiments, we have observed that 23% more cy- hence, it is advisable to carefully choose the memory sizes to achieve
cles are required for layers with pooling. For this reason, modelling a high efficiency. To show how our tool can help to choose the mem-
these effects is crucial to accurately model performance. Besides ory size, we perform an exploration of a wide range of memory
pooling, the underlying mapping plays a role. ScaleSim assumes sizes for a ResNet-34 workload, while keeping the array size fixed
a different mapping compared to Gemmini allowing for a higher to 16×16. For our evaluation we assume an off-chip memory with a
utilization and therefore a deviating cycle count. We are able to fixed latency, since Gemmini has an L2-cache in between the off-
take both of these effects in our model into account. Hence, our chip DRAM and the local memories, making this memory hierarchy
model reflects the cycle count observed during the simulation more difficult to model. However, looking at the local memories is still
accurately than ScaleSim. Similar trends can be observed over the very important, since they have a significant impact on the area.
different array sizes. Looking at the 1×1 convolution operation (𝐿2 ), The impact on the cycle-count of different memory sizes using our
ScaleSim is also unable to accurately reflect the correct cycle counts. analytical model is shown in Figure 5. In general, memory sizes af-
The calculated values are too low, since ScaleSim assumes a too fect the tiling of data across the scratchpad and accumulator. Larger
high bandwidth, leading to fewer stalls than are acutally present. memories tend to have a greater impact on area and energy than
In case of a 3×3 convolution (𝐿3 ), all tools are able to give close on performance. A 16×16 array in which the memories are set to
estimates. the largest configuration (2048k and 1024k) results in 7x more area
Besides accuracy, the simulation time for one design evaluation (in total 11.1𝑚𝑚 2 ) and only 7% more performance, in comparison
is another very important metric for design space exploration, since to the smallest configuration which only requires 1.5𝑚𝑚 2 . Due to
faster evaluation aids faster design space exploration. Table 1 shows the significant increase in area, making the memory larger might
the simulation time of our approach compared to ScaleSim and the not always be the correct optimization choice. In comparison, an
cycle-accurate simulation using the same array sizes and ResNet increase of array size from 16×16 to 32×32 with fixed 256k scratch-
layers. It has to be noted, that a cycle-accurate simulation of an pad and 64k accumulator memories, adds 73% area and increases
entire ResNet-34 takes multiple days, making it infeasible for design performance by 217%. Hence, it should be considered that a larger
space exploration. Depending on the workload, the table shows array can be a better choice than larger memories. Especially in
that our approach provides a speed-up of up to 12947x and 138x area constrained embedded designs, increasing the array size is the
compared to cycle-accurate simulation and ScaleSim, respectively. preferable choice.
RAPIDO 2023, January 17–18, 2023, Toulouse, France Hotfilter et al.

22 80 design points is infeasible. Our model speeds up the evaluation of


Estimated Energy [J]

a design configuration accurately. We verified our model with a


20 cycle-accurate evaluation of the same architecture, showing less
70 than 1% deviation on a 16×16 array, while the average deviation

FPS
18
over all array configurations amounts to 7%. Compared to state-of-
16 the-art systolic array simulators, we demonstrated an improvement
60 in cycle count estimation accuracy and were able to include more
14 instructions like Max-Pooling into our simulation. Moreover, we
2 4 6 8 10 12 14 coupled our analytical model with Accelergy to get estimates of
Estimated Area [𝑚𝑚 2 ] energy consumption and area, besides the raw cycle count, making
a design space exploration feasible. Exemplary, we performed this
on a case study with ResNet-34, revealing valuable insights on how
Figure 6: Achieved FPS for ResNet-34 of Pareto optimal array
different design parameters influence energy consumption, area
configurations and the associated area and energy
and overall performance.

5.3 Exemplary Design Space Evaluation of ACKNOWLEDGMENTS


ResNet-34 This work was funded by the German Federal Ministry of Education
and Research (BMBF) under grant number 16ME0454 (EMDRIVE).
To showcase the insights our analytical model can generate, we
The responsibility for the content of this publication lies with the
use our evaluation tool for design space exploration. We explore a
authors.
ResNet-34 as workload, which demonstrates good prediction accu-
racy on image processing tasks. For the case study, we assume an REFERENCES
inference use case on an embedded compute platform. Our objec- [1] Alon Amid et al. 2020. Chipyard: Integrated Design, Simulation, and Im-
tive is to minimize the energy consumption, while maintaining a plementation Framework for Custom SoCs. IEEE Micro 40, 4 (2020), 10–21.
high performance. The clock frequency is assumed to be 700 MHz. https://doi.org/10.1109/MM.2020.2996616
[2] Steffen Baehr et al. 2019. Low Latency Neural Networks using Heterogenous Re-
The design space is limited by a set of architecture constraints. sources on FPGA for the Belle II Trigger. arXiv:1910.13679 [hep-ex, physics:physics]
We use the same array sizes as before, but apply less restrictive (Oct 2019). http://arxiv.org/abs/1910.13679
constrains to the memories. Scratchpad memory size can be set [3] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. 2018. Eyeriss v2: A flexible
and high-performance accelerator for emerging deep neural networks. CoRR
between 128 kB and 4 MB and accumulator memory between 64 kB abs/1807.07928 (2018). http://arxiv.org/abs/1807.07928
and 2 MB. Finally, we add a performance constraint that all archi- [4] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss:
An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural
tectures have to yield at least 30 FPS on ResNet-34 to be considered Networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127–138. https:
valid. //doi.org/10.1109/JSSC.2016.2616357
With the given constraints, the full design space consists of 57 [5] Hasan Genc et al. 2021. Gemmini: Enabling Systematic Deep-Learning Architec-
ture Evaluation via Full-Stack Integration. In 2021 58th ACM/IEEE Design Automa-
points and 13 Pareto optimal points. Figure 6 shows the Pareto tion Conference (DAC). 769–774. https://doi.org/10.1109/DAC18074.2021.9586216
points with the associated area, energy consumption and cycle [6] Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compress-
count. The performance for the different design points ranges be- ing Deep Neural Networks with Pruning, Trained Quantization and Huffman
Coding. arXiv:1510.00149 [cs] (Feb 2016). http://arxiv.org/abs/1510.00149
tween 34 and 117 FPS. From the plot, we can see gaps in the per- [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual
formance domain instead of a continuous trend. This is caused by Learning for Image Recognition. https://doi.org/10.48550/ARXIV.1512.03385
[8] Tim Hotfilter, Julian Hoefer, Fabian Kreß, Fabian Kempf, and Juergen Becker.
Gemmini’s architecture constraints. Since array sizes cannot be 2021. FLECSim-SoC: A Flexible End-to-End Co-Design Simulation Framework
defined arbitrarily, we have to move for example from a 16×16 array for System on Chips. In 2021 IEEE 34th International System-on-Chip Conference
directly to a 32×32, resulting in a large performance gap. It has to (SOCC). 83–88. https://doi.org/10.1109/SOCC52499.2021.9739212
[9] Angshuman Parashar et al. 2019. Timeloop: A Systematic Approach to DNN
be noted, that none of the 8×8 array configurations satisfies the Accelerator Evaluation. In 2019 IEEE International Symposium on Performance
performance requirements. With the results, we determine that Analysis of Systems and Software (ISPASS). 304–315. https://doi.org/10.1109/
the array size is the main indicator for performance. Considering ISPASS.2019.00042
[10] Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar
our envisaged use case, we found an energy efficiency to perfor- Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. CoRR abs/1811.02883 (2018).
mance sweet-spot with an array size of 32×32, scratchpad memory http://arxiv.org/abs/1811.02883
[11] Yakun Sophia Shao et al. 2019. Simba: Scaling deep-learning inference with multi-
of 256 kB and accumulator memory of 64 kB. This design configura- chip-module-based architecture. In Proceedings of the 52nd annual IEEE/ACM inter-
tion is estimated by our tool to have 3.25 𝑚𝑚 2 area and consumes national symposium on microarchitecture (MICRO ’52). Association for Computing
14.17 J total energy per inference. The total performance settles at Machinery, New York, NY, USA, 14–27. https://doi.org/10.1145/3352460.3358302
[12] Iris Walter et al. 2021. Embedded Face Recognition for Personalized Services in the
59 FPS. Assistive Robotics. In Machine Learning and Principles and Practice of Knowledge
Discovery in Databases. Springer International Publishing, Cham, 339–350.
6 CONCLUSION [13] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An
insightful visual performance model for multicore architectures. Communications
In this paper, we have introduced our analytical model for systolic of The Acm 52, 4 (April 2009), 65–76. https://doi.org/10.1145/1498765.1498785
[14] Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An
arrays. For an efficient and fast inference of a DNN, it is crucial Architecture-Level Energy Estimation Methodology for Accelerator Designs.
to design the right DNN accelerator for a given application. Since In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
the design space of DNN accelerators is very large and the com- 1–8. https://doi.org/10.1109/ICCAD45719.2019.8942149
plexity of the workload is high, a cycle-accurate evaluation of all

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy