RM Merged Files
RM Merged Files
https://doi.org/10.1007/s11265-021-01651-5
Received: 4 May 2020 / Revised: 4 November 2020 / Accepted: 10 February 2021 / Published online: 15 March 2021
© The Author(s) 2021
Abstract
In recent years, Convolutional Neural Network CNN have been incorporated in a large number of applications, including
multimedia retrieval and image classification. However, CNN based algorithms are computationally and resource intensive
and therefore difficult to be used in embedded systems. FPGA based accelerators are becoming more and more popular in
research and industry due to their flexibility and energy efficiency. However, the available resources and the size of the on-
chip memory can limit the performance of the FPGA accelerator for CNN. This work proposes an High-Level Synthesis
HLS library for CNN algorithms. It contains seven different streaming-capable CNN (plus two conversion) functions for
creating large neural networks with deep pipelines. The different functions have many parameter settings (e.g. for resolution,
feature maps, data types, kernel size, parallelilization, accuracy, etc.), which also enable compile-time optimizations. Our
functions are integrated into the HiFlipVX library, which is an open source HLS FPGA library for image processing and
object detection. This offers the possibility to implement different types of computer vision applications with one library.
Due to the various configuration and parallelization possibilities of the library functions, it is possible to implement a
high-performance, scalable and resource-efficient system, as our evaluation of the MobileNets algorithm shows.
Our contribution consists of a generic, template-based SDSoC [26] from Xilinx or the OpenCL SDK [10] from
and open source HLS library for a fast implementation of Intel) reduces the programming hurdle and shortens the
CNNs on FPGA-based embedded or HPC systems. The development time of FPGA-based hardware accelerators.
library consists of 7 different layers, which are used in Consequently, many HLS implementations have been
common CNN algorithms. It operates on parameterizable introduced for the acceleration from CNNs, like from
fixed-point data types and floating-point data types, and has Tapiador et al. [30], Zhang et al. [37] or Venieris et al. [32],
been optimized for performance and resource efficiency. to implement energy-efficient and effective HLS-based
The different compile time parameters and data types of neural network accelerators. While some papers present
the library functions offer multiple opportunities for an approaches for cloud applications with sufficient resources
optimized design and extensive design space exploration. [3], others present designs for embedded applications
One benefit is that all functions are streaming capable to with limited resources [37]. For example, Yao et al. [3]
allow a deep pipeline. Creating streaming applications with implemented an HLS-based library for cloud systems, like
multiple nodes or layers gives FPGAs the ability to achieve the AWS. To address the resource limitation on FPGAs,
higher performance and power efficiency for computer many optimizations and implementations were carried out
vision algorithms compared to other architectures, such as to reduce the resource usage. Suda et al. [28] propose an
CPUs and GPUs, as Kalms et al. [14] or Qasaimeh et al. [24] implementation using HLS for a lighter data type (fixed-
show. Furthermore, we have researched and implemented point 16-bit) while our proposed work supports multiple
different possibilities of parallelization in order to achieve a data types (32-bit floating-point and 8-bit, 16-bit or 32-bit
high performance with an efficient use of resources, which fixed-point with an adjustable size of the fraction part).
we show in this paper using our implementation of the Guo et al. [7] proposed a flexible CNN accelerator
MobileNets algorithm [9]. Our library is integrated into with bit-width reduction using quantization, improving
the HiFlipVX library, which is an open source HLS FPGA the performance of OpenCL-based FPGA accelerators for
library for image processing [17] and object detection [15]. CNNs. Liu et al. [19] integrated the pointwise separable
This offers the possibility to design and implement different convolution, which is needed in different neural networks
kinds of computer vision applications with one library. Most like MobileNets. Some of the previous studies focused only
functions of the libraries are based on the OpenVX standard. on the acceleration of the convolution layers of CNNs. For
This simplifies the design of applications on heterogeneous example, Liu et al. [20] only used models with convolution
systems containing different types of architectures (e.g. layers without any Fully Connected layer. Therefore, it is
CPU, GPU and FPGA), due to the different existing hard to be used for accelerating different CNN algorithms.
implementations from different vendors. Memory bandwidth issues in CNNs are discussed
In the following, Section 2 provides information about by Zhang et al. [37] and Zhang et al. [39]. Guan
the related work, Section 3 describes the implementation et al. [6] proposed FP-DNN, which is an end-to-end
of the neural network library and MobileNets, Section 4 framework that automatically generates optimized FPGA-
evaluates the achieved results and Section 5 contains based implementations of deep neural networks (DNNs)
conclusion and outlook. using an RTL/HLS hybrid library. Another HLS based
library is the Caffeine FPGA engine [38] that uses an
HLS-based systolic-like architecture to implement matrix
2 Related Work multiplication kernels. It allows changing parameters such
as the number of (PEs), precision, and feature map size.
State-of-the-art CNN architectures for large-scale visual The proposed CNN library is highly parametrizable, has
recognition use a multitude of layers with millions of a rich set of functions and is therefore applicable for various
computations. FPGA designers for embedded applications algorithm designs. All functions are streaming capable and
encountered three major challenges to efficiently map can be easily connected to each other. High performance
CNNs on hardware such as a difficult programming frame- with an efficient use of resources can be achieved through
work, limited FPGA resources and memory bandwidth. the streaming approach and the various parallelization
Many implementations have been proposed to address the parameters. The integration of the proposed CNN library
above mentioned challenges on FPGAs. Wang et al. [33] into the HiFlipVX image processing library [17], which has
built an RTL library to map neural networks on FPGAs. been extended for object detection [15], increases the range
However, RTL implementations suffer from high costs and of possible applications that can be implemented in the field
time-to-market which makes RTL-based custom hardware of computer vision. Following the OpenVX standard [5]
accelerators infeasible for most cases. makes it easier to create a heterogeneous system consisting
The availability of HLS tools, using OpenCL, C or C++, of different architectures (e.g. CPU, GPU and FPGA)
from FPGA vendors (e.g. Vivado HLS [34] from Xilinx, from different vendors. Since the library does not require
J Sign Process Syst (2021) 93:513–529 515
vendor-specific or other external libraries, it can be ported library have integrated vectorization, which can be applied
to other platforms more easily. This also improves the on their Input Feature Map (IFM) and/or their Output
verification and integration process in frameworks like Feature Map (OFM). Unsigned and signed 8-bit and 16-bit
Tensorflow [1] or Caffe [13]. fixed-point and 32-bit floating-point data types are possible
for the inputs, outputs, weights and biases, to be applicable
for many hardware designs. The size of the fraction can
3 Implementation be configured as a parameter of the function. For an 8-
bit unsigned integer data type, this value can be between 0
This section first describes the architecture and implemen- and 8. If the fixed-point position is set to 5, the fractional
tation of the different Neural Network layers. The library part is 5 and the integer part is 3. Functions that need
contains 7 neural network functions, which are described in trained coefficients buffer them on first use, if configured,
Section 3.2. All functions are streaming-capable to exploit to reduce the amount of global memory access. The fixed-
the advantages of an FPGA. Section 3.3 describes how to point implementations contain policies for rounding and
create an algorithm using the library components by using overflow. If an overflow occurs data can either be truncated
MobileNets as an example. It also adds two additional or saturated to its maximum/minimum value. For fixed-
functions needed to create an efficient implementation. point arithmetic operations, the data can be rounded to zero
or the nearest number.
3.1 The HiFlipVX Library Seven different neural network layers were designed
and implemented: 3D Convolution, Depthwise Convolution,
HiFlipVX is an open source [16] HLS FPGA library for Pooling, Activation, Batch Normalization, Fully Connected
image processing [17], which has been extended for object and Softmax. The I/Os of the different layer functions are
detection [15]. The library contains 46 C++ streaming- the input vector, the output vector and, if required, the
capable functions, which are mostly based on the OpenVX weights vector and the biases vector. Since all functions
standard. OpenVX is an open, royalty-free standard for are streaming capable, we can use the simple AXI4-Stream
cross platform acceleration of computer vision applications interface for the Xilinx implementation. It is a simple
[5]. The library functions are parametrizable using C++ protocol, with ready and valid signal for handshaking
templates and highly optimized for performance and FPGA and the corresponding data signal. The HLS interface
resources. In comparison to the xfOpenCV library from axis directive in the library functions automatically
Xilinx [36], it only consumed in average 39% FFs and 32% creates this interface. For all interface parameters we
Lookup Table (LUTs) for a selected set of functions [17]. use the vx image data<DATA TYPE, VEC SIZE> of
In addition to the OpenVX standard, most functions support the HiFlipVX library. It is a vector data type, with
different vectorization options (2x, 4x, 8x) and additional two additional configurable signals for the AXI4-Stream
data types (8-, 16- or 32-bit signed/unsigned integers). The interface that can be activated by using macros. These
use of vectorization does not only increases performance, signals indicate the Start of Frame (SoF) (last signal) and
but also the energy efficiency, as shown by Akguen et al. [2]. End of Frame (EoF) (user signal) and are needed when
The functions of our proposed library were integrated connected to the DMA or Video DMA (VDMA) blocks
into the HiFlipVX library. They use the same data types and from Xilinx. The remaining library function parameters are
the function headers have a similar structure. Therefore it template parameters (e.g. input/output image size, kernel
is easy to connect the functions of the two libraries, either size, IFM, OFM, etc.).
directly or e.g. by using data-width converters. Furthermore, The library has been optimized and tested for Vivado
certain pre-processing for the CNNs can be done with the HLS [34] and SDSoC [26] 2019.1, but also works with other
functions of the HiFlipVX library, e.g. changing the image versions. Internally SDSoC uses HLS, but builds a complete
size or the image format. The integration also makes it easier system around the accelerated functions containing the
to use existing OpenVX-based frameworks, which did not hardware and software layers. To create such a system,
observe CNNs, like AFFIX [29] or JANUS [22]. SDSoC adds some restrictions that basically affect the
interfaces of the function. One of these limitations is that
3.2 Neural Network Layers only structs with more than 1 element are synthesizable.
This has been solved by automatically using native data
One goal of the library was the streaming capability of types instead of structs for these kind of interfaces. This
the library functions. Since all functions are pipelined, a is also possible, since SDSoC adds the SoF and EoF
pipeline interval of 1 was a key objective to achieve an signals to the AXI4-Stream interface by itself. Furthermore,
optimized performance. Additionally, all functions in the interface arrays need a known amount of elements. The
516 J Sign Process Syst (2021) 93:513–529
proposed library does not need vendor specific or external be a drawback of using HLS. The general equation for
libraries. For some mathematical operations or signals, calculating a 3D convolution is:
Xilinx libraries have been used for a resource efficient
implementation. By using a macro, alternatives are applied y −1 Kx −1
M−1 K
I F
if other tools are used. dsty,x,o =
i=0 n=0 m=0
3.2.1 3D Convolution
× src(y+n− Ky ),(x+m− Kx ),i · Wo,i,n,m +Bo (1)
2 2
The process of 3D convolution is the most computational
intensive layer in most feed forward networks. The main Parallelization The performance benchmark for most 3D
goal of the proposed image and loop dimension ordering convolution layers is the number of multiplications pro-
was to achieve a streaming capable function. Under this cessed per second. For a multiplication mostly internal Dig-
constraint we developed a structure that is optimized for ital Signal Processor (DSPs) are used on an FPGA. When
performance and resource usage. Therefore, the ordering of increasing the number of multiplications, the amount of data
some dimensions is different from the OpenVX standard. that is needed simultaneously and thus the required memory
Listing 1 shows the general structure of the hardware bandwidth increases. To implement an efficient streaming
implementation, which is be explained throughout this capable function, data of the input image as well as the coef-
subsection. Therefore the total latency can be derived ficients should be buffered. This can limit the maximum
from the total number of loop iterations plus the pipeline resolution of the image to be processed. These buffers are
stages. The order of the image and coefficient dimensions usually implemented using Block RAM (BRAM). However,
is: BRAM has a limited bandwidth to read and write data. To
increase the bandwidth, data can be distributed over sev-
– Input Image: BAT CH × ROWsrc × COLsrc × I F M eral BRAMs. However, this can lead to fragmentation, if
– Output Image: BAT CH ×ROWdst ×COLdst ×OF M the BRAM is not fully utilized and can therefore limit the
– Weights: OF M × I F M × Ky × Kx data to be stored. For this reason, fragmentation should be
– Biases: (0) ∨ (OF M) ∨ (BAT CH × ROWdst × kept as small as possible while increasing the amount of
COLdst × OF M) multiplications.
Various loop variables are suitable for parallelization, as
As shown, different sizes for the Bias are possible. A illustrated in Listing 1. One possibility of parallelization
stride is set, when input and output resolution differ. In would be in the direction of the (COL) as in HiFlipVX.
the proposed implementation the stride only effects the However, this type of parallelization would increase the
condition when a result is written to the output. It has no bit-width of various buffers and therefore lead to a high
effect to the latency. However, loop iterations could also fragmentation of BRAM. Additional buffers would also
be skipped in dependence of the stride. However, the used have to be introduced to restructure the input and output
HLS compiler only allows ”perfect loops”, which could data. Therefore, we have concentrated on the parallelization
of the inner loops, as shown by the parameters (Vo ) and
(Vi ) in Listing 1. Both (OF M) and (I F M) parallelization
would increase the bit-width of the coefficient buffer.
Additionally, the parallelization of (I F M) increases the bit-
width of the input buffers. In some cases (Vi ) can be raised
to a certain point without causing additional fragmentation
of the input buffer.
Figure 1 The input stage buffers the input image for the 3D convolu- steps. 1. step reads new input vector of size Vi (dashed lines) while
tion function with a Ky × Kx window/kernel size (here 3 × 3) and an computing the first Output Feature Map [o = 0] 2. step updates win-
input vector size of Vi . The input stage contains input registers on the dow registers (continuous lines). 3. step updates buffers (dotted lines).
left (white), big line buffers (dark gray), small input/window buffers 4. step sends all data from the sliding window registers to the com-
(light gray) and sliding window registers on the right (white). The pro- pute stage (dashed lines). I F M = Input Feature Map; COLsrc = Input
cess of buffering the input image can be separated into 4 pipelined Image Columns.
fragmentation. The input buffer is a line buffer that does not buffer. The other elements of the sliding window get their
have to store the entire image row. Instead, it is sufficient data from the window buffers. Additionally, the algorithm
to store the ( I FViM ) elements of the current iteration of (x). checks whether valid data should be present in the buffers.
The sliding window updates its complete elements in each Otherwise a zero is loaded into the corresponding sliding
clock cycle, because all feature maps of (x) are calculated window elements, to apply zero padding. The proposed
before the window is moved one element to the right. The implementation always applies zero padding of ( K2x ) on
K
window buffers are needed for this, since only 1 element can both sides in x-direction and of ( 2y ) on both sides in
be read from each line/input buffer in a clock cycle. Each of y-direction.
the (Ky · (Kx − 1)) window buffers has ( I FViM ) elements. 3) Update Buffers: This stage reads the data from the
The different computation steps of the 3D convolution are window and writes it to the different buffers, as shown in
described below in chronological order. Figure 1 by the dotted lines. The input buffer receives its
1) Read Input Vector: Reads vector of Vi elements from data from the bottom left element in the window. Since the
the input image, if the following condition is met: (y ≤ input data can only be read once, it must be buffered. The
ROWsrc ) ∧ (x ≤ COLsrc ) ∧ (o = 0). line buffer receives its data from the right column of the
window in the last iteration of (o = OFVo − 1). This moves
2) Update Sliding Window: In this stage, the data is read M
from the different buffers and stored in the sliding window, the data of the image one line up so that it is available again
as shown in Figure 1 by the dashed lines. Each element in at the next iteration of (x). The window buffer receives its
the sliding window of size (Ky × Kx ) contains (Vi ) vector data from the left columns of the window in the last iteration
elements. The left column of the sliding window get its of (o = OF Vo − 1). The sliding window effect results from
M
data from the line buffers and the input buffer. If (o = 0) the offset reading and writing between the window buffer
new data is read from the input image instead of the input and the window.
518 J Sign Process Syst (2021) 93:513–529
Figure 2 Computation stages of the 3D convolution implementation. gray) buffer weights/biases if configured. I F M = Input Feature Map;
The input comes from the sliding window of the input stage (Figure 1). OF M = Output Feature Map; Vi = Input Vector Size; Vo = Output
Some stages are for floating-point or fixed-point numbers only. Buffers Vector Size; Kx × Ky = Kernel Size.
for weights/biases are marked in light gray. Reader functions (dark
4) Compute Convolution: Figure 2 shows the computa- is checked for overflow and saturated if the corresponding
tion stage of the 3D convolution process. As stated, some policy is set.
of the blocks in the image are only used for fixed-point 4) Write Output Vector: Writes back a vector
or floating-point calculations. The gray blocks show the of Vo elements to the output image, if the follow-
K ROWsrc −1
data whose contents needs to be maintained between loop ing condition is met: (y − 2y ) mod ( ROW −1 ) ∧
iterations and stored in buffers. The weight and bias coeffi-
dst
COLsrc −1
cients can be buffered within the function if the user sets the (x − K2x ) mod ( COL dst −1
) ∧ i = I FViM − 1 . The
appropriate parameter. On first use, they are read from the condition includes the stride computation, expressed with
interfaces and stored in the buffers. If the same coefficients the modulus operation. The value for the stride must be an
are needed again, they can be accessed from the buffers. element of the natural numbers.
In the first step, the input data is taken from the sliding
window and multiplied by the corresponding weights. In 3.2.2 Depthwise Convolution
total (Vo × Vi ) 2D-convolutions of the size (Ky × Kx ) are
calculated. Then (Vi ) 2D-convolutions of the different (Vo ) The Depthwise Convolution can be considered as a 2D
are added together to partially calculate the 3D convolution convolution that is applied to each feature maps of a 3D
of each (Vo ). input image separately. This layer is usually used together
The operation of calculating a sum over several loop with a “pointwise” convolution of (1 × 1), as in MobileNets
iterations violated the desired pipeline interval of 1 by a [9]. This means that for a (3 × 3) convolution, a (3 × 3)
factor of 5 when using floating-point numbers with the pointwise convolution and a (1 × 1) pointwise convolution
Xilinx tools. Therefore, we convert floating-point numbers is used. The advantage of this approach is that less
for this summation to a value that is saturated to a 32- multiplications and weights are required for the convolution
bit fixed-point number. The user sets the parameter for process. Comparing it to a classic 2D convolution, it has a
the fixed-point position of this variable. In the next step similar effect to the separable filter shown in [17].
the partial 3D convolutions are added to the final 3D The amount of feature maps in the input and output
convolution until all 2D-convolutions are summed up. Then image are the same for this function. When comparing
the result is converted back if the final output should be a with the structure of Listing 1, the loop over OF M is
floating-point number. eliminated. Consequently, the total latency is reduced by
When using fixed-point numbers, the multiplication that factor and there is only one parallelization term (Vi ).
in the 2D-convolution increases the fixed-point position. The rest of the basic structure in Listing 1 remains. The total
Therefore, the value is shifted back to the fixed-point number of multiplications and weights is reduced by a factor
position, while ensuring the overflow policy. This process is of OF M compared to a pointwise convolution. Therefore
done before adding the bias, because it has the same fixed- fewer weights must be stored in the internal buffers. On
point position as the output. After adding the bias, the result the other hand, the size of biases remains unchanged. It
J Sign Process Syst (2021) 93:513–529 519
the third dimension of size I F M. It first calculates the mean Batch Normalization is very resource-efficient. The latency
(μ) of the pixel values, as shown in Eq. 4. Using the the of the hardware function is: ROW S · COLS · I FVM + P .
mean value, the variance (σ 2 ) is calculated, as shown in
Eq. 5. Using the mean, variance and a set of pre-trained 1
values (γi , βi ), the output image pixels are calculated, as dsti = γi · (srci − μi ) · ci + βi , ci = (7)
shown in Eq. 6. σi2 +
I
FM
1 3.2.6 Fully Connected
μ= · xi (4)
IFM
i=1
The Fully Connected layer is an essential component of
I
FM most CNNs. It is one of the last layers and is used for
1
σ2 = · (xi − μ)2 (5) the final classification decision. Simplified it is a 3D
IFM
i=1 convolution with a (1 × 1) kernel on an image with (1 × 1)
xi − μ pixel. However, the I F M and OF M can be very large.
yi = γi · √ + βi (6) The weights, biases and input image are buffered on first
σ2 +
use. However, since each weight/bias is read only once per
A straightforward way to compute this function would
image, it is recommended not to buffer them if the weight
be in three separate loops iterating over the 3rd dimension,
matrix becomes too large. The summation of Eq. 8 has been
nested in the loops iterating over the 1st and 2nd
implemented using fixed-point numbers for the floating-
dimensions. With this approach only one output pixel is
point implementation. Therefore, the multiplication result
generated every 3 clock cycles for a parallelization degree
is converted and saturated to a 32-bit wide number before
of zero. Therefore, we created three functions to compute
summation and converted back afterwards. Fixed-point
μ, σ 2 and yi , which are used in a pipelined manner inside
numbers were used, since a summation with floating-point
the three nested loops. As a result, the overall latency is as
numbers increased the total latency by a factor of 5. The
follows: (ROW S · COLS + 2) · I FVM + P . Two times I FVM
fixed-point position is set by a parameter. Depending on the
additional clock cycles are required, since each mini-batch
degree of parallelization, V multiplications are calculated
must pass through these three stages in a pipeline manner.
in parallel and added together. After summation, the data
The input data of a mini-batch (B) is stored in a buffer in the
must be shifted back due to the fixed-point multiplication
first stage to be used for the next two stages. Since there are
according to the rounding policy. Then the bias is added.
3 stages, 3 consecutive input vectors of size I F M must be
When using fixed-point values, the result is converted back
stored in buffers. The weight vectors γ and β are read in the
to the output format according to the overflow policy. The
third stage at the first use and buffered for the further use.
latency of the hardware function is: OF M · I FVM + P .
To calculate (μ) and (σ 2 ) a sum of values must
be computed. Like in the convolution filter, the sum is
computed using fixed-point numbers, because a floating- I
FM
point sum increases the latency by a factor of 5. Therefore, dstof m = (srcif m · weightof m,if m ) + biasof m (8)
floating-point numbers are converted and saturated to a 32- if m=1
bit wide integer value. Since the normalization ( I F1M ) is a
constant, it can be pre-computed to replace the division by a 3.2.7 Softmax
multiplication. Both calculations are easy to vectorize, since
only the sum needs to be parallelized. For the parallelization The Softmax layer normalizes an input vector into a
of the last stage which computes yi , the term √ 12 can be probability distribution and limits the output to a range
σ + between 0 and 1. It is used to determine the probability of
pre-computed once. This has a big impact on the resource several classes at once. The calculation shown in Eq. 9 is
usage when vectorizing the function, since the division and done in two parts. The first part computes the sum and stores
square root are the most resource consuming functions. Due the exponents of the inputs into a buffer. Due to the high
to the accuracy, √ 12 is calculated using floating-point range of values in this function all operations are done using
σ +
numbers. floating-point operations. However, for the same reason
There are different variants of the Batch Normalization. as in the previous functions, the summation is calculated
One of them avoids the calculation of (μ) and (σ 2 ). In using fixed-point numbers. Therefore the exponent result
this variant, the two values are passed to the function as is converted and saturated into a 32-bit fixed-point number
additional parameters, as shown in Eq. 7. Since both values before summation. For each element in the input vector,
are constants, the value of (ci ) can be pre-calculated after (V ) exponents are calculated, stored and added to the
training the neural network. As a result, this variant of the summation. The second part calculates the division of
J Sign Process Syst (2021) 93:513–529 521
Eq. 9. For fixed-point numbers, the division result must The modules 2 to 14 all have the same structure with
be shifted (multiplied) to the correct position according different parameter settings. The first three layers, which
to the rounding policy. Depending on the parallelization include a depthwise convolution, Batch Normalization and
degree, (V ) output elements are computed. The latency of activation layer, all have the same degree of parallelization
the hardware function is: 2 · I FVM + P . (VDW ). Again, the data for pointwise convolution must be
converted to (VI F M ) and then to (VOF M ). The last two
esrcif m layers have a parallelization degree of (VP W ). With these
dstif m = I F M (9)
srci ) 4 vectorization parameters (VDW , VI F M , VOF M , VP W ) the
i=1 (e
optimal configuration for the desired amount of resources
3.3 MobileNets Architecture can be found, as shown in the evaluation. Data width
converters can also be connected between the various
MobileNets [9] were presented by Google Inc. and were modules. However, they were not needed in the final
developed for mobile and embedded vision applications. configuration.
MobileNets utilizes a combination of depthwise separable Module 15 contains the last layer and its output is
convolutions and pointwise convolution to form lightweight therefore connected to a e.g. 64-bit wide DMA via a data
deep neural networks. These networks also introduced two width converter. This module only needs the parameter
global parameters for the width and resolution multiplier (VDW ) for pooling and the parameter (VI F M ) for the input
to define different sizes of the networks. The different of the Fully Connected layer. The Softmax layer is not
networks based on these parameters have different latency computationally intensive enough to become a bottleneck.
and accuracy. This allows using optimum networks to match In general, the vector parameters must be set so that no
the design requirements of the system. single layer becomes a bottleneck, since the slowest layer
MobileNets architecture is based on depthwise separable limits the speed of the others. The different modules contain
convolution as mentioned before. A standard convolution scatter engines to distribute all coefficients to the local
can be factorized into a depthwise convolution and a buffers. This allows all coefficients to be preloaded with
pointwise convolution. Depthwise separable convolution optimal utilization of the memory bandwidth. The scatter
separates filtering and combines inputs into two layers, engine also reduces the number of DMAs needed to access
on contrary to a standard convolution. This factorization memory to load new coefficients. They require data width
of the convolution layer results in a reduction in model converters, since each local buffer has a different depth of
size and computation requirements of the algorithm. This its elements depending on the degree of parallelization of
concept is used in MobileNets in order to have light- the corresponding layer. The HiFlipVX data converter can
weighted neural networks. The first layer of MobileNets also convert between widths that are not multiples of each
is a full convolution layer. Later layers are a combination other. Therefore the data for the different local buffers must
of depthwise convolution and pointwise convolutions. All be aligned to the data type of the scatter engine in global
convolutions are followed by a Batch Normalization layer memory.
and activation layer (ReLU). The final Fully Connected
layer has no non-linearity and is followed by a Softmax 3.4 High-Level Synthesis Directive Usage
layer. Before the final Fully Connected layer an average
pooling is used to reduce the spatial resolution. In total In this work we use 8 different directives (pragma
MobileNets has 28 layers. HLS): inline, interface, data pack, dataflow,
Figure 3 shows the hardware implementation of the stream, resource, pipeline, array partition.
different layers of MobileNets, which is parameterizable. All internal and callable library functions are inlined using
The different modules are interconnected, with module 1 the inline directive.
containing the first layer and module 15 the last layer, The interface directive is only needed in wrapper
creating a very deep pipeline. The input of the first layer functions, which instantiate the library functions and set the
in module 1 is connected to a data width converter and template parameters. There is an example test bench for
gets its data from the global memory. To optimize the each function of the library and the different MobileNets
memory bandwidth, it receives an e.g. 64-bit wide input layers in the main file. When using Xilinx SDSoC no
and converts it to the desired vector size (VI F M ) of the 3D interface directive is be needed. For the SDSoC tool
convolution. The output vector (VOF M ) is then converted we set the ap fifo protocol for all ports. For Xilinx
to the vectorization (VP W ) of the Batch Normalization and Vivado HLS we set the AXI4-Stream (axis) protocol
activation layers. All layers and conversion units of the as interface for the ports. It is a simple handshaking
pipeline are connected via very small FIFO buffers. They protocol most Xilinx IP-Cores use. Additionally, we
are marked by a thicker line in the figure. deactivate the control port of all IP-Cores in Vivado HLS
522 J Sign Process Syst (2021) 93:513–529
Figure 3 Block design of the MobileNets hardware implementation. buffers are marked in light gray. Data movers blocks are marked in
MobileNets has been separated into 15 modules. The modules are dark gray. Multiple scatter units can be connected to the same DMA.
directly connected to each other in the order of their numbering. Local
(ap ctrl none port=return). This port should not The dataflow directive enable task-level pipelining. It
be deactivated for SDSoC. The ( SDSCC ) macro is is needed to create the streaming applications in the three
globally set by the SDSoC tool and is used by our different MobileNets layers shown in Figure 3. To enable
library to automatically switch between the two Xilinx streaming between the different functions of the MobileNets
tools. Setting the ap fifo ports and using the C99 layers, FIFOs are needed. Therefore, the stream directive
style for arrays, our library does not need any specific is used for these FIFOs using a depth of 8. The small
SDSoC directives (pragma SDS). As mentioned in depth allows to use LUTs instead of BRAMs for the FIFOs,
Section 3.2, we use the (vx image data<DATA TYPE, since BRAM is often a limiting resource. Within all library
VEC SIZE>) data type for the function ports, to apply functions we use the pipeline directive with the goal to
vectorization and set the last and user bits of the AXI4- achieve an initiation interval of one. Since all loops below
Stream protocol if needed. To achieve the full bandwidth, the pipeline directive are unrolled automatically, there is
all callable library functions use the data pack directive no need of using the unroll directive.
for their ports. Additionally, we use the data pack Therefore, the resource directive is set to
directive for our internal buffers and FIFOs, to reduce the FIFO LUTRAM for these FIFOs. For most internal buffers,
fragmentation of the utilized (BRAMs). shown in Figures 1 and 2, we set the resource directive to
J Sign Process Syst (2021) 93:513–529 523
use LUTs (RAM 2P LUTRAM) or BRAMs (RAM 2P BRAM) images that are processed one after the other. Increasing
depending on their size. RAM 2P BRAM has been used for this parameter has the advantage that a function can read in
most weight and line buffers. RAM 2P LUTRAM has been pixels of a new image before it has finished the calculation
used for most bias, window and input buffers. The use of of the last image. Coefficients can be buffered on first use,
the resource directives should be used with caution, which does not need to be repeated for the other input
since it can also have a negative effect. In most cases it is images (batches). The batch size can be set for all functions
advisable to give the tool the choice, because then it can in the library. The input resolution can differ from the output
select according to the total resource usage, bit-widths and resolution, but has to be bigger. This is only possible for
selected frequency. The array partition directive is the two convolution functions and the pooling function to
needed if the LUT and BRAM memories do not provide implement a stride. Only the 3D convolution and the Fully
the required bandwidth. E.g. to separate the window buffers Connected layers have both (I F M) and (OF M), all other
of Figure 1. The array partition directive is also functions only have one feature map (I F M). For resolution,
used quite often to completely partition C++ arrays into batch amount and the feature map size we allow a value
registers, like for the white boxes in Figure 1. between 1 and 2028. The bias size can be (0), (OF M) or
(BAT CH ES·OF M) for the two convolution functions and
the Fully Connected layer. The kernel size can be changed
4 Evaluation for the two convolution functions and is (n × m), where (n)
and (m) can be different but must be odd numbers and must
In this section a detailed evaluation of the different be in the range of 1 and 9. It is the same for the pooling
functions of the library is made. Different parameter settings size, but the numbers can also be even. Pooling and padding
are evaluated to make general assumptions. Furthermore, sizes can only be set for the pooling function. The padding
designing larger algorithms is evaluated using MobileNets. size can be between 0 and the half of the pooling size. The
convolution functions automatically
use a padding, which is
4.1 Single Layers Ky,x
the half of the kernel size 2 .
On the right side of the table there are parameters
This part evaluates the different layers of the proposed that are more specific to the FPGA design, such as
library. We tested the design on a ZCU104 MPSoC FPGA frequency changes. The (VI F M ) parallelization is used
from Xilinx using the 2019.1 tool chain including SDSoC in all functions. The (VOF M ) parallelization is only
and Vivado HLS. To obtain the implementation results, needed for 3D convolution and Fully Connected layers,
we built a design with SDSoC and took the results of the for exploration and to further improve the performance.
single functions from the Vivado project. All functions in For both parallelization parameters we allow a value
the library have several parameters which can be changed at between 1 and 128. We allow different data types for
compile time. Table 2 shows the default configuration of the the inputs/outputs and weights of the different layers
parameters of the different layers tested in this evaluation. (int8, uin8, int16, uint16, float32). The
The table also shows the high configurability of the library. biases can have a different data type if fixed-point
On the left side of the table are the normal parameters numbers are used (int8, uin8, int16, uint16,
of a neural network, which are also needed in non-FPGA int32, uint32, float32). This approach has been
designs. Additionally we support 2 pooling types and 9 suggested by some CNN algorithm implementations. The
activation function types. In our terminology, batches are fixed point position determines the size of the fraction
and must be below the number of digits of the data type.
For signed data types, at least 1 bit is required for the
Table 2 Default configurations for the changeable parameters of the integer part. For arithmetic calculations, mainly for fixed-
different layers. point numbers, we must check for overflow and perform
batches 4 vif m 1 the wanted rounding policy. If an overflow occurs data can
input 64x64 vof m 1 either be truncated or saturated to its maximum/minimum
output 64x64 frequency 100 Mhz value. For fixed-point arithmetic operations, the data can be
IFM 32 data type uint8 rounded to zero or the nearest number. Coefficients (weights
OFM 32 bias data type uint8
and biases) can be buffered during execution within the
bias size OFM fixed point position 8
function (buffer coefficients). In Figure 3, this
kernel size 3x3 overflow saturate
is done outside the function to increase efficiency of the
coefficient reading process.
pooling size 2x2 rounding to zero
To verify the correctness of the library functions, we
padding size 1x1 buffer coefficients yes
calculate the mean absolute percentage error (MAPE) of
524 J Sign Process Syst (2021) 93:513–529
Table 3 MAPE (mean absolute percentage error) between the 32- large for FFs and LUTs. The implementation results in
bit floating point baseline software implementation and the various the table do not include the additional blocks that SDSoC
hardware implementations.
integrates into the HW design. The DSP behavior is dif-
layers uint16 int16 float32 ferent because we let the tool decide whether to use LUTs
or DSPs for the arithmetic calculation, as this can vary
3D convolution 0.3413 0.6804 0.00003 depending on the application. The Fully Connected layer
depthwise conv. 0.0127 0.0261 0.00000 usually has many coefficients and therefore requires a lot
pooling (max) 0.0000 0.0000 0.00000 of BRAM. Therefore, it may make sense not to buffer
activation (relu) 0.0000 0.0000 0.00000 the weights, since each weight is only required once per
batch normalization 0.0390 0.1012 0.00004 batch. The 3D convolution consumes more BRAM than
fully connected 0.0000 0.3421 0.00000 the depthwise convolution because it has to buffer more
softmax 0.2104 0.4245 0.00001 coefficients.
In addition, the table shows the estimated latency per
batch. As it is well known, the process of 3D convolution
our hardware implementation compared to a floating-point is the most computationally intensive part in many CNN
baseline one. Table 3 shows the results using the default algorithms and must therefore be parallelized more. The
configuration and quantized random input numbers in the Softmax function is the least computationally intensive
range between 0 and 1, where {x ∈ R|0 ≤ x < 1}. The function and could therefore be executed on a CPU in a
calculation of MAPE has a problem if the divisor is zero. HW/SW co-design, as this function is also quite resource
Therefore we do not consider results where the divisor is intensive. By adding the proposed multi-stage pipelining
less than 10−6 . The fixed-point positions for the data type approach, Batch Normalization can calculate the three
in the table are 16 (uint16), 15 (int16) and 24 (float32). The internal functions almost in the same time as the activation
MAPE of 0.68% for the 3D convolution is due to the high layer. Due to this approach and the computationally
number of multiplications and additions for each output intensive operations like division and square root, more
pixel. A similar behavior can be observed with the other resources are needed. Depthwise convolution and pooling
functions, where many variables have to be added and/or require some additional cycles due to the line buffers.
multiplied together. The float32 computation can have a Table 5 shows the resource usage of the implemented
very small error for the functions that have to calculate a design from the various activation functions using unsigned
sum over several loops, because we had to use fixed point 16-bit data types. As expected, all functions that include an
arithmetic for this summation. Of course, if numbers had to exponent, logarithm, or division in their equation consume
be saturated, the MAPE would be higher, but this was not more resources. Using exponential functions instead of
meant to be proven by this approach, since it is generally the the hyperbolic functions could reduce resource usage. For
case for fixed point numbers. the square root function, there is an option for relaxed
Table 4 shows the resource utilization of the implemented mathematical calculation to reduce resource usage by
and synthesized (grey) designs using the default configura- reducing the precision of the fraction part. The difference
tion. In this table, the Softmax and Fully Connected layers in accuracy can be seen with a MAPE of 0.37 %. Due to
have 256 I F M and 256 OF M, since the resolution for the accuracy, mainly floating point operations were used
these layers is (1 × 1). As it can be seen from the table, for the computational-intensive functions. However, due to
the difference between the estimated synthesis results from quantization, there is still a small error rate left for these
SDSoC and the implemented results from Vivado is quite functions.
Table 4 Resource utilization and latency per batch of implemented (black) and synthesized (grey) designs.
Fully Connected and Softmax layers have 256 IFM and 256 OFM
J Sign Process Syst (2021) 93:513–529 525
Table 5 MAPE (mean absolute percentage error) and resource FFs (43% on average), but also the LUTs (8% on average).
utilization of the implemented design of the various activation However, it has no effect on the BRAMs or DSPs. For
functions using unsigned 16-bit data types.
designs with higher accuracy or for fast integration and
BRAM DSP FF LUT MAPE testing, the library also supports floating point numbers.
They have no effect on the latency of the various library
logistic 0 15 1362 2146 0.00126 functions, except for additional pipeline stages, but have a
hyperbolic 0 17 1549 2370 0.02543 high impact on resource utilization: +432% LUTs, +784%
relu 0 0 26 17 0.00000 FFs and +423% DSPs. When using 16-bit fixed-point values
brelu 0 0 26 25 0.00000 to increase accuracy, there is only a small increase for
softrelu 0 28 1418 2112 0.00144 LUTs (25%), FFs (17%) and DSPs (2%). This again shows
abs 0 0 26 17 0.00000 the importance of quantization in FPGAs. The BRAM
square 0 1 28 28 0.00000 usage always scales with the bit width of the data type
sqrt 0 0 164 384 0.00139 used. Increasing the kernel size has a similar effect for 3D
sqrt (relaxed) 0 0 113 243 0.36990 and depthwise convolution. In both cases the DSPs grow
linear 0 0 26 25 0.00000 with the kernel size. The BRAM increase depends on the
coefficient size (ky × kx ) and the line buffer amount of
(ky − 1). LUTs and FFs are only increased by 85% and 65%
respectively for 2.78× the amount of weights.
Figure 4 shows the relative resource usage for various A more detailed investigation of parallelization was done,
parameter settings compared to the default configuration. because finding the right parameters is important for an
As expected, a change in frequency mainly increases the efficient and performant design. The Batch Normalization
Figure 4 Relative resource utilization for various settings compared to the default configuration. Value is not reported if it is zero. 3D convolution
has a vectorization of vif m × vof m .
526 J Sign Process Syst (2021) 93:513–529
layer scales well with parallelization because resource- configuration with (VOF M = 8), (VI F M = 8) and a fre-
intensive functions do not need to be calculated multiple quency of 200 MHz, an acceleration of 260 was achieved
times as described in Section 3.2.5. Only the increase when the convolution function was executed on the real
in DSPs approximates to a linear behavior. The DSPs system using SDSoC. The measurements were performed
of all other functions scale linearly with the degree of with the ARM processor, on which no operating system is
parallelization. The LUTs and FFs of the Pooling layer running. The consumed resources for the convolution func-
scale less than linearly with the degree of parallelization. tion are: 8858 LUTs, 7679 FFs, 576 DSPs and 66 BRAMs.
The Fully Connected layer even shows a reduction of FFs The BRAM has increased due to fragmentation and a high
and BRAMs due to fragmentation. The 3D convolution has demand of on-chip bandwidth. The execution time of the
a combined vectorization of (VI F M × VOF M ). Different hardware is 837μs, which includes the cache flushing and
combinations of (VI F M ) and (VOF M ) were tested to find an data movement between the FPGA and DMA.
optimized combination. The combined vectorization results
in a parallelization (V ) for 2 (1 × 2|2 × 1), 4 (4 × 1|4 × 4.2 MobileNets
4|2 × 2), 8 (8 × 1|1 × 8|4 × 2|2 × 4) or 16 (16 × 1|1 ×
16|8 × 2|2 × 8|4 × 4). Some assumptions can be made when Before implementing the MobileNets layers onto hardware,
comparing these combinations. The greater the imbalance the optimal parameters must be set. When creating a deep
between (VI F M ) and (VOF M ), the more resources are used pipeline, the system normally is as fast as its slowest
on average. If, for the same (V ), (VI F M ) is greater than component. Table 6 shows our offline calculations for an
(VOF M ), the average usage of LUTs and FFs increases optimal setting of the different modules containing the
slightly by 6% and 10% respectively. On the contrary, a high MobileNets layers. All parameters, which are not reported
(VI F M ) can cause more BRAM to be used if it worsens line in the table use the default configuration. The parameter
buffer fragmentation. values for the resolution and feature maps are set by the
Additionally, one 3D convolution layer has been imple- algorithm. Using the latency equations of Section 4.1, the
mented with a high parallelization to show the perfor- estimated latency can be calculated. The number of pipeline
mance improvement in comparison to a baseline imple- stages was ignored in this estimation as it has almost no
mentation, which is running on the ARM processor of the impact. The maximum latency in the right column shows the
ZCU104 MPSoC at a frequency of 1.2 GHz in release mode bottleneck of the design. In the next step, the vectorization
using the O3 optimization option. Using the same default (V ) settings discussed in Section 3.3 are adapted to improve
Table 6 Shows proposed vectorization (V ) setting for MobileNets layers of Section 3.3.
input output IFM OF M vdw vif m vof m vpw dwconv dwbn pwconv pwbn max
Latency is calculated for functions in Figure 3 separately without pipeline stages. Depthwise (dw) & pointwise (pw) latency of Batch
Normalization (bn) & convolution are reported. Maximum latency of all functions within a layer is shown on the right
J Sign Process Syst (2021) 93:513–529 527
Table 7 Final results of the three MobileNets modules shown in is some overhead for streaming multiple functions in a
Section 3.3 executed separately on the ZCU104. pipeline, data moving between the DDR and cache flushing.
module 1 module 2 module 15 This overhead is 90.8%, 76.6% and 52.9% for the modules
1, 2 and 15. When executing the layers sequentially, these
ARM (ms) 34.166 53.221 9.944 numbers would be higher. To verify the propagation of the
FPGA (ms) 0.966 0.902 0.489 error, the MAPE value was computed for 16-bit unsigned
speed-up 35.4 59.0 20.3 fixed-point numbers. It was 0.21%, 0.79% and 0.78% for
LUT 11881 16914 10579 the modules 1, 2 and 15. The resources listed in the table
FF 13265 16660 5773 contain only the modules and no DMAs. When considering
DSPs 237 140 27 the ZCU104 the resource usage is sufficient to fit all layers.
BRAM 1 20 263.5 For this case, the Ultra Rams would be needed and the Fully
Connected layer in module 15 should not buffer its weights.
The FPGA runs at 200 MHz
4.3 Comparisons to Related Work
the max latency, while keeping the available resources for Hassan et al. [8] presented a HW/SW co-design implemen-
DSPs and BRAMs into account. Since these two resources tation of AlexNet on an FPGA. They performed the first
can be easily estimated and are in most cases the limiting layer of AlexNet on hardware and achieved 2147483647
resources for CNNs. The activation layer is not taken into clock cycles, which would be approximately 10.7 ms when
account, since it has the same parallelization as the Batch considering a frequency of 0.2 GHz. For comparison,
Normalization, but a slightly lower latency. In the table: a similar convolution layer was implemented using our
(vdw ) refers to (dwconv ) and (dwbn ); (vif m ) and (vof m ) library with the same frequency, same parameters and 8-bit
refer to (pwconv ); (vpw ) refers to (pwbn ). For module 15, unsigned integer data types. The implemented convolution
(dwconv ) refers to the Pooling layer and (pwbn ) to the layer had a latency of 3.31 ms, which is a speed-up of
Softmax layer. 3.23. For the same layer, our work shows almost 73% less
Table 7 shows the final implemented design executed BRAM usage, demonstrating the proposed library’s ability
on the ZCU104 MPSoC in baremetal. A baseline software to reduce the memory consumption of large neural networks
implementation uses 32-bit floating point numbers and on FPGAs.
runs on the ARM processor at a frequency of 1.2 GHz Liu et al. [19] proposed and developed a CNN accelerator
in release mode using the O3 optimization option. Our for the Xilinx ZYNQ-7100 platform. They implemented the
proposed implementation uses 8-bit unsigned numbers and SSD-MobileNets-V1 [31] layers as test application for their
runs on the FPGA at a frequency of 0.2 GHz. The time proposed work. The proposed work is also HLS based and
measurements have been done using the ARM processor. A uses Vivado HLS 2016.4. We implemented the most time
good speed-up has been achieved for the single modules. consuming SSD-MobileNets-V1 layers and compared them
Module 2 has the highest speed-up, since it has the highest with the work of Liu et al. in Table 8. For our hardware
parallelization degree and contains most functions executed and software implementations we used the Zynq ZCU102.
in a streaming manner. For module 1 and 2 also a frequency For measurements we executed the algorithms on the board
of 300 MHz was possible. When combining all modules to and measured them from the ARM processor. We show
a very deep pipeline this speed-up would be even higher. the CPU and FPGA results of our work and of Liu et
When comparing the FPGAs computation time with the al. [19]. Both implementations run at 100 MHz, to have
estimated time of the slowest function in Table 6, there a fair comparison, but higher frequencies can be achieved
All results are in ms. VI F M : Parallelization of the Input Feature Map. VOF M : Parallelization of the Output Feature Map
528 J Sign Process Syst (2021) 93:513–529
with our implementation. The table shows the execution project AITIA: Embedded AI Techniques for Industrial Applications.
times for different parallelization settings for IFM and OFM CORNET-AITIA is funded by the BMWi (Federal Ministry for
Economic Affairs and Energy) under the IGF-project number: 249
of our implementation. The Proposed 2 settings should
EBG.
be the maximum possible in terms of available resources,
if the complete algorithm is ported to the ZCU102 and Funding Open Access funding enabled and organized by Projekt
the different functions stream their results between each DEAL.
other. If we compare the results of Layer 27 and Layer 29
of the SSD-Mobilenet-V1 network, our execution time is Open Access This article is licensed under a Creative Commons
11.2x and 18.7x times faster. When computing the complete Attribution 4.0 International License, which permits use, sharing,
streaming network of SSD-Mobilenet-V1, Layers 1 and adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
27 would be the bottleneck. This is intended, since layers source, provide a link to the Creative Commons licence, and indicate
1, 27, 29, 31 and 33 of SSD-Mobilenet-V1 are the only if changes were made. The images or other third party material in
layers with a 3 × 3 convolution kernel. Therefore, we this article are included in the article’s Creative Commons licence,
used these layers as roofline, since they consume more unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
DSPs than the other layers. For our work, we propose to intended use is not permitted by statutory regulation or exceeds
use a quantization technique, like the TensorFlow post- the permitted use, you will need to obtain permission directly from
training quantization [1]. This will allow us to use smaller the copyright holder. To view a copy of this licence, visit http://
parameters thus saving resources and power. For our work, creativecommonshorg/licenses/by/4.0/.
we used unsigned 8-bit integers as data types for inputs,
outputs, weights and biases. Wu et al. [35] investigated the
mathematical aspect of quantization parameters on different References
neural networks. Also, an 8-bit quantization workflow is
presented where an accuracy within 1% of the floating-point 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,
Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016).
baseline is maintained. Therefore quantized parameters with Tensorflow: A system for large-scale machine learning. In 12th
smaller bit widths should be used instead of floating-point USENIX symposium on operating systems design and implemen-
parameters. As it maintains a reasonable accuracy, achieves tation (OSDI 16) (pp. 265–283). https://github.com/tensorflow/
a higher speedup and saves resources. models/blob/master/research/slim/nets/mobilenet v1.md.
2. Akgün, G., Kalms, L., Göhringer, D. (2020). Resource efficient
dynamic voltage and frequency scaling on xilinx fpgas. In
International symposium on applied reconfigurable computing
5 Conclusion (ARC) (pp. 178–192).
3. Chen, Y., He, J., Zhang, X., Hao, C., Chen, D. (2019). Cloud-
dnn: an open framework for mapping dnn models to cloud
In this work we have shown an HLS FPGA library fpgas. In Proceedings of the international symposium on field-
for neural networks. It contains 7 different streaming programmable gate arrays (FPGA) (pp. 73–82). https://doi.org/
capable functions to create large neural networks with deep 10.1145/3289602.3293915.
pipelines. Due to the high parameterization of its functions 4. Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen,
T., Xu, Z., Sun, N., Temam, O. (2014). Dadiannao: a machine-
the library is suitable for embedded and HPC systems. The learning supercomputer. In 47th annual IEEE/ACM international
integration in HiFlipVX allows the use of further image symposium on microarchitecture (pp. 609–622).
processing functions. Although the library was optimized 5. Giduthuri, R., & Pulli, K. (2016). Openvx: A framework for
by Xilinx HLS directives, it was implemented in a way that accelerating computer vision. In SIGGRAPH ASIA 2016 Courses
(pp. 14:1–14:50). https://doi.org/10.1145/2988458.2988513.
it is vendor independent. The different parameter settings 6. Guan, Y., Liang, H., Xu, N., Wang, W., Shi, S., Chen, X., Sun,
and parallelization possibilities were investigated in the G., Zhang, W., Cong, J. (2017). Fp-dnn: An automated framework
evaluation to make conclusions for the user. The evaluation for mapping deep neural networks onto fpgas with rtl-hls hybrid
also shows the low error rate, high performance, scalability templates. In 25th annual international symposium on field-
programmable custom computing machines (FCCM) (pp. 152–
and resource efficiency of the library. Using the MobileNets 159).
algorithm we show how to efficiently create and optimize 7. Guo, K., Sui, L., Qiu, J., Yu, J., Wang, J., Yao, S., Han, S.,
larger designs. An efficient approach to transfer coefficients Wang, Y., Yang, H. (2018). Angel-eye: A complete design flow
and a way to find the optimal vectorization parameters for mapping cnn onto embedded fpga. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 37(1),
were shown. In future, we plan to enhance the library to a 35–47.
framework, which uses the OpenVX graph-based approach. 8. Hassan, R., & Mostafa, H. (2020). Implementation of deep
neural networks on fpga-cpu platform using xilinx sdsoc Analog
Acknowledgements This work has been funded partially by the Integrated Circuits and Signal Processing. https://doi.org/10.1007/
German Federal Ministry of Education and Research BMBF as part s10470-020-01638-5.
of the PARIS project under grant agreement number 16ES0657 9. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
and partially by COllective Research NETworking (CORNET) W., Weyand, T., Andreetto, M., Adam, H. (2017). Mobilenets:
J Sign Process Syst (2021) 93:513–529 529
Efficient convolutional neural networks for mobile vision 26. Sekar, C. (2017). Hemasunder: Tutorial t7: Designing with xilinx
applications. arXiv:1704.04861. sdsoc. In 30th international conference on VLSI design and 16th
10. Intel (2020). Intel FPGA SDK for OpenCL Pro Edition: international conference on embedded systems (VLSID) (pp. xl–
Programming Guide 19.4. xli). https://doi.org/10.1109/VLSID.2017.97.
11. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating 27. Song, L., Wang, Y., Han, Y., Zhao, X., Liu, B., Li, X. (2016). C-
deep network training by reducing internal covariate shift. brain: A deep learning accelerator that tames the diversity of cnns
arXiv:1502.03167. through adaptive data-level parallelization. In Proceedings of the
12. Ji, S., Xu, W., Yang, M., Yu, K. (2013). 3d convolutional neural 53rd Annual Design Automation Conference (DAC). https://doi.
networks for human action recognition. IEEE Transactions on org/10.1145/2897937.2897995.
Pattern Analysis and Machine Intelligence, 35(1), 221–231. 28. Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula,
13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Gir- S., Seo, J.S., Cao, Y. (2016). Throughput-optimized opencl-based
shick, R., Guadarrama, S., Darrell, T. (2014). Caffe: Convolutional fpga accelerator for large-scale convolutional neural networks. In
architecture for fast feature embedding. In Proceedings of the Inproceedings of international symposium on field-programmable
22nd ACM international conference on multimedia (pp. 675– gate arrays (FPGA) (pp. 16–25). https://doi.org/10.1145/2847263.
678). 2847276.
14. Kalms, L., & Göhringer, D. (2017). Exploration of opencl for 29. Taheri, S., Behnam, P., Bozorgzadeh, E., Veidenbaum, A.,
fpgas using sdaccel and comparison to gpus and multicore cpus. Nicolau, A. (2019). Affix: Automatic acceleration framework for
In 27th international conference on field programmable logic and fpga implementation of openvx vision algorithms. In International
applications (FPL) (pp. 1–4). https://doi.org/10.23919/FPL.2017. symposium on field-programmable gate arrays (FPGA) (pp. 252–
8056847. 261). https://doi.org/10.1145/3289602.3293907.
15. Kalms, L., & Göhringer, D. (2020). Accelerated high-level 30. Tapiador Morales, R., Rios-Navarro, A., Linares-Barranco, A.,
synthesis feature detection for FPGAs using HiFlipVX, chap. 7, Kim, M., Kadetotad, D., Seo, J.S. (2016). Comprehensive evalua-
(pp. 115–135). New York: Springer. tion of opencl-based convolutional neural network accelerators in
16. Kalms, L., & Göhringer, D. (2020). Hiflipvx: Open source high- xilinx and altera fpgas coRR.
level synthesis fpga library for image processing. https://github. 31. Tensorflow (2020). Ssd mobilenet v1. https://tensorflow.org/lite/
com/TUD-ADS/HiFlipVX. models/object detection/overview.
17. Kalms, L., Podlubne, A., Göhringer, D. (2019). Hiflipvx: An open 32. Venieris, S.I., & Bouganis, C. (2017). Latency-driven design for
source high-level synthesis fpga library for image processing. In fpga-based convolutional neural networks. In 27Th international
Applied reconfigurable computing (pp. 149–164). conference on field programmable logic and applications (FPL)
18. Krizhevsky, A., Sutskever, I., Hinton, G.E. (2017). Imagenet clas- (pp. 1–8).
sification with deep convolutional neural networks. Communica- 33. Wang, Y., Xu, J., Han, Y., Li, H., Li, X. (2016). Deepburning:
tions of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386. Automatic generation of fpga-based learning accelerators for the
19. Liu, B., Zou, D., Feng, L., Feng, S., Fu, P., Li, J. (2019). neural network family. In 53rd design automation conference
An fpga-based cnn accelerator integrating depthwise separable (DAC) (pp. 1–6).
convolution. Electronics, 8, 281. 34. Winterstein, F., Bayliss, S., Constantinides, G.A. (2013). High-
20. Liu, Z., Chow, P., Xu, J., Jiang, J., Dou, Y., Zhou, J. (2019). A level synthesis of dynamic data structures: A case study using
uniform architecture design for accelerating 2d and 3d cnns on vivado hls. In International conference on field-programmable
fpgas. Electronics, 8, 65. technology (FPT) (pp. 362–365). https://doi.org/10.1109/FPT.
21. Long, J., Shelhamer, E., Darrell, T. (2015). Fully convolutional 2013.6718388.
networks for semantic segmentation. In 2015 IEEE Conference 35. Wu, H., Judd, P., Zhang, X., Isaev, M., Micikevicius, P. (2020).
on computer vision and pattern recognition (CVPR) (pp. 3431– Integer quantization for deep learning inference: Principles and
3440). empirical evaluation.
22. Omidian, H., & Lemieux, G.G.F. (2018). Janus: A compi- 36. Xilinx (2019). xfopencv. https://github.com/Xilinx/xfopencv.
lation system for balancing parallelism and performance in 37. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J. (2015).
openvx. Journal of Physics: Conference Series (JPCS) 012–011. Optimizing fpga-based accelerator design for deep convolutional
https://doi.org/10.1088/1742-6596/1004/1/012011. neural networks. In Proceedings of the international symposium
23. Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M.P., on field-programmable gate arrays (FPGA) (pp. 161–170).
Shyu, M.L., Chen, S.C., Iyengar, S.S. (2018). A survey on deep https://doi.org/10.1145/2684746.2689060.
learning: Algorithms, techniques, and applications. ACM Comput. 38. Zhang, C., Sun, G., Fang, Z., Zhou, P., Pan, P., Cong, J. (2018).
Surv 51(5). https://doi.org/10.1145/3234150. Caffeine:towards uniformed representation and acceleration for
24. Qasaimeh, M., Denolf, K., Lo, J., Vissers, K., Zambreno, deep convolutional neural networks. IEEE Transactions on
J., Jones, P.H. (2019). Comparing energy efficiency of cpu, Computer-Aided Design of Integrated Circuits and Systems.
gpu and fpga implementations for vision kernels. In Interna- 39. Zhang, J., Li, J., 25–34 (2017). Improving the performance of
tional conference on embedded software and systems (ICESS) opencl-based fpga accelerator for convolutional neural network. In
(pp. 1–8). Inproceedings of international symposium on field-programmable
25. Ren, S., He, K., Girshick, R., Sun, J. (2017). Faster r-cnn: Towards gate arrays (FPGA). https://doi.org/10.1145/3020078.3021698.
real-time object detection with region proposal networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39(6), Publisher’s Note Springer Nature remains neutral with regard to
1137–1149. jurisdictional claims in published maps and institutional affiliations.
c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract—FPGAs excel in low power and high throughput different performance or area objectives. In recent languages
arXiv:2002.05796v2 [cs.PL] 21 Jul 2020
computations, but they are challenging to program. Traditionally, such as Chisel [1], VeriScala [2], and MyHDL [3], programmers
developers rely on hardware description languages like Verilog or can create a functional description of their design but stick to
VHDL to specify the hardware behavior at the register-transfer
level. High-Level Synthesis (HLS) raises the level of abstraction, the RTL.
but still requires FPGA design knowledge. Programmers usually High-Level Synthesis (HLS) increases the abstraction level
write pragma-annotated C/C++ programs to define the hardware to an untimed high-level specification similar to imperative
architecture of an application. However, each hardware vendor programming languages and automatically solves low-level
extends its own C dialect using its own vendor-specific set design issues such as clock-level timing, register allocation,
of pragmas. This prevents portability across different vendors.
Furthermore, pragmas are not first-class citizens in the language. and structural pipelining [4]. However, an HLS code that is
This makes it hard to use them in a modular way or design proper optimized for the synthesis of high-performance circuits is
abstractions. fundamentally different from a software program delivering
In this paper, we present AnyHLS, an approach to synthesize high performance on a CPU. This is due to the significant gap
FPGA designs in a modular and abstract way. AnyHLS is between the programming paradigms. An HLS compiler has to
able to raise the abstraction level of existing HLS tools by
resorting to programming language features such as types and
optimize the memory hierarchy of a hardware implementation
higher-order functions as follows: It relies on partial evaluation and parallelize its data paths [5].
to specialize and to optimize the user application based on In order to achieve good Quality of Results (QoR), HLS
a library of abstractions. Then, vendor-specific HLS code is languages demand programmers also to specify the hardware
generated for Intel and Xilinx FPGAs. Portability is obtained architecture of an application instead of just its algorithm. For
by avoiding any vendor-specific pragmas at the source code. In
order to validate achievable gains in productivity, a library for
this reason, HLS languages offer hardware-specific pragmas.
the domain of image processing is introduced as a case study, This ad-hoc mix of software and hardware features makes
and its synthesis results are compared with several state-of-the- it difficult for programmers to optimize an application. In
art Domain-Specific Language (DSL) approaches for this domain. addition, most HLS tools rely on their own C dialect, which
prevents code portability. For example, Xilinx Vivado HLS [6]
uses C++ as base language while Intel SDK [7] (formerly
Altera) uses OpenCL C. These severe restrictions make it hard
I. I NTRODUCTION
to use existing HLS languages in a portable and modular way.
Field Programmable Gate Arrays (FPGAs) consist of a In this paper, we advocate describing FPGA designs using
network of reconfigurable digital logic cells that can be functional abstractions and partial evaluation to generate
configured to implement any combinatorial logic or sequential optimized HLS code. Consider Figure 1 for an example from
circuits. This allows the design of custom application-tailored image processing: With a functional language, we separate
hardware. In particular memory-intensive applications benefit the description of the sobel_x operator from its realization
from FPGA implementations by exploiting fast on-chip memory in hardware. The hardware realization make_local_op is
for high throughput. These features make FPGA implementa- a function that specifies the data path, the parallelization,
tions orders of magnitude faster/more energy-efficient than CPU and memory architecture. Thus, the algorithm and hardware
implementations in these areas. However, FPGA programming architecture descriptions are described by a set of higher-
poses challenges to programmers unacquainted with hardware order functions. A partial evaluator, ultimately, combines
design. these functions to generate an HLS code that delivers high-
FPGAs are traditionally programmed at Register-Transfer performance circuit designs when compiled with HLS tools.
Level (RTL). This requires to model digital signals, their timing, Since the initial descriptions are high-level, compact, and
flow between registers, as well as the operations performed functional, they are reusable and distributable as a library.
on them. Hardware Description Languages (HDLs) such as We leverage the AnyDSL compiler framework [8] to perform
Verilog or VHDL allow for the explicit description of arbitrary partial evaluation and extend it to generate input code for
circuits but require significant coding effort and verification HLS tools targeting Intel and Xilinx FPGA devices. We claim
time. This makes design iterations time-consuming and error- that this approach leads to a modular and portable code other
prone, even for experts: The code needs to be rewritten for than existing HLS approaches, and is able to produce highly
Mem2D Mem2D Mem2D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D Mem1D
(W × H, v) line buffer op1 (W × H, v)
row col ...
sel sel
line buffer opv
Figure 1. AnyHLS example: The algorithm description sobel_x is decoupled from its realization in hardware make_local_op. The hardware realization
is a function that specifies important transformations for the exploitation of parallelism and memory architecture. The function generate(vhls) selects the
backend for code generation, which is Vivado HLS in this case. Ultimately, an optimized input code for HLS is generated by partially evaluating the algorithm
and realization functions.
application domain by themselves (see Section III). AnyHLS is Thus, the calls
thereby built on top of AnyDSL [8] (see Section II-C). AnyDSL let z = pow(x, 5); let z = pow(3, 5);
offers partial evaluation to enable shallow embedding [34] will result in the following equivalent sequences of instructions
without the need for modifying a compiler. This means after specialization:
that there is no need to change the compiler when adding let y = x * x; let z = 243;
support for a new application domain, since programmers can let z = x * y * y;
design custom control structures. Partial evaluation specializes As syntactic sugar, @ is available as shorthand for @(true).
algorithmic variants of a program at compile-time. Compared This causes the partial evaluator to always specialize the
to metaprogramming, partial evaluation operates in a single annotated function.
language and preserves the well-typedness of programs [8]. Fur- FPGA implementations must be statically defined for QoR:
thermore, different combinations of static/dynamic parameters types, loops, functions, and interfaces must be resolved at
can be instantiated from the same code. Previously, we have compile-time [16], [18], [19]. Partial evaluation has many
shown how to abstract image border handling implementations advantages compared to metaprogramming as discussed in
for Intel FPGAs using AnyDSL [35]. In this paper, we present Section II-B. Hence, Impala’s partial evaluation is particularly
AnyHLS and an image processing library to synthesize FPGA useful to optimize HLS descriptions.
designs in a modular and abstract way for both Intel and Xilinx
FPGAs. 2 https://anydsl.github.io
2) Generators: Because iteration on various domains is a halide-app.cpp hipacc-app.cpp anyhsl-app.impala
common pattern, Impala provides syntactic sugar for invoking
Halide compiler Hipacc compiler
certain higher-order functions. The loop Vivado Vivado AOCL
AnyDSL compiler
+
Image
Processing
backend backend backend (partial evaluator) Lib.impala
for var1, ..., varn in iter(arg1, ..., argn) { /* ... */ }
an anonymous function
|var1, ..., varn| { /* ... */ } VHLS VHLS AOCL VHLS AOCL
that is passed to iter as the last argument. We call functions XILINX XILINX INTEL XILINX INTEL
FPGA FPGA FPGA FPGA FPGA
that are invokable like this generators. Domain-specific libraries
implemented in Impala make busy use of these features as
Figure 2. FPGA code generation flows for Halide, Hipacc, and AnyHLS (from
they allow programmers to write custom generators that take left to right). VHLS and AOCL are used as acronyms for Vivado HLS and
advantage of both domain knowledge and certain hardware Intel FPGA SDK for OpenCL, respectively. Halide and Hipacc rely on domain-
features, as we will see in the next section. specific compilers for image processing that instantiate template libraries.
AnyHLS allows defining all abstractions for a domain in a language called
Generators are particularly powerful in combination with Impala and relies on partial evaluation for code specialization. This ensures
partial evaluation. Consider the following functions: maintainability and extensibility of the provided domain-specific library—for
image processing in this example.
type Body = fn(int) -> ();
fn @(?a & ?b) unroll(a: int, b: int, body: Body) -> () {
if a < b { body(a); unroll(a+1, b, body) }
} with vhls() { body() } with opencl() { body() }
fn @ range(a: int, b: int, body: Body) -> () {
unroll($a, b, body) With opencl we use a grid and block size of (1, 1, 1)
}
to generate a single work-item kernel, as the official AOCL
Both generators iterate from a (inclusive) to b (exclusive) documentation recommends [7]. We extended AnyDSL’s
while invoking body each time. The filter unroll tells the OpenCL runtime by the extensions of Intel OpenCL SDK.
partial evaluator to completely unroll the recursion if both loop To provide an abstraction over both HLS backends, we create
bounds are statically known at a particular call site. a wrapper generate that expects a code generation function:
type Backend = fn(fn() -> ()) -> ();
III. T HE A NY HLS L IBRARY fn @ generate(be: Backend, body: fn() -> ()) -> () {
with be() { body() }
Efficient and resource-friendly FPGA designs require }
application-specific optimizations. These optimizations and Switching backends is now just a matter of passing an
transformations are well known in the community. For example, appropriate function to generate:
de Fine Licht et al. [20] discuss the key transformations of HLS
let backend = vhls; // or opencl
codes such as loop unrolling and pipelining. They describe with generate(backend) { body() }
the whole hardware design from the low-level memory layout
to the operator implementations with support for low-level
loop transformations throughout the design. In our setting, B. Building Abstractions for FPGA Designs
the programmer defines and provides these abstractions using
In the following, we present abstractions for the key
AnyDSL for a given domain in the form of a library. We
transformations and design patterns that are common in FPGA
rely on partial evaluation to combine those abstractions and to
design. These include (a) important loop transformations, (b)
remove overhead associated with them. Ultimately, the AnyDSL
control flow and data flow descriptions such as reductions,
compiler synthesizes optimized HLS code (C++ or OpenCL C)
Finite State Machines (FSMs) and (d) the explicit utilization of
from a given functional description of an algorithm as shown
different memory types. Approaches like Spatial [15] expose
in Figure 2. The generated code goes to the selected HLS tool.
these patterns within the language—new patterns require
This is in contrast to other domain-specific approaches like
dedicated support from the compiler. Hence, these languages
Halide-HLS [25] or Hipacc [27], which rely on domain-specific
and compilers are restricted to a specialized application
compilers to instantiate predefined templates or macros. Hipacc
domain they have been designed for. In AnyHLS, Impala’s
makes use of two distinct libraries to synthesize algorithmic
functional language and partial evaluation allow us to design
abstractions to Vivado-HLS and Intel AOCL, while AnyHLS
the abstractions needed for FPGA synthesis in the form of
uses the same image processing library that is described in
a library. New patterns can be added to the library without
Impala.
dedicated support from the compiler. This makes AnyHLS
easier to extend compared to the approaches mentioned afore.
A. HLS Code Generation 1) Loop Transformations: C++ compilers usually provide
For HLS code generation, we implemented an intrinsic certain preprocessor directives that perform particular code
named vhls in AnyHLS to emit Vivado HLS and an intrinsic transformations. A common feature is to unroll loops (see
named opencl to emit AOCL: left-hand side):
Instead of a pragma (on the left), AnyHLS uses the intrinsic
body body body
generator pipeline (on the right). Unlike the above loop
abstractions (e.g., unroll), Impala emits a tool-specific pragma
body body body
for the pipeline abstraction. This provides portability across
body
no unrolling
body body
different HLS tools. Furthermore, it allows the programmer
unroll inner loop
to invoke and pass around pipeline—just like any other
unroll outer loop unroll inner and outer loop
generator.
2) Reductions: Reductions are useful in many contexts. The
Figure 3. Parallel processing following function takes an array of values, a range within,
and an operator:
for (int i=0; i<N/W; ++i) { for i in range(0, N/W) { type T = int;
for (int w=0; w<W; ++w) { for w in unroll(0, W) { fn @(?beg & ?end) reduce(beg: int, end: int, input: &[T],
#pragma unroll op: fn(T, T) -> T) -> T {
body(i*W + w); body(i*W + w); let n = end - beg;
} } if n == 1 {
} } input(beg)
} else {
Such pragmas are built into the compiler. The Impala version let m = (end + beg) / 2;
(shown at right) uses generators that are entirely implemented let a = reduce(beg, m, input, op);
let b = reduce(m, end, input, op);
as a library. Partial evaluation optimizes Impala’s range op(a, b)
and unroll abstractions as well as the input body function }
}
according to their static inputs, i.e., N, W. The residual program
consists of the consecutive body function according to the value In the above filter, the recursion will be completely unfolded
of the W as shown in Figure 3. This generates a concise and if the range is statically known. Thus,
clean code for the target HLS compiler, which is drastically reduce(0, 4, [a, b, c, d], |x, y| x + y)
different from using a pragma.
Generators, unlike C++ pragmas, are first-class citizens of yields: (a + b) + (c + d).
the Impala language. This allows programmers to implement 3) Finite State Machines: AnyHLS models computations
sophisticated loop transformations. For example, the following that depend not only on the inputs but also on an internal
function tile returns a new generator. It instantiates a tiled state with an FSM. To define an FSM, programmers need to
loop nest of the specified tile size with the Loops inner specify states and a transition function that determines when
and outer: to change the current state based on the machine’s input. This
type Loop = fn(int, int, fn(int) -> ()) -> ();
is especially beneficial for modeling control flow. To describe
fn @ tile(size: int, inner: Loop, outer: Loop) -> Loop { an FSM in Impala, we start by introducing types to represent
@|beg, end, body| outer(0, (end-beg)/size,
|i| inner(i*size + beg, (i+1)*size + end, |j| body))
the states and the machine itself:
} type State = int;
struct FSM {
let schedule = tile(W, unroll, range); add: fn(State, fn() -> (), fn() -> State) -> (),
for i in schedule(0, N) { run: fn(State) -> ()
body(i) }
}
An object of type FSM provides two operations: adding one
Passing W for the tiling size, unroll for the inner loop, and
state with add or running the computation. The add method
range for the outer loop yields a generator that is identical
takes the name of the state, an action to be performed for this
to the loop nest at the beginning of this paragraph. With this
state, and a transition function associated with this state. Once
design, we can reuse or explore iteration techniques without
all states are added, the programmer runs the machine by
touching the actual body of a for loop. For example, consider
passing the initial state as an input parameter. The following
the processing options for a two-dimensional loop nest as shown
example adds 1 to every element of an array:
in Figure 3: When just passing range as inner and outer
let buf = /*...*/;
loop, the partial evaluator will keep the loop nest and, hence, let mut (idx, pixel) = (0, 0);
not unroll body and instantiate it only once. Unrolling the inner let fsm = make_fsm();
fsm.add(Read, || pixel = buf(idx),
loop replicates body and increases the bandwidth requirements || if idx>=len { Exit } else { Compute });
accordingly. Unrolling the outer loop also replicates body, but fsm.add(Compute, || pixel += 1, || Write);
fsm.add(Write, || buf(idx++) = pixel, || Read );
in a way that benefits data reuse from the temporal locality of fsm.run(Read);
an iterative algorithm. Unrolling both loops replicate body for
increased bandwidth and data reuse for the temporal locality. Similar the other abstractions introduced in this section, the
C/C++-based HLS solutions often use a pragma to mark a constructor for an FSM is not a built-in function of the compiler
loop amenable for pipelining. This means parallel execution but a regular Impala function. In some cases, we want to
of the loop iterations in hardware. For example, the following execute the FSM in a pipelined way. For this scenario, we add
code on the left uses an initiation interval (II) of 3: a second method run_pipelined. As all the methods, e.g.,
for (int i=0; i<N; ++i) { let II = 3; make_fsm, add, run, are annotated for partial evaluation
#pragma HLS pipeline II=3 for i in pipeline(II, 0, N) {
body(i); body(i) (by @), input functions to these methods will be optimized
} } according to their static inputs. Ultimately, AnyHLS will emit
the states of an FSM as part of a loop according to the selected
run method.
global memory on-chip memory register stream
4) Memory Types and Memory Abstractions: FPGAs have
different memory types of varying sizes and access properties.
Impala supports four memory types specific to hardware design Figure 4. Memory types provided for FPGA design
(see Figure 4): global memory, on-chip memory, registers, and
streams. Global memory (typically DRAM) is allocated on the OnChipArray
Regs2D StreamArray
host using our runtime and accessed through regular pointers. Regs1D
On-chip memory (e.g., BRAM or M10K/M20K) for the FPGA
is allocated using the reserve_onchip compiler intrinsic.
1D register array
Memory accesses using the pointer returned by this intrinsic 2D register array stream array
on-chip array
will map to on-chip memory. Standard variables are mapped
to registers, and a specific stream type is available to allow
for the communication between FPGA kernels. Memory-wise, Figure 5. Memory abstractions
a stream is mapped to registers or on-chip memory by the
HLS tools. These FPGA-specific memory types in Impala will
the smaller array. The generator (make_regs1d) returns an
be mapped to their corresponding tool-specific declarations in
Impala variable that can be read and written by index values
the residual program (on-chip memory will be defined as local
(regs in the following code), similar to C arrays.
memory for AOCL whereas it will be defined as an array in
let regs = make_regs1d(size);
Vivado HLS).
a) Memory partitioning: an array partitioning pragma However, it defines size number of registers in the residual
must be defined as follows to implement a C array with program instead of declaring an array and partitioning it by
hardware registers using Vivado HLS [6]: tool-specific pragmas as in Listing 1. The generated code
typedef int T; does not contain any compiler directives; hence it can be
T Regs1D[size];
#pragma HLS variable=Regs1D array_partition dim=0
used for different HLS tools (e.g., Vivado HLS, AOCL). Since
we annotated make_regs1d, read, and write for partial
Listing 1. A typical way of partitioning an array by using pragmas in existing
HLS tools. evaluation, any call to these functions will be inlined recursively.
This means that the search to find the register to read to or
Other HLS tools offer similar pragmas for the same task. write from will be performed at compile time. These registers
Instead, AnyHLS provides a more concise description of a will be optimized by the AnyDSL compiler, just like any other
register array without using any tool-specific pragma by the variables: unnecessary assignments will be avoided, and a clean
recursive declaration of registers as follows: HLS code will be generated.
type T = int; Correspondingly, AnyHLS provides generators (similar to
struct Regs1D { Listing 2) for one and two-dimensional arrays of on-chip
read: fn(int) -> T,
write: fn(int, T) -> (), memory (e.g., line buffers in Section IV), global memory, and
size: int streams (as illustrated in Figure 5) instead of using memory
}
fn @ make_regs1d(size: int) -> Regs1D { partitioning pragmas encouraged in existing HLS tools (as in
if size == 0 { Listing 1).
Regs1D {
read: @|_| 0,
write: @|_, _| (),
size: size IV. A L IBRARY FOR I MAGE P ROCESSING ON FPGA
}
} else {
AnyHLS allows for defining domain-specific abstractions
let mut reg: T; and optimizations that are used and applied prior to generating
let others = make_regs1d(size - 1);
Regs1D {
customized input to existing HLS tools. In this section, we
read: @|i| if i+1 == size { reg } introduce a library that is developed to support HLS for the
else { others.read(i) },
write: @|i, v| if i+1 == size { reg = v }
domain of image processing applications. It is based on the
else { others.write(i, v) }, fundamental abstractions introduced in Section III-B. Our low-
size: size
}
level implementation is similar to existing domain-specific
} languages targeting FPGAs [24], [27]. For this reason, we focus
}
on the interface of our abstractions as seen by the programmer.
Listing 2. Recursive description of a register array using partial evalution We design applications by decoupling their algorithmic
instead of declaring an array and partitioning it by HLS pragmas.
description from their schedule and memory operations. For
instance, typical image operators, such as the following
When the size is not zero, each recursive call to this
Sobel filter, just resort to the make_local_op generator.
function allocates a register variable named reg, and creates
Similarly, we implement a point operator for RGB-to-gray
a smaller register array with one element less named others.
color conversion as follows (Listing 3):
The read and write functions test if the index i is equal
fn sobel_edge(output: &mut [T], input: &[T]) -> () {
to the index of the current register. In the case of a match, let img = make_raw_mem2d(width, height, input);
the current register is used. Otherwise, the search continues in let dx = make_raw_mem2d(width, height, output);
let sobel_extents = extents(1, 1); // for 3x3 filter access an element of the vector. This increases data reuse and
let operator = make_local_op(4, // vector factor
sobel_operator_x, sobel_extents, mirror, mirror);
DRAM-to-on-chip memory bandwidth [42].
with generate(hls) { operator(img, dx); } 2) Stream Processing: Inter-kernel dependencies of an
}
algorithm should be accessed on-the-fly in combination with
fn rgb2gray(output: &mut [T], input: &[T]) -> () { fine-granular communication in order to pipeline the full
let img = make_raw_img(width, height, input);
let gray = make_raw_img(width, height, output); implementation with a fixed throughput. That is, as soon as a
let operator = make_point_op(@ |pix| { block produces one data, the next block consumes it. In the
let r = pix & 0xFF;
let g = (pix >> 8) & 0xFF; best case, this requires only a single register of a small buffer
let b = (pix >> 16) & 0xFF; instead of reading/writing to temporary images:
(r + g + b) / 3
});
Mem1D Mem1D Mem1D Mem1D
with generate(hls) { operator(img, gray); }
} Kernel1 Kernel2 Kernel3
Listing 3. Sobel filter and RGB-to-gray color conversion as example
applications described by using our library.
We define a stream between two kernels as follows:
The image data structure is opaque. The target platform fn make_mem_from_stream(size: int, data: stream) -> Mem1D;
mapping determines its layout. AnyHLS provides common
border handling functions as well as point and global operators 3) Line Buffers: Storing an entire image to on-chip memory
such as reductions (see Section III-B2). These operators are before execution is not feasible since on-chip memory blocks
composable to allow for more sophisticated ones. are limited in FPGAs. On the other hand, feeding the data
on demand from main memory is extremely slow. Still, it is
possible to leverage fast on-chip memory by using it as FIFO
A. Vectorization
buffers containing only the necessary lines of the input images
Image processing applications consist of loops that possess a (W pixels per line).
very high degree of spatial parallelism. This should be exploited Mem2D (1, h, v)
to reach the bandwidth speed of memory technologies. A
line buffer
resource-efficient approach, so-called vectorization or loop
coarsening, is to aggregate the input pixels to vectors and
process multiple input data at the same time to calculate line buffer
Mem1D (W, v)
multiple output pixels in parallel [39]–[41]. This replicates only
the arithmetic operations applied to data (so-called datapath) line buffers (W, h, v)
instead of the whole accelerator, similar to Single Instruction
Multiple Data (SIMD) architectures. Vectorization requires a This enables parallel reads at the output for every pixel read
control structure specialized to a considered hardware design. at the input. We model a line buffer as follows:
We support the automatic vectorization of an application by type LineBuf1D = fn(Mem1D) -> Mem1D;
a given factor v when using our image processing library. In fn make_linebuf1d(width: int) -> LineBuf1D;
// similar for LineBuf2D
particular, our library use the vectorization techniques proposed
in [40]. For example, the make_local_op function has Akin to Regs1D (see Section III-B4), a recursive call builds
an additional parameter to specify the desired vectorization an array of line buffers (each line buffer will be declared by a
and will propagate this information to the functions it uses separate memory component in the residual program similar
internally: make_local_op(op, v). For brevity, we omit to on-chip array in Figure 5).
the parameter for the vectorization factor for the remaining 4) Sliding Window: Registers are the most amenable re-
abstractions in this section. sources to hold data for highly parallelized access. A sliding
window of size w × h updates the constituting shift registers by
B. Memory Abstractions for Image Processing a new column of h pixels and enables parallel access to w · h
1) Memory Accessor: In order to optimize memory access pixels.
Mem2D (w, h, 1)
and encapsulate the contained memory type (on-chip memory,
etc.) into a data structure, we decouple the data transfer from Mem2D
(1, h, v)
the data use via the following memory abstractions:
struct Mem1D { struct Mem2D {
read: fn(int) -> T, read: fn(int, int) -> T,
write: fn(int, T)->(), write: fn(int, int, T)->(),
update: fn(int) -> (), update: fn(int, int) -> (), sliding window
size: int width: int, height: int
} } This provides high data reuse for temporal locality and avoids
Similar to hardware design practices, these memory abstractions waste of on-chip memory blocks that might be utilized for a sim-
require the memory address to be updated before the ilar data bandwidth. Our implementation uses make_regs2d
read/write operations. The update function transfers data for an explicit declaration of registers and supports pixel-based
from/to the encapsulated memory to/from staging registers indexing at the output. This will instantiate w · h registers in
using vector data types. Then, the read/write functions the residual program, as explained in Section III-B4.
type Swin2D = fn(Mem2D) -> Mem2D; type LocalOp = fn(Mem1D) -> Mem1D;
fn @ make_sliding_window(w: int, h: int) -> Swin2D { fn @ make_local_op(v: int, op: Op, ext: Extents,
let win = make_regs2d(w, h); bh_lower: FnBorder,
// ... bh_upper: FnBorder) -> LocalOp {
} @ |img, out| {
let mut (col, row, idx) = (0, 0, 0);
let wait = /* initial latency */
C. Loop Abstractions for Image Processing let fsm = make_fsm();
fsm.add(Read, || img.update(idx), || Compute);
1) Point Operators: Algorithms such as image scaling and fsm.add(Compute, || {
line_buffer.update(col);
color transformation calculate an output pixel for every input sliding_window.update(row);
pixel. The point operator abstraction (see Listing 4) in AnyHLS col_sel.update(col);
for i in unroll(0, v) {
yields a vectorized pipeline over the input and output image. out.write(i, op(col_sel.read(i)));
This abstraction is parametric in its vector factor v and the }
}, || if idx > wait { Write } else { Index });
desired operator function op. fsm.add(Write, || out.update(idx-wait-1), || Index);
fsm.add(Index, || {
type PointOp = fn(Mem1D) -> Mem1D;
idx++; col++;
fn @ make_point_op(v: int, op: Op) -> PointOp {
if col == img_width { col=0; row++; }
@ |img, out| {
}, || if idx < img.size { Read } else { Exit });
for idx in pipeline(1, 0, img.size) {
fsm.run_pipelined(Read, 1, 0, img.size);
img.update(idx);
}
for i in unroll(0, v) {
}
out.write(i, op(img.read(i)));
}
out.update(idx); Listing 5. Implementation of the local operator abstraction.
}
}
}
Compared to the local operator in Figure 1, we also support
Listing 4. Implementation of the point operator abstraction. boundary handling. We specify the extent of the local operator
(filter size / 2) as well as functions specifying the boundary
handling for the lower and upper bounds. Then, row and column
The total latency is
selection functions apply border handling correspondingly in x-
L = Larith + dW/ve · H cycles (2) and y−directions by using one-dimensional multiplexer arrays
similar to Özkan et al. [40].
where W and H are the width and height of the input image,
and Larith is the latency of the data path. V. E VALUATION AND R ESULTS
2) Local Operators: Algorithms such as Gaussian blur and In the following, we compare the Post Place and Route
Sobel edge detection calculate an output pixel by considering (PPnR) results using AnyHLS and other state-of-the-art domain-
the corresponding input pixel and a certain neighborhood of it specific approaches including Halide-HLS [25] and Hipacc [27].
in a local window. Thus, a local operator with a w × h window The generated HLS codes are compiled using Intel FPGA SDK
requires w · h pixel reads for every output. The same (w − 1) · h for OpenCL 18.1 and Xilinx Vivado HLS 2017.2 targeting a
pixels are used to calculate results at the image coordinates Cyclone V GT 5CGTD9 FPGA and a Zynq XC7Z020 FPGA,
(x, y) and (x + 1, y). This spatial locality is transformed into repectively.
temporal locality when input images are read in raster order for The generated hardware designs are evaluated for their
burst mode, and subsequent pixels are sequentially processed throughput, latency, and resource utilization. FPGAs possess
with a streaming pipeline implementation. The local operator two types of resources: (i) computational: LUTs and DSP
implementation in AnyHLS (shown in Listing 5) consists of blocks; (ii) memory: Flipflops (FFs) and on-chip memory
line buffers and a sliding window to hold dependency pixels (BRAM/M20K). A SLICE/ALM is comprised of look-up tables
in on-chip memory and calculates a new result for every new (LUTs) and flip flops, thus indicate the resource usage when
pixel read. considered with the DSP block and on-chip memory blocks.
Mem2D Mem2D Mem2D
The implementation results presented for Vivado HLS feature
Mem1D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D
only the kernel logic, while those by Intel OpenCL include
(W × H, v) line buffer op1 (W × H, v) PCIe interfaces. The execution time of an FPGA circuit (Vivado
row col
line buffer
sel sel
...
HLS implementation) equals to Tclk · latency, where Tclk is
opv
the clock period of the maximum achievable clock frequency
line buffers sliding window
(lower is better). We measured the timing results for Intel
local operator
OpenCL by executing the applications on a Cyclone V GT
This provides a throughput of v pixels per clock cycle at the 5CGTD9 FPGA. This is the case for all analyzed applications.
cost of an initial latency (v is the vectorization factor) We have no intention nor license rights [43, §4] [44, §2] to
benchmark and compare the considered FPGA technologies or
Linitial = Larith + (bh/2c · dW/ve + bdw/ve/2c) (3) HLS tools.
that is spent for caching neighboring pixels of the first
calculation. The final latency is thus: A. Applications
In our experimental evaluation, we consider the following
L = Linitial + (dW/ve · H) (4) applications:
Harris
2) Vectorization: Many FPGA implementations benefit from
FChain parallel processing in order to increase memory bandwidth.
AnyHLS implicitly parallelizes a given image pipeline by a
Harris naïve vectorization factor v. As an example, Figure 7 shows the
FChain streaming pipeline PPnR results, along with the achieved memory throughput for
0 16 35 107 different vectorization factors for the mean filter on a Cyclone V.
Execution time [ms] The memory-bound of the Cyclone V is reported by Intel’s
Figure 6. Execution time for naïve and streaming pipeline implementations Memory Bound [MB/s]
Resource Usage in %
•
On-Chip Mem Blocks Logic Resources
as a pre-processing algorithm 30
• bilateral filter (Bilateral), a 5 × 5 floating-point kernel
as an edge-preserving and noise-reducing function based 25
on exponential functions
• mean filter (MF), a 5×5 filter that determines the average 20
within a local window via 8-bit arithmetic
15
• SobelLuma, an edge detection algorithm provided as a
1 2 4 8 16 32
design example by Intel. The algorithm consists of RGB Vectorization factor (v)
to Luma color conversion, Sobel filters, and thresholding
Figure 7. PPnR results of AnyHLS’s mean filter implementation on an Intel
B. Library Optimizations Cyclone V. The memory bound of the device for our setup is 1344.80 MB/s.
latency given in Equation (4), which is L = Larith + AnyHLS 8 1646 16 1050641 801.8
Halide-HLS 16 2096 50 1060897 458.7
1.042.442 clock cycles for Gauss when v = 1. Larith = Hipacc 8 1709 16 1052693 820.1
14 for AnyHLS’ Gauss implementation as shown in
Table II.
ii) Halide-HLS pads input images according to the selected has control over code generation. Extending AnyHLS’ image
border handling mode (even when no border handling is processing library only requires adding new functions in Impala
defined). This increases the input image size from (W , (see Figure 2). Our intention to compare AnyHLS with these
H) to (W + w − 1, H + h − 1), thus the latency. DSLs is to show that we can generate equally good designs
iii) Hipacc does not pad input images, but run (H + bh/2c · without creating an entire compiler backend.
(W + bw/2c)) loop iterations for a (W × H) image 2) Experiments using Intel FPGA SDK for OpenCL (AOCL):
and (w × h) window. This is similar to the convolution Table IV presents the implementation results for an edge
example in the Vivado Design Suite User Guide [6], but detection algorithm provided as a design example by Intel. The
not optimal. algorithms consist of RGB to Luma color conversion, Sobel
The execution time of an implementation equals to Tclk · filters, and thresholding. Intel’s implementations consist of a
latency, where Tclk is the clock period of the maximum single-work item kernel that utilizes shift registers according
achievable clock frequency (lower is better). Overall, AnyHLS to the FPGA design paradigm. These types of techniques are
processes a given image faster than the other DSL implemen- recommended by Intel’s optimization guide [7] despite that
tations. the same OpenCL code performs drastically bad on other
Halide-HLS uses more on-chip memory for line buffers (see computing platforms.
Section IV-C2) compared to Hipacc and AnyHLS because of its
image padding for border handling. Let us consider the number Table IV
PP N R RESULTS OF AN EDGE DETECTION APPLICATION FOR THE I NTEL
of BRAMs utilized for the Gaussian blur: The line buffers need C YCLONE V. I MAGE SIZES ARE 1024 × 1024. N ONE OF THE
to hold 4 image lines for the 5 × 5 kernel. The image width IMPLEMENTATIONS USE DSP S .
is 1024 and the pixel size is 32 bits. Therefore, AnyHLS and
v Framework #M10K #ALM #DSP Throughput [MB/s]
Hipacc use eight 18K BRAMs as shown in Table II. However,
Halide-HLS stores 1028 integer pixels, which require 16 18K Intel’s Imp. 290 23830 0 419.5
1 AnyHLS 291 23797 0 422.5
BRAMs to buffer four image lines. This doubles the number Hipacc 318 25258 0 449.1
of BRAMs usage (see Table III). Intel’s Imp. - - 0 -
AnyHLS use the vectorization architecture proposed in [40]. 16 AnyHLS 337 29126 0 1278.3
Hipacc 362 35079 0 1327.7
This improves the use of the registers compared to Hipacc and
Intel’s Imp. - - 0 -
Halide. 32 AnyHLS 401 38069 0 1303.8
The performance metrics and resource usage reported by Hipacc 421 44059 0 1320.0
Vivado HLS correlate with our Impala descriptions, hence we
claim that the HLS code generated from AnyHLS’ image We described Intel’s handwritten SobelLuma example using
processing library does not entail severe side effects for Hipacc and AnyHLS. Both Hipacc and AnyHLS provide a
the synthesis of Vivado HLS. Hipacc and Halide-HLS have higher throughput even without vectorization. In order to reach
dedicated compiler backends for HLS code generation. These memory-bound, we would have to rewrite Intel’s hand-tuned
can be improved to achieve similar performance to AnyHLS. design example to exploit further parallelism. AnyHLS uses
However, this is not a trivial task and prone to errors. The slightly less resource, whereas Hipacc provides slightly higher
advantage of AnyDSL’s partial evaluation is that the user throughput for all the vectorization factors. Similar to Figure 7,
REFERENCES
16 AnyHLS Table V
103 PP N R FOR THE I NTEL C YCLONE V. M ISSING NUMBERS (-) INDICATE THAT
Throughput in [MPixel/s]
NDRange
8 THE GENERATED IMPLEMENTATIONS DO NOT FIT THE BOARD .
4
App v Framework #M10K #ALM #DSP Throughput [MB/s]
2
CU4/SIMD16 16 AnyHLS 401 37509 0 1330.1
102 1 Gauss
16 Hipacc 402 35090 0 1301.2
16 AnyHLS 370 31446 0 1328.8
Jacobi
CU1/SIMD1 16 Hipacc 372 30296 0 1282.9
CU16/SIMD1
1 AnyHLS 399 79270 153 326.6
Bilat.
1 Hipacc 422 79892 159 434.7
20 30 40 50 60 70 80
16 AnyHLS 400 39266 0 1255.68
Hardware resources (logic utilization [%]) MF 16 Hipacc - - - -
8 Hipacc 351 31796 0 1275.9
8 AnyHLS 418 44807 0 1230.6
Figure 8. Design space for a 5 × 5 mean filter using an NDRange kernel FChain
8 Hipacc 645 64225 0 427.4
(using the num_compute_units / num_simd_work_items attributes)
8 AnyHLS 442 50537 96 1158.5
and AnyHLS (using the vectorization factor v) for an Intel Cyclone V. Harris
8 Hipacc 668 74246 96 187.14
Hipacc AnyHLS
Throughput in [MPixel/s]
10
VI. C ONCLUSIONS
2
In this paper, we advocate the use of modern compiler
29
technologies for high-level synthesis. We combine functional
abstractions with the power of partial evaluation to decouple a
high-level algorithm description from its hardware design that
28
implements the algorithm. This process is entirely driven by
code refinement, generating input code to HLS tools, such as
Harris Gauss Bilateral Jacobi FChain MF Vivado HLS and AOCL, from the same code base. To specify
important abstractions for hardware design, we have introduced
Figure 9. Throughput measurements for an Intel Cyclone V for the a set of basic primitives. Library developers can rely on these
implementations generated from AnyHLS and Hipacc. Resource utilization primitives to create domain-specific libraries. As an example,
for the same implementations are shown in Table V.
we have implemented an image processing library for synthesis
to both Intel and Xilinx FPGAs. Finally, we have shown that
our results are on par or even better in performance compared
both frameworks yield throughputs very close to the memory
to state-of-the-art approaches.
bound of the Intel Cyclone V.
The OpenCL NDRange kernel paradigm conveys multiple
ACKNOWLEDGMENTS
concurrent threads for data-level parallelism. OpenCL-based
HLS tools exploit this paradigm to synthesize hardware. AOCL This work is supported by the Federal Ministry of Education
provides attributes for NDRange kernels to transform its iter- and Research (BMBF) as part of the Metacca, MetaDL,
ation space. The num_compute_units attribute replicates ProThOS, and REACT projects as well as the Intel Visual
the kernel logic, whereas num_simd_work_items vector- Computing Institute (IVCI) at Saarland University. It was
3
izes the kernel implementation . Combinations of those provide also partially funded by the Deutsche Forschungsgemein-
a vast design space for the same NDRange kernel. However, as schaft (DFG, German Research Foundation) – project number
Figure 8 demonstrates, AnyHLS achieves implementations that 146371743 – TRR 89 “Invasive Computing”. Many thanks to
are orders of magnitude faster than using attributes in AOCL. our colleague Puya Amiri for his work on the pipeline support.
Finally, Table V and Figure 9 present a comparison between
AnyHLS and the AOCL backend of Hipacc [45]. As shown R EFERENCES
in Figure 2, Hipacc has an individual backend and template [1] J. Bachrach et al., “Chisel: Constructing hardware in a Scala
library written with preprocessor directives to generate high- embedded language”, in Proc. of the 49th Annual Design
Automation Conf. (DAC), IEEE, Jun. 3–7, 2012.
performance OpenCL code for FPGAs. In contrast, the ap-
[2] Y. Liu et al., “A scala based framework for developing accel-
plication and library code in AnyHLS stays the same. The eration systems with FPGAs”, Journal of Systems Architecture,
generated AOCL code consists of a loop that iterates over vol. 98, 2019.
the input image. Compared to Hipacc, AnyHLS achieves [3] J. Decaluwe, “MyHDL: A Python-based hardware description
similar performance but outperforms Hipacc for multi-kernel language”, Linux Journal, no. 127, 2004.
[4] J. Cong et al., “High-level synthesis for FPGAs: From
applications such as the Harris corner detector. This shows that
prototyping to deployment”, IEEE Trans. on Computer-Aided
AnyHLS optimizes the inter-kernel dependencies better than Design of Integrated Circuits and Systems (TCAD), vol. 30, no.
Hipacc (see Section IV-B2). 4, 2011.
[5] J. Cong et al., “Automated accelerator generation and opti-
3 These mization with composable, parallel and pipeline architecture”,
parallelization attributes are suggested in [7] for NDRange kernels,
in Proc. of the 55th Annual Design Automation Conf. (DAC),
not for the single-work item kernels using shift registers such as the edge
detection application shown in Table IV. ACM, Jun. 24–29, 2018.
[6] Xilinx, Vivado Design Suite user guide high-level synthesis [27] O. Reiche et al., “Generating FPGA-based image processing
UG902, 2017. accelerators with Hipacc”, in Proc. of the Int’l Conf. On
[7] Intel, Intel FPGA SDK for OpenCL: Best practices guide, 2017. Computer Aided Design (ICCAD), IEEE, Nov. 13–16, 2017.
[8] R. Leißa et al., “AnyDSL: A partial evaluation framework for [28] N. Chugh et al., “A DSL compiler for accelerating image
programming high-performance libraries”, Proc. of the ACM processing pipelines on FPGAs”, in Proc. of the Int’l Conf.
on Programming Languages (PACMPL), vol. 2, no. OOPSLA, on Parallel Architecture and Compilation Techniques (PACT),
Nov. 4–9, 2018. ACM, Sep. 11–15, 2016.
[9] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization [29] Y. Chi et al., “Soda: Stencil with optimized dataflow archi-
for configurable computing”, in Proc. of the ACM/SIGDA tecture”, in 2018 IEEE/ACM Int’l Conf. on Computer-Aided
international symposium on Field programmable gate arrays, Design (ICCAD), IEEE, 2018.
ACM, 2013. [30] R. Stewart et al., “A dataflow IR for memory efficient
[10] R. Nane et al., “A survey and evaluation of FPGA high-level RIPL compilation to FPGAs”, in Proc. of the Int’l Conf. on
synthesis tools”, IEEE Trans. on Computer-Aided Design of Algorithms and Architectures for Parallel Processing (ICA3PP),
Integrated Circuits and Systems, vol. 35, no. 10, 2015. Springer, Dec. 14–16, 2016.
[11] G. Martin and G. Smith, “High-level synthesis: Past, present, [31] M. Kristien et al., “High-level synthesis of functional patterns
and future”, IEEE Design & Test of Computers, vol. 26, no. 4, with Lift”, in Proc. of the 6th ACM SIGPLAN Int’l Workshop on
2009. Libraries, Languages and Compilers for Array Programming,
[12] D. F. Bacon et al., “FPGA programming for the masses”, ARRAY@PLDI 2019, Phoenix, AZ, USA, June 22, 2019., 2019.
Communications of the ACM, vol. 56, no. 4, 2013. [32] R. Baghdadi et al., “Tiramisu: A polyhedral compiler for
[13] S. A. Edwards, “The challenges of synthesizing hardware from expressing fast and portable code”, in Proc. of the IEEE/ACM
C-like languages”, IEEE Design & Test of Computers, vol. 23, Int’l Symp. on Code Generation and Optimization (CGO),
no. 5, 2006. IEEE, Feb. 16–20, 2019.
[14] J. Sanguinetti, “A different view: Hardware synthesis from [33] E. Del Sozzo et al., “A unified backend for targeting FPGAs
SystemC is a maturing technology”, IEEE Design & Test of from DSLs”, in Proc. of the 29th Annual IEEE Int’l Conf.
Computers, vol. 23, no. 5, 2006. on Application-specific Systems, Architectures and Processors
[15] D. Koeplinger et al., “Spatial: A language and compiler for (ASAP), IEEE, Jul. 10–12, 2018.
application accelerators”, in Proc. of the 39th ACM SIGPLAN [34] R. Leißa et al., “Shallow embedding of DSLs via online partial
Conf. on Programming Language Design and Implementation evaluation”, in Proc. of the Int’l Conf. on Generative Program-
(PLDI), ACM, Jun. 18–22, 2018. ming: Concepts & Experiences (GPCE), ACM, Oct. 26–27,
[16] H. Eran et al., “Design patterns for code reuse in HLS packet 2015.
processing pipelines”, in 27th Annual Int’l Symp. on Field- [35] M. A. Özkan et al., “A journey into DSL design using
Programmable Custom Computing Machines (FCCM), IEEE, generative programming: FPGA mapping of image border
2019. handling through refinement”, in Proc. of the 5th Int’l Workshop
[17] J. S. da Silva et al., “Module-per-object: A human-driven on FPGAs for Software Programmers (FSP), VDE, 2018.
methodology for C++-based high-level synthesis design”, in [36] N. D. Jones et al., Partial evaluation and automatic program
27th Annual Int’l Symp. on Field-Programmable Custom generation. Peter Sestoft, 1993.
Computing Machines (FCCM), IEEE, 2019. [37] Y. Futamura, “Parital computation of programs”, in Proc. of the
[18] D. Richmond et al., “Synthesizable higher-order functions for RIMS Symposia on Software Science and Engineering, 1982.
C++”, Trans. on Computer-Aided Design of Integrated Circuits [38] C. Consel, “New insights into partial evaluation: The SCHISM
and Systems, vol. 37, no. 11, 2018. experiment”, in Proc. of the 2nd European Symp. on Program-
[19] M. A. Özkan et al., “A highly efficient and comprehensive ming (ESOP), Springer, Mar. 21–24, 1988.
image processing library for C++-based high-level synthesis”, [39] M. Schmid et al., “Loop coarsening in C-based high-level
in Proc. of the 4th Int’l Workshop on FPGAs for Software synthesis”, in Proc. of the 26th Annual IEEE Int’l Conf.
Programmers (FSP), VDE, 2017. on Application-specific Systems, Architectures and Processors
[20] J. de Fine Licht et al., “Transformations of high-level synthesis (ASAP), IEEE, 2015.
codes for high-performance computing”, The Computing Re- [40] M. A. Özkan et al., “Hardware design and analysis of efficient
search Repository (CoRR), 2018. arXiv: 1805.08288 [cs.DC]. loop coarsening and border handling for image processing”,
[21] G. Ofenbeck et al., “Spiral in Scala: Towards the systematic in Proc. of the Int’l Conf. on Application-specific Systems,
construction of generators for performance libraries”, in Proc. Architectures and Processors (ASAP), IEEE, Jul. 10–12, 2017.
of the Int’l Conf. on Generative Programming: Concepts & [41] G. Stitt et al., “Scalable window generation for the Intel
Experiences (GPCE), ACM, Oct. 27–28, 2013. Broadwell+Arria 10 and high-bandwidth FPGA systems”, in
[22] P. Milder et al., “Computer generation of hardware for linear Proc. of the ACM/SIGDA Int’lSymp. on Field-Programmable
digital signal processing transforms”, ACM Trans. on Design Gate Arrays (FPGA), ACM, Feb. 25–27, 2018.
Automation of Electronic Systems (TODAES), vol. 17, no. 2, [42] Y.-k. Choi et al., “A quantitative analysis on microarchitectures
2012. of modern CPU-FPGA platforms”, in Proc. of the 53rd Annual
[23] J. Hegarty et al., “Darkroom: Compiling high-level image Design Automation Conf. (DAC), ACM, Jun. 5–9, 2016.
processing code into hardware pipelines”, ACM Trans. on [43] Core evaluation license agreement, version 2014.06, Xilinx,
Graphics (TOG), vol. 33, no. 4, 2014. Inc., Jun. 2014. [Online]. Available: https://www.xilinx.com/
[24] J. Hegarty et al., “Rigel: Flexible multi-rate image processing products/intellectual-property/license/core-evaluation-license-
hardware”, ACM Trans. on Graphics (TOG), vol. 35, no. 4, agreement.html.
2016. [44] Intel program license subscription agreement, version Rev.
[25] J. Pu et al., “Programming heterogeneous systems from an 10/2009, Intel Corporation, Oct. 2009. [Online]. Available:
image processing DSL”, ACM Trans. on Architecture and Code https://www.intel.com/content/www/us/en/programmable/
Optimization (TACO), vol. 14, no. 3, 2017. downloads/software/license/lic-prog_lic.html.
[26] J. Ragan-Kelley et al., “Halide: A language and compiler for [45] M. A. Özkan et al., “FPGA-based accelerator design from
optimizing parallelism, locality, and recomputation in image a domain-specific language”, in Proc. of the 26th Int’l Conf.
processing pipelines”, in Proc. of the Conf. on Programming on Field-Programmable Logic and Applications (FPL), IEEE,
Language Design and Implementation (PLDI), ACM, Jun. 16– Aug. 29–Sep. 2, 2016.
19, 2013.
150
1 INTRODUCTION
High-performance systems-on-chip (SoCs) are increasingly based on heterogeneous architectures
that combine general-purpose processor cores and specialized hardware accelerators [4, 8, 22]. Ac-
celerators are hardware devices designed to perform specific functions. Accelerators are become
popular because they guarantee considerable gains in both performance and energy efficiency
with respect to the corresponding software executions [9–11, 20, 23, 29, 41, 48]. However, the
This article was presented in the International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS) 2017 and appears as part of the ESWEEK-TECS special issue.
Authors addresses: The authors are within the Department of Computer Science, Columbia University, New York, NY,
USA (Luca Piccolboni: piccolboni@cs.columbia.edu, Paolo Mantovani: paolo@cs.columbia.edu, Giuseppe Di Guglielmo:
giuseppe@cs.columbia.edu, and Luca P. Carloni: luca@cs.columbia.edu).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2017 ACM 1539-9087/2017/09-ART150 $15.00
https://doi.org/10.1145/3126566
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:2 L. Piccolboni et al.
integration of several specialized hardware blocks into a complex accelerator is a difficult design
and verification task. In response to this challenge, we advocate the application of two key prin-
ciples. First, to cope with the increasing complexity of SoCs and accelerators, most of the design
effort should move away from the familiar register-transfer level (RTL) by embracing system-level
design (SLD) [18, 42] with high-level synthesis (HLS) [32, 39]. Second, it is necessary to create
reusable and flexible components, also known as intellectual property (IP) blocks, which can be
easily (re)used across a variety of architectures with different targets for performance and metrics
for cost.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:3
Fig. 1. COSMOS: a methodology to coordinate HLS and memory optimization for the DSE of hardware
accelerators.
replication), rather than in time. The application of this knob generally leads to a faster, but larger,
implementation of the initial specification.
Despite the advantages of HLS, performing this design-space exploration (DSE) is still a compli-
cated task, especially for complex hardware accelerators. First, the support for memory generation
and optimization is limited in current HLS tools. Some HLS tools still require third-party gener-
ators to provide a description of the memory organization and automatize the DSE process [36,
37]. Several studies, however, highlight the importance of private memories to sustain the parallel
datapath of accelerators: on a typical accelerator design, memory takes from 40% to 90% of the
area [16, 30]; hence, its optimization cannot be an independent task. Second, HLS tools are based
on heuristics, whose behavior is not robust and often hard to predict [24]. Small changes to the
knobs, e.g., changing the number of iterations unrolled in a loop, can cause significant and un-
expected modifications at the implementation level. This increases the DSE effort because small
changes to the knobs can take the exploration far from the Pareto-optimality.
1.3 Contributions
To address these limitations, we present COSMOS1 : an automatic methodology for the DSE of
complex hardware accelerators, which are composed of several components. COSMOS is based on
a compositional approach that coordinates both HLS tools and memory generators. First, thanks to
the datapath and memory co-design, COSMOS produces a large set of Pareto-optimal implemen-
tations for each component, thus increasing both performance and cost spans. These spans are
defined as the ratios between the maximum value and the minimum value for performance and
cost, respectively. Second, COSMOS leverages compositional design techniques to significantly re-
duce the number of invocations to the HLS tool and the memory generator. In this way, COSMOS
focuses on the most critical components of the accelerator and quickly converges to the desired
trade-off point between cost and performance for the entire accelerator. The COSMOS methodol-
ogy consists of two main steps (Figure 1). First, COSMOS uses an algorithm to characterize each
component of the accelerator individually by efficiently coordinating multiple runs of the HLS and
memory generator tools. This algorithm finds the regions in the design space of the components
that include the Pareto-optimal implementations (Component Characterization in Figure 1). Sec-
ond, COSMOS performs a DSE to identify the Pareto-optimal solutions for the entire accelerator
by efficiently solving a linear programming (LP) problem instance (Design-Space Exploration).
We evaluate the effectiveness and efficiency of the COSMOS methodology on a complex accel-
erator for wide-area motion imagery (WAMI) [3, 38], which consists of approximately 7000 lines
of SystemC code. While exploring the design space of WAMI, COSMOS returns an average perfor-
mance span of 4.1× and an average area span of 2.6×, as opposed to 1.7× and 1.2× when memory
1 COSMOS stands for “COordination of high-level Synthesis and Memory Optimization for hardware acceleratorS”. We also
adopt the name COSMOS for our methodology since it is the opposite of CHAOS (in the Greek creation myths). In our
analogy, CHAOS corresponds to the complexity of the DSE process.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:4 L. Piccolboni et al.
optimization is not considered and only standard dual-port memories are used. Further, COSMOS
achieves the target data-processing throughput for the WAMI accelerator while reducing the
number of invocations to the HLS tool per component by up to 14.6×, with respect to an
exhaustive exploration approach.
1.4 Organization
The paper is organized as follows. Section 2 provides the necessary background for the rest of the
paper. Section 3 describes few examples to show the effort required in the DSE process. Section 4
gives an overview of the COSMOS methodology, which is then detailed in Sections 5 (Component
Characterization) and 6 (Design-Space Exploration). Section 7 presents the experimental results.
Section 8 discusses the related work. Finally, Section 9 concludes the paper.
2 PRELIMINARIES
This section provides the necessary background concepts. We first describe the main characteris-
tics of the accelerators targeted by COSMOS in Section 2.1. Then, we present the computational
model we adopt for the DSE in Section 2.2.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:5
by exchanging the data through an on-chip interconnect network that implements transaction-
level modeling (TLM) [19] channels. These channels synchronize the components by absorbing
the potential differences in their computational latencies with a latency-insensitive communica-
tion protocol [7]. This ensures that the components of an accelerator can always be replaced with
different Pareto-optimal implementations without affecting the correctness of the accelerator im-
plementation. COSMOS employs channels with a fixed bitwidth (256 bits) and does not explore
different design alternatives to implement the communication among the components. It can be
extended, however, to support this type of DSE by using, for example, the XKnobs [35] or buffer-
restructuring techniques [13]. Each component includes a datapath, which is organized in a set of
loops, to read and store input and output data and to compute the required functionality. There
are also private local memories (PLMs), or scratchpads, where data resides during the computation.
PLMs are multi-bank memory architectures that provide multiple read and write ports to allow
accelerators to perform parallel accesses. We generate optimized memories for our accelerators
by using the Mnemosyne memory generator [37]. Several analyses highlight the importance of
the PLMs in sustaining the parallel datapath of accelerators [16, 30]. PLMs play a key role on
the performance of accelerators [25], and they occupy from 40% to 90% of the entire area of the
components of a given accelerator [30].
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:6 L. Piccolboni et al.
behaviors, they are a practical model to analyze stream processing accelerators for many classes
of applications, e.g., image and signal processing applications. A PN is a bipartite graph defined as
a tuple (P,T , F , w, M 0 ), where P is a set of m places, T is a set of n transitions, F : (P × T ) ∪ (T × P )
is a set of arcs, w : F → N+ is an arc weighting function, and M 0 ∈ Nm is the initial marking, i.e.,
the number of tokens at each p ∈ P. A PN is strongly-connected if for every pairs of places pi and
p j there exists a sequence of transitions and places such that pi and p j are mutually reachable in
the net. A PN can be organized in a set of strongly-connect components, i.e., the maximal sets of
places that are strongly-connected. A TMG is a PN such that (i) each place has exactly one input
and one output transition, and (ii) w : F → 1, i.e., every arc has a weight equal to 1. To measure
performance, TMGs are extended with a transition firing-delay vector τ ∈ Rn , which represents
the duration of each particular firing.
The minimum cycle time of a strongly-connected TMG is defined as: max {D k /Nk | k ∈ K },
where K is the set of cycles of the TMG, D k is the sum of the transition firing delays in cycle
k, and Nk is the number of tokens in cycle k [40]. In this paper, we use the TMG model to formally
describe the accelerators. We use the term system to indicate a complex accelerator that is made of
multiple components. Each component of the system is represented with a transition in the TMG
whose firing delay is equal to its effective latency. The effective latency λ of a component is defined
as the product of its clock cycle count and target clock period. The maximum sustainable effective
throughput θ of the system is then the reciprocal of the minimum cycle time of its TMG, if the TMG
is strongly connected. Otherwise, it is the minimum θ among its strongly-connected components.
We use λ and θ as performance figures for the single components and the system, respectively. We
use the area α as the cost metric for both the components and the system.
3 MOTIVATIONAL EXAMPLES
Performing an accurate and as exhaustive as possible DSE for a complex hardware accelerator is
a difficult task for three main reasons: (i) HLS tools do not always support PLM generation and
optimization (Section 3.1), (ii) HLS tools are based on heuristics that make it difficult to configure
the knobs (Section 3.2), and (iii) HLS tools do not handle the simultaneous optimization of multiple
components (Section 3.3). Next, we detail these issues with some examples.
3.1 Memories
The joint optimization of the accelerator datapath and PLM architecture is critical for an effective
DSE. Figure 4 depicts the design space of Gradient, a component we designed for WAMI. The
graph reports different design points, each characterized in terms of area (mm2 ) and effective
latency (milliseconds), synthesized for an industrial 32nm ASIC technology library. The points
with the same color (shape) are obtained by partially unrolling the loops for different numbers
of iterations. The different colors (shapes) indicate different numbers of ports for the PLM2 . By
increasing the number of ports, we notice a significant impact on both latency and area. In fact,
multiple ports allow the component to read and write more data in the same clock cycle, thus
increasing the hardware parallelism. Multi-port memories, however, require much more area
since more banks may be used depending on the given memory-access pattern. Note that ignoring
the role of the PLM limits considerably the design space. By changing the number of ports of
the PLM, we obtain a latency span of 7.9× and an area span of 3.7×. By using standard dual-port
memories, we have only a latency span of 1.4× and an area span of 1.2×. This motivates the need
2 Here and in the rest of the paper, the number of ports indicates the number of read ports to the memories containing the
input data of the component and the number of write ports containing the output data of the component, i.e., the ports
that allow parallelism in the compute phase of the component.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:7
Fig. 4. Example of application of two HLS knobs (number of ports, number of unrolls) to Gradient, a com-
ponent of WAMI. The nested graph magnifies the design points with 2 read and 2 write ports. The numbers
indicate the numbers of iterations unrolled.
of considering the optimization of PLMs in the DSE process. COSMOS takes into consideration
the PLMs by generating optimized memories with Mnemosyne [37].
3.3 Compositionality
Complex accelerators need to be partitioned into multiple components to be efficiently synthesized
by current HLS tools. This reduces the synthesis time and improves the quality of results, but sig-
nificantly increases the DSE effort. Figure 5 reports a simple example to illustrate this problem.
On the top, the figure reports two graphs representing a small subset of Pareto-optimal points for
Gradient and Grayscale, two components of WAMI. Assuming that they are executed sequen-
tially in a loop, their aggregate throughput is the reciprocal of the sum of their latencies. On the
bottom, the figure reports all the possible combinations of the design points of the two components,
differentiating the Pareto-optimal combinations from the Pareto-dominated combinations. These
design points are characterized in terms of area (mm2 ) and effective throughput (1/milliseconds).
In order to find the Pareto-optimal combinations at the system level, an exhaustive search method
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:8 L. Piccolboni et al.
Fig. 5. Example of composition for Gradient and Grayscale, two components of WAMI. The graphs on
the top report some Pareto-optimal points for the two components. The graph on the bottom shows all the
possible combinations of these components, assuming they are executed sequentially in a loop. In the graph
of the composition, the effective throughput is used as the performance metric.
would apply the following steps: (i) synthesize different points for each component by varying
the settings of the knobs, (ii) find the Pareto-optimal points for each component, and (iii) find the
Pareto-optimal combinations of the components at the system level. This approach is impractical
for complex accelerators. First, step (i) requires to try all the combinations of the knob settings (e.g.,
different number of ports and number of unrolls). Second, step (iii) requires to evaluate an expo-
nential number of combinations at the system level to find those that are Pareto-optimal. In fact,
if we have n components with k Pareto-optimal points each, then the number of combinations to
check is O(k n ). This example motivates the need of a smart compositional method that identifies
the most critical components of an accelerator and minimizes the invocations to the HLS tool. In
order to do that, COSMOS reduces the number of combinations of knob settings that are used for
synthesis and prioritizes the synthesis of the components depending on their level of contribution
to the effective throughput of the entire accelerator.
(1) Component Characterization (Section 5): in this step COSMOS analyzes each component
of the system individually; for each component it identifies the boundaries of the regions
that include the Pareto-optimal designs; starting from the HLS-ready implementation of
each component (in SystemC), COSMOS applies an algorithm that generates knob and
memory configurations to automatically coordinate the HLS and memory generator tools;
the algorithm takes into account the memories of the accelerators and tries to deal with
the unpredictability of HLS tools;
(2) Design-Space Exploration (Section 6): in this step COSMOS analyzes the design space of
the entire system; the system is modeled with a TMG to find the most critical components
for the system throughput; then, COSMOS:
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:9
5 COMPONENT CHARACTERIZATION
Algorithm 1 reports the pseudocode used for the component characterization. The designer pro-
vides the clock period, the maximum number of ports for the PLMs (mainly constrained by the
target technology and the memory generator) and the maximum number of loop unrolls. In order
to keep the delay of the logic for selecting the memory banks negligible, the number of ports should
be a power of two. Note that this constraint can be partially relaxed without requiring Euclidean
division for the selection logic [46]. The number of unrolls depends on the loop complexity. Loops
with few iterations can be completely unrolled, while more complex loops can be only partially
unrolled. In fact, unrolling loops replicates the hardware resources, thus making the scheduling
more complex for the HLS tool. The algorithm identifies regions in the design space of the com-
ponent. A region includes design points that have the same number of ports. They are bounded
by an upper-left (λmin , αmax ) and a lower-right (λmax , αmin ) point. These regions represent the
design space of the component that will be used for the DSE at the system level, as explained in
Section 6.
ALGORITHM 1: Component Characterization
Input: clock, max_ports, max_unrolls
Output: set of regions (λmax , αmin , λmin , αmax )
1 for ports = 1 up to max_ports do
2 // Identification of max-λ min-α point
3 (λmax , αmin ) = hls_tool(ports, ports, clock);
4 // Identification of min-λ max-α point
5 for unrolls = max_unrolls down to ports + 1 do
6 (λmin , αmax ) = hls_tool(unrolls, ports, clock);
7 if λ_constraintports (unrolls) is sat then break;
8 // Generation of the PLM of the component
9 αplm = memory_generator(ports);
10 αmin += αplm ; αmax += αplm ;
11 // Save the region of the design space
12 save(ports, unrolls, λmax , αmin , λmin , αmax );
tool parameters: hls_tool(unrolls, ports, clock);
tool parameters: memory_generator(ports);
Algorithm 1 starts by identifying the lower-right point of the region. To identify this design
point, it sets the number of unrolls equal to the current number of ports (line 3). This ensures that
all the ports of the PLM are exploited and the obtained point is not redundant. In fact, this point
cannot be obtained by using a lower number of ports. On the other hand, finding the upper-left
point is more challenging. A complete unroll (which could lead to the point with the minimum
latency) is unfeasible in case of complex loops. Indeed, it is not always guaranteed that, by increas-
ing the number of unrolls, the HLS tool returns an implementation of the component that gives
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:10 L. Piccolboni et al.
lower latency in exchange for higher area occupation. To overcome these problems, Algorithm 1
introduces a constraint, λ − constraint for the rest of the paper, that defines the maximum number
of states that the HLS tool can insert in the body of a loop. This helps in constraining the behavior
of the HLS tool to be more deterministic and in removing some of the Pareto-dominated points.
Thus, Algorithm 1 uses the following function to estimate the number of states that should be
sufficient to schedule one iteration of the loop that includes read and write operations:
γr ∗ unrolls γw
hpor t s (unrolls) = + +η (1)
ports ports
where γr is the maximum number of read accesses to the same array per loop iteration, γw is
the maximum number of write accesses to the same array per loop iteration and, η accounts for
the latency required to perform the operations that do not access the PLM. These parameters are
inferred by traversing the control data flow graph (CDFG) created by the HLS tool for scheduling
the lower-right point. This function is used as an upper bound of the number of states that the
HLS tool can insert. If this upper bound is not sufficient, then the synthesis fails and the point is
discarded. A synthesis run with a lower number of unrolls is performed to find another point to
be used as the upper-left extreme (lines 5-7).
Example 1. Figure 6 shows an example of using the λ-constraint. The loop (reported on the left)
contains two read operations to two distinct arrays, i.e., γr = 1, and one write operation, i.e., γw = 1.
We assume that all the operations that are neither read nor write operations can be performed in
one clock cycle, i.e., η = 1. The two diagrams (on the right) show the results of the scheduling by
using two ports for the PLM and by unrolling two or three times the loop, respectively. In the first
case (unrolls = 2), the HLS tool can schedule all the operations in a maximum of h 2 (2) = 3 clock
cycles. Thus, this point would be chosen by Algorithm 1 to be used as upper-left extreme. In the
second case (unrolls = 3), the HLS tool is not able to complete the schedule within h 2 (3) = 4 clock
cycles (it needs at least 5 clock cycles). Thus, this point is discarded.
Note that the λ-constraint is not guaranteed to obtain a Pareto-optimal point due to the intrinsic
variability of the HLS results. Still, this point can serve as an upper bound of the region in the
design space. Note also that the λ − constraint cannot be applied to loops that (i) require data from
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:11
sub-components through blocking interfaces or (ii) do not present memory accesses to the PLM.
In these cases, in fact, it is necessary to extend the definition of the estimation function given in
Equation (1) to handle such situations. Alternatively, COSMOS can optionally run some synthesis
in the neighbourhood of the maximum number of unrolls and use a local Pareto-optimal point as
the upper-left extreme.
6 DESIGN-SPACE EXPLORATION
After the characterization of the single components of a given accelerator, COSMOS uses a LP
formulation to find the Pareto-optimal design points at the system level. The DSE problem at the
system level can be formulated as follows:
Problem 1. Given a TMG model of the system where each component has been characterized, a
HLS tool, and a target granularity δ > 0, find a Pareto curve α versus θ of the system, such that:
(i) given two consecutive points d, d on the Pareto curve, they have to satisfy: max {d α /d α −
1, dθ /dθ − 1} < δ ; this ensures a maximum distance between two design points on the
curve;
(ii) the HLS tool must be invoked as few times as possible.
This formulation is borrowed from [28], where the authors propose a solution that requires the
manual effort of the designers to characterize the components. In contrast, COSMOS solves this
problem by leveraging the automatic characterization method in Section 5 and by dividing it into
two steps: Synthesis Planning and Synthesis Mapping.
where the function fi returns the implementation cost (α) of the i-th component given the firing-
delay τi of transition ti , σ ∈ Rn is the transition-firing initiation-time vector, M 0 ∈ Nm is the initial
marking, τ − ∈ Rm is the input-transition firing-delay vector, i.e., τi− is the firing-delay of the tran-
sition tk entering in place pi (note that τmin− −
and τmax correspond to the extreme λmin and λmax
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:12 L. Piccolboni et al.
⎪ +1 if t j is an output transition of pi ,
⎧
⎪
A[i, j] = ⎨
⎪ −1 if t j is an input transition of pi , (3)
⎪0 otherwise.
⎩
The objective function minimizes the implementation costs of the components, while satisfying
the system throughput requirements. Given the component extreme latencies λmin and λmax , it is
possible to determine the values of θmin and θmax by labeling the transitions of the TMG of the
system with such latencies. By iterating from θmin to θmax with a ratio of (1 + δ ), we can then find
the optimal values of λ for the components that solve Problem 1. This formulation guarantees that
the components that are not critical for the system throughput are selected to minimize their cost.
The cost functions fi in Equation (2) are unknown a-priori, but they can be approximated with
convex piecewise-linear functions. This LP formulation can be solved in polynomial time [5], and
it can be extended to the case of non-strongly-connected TMGs.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:13
mapping function that returns the number of unrolls that should be applied, given a specific value
for the latency (we apply the ceiling function to get an integer value). For instance, if a point with
latency of 20 s is required, the mapping function returns 11 as the number of unrolls. Note that by
specifying the maximum latency, the function returns the minimum number of unrolls, while by
specifying the minimum latency, it returns the maximum number of unrolls.
It is possible that the mapping may fail by choosing a value for μ t ar дet that does not satisfy the λ-
constraint (Section 5). In this case, COSMOS tries to increase the number of unrolls to preserve the
throughput. Further, if λt ar дet is not included in any region, COSMOS uses the slowest point of the
next region that has a larger number of ports. This does not require a synthesis run (because that
point has been synthesized during the characterization), and it is a conservative solution because,
as in the case of failure of the λ-constraint, we are willing to trade area to preserve the throughput.
7 EXPERIMENTAL RESULTS
We implement the COSMOS methodology with a set of tools and scripts to automatize the DSE.
Specifically, COSMOS includes: (i) Mnemosyne [37] to generate multi-bank memory architectures
as described in Section 5, (ii) a tool to extract the information required by Mnemosyne from the
database of the HLS tool, (iii) a script to run the synthesis and the memory generator according
to Algorithm 1, (iv) a program that creates and solves the LP model by using the GLPK Library3
(Section 6.1), and (v) a tool that maps the LP solutions to the HLS knobs and runs the synthesis
(Section 6.2).
We evaluate the effectiveness and efficiency of COSMOS by considering the WAMI applica-
tion [38] as a case study. The original specification of the WAMI application is available in C in
the PERFECT Benchmark Suite [3]. Starting from this specification, we design a SystemC acceler-
ator to be synthesized with a commercial HLS tool, i.e., Cadence C-to-Silicon. We use an industrial
32nm ASIC technology as target library4 . We choose the WAMI application as our case study due
to (i) the different types of computational blocks it includes and (ii) its complexity. The hetero-
geneity of its computational blocks allows us to develop different components for each block and
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:14 L. Piccolboni et al.
show the vast applicability of COSMOS. The C specification is roughly 1000 lines of code. The
specification of our accelerator design is roughly 7000 lines of SystemC code.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:15
COSMOS No Memory
Component reд λspan α span λspan α span
Debayer 3 2.89× 1.99× 1.04× 1.36×
Grayscale 4 6.91× 3.41× 2.75× 1.14×
Gradient 4 7.89× 3.65× 1.39× 1.22×
Hessian 4 7.70× 7.30× 1.44× 1.30×
SD-Update 4 9.87× 2.01× 2.78× 1.79×
Matrix-Sub 4 2.75× 3.98× 1.88× 1.05×
Matrix-Add 3 1.53× 1.01× 1.26× 1.01×
Matrix-Mul 3 2.88× 3.05× 1.92× 1.14×
Matrix-Resh 1 1.02× 1.04× 1.02× 1.04×
Steep.-Descent 1 1.95× 1.46× 1.95× 1.46×
Change-Det. 1 2.21× 1.04× 2.21× 1.04×
Warp 1 1.09× 1.03× 1.09× 1.03×
Average - 4.06× 2.58× 1.73× 1.22×
overall a richer DSE, as evidenced by the average results. For some components the algorithm ex-
tracts only one region because multiple ports can incur in additional area for no latency gains.
This happens when (i) the algorithm cannot exploit multiple accesses in memory, or (ii) the data
is cached into local registers which can be accessed in parallel in the same clock cycle, e.g., for
Change-Detection. On the other hand, in most cases COSMOS provides significant gains in
terms of area and latency spans compared to a DSE that does not consider the memories.
Figure 9 shows the design space of four representative components of WAMI. The rectangles in
the figures are the regions found by Algorithm 1. For completeness, in addition to the design points
corresponding to the extreme points of the regions, the graphs show also the intermediate points
that could be selected by the mapping function. The small graphs on the right magnify the cor-
responding regions reported on the left. As in the examples discussed in Section 3, increasing the
number of ports has a significant impact on the DSE, while loop unrolling has a local effect within
each region. Another aspect that is common among many components is that the regions become
smaller as we keep increasing the number of ports. For example, for Grayscale in Figure 9(c), we
note that by increasing the number of ports, we reach a point where the gain in latency is not sig-
nificative. This effect, called diminishing returns [1], is the same effect that can be observed in the
parallelization of software algorithms. In some cases, changing the ports increases only the area
with no latency gains as discussed in the previous paragraph. This is highlighted in Figure 9(d),
where for Change-Detection we report two additional regions with respect to those specified
in Table 1. The diminishing-return effect can also be observed by increasing the number of unrolls
inside a region, e.g., Figure 9(b). This is why COSMOS exploits Amdahl’s Law (Section 6.2). On the
other hand, we notice some discontinuities of the Pareto-optimal points within some regions, e.g.,
the region in the bottom-right corner of Figure 9(a). Even by applying the λ − constraints (Sec-
tion 5) it is not possible to completely discard the Pareto-dominated implementations. In fact,
by further restricting the imposed constraints, i.e., by reducing the number of states that the
HLS tool can insert in each loop, we observe that also the Pareto-optimal implementations are
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:16 L. Piccolboni et al.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:17
discarded. Thus, it is not always possible to obtain a curve composed only of Pareto-optimal points
within a certain region. Finally, the Pareto-optimal points outside the regions are not discarded by
COSMOS. They can be chosen when it is necessary to perform the mapping (Section 6.2).
| dm − d p |
σ (dp , dm ) =
dp
where dp is the area of a planned point p, while dm is the area of the corresponding mapped
point m. Each planned point in Figure 10 is labeled with its corresponding σ % value. Note that the
curve obtained with LP is a theoretical curve because the points found at the system level do not
guarantee the existence of a corresponding set of implementations for the components. The error
is mainly due to the impact of the memory, which determines a significant distance between two
consecutive regions (e.g., the points with more than 10% of mismatch in Figure 10). In fact, if a point
is mapped between two regions it must be approximated with the lower-right point of the next
region with lower effective latency. This choice permits to satisfy the throughput requirements
almost always, but at the expense of additional area. In fact, even if Equation (2) is constrained by
the system throughput, it is not always guaranteed to obtain the same throughput because it is not
always the case that there exists a mapped point that has exactly the same latency of a planned
point. To solve this issue, one could try to reduce the clock period and satisfy the throughput
requirements.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:18 L. Piccolboni et al.
Fig. 11. Number of invocations of the HLS tool for an exhaustive exploration (bars on the left) and COSMOS
(on the right).
Finally, to demonstrate the efficiency of COSMOS, Figure 11 shows the number of invocations
to the HLS tool. For each component of WAMI, the right bars report the breakdown of the syn-
thesis calls performed in each phase of the algorithm. At least two invocations are necessary for
each region to characterize a component. Then, we have to consider the invocations that fail due
to the λ − constraints, and finally, the invocations required at system level on the most critical
components (mapping). Some components do non play any role in the efficiency of the system.
For example, for Matrix-Mul, there are no invocations after the characterization because only the
slowest version has been requested by Equation (2) (to save area). This component is not important
to guarantee a high throughput for the entire system. Moreover, some synthesized points belong
to multiple solutions of the LP problem, as in the case of Debayer. Therefore, COSMOS avoids
performing an invocation of the HLS with the same knobs more than once. On the other hand, the
left bars in Figure 11 report the number of invocations required for a exhaustive exploration. Such
exploration requires to (i) synthesize all the possible configurations of unrolls and memory ports
for each component, (ii) find the Pareto-optimal design points for each component, and (iii) com-
pose all the Pareto-optimal designs to find the Pareto curve at the system level (Section 3). The left
bars in Figure 11 show the number of invocations to the HLS tool required in step (i). COSMOS
reduces the total number of invocations for WAMI by 6.7× on average and up to 14.6× for the
single components, compared to the exhaustive exploration. Further, while COSMOS returns the
Pareto-optimal implementations at the system level, to find the combinations of the components
that are Pareto optimal with an exhaustive search method, one has to combine the huge number
of solutions for the single components. In the case of WAMI, the number of combinations, i.e.,
the product of the number of Pareto-optimal points of each component, is greater than 9 ∗ 1012 .
This motivates the need of using a compositional method like COSMOS for the DSE of complex
accelerators.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:19
7.4 Summary
We report a brief summary of the achieved results:
• COSMOS guarantees a richer DSE with respect to the approaches that do not consider the
memory as integral part of the DSE: for WAMI, COSMOS guarantees an average perfor-
mance span of 4.06× and an average area span of 2.58× as opposed to 1.73× and 1.22×,
respectively, when only standard dual-port memories are used; COSMOS obtains a richer
set of Pareto-optimal implementations thanks to memory generation and optimization;
• COSMOS guarantees a faster DSE compared to exhaustive search methods: for WAMI,
COSMOS reduces the number of invocations to the HLS tool by 6.7× on average and by up
to 14.6× for the single components; COSMOS is able to reduce the number of invocations
thanks the compositional approach discussed in Section 6;
• COSMOS is an automatic and scalable methodology for DSE: the approach is intrinsi-
cally compositional, and thus with larger designs the performance gains are expected to be
as good as smaller ones, if not better. While an exhaustive method has to explore all the
alternatives, COSMOS focuses on the most critical components.
8 RELATED WORK
This section describes the most-closely related methods to perform DSE. We distinguish the meth-
ods that explore single-component designs (reported in Section 8.1) from those that are composi-
tional like COSMOS (in Section 8.2).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:20 L. Piccolboni et al.
to account for the high variability and partial unpredictability of the HLS tools. Such constraints
consider both the dependency graph of the specification and the memory references in each loop.
Thus, COSMOS identifies larger regions of Pareto-optimal implementations.
Other methods, such as Aladdin [47], perform a DSE without using HLS tools and without gener-
ating the RTL implementations, estimating the performance and costs of high-level specifications
(C code for Aladdin). COSMOS differs from these methods because it aims at generating efficient
RTL implementations by using HLS and memory generator tools. Indeed, such methods can be
used before applying COSMOS to pre-characterize the different components of an accelerator that
is not ready to be synthesized with HLS tools. Since the design of HLS-ready specifications requires
significant efforts [39], this can help the designers to focus only on the most critical components,
i.e., those that are expected to return good performance gains over software executions. After this
pre-characterization, COSMOS can be used to perform a DSE of such components and obtain the
Pareto-optimal combinations of their RTL implementations.
9 CONCLUDING REMARKS
We presented COSMOS, an automatic methodology for compositional DSE that coordinates both
HLS and memory generator tools. COSMOS takes into account the unpredictability of the current
HLS tools and considers the PLMs of the components as an essential part of the DSE. The method-
ology of COSMOS is intrinsically compositional. First, it characterizes the components to define
the regions of the design space that contain Pareto-optimal implementations. Then, it exploits a
LP formulation to find the Pareto-optimal solutions at the system level. Finally, it identifies the
knobs for each component that can be used to obtain the corresponding implementations at RTL.
We showed the effectiveness and efficiency of COSMOS by considering the WAMI accelerator as
a case study. Compared to methods that do not consider the PLMs, COSMOS finds a larger set of
Pareto-optimal implementations. Additionally, compared to exhaustive search methods, COSMOS
reduces the number of invocations to the HLS tool by up to one order of magnitude.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their valuable comments and help-
ful suggestions that help us improve the paper considerably. This work was supported in part
by DARPA PERFECT (C#: R0011-13-C-0003), the National Science Foundation (A#: 1527821), and
C-FAR (C#: 2013-MA-2384), one of the six centers of STARnet, a Semiconductor Research Corpo-
ration program sponsored by MARCO and DARPA.
REFERENCES
[1] M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In
Proc. of the ACM Spring Joint Computer Conference (AFIPS).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:21
[2] N. Baradaran and P. C. Diniz. 2008. A Compiler Approach to Managing Storage and Memory Bandwidth in Config-
urable Architectures. ACM Transaction on Design Automation of Electronic Systems (2008).
[3] K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez,
L. Song, N. Tallent, and A. Tumeo. 2013. PERFECT (Power Efficiency Revolution For Embedded Computing Tech-
nologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute.
http://hpc.pnl.gov/PERFECT/.
[4] S. Borkar and A. Chien. 2011. The Future of Microprocessors. Communication of the ACM (2011).
[5] S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
[6] J. Campos, G. Chiola, J. M. Colom, and M. Silva. 1992. Properties and Performance Bounds for Timed Marked Graphs.
IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications (1992).
[7] L. P. Carloni. 2015. From Latency-Insensitive Design to Communication-Based System-Level Design. Proc. of the IEEE
(2015).
[8] L. P. Carloni. 2016. The Case for Embedded Scalable Platforms. In Proc. of the ACM/IEEE Design Automation Conference
(DAC). (Invited).
[9] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. 2014. DaDianNao:
A Machine-Learning Supercomputer. In Proc. of the Annual ACM/IEEE International Symposium on Microarchitecture
(MICRO).
[10] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks. IEEE Journal of Solid-State Circuits (2017).
[11] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. 2014. Accelerator-Rich Architectures:
Opportunities and Progresses. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[12] J. Cong, P. Li, B. Xiao, and P. Zhang. 2016. An Optimal Microarchitecture for Stencil Computation Acceleration Based
on Nonuniform Partitioning of Data Reuse Buffers. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems (2016).
[13] J. Cong, P. Wei, C. H. Yu, and P. Zhou. 2017. Bandwidth Optimization Through On-Chip Memory Restructuring for
HLS. In Proc. of the Annual Design Automation Conference (DAC).
[14] J. Cong, P. Zhang, and Y. Zou. 2011. Combined Loop Transformation and Hierarchy Allocation for Data Reuse Opti-
mization. In Proc. of the ACM/IEEE International Conference on Computer-Aided Design (ICCAD).
[15] J. Cong, P. Zhang, and Y. Zou. 2012. Optimizing Memory Hierarchy Allocation with Loop Transformations for High-
Level Synthesis. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[16] E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2015. An Analysis of Accelerator Coupling in Hetero-
geneous Architectures. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[17] F. Ferrandi, P. L. Lanzi, D. Loiacono, C. Pilato, and D. Sciuto. 2008. A Multi-objective Genetic Algorithm for Design
Space Exploration in High-Level Synthesis. In Proc. of the IEEE Computer Society Annual Symposium on VLSI.
[18] A. Gerstlauer, C. Haubelt, A. D. Pimentel, T. P. Stefanov, D. D. Gajski, and J. Teich. 2009. Electronic System-level
Synthesis Methodologies. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2009).
[19] F. Ghenassia. 2006. Transaction-Level Modeling with SystemC. Springer-Verlag.
[20] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A High-Performance and Energy-
Efficient Accelerator for Graph Analytics. In Proc. of the Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO).
[21] C. Haubelt and J. Teich. 2003. Accelerating Design Space Exploration Using Pareto-Front Arithmetics [SoC design].
In Proc. of the ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC).
[22] M. Horowitz. 2014. Computing’s energy problem (and what we can do about it). In Proc. of the IEEE International
Solid-State Circuits Conference (ISSCC).
[23] L. W. Kim. 2017. DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks.
IEEE Transactions on Neural Networks and Learning Systems (2017).
[24] S. Kurra, N. K. Singh, and P. R. Panda. 2007. The Impact of Loop Unrolling on Controller Delay in High Level Synthesis.
In Proc. of the ACM/IEEE Conference on Design, Automation and Test in Europe (DATE).
[25] B. Li, Z. Fang, and R. Iyer. 2011. Template-based Memory Access Engine for Accelerators in SoCs. In Proc. of the
ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC).
[26] H. Y. Liu and L. P. Carloni. 2013. On Learning-Based Methods for Design-Space Exploration with High-Level Synthe-
sis. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[27] H. Y. Liu, I. Diakonikolas, M. Petracca, and L. P. Carloni. 2011. Supervised Design Space Exploration by Compositional
Approximation of Pareto Sets. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[28] H. Y. Liu, M. Petracca, and L. P. Carloni. 2012. Compositional System-Level Design Exploration with Planning of
High-Level Synthesis. In Proc. of the AMC/IEEE Conference on Design, Automation, and Test in Europe (DATE).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:22 L. Piccolboni et al.
[29] X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen. 2016. High Level Synthesis of Complex Appli-
cations: An H.264 Video Decoder. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA).
[30] M. J. Lyons, M. Hempstead, G. Y. Wei, and D. Brooks. 2012. The Accelerator Store: A Shared Memory Framework for
Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization (2012).
[31] A. Mahapatra and B. Carrion Schafer. 2014. Machine-learning based Simulated Annealer Method for High Level
Synthesis Design Space Exploration. In Proc. of the Electronic System Level Synthesis Conference (ESLsyn).
[32] W. Meeus, K. Van Beeck, T. Goedemé, J. Meel, and D. Stroobandt. 2012. An Overview of Today’s High-Level Synthesis
Tools. Design Automation for Embedded Systems (2012).
[33] V. K. Mishra and A. Sengupta. 2014. PSDSE: Particle Swarm Driven Design Space Exploration of Architecture and
Unrolling Factors for Nested Loops in High Level Synthesis. In Proc. of the IEEE International Symposium on Electronic
System Design (ISED).
[34] T. Murata. 1989. Petri Nets: Properties, Analysis and Applications. Proc. of the IEEE (1989).
[35] L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. Broadening the Exploration of the Accelerator
Design Space in Embedded Scalable Platforms. In Proc. of the IEEE High Performance Extreme Computing Conference
(HPEC).
[36] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2014. System-level Memory Optimization for High-level
Synthesis of Component-based SoCs. In Proc. of the ACM/IEEE International Conference on Hardware/Software Code-
sign and System Synthesis (CODES+ISSS).
[37] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. System-Level Optimization of Accelerator Local
Memory for Heterogeneous Systems-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems (2017).
[38] R. Porter, A. M. Fraser, and D. Hush. 2010. Wide-Area Motion Imagery. IEEE Signal Processing Magazine (2010).
[39] A. Qamar, F. B. Muslim, F. Gregoretti, L. Lavagno, and M. T. Lazarescu. 2017. High-Level Synthesis for Semi-Global
Matching: Is the Juice Worth the Squeeze? IEEE Access (2017).
[40] C. V. Ramamoorthy and G. S. Ho. 1980. Performance Evaluation of Asynchronous Concurrent Systems Using Petri
Nets. IEEE Transaction on Software Engineering (1980).
[41] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. HernÃandez-Lobato, G. Y. Wei, and D. Brooks.
2016. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In Proc. of the ACM/IEEE
Annual International Symposium on Computer Architecture (ISCA).
[42] A. Sangiovanni-Vincentelli. 2007. Quo Vadis, SLD? Reasoning About the Trends and Challenges of System Level
Design. Proc. of the IEEE (2007).
[43] B. Carrion Schafer. 2016. Probabilistic Multiknob High-Level Synthesis Design Space Exploration Acceleration. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems (2016).
[44] B. Carrion Schafer, T. Takenaka, and K. Wakabayashi. 2009. Adaptive Simulated Annealer for High Level Synthe-
sis Design Space Exploration. In Proc. of the IEEE International Symposium on VLSI Design, Automation and Test
(VLSI-DAT).
[45] B. Carrion Schafer and K. Wakabayashi. 2012. Machine Learning Predictive Modelling High-Level Synthesis Design
Space Exploration. IET Computers Digital Techniques (2012).
[46] A. Seznec. 2015. Bank-interleaved Cache or Memory Indexing Does Not Require Euclidean Division. In Proc. of the
Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD).
[47] Y. S. Shao, B. Reagen, G. Y. Wei, and D. Brooks. 2014. Aladdin: A Pre-RTL, Power-performance Accelerator Simulator
Enabling Large Design Space Exploration of Customized Architectures. In Proc. of the ACM/IEEE Annual International
Symposium on Computer Architecture (ISCA).
[48] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
Extending High-Level Synthesis for Task-Parallel
Programs
Yuze Chi∗ , Licheng Guo∗ , Jason Lau∗ , Young-kyu Choi∗† , Jie Wang∗ , Jason Cong∗
∗ University of California, Los Angeles, † Inha University
{chiyuze,cong}@cs.ucla.edu
Abstract—C/C++/OpenCL-based high-level synthesis (HLS) takes only a few minutes for a simple design or a component
becomes more and more popular for field-programmable gate in a modular design.
array (FPGA) accelerators in many application domains in recent
Thanks to the advances in HLS scheduling algorithms [13–
arXiv:2009.11389v2 [cs.AR] 6 May 2021
1
TABLE I: Summary of related work.
compilation on each task and unnecessarily slows down code
generation. Programmers can manually synthesize tasks Programmability
Software RTL Code
separately and instantiate them in RTL, but doing so requires Related Work Peek- Trans- Host Simulation Generation
ing action Iface.
debugging RTL code, which is time-consuming and error-
prone. We think such processes should be automated. Fleet [37] No No N/A Sequential N/A
Intel HLS (pipe) No No N/A Multi-thread Monolithic
Limited productivity support for task-parallel programs Intel HLS (stream) No Yes N/A Multi-thread Monolithic
significantly elongates the development cycles and under- Intel OpenCL No No OpenCL Multi-thread Monolithic
LegUp [38, 39] No No N/A Multi-thread Monolithic
mines the benefits brought by HLS. One may argue that Merlin [40] No No C++ Sequential Monolithic
programmers should always go for data-parallel implemen- ST-Accel [36] No No VFS Sequential Hierarchical
tations when designing FPGA accelerators using HLS, but Vivado HLS (ap_fifo) No No OpenCL Sequential Monolithic
Vivado HLS (axis) No Yes OpenCL Multi-thread Manual
data-parallelism may be inherently limited, for example, in Xilinx OpenCL No No OpenCL Multi-thread Monolithic
applications involving graphs. Moreover, researches show that TAPA Yes Yes C++ Coroutine Hierarchical
even for data-parallel applications like neural networks [3] and
stencil computation [9], task-parallel implementations show
better scalability and higher frequency than their data-parallel general HLS tools treat a task-parallel program as a monolithic
counterparts due to the localized communication pattern [26]. design and generate RTL code for each instance of task
In fact, at least 6 papers [11, 27–31] among the 28 research separately, except that Vivado HLS axis allows programmers
papers published in the ACM FPGA 2020 conference use task- to manually instantiate tasks using a configuration file when
parallel implementation with HLS, and another 3 papers [32– running logic synthesis and implementation. To the best of our
34] use RTL implementation that would have required task- knowledge, TAPA is the only work that provides convenient
parallel implementation if written in HLS. programming interfaces, unconstrained software simulation,
In this paper, we extend the HLS C++ language and and hierarchical code generation for general task-parallel pro-
present our framework, TAPA (task-parallel)1 , as a solution grams on FPGAs using HLS.
to the aforementioned limitations of HLS productivity. Our
II. BACKGROUND
contributions include:
• Convenient programming interfaces: We show that, with A. Task-Parallel Program
peeking and transactions added to the programming inter- Task-level parallelism is a form of parallelization of com-
faces, TAPA can be used to program task-parallel kernels puter programs across multiple processors. In contrast to data
with 22% reduction in lines of code (LoC) on average. By parallelism where the workload is partitioned on data and each
unifying the interface used for the kernel and host, TAPA processor executes the same program (e.g., OpenMP [41]),
further reduces the LoC on the host side by 51% on average. different processors in a task-parallel program often behave
• Unconstrained software simulation: We demonstrate that differently, while data are passed between processors. Ex-
our proposed simulator can correctly simulate task-parallel amples of task-parallel programs include image processing
programs that existing software simulators fail to simulate. pipelines [9–11], graph processing [42–45], and network
Moreover, the correctness verification cycle can be short- switching [33]. Task-parallel programs are often described us-
ened by a factor of 3.2× on average. ing dataflow models [46–50], where tasks are called processes.
• Hierarchical code generation: We show that by modu- Processes communicate only through unidirectional channels.
larizing a task-parallel program and using a hierarchical Data exchanged between channels are called tokens. In this
approach, RTL code generation can be accelerated by a paper, we borrow the terms channel and token, and focus on
factor of 6.8× on our server with 32 hyper-threads. the problem of statically mapping tasks to hardware. That
• Fully automated open-source framework: TAPA is open- is, instances of tasks are synthesized to different areas in
source at https://github.com/UCLA-VAST/tapa/. an FPGA accelerator. We plan to address dynamic schedul-
Table I summarizes the related work. Among all general ing [35, 39, 51] in our future work.
HLS tools (Section VI-A) and streaming frameworks (Sec-
tion VI-B): ¬ None of them supports peeking in their kernel B. A Motivating Example
APIs; only Intel HLS stream and Vivado HLS axis support An on-chip ring network is a commonly used topology to
transactions; only Merlin allows the accelerator kernel to be provide all-to-all interconnection among many task-parallel
called from the host as if it is a C/C++ function. Vivado processing elements (PE) in a single FPGA accelerator, which
HLS, Merlin, and both streaming frameworks (ST-Accel [36] is particularly useful in graph processing [52–58] where each
and Fleet [37]) execute tasks sequentially for simulation, vertex may be connected to any other vertices. A ring network
which works on limited applications, while others launch one has the advantages of simplicity and high routability, but
thread per task instance, which does not scale well. ® All implementing a customized ring network in HLS faces several
1 While a prior work TAPAS [35] and our work TAPA share similarity in
issues that make such designs verbose to write, hard to read,
name, our work focuses on statically mapping tasks to hardware, yet TAPAS and error-prone. In this section, we use a simplified real-world
specializes in dynamically scheduling tasks. design to illustrate the productivity issues for implementing
2
Ring Ring in the first stage of pipeline, leading to II greater than 1. This
PE 1 Node Node PE 2 further complicates the HLS implementation (Listing 1).
1 2 3) System integration: To offload computation kernel from
the host CPU to PCIe-based FPGA accelerators, programmers
need to write host-side code to interface the accelerator kernel
Ring Ring with the host. FPGA vendors adopt the OpenCL standard
PE 0 Node Node PE 3 to provide such a functionality. While the standard OpenCL
0 3
host-kernel interface infrastructure relieves programmers from
Fig. 1: An accelerator with 4 PEs connected via a ring network. writing their own operating system drivers and low-level
libraries, it is still inconvenient and hard-to-use. Programmers
such a ring network in HLS, which serves as a motivating often have to write and debug tens of lines of code just to set
example for our work. up the host-kernel interface. This includes manually setting up
Fig. 1 shows an example where PEs in an accelerator are environmental variables for simulation, creating, and maintain-
interconnected via a ring network. In this example, network ing OpenCL Context, CommandQueue, Program, Kernel, etc.
nodes form a cyclic ring, and each ring node is connected to a data structures [59]. Task-parallel accelerators often make the
PE via a bidirectional link. Each PE can send packets to other situation worse because the parallel tasks are often described
PEs through its associated node, specifying its destination PE as distinct OpenCL kernels [24], which significantly increases
in the packet header. Each node forwards packets either to the programmers’ burden on managing multiple kernels in the
its next node or to its associated PE, based on the packet host-kernel interface. In our experiments, more than 60 lines
header. We assume packets are sent infrequently and channels of host code are created just for the host-kernel integration,
between nodes are provisioned so that they will never be full. which constitute more than 20 percent of the whole source
Furthermore, we would like to insert packets from PEs to the code. Yet, what we want is just a single function invocation
network ASAP so that PEs will not stall due to back pressure of the synthesized FPGA bitstream given proper arguments.
from the ring nodes. While such a ring node can be written 4) Software simulation: C does not have explicit parallel
using Vivado HLS (Listing 1), we found that the followings semantics by itself. Vivado HLS uses the dataflow model and
are missing or hard-to-use in the HLS tools and significantly allows programmers to instantiate tasks by invoking each of
degrade the productivity. them sequentially [23]. While this is very concise to write
1) Peeking: Peeking is defined as reading a token from (Listing 2), it leads to incorrect simulation results because the
a channel without consuming it. Compared with the normal communication between a ring node and its corresponding PE
destructive read, peeking is non-destructive because the token is bidirectional, yet sequential execution can only send tokens
may be read many times. For example, in our ring network, from nodes to PEs because of their invocation order. This
when Node 1 receives incoming packets from both PE 1 (via problem was also pointed out in [60]. In order to run software
pe_in) and Node 0 (via node_in), it will forward the packet simulation correctly, the programmer can change the source
from PE 1 to Node 2 (via node_out) to prevent PE 1 from code to run tasks in multiple threads, but doing so requires the
being stalled due to back pressure. In the same clock cycle, the same piece of task instantiation code to be written twice for
packet from Node 0 cannot be forwarded unless the destination synthesis and simulation, reducing productivity. While there
of that packet is PE 1 (via pe_out), because we cannot write exist other tools (e.g. [24]) that can run tasks in parallel threads
two tokens to the same output channel (node_out) in the and do not have the same correctness problem, we will show
same clock cycle. This requires us to conditionally read tokens in Section V-D that such simulators do not scale well when
based on the content of tokens. Without a peek API, one the number of task instances increases.
has to manually maintain a buffer for the incoming values, 5) RTL code generation: In our ring network example, the
as shown in Line 7–15 of Listing 1. This not only increases same ring node is instantiated many times. While state-of-
the programming burden, but also makes the design prone to the-art HLS compilers can recognize multiple instances of the
errors in state transitions of the buffer. same function and reuse HLS results for regular non-task-
2) Transactions: A sequence of tokens may constitute a parallel programs, task-parallel programs are always treated as
single logical communication transaction. Using the same a monolithic one. This means instances of the same task in a
ring network example, we consider the whole accelerator task-parallel program are treated as if they were different, pos-
execution as a logical communication transaction, and let each sibly in order to explore different communication interfaces of
PE control the termination of each RingNode, as shown in each instance. This significantly elongates the code generation
Line 11 of Listing 1. Without an eot API, one has to manually time when the number of instances is large (Section V-E). We
add a special bit to the data structure to indicate the end-of- can manually do hierarchical code generation, i.e., synthesize
transaction (Line 1–4 of Listing 1). Note that the Pkt struct each task separately and connect the generated RTL code, but
may be used elsewhere, thus it may be infeasible to add the doing so forces us to debug RTL code and spend tens of
eot bit directly to the Pkt struct. Moreover, determining the minutes to verify the correctness for each code modification,
end of transaction must be a peek operation; otherwise, the thus defeats the purpose for adopting HLS.
HLS compiler will be unable to schedule the exit condition In this paper, we present the TAPA framework and address
3
1 struct PktEoT { 1 void RingNode(istream<Pkt>& node_in, istream<Pkt>& pe_in,
2 Pkt pkt; Auxiliary struct for termination control; 2 ostream<Pkt>& node_out, ostream<Pkt>& pe_out) {
3 bool eot; eot stands for “end of transaction”. 3 while (! pe_in.eot() ) {
4 }; 4 if (!pe_in.empty()) {
5 void RingNode(stream<Pkt>& node_in, stream<PktEoT>& pe_in, 5 node_out.write( pe_in.read() );
6 stream<Pkt>& node_out, stream<Pkt>& pe_out) { 6 if (!node_in.empty() && IsForThisNode( node_in.peek() ))
7 Pkt node_pkt; Manually maintained input 7 pe_out.write( node_in.read() );
8 bool node_pkt_valid = false; } else if (!node_in.empty()) {
9 PktEoT pe_pkt;
buffers to implement non- 8
9 Pkt pkt = node_in.read() ;
10 bool pe_pkt_valid = false; destructive read (i.e., peek). 10 (IsForThisNode(pkt) ? pe_out : node_out).write(pkt);
11 while (!( pe_pkt_valid && pe_pkt.eot )) { 11 }
12 if (!pe_pkt_valid) Manually 12 } // Highlighted are destructive read operations and
13 pe_pkt_valid = pe_in.read_nb(pe_pkt); } // non-destructive read (peek) operations .
14 if (!node_pkt_valid) update 13
4
1 void Kernel(...) { context switch between coroutines takes only 26ns on modern
2 channel<Pkt, 2> node_0_1, node_1_2, ...
3 channel<Pkt, 2> from_pe_0, to_pe_0, from_pe_1, to_pe_1, ... CPUs [63], while a preemptive thread context switch takes
4 // Instantiates other channels... 1.2~2.2µs [67], which is two orders of magnitude slower.
5 task()
6 .invoke(RingNode, node_0_1, node_1_2, from_pe_1, to_pe_1) TAPA leverages coroutines to perform software simulation
7 .invoke(RingNode, node_1_2, node_2_3, from_pe_2, to_pe_2) as follows. When a task is instantiated, a coroutine is launched
8 // Instantiates other ring nodes and PEs...
9 } but suspended immediately. Once all tasks are instantiated,
the simulator starts to resume the suspended coroutines. A
Listing 4: Accelerator task instantiation in TAPA. resumed task will be suspended again if any input channel
is accessed when empty or any output channel is accessed
the runtime environment properly. As a user of TAPA, the when full, which means that no progress can be made from
programmer can use a single function invocation in the same this task. A different task will then be selected and resumed
source code to run software simulation, hardware simulation, by the simulator. Moreover, the coroutines can be distributed
and on-board execution, with the only difference of specifying in a thread pool. The thread pool launches one thread per
proper kernel binaries. CPU core and can bind the thread to the corresponding core,
which prevents the threads from preemption against each other.
IV. TAPA F RAMEWORK I MPLEMENTATION
This improves simulation parallelism without introducing high
A. Software Simulation context switch overhead as in the multi-thread simulators. We
State-of-the-Art Approach: There are two state-of-the-art will show in Section V-D that the coroutine-based simulator
approaches to run software simulation for task-parallel ap- outperforms the existing simulators by 3.2× on average. TAPA
plications: the sequential approach and the multi-thread ap- software simulator is implemented as a C++ library, which can
proach. A sequential simulator invokes tasks sequentially in be compiled by any compatible C++ compiler.
the invocation order [23]. Sequential simulators are fast, but
cannot correctly simulate the capacity of channels and appli- B. RTL Code Generation
cations with tasks communicating bidirectionally, as discussed State-of-the-Art Approach: Current HLS tools treat the
in Section II-B. A multi-thread simulator invokes tasks in whole task-parallel program as a monolithic design, treat
parallel by launching a thread for each task. This enables channels as global variables, and compile different instances
the capacity of channels and bidirectional communication to of tasks as if they are completely unrelated. This can lead
be simulated correctly. However, they may perform poorly to a significant amount of repeated work. For example, the
due to the inefficient context switch handled by the operating dataflow architecture generated by a stencil accelerator com-
system. The FLASH simulator [60, 61] proposed an alterna- piler, SODA [7, 9], is highly modularized, and has many
tive to the above, which uses HLS scheduling information functionally identical modules. However, both the Vivado HLS
to create an interleaving execution of all tasks. Note that and Intel FPGA OpenCL backends generate RTL code for each
although FLASH is also single-threaded, it is different from a module separately. When the design scales out to hundreds of
sequential simulator because it interleaves tasks via source-to- modules, RTL code generation can easily run for hours, taking
source transformation while a sequential simulator does not. even longer time than logic synthesis and implementation.
Compared with a sequential simulator, FLASH is on average While we recognize that a programmer can manually generate
1.7× slower [61], due to additional scheduling information RTL code for each task and glue them at RTL level, doing
being taking into consideration for cycle-accurate modeling. so defeats the purpose of using HLS for high productivity.
Besides, generating simulation executable becomes slower due We also recognize that fast RTL code generation in general
to the need of the HLS scheduler output for cycle-accuracy, is an interesting problem, but we focus on the inefficiency
which is not needed for correctness verification. exacerbated by task-parallel programs in this paper.
In this section, we present an alternative approach to run Modularized Approach: Thanks to the hierarchical pro-
software simulation on task-parallel applications. Given that gramming model, TAPA can keep the program hierarchy,
the inefficiency of multi-thread execution is mainly caused by recognize different instances of the same task, and compile
the preemptive nature of operating system threads, we propose each task only once. As such, the total amount of time spent
an approach that uses collaborative coroutines [62, 63] instead on RTL code generation is reduced. Moreover, modularized
of preemptive threads for each task. Note that fast and/or cycle- compilation makes it possible to compile tasks in parallel,
accurate debugging in general [64] is out of the scope of this further reducing RTL code generation time on multi-core
paper; we focus on the correctness and scalability issues for machines. TAPA implements this by invoking the vendor tools
task-parallel programs. in parallel for each task. On average, TAPA reduces HLS
Coroutine-Based Approach: Routines in programming lan- compilation time by 4.9× (Section V-E).
guages are the units of execution contexts, e.g., functions Fig. 2 shows how RTL code is generated by TAPA, which
in C/C++ [65]. Coroutines [66] are routines that execute is composed of four steps. First, TAPA extracts the HLS
collaboratively; more specifically, coroutines can be explicitly code for each task and the metadata information of the whole
suspended and resumed. A coroutine can invoke subroutines design, including the communication topology among tasks,
and suspend from and resume to any subroutine [63]. A token types exchanged between tasks, and the capacity of each
5
TABLE II: Benchmarks used in this paper. Each task may be instantiated
Handled automatically by TAPA multiple times, so task instance count (#Inst.) and channel count (#Chan.) are
Source to source RTL code & greater than task count (#Task).
HLS code HLS
transformation HLS report
(per task) compiler
TAPA C++ (TAPA) (per task) Benchmark Application #Task #Inst. #Chan.
code Instantiate tasks,
Extract Task info cannon Cannon’s algorithm [25] 5 91 344
metadata Channal info channels, and their cnn VGG [68] convolutional network [3] 14 209 366
(TAPA) ctrl. logic (TAPA) gaussian Gaussian stencil filter [9] 15 564 1602
(TAPA)
gcn Graph convolutional network [52] 5 12 25
gemm General matrix multiplication [3] 14 207 364
C++ code with host Complete kernel network Bucket sort w/ Omega network [69] 3 14 32
OpenCL function calls RTL code page_rank PageRank citation ranking [54] 4 18 89
Fig. 2: TAPA code generation. The host-kernel interface code is generated
together with the kernel RTL code using metadata of the top-level task.
1.6 Vivado HLS
Lines of Code
(Normalized)
channel. Source-to-source transformation is applied in this 1.2 Intel OpenCL
TAPA
step to insert HLS pragmas where necessary (e.g., to generate 0.8
proper RTL interfaces). Then, the vendor HLS tool is used to 0.4
generate RTL code and HLS report for each task. While TAPA
0.0
uses libraries to implement kernel APIs extensively, e.g., for on cnn sian gcn gemm ork e_rank
cann gaus netw pag
read, write, and the end-of-transaction bit, not all APIs, e.g.,
peeking, can be implemented as libraries, due to the lack of Fig. 3: LoC comparison for kernel code. Lower is better.
support from the HLS scheduler. To support peeking, TAPA
adds a scalar argument to each istream, and connect this port that the generated RTL codes have exactly the same cycle-
to the output of first-word-fall-through FIFO when the RTL accurate behavior without having access to the HLS compiler’s
code is assembled in the next step. scheduling algorithm. For example, the bucket sort network
Using the metadata extracted in the first step, TAPA assem- implemented in TAPA has a total latency of 3 cycles while
bles the per-task RTL code to create the complete kernel. In the Vivado HLS implementation has a total latency of 6.
this step, for each parent task, TAPA instantiates the children This is inevitable because, using Vivado HLS, the manually
tasks and channels, and generates a small state machine that maintained buffer forces an additional latency of 1 cycle at
controls start of the children tasks and termination of the parent each network stage. The shallower pipeline makes TAPA use
task. Finally, TAPA packages the assembled RTL code to a 40% fewer LUTs and 39% fewer FFs for network. For other
format that the vendor implementation tool can recognize (xo benchmarks, TAPA uses 0.4% fewer LUTs and 1% fewer FFs
file for Vitis). on average. This shows that the additional APIs provided by
TAPA does not add resource overhead.
V. E VALUATION
We prototype TAPA on Xilinx devices using Vivado HLS B. Lines of Kernel Code
as the backend; support for Intel devices will be added later.
TAPA simplifies the kernel code in two aspects. First, the
We compare the productivity of TAPA with two vendor tools
TAPA communication interfaces simplify the code with the
that provide end-to-end high-level programming experience
built-in support for peeking and transactions. This not only
(including host-kernel communication): Xilinx Vitis 2019.2
simplifies the body of each task definition, but also removes
suite and Intel FPGA SDK for OpenCL Pro Edition 19.4. The
the necessity for many struct definitions. Second, the TAPA
experimental results are obtained on an Ubuntu 18.04 server
instantiation interfaces simplify the code by allowing tasks to
with 2 Xeon Gold 6244 processors.
be launched concisely. Fig. 3 shows the lines of kernel code
comparison of each benchmark. On average, TAPA reduces
A. Benchmarks
the lines of kernel code by 22%. Note that only synthesizable
Table II summarizes the benchmarks used in this paper. kernel code is counted; code added for multi-thread software
All implementations (Vivado HLS, Intel OpenCL, and TAPA) simulation is not counted for Vivado HLS.
of each benchmark are written in such a way that tasks in
each implementation have one-to-one correspondence, corre-
C. Lines of Host Code
sponding loops are scheduled with the same initiation interval
(II), and each task performs the same computation. This not The host code used in the benchmarks contains a mini-
only guarantees source codes to all tools are functionally mal test bench to verify the correctness of the kernel code.
equivalent, but also makes all tools generate consistent quality TAPA system-integration API automatically interfaces with the
of results (QoR), which enables fair comparison of tool run OpenCL host APIs and relieves the programmer from writing
time. Note that we aim to compare the productivity of the HLS repetitive code just to connect the kernel to a host program.
tools, not QoR (although we want to make sure there is no Table 4 shows the lines of host code comparison. On average,
QoR degradation). In particular, we were unable to guarantee the length of host code is reduced by 51%.
6
E. RTL Code Generation Time
4 Vivado HLS
Lines of Code
(Normalized)
3 Intel OpenCL Fig. 6 shows the RTL code generation time comparison.
TAPA Thanks to the hierarchical programming model and modular-
2
ized code generator, TAPA shortens the HLS compilation time
1 by 6.8× on average. This is because ¬ TAPA runs HLS for
0 each task only once even if it is instantiated many times, while
cann
on cnn sian gcn gemm netw
ork e_rank
gaus pag Vivado HLS and Intel OpenCL run HLS for each task instance;
TAPA runs HLS in parallel on multi-core machines.
Fig. 4: LoC comparison for host code. Lower is better.
VI. R ELATED W ORK
Vivado HLS (MT) TAPA (Coroutine) Two domain-specific streaming frameworks are discussed in
10 min
1 min Section VI-B. SystemC and pthread are two well-known alter-
10 sec native API paradigms that support task-parallel programs. We
1 sec will discuss and compare them with TAPA in Section VI-C.
cann
on cnn sian gcn gemm netw
ork e_rank
gaus pag A. HLS Support for Task-Parallel Programs
Fig. 5: Simulation time in log scale. Lower is better. Sequential simulator Intel HLS supports two different inter-task communication
fails to simulate cannon and pagerank correctly. Intel OpenCL multi-thread interfaces: pipe and stream. pipe implements a simple FIFO
simulator cannot simulate gaussian due to its large number of task instances. interface with data, valid, and ready signals, while stream
implements an Avalon-ST interface that supports transactions.
Tasks are instantiated using launch and collect.
10 hr Vivado HLS Intel FPGA OpenCL supports the simple FIFO interface
Elapsed Time
1 hr Intel OpenCL
TAPA via two sets of APIs, i.e., standard OpenCL pipe and Intel-
10 min
specific channel. Tasks are instantiated by defining OpenCL
1 min
10 sec __kernels, which forces instances of the same task to be
1 sec synthesized separately as different OpenCL kernels.
cann
on cnn sian gcn gemm netw
ork e_rank Vivado (Vitis) HLS provides two different streaming inter-
gaus pag
faces: ap_fifo and axis. ap_fifo generates the simple FIFO
Fig. 6: RTL code generation time in log scale. Lower is better. interface. Tasks are instantiated by invoking the corresponding
functions in a dataflow region (Listing 2). axis generates
D. Software Simulation Time AXI-Stream interface with transaction support. It requires the
programmers to instantiate channels and tasks in a separate
Fig. 5 shows four simulators, that is, the sequential Vivado configuration file when running logic synthesis and imple-
HLS simulator, the multi-thread Vivado HLS simulator, the mentation. This allows different instances of the same task
multi-thread Intel OpenCL simulator, and the coroutine-based to be synthesized only once, but takes longer time to learn
TAPA simulator. Among the three simulators, the sequential and implement compared with ap_fifo.
simulator fails to correctly simulate benchmarks that require Xilinx OpenCL supports standard OpenCL pipe, which
feedback data paths (cannon and page_rank). Due to the generates AXI-Stream interfaces similar to Vivado HLS axis,
larger memory footprint required for storing the tokens trans- but pipe does not provide APIs to support transactions.
mitted between tasks and lack of parallelism, the sequential LegUp supports the simple FIFO interface via FIFO. Tasks
simulator is outperformed by the coroutine-based simulator are instantiated using pthread API (Section VI-C).
in all but one of the benchmarks (network). The two multi- Merlin [40] allows programmers to call the FPGA kernel as
thread simulators correctly simulate all benchmarks, except a C/C++ function and provides OpenMP-like simple pragmas
that Intel OpenCL cannot handle gaussian because its large with automated design space exploration based on machine
number of task instances (564) exceeds the maximum allowed learning. To support task-parallel programs, Merlin leverages
by the simulator (256). However, the multi-thread simula- its backend vendor HLS tools’ programming interfaces.
tors perform poorly on benchmarks that are communication- Their limitations are summarized in Table I on Page 2. Note
intensive (e.g., network) or have more tasks than the number that a common limitation of HLS tools (including TAPA) is
of available threads (e.g., gaussian). Although the coroutine- that they can not guarantee the software description produces
based TAPA simulator is not always the fastest simulator for deterministic output sequences for task-parallel programs. For
all benchmarks, the worst-case slowdown is only 6%, which is instance, the emptiness test to an input channel is prone to
not significant in comparison with the multi-thread simulator, breaking determinism, yet it is available to all HLS tools for
which can be 11× slower. On average, TAPA is 3.2× faster performance and expressiveness reasons: merging two input
than other simulators. channels round-robin using non-blocking reads would produce
7
an output sequence determined by the relative arrival order of 1 SC_MODULE(RingNode) {
2 sc_port<tlm_fifo_get_if<Pkt>> node_in;
the input tokens. An implication of non-determinism is we 3 sc_port<tlm_fifo_get_if<PktEoT>> pe_in;
cannot assert that a program is deadlock-free just because 4 sc_port<tlm_fifo_put_if<Pkt>> node_out, pe_out;
5 SC_CTOR(RingNode) { SC_THREAD(thread); }
its simulation succeeds. This is different from deterministic 6 void thread() { while (...) {...} }
programs, e.g., Kahn process networks [47], whose successful 7 };
8 SC_MODULE(Kernel) {
simulation generally implies deadlock-free on-board execu- 9 tlm_fifo<Pkt> node_0_1{/*depth=*/2}, node_1_2{2}, ...
tion. For applications that can be efficiently written without 10 // Other channels...
11 RingNode node1, node2, ...
breaking determinism, e.g., streaming applications, there are 12 // Other tasks...
dedicated frameworks developed specifically for them, which 13 SC_CTOR(Kernel) {
14 node1.node_in(node_0_1);
are discussed in the next section. 15 node1.node_out(node_1_2);
16 // Other argument bindings...
17 }
B. Streaming Framework 18 };
ST-Accel [36] is a high-level programming platform that
features highly efficient host-kernel communication interface Listing 5: SystemC TLM API example.
exposed as a virtual file system (VFS). It uses Vivado HLS as
its backend for hardware generation. 1 struct RingNode_Arg {
Fleet [37] is a massively parallel streaming framework for 2 FIFO<Pkt>* node_in, node_out, pe_out;
3 FIFO<PktEoT>* pe_in;
FPGAs that features highly efficient memory interfaces for 4 };
massive instances of parallel processing elements. Program- 5 void RingNode(void* arg) {
6 FIFO<Pkt>* node_in = ((RingNode_arg*)arg)->node_in;
mers write Fleet programs in a domain-specific RTL language 7 // Unpack other arguments...
8 while (...) {...}
based on Chisel [70]. 9 pthread_exit(NULL);
TAPA aims to support more general task-parallel applica- 10 }
11 void Kernel(...) {
tions beyond streaming. 12 FIFO<Pkt> node_0_1, node_1_2, ...
13 // Instantiate other channels...
C. Alternative APIs 14 RingNode_Arg node1_arg, node2_arg, ...
15 node1_arg.node_in = &node_0_1;
SystemC is a set of C++ classes and macros that provide 16 // Pack other arguments...
17 pthread_t node1_pid, node2_pid, ...;
detailed hardware modeling and event-driven simulation. It 18 pthread_create(&node1_pid, NULL, RingNode, &node1_arg);
supports both cycle-accurate and untimed simulation and 19 // Create other threads...
20 pthread_join(&node1_pid, NULL);
many simulator implementations are available [71, 72]. The 21 // Join other threads...
official open-source SystemC simulator implementation uses 22 }
coroutines without thread pooling. Some HLS tools support
Listing 6: Pthread API example.
a subset of untimed SystemC as the input [23]. SystemC
supports task-parallel programs natively via the SC_MODULE
program: for TAPA code snippets shown in Listing 3 and
constructs and tlm_fifo interfaces, which supports peeking.
Listing 4, equivalent pthread-based code would be 2.4× long.
While SystemC supports peeking FIFOs and coroutine-based
simulation for task-parallel programs, it is limited by its In summary, while the API alternatives do exist in their
special and verbose coding style. Listing 5 shows the example own domains, they are more verbose and thus less productive
discussed in Section II-B written in SystemC. Compared with compared with TAPA for task-parallel FPGA acceleration.
other C-like HLS languages, SystemC is more verbose and less
productive due to its special language constructs: for TAPA
code snippets shown in Listing 3 and Listing 4, the equivalent VII. C ONCLUSION AND F UTURE W ORK
SystemC kernel code would be 86% longer. On the host side,
SystemC generates the main function in sc_main by itself for In this paper, we present TAPA as an HLS C++ language
simulation, and programmers need to spend time incorporating extension to enhance the programming productivity of task-
the SystemC test bench with other parts of their program. This parallel programs on FPGAs. TAPA has multiple advantages
is not a problem if the whole system is defined by the kernel over state-of-the-art HLS tools: on average, ¬ its enhanced
in SystemC, e.g., as in embedded systems, but in data center programming interface helps to reduce the lines of kernel code
applications where the FPGA accelerator is only part of the by 22%, its unified system integration interface reduces the
system, this introduces non-trivial complication. lines of host code by 51%, ® its coroutine-based software
Pthread API is a set of widely used standard APIs that can simulator shortens the correctness verification development
be used to implement task-parallel programs using threads. cycle by 3.2×, ¯ its modularized code generation approach
Pthread requires programmers to explicitly create and join shortens the QoR tuning development cycle by 6.8×. As a
threads, and each argument needs to be manually packed and fully automated and open-source framework, TAPA aims to
passed. Listing 6 shows an example using the accelerator provide highly productive development experience for task-
discussed in Section II-B. Compared with the invoke API parallel programs using HLS. For future work, we plan to
used by TAPA, the pthread APIs require more effort to extend our work to support dynamic tasks on FPGAs.
8
ACKNOWLEDGMENT [24] Intel, “Intel FPGA SDK for OpenCL Pro Edition: Programming Guide,”
2020.
The authors would like to thank the anonymous reviewers [25] H.-J. Lee, J. P. Robertson, and J. A. Fortes, “Generalized Cannon’s
and our labmate, Linghao Song, for their valuable comments Algorithm for Parallel Matrix Multiplication,” in ICS, 1997.
and helpful suggestions. This work is partially supported by [26] J. Cong, P. Wei, C. H. Yu, and P. Zhou, “Latte: Locality Aware
Transformation for High-Level Synthesis,” in FCCM, 2018.
a Google Faculty Award, the NSF RTML program (CCF- [27] T. Young-Schultz, L. Lilge, S. Brown, and V. Betz, “Using OpenCL
1937599), NIH Brain Initiative (U01MH117079), the Xilinx to Enable Software-like Development of an FPGA-Accelerated Biopho-
Adaptive Compute Clusters (XACC) program, and CRISP, one tonic Cancer Treatment Simulator,” in FPGA, 2020.
[28] V. Rybalkin and N. Wehn, “When Massive GPU Parallelism Ain’t
of six JUMP centers. Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network,”
in FPGA, 2020.
R EFERENCES [29] A. Sohrabizadeh, J. Wang, and J. Cong, “End-to-End Optimization of
[1] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, Deep Learning Applications,” in FPGA, 2020.
“High-Level Synthesis for FPGAs: From Prototyping to Deployment,” [30] J. De Fine Licht, G. Kwasniewski, and T. Hoefler, “Flexible Com-
TCAD, 2011. munication Avoiding Matrix Multiplication on FPGA with High-Level
[2] X. Wei, Y. Liang, and J. Cong, “Overcoming Data Transfer Bottlenecks Synthesis,” in FPGA, 2020.
in FPGA-based DNN Accelerators via Layer Conscious Memory Man- [31] J. Jiang, Z. Wang, X. Liu, J. Gómez-Luna, N. Guan, Q. Deng, W. Zhang,
agement,” in DAC, 2019. and O. Mutlu, “Boyi: A Systematic Framework for Automatically De-
[3] J. Cong and J. Wang, “PolySA: Polyhedral-Based Systolic Array Auto- ciding the Right Execution Model of OpenCL Applications on FPGAs,”
Compilation,” in ICCAD, 2018. in FPGA, 2020.
[4] Y.-H. Lai, Y. Chi, Y. Hu, J. Wang, C. H. Yu, Y. Zhou, J. Cong, and [32] H. Zeng and V. Prasanna, “GraphACT: Accelerating GCN training on
Z. Zhang, “HeteroCL: A Multi-Paradigm Programming Infrastructure CPU-FPGA heterogeneous platforms,” in FPGA, 2020.
for Software-Defined Reconfigurable Computing,” in FPGA, 2019. [33] P. Papaphilippou, J. Meng, and W. Luk, “High-Performance FPGA
[5] H. R. Zohouri, A. Podobas, and S. Matsuoka, “Combined Spatial Network Switch Architecture,” in FPGA, 2020.
and Temporal Blocking for High-Performance Stencil Computation on [34] H. Chen, S. Madaminov, M. Ferdman, and P. Milder, “FPGA-
FPGAs Using OpenCL,” in FPGA, 2018. Accelerated Samplesort for Large Data Sets,” in FPGA, 2020.
[6] M. Koraei, O. Fatemi, and M. Jahre, “DCMI: A Scalable Strategy for [35] S. Margerm, A. Sharifian, A. Guha, A. Shriraman, and G. Pokam,
Accelerating Iterative Stencil Loops on FPGAs,” TACO, vol. 16, no. 4, “TAPAS: Generating Parallel Accelerators from Parallel Programs,” in
2019. MICRO, 2018.
[7] Y. Chi and J. Cong, “Exploiting Computation Reuse for Stencil Accel- [36] Z. Ruan, T. He, B. Li, P. Zhou, and J. Cong, “ST-Accel: A High-
erators,” in DAC, 2020. Level Programming Platform for Streaming Applications on FPGA,”
[8] J. de Fine Licht, A. Kuster, T. De Matteis, T. Ben-Nun, D. Hofer, and in FCCM, 2018.
T. Hoefler, “StencilFlow: Mapping Large Stencil Programs to Distributed
[37] J. Thomas, P. Hanrahan, and M. Zaharia, “Fleet: A Framework for
Spatial Computing Systems,” in CGO, 2021.
Massively Parallel Streaming on FPGAs,” in ASPLOS, 2020.
[9] Y. Chi, J. Cong, P. Wei, and P. Zhou, “SODA : Stencil with Optimized
Dataflow Architecture,” in ICCAD, 2018. [38] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson,
[10] J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, and S. Brown, and T. Czajkowski, “LegUp: High-Level Synthesis for FPGA-
M. Horowitz, “Programming Heterogeneous Systems from an Image Based Processor/Accelerator Systems,” in FPGA, 2011.
Processing DSL,” TACO, vol. 14, no. 3, 2017. [39] J. Choi, S. D. Brown, and J. H. Anderson, “From Pthreads to Multicore
[11] J. Li, Y. Chi, and J. Cong, “HeteroHalide: From Image Processing DSL Hardware Systems in LegUp High-Level Synthesis for FPGAs,” TVLSI,
to Efficient FPGA Acceleration,” in FPGA, 2020. vol. 25, no. 10, 2017.
[12] UCLA-VAST, “TAPA Sample Applications.” [Online]. Available: [40] J. Cong, M. Huang, P. Pan, D. Wu, and P. Zhang, “Software Infras-
https://github.com/UCLA-VAST/tapa/tree/master/apps tructure for Enabling FPGA-Based Accelerations in Data Centers,” in
[13] J. Cong and Z. Zhang, “An Efficient and Versatile Scheduling Algorithm ISLPED, 2016.
Based On SDC Formulation,” in DAC, 2006. [41] L. Dagum and R. Menon, “OpenMP: An Industry Standard API for
[14] J. Cheng, S. T. Fleming, Y. T. Chen, J. H. Anderson, and G. A. Shared-Memory Programming,” IEEE Computational Science and En-
Constantinides, “EASY: Efficient Arbiter SYnthesis from Multi-threaded gineering, vol. 5, no. 1, 1998.
Code,” in FPGA, 2019. [42] G. Dai, Y. Chi, Y. Wang, and H. Yang, “FPGP: Graph Processing
[15] J. Cheng, L. Josipović, G. A. Constantinides, P. Ienne, and J. Wickerson, Framework on FPGA A Case Study of Breadth-First Search,” in FPGA,
“Combining Dynamic & Static Scheduling in High-level Synthesis,” in 2016.
FPGA, 2020. [43] G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang, “ForeGraph:
[16] H. Hsiao and J. Anderson, “Thread Weaving: Static Resource Scheduling Exploring Large-scale Graph Processing on Multi-FPGA Architecture,”
for Multithreaded High-Level Synthesis,” in DAC, 2019. in FPGA, 2017.
[17] A. Haj-Ali, Q. Huang, W. Moses, J. Xiang, K. Asanovic, J. Wawrzynek, [44] S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu,
and I. Stoica, “AutoPhase: Juggling HLS Phase Orderings in Random “HitGraph: High-throughput Graph Processing Framework on FPGA,”
Forests with Deep Reinforcement Learning,” in MLSys, 2020. TPDS, 2019.
[18] Y. T. Chen, J. H. Kim, K. Li, G. Hoyes, and J. H. Anderson, “High- [45] Y. Wang, J. C. Hoe, and E. Nurvitadhi, “Processor Assisted Worklist
Level Synthesis Techniques to Generate Deeply Pipelined Circuits for Scheduling for FPGA Accelerated Graph Processing on a Shared-
FPGAs with Registered Routing,” in FPT, 2019. Memory Platform,” in FCCM, 2019.
[19] L. Guo, J. Lau, Y. Chi, J. Wang, C. H. Yu, Z. Chen, Z. Zhang, and [46] C. A. R. Hoare, “Communicating Sequential Processes,” Communica-
J. Cong, “Analysis and Optimization of the Implicit Broadcasts in FPGA tions of the ACM, vol. 21, no. 8, 1978.
HLS to Improve Maximum Frequency,” in DAC, 2020.
[47] G. Kahn, “The Semantics of a Simple Language for Parallel Program-
[20] L. Josipović, S. Sheikhha, A. Guerrieri, P. Ienne, and J. Cortadella,
ming,” in IFIP, 1974.
“Buffer Placement and Sizing for High-Performance Dataflow Circuits,”
in FPGA, 2020. [48] E. A. Lee and D. G. Messerschmitt, “Synchronous Data Flow,” IEEE,
[21] L. Guo, Y. Chi, J. Wang, J. Lau, W. Qiao, E. Ustun, Z. Zhang, vol. 75, no. 9, 1987.
and J. Cong, “AutoBridge: Coupling Coarse-Grained Floorplanning and [49] J. T. Buck, “Scheduling Dynamic Dataflow Graphs with Bounded
Pipelining for High-Frequency HLS Design on Multi-Die FPGAs,” in Memory Using the Token Flow Model,” Ph.D. dissertation, 1993.
FPGA, 2021. [50] J. L. Peterson, “Petri Nets,” ACM Computing Surveys, vol. 9, no. 3,
[22] J. Cong, P. Wei, C. H. Yu, and P. Zhang, “Automated Accelerator 1977.
Generation and Optimization with Composable, Parallel and Pipeline [51] M. Abeydeera and D. Sanchez, “Chronos: Efficient Speculative Paral-
Architecture,” in DAC, 2018. lelism for Accelerators,” in ASPLOS, 2020.
[23] Xilinx, “Vivado Design Suite User Guide: High-Level Synthesis [52] T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph
(UG902),” 2020. Convolutional Networks,” in ICLR, 2017.
9
[53] C. Deng, Z. Zhao, Y. Wang, Z. Zhang, and Z. Feng, “GraphZoom:
A Multi-level Spectral Approach for Accurate and Scalable Graph
Embedding,” in ICLR, 2020.
[54] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation
Ranking: Bringing Order to the Web,” Tech. Rep., 1998.
[55] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community
Structure in Large Networks: Natural Cluster Sizes and the Absence of
Large Well-Defined Clusters,” Internet Mathematics, vol. 6, no. 1, 2009.
[56] J. Mcauley, “Learning to Discover Social Circles in Ego Networks,” in
NIPS, 2012.
[57] Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang, “NXgraph:
An Efficient Graph Processing System on a Single Machine,” in ICDE,
2016.
[58] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and
H. Yang, “GraphH: A Processing-in-Memory Architecture for Large-
scale Graph Processing,” TCAD, 2018.
[59] Xilinx, “Vitis Accel Hello World Example.” [Online]. Available:
https://github.com/Xilinx/Vitis_Accel_Examples/blob/21bb0cf788ace59
3c6075accff7f7783588ae8b4/hello_world/src/host.cpp#L58-L115
[60] Y. Chi, Y.-k. Choi, J. Cong, and J. Wang, “Rapid Cycle-Accurate
Simulator for High-Level Synthesis,” in FPGA, 2019.
[61] Y.-k. Choi, Y. Chi, J. Wang, and J. Cong, “FLASH: Fast, ParalleL, and
Accurate Simulator for HLS,” TCAD, 2020.
[62] A. L. de Moura and R. Ierusalimschy, “Revisiting Coroutines,” TOPLAS,
vol. 31, no. 2, 2009.
[63] O. Kowalke, “Boost Library Documentation, Coroutine2,” 2014.
[Online]. Available: https://boost.org/doc/libs/1_65_0/libs/coroutine2/d
oc/html/coroutine2/intro.html
[64] A. S. Jamal, E. Cahill, J. Goeders, and S. J. E. Wilton, “Fast Turnaround
HLS Debugging using Dependency Analysis and Debug Overlays,”
TRETS, vol. 13, no. 1, 2020.
[65] D. E. Knuth, Fundamental Algorithms. The Art of Computer Program-
ming 1, 3rd ed., 1997.
[66] M. E. Conway, “Design of a Separable Transition-Diagram Compiler,”
Communications of the ACM, vol. 6, no. 7, 1963.
[67] E. Bendersky, “Measuring context switching and memory overheads
for Linux threads,” 2018. [Online]. Available: https://eli.thegreenplace.
net/2018/measuring-context-switching-and-memory-overheads-for-linu
x-threads/
[68] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” in ICLR, 2015.
[69] D. H. Lawrie, “Access and Alignment of Data in an Array Processor,”
ToC, vol. C-24, no. 12, 1975.
[70] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis,
J. Wawrzynek, and K. Asanović, “Chisel: Constructing Hardware in a
Scala Embedded Language,” in DAC, 2012.
[71] T. Schmidt, G. Liu, and R. Dömer, “Exploiting Thread and Data Level
Parallelism for Ultimate Parallel SystemC Simulation,” in DAC, 2017.
[72] M. K. Chung, J. K. Kim, and S. Ryu, “SimParallel: A High Performance
Parallel SystemC Simulator Using Hierarchical Multi-threading,” in
ISCAS, 2014.
10
From Software to Accelerators with LegUp High-Level Synthesis
Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort,
Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, Jason Anderson
ECE Department, University of Toronto, Toronto, ON, Canada
legup@eecg.toronto.edu
Program code
LegUp
5
Altered SW binary (calls HW accelerators) Profiling Data:
3 Execution Cycles
4 High-level Power
synthesis Suggested Cache Misses
µP Hardened
program
program
segments to
segments
6 FPGA fabric target to
HW
2.1 LegUp System Architecture to increase memory bandwidth is to use multiple coherent mem-
ory blocks, with extra circuitry to manage memory coherency be-
LegUp can target two Altera FPGAs: the Cyclone II on the Altera
tween the memory blocks. However, by implementing memory co-
DE2 board [5], and the Stratix IV on the Altera DE4 board [6]. The
herency we add area and latency overhead. Thus, we take an alter-
target system architecture is shown in Fig. 2. The system comprises
nate approach, where we implement multi-ported memories (that
the MIPS soft processor, hardware accelerators, on-chip cache, as
have more than 2 ports) using existing dual-ported memory blocks.
well as off-chip memory (8MB SDRAM on the DE2 board or 2GB
We can then use these multi-ported memories to implement multi-
DDR2-SDRAM on the DE4 board). An accelerator may have lo-
ported caches suitable for many-accelerator systems.
cal memories for storing data that is not shared with the proces-
We have investigated two types of multi-ported caches, called
sor or other accelerators. These local memories are implemented
the LVT cache and the MP cache [10], both of which allow multiple
in on-chip block RAMs, instantiated within a hardware accelera-
concurrent accesses to all regions of the cache in every clock cycle.
tor. Data shared between the processor and hardware accelerators
The LVT cache is based on memory replication, whereas the MP
is stored in off-chip memory, which can be accessed using the on-
cache uses memory multi-pumping (operating the memory at a
chip cache. The components of the system communicate via the
higher clock rate than the surrounding system). The main advantage
Avalon Interconnect, Altera’s on-chip interface, which is generated
of both cache architectures is that they offer higher on-chip memory
automatically by Altera’s SOPC Builder tool [7]. Avalon is a point-
bandwidth than what is typically available on the FPGA fabric,
to-point network, which allows multiple independent transfers to
while providing a shared memory space which acts as a single piece
occur simultaneously via memory-mapped addresses. When multi-
of memory. These caches also require no cache coherency scheme,
ple components are connected to a single component, such as the
avoiding the area and latency costs for synchronization.
on-chip data cache, a round-robin arbiter is generated to arbitrate
among simultaneous accesses.
3. Hardware/Software Partitioning
2.2 Multi-ported caches With the LegUp design methodology, the program is partitioned
When many accelerators are operating in parallel, memory band- into both a hardware portion and a software portion. The chosen
width can easily become a performance bottleneck. The on-chip partitioning depends on the designer’s objective, which is often to
RAMs on current commercial FPGAs have two ports, meaning that reduce overall execution time. Towards this goal, the MIPS soft
for a given memory block, there can only be up to two memory ac- processor contains a hardware profiler to determine which sections
cesses at a time. However, for systems with many accelerators that of the original program are taking the most execution time. LegUp
need to access memory concurrently, two ports may not be ade- can also estimate the speedup associated with migrating a particular
quate and cache accesses may limit performance. A typical way program segment into hardware versus leaving it in software.
3.1 Hardware Profiling may not be suitable for hardware acceleration, perhaps because
The hardware profiler in the MIPS soft processor is called LEAP, they contain a sequential algorithm with minimal instruction level
which stands for Low-overhead and Extensible Architecture for parallelism or they are too memory intensive. Ideally, we would
Profiling [2]. For each function in a program, the profiler can be know exactly how much execution time would be saved by synthe-
used to quickly and accurately obtain the exact number of clock cy- sizing a segment of software code into a hardware accelerator.
cles spent executing the function. In software-based profiling, the One way we could gauge the speedup achieved by hardware ac-
program being profiled must be modified with instrumentation to celeration is to actually convert the segments into hardware circuits
gather profiling data during its execution. In contrast, our hardware- and then run the program on the board to measure the results. But
based approach allows the program to execute in its original un- this approach is too time consuming if there are many alternatives
modified form at full speed on the processor. The MIPS processor to investigate, as it requires running FPGA synthesis and place-
is augmented with additional circuitry that automatically gathers and-route tools for each alternative. Alternately, one could run an
profiling data as the program executes. Such hardware profiling is RTL simulation to measure the execution time, in cycles, of the fi-
superior in speed and accuracy when compared to software profil- nal hybrid system. However, this method is also time-consuming
ing. and becomes infeasible in real applications.
To aid in the task of software/hardware partitioning, LegUp
Instr, PC provides an estimate of the total number of clock cycles consumed
by a function if it is accelerated in hardware, which can then
be compared to the LEAP profiling results described above. This
PC approach uses profiling (in software) to estimate the execution flow
Is call? Is return?
Change? of the processor/accelerator hybrid system and then uses early high-
yes yes yes
level synthesis scheduling information to predict the number of
Data Store/Update Data cycles required by portions of the program after being synthesized
Counter++ Hash to Counter
function to hardware.
Store/Update Data
number When synthesizing a software program into a hybrid system,
Pop function number
Counter
off stack LegUp replaces the accelerating functions with wrapper functions
Push to enable communication between the processor and accelerators.
function
Reset Data Counter
number to
Reset Data Counter The call to a wrapper function starts the accelerator’s execution
stack and the return from the wrapper function indicates the accelera-
tor finished its execution. Our estimation approach considers the
Figure 3. High-level flow chart for instruction-count profiling. hardware cycles spent on three operations: 1) the execution of the
hardware accelerator, 2) the accelerator’s initialization performed
The high-level operation of LEAP is shown in Fig. 3. LEAP by the software wrapper function, and 3) reads and writes to the
profiles the execution of the program by monitoring the proces- shared memory space.
sor’s program counter and instruction bus. During execution, LEAP We estimate the cycles taken during the accelerator’s execution
maintains a counter, called a Data Counter, that tracks the number in two steps. First, we perform HLS scheduling for the accelerated
of times an event has occurred. Two modes are available: the pro- function to determine the number of clock cycles required for
filer can count dynamic instructions, or clock cycles. each basic block in the function. A basic block is a contiguous
LEAP organizes the collected data on a per-function basis by al- set of instructions with a single entry (at its beginning) and exit
locating a storage counter for each software function. LEAP iden- point (at its end). Next, we execute the program in software using
tifies function boundaries by decoding (in hardware) the executing representative inputs to estimate the number of times each basic
instruction to determine if it is a function call or return. If a call is block is executed. Finally, we estimate the total cycle count of the
detected, the Data Counter is added to any previously stored values accelerated function by multiplying the estimated number of times
associated with the function containing the call instruction (from each basic block is executed by the number of clock cycles required
previous invocations of the function). The Data Counter is then re- by the corresponding basic block in its schedule.
set to 0 to begin counting the events in the called function. If a To estimate the time taken by the software wrapper function
function return is detected, the Data Counter value is added to the running on the processor, we count the number of instructions
counter associated with the current function, and once again the required by the wrapper. The instruction count is sufficient for
Data Counter is reset. wrapper function estimation, as the wrapper function is small and
In order to determine the counter associated with a particular only contains simple operations to communicate with hardware
function, other hardware profilers, such as SnoopP [30] (a hard- accelerators, and we have found empirically that the instructions-
ware profiler for FPGA-based processors), use a large number of per-cycle of the MIPS processor is close to one.
comparators to associate program counter address ranges with in- We estimate the cycles spent accessing shared memory in three
dividual counters. A novel aspect of LEAP is the use of perfect steps. First, we run the program with a representative set of inputs
hashing hardware to associate function addresses with counters. A using a MIPS emulator to determine the address sequence accessed
set of hashing parameters are generated during the software compi- by the software program (without hardware accelerators). Next,
lation stage (step ➀ in Fig. 1) and used to configure the profiler we predict the address sequence accessed by the hybrid proces-
on the FPGA. No modifications of the hardware profiler circuit sor/accelerator system by eliminating any addresses that are stored
(e.g. resynthesis or reprogramming) are needed to profile a new in local memory of the hardware accelerators. Then, we use a cache
program. The use of hashing leads to significantly less hardware simulator to determine the number of cache hits and misses. Finally,
overhead when compared to other hardware profilers. Specifically, we can use the estimated cost of a cache hit or miss (in cycles) to
relative to SnoopP, our design requires up to 18× less area [2]. predict the total cycles spent on shared memory accesses.
Experimental results show that our approach has an average
3.2 Accelerator Speedup Prediction error rate of about 7% compared to the results obtained from RTL
simulation, but with 184× less run-time on average.
By using the LEAP profiler, the user can identify time-consuming
program segments. However, these compute-intensive functions
3.3 Partitioning Example
1.8
An example of hardware/software partitioning is provided in Ta-
ble 1 for four functions of the jpeg benchmark in the CHStone
4.5 Multi-Pumping
For applications that involve many multiplication operations,
LegUp uses a new approach to resource sharing that allows multi-
ple operations to be performed by a single multiply functional unit 5. Pthreads and OpenMP
in one clock cycle [9]. Our approach is based on multi-pumping, One source of the quality gap between HLS-generated hardware
which operates functional units at a higher frequency than the sur- and human-designed hardware is the inability of HLS to fully
rounding system logic, typically 2×, allowing multiple computa- exploit the parallelism available in the target FPGA fabric for a
tions to complete in a single system cycle. This method is partic- given application. Current HLS tools can typically employ instruc-
ularly effective for the DSP blocks on modern FPGAs. The hard- tion level parallelism and loop pipelining to execute multiple op-
ened DSP blocks in modern FPGAs can operate a speeds exceeding erations in parallel. This fine-grained parallelism, however, is of-
500 MHz, whereas typical system speeds are less than 300 MHz. ten not enough to meet the performance requirements of a high-
We have found that multi-pumping is a viable approach to achieve performance system. Coarse-grained parallelism is often realized
the area reductions of resource sharing, with considerably less neg- by using an HLS tool to synthesize a single hardware core, and then
ative impact to circuit performance. For a given constraint on the manually instantiating multiple instances of the core in structural
number of DSPs, multi-pumping can deliver considerably higher HDL. Some commercial HLS tools, such as Vivado [21], allow this
performance than resource sharing. Empirical results over digital to be done through vendor-specific pragmas. Although the use of
signal processing benchmarks show that multi-pumping achieves vendor-specific pragmas can ease the process of instantiating multi-
the same DSP reduction as resource sharing, but with a lower im- ple hardware cores, it nevertheless requires knowledge of hardware
pact to circuit performance: decreasing circuit speed by only 5% design – a barrier for software engineers. We addresses this chal-
instead of 80%. lenge by providing a mechanism through which an engineer may
use software techniques to specify parallelism to the LegUp HLS
4.6 Bitwidth Minimization tool, with the tool then implementing the specified parallelism in a
Software programs today use standard datatypes that are 8, 16, 32, hardware circuit.
or 64-bits in length. As such, programs are over engineered in the LegUp provides support for two standard parallel programming
sense that variables are frequently represented using more bits than methodologies which software engineers are likely familiar with –
are actually required, e.g. a 32-bit int datatype may be used for a Pthreads and OpenMP. Parallelism described in the software code
loop index that is known to have a range from 0 to 100. Because is automatically synthesized into parallel hardware accelerators that
processor datapaths are of fixed widths, there is little to be gained in perform the corresponding computations concurrently. Parallel pro-
term’s a software program’s performance by optimizing bitwidths. gramming in software often requires the use of synchronization
However, in HLS, hardware quality (area, speed and power) is constructs that, for example, manage which threads may execute
impacted considerably by the bit-level representation of program a given code segment at any given moment. Recognizing this, we
variables. also provide HLS support for two key thread synchronization con-
LegUp uses two strategies to statically (i.e. at compile time) structs in the Pthreads/OpenMP library: mutexes and barriers. The
or dynamically (i.e. using run-time profiling) determine minimized approach we take is to automatically instantiate parallel hardware
representations of variables: 1) range analysis and 2) bitmask anal- for parallel threads. That is, each software thread is mapped auto-
ysis. Range analysis seeks to determine the maximum and mini- matically into a hardware accelerator. The remaining (sequential)
mum values that variables take on in a program’s execution and in portions of the program are executed in software on the MIPS soft
so doing, bound the number of bits required to represent the vari- processor.
able. Variable ranges can be deduced from constants in the source Table 4 shows a list of Pthreads and OpenMP library func-
code, and then propagated through a program’s control-dataflow tions which are currently supported by LegUp. In addition to those
graph to infer ranges for other variables. Bitmask analysis, on the listed in the table, OpenMP clauses to set the number of threads
other hand, seeks to characterize the individual bits in a variable. (num threads), the scopes of variables (e.g. public, private)
For example, assume that A and B are unknown 16-bit values and and the division of work among threads (static scheduling of any
consider the C-language statement: Z = A & (B << 2). In this chunk size) are also supported. Note that all of the OpenMP/Pthreads
case, the two right-most bits of Z are guaranteed to be logic-0 and functions in Table 4 are automatically compiled in our framework,
this property can be applied to minimize the size of hardware that requiring no manual code changes by the user. Meaning that, the
uses Z as an operand (e.g. if Z feeds into a multiplier, the two right- input C program with calls to the Pthreads/OpenMP API can be
most bits of the product are guaranteed to be logic-0). Note that compiled to a hybrid processor/accelerator system as is. The com-
while bitmask analysis guarantees that Z’s two LSBs are 0, range plete system, including the MIPS processor, on-chip cache, off-chip
analysis can infer nothing regarding Z’s min and max values. The memory controller, as well as parallel accelerators, can be created
two forms of analysis thus offer complementary information. with a single make target.
Table 3. Bitwidth minimization Cyclone II implementation results.
LUTs Registers FMax (MHz)
Benchmark Baseline Bitmask+ Dynamic+ Baseline Bitmask+ Dynamic+ Baseline Bitmask+ Dynamic+
Range Bitmask Range Bitmask Range Bitmask
dhrystone 5244 4120 3738 3575 3131 2438 117.94 114.09 115.96
fft 2046 2043 1880 1048 1028 746 92.89 91.3 91.3
adpcm 21695 18631 7036 11039 10020 4291 55.46 56.04 56.16
aes 19784 15792 8871 11470 9162 4066 49.38 49.82 46.47
blowfish 10621 10590 10296 7412 7353 7040 75.41 73.61 71.62
gsm 9787 9645 7807 6612 6487 5029 33.2 32.39 32.98
jpeg 33618 31083 22057 20688 19388 11885 18.02 17.53 19.15
mips 3384 3358 2116 1620 1590 999 98.8 95.56 110.22
motion 4054 4020 2946 2526 2526 1656 112.18 111.83 125.85
sha 10686 8243 7612 7779 5838 5371 99.42 106.68 109.42
Geomean: 8655 7838 5711 5230 4794 3217 65.7 65.2 67.3
Ratio: 1.00 0.91 0.66 1.00 0.92 0.62 1.00 0.99 1.02
(a) Schedule Gantt Chart (b) Control Flow Graph (c) Loop Pipeline Schedule
function, in this case the user has selected the basic block labeled
Table 4. Supported Pthreads functions/OpenMP pragmas. “BB 1”. In the “Schedule Chart” window pane the schedule viewer
Pthreads Functions Description gives a list of all LLVM instructions inside the selected basic block.
pthread create(..) Invoke thread
Each LLVM instruction corresponds to a hardware operation in the
pthread join(..) Wait for thread to finish
pthread exit(..) Exit from thread, can be used to return data synthesized circuit. The user can highlight any instruction to dis-
pthread mutex lock(..) Lock mutex play the data dependencies between all predecessor and successor
pthread mutex unlock(..) Unlock mutex instructions. Fig. 6b shows the control flow graph for the kernel,
pthread barrier init(..) Initialize barrier where each node in the graph is a basic block. Fig. 6c shows the
pthread barrier wait(..) Synchronize on barrier object loop pipeline schedule after the basic block has been pipelined. The
OpenMP Pragmas Description pipeline initiation interval is two, which means a new loop iteration
omp parallel Parallel section begins every two clock cycles. The area highlighted in black is the
omp parallel for Parallel for loop steady-state operation of the pipeline; observe that three iterations
omp master Parallel section executed by master thread only of the loop are executing in parallel.
omp critical Critical section In addition to visualization, we have been focusing recently
omp atomic Atomic section on adding debugging capabilities to LegUp. Debugging tools are
reduction(operation: var) Reduce a var with operation
ubiquitous in the software development community because they
OpenMP Functions Description
omp get num threads() Get number of threads
raise productivity by providing insight into the execution state as
omp get thread num() Get thread ID a program executes. In contrast, most hardware designers are ac-
customed to using simulation waveforms to debug their digital cir-
cuits. With LegUp, we want to bridge this gap by offering users a
software-like debugging platform for the hybrid hardware/software
6. Visualization and Debugging coprocessor system. LegUp’s debugging platform will help devel-
LegUp provides visualization tools for analyzing the internal HLS opers gain insight into problems with their applications at a higher
algorithms. For instance, we have a graphical viewer for the level of abstraction than traditional RTL simulation and waveform
scheduling report file produced by LegUp that shows a Gantt chart analysis.
of the scheduled instructions for the program and also can visu- To implement the debugger, LegUp leverages the LLVM com-
alize loop pipeline scheduling. Fig. 6 shows three screenshots of piler debugging meta-data, which maps each C statement to a set
the LegUp visualization tool for a matrix multiply kernel. Fig. 6a of one or more simple instructions in LLVM’s intermediate rep-
shows a Gantt chart for LegUp’s high-level synthesis schedule. On resentation (IR). Fig. 7 depicts this mapping. Next, we map the IR
the left side, the “Explorer” panel lists each basic block for each instructions to LegUp-synthesized hardware elements. Each LLVM
8. Acknowledgements
The financial support of the Natural Sciences and Engineering
Research Council of Canada (NSERC) and Altera Corporation is
gratefully acknowledged.
References
[1] The OpenCL specification version: 1.0 document revision: 48, 2009.
[2] M. Aldham, J. Anderson, S. Brown, and A. Canis. Low-cost hardware
profiling of run-time and energy in FPGA embedded processors. In
Figure 7. Mapping from C statements to LLVM intermediate rep- IEEE ASAP, pages 61–68, 2011.
resentation instructions. [3] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves,
D. Subramanian, L. Torczon, and T. Waterman. Finding effective
compilation sequences. In ACM LCTES, pages 231–239, 2004.
IR instruction is scheduled to run in one or more states of the finite
state machine. Also, each IR instruction can be synthesized into [4] Cyclone-II Data Sheet. Altera, Corp., San Jose, CA, 2004.
several hardware units and signals. Some hardware signals, such [5] DE2 Development and Education Board. Altera, Corp., San Jose, CA,
as the memory controller signals, can be shared between multiple 2010.
instructions, depending on the state. [6] DE4 Development Board. Altera, Corp., San Jose, CA, 2010.
Our goal is to have an integrated debugging system that is ca- [7] SOPC Builder User Guide. Altera, Corp., San Jose, CA, 2010.
pable of capturing, and displaying to the user, hardware signals [8] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson,
while running the hybrid processor/accelerator system runs on the S. Brown, and T. Czajkowski. LegUp: high-level synthesis for FPGA-
board. Fig. 8 shows a screenshot of the LegUp debugging platform, based processor/accelerator systems. In ACM/SIGDA FPGA, pages
which is “work-in-progress”. Currently, the debugging platform is 33–36, 2011.
for simulation only; that is, we communicate with the simulation [9] A. Canis, J. H. Anderson, and S. D. Brown. Multi-pumping for
tool, ModelSim, to inspect signal values and control the simulation resource reduction in FPGA high-level synthesis. In IEEE DATE,
cycle by cycle. By examining the state of the finite state machine, pages 194–197, 2013.
we can detect the current state being executed and highlight the ac- [10] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski.
tive C statements associated with the current state. There may more Impact of cache architecture and interface on performance and area of
than one active C statement per state, due to the instruction-level FPGA-based processor/parallel-accelerator systems. In IEEE FCCM,
parallelism in hardware (see Fig. 8). By clicking on a C statement, pages 17–24, 2012.
the corresponding synthesized Verilog code is highlighted. Single- [11] J. Cong and Z. Zhang. An efficient and versatile scheduling algorithm
stepping is supported, which runs the circuit simulation until the based on sdc formulation. In ACM DAC, volume 43, pages 433–438,
next C statement is reached. Note that C statements may take more 2006.
than one clock cycle to complete. Developers can also step over a [12] J. Cong and Y. Zou. FPGA-based hardware acceleration of litho-
C statement to reach the next executing statement, or can step into graphic aerial image simulation. ACM Transactions on Reconfigurable
a C statement to see IR-level and hardware-level details related to Technology and Systems (TRETS), 2(3):1–29, 2009.
that statement on a cycle-by-cycle basis. Hardware signal names [13] P. Coussy, D. Gajski, M. Meredith, and A. Takach. An introduction
and current values are displayed based on the circuit’s current state to high-level synthesis. IEEE Design Test of Computers, 26(4):8 – 17,
so that developers can track signal value changes (right panel in the jul. 2009.
figure). [14] M. Gort and J. H. Anderson. Range and bitmask analysis for hardware
The LegUp debugging platform is still under development. Sup- optimization in high-level synthesis. In ASP DAC, pages 773–779,
porting break-points, enabling the debugging of hybrid proces- 2013.
sor/accelerator applications and on-chip hardware debugging are [15] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and quan-
all future work. titative analysis of the CHStone benchmark program suite for practical
C-based high-level synthesis. Journal of Information Processing, 17:
242–254, 2009.
7. Conclusion [16] Calypto Catapult. http://calypto.com/en/products/catapult/overview,
2013.
LegUp is a high-level synthesis (HLS) framework that allows soft-
ware methodologies to be used for the synthesis of a hybrid system [17] OpenCL for Altera FPGAs. http://www.altera.com/products/software/
opencl/opencl-index.html, 2013.
comprising an embedded processor, and one or more FPGA-based
accelerators. Since the original LegUp release in March 2011, it [18] C-to-Verilog. http://www.c-to-verilog.com, 2013.
has been downloaded over 600 times by researchers around the [19] Forte Design Systems The high level design company.
world (at the time of writing). As described in this paper, the cur- http://www.forteds.com/products/cynthesizer.asp, 2013.
rent LegUp 3.0 release includes functionality to assist with hard- [20] LLVM Compiler Infrastructure Project. http://www.llvm.org, 2010.
ware/software partitioning, multi-ported caches to ease memory [21] Xilinx: Vivado Design Suite. http://www.xilinx.com/products/design
bottlenecks, support for Pthreads and OpenMP, and improvements tools/vivado/vivado-webpack.htm, 2013.
to the core HLS algorithms, including loop pipelining, multipump- [22] C. Huang, Y. Che, Y. Lin, and Y. Hsu. Data path allocation based
ing, bitwidth optimization, and tools to select profitable compiler on bipartite weighted matching. In ACM/IEEE DAC, pages 499–504,
optimization passes to improve hardware quality. One of the few 1990.
open-source frameworks of its kind, we hope the tool will be use- [23] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and J. Ander-
ful to the embedded systems research community as a platform son. The effect of compiler optimizations on high-level synthesis for
to explore new design methodologies and synthesis strategies. The FPGAs. In IEEE FCCM, pages 89–96, 2013.
LegUp project website, http://legup.eecg.toronto.edu, in- [24] H. Kuhn. The Hungarian method for the assignment problem. In
cludes documentation, tutorials on how to use and modify the tool, 50 Years of Integer Programming 1958-2008, pages 29–47. Springer,
related publications, as well as links to download the source code. 2010.
Figure 8. Screenshot of debugging platform.
High Level Synthesis Based Hardware Accelerator Design for Processing SQL
Queries
CITATIONS READS
15 781
5 authors, including:
Adrian Cristal
Barcelona Supercomputing Center
264 PUBLICATIONS 3,609 CITATIONS
SEE PROFILE
All content following this page was uploaded by Arda Yurdakul on 03 July 2018.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 2. ACCELERATOR IMPLEMENTATION
not made or distributed for profit or commercial advantage and that copies In order to support full and complex database analytics
bear this notice and the full citation on the first page. To copy otherwise, to in hardware, we have focused on accelerating data filtering,
republish, to post on servers or to redistribute to lists, requires prior specific arithmetic, logic, sorting, aggregation and equi-join opera-
permission and/or a fee.
FPGAWorld ’15 September 8-10, Stockholm and Copenhagen tions. All the units are designed to work at 200 MHz on a
Copyright 2015 ACM 978-1-4503-3737-3 ...$15.00. Virtex-7 xc7vx690t↵g1761-2 FPGA.
Purpose: The goal is to enhance the performance of various database operations by implementing them in hardware.
Operations Accelerated: The specific operations targeted for acceleration include:
Data Filtering: Selecting specific data based on certain criteria.
Arithmetic Operations: Basic mathematical operations like addition, subtraction, multiplication, and division.
Logic Operations: Logical operations such as AND, OR, and NOT.
Sorting: Arranging data in a specific order.
Aggregation: Combining multiple data values into a single value (e.g., sum, average).
Equi-Join: Combining rows from two tables based on a common column.
Hardware Specifications: These accelerators are designed to operate at a frequency of 200 MHz on a Virtex-7 xc7vx690t FPGA.
2.1 Data Filtering, Arithmetic and
Logic Operations
Database filtering operations are relational operations that
test numerical or logical relations between columns, numer-
ical and/or boolean values. For this purpose, we designed
a pipelined, parametrizable width, n-way compute engine
that takes rows as inputs, applies a filtering operation on
the desired columns and produces an output bitmap. This
bitmap determines the selected rows for further process-
ing after the filtering operation. The main importance of
filtering operations in an SQL query is to filter out un-
wanted data from further processing, thus reducing the size
of the input set. The most important design choices of fil-
tering operations is selecting the correct parallelism for the
maximum utilization of memory bandwidth. Similarly, we
have designed pipelined, parametrizable, n-way arithmetic
and logical compute engines. The arithmetic compute en-
gine supports integer ADD, SUB, MULT and DIV opera-
tions, whereas the logical compute engine supports the log-
ical AND, OR and NAND operations.
6. ACKNOWLEDGEMENT
Funding from the European Union’s Seventh Framework
Programme (FP7/2007-2013) under grant agreement No 318633,
Figure 5: Runtimes for 3 queries (in ms, log scale) UPC project TIN2012-34557, the Turkish Ministry of Devel-
opment under the TAM Project, number 2007K120610 as
well as Severo Ochoa Mobility grant program support was
tial and temporal planning which enables/disables compute received.
units. The biggest di↵erence of ASIC design compared to
HLS is the flexibility. HLS can adjust to di↵erent sizes of 7. REFERENCES
blocks such as database columns by extending or shrinking [1] K. E. Batcher. Sorting networks and their
data sizes in high level source code. Hence, it is a more applications. In Proc. spring joint computer
flexible solution for hardware design. conference, pages 307–314. ACM, 1968.
The authors in [12] discuss efficient methodologies for de- [2] A. Becher, F. Bauer, D. Ziener, and J. Teich.
coupling accelerators from its host. In contrast, our work Energy-aware sql query acceleration through
presents in-memory database acceleration where the mem- fpga-based dynamic partial reconfiguration. In Field
ory is controlled by a host system. Thus, in this manner, our Programmable Logic and Applications (FPL), 2014
accelerator can be classified as a tightly coupled accelerator. 24th International Conference on, pages 1–8. IEEE,
Complementary to our work, authors in [4] discuss im- 2014.
plementation challenges of merge sort and join operations. [3] Canis and et al. Legup: high-level synthesis for
They have designed and customized a merge-sort join imple- fpga-based processor/accelerator systems. In
mentation for a specific platform and have studied scaling Proceedings of the 19th ACM/SIGDA international
and parallelization capabilities. Our work focuses on using symposium on Field programmable gate arrays, pages
HLS in the database analytics domain without any custom 33–36. ACM, 2011.
enhancements based on the underlying architecture. [4] J. Casper and K. Olukotun. Hardware acceleration of
While HLS produces hardware from high level software, database operations. In FPGA ’14, pages 151–160.
Glacier compiles VHDL code from algebraic expressions [11]. [5] Chung and et al. Coram: an in-fabric memory
This adds additional steps in the system design because al- architecture for fpga-based computing. In Proceedings
gebraic expressions must be created from SQL expressions or of the 19th ACM/SIGDA international symposium on
they are taken directly from a query planner. Our approach Field programmable gate arrays, pages 97–106. ACM,
binds accelerators to SQL operators semantically. Then, 2011.
the query plan is made accordingly. In this work, the query [6] T. P. P. Council. Tpc-h benchmark specification.
plans are generated manually. The query plan creation is Published at
out of scope of this work. We plan to automate our resource http://www.tpc.org/tpch/spec/tpch2.6.0.pdf, 2008.
allocation the near future. [7] C. Dennl, D. Ziener, and J. Teich. On-the-fly
Runtime query processing has been possible by runtime composition of FPGA-based SQL query accelerators
reconfiguration capabilities of FPGAs. Authors of [7] have using a partially reconfigurable module library. In
built a database operations library which at runtime forms Proc. FCCM, IEEE, pages 45–52, 2012.
the data path based on the given SQL query. They have [8] Halstead and et al. Accelerating join operation for
focused on data filtering operations. The main advantage relational databases with FPGAs. In Proc. FCCM,
of runtime reconfiguration is to eliminate the synthesis of pages 17–20, 2013.
queries, if the available runtime operator library can execute [9] István and et al. A flexible hash table design for
the given query. The flexibility that HLS provides is at com- 10gbps key-value stores on fpgas. In FPL, pages 1–8,
pile time rather than runtime. Hence, there is possibility to 2013.
combine runtime reconfiguration with HLS technology. Al-
[10] D. Koch and J. Torresen. Fpgasort: A high
though our design could be enhanced by the use of dynamic
performance sorting architecture exploiting run-time
reconfiguration, it does not necessarily require it, since the
reconfiguration on fpgas for large problem sorting. In
static design is able to work with di↵erent parameters and
FPGA ’11, pages 45–54.
arguments, and it can fit in our FPGA.
[11] R. Mueller, J. Teubner, and G. Alonso. Glacier: A
query-to-hardware compiler. In SIGMOD ’10, pages
5. CONCLUSIONS 1159–1162.
As our results demonstrate while simulating an in-memory [12] A. Parashar et al. Triggered instructions: A control
database accelerator, HLS tools can present high perfor- paradigm for spatially-programmed architectures.
mance gains running complete database queries and is a SIGARCH Comput. Archit. News, pages 142–153.
promising way to address the big data explosion. Although [13] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and
designing the most of the database operations are straight- K. A. Ross. Q100: The architecture and design of a
forward in HLS, certain issues such as the memory - accel- database processing unit. In ASPLOS ’14, pages
erator communication synthesis might require handwritten 255–268.
ABSTRACT Hardware accelerators based on field programmable gate array (FPGA) and system on chip
(SoC) devices have gained attention in recent years. One of the main reasons is that these devices contain
reconfigurable logic, which makes them feasible for boosting the performance of applications. High-level
synthesis (HLS) tools facilitate the creation of FPGA code from a high level of abstraction using different
directives to obtain an optimized hardware design based on performance metrics. However, the complexity
of the design space depends on different factors such as the number of directives used in the source code,
the available resources in the device, and the clock frequency. Design space exploration (DSE) techniques
comprise the evaluation of multiple implementations with different combinations of directives to obtain
a design with a good compromise between different metrics. This paper presents a survey of models,
methodologies, and frameworks proposed for metric estimation, FPGA-based DSE, and power consumption
estimation on FPGA/SoC. The main features, limitations, and trade-offs of these approaches are described.
We also present the integration of existing models and frameworks in diverse research areas and identify the
different challenges to be addressed.
INDEX TERMS Computing models, design space exploration, field programmable gate array (FPGA),
system on chip (SoC), power consumption.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 90429
R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks
HLS tools support C/C++, SystemC, and OpenCL [12] Consequently, the reader will become more confident about
codes to generate the final RTL code. These tools pro- the fundamental and technical aspects of the comput-
vide the designer with a detailed report for each algorith- ing models, methodologies, and frameworks designed for
mic solution, including information about the estimation of FPGA/SoC, acquiring a clear idea of the main parameters
latency, resource utilization (also known as area occupied), required by each one. We highlight the importance of hav-
and throughput. The use of directives allows code optimiza- ing simple approaches with few parameters, such as those
tion through parallel techniques, such as loop pipelining, loop proposed for other parallel architectures, so that they have a
unrolling, array partitioning, and array reshaping. For each greater scope and can be widely used. Based on this literature
solution, the designer can specify different combinations of review, the FPGA developer can select the approach that best
directives; comparing the reports provided by these tools, suits the application, hardware architecture, and program-
the best option can be determined according to different ming skills.
performance metrics. Some survey articles are available in the literature
Furthermore, these tools allow a design space exploration for FPGA-based reconfigurable hardware. Schafer and
(DSE), which involves the evaluation of multiple imple- Wang [14] divide HLS DSE techniques into two main groups:
mentations with different combinations of user design con- synthesis-based and model-based. In addition to this classifi-
straints, FPGA features, and directives (also known as knobs cation, a third group appears including DSE synthesis-based
or optimizations). Setting these optimizations to obtain a and supervised learning. According to [15], HLS DSE can
hardware design with the desired characteristics is a problem be developed using model-based and model-free techniques.
that increases exponentially as the designer applies more Model-based techniques are composed of tools and method-
directives, and the program has more complex code struc- ologies that use analytical models, whereas model-free tech-
tures. The generated hardware is directly associated with the niques include approaches where the HLS tool is treated as a
applied directives, but sometimes applying and tuning direc- black box. A survey of automatic high-level code deployment
tives requires a considerable endeavour to obtain a proper for HLS tools and toolchains is presented in [16]. The authors
hardware implementation. An optimal DSE process grants a analyze commercial HLS tools, academic HLS tools, HLS
hardware design with a good compromise between metrics code generation tools, domain-specific language tools for
such as latency, area, throughput, and power consumption. HLS, dataflow HLS tools, and automatic code deployment
Over the years, parallel computing models have proven tools (including automated DSE). Yehya et al. [17] focus
their benefits across different architectures, such as clusters on power consumption. They classify different estimation
of distributed processors with single cores and multicores, techniques as analytical, table-based, polynomial-based, and
GPU, and cloud. These models act as a bridge between neural networks. The work in [18] analyzes different per-
the architecture and software developer. The actual trend in formance and power estimation models for CPU, GPU, and
parallel computer architectures demonstrates progress toward FPGA. Moreover, reconfigurable architectures can be cate-
hybrid architectures combining namely many cores, super- gorized as coarse-grained and fine-grained according to [19],
scalars, single instruction/multiple data (SIMD), hardware [20]. In this work, we focus on FPGA and FPGA/SoC archi-
accelerators, and on-chip communication systems, among tectures included in the last category.
others, which require handling computations and data locality To the best of our knowledge, there is no previous work
at several levels to achieve suitable performance [13]. that jointly:
Using computing models, and also methodologies, and • describes the models, methodologies, and frameworks
frameworks to predict the performance of FPGA/SoC archi- developed for the estimation of metrics, FPGA-based
tectures may reduce design times and improve productivity, DSE, and power consumption estimation on FPGA/SoC,
which are critical issues when choosing these architectures. • shows their application in different research areas,
In this survey, a model is an abstraction that represents a sim- • analyzes the challenges to be addressed to widely use
plified system. A methodology describes the steps involved them for FPGA/SoC,
in the process for systematically solving a problem. A frame- • compares them with the commonly used parallel com-
work provides the structure needed in the form of a template puting models for CPU, GPU, and multicore processors.
or conceptual scheme to simplify the elaboration of a task.
B. METHODOLOGY
A. CONTRIBUTION This survey is conducted by collecting the latest con-
In this paper, we present a thorough analysis of the com- tributions, focusing on the models, methodologies, and
puting models, methodologies, and frameworks proposed frameworks for FPGA-based devices. The paper collection
for reconfigurable hardware accelerators based on FPGA. process has been performed mainly using models, method-
We compare their main features, including the inputs, ologies, FPGA/SoC, parallel computing models, DSE, and
outputs, and techniques employed for their development. Pareto-optimal design keywords in well-known scientific
Then, we show how these approaches for FPGA/SoC can databases such as IEEE Xplore, Scopus, Web of Science,
be applied in different research fields, exposing their ben- ScienceDirect, arXiv, and Directory of Open Access Journals
efits in improving the design process and productivity. (DOAJ). The collected contributions are from the last six
C. OUTLINE
The remainder of this paper is organized as follows. Section II model [22] based on the RAM model. The main idea behind
briefly presents the most widely used parallel computing PRAM is that there is a shared memory m connected to
models for CPU, GPU, and multicore processors. Section III several processing units with a global clock, as shown in
introduces the FPGA-based reconfigurable hardware accel- Fig. 1. In this scenario, one processor P can execute one
erator architectures, hardware/software co-design, DSE and operation (arithmetic, memory access, or logic) within one
metrics, and the techniques to improve latency, area, and single clock cycle. However, this model does not consider the
power for this technology. In Section IV, we describe pre- communication or synchronization overheads.
vious works on models, methodologies, and frameworks PRAM sub-models like the exclusive read exclusive write
proposed for FPGA/SoC according to their main features: (EREW), exclusive read concurrent write (ERCW), concur-
metrics estimation (IV-A), FPGA-based DSE (IV-B), and rent read exclusive write (CREW), and concurrent read con-
power consumption estimation (IV-C); and in Section IV- current write (CRCW) are introduced to handle read/write
D, we present a summary and discussion. The integration operations in a shared memory model [23].
of models and frameworks for FPGA-based reconfigurable
hardware accelerators in different research fields is exposed B. BULK SYNCHRONOUS PARALLEL MODEL
in Section V. Challenges are analyzed in Section VI. Finally, The bulk synchronous parallel model (BSP) [24] proposed for
conclusions are presented in Section VII. distributing computing is a bridging model between hardware
and algorithms that offers a high degree of abstraction. The
II. PARALLEL COMPUTING MODELS FOR BSP program is divided into supersteps separated by a barrier
PERFORMANCE ESTIMATION synchronization. Each superstep comprises several blocks of
Computing models allow to easily analyzing algorithms by computation and communication. Fig. 2 shows the workflow
simplifying the computational world to a reduced set of of the BSP model.
parameters that define the cost of arithmetic and memory A BSP computer is represented by parameters P, s, L, and
access operations and communication. These models con- G, where:
tribute to the search for efficient algorithms for a given • P: number of processors of the BSP computer.
architecture, improving the productivity of designers, pro- • s: processor speed.
grammers, and engineers. A small amount of communication, • L: cost, in step, to complete a barrier synchronization.
a small number of operations, and a high degree of parallelism • G: cost, in words, of delivering a message.
are key points that directly contribute to the efficiency of a The normalized cost G is defined by Eq.1
parallel algorithm.
This section summarizes the characteristics of the most Oplocal
G= (1)
widely used parallel computing models for performance Wsec
estimation. It is not aimed at providing a comprehensive where Oplocal is the number of local operations executed in
presentation or a thorough classification of parallel models, a processor and Wsec is the number of words communicated
languages, and architectures. In addition, we present some by the network per second. L represents the barrier synchro-
examples of their application in different architectures. nization cost at the end of each superstep.
The sum of G and L is the superstep cost. The former rep-
A. RANDOM ACCESS MACHINE AND PARALLEL RANDOM resents the number of maximum local computations executed
ACCESS MACHINE on parallel processors. The latter represents a cost composed
The random access machine (RAM) model is proposed of the cost of the communications plus the synchronization at
in [21] for sequential algorithms. It is composed of a memory, the end of the superstep.
control unit, processor, and program. In 1978, Fortune and The multi-BSP model [25] extends the BSP to multicore
Wyllie proposed the parallel random access machine (PRAM) architectures by considering the architecture as a tree with d
TABLE 1. Features of the computing models PRAM, BSP, LogP, CCM, multi-BSP, and Roofline.
FIGURE 7. High-level overview of FlexCL, based on [98]. The input is the OpenCL kernel code, which is transformed to
LLVM IR through Clang. Information from the source code is extracted by a kernel analyzer, which is sent to a computation
model, a communication model, and a global memory model. The results of each model are integrated in one model to
estimate the final kernel execution time.
3) FRAMEWORKS
Pyramid, developed by Makrani et al. [101], is a machine
learning based framework to estimate timing and resource
utilization, and to overcome the differences between the
post-implementation results and intellectual property (IP)
cores created with HLS. It is developed by employing ensem-
ble machine learning techniques, such as linear regression,
artificial neural networks, support vector machines, and ran-
dom forests. As part of the framework, Minerva [102],
which is an automated hardware optimization tool based on
a heuristic model, is used to obtain a good throughput and
throughput-to-area ratio for the RTL code generated by HLS.
Wang et al. [103] present a framework based on a
performance analysis model combined with code tuning tech-
niques for OpenCL applications only on FPGAs, assuming
that an incremental development model is adopted by design-
FIGURE 8. Classification of HLS DSE techniques, based on [14].
ers [104]. The model includes four FPGA-centric metrics to
detect possible bottlenecks related to memory, parallelism,
and computation. designs, also known as Pareto-optimal designs. Considering
that there is a limited number of resources (LUT, BRAM,
DSP, and FF) available in the reconfigurable architecture, the
4) SUMMARY
hardware design cannot request more resources than those
For metric estimation, a few contributions have considered available in the FPGA.
the use of the traditional parallel computing models such The comparison among diverse design space explorers is
as BSP and PRAM [94], [96] on FPGA. Nevertheless, the useful for observing the strengths and weaknesses of each.
adoption of the Roofline model for estimating performance This can be achieved using benchmarks, composed of com-
and bottlenecks on FPGA devices has been widely adopted putational kernels suitable for hardware acceleration. Some
due to its intuitiveness and simplicity [48], [99], [100]. of these are MachSuite [107], CHStone (C-based) [108],
Furthermore, the differences between the metric estimation S2CBench (SystemC-based) [109], Rosetta [110], and Spec-
reported by HLS tools and the post-implementation results tor (OpenCL-based) [111].
are a key point to consider when designing the estimators of Surveys related to this topic are presented in [63] and [14].
performance metrics [101]. In particular, the last one proposes a classification of HLS
DSE techniques into two groups, as depicted in Fig. 8:
B. FPGA-BASED DESIGN SPACE EXPLORATION synthesis-based and model-based. In this classification, the
Design space explorers aim to minimize HLS tools execu- third category is composed of a combination of supervised
tion times, which are highly dependent on the size of the learning and DSE synthesis-based techniques.
space to be analyzed. Different methodologies, models, and According to Sohrabizadeh et al. [15], HLS DSE can be
frameworks have been proposed based on the analysis of HLS developed using model-based and model-free techniques.
directives, where the exploration of the design space [105], Model-based techniques comprise tools and methodologies
[106] is important because it increases exponentially with the that use analytical models. They estimate the resources and
use of directives. The challenge is to find a set of hardware performance of each point in the design space. Model-free
techniques include approaches in which the HLS tool is COSMOS includes memory as part of the DSE process and
treated as a black box, such as Bayesian optimization and applies synthesis constraints to reduce the variability of the
reinforcement learning techniques [112], [113], [114], [115]. HLS tools.
The adaptive threshold non-Pareto elimination strategy
1) METHODOLOGIES (ATNE) [122] focuses on inaccuracy estimation, to address
Roofline model has been introduced within methodologies to the exploration of the design space on FPGA for implemen-
explore the design space, targeting HPC applications based tations based on OpenCL. The ATNE algorithm is based
on HLS [116], [117], [118]. on a random forest for regression. The prediction quality
Nabi et al. [117] propose TyTra flow that integrates perfor- is obtained using two metrics: average distance from ref-
mance and cost models based on Roofline analysis to obtain erence set (ADRS) and hypervolume error (HVE). The
an optimized FPGA solution for scientific HPC applications. results are shown for matrix multiplication, Sobel filter, finite
The methodology adopts the models defined in the OpenCL impulse response filter (FIR), histogram, and discrete cosine
standard: platform and memory hierarchy, kernel execution, transform.
memory execution, and data pattern. The Roofline model is Xu et al. [123] propose a methodology for performing
the base for the design space explorer and is used to assist DSE using MPSoC devices. This work presents three meth-
the selection of the best instance to be downloaded into the ods to automatically carry out the exploration: two based
hardware. Additionally, the authors propose an intermediate on simulation (cycle-accurate and fast cycle-accurate) and
representation language (TyTra-IR). For the calculation of one based on hardware acceleration. For this purpose, the
resource utilization to obtain scalability of the system, the authors consider several IP cores in an FPGA. The proposed
authors consider a maximum utilization of the FPGA of 80%, methodology is called fast explorer for behavioral systems
as suggested by [119]. (FEBS), and it accepts the number N of IP cores and their test-
Siracusa et al. [118] propose a DSE methodology, pre- benches as input. The output is a set of dominant systems with
sented in Fig. 9. The system input is the C/C++ source area vs performance trade-off. In this methodology, design
code, which is translated to an LLVM IR trace, obtaining the space exploration is performed for each IP core. The general
baseline of performance estimation and resource utilization overview for this design space explorer is shown in Fig. 11.
through the synthesis process. From this base implemen-
2) MODELS
tation, the Roofline model chart (RooflineOrig) determines
Lo et al. [113] propose a sequential model-based optimiza-
memory bottlenecks. Afterward, an automated DSE estimates
tion, using a transfer-learning mechanism, to select direc-
resources and performance, generating the optimal design
tive configurations in HLS, minimizing the number of
points. The Roofline for the best feasible design is plotted
tool evaluations/executions while obtaining solutions with
along with the RooflineOrig chart, to compare the current
LUTs-latency optimal trade-offs.
design’s performance and the performance of the solution
Kwon et al. [124] propose the mixed-sharing multidomain
derived by the DSE. The explorer includes resource sharing
model for reusing the knowledge obtained from previous
and HLS-specific IR optimizations during sample estima-
HLS DSE whereas exploring a new target design space,
tions. This work is extended in [116], with the hierarchical
showing its effectiveness when approximating quality of
version of Roofline, estimating peak performance analyti-
results (QoR) without running HLS tools.
cally and integrating a guide to reaching memory-transfer and
Dai et al. [125] present a fast and accurate QoR estimation
data-locality optimizations.
based on HLS. For this purpose, they use final HLS reports
Ferretti et al. [120] propose a method for inferring knowl-
from a set of synthesized applications to identify relevant
edge from past design explorations, as shown in Fig. 10.
features and metrics, and construct the dataset to be used for
The authors introduce signature encoding for code and
training machine learning models (linear regression, artifi-
directives, composed of specification encoding (SE), config-
cial neural networks, and gradient tree boosting). To create
uration space descriptor (CSD), and similarity metric longest
the dataset, the authors employ the information obtained
common subsequence (LCS). The methodology uses signa-
from HLS reports for different directives and targeting dif-
ture encoding to create a string with design and configuration
ferent FPGA platforms. In addition, C-to-bitstream flow
spaces (directives and their modes), combining CSD and SE.
for different clock periods is performed to obtain features
On the other side, the LCS metric is used to measure the
such as post-implementation resources and the worst neg-
similarity between the actual and previous DSE stored in a
ative slack. Finally, the authors obtain 234 features, which
database.
were reduced to 87 after an elimination process to remove
COSMOS, an automatic and scalable methodology for
irrelevant features.
DSE, is introduced by Piccolboni et al. [121] for complex
Other models focus on the DSE process are presented
accelerators. It generates a set of Pareto-optimal designs and
in [126], [127], [128], and [129].
reduces the number of HLS invocations. It comprises two
main phases: component characterization and DSE (based on 3) FRAMEWORKS
two steps: synthesis planning and mapping). The comput- Mehrabi et al. propose Prospector framework [114], which
ing model used for DSE is based on timed marked graphs. uses Bayesian techniques to obtain the best configurations
FIGURE 9. A DSE methodology presented in [116], [118]. The input source code is translated to LLVM IR trace, obtaining the baseline for
performance estimation and resource utilization. Subsequently, the Roofline model chart estimates memory bottlenecks. An automated
DSE phase allows resource and performance estimations, and the best feasible design is plotted along with the original Roofline chart.
FIGURE 10. A DSE methodology presented in [120] that uses past design explorations to infer knowledge. The signature encoding is used
to create a string with the design and configuration spaces. The new signature is compared with the ones obtained from previous DSE
(DSE database). After the similarity evaluation, the signature selected is used as input for the inference stage, to finally obtain the
optimal configuration.
FIGURE 11. MPSoC DSE, based on [123]. Different IP cores coexist in the MPSoC: some developed
with HLS tools (IP1 and IP2) and others using RTL description. A design space is generated with
the HLS tools. The system level exploration receives as input the number of IP cores described in
ANSI-C or SystemC and their testbenches. The output is a Pareto-design with throughput-area
trade-off. The system level exploration is composed by three methods: two based on simulation
and one based on hardware acceleration.
with fewer resources and reduced latency near Pareto- Lin-Analyzer [130] is a tool that allows accurate and
efficient designs. The HLS tool is considered as a black fast FPGA performance estimation and DSE, consider-
box (or function), which has to be modelled and optimized. ing fine-grained parallelism. With this framework, runtime
Prospector is shown in Fig. 12, where the inputs are the source scales linearly while increasing the design space complex-
code, clock frequency, and directives, and the outputs are the ity; however, only a few optimizations are considered,
synthesized designs. The Bayesian optimization unit (BOU) mainly loop unrolling, loop pipelining, and array partitioning.
is used to explore the design space and control the selection Regarding resource utilization, the authors assume that DSP
of directives. The HLS tool is used to generate RTL from and BRAM are the bottlenecks in accelerator designs. The
the high-level source code. At the end of the process, the communication cost between the FPGA and global mem-
framework can obtain different designs with a latency-area ory is not considered. The framework is divided into three
trade-off, which belong to the Pareto frontier. main stages: instrumentation, optimization of dynamic data
FIGURE 12. Prospector framework, based on [114]. The inputs are the source code, clock frequency, and directives; and
the outputs are the synthesized designs with a trade-off between latency and area. The directives are encoded and sent
to the BOU. Source code and clock frequency are the inputs for HLS Tools. Performance and cost values are obtained from
HLS tool and Place & Route process.
dependence graph (DDDG) generation, and DDDG schedul- evaluates the configuration and establishes the next set of
ing. In the last stage, latency is used as a performance metric directives to be applied to the input code. This iteration is
under resource constraints. Lina is proposed in [131] as an repeated until a high-performance configuration is obtained.
extension of Lin-Analyzer, and it includes non-perfect loop Ferretti et al. [134] present a framework for HLS DSE
nests and timing analyses. using a cluster-based heuristic integrally developed in
MPSeeker is proposed by Zhong et al. [132] to estimate MATLAB. The algorithm identifies different clusters in the
the performance and resource utilization from a given DSE, reducing the number of regions to be analyzed; intra-
code (C/C++), considering fine-and coarse-grained paral- clustering is performed, followed by inter-cluster exploration.
lelism, allowing fast DSE. Because MPSeeker contemplates A lattice-traversing DSE framework [135] is proposed to
multi-parallelism using the loop tiling technique, a gradient explore the design space by transforming it into a lattice
boosted machine is proposed to obtain an accurate resource representation. The framework includes three stages: lattice
model for FF and LUT, while Lin-Analyzer is used for creation and initial sampling, selection of lattice Pareto-
BRAM and DSP estimation. The authors also extend the fea- neighbours, and synthesis and lattice labelling.
tures of Lin-Analyzer by including the data communication IronMan [115] is an end-to-end flexible and automated
cost. The performance cost in MPSeeker is modelled as the framework for DSE composed of a performance and resource
sum of the kernel computation and data communication costs. predictor based on a graph-neural network (GPP), multi-
Choi et al. [78] present a DSE and clock cycle estimator objective DSE engine based on reinforcement-learning
using HLS, including code transformations in the presence (RLMD), and code transformer (CT). One of the main fea-
of variable loop bounds. They propose a resource predic- tures of this framework is that it retrieves the final code with
tion method based on HLS reports through shareable and the discovered optimizations, ready to generate the corre-
non-shareable operators from a loop. Using linear interpo- sponding RTL through HLS.
lation, non-shareable resources are obtained, whereas the Sherlock [136], introduced by Gautier et al., is a DSE
resources estimated for shareable operators are computed as framework based on multi-objective optimizations devoted
the maximum of all loops. An analytical model is proposed to find Pareto-optimal solutions (or Pareto front), handling
for clock cycle prediction. In this framework, the design with multiple conflicting optimization objectives. This framework
the best performance is the output. uses active learning to exploit a surrogate design space model
COMBA [77], [133] is a framework that focuses on select- to find the Pareto-optimal designs as quickly as possible.
ing the optimal configuration of directives in HLS, taking Others frameworks devoted to DSE are introduced
into account the use and availability of hardware resources, in [15], [136], [137], and [138].
and provides an estimation of performance and resource
utilization. The authors propose the metric-guided DSE II 4) SUMMARY
(MGDSE-II) algorithm to prune and explore the design space A summary of most of the contributions devised for DSE and
based on three metrics: the number of DSP, BRAM, and LUT. presented in this section are listed in Table 2, considering the
An overview of COMBA, which is composed of a recursive following aspects:
data collector, analytical models (latency and resources), and • Reference.
DSE, is presented in Fig. 13. In COMBA, the input is the • Pruning of the design space (P-DS).
C/C++ source code, which is transformed into an LLVM IR • Whether it is based on the Roofline model.
trace through Clang. The IR trace is the input for the recursive • Whether it considers quality of results (QoR) in relation
data collector, which extracts static and dynamic information to the place and route estimation.
that will be used for the analytical models. MGDSE-II then • Whether it applies transfer learning (TL).
FIGURE 13. COMBA framework overview, based on Zhao et al. [77]. LLVM IR is extracted from the source code. This trace is the
input for the recursive data collector, which will extract the parameters used by the analytical models (latency and resource).
MGDSE-II evaluates the configuration and defines the next set of directives to be applied. The output of the complete flow is the
high-performance configuration.
TABLE 2. Summary of most of the contributions devised for DSE and by [113], BRAM and LUT are computed by [137].
presented in this section. The acronyms used in the table are: P-DS:
pruning of the design space, QoR: quality of results in relation to the COMBA [77], [133] estimates DSP, BRAM, and LUT.
place and route estimation, TL: transfer learning, N Resource: number of Lin-Analyzer [130] computes BRAM and DSP, whereas
estimated resources or NS (not specified).
MPSeeker [132] estimates FF and LUT, combining
Lin-Analyzer for DSP and BRAM utilization. Neverthe-
less, overestimating resource utilization can lead to pruning
valid design points in the exploration phase. LUT, FF, DSP,
and BRAM post-implementation estimation is performed
by [125]. A challenge with HLS tools is efficiently predicting
resource sharing for unrolling factors and array partitions
when using HLS pragmas. [78], [118].
and instrumentation (implying the accumulation of events HL-Pow, proposed by Lin et al. [153], is based on machine
to monitor the relevant signals). A linear model is used to learning techniques and overcomes the gap between the HLS
estimate the power contribution of the overall system by synthesis phase and power consumption estimation (usually
computing the power consumption of each IP core. performed after the RTL implementation flow). A DSE is
In the context of approximate computing, Xu et al. [144] introduced to obtain the latency vs power trade-off, with prun-
investigate the use of linear regression and multilayer per- ing to reduce the design space when finding Pareto-optimal
ceptron (MLP) models to generate a new approximated RTL designs. For the machine learning implementation, the train-
design with a trade-off between area and power. Using this ing dataset is constructed by a feature construction (HLS
approach, the search space is extended by reducing the pre- report) and power collection (post-implementation report),
cision of the weights obtained for the predictive models. with a total of 256 elements per feature. The experiments are
The proposed method is divided into three stages: kernel performed with different machine learning models, including
extraction and training data generation, model fitting and linear regression, support vector machines, tree-based mod-
substitution, and model precision optimization with bit width els, and neural networks.
reduction. PowerGear, described by Lin et al. [154], is a graph-
learning-assisted power estimator for FPGA HLS, and is
2) MODELS
composed of a graph construction flow and a power-aware
graph neural network model called HEC-GNN. This study
Lorandel et al. [145] propose the use of neural networks to
considers the impact of interconnections in the hardware
estimate the dynamic power consumption and output signal
design that affects the power modelling. The authors ben-
activities for different IP cores involved in a system. In this
efit from the HLS front-end and HLS back-end to recover
study, two stages are considered: IP characterization and
dataflow graphs because it is possible to obtain the IR traces
high-level system modelling. Nasser et al. [146] present a
and finite state machine with data path information. Pow-
model for the characterization phase by extracting the rele-
erGear can be used to guide a design space explorer with
vant information for each component that has an impact on
a trade-off between latency and power to obtain the Pareto
power.
frontier.
Tripathi et al. [147] introduce an MLP architecture to cal-
Aladdin, introduced by Shao et al. [155], estimates the per-
culate power consumption, using LLVM IR instructions as
formance, power, and area of accelerators. It generates a
input, and modelling only dynamic power.
dependence graph from the input code and produces a fast
Verma et al. [148] present a power estimation model that
cycle estimate before RTL construction.
improves the Deng’s model [149], and is designed using
HAPE, presented by Makni et al. [156], is a framework
nonlinear regression techniques. For this purpose, they use
for area-power estimation based on analytical models, and
the power data of different types of digital circuits (described
it aims to assist the DSE in reducing HLS runtime. HAPE
in VHDL) after the synthesis process. The data is divided
focuses only on the main subtraces present in a source code
into designs with and without clock gating, and based on this
containing the directives provided by the designer. HAPE
separation, two power models are developed.
integrates Lin-Analyzer for computation cost.
In [150] two techniques are proposed by Verma et al.
remarking the importance of predicting the power consump-
4) SUMMARY
tion in an early stage of the accelerator design: a heuristic
approach based on a backpropagation neural network and a Regarding the power consumption, there is an evident trend in
regression based on statistics. estimating this metric in the early stages of design using HLS
FlexCL is extended in [151] through the incorporation of tools. Moreover, some of the presented frameworks integrate
three modes of communication for the memory model: direct, the performance, power, and area estimations with a DSE
burst, and stream access patterns, and an analytical power engine.
model for dynamic and static power.
D. SUMMARY AND DISCUSSION
The studies described in this section are summarized in
3) FRAMEWORKS
Table 3, including for each one:
HLSPredict, developed by O’Neal et al. [152], is a frame-
• Reference and year of publication.
work based on an ensemble of ten machine learning models
• Whether it is a model, a methodology, or a framework.
to predict performance and power consumption without ana-
lytical models or HLS-in-the-loop. Two types of IP cores are – In the case of a model, the number of input param-
considered: without directives (base IP core) or with direc- eters is included. For example, the model presented
tives (optimized IP core). Accelerators for training the models in [157] uses more than 10 input parameters (10+),
are based on a template with DMA for memory transac- and the model presented in [98] uses 21 parameters.
tions, which implies that for every source code implemented The symbol (−) indicates that the number of param-
through HLS, the functionality of the IP core is encapsulated eters is not defined in the corresponding study.
and integrated within the hardware template. • Whether it includes DSE.
TABLE 3. Contributions presented in the literature for metric estimation, FPGA-based DSE, and power consumption. The acronyms used in the table are:
A: area, L: latency, P: power consumption, QoR: quality of result, C: communication, T: throughput, E: energy, S: speed-up, RT: reconfiguration time, S-C:
SystemC, I-C: Impulse C, HDL: hardware description language, MH: meta-heuristics, Em: empirical, and PN: Petri Nets.
research area in which the model is applied. The fourth and connected). A greedy algorithm is employed to search for the
fifth columns are the aim and type of model used, respec- best accelerator configuration under constraints such as the
tively, and the last one is the target platform. DSP, BRAM, bandwidth, and DNN layers.
We can observe that most contributions focus on CNN FRED [183], developed by Biondi et al., is a framework
accelerators, and that the models are devoted to carrying out for real-time applications that benefits from a dynamic partial
DSE and performance estimation and are mainly based on reconfiguration (DPR). It includes a hardware task model for
Roofline. The use of this model is based on the premise that the tasks carried out by the FPGA with partial reconfiguration
communication and computation are two basic constraints enabled, a software model for the tasks executed on the
to improve the throughput of an accelerator, specially when processor, and a scheduling infrastructure.
developing hardware for highly demanding applications. Mu et al. present [184] a collaborative framework to obtain
OpenCL-based hardware designs for CNN implementation.
B. FRAMEWORKS A DSE based on LoopTrees is generated and pruned to
Frameworks (or toolflows) have been proposed to map ML reduce the design space. Fine-grained and coarse-grained
inference and training into SoC-based, integrating models to analytical models are introduced to generate the final opti-
mainly estimate hardware resource utilization, latency, and mized solution. The former estimates the latency and resource
throughput. An exhaustive survey is presented in [166]. utilization, whereas the latter applies further optimization
Concerning training acceleration, Geng et al. [177] to the best candidate designs obtained after applying the
developed FPDeep, a toolflow for a scalable CNN training fine-grained model.
acceleration on deeply-pipelined FPGA clusters, proposing a The heterogeneous image processing acceleration
model for operator graph partitioning and hardware resource (Hippac), proposed by Reiche et al. [185], is a framework
allocation (with a distinction between small and large FPGA that allows the generation of image processing accelerators.
clusters). Roofline is used to evaluate the throughput, because Several steps are performed by analyzing the IR trace: data
of its dependency on communication and computation. dependency analysis, dependency graph restructuring, and
F-CNN, introduced by Zhao et al. [178], is an automatic transformations (streaming objects, memory allocation, and
framework for CNN training based on the reconfiguration replication of the innermost kernel to improve throughput).
of a streaming data path at runtime. The proposed mod- A framework named Spark-to-FPGA-Accelerator (S2FA),
els for resource and bandwidth estimation guide the space introduced by Yu et al. [186], transforms Scala computa-
exploration under design constraints to obtain an optimal tional kernels based on Apache Spark applications into opti-
performance. mized accelerator designs. For this, a learning-based DSE
HP-GNN, proposed by Lin et al. [179], is a framework is employed to obtain high-performance RTL designs using
for training graph neural networks (GNN) on a CPU-FPGA an ensemble of reinforcement learning algorithms: uniform
platform. It incorporates an engine dedicated to exploring greedy mutation, differential evolution genetic algorithm,
the design space through an exhaustive search using per- particle swarm optimization, and simulated annealing. The
formance and resource utilization models. HP-GNN also HLS tool is executed in the loop to verify each optimization.
incorporates hardware templates to implement different GNN AutoDNNchip [187] is proposed by Xu et al. to facilitate
architectures. fast chip designs based on DNN, targeting FPGA and ASIC
Regarding inference acceleration, Ghaffari et al. [180] platforms. The main factors involved in the DNN acceleration
present CNN2Gate, a framework based on OpenCL to map process are bit precision, clock frequency, memory technol-
a CNN onto an FPGA with fixed-point arithmetic, including ogy, PE architecture, width for data transfer, memory allo-
a hardware-aware DSE based on resource utilization. It is cation, and DNN mapping. AutoDNNchip is composed of a
implemented using manual directive tuning, reinforcement chip predictor and a chip builder. The former predicts metrics
learning, and the hill-climbing methods. such as area, latency, energy, and throughput, whereas the
Venieris et al. [181] propose the fpgaConvNet toolflow latter performs the DSE optimizing the chip design using the
to map a CNN onto an FPGA, thereby optimizing the results obtained by the predictor. A chip predictor is formed
neural network workload. It includes a DSE using a by two modes: (i) coarse-grained and (ii) fine-grained. In (i),
multi-objective algorithm (simulated annealing), where the analytical models are used to obtain the energy, critical path,
explorer optimizes the design according to latency, through- and area for a DNN model, while in (ii), an algorithm is
put, or maximum throughput with a latency constraint. implemented to obtain the final latency through runtime sim-
Performance estimation and resource utilization models are ulations, considering the results of the coarse-grained mode.
proposed for DSE. A chip builder is composed of a DSE based on two phases:
Cloud-DNN [182], introduced by Chen et al., is a frame- early stage architecture and IP configuration exploration, and
work for mapping DNN to cloud-FPGA, generating the inter-IP pipeline exploration and IP optimization. Finally, the
corresponding HLS project to obtain the final IP core. The RTL is generated and executed to validate the results.
proposed accelerator model is based on hardware resource Table 5 summarizes the frameworks used in the contribu-
cost (considering DSP and BRAM) and a performance tions described in this section. The first two columns are the
model for each layer (convolutional, max pooling, and fully reference and the year of publication. The third column is the
research area in which the model is applied. The fourth is the • The number of PE replicas in a hardware design, and
name of the framework and the last is the target platform. consequently the level of coarse-grain parallelism that
As we can observe, most frameworks are devoted to can be obtained, is limited to the available physi-
mapping ML-based inference into FPGA/SoC architectures. cal resources. Therefore, different strategies should be
The components of these frameworks are usually expressed implemented to exploit the architecture so as to increase
as pre-defined optimized templates, mainly implemented in the scalability of the system.
C++ and OpenCL, where parallelism can be controlled • There is a trade-off between the different metrics to
by changing the parameters associated with the different be optimized, as was presented in Section III-B. As an
directives. example, the area occupied is likely to increase if the
latency is reduced, and vice versa. Thus, the FPGA
VI. CHALLENGES designer should choose a good compromise between
Nowadays, the explosive growth of accelerators promises the metrics in terms of resources, computing operations,
greater computational capabilities. FPGA/SoC devices are throughput, among others.
widely used as hardware accelerators in different areas of • The hardware generated through HLS tools is directly
research and development. However, the structured study we associated with the applied directives, but sometimes
have presented in the previous sections indicates the necessity applying and tuning directives require a considerable
to address some challenges. Coping with them will permit endeavour to obtain a proper FPGA implementation.
a more widespread adoption of models, methodologies, and Moreover, generating a solution for each directive com-
frameworks for performance estimation of HLS-based hard- bination is associated with the synthesis time, reducing
ware designs for FPGA/SoC technology. productivity.
Even using HLS tools, reconfiguring an FPGA/SoC with • The exploration of the design space is linked to the
an efficient hardware design is a challenging task. This is human effort of performing combinations of direc-
easily made apparent by some observations: tives, user design constraints, FPGA features, and code
• Physical resources, such as memory bandwidth, recon- restructuring, among others.
figurable hardware (LUTs, CLBs, and slices), and static We can cope with the above considerations through mod-
hardware (DSPs and BRAMs) are limited in FPGA/SoC els, methodologies, and frameworks to reduce design time,
devices. Thus, the available physical resources should as follows:
be used skilfully, considering techniques to improve the • The level of coarse-grain parallelism can be obtained
latency, area, and power, as introduced in Section III-C. by means of a model such as Roofline, identifying the
• Code restructuring techniques aid creating efficient computation-to-communication ratio, exposing the rela-
FPGA implementations using HLS tools, modifying the tionship between communication bottlenecks, compu-
original source code of the application according to tations, and number of replicas, as was presented in
the FPGA architecture. Suggestions for this topic are Section II-E and demonstrated in contributions such
presented in [82]. as [48], [118].
TABLE 5. Utilization of frameworks FPGA/SoC on different research areas. PDR: Partial dynamic reconfiguration.
• Design space explorers aim to identify the optimal com- computation model to reduce the time required to obtain
bination of directives to obtain an HLS-based hardware the necessary statistics for each implementation for a
design with the best trade-off among different metrics, specific application. However, there is a gap between the
generating the Pareto-optimal set of designs. Reducing HLS report and the real hardware implementation [101]
the design space and avoiding HLS in the exploration that can be addressed with a performance model that
process can improve the design time, as was described includes the results obtained from the sourceCode-to-
in Section IV-B. bitstream flow using the values related to final hardware
• Models integrated within a methodology or frame- utilization, power consumption, and timing reports.
work can automatically estimate the performance of • Computing models for FPGA-based reconfigurable
HLS-based hardware designs without executing HLS hardware accelerators have to consider that the inher-
tools, as presented in Section IV. ent hardware is not fixed. Rather, it is defined by how
• Some frameworks and methodologies including DSE the application is described. Therefore, a higher num-
provide automatic directive-insertion optimizations and ber of parameters have to be included in the model,
code transformation insights, as in contributions such such as hardware resources (DSP, BRAM, LUT, and
as [115], [116], [118]. FF), programmable logic clock, latency, byte-operations
Nevertheless, the literature review shows that a number of (Bops), scalability in the number of PE, and power
challenges has to still be addressed in order to make optimal consumption. This contrasts with the computing models
use of models, methodologies, and frameworks, such as: proposed for other parallel platforms, such as PRAM or
• Recent HLS tools generate more comprehensive reports BSP, that use a few parameters. Nevertheless, including
with more accurate information on total resource avail- more parameters in the model increases the analysis
ability, latency, clock frequency, and resource utiliza- accuracy, but affects the complexity of the model analy-
tion. These reports can be integrated with models, sis. Therefore, the trade-off between these two features
methodologies, and frameworks to estimate metrics and has to be addressed. In addition, the parameters should
provide an initial value for the replication factor of a be adjusted according to the particular combination of
single PE. However, the report generation is linked to directives applied to the source code.
the synthesis time of the FPGA implementation. Reduc- • The compatibility among different versions of HLS tools
ing the design time is an important factor when using is not granted by models, methodologies, and frame-
FPGA/SoC without losing hardware quality to reconfig- works. As a consequence, calibration techniques can
ure the platform. Thus, if the HLS tool is in the loop for help maintain compatibility between high-level tools,
performance estimation using reports, it can lead to an thereby avoiding being tied to one version of HLS tool
increased design time. One way to overcome this is to in particular [14].
use approaches such as [113], [121], [124], [152], [156], • Methodologies and frameworks are typically linked to a
without the need to run HLS in the loop or reduce its tool [77], [130], [131], [136]. However, most such tools
invocation. are not easily available or do not have user support. This
• The performance metrics reported by HLS tools is a critical point in the adoption of methodologies and
make them suitable to be combined with a parallel frameworks for performance estimation, which makes
plays an important role and different strategies provided ASIC Application specific integrated circuit.
by commercial tools can be used in this phase, adding BRAM Block RAM.
another factor to be analyzed. BSP Bulk synchronous parallel.
• It is fundamental to consider the application of CCM Collective computing model.
HLS-specific compiler optimizations, due to the impact CDFG Control data flow graph.
that they have on the hardware quality, in terms of CFD Computational fluid dynamics.
latency, area, and power consumption [190]. CI Computational intensity.
Fig. 18 summarizes the main aspects presented in this CLBs Configurable logic block.
section, considering those to create efficient hardware to CNN Convolutional neural network.
reconfigure the FPGA, how some of these aspects may be CRCW Concurrent read concurrent write.
coped through models, methodologies, and frameworks, and CREW Concurrent read exclusive write.
the challenges that need to be considered to bridge the gap CUDA Compute Unified Device Architecture.
between designers and FPGA-based reconfigurable hardware D Design space.
accelerators. DDDG Dynamic data dependence graph.
DMA Direct memory access.
VII. CONCLUSION DNN Deep neural network.
In this survey, different models, methodologies, and frame- DSE Design space exploration.
works proposed for metrics estimation, FPGA-based design DSP Digital signal processor.
space exploration, and power consumption estimation on ERCW Exclusive read concurrent write.
FPGA/SoC have been described. The main features and lim- EREW Exclusive read exclusive write.
itations, as well as trade-offs of these approaches, have been ERT Empirical Roofline toolkit.
presented, and different challenges to be addressed have been FF Flip-Flop.
identified. FIR Finite impulse response filter.
The integration of models and frameworks in different FPGA Field programmable gate array.
research areas has also been described, indicating a growing GNN Graph neural networks.
tendency to apply them in the field of machine learning HDL High-level design.
accelerators for diverse applications. HLS High-level synthesis.
Based on our literature review, it can be observed that HPC High-performance computing.
existing models, methodologies, and frameworks are very HPM Hierarchical model for parallel computations.
difficult to compare against one another. One reason is the HVE Hypervolume error.
lack of standards limiting their evaluation on different hard- I/O Input/Output.
ware and applications, together with the fact that the different IoT Internet of things.
approaches do not analyze the same performance metrics. IP Intellectual property.
In addition, it can be affirmed that the inherent hardware IR Intermediate representation.
reconfigurability of FPGA/SoC affects the complexity of the L Latency.
associated models. Indeed, the models for FPGA/SoC usually L1 Level-1 cache memory.
have a higher complexity than those commonly used for CPU, L2 Level-2 cache memory.
GPU, multicore processors, among other architectures. LLVM IR Low-level virtual machine intermediate
We believe this survey can help readers understand the ben- representation.
efits of integrating models, methodologies, and frameworks LUT LookUp Table.
for FPGA-based hardware accelerators into the design flow. ML Machine learning.
Therefore, the FPGA designer can select the approach that MLP Multi-layer perceptron.
best suits the application, hardware architecture, and pro- MOOA Multi-objective optimization algorithms.
gramming skills. MPSoC Multiprocessor system on chip.
The literature review shows that several challenges have PC Peak computation.
to still be addressed to make optimal integration of models, PE Processing element.
methodologies, and frameworks in the design flow. By high- PF Pareto-optimal frontier.
lighting these challenges, this survey reveals what has to be PMB Peak memory bandwidth.
considered to bridge the gap between the FPGA designer and PRAM Parallel random access machine.
hardware accelerators based on FPGA. QoR Quality of results.
RAM Random access machine.
APPENDIX A. LIST OF ACRONYMS RTL Register transfer level.
A Area. SIMD Single Instruction/Multiple Data.
ADRS Average distance from reference set. SoC System on chip.
AP Attainable performance. SPMD Single program multiple data.
[44] J. L. Roda, F. Sande, C. Leon, J. A. Gonzalez, and C. Rodriguez, ‘‘The col- [64] Z. Zeng, R. Sedaghat, and A. Sengupta, ‘‘A novel framework of optimiz-
lective computing model,’’ in Proc. Euromicro Workshop Parallel Dis- ing modular computing architecture for multi objective VLSI designs,’’
trib., 1999, pp. 19–26. in Proc. Int. Conf. Microelectron. (ICM), Dec. 2009, pp. 328–331.
[45] S. Williams, A. Waterman, and D. Patterson, ‘‘Roofline: An insightful [65] Y. Ma, S. Roy, J. Miao, J. Chen, and B. Yu, ‘‘Cross-layer optimization
visual performance model for multicore architectures,’’ Commun. ACM, for high speed adders: A Pareto driven machine learning approach,’’
vol. 52, no. 4, pp. 65–76, 2009. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 12,
[46] C. Yang, T. Kurth, and S. Williams, ‘‘Hierarchical roofline analysis for pp. 2298–2311, Dec. 2018.
GPUs: Accelerating performance optimization for the NERSC-9 perl- [66] D. Roy and A. Sengupta, ‘‘Low overhead symmetrical protection of
mutter system,’’ Concurrency Comput., Pract. Exper., vol. 32, no. 20, reusable IP core using robust fingerprinting and watermarking during
p. e5547, Oct. 2020. high level synthesis,’’ Future Gener. Comput. Syst., vol. 71, pp. 89–101,
[47] C. Yang, Y. Wang, T. Kurth, S. Farrell, and S. Williams, ‘‘Hierarchical Jun. 2017.
roofline performance analysis for deep learning applications,’’ in Intelli- [67] L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, ‘‘Broad-
gent Computing (Lecture Notes in Networks and Systems), vol. 284, K. ening the exploration of the accelerator design space in embedded scal-
Arai, Ed. Cham, Switzerland: Springer, 2021, doi: 10.1007/978-3-030- able platforms,’’ in Proc. IEEE High Perform. Extreme Comput. Conf.
80126-7_35. (HPEC), Sep. 2017, pp. 1–7.
[48] B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi, ‘‘Per- [68] R. Resmi and B. B. T. Sundari, ‘‘Allocation of optimal reconfigurable
formance modeling for FPGAs: Extending the roofline model with array using graph merging technique,’’ in Proc. Int. Conf. Embedded Syst.
high-level synthesis tools,’’ Int. J. Reconfigurable Comput., vol. 2013, (ICES), Jul. 2014, pp. 49–54.
Jan. 2013, Art. no. 428078. [69] D. S. H. Ram, M. C. Bhuvaneswari, and S. M. Logesh, ‘‘A novel evolu-
[49] Y. J. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery, tionary technique for multi-objective power, area and delay optimization
N. J. Wright, M. W. Hall, and L. Oliker, ‘‘Roofline model toolkit: A prac- in high level synthesis of datapaths,’’ in Proc. IEEE Comput. Soc. Annu.
tical tool for architectural and program analysis,’’ in High Perform. Com- Symp. VLSI, Jul. 2011, pp. 290–295.
put. Syst. Perform. Modeling, Benchmarking, Simul., 2015, pp. 129–148. [70] A. Sengupta, R. Sedaghat, and P. Sarkar, ‘‘A multi structure genetic algo-
[50] Y. Wang, C. Yang, S. Farrell, Y. Zhang, T. Kurth, and S. Williams, rithm for integrated design space exploration of scheduling and allocation
‘‘Time-based roofline for deep learning performance analysis,’’ in Proc. in high level synthesis for DSP kernels,’’ Swarm Evol. Comput., vol. 7,
Workshop Deep Learn. Supercomputers, Atlanta, GA, USA, Nov. 2020, pp. 35–46, Dec. 2012.
pp. 10–19. [71] A. Sengupta, R. Sedaghat, and P. Sarkar, ‘‘Rapid exploration of integrated
[51] Y. Zhang, G. Chen, G. Sun, and Q. Miao, ‘‘Models of parallel com- scheduling and module selection in high level synthesis for application
putation: A survey and classification,’’ Frontiers Comput. Sci. China, specific processor design,’’ Microprocessors Microsyst., vol. 36, no. 4,
vol. 1, no. 2, pp. 156–165, May 2007. pp. 303–314, Jun. 2012.
[72] B. C. Schafer and K. Wakabayashi, ‘‘Design space exploration acceler-
[52] A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir, ‘‘A model for hier-
ation through operation clustering,’’ IEEE Trans. Comput.-Aided Design
archical memory,’’ in Proc. Symp. Theory Comput., 1987, pp. 305–314.
Integr., vol. 29, no. 1, pp. 153–157, Jan. 2009.
[53] B. Alpern, L. Carter, E. Feig, and T. Selker, ‘‘The uniform memory hierar-
[73] B. C. Schafer, T. Takenaka, and K. Wakabayashi, ‘‘Adaptive simulated
chy model of computation,’’ Algorithmica, vol. 12, nos. 2–3, pp. 72–109,
annealer for high level synthesis design space exploration,’’ in Proc. Int.
Sep. 1994.
Symp. VLSI Design, Autom. Test, Apr. 2009, pp. 106–109.
[54] Z. Li, P. Mills, and J. H. Reif, ‘‘Models and resource metrics for parallel
[74] C. Lattner and V. Adve, ‘‘LLVM: A compilation framework for lifelong
and distributed computation,’’ Parallel Algorithms Appl., vol. 8, no. 1,
program analysis & transformation,’’ in Proc. Int. Symp. Code Gener.
pp. 35–59, 1996.
Optim. (CGO), 2004, pp. 75–86.
[55] X. Qiao, S. Chen, and L. T. Yang, ‘‘HPM: A hierarchical model for [75] LLVM Developer Group. Clang. Accessed: Feb. 1, 2022. [Online]. Avail-
parallel computations,’’ Int. J. High Perform. Comput. Netw., vol. 1, no. 3, able: https://clang.llvm.org
pp. 117–127, 2004.
[76] L. Huang, D.-L. Li, K.-P. Wang, T. Gao, and A. Tavares, ‘‘A survey on
[56] S. Pllana, I. Brandic, and S. Benkner, ‘‘Performance modeling and predic- performance optimization of high-level synthesis tools,’’ J. Comput. Sci.
tion of parallel and distributed computing systems: A survey of the state Technol., vol. 35, no. 3, pp. 697–720, May 2020.
of the art,’’ in Proc. 1st Int. Conf. Complex, Intell. Softw. Intensive Syst. [77] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He,
(CISIS), Apr. 2007, pp. 279–284. ‘‘COMBA: A comprehensive model-based analysis framework for high
[57] A. Riahi, A. Savadi, and M. Naghibzadeh, ‘‘Comparison of analytical level synthesis of real applications,’’ in Proc. Int. Conf. Comput.-
and ML-based models for predicting CPU–GPU data transfer time,’’ Aided Des. (ICCAD), 2017, pp. 430–437, doi: 10.1109/ICCAD.2017.
Computing, vol. 102, no. 9, pp. 2099–2116, Sep. 2020. 8203809.
[58] O. Bringmann, W. Ecker, I. Feldner, A. Frischknecht, C. Gerum, [78] Y.-K. Choi and J. Cong, ‘‘HLS-based optimization and design space
T. Hämäläinen, M. A. Hanif, M. J. Klaiber, D. Mueller-Gritschneder, exploration for applications with variable loop bounds,’’ in Proc. Int.
P. P. Bernardo, S. Prebeck, and M. Shafique, ‘‘Automated HW/SW co- Conf. Comput.-Aided Design (ICCAD), 2018, pp. 1–8.
design for edge AI: State, challenges and steps ahead: Special ses- [79] J. S. Monson and B. L. Hutchings, ‘‘Using source-level transformations to
sion paper,’’ in Proc. Int. Conf. Hardw./Softw. Codesign Syst. Synthesis improve high-level synthesis debug and validation on FPGAs,’’ in Proc.
(CODES+ ISSS), 2021, pp. 11–20, doi: 10.1145/3478684.3479261. Int. Symp. Field-Program. Gate Arrays, 2015, pp. 5–8.
[59] C. Pham-Quoc, X.-Q. Nguyen, and T. N. Thinh, ‘‘Towards an FPGA- [80] C. Li, Y. Bi, Y. Benezeth, D. Ginhac, and F. Yang, ‘‘High-level synthe-
targeted hardware/software co-design framework for CNN-based edge sis for FPGAs: Code optimization strategies for real-time image pro-
computing,’’ Mobile Netw. Appl., vol. 174, pp. 1–12, May 2022. cessing,’’ J. Real-Time Image Process., vol. 14, no. 3, pp. 701–712,
[60] Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y. Liang, ‘‘HASCO: Mar. 2018.
Towards agile HArdware and software CO-design for tensor computa- [81] R. Campos and J. M. Cardoso, ‘‘On data parallelism code restructuring
tion,’’ in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput. Archit. (ISCA), for HLS targeting FPGAs,’’ in Proc. Int. Parallel Distrib. Process. Symp.
Jun. 2021, pp. 1055–1068. Workshops (IPDPSW), 2021, pp. 144–151.
[61] Y. Li, R. Chen, B. Sensale-Rodriguez, W. Gao, and C. Yu, ‘‘Real-time [82] J. de Fine Licht, M. Besta, S. Meierhans, and T. Hoefler, ‘‘Transfor-
multi-task diffractive deep neural networks via hardware-software co- mations of high-level synthesis codes for high-performance comput-
design,’’ Sci. Rep., vol. 11, no. 1, pp. 1–9, Dec. 2021. ing,’’ IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 5, pp. 1014–1029,
[62] N. Talati, K. May, A. Behroozi, Y. Yang, K. Kaszyk, C. Vasiladiotis, May 2021, doi: 10.1109/TPDS.2020.3039409.
T. Verma, L. Li, B. Nguyen, J. Sun, and J. M. Morton, ‘‘Prodigy: [83] A. C. Ferreira and J. M. Cardoso, ‘‘Graph-based code restructuring target-
Improving the memory latency of data-indirect irregular workloads using ing HLS for FPGAs,’’ in Proc. Int. Symp. Appl. Reconfigurable Comput.,
hardware-software co-design,’’ in Proc. Int. Symp. High-Performance 2019, pp. 230–244.
Comput. Archit. (HPCA), 2021, pp. 654–667. [84] M. Q. Hoang, P. L. Nguyen, H. V. Tran, H. Q. Nguyen, V. T. Nguyen,
[63] D. R. F. de Bulnes, Y. Maldonado, and L. Trujillo, ‘‘Development of and C. Vo-Le, ‘‘FPGA oriented compression of DNN using layer-targeted
multiobjective high-level synthesis for FPGAs,’’ Sci. Program., vol. 2020, weights and activations quantization,’’ in Proc. IEEE 8th Int. Conf. Com-
Jun. 2020, Art. no. 7095048. mun. Electron. (ICCE), Jan. 2021, pp. 157–162.
[85] Q. Zhang, J. Cao, Y. Zhang, S. Zhang, Q. Zhang, and D. Yu, ‘‘FPGA [107] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks, ‘‘MachSuite:
implementation of quantized convolutional neural networks,’’ in Proc. Int. Benchmarks for accelerator design and customized architectures,’’ in
Conf. Commun. Technol. (ICCT), 2019, pp. 1605–1610. Proc. IEEE Int. Symp. Workload Characterization (IISWC), Oct. 2014,
[86] P. Bacchus, R. Stewart, and E. Komendantskaya, ‘‘Accuracy, training pp. 110–119.
time and hardware efficiency trade-offs for quantized neural networks [108] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, ‘‘CHStone: A
on FPGAs,’’ in Proc. Int. Symp. Appl. Reconfigurable Comput., 2020, benchmark program suite for practical C-based high-level synthesis,’’ in
pp. 121–135. Proc. IEEE Int. Symp. Circuits Syst., May 2008, pp. 1192–1195.
[87] X. Xu, Q. Lu, T. Wang, Y. Hu, C. Zhuo, J. Liu, and Y. Shi, ‘‘Efficient [109] B. C. Schafer and A. Mahapatra, ‘‘S2CBench: Synthesizable SystemC
hardware implementation of cellular neural networks with incremental benchmark suite for high-level synthesis,’’ IEEE Embedded Syst. Lett.,
quantization and early exit,’’ ACM J. Emerg. Technol. Comput. Syst., vol. 6, no. 3, pp. 53–56, Sep. 2014.
vol. 14, no. 4, pp. 1–20, Oct. 2018. [110] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston,
[88] N. Grover and M. Soni, ‘‘Reduction of power consumption in FPGAs— Y.-H. Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang, ‘‘Rosetta: A
An overview,’’ Inf. Eng. Electron. Bus., vol. 4, no. 5, p. 50, 2012. realistic high-level synthesis benchmark suite for software-programmable
[89] M. Ibro and G. Marinova, ‘‘Review on low-power consumption tech- FPGAs,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
niques for FPGA-based designs in IoT technology,’’ in Proc. 16th Int. 2018, pp. 269–278.
Conf. Telecommun. (ConTEL), Jun. 2021, pp. 110–114. [111] Q. Gautier, A. Althoff, P. Meng, and R. Kastner, ‘‘Spector: An OpenCL
[90] B. Khaleghi, S. Salamat, M. Imani, and T. Rosing, ‘‘FPGA energy effi- FPGA benchmark suite,’’ in Proc. Int. Conf. Field-Program. Technol.
ciency by leveraging thermal margin,’’ in Proc. IEEE 37th Int. Conf. (FPT), Dec. 2016, pp. 141–148.
Comput. Design (ICCD), Nov. 2019, pp. 376–384. [112] B. Reagen, J. M. Hernández-Lobato, R. Adolf, M. Gelbart,
[91] H. Kim and K. Choi, ‘‘Low power FPGA-SoC design techniques for P. Whatmough, G.-Y. Wei, and D. Brooks, ‘‘A case for efficient
CNN-based object detection accelerator,’’ in Proc. IEEE 10th Annu. accelerator design space exploration via Bayesian optimization,’’ in
Ubiquitous Comput., Electron. Mobile Commun. Conf. (UEMCON), Proc. Int. Symp. Low Power Electron. Design (ISLPED), 2017, pp. 1–6.
Oct. 2019, pp. 1130–1134. [113] C. Lo and P. Chow, ‘‘Model-based optimization of high level synthesis
[92] Y. Choi and J. Cong, ‘‘HLScope: High-level performance debugging for directives,’’ in Proc. 26th Int. Conf. Field Program. Log. Appl. (FPL),
FPGA designs,’’ in Proc. Annu. Int. Symp. Field-Programmable Custom Aug. 2016, pp. 1–10.
Comput. Mach. (FCCM), 2017, pp. 125–128. [114] A. Mehrabi, A. Manocha, B. C. Lee, and D. J. Sorin, ‘‘Bayesian optimiza-
[93] Y. Choi, P. Zhang, P. Li, and J. Cong, ‘‘HLScope+: Fast and accurate tion for efficient accelerator synthesis,’’ ACM Trans. Archit. Code Optim.,
performance estimation for FPGA HLS,’’ in Proc. Int. Conf. Comput.- vol. 18, no. 1, pp. 1–25, Mar. 2021.
Aided Des. (ICCAD), 2017, pp. 691–698.
[115] N. Wu, Y. Xie, and C. Hao, ‘‘IronMan: GNN-assisted design space
[94] N. Kapre and H. Patel, ‘‘Applying models of computation to OpenCL exploration in high-level synthesis via reinforcement learning,’’ in Proc.
pipes for FPGA computing,’’ in Int. Workshop OpenC (IWOCL), 2017, Great Lakes Symp. VLSI (GLSVLSI), Jun. 2021, pp. 39–44.
pp. 1–9.
[116] M. Siracusa, E. Del Sozzo, M. Rabozzi, L. Di Tucci, S. Williams,
[95] E. A. Lee and D. G. Messerschmitt, ‘‘Synchronous data flow,’’ Proc.
D. Sciuto, and M. D. Santambrogio, ‘‘A comprehensive methodology to
IEEE, vol. 75, no. 9, pp. 1235–1245, Sep. 1987.
optimize FPGA designs via the roofline model,’’ IEEE Trans. Comput.,
[96] M. Hora, V. Končický, and J. Tětek, ‘‘Theoretical model of compu- vol. 71, no. 8, pp. 1903–1915, Aug. 2021.
tation and algorithms for FPGA-based hardware accelerators,’’ 2018,
[117] S. W. Nabi and W. Vanderbauwhede, ‘‘FPGA design space exploration for
arXiv:1807.03611.
scientific HPC applications using a fast and accurate cost model based on
[97] K. Papadimitriou, A. Dollas, and S. Hauck, ‘‘Performance of par-
roofline analysis,’’ J. Parallel Distrib. Comput., vol. 133, pp. 407–419,
tial reconfiguration in FPGA systems: A survey and a cost model,’’
Nov. 2019.
ACM Trans. Reconfigurable Technol. Syst., vol. 4, no. 4, pp. 1–36,
2011. [118] M. Siracusa, L. Di Tucci, M. Rabozzi, S. Williams, E. D. Sozzo, and
M. D. Santambrogio, ‘‘A CAD-based methodology to optimize HLS code
[98] S. Wang, Y. Liang, and W. Zhang, ‘‘FlexCL: An analytical performance
via the roofline model,’’ in Proc. 39th Int. Conf. Comput.-Aided Design,
model for OpenCL workloads on flexible FPGAs,’’ in Proc. 54th Annu.
Nov. 2020, pp. 1–9.
Design Automat. Conf., Jun. 2017, p. 27.
[99] E. Calore and S. F. Schifano, ‘‘Performance assessment of FPGAs as HPC [119] R. Tessier and H. Giza, ‘‘Balancing logic utilization and area efficiency
accelerators using the FPGA empirical roofline,’’ in Proc. 31st Int. Conf. in FPGAs,’’ in Proc. 10th Int. Workshop Field Program. Logic Appl.,
Field-Program. Log. Appl. (FPL), Aug. 2021, pp. 83–90. vol. 1896, 2000, pp. 535–544.
[100] T. Nguyen, S. Williams, M. Siracusa, C. MacLean, D. Doerfler, and [120] L. Ferretti, J. Kwon, G. Ansaloni, G. D. Guglielmo, L. P. Carloni,
N. J. Wright, ‘‘The performance and energy efficiency potential and L. Pozzi, ‘‘Leveraging prior knowledge for effective design-space
of FPGAs in scientific computing,’’ in Proc. Perform. Modeling, exploration in high-level synthesis,’’ IEEE Trans. Comput.-Aided Design
Benchmarking Simul. High Perform. Comput. Syst. (PMBS), 2020, Integr. Circuits Syst., vol. 39, no. 11, pp. 3736–3747, Nov. 2020.
pp. 8–19. [121] L. Piccolboni, P. Mantovani, G. D. Guglielmo, and L. P. Carloni, ‘‘COS-
[101] H. M. Makrani, F. Farahmand, H. Sayadi, S. Bondi, S. M. P. Dinakarrao, MOS: Coordination of high-level synthesis and memory optimization for
H. Homayoun, and S. Rafatirad, ‘‘Pyramid: Machine learning framework hardware accelerators,’’ ACM Trans. Embedded Comput. Syst., vol. 16,
to estimate the optimal timing and resource usage of a high-level synthesis no. 5s, pp. 1–22, Oct. 2017.
design,’’ in Proc. 29th Int. Conf. Field Program. Log. Appl. (FPL), [122] P. Meng, A. Althoff, Q. Gautier, and R. Kastner, ‘‘Adaptive threshold non-
Sep. 2019, pp. 397–403. Pareto elimination: Re-thinking machine learning for system level design
[102] F. Farahmand, A. Ferozpuri, W. Diehl, and K. Gaj, ‘‘Minerva: Automated space exploration on FPGAs,’’ in Proc. Design, Automat. Test Eur. Conf.
hardware optimization tool,’’ in Proc. Int. Conf. ReConFigurable Comput. Exhib. (DATE), 2016, pp. 918–923.
FPGAs (ReConFig), Dec. 2017, pp. 1–8. [123] S. Xu, S. Liu, Y. Liu, A. Mahapatra, M. Villaverde, F. Moreno, and
[103] Z. Wang, B. He, W. Zhang, and S. Jiang, ‘‘A performance analysis B. Carrion Schafer, ‘‘Design space exploration of heterogeneous MPSoCs
framework for optimizing OpenCL applications on FPGAs,’’ in Proc. with variable number of hardware accelerators,’’ Microprocessors
IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Mar. 2016, Microsyst., vol. 65, pp. 169–179, Mar. 2019.
pp. 114–125. [124] J. Kwon and L. P. Carloni, ‘‘Transfer learning for design-space explo-
[104] C. Larman and V. R. Basili, ‘‘Iterative and incremental developments. A ration with high-level synthesis,’’ in Proc. Workshop Mach. Learn.
brief history,’’ Computer, vol. 36, no. 6, pp. 47–56, 2003. (CAD), 2020, pp. 163–168.
[105] J. Cong, W. Jiang, B. Liu, and Y. Zou, ‘‘Automatic memory partitioning [125] S. Dai, Y. Zhou, H. Zhang, E. Ustun, E. F. Y. Young, and Z. Zhang, ‘‘Fast
and scheduling for throughput and power optimization,’’ ACM Trans. Des. and accurate estimation of quality of results in high-level synthesis with
Automat. Electron. Syst., vol. 16, no. 2, pp. 1–15, 2011. machine learning,’’ in Proc. IEEE 26th Annu. Int. Symp. Field-Program.
[106] N. K. Pham, A. K. Singh, A. Kumar, and M. M. A. Khin, ‘‘Exploiting Custom Comput. Mach. (FCCM), Apr. 2018, pp. 129–132.
loop-array dependencies to accelerate the design space exploration with [126] S. Liu, F. C. Lau, and B. C. Schafer, ‘‘Accelerating FPGA prototyping
high level synthesis,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. through predictive model-based HLS design space exploration,’’ in Proc.
(DATE), 2015, pp. 157–162. 56th Annu. Design Autom. Conf., Jun. 2019, p. 97.
[127] A. S. B. Lopes and M. M. Pereira, ‘‘A machine learning approach to [148] G. Verma, V. Khare, and M. Kumar, ‘‘More precise FPGA power esti-
accelerating DSE of reconfigurable accelerator systems,’’ in Proc. 33rd mation and validation tool (FPEV_Tool) for low power applications,’’
Symp. Integr. Circuits Syst. Design (SBCCI), Aug. 2020, pp. 1–6. Wireless Pers. Commun., vol. 106, no. 4, pp. 2237–2246, Jun. 2019.
[128] E. Ustun, C. Deng, D. Pal, Z. Li, and Z. Zhang, ‘‘Accurate operation delay [149] L. Deng, K. Sobti, and C. Chakrabarti, ‘‘Accurate models for estimating
prediction for FPGA HLS using graph neural networks,’’ in Proc. 39th Int. area and power of FPGA implementations,’’ in Proc. IEEE Int. Conf.
Conf. Comput.-Aided Design, Nov. 2020, p. 87. Acoust., Speech Signal Process., Mar. 2008, pp. 1417–1420.
[129] M. Manuel, A. Kreddig, S. Conrady, N. A. Vu Doan, and W. Stechele, [150] G. Verma, T. Singhal, R. Kumar, S. Chauhan, S. Shekhar, B. Pandey, and
‘‘Model-based design space exploration for approximate image process- D. M. Akbar Hussain, ‘‘Heuristic and statistical power estimation model
ing on FPGA,’’ in Proc. IEEE Nordic Circuits Syst. Conf. (NorCAS), for FPGA based wireless systems,’’ Wireless Pers. Commun., vol. 106,
Oct. 2020, pp. 1–7. no. 4, pp. 2087–2098, Jun. 2019.
[130] G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar, ‘‘Lin-analyzer: [151] Y. Liang, S. Wang, and W. Zhang, ‘‘FlexCL: A model of performance and
A high-level performance analysis tool for FPGA-based accelerators,’’ in power for OpenCL workloads on FPGAs,’’ IEEE Trans. Comput., vol. 67,
Proc. 53rd Annu. Design Automat. Conf., Austin, TX, USA, Jun. 2016, no. 12, pp. 1750–1764, Dec. 2018.
p. 136. [152] K. O’Neal, M. Liu, H. Tang, A. Kalantar, K. DeRenard, and P. Brisk,
[131] A. B. Perina, J. Becker, and V. Bonato, ‘‘Lina: Timing-constrained high- ‘‘HLSPredict: Cross platform performance prediction for FPGA high-
level synthesis performance estimator for fast DSE,’’ in Proc. Int. Conf. level synthesis,’’ in Proc. Int. Conf. Comput.-Aided Design, Nov. 2018,
Field-Program. Technol. (ICFPT), 2019, pp. 343–346. pp. 1–8.
[132] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar, [153] Z. Lin, J. Zhao, S. Sinha, and W. Zhang, ‘‘HL-Pow: A learning-
‘‘Design space exploration of FPGA-based accelerators with multi-level based power modeling framework for high-level synthesis,’’ in
parallelism,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Proc. Asia South Pacific Design Autom. Conf. (ASP-DAC), 2020,
Mar. 2017, pp. 1141–1146. pp. 574–580.
[133] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, ‘‘Performance [154] Z. Lin, Z. Yuan, J. Zhao, W. Zhang, H. Wang, and Y. Tian, ‘‘PowerGear:
modeling and directives optimization for high-level synthesis on FPGA,’’ Early-stage power estimation in FPGA HLS via heterogeneous edge-
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 7, centric GNNs,’’ in Proc. Design, Automat. Test Eur. Conf. Exhib. (DATE),
pp. 1428–1441, Jul. 2019. 2022, pp. 1341–1346.
[134] L. Ferretti, G. Ansaloni, and L. Pozzi, ‘‘Cluster-based heuristic for high [155] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, ‘‘Aladdin: A pre-RTL,
level synthesis design space exploration,’’ IEEE Trans. Emerg. Topics power-performance accelerator simulator enabling large design space
Comput., vol. 9, no. 1, pp. 35–43, Jan. 2021. exploration of customized architectures,’’ in Proc. ACM/IEEE 41st Int.
[135] L. Ferretti, G. Ansaloni, and L. Pozzi, ‘‘Lattice-traversing design space Symp. Comput. Archit. (ISCA), Jun. 2014, pp. 97–108.
exploration for high level synthesis,’’ in Proc. IEEE 36th Int. Conf. [156] M. Makni, S. Niar, M. Baklouti, and M. Abid, ‘‘HAPE: A high-level area-
Comput. Design (ICCD), Oct. 2018, pp. 210–217. power estimation framework for FPGA-based accelerators,’’ Micropro-
[136] Q. Gautier, A. Althoff, C. L. Crutchfield, and R. Kastner, ‘‘Sherlock: cessors Microsyst., vol. 63, pp. 11–27, Nov. 2018.
A multi-objective design space exploration framework,’’ ACM Trans. [157] K. K. W. Poon, A. Yan, and S. J. E. Wilton, ‘‘A flexible power model
Design Autom. Electron. Syst., vol. 27, no. 4, pp. 1–20, Jul. 2022. for FPGAs,’’ in Proc. 12th Int. Conf. Field Program. Logic Appl. (FPL),
[137] D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, vol. 2438, 2002, pp. 312–321.
and K. Olukotun, ‘‘Automatic generation of efficient accelerators for [158] C. Du and Y. Yamaguchi, ‘‘High-level synthesis design for stencil compu-
reconfigurable hardware,’’ in Proc. ACM/IEEE 43rd Annu. Int. Symp. tations on FPGA with high bandwidth memory,’’ Electronics, vol. 9, no. 8,
Comput. Archit. (ISCA), Jun. 2016, pp. 115–127. p. 1275, Aug. 2020. [Online]. Available: https://www.mdpi.com/2079-
[138] H. M. Makrani, H. Sayadi, T. Mohsenin, S. Rafatirad, A. Sasan, and 9292/9/8/1275
H. Homayoun, ‘‘XPPE: Cross-platform performance estimation of hard- [159] M. Karp, A. Podobas, N. Jansson, T. Kenter, C. Plessl, P. Schlatter,
ware accelerators using machine learning,’’ in Proc. 24th Asia South and S. Markidis, ‘‘High-performance spectral element methods on field-
Pacific Design Automat. Conf., Jan. 2019, pp. 727–732. programmable gate arrays: Implementation, evaluation, and future pro-
[139] PowerTool MAXPOWERTOOL002 Quick Start Guide, Maxim Integr., jection,’’ in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS),
San Jose, CA, USA, 2014. Accessed: Feb. 19, 2022. [Online]. Available: May 2021, pp. 1077–1086.
https://pdfserv.maximintegrated.com/en/an/UG5981.pdf [160] K. Nagasu, K. Sano, F. Kono, and N. Nakasato, ‘‘FPGA-based tsunami
[140] USB Interface Adapter Evaluation Module. User’s Guide, Texas Instrum., simulation: Performance comparison with GPUs, and roofline model for
Dallas, TX, USA, 2006. Accessed: Feb. 19, 2022. scalability analysis,’’ J. Parallel Distrib. Comput., vol. 106, pp. 153–169,
[141] Xilinx Power Estimator User Guide. UG-440 (v2021.2), Xilinx, San Aug. 2017.
Jose, CA, USA, 2021. Accessed: Feb. 19, 2022. [Online]. Available: [161] C. Du, I. Firmansyah, and Y. Yamaguchi, ‘‘FPGA-based computational
https://china.xilinx.com/content/dam/xilinx/support/documents/sw_ fluid dynamics simulation architecture via high-level synthesis design
manuals/xilinx2021_2/ug440-xilinx-power-estimator.pdf method,’’ in Proc. Int. Symp. Appl. Reconfigurable Comput., vol. 12083,
[142] Intel FPGA Power and Thermal Calculator User Guide, Intel, San 2020, pp. 232–246.
Jose, CA, USA, 2021. Accessed: Feb. 19, 2022. [Online]. Available: [162] E. Reggiani, G. Natale, C. Moroni, and M. D. Santambrogio, ‘‘An
https://www.intel.com/content/www/us/en/docs/programmable/683445/ FPGA-based acceleration methodology and performance model for iter-
21-4/overview-of-the.html ative stencils,’’ in Proc. IEEE Int. Parallel Distrib. Process. Symp. Work-
[143] J. J. Davis, E. Hung, J. M. Levine, E. A. Stott, P. Y. K. Cheung, and shops (IPDPSW), May 2018, pp. 115–122.
G. A. Constantinides, ‘‘KAPow: high-accuracy, low-overhead online per- [163] M. Feickert and B. Nachman, ‘‘A living review of machine learning for
module power estimation for FPGA designs,’’ ACM Trans. Reconfig- particle physics,’’ 2021, arXiv:2102.02770.
urable Technol. Syst., vol. 11, no. 1, pp. 1–22, Mar. 2018. [164] A. M. C. Deiana, N. Tran, J. Agar, M. Blott, G. Di Guglielmo,
[144] S. Xu and B. C. Schafer, ‘‘Approximating behavioral HW accelerators J. Duarte, P. Harris, S. Hauck, M. Liu, M. S. Neubauer, and J. Ngadiuba,
through selective partial extractions onto synthesizable predictive mod- ‘‘Applications and techniques for fast machine learning in science,’’ 2021,
els,’’ in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), arXiv:2110.13041.
Nov. 2019, pp. 1–8. [165] S. L. Brunton, B. R. Noack, and P. Koumoutsakos, ‘‘Machine learning for
[145] J. Lorandel, J.-C. Prévotet, and M. Hélard, ‘‘Efficient modelling of FPGA- fluid mechanics,’’ Annu. Rev. Fluid Mech., vol. 52, no. 1, pp. 477–508,
based IP blocks using neural networks,’’ in Proc. Int. Symp. Wireless Jan. 2020.
Commun. Syst. (ISWCS), 2016, pp. 571–575. [166] S. I. Venieris, A. Kouris, and C.-S. Bouganis, ‘‘Toolflows for
[146] Y. Nasser, J. Prévotet, and M. Hélard, ‘‘Power modeling on FPGA: A mapping convolutional neural networks on FPGAs: A survey and
neural model for RT-level power estimation,’’ in Proc. Int. Conf. Comput. future directions,’’ ACM Comput. Surv., vol. 51, no. 3, pp. 1–56,
Frontiers (CF), 2018, pp. 309–313. Jun. 2018.
[147] A. N. Tripathi and A. Rajawat, ‘‘An accurate and quick ANN-based [167] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and
system-level dynamic power estimation model using LLVM IR profiling J. Kepner, ‘‘Survey and benchmarking of machine learning accelerators,’’
for FPGA designs,’’ IEEE Embedded Syst. Lett., vol. 12, no. 2, pp. 58–61, in Proc. IEEE High Perform. Extreme Comput. Conf. (HPEC), Sep. 2019,
Jun. 2020. pp. 1–9.
[168] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, [189] Q. Xu, T. Mytkowicz, and N. Kim, ‘‘Approximate computing: A survey,’’
‘‘Survey of machine learning accelerators,’’ in Proc. IEEE High Perform. IEEE Des. Test, vol. 33, no. 1, pp. 8–22, Jan. 2016.
Extreme Comput. Conf. (HPEC), Sep. 2020, pp. 1–12. [190] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and
[169] E. Reggiani, M. Rabozzi, A. M. Nestorov, A. Scolari, L. Stornaiuolo, J. Anderson, ‘‘The effect of compiler optimizations on high-level syn-
and M. Santambrogio, ‘‘Pareto optimal design space exploration for thesis for FPGAs,’’ in Proc. Annu. Int. Symp. Field-Program. Custom
accelerated CNN on FPGA,’’ in Proc. IEEE Int. Parallel Distrib. Process. Comput. Mach., 2013, pp. 89–96.
Symp. Workshops (IPDPSW), May 2019, pp. 107–114.
[170] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, ‘‘Design space explo-
ration of FPGA-based deep convolutional neural networks,’’ in Proc.
21st Asia South Pacific Design Autom. Conf. (ASP-DAC), Jan. 2016,
pp. 575–580. ROMINA SOLEDAD MOLINA (Student Member,
[171] J. Xu, Z. Liu, J. Jiang, D. Yong, and S. Li, ‘‘CaFPGA: An automatic gen- IEEE) received the master’s (Master in Computer
eration model for CNN accelerator,’’ Microprocess. Microsyst., vol. 60, Science) degree from the Universidad Nacional
pp. 196–206, Jul. 2018.
de San Luis, Argentina. She is currently pursuing
[172] J. Shan, M. T. Lazarescu, J. Cortadella, L. Lavagno, and M. R. Casu, the Ph.D. degree in industrial and information
‘‘CNN-on-AWS: Efficient allocation of multikernel applications on engineering with the Università degli Studi di
multi-FPGA platforms,’’ IEEE Trans. Comput.-Aided Design Integr.
Trieste, under a Joint-Supervision Program with
Circuits Syst., vol. 40, no. 2, pp. 301–314, Feb. 2021.
the Universidad Nacional de San Luis. Her main
[173] S. O. Ayat, M. Khalil-Hani, and A. A.-H.-A. Rahman, ‘‘Optimizing
research interests include digital signal processing,
FPGA-based CNN accelerator for energy efficiency with an extended
digital control, image analysis, high-performance
roofline model,’’ Turkish J. Electr. Eng. Comput. Sci., vol. 26, no. 2,
pp. 919–935, Mar. 2018. computing, machine learning, parallel computing models, FPGA, and SOC.
[174] L. Xie, X. Fan, W. Cao, and L. Wang, ‘‘High throughput CNN accelerator
design based on FPGA,’’ in Proc. Int. Conf. Field-Program. Technol.
(FPT), Dec. 2018, pp. 274–277.
VERONICA GIL-COSTA is currently a Former
[175] C. Park, S. Park, and C. S. Park, ‘‘Roofline-model-based design space
Researcher at Yahoo! Labs Santiago hosted by the
exploration for dataflow techniques of CNN accelerators,’’ IEEE Access,
vol. 8, pp. 172509–172523, 2020. University of Chile. She is also an Associate Pro-
fessor at the University of San Luis, a Researcher
[176] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘Performance modeling
at the National Research Council (CONICET)
for CNN inference accelerators on FPGA,’’ IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 39, no. 4, pp. 843–856, Apr. 2020. of Argentina, and a Researcher at the CITIAPS,
Chile. Her research work is on parallel computing
[177] T. Geng, T. Wang, A. Li, X. Jin, and M. Herbordt, ‘‘FPDeep: Scalable
acceleration of CNN training on deeply-pipelined FPGA clusters,’’ 2019, and distributed systems, with applications in query
arXiv:1901.01007. processing and capacity planning for large scale
systems.
[178] W. Zhao, H. Fu, W. Luk, T. Yu, S. Wang, B. Feng, Y. Ma, and
G. Yang, ‘‘F-CNN: An FPGA-based framework for training convolutional
neural networks,’’ in Proc. IEEE 27th Int. Conf. Appl.-Specific Syst.,
Architectures Processors (ASAP), Jul. 2016, pp. 107–114.
[179] Y.-C. Lin, B. Zhang, and V. Prasanna, ‘‘HP-GNN: Generating high MARÍA LIZ CRESPO is currently a Research
throughput GNN training implementation on CPU-FPGA heterogeneous Officer at The Abdus Salam International Centre
platform,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, for Theoretical Physics (ICTP) and an Associate
Feb. 2022, pp. 123–133.
Researcher of the Italian National Institute of
[180] A. Ghaffari and Y. Savaria, ‘‘CNN2Gate: An implementation of convolu- Nuclear Physics (INFN), Trieste, Italy. She is also
tional neural networks inference on FPGAs with automated design space coordinating the research and training program of
exploration,’’ Electronics, vol. 9, no. 12, p. 2200, Dec. 2020.
the Multidisciplinary Laboratory (MLab), ICTP.
[181] S. I. Venieris and C.-S. Bouganis, ‘‘FpgaConvNet: Automated mapping She has organized several international schools
of convolutional neural networks on FPGAs,’’ in Proc. ACM/SIGDA Int.
and workshops on fully programmable systems
Symp. Field-Program. Gate Arrays, Feb. 2017, pp. 291–292.
on chip for nuclear and scientific instrumentation.
[182] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, ‘‘Cloud-DNN: An open She is the coauthor of more than 100 scientific publications in prestigious
framework for mapping DNN models to cloud FPGAs,’’ in Proc. Int.
peer-reviewed journals. Her main research interests include advanced sci-
Symp. Field-Program. Gate Arrays, 2019, pp. 73–82.
entific instrumentation for particle physics experiments and experimental
[183] A. Biondi, A. Balsini, M. Pagani, E. Rossi, M. Marinoni, and G. Buttazzo,
multidisciplinary research.
‘‘A framework for supporting real-time applications on dynamic reconfig-
urable FPGAs,’’ in Proc. IEEE Real-Time Syst. Symp. (RTSS), Nov. 2016,
pp. 1–12.
[184] J. Mu, W. Zhang, H. Liang, and S. Sinha, ‘‘A collaborative framework for
FPGA-based CNN design modeling and optimization,’’ in Proc. 28th Int.
Conf. Field Program. Log. Appl. (FPL), 2018, pp. 139–1397. GIOVANNI RAMPONI (Life Senior Member,
[185] O. Reiche, M. A. Özkan, R. Membarth, J. Teich, and F. Hannig, ‘‘Gener- IEEE) was born in 1956. Since 2000, he has
ating FPGA-based image processing accelerators with Hipacc,’’ in Proc. been a Full Professor of electronics at the Depart-
Int. Conf. Computer-Aided Design (ICCAD), Nov. 2017, pp. 1026–1033. ment of Engineering and Architecture, University
[186] C. H. Yu, P. Wei, M. Grossman, P. Zhang, V. Sarker, and J. Cong, ‘‘S2FA: of Trieste, Italy. He is the co-inventor of inter-
An accelerator automation framework for heterogeneous computing in national patents, and has published more than
datacenters,’’ in Proc. Design Autom. Conf. (DAC), 2018, p. 153. 200 papers in international journals, conference
[187] P. Xu, X. Zhang, C. Hao, Y. Zhao, Y. Zhang, Y. Wang, C. Li, Z. Guan, proceedings, and book chapters. His research
D. Chen, and Y. Lin, ‘‘AutoDNNchip: An automated DNN chip predictor interests include nonlinear digital signal process-
and builder for both FPGAs and ASICs,’’ in Proc. ACM/SIGDA Int. Symp. ing, enhancement and feature extraction in images
Field-Program. Gate Arrays, Feb. 2020, pp. 40–50. and image sequences, image visualization, image quality evaluation, and
[188] L. Ferretti, A. Cini, G. Zacharopoulos, C. Alippi, and L. Pozzi, ‘‘A graph deep learning techniques for image processing. More information can be
deep learning framework for high-level synthesis design space explo- found at: www.units.it/ramponi.
ration,’’ 2021, arXiv:2111.14767.
Abstract—Large language models (LLMs) have catalyzed an implementations. HLS tools automate the design tasks such
upsurge in automatic code generation, garnering significant at- as concurrent analysis of algorithms, interface design, logic
tention for register transfer level (RTL) code generation. Despite unit mapping, and data management, thereby substantially
the potential of RTL code generation with natural language,
arXiv:2408.06810v1 [cs.AR] 13 Aug 2024
it remains error-prone and limited to relatively small modules shortening the hardware design cycle.
because of the substantial semantic gap between natural language While HLS offers numerous advantages such as higher
expressions and hardware design intent. In response to the development efficiency and lower design barriers [1] [2], there
limitations, we propose a methodology that reduces the semantic are still some issues in the real-world HLS-based hardware
gaps by utilizing C/C++ for generating hardware designs via acceleration workflow [3]. Firstly, the overall analysis of the
High-Level Synthesis (HLS) tools. Basically, we build a set
of C-to-HLS optimization strategies catering to various code program is of great importance, determining the performance
patterns, such as nested loops and local arrays. Then, we apply bottlenecks of the program and the co-design between CPU
these strategies to sequential C/C++ code through in-context and FPGA remains a challenging issue. Besides, designs based
learning, which provides the LLMs with exemplary C/C++ to on HLS still encounter a few major performance issues [4] [5].
HLS prompts. With this approach, HLS designs can be generated Foremost, it still requires substantial optimization experience
effectively. Since LLMs still face problems in determining the
optimized pragma parameters precisely, we have a design space to craft high-quality HLS code and achieve desired perfor-
exploration (DSE) tool integrated for pragma parameter tuning. mance in practical development processes [6] [7]. In addition,
Furthermore, we also employ profiling tools to pinpoint the HLS code often struggles to reach optimality due to the large
performance bottlenecks within a program and selectively convert design space of various pragma parameters. Some design
bottleneck components to HLS code for hardware accelera- space exploration (DSE) tools have been proposed [8] [9]
tion. By combining the LLM-based profiling, C/C++ to HLS
translation, and DSE, we have established HLSPilot—the first [10] [11] to automate the parameter tuning, but these tools do
LLM-enabled high-level synthesis framework, which can fully not fundamentally optimize the hardware design. High-quality
automate the high-level application acceleration on hybrid CPU- HLS design turns out to be the major performance challenge
FPGA architectures. According to our experiments on real-world from the perspective of general software designers. Some
application benchmarks, HLSPilot achieve comparable perfor- researchers have attempted to address this challenge by using
mance in general and can even outperform manually crafted
counterparts, thereby underscoring the substantial promise of pre-built templates for specific domain applications [12] [13]
LLM-assisted hardware designs. [14]. For example, ThunderGP [13] has designed a set of HLS-
Index Terms—large language model, high-level synthesis, C- based templates for optimized graph processing accelerator
to-HLS, Code Generation. generation, allowing designers to implement various graph
algorithms by filling in the templates. However, it demands
I. I NTRODUCTION comprehensive understanding of both the domain knowledge
Hardware designing is a demanding task requiring a high and the HLS development experience from designers and there
level of expertise. Traditional hardware design involves coding is still a lack of well-established universal solution to obtain
with register transfer level (RTL) language. However, as optimized HLS code. Bridging the gap between C/C++ and
the complexity of hardware increases continuously with the HLS remains a formidable challenge requiring further efforts.
computing requirements of applications, RTL coding becomes Large Language Models (LLMs) have recently exhibited
exceedingly time-consuming and labor-intensive. The emer- remarkable capabilities in various generative tasks, including
gence of High-Level Synthesis (HLS) enables hardware design text generation, machine translation, and code generation,
at higher abstraction levels [1]. HLS typically employs high- underscoring their advanced learning and imitation skills.
level languages like C/C++ for hardware description, allowing These advancements have opened up possibilities for ad-
software engineers to also engage in hardware development, dressing hardware design challenges. Researchers have begun
which significantly lowering the expertise barrier in hard- applying LLMs to various hardware design tasks, including
ware design. Designers can focus more on the applications general-purpose processor designs, domain-specific accelera-
and algorithms rather than the details of low-level hardware tor designs, and arbitrary RTL code generation. Among these
applications, it can be observed that neural network accelerator
∗ Corresponding author.
generation utilizing a predefined template, as reported in [15],
This work is supported by the National Key R&D Program of China under
Grant (2022YFB4500405), and the National Natural Science Foundation of reaches an almost 100% success rate. In contrast, generating
China under Grant 62174162. register transfer level (RTL) code from natural language de-
scriptions, such as design specifications, experiences a consid- workflow on a hybrid CPU-FPGA architecture with a case
erably higher failure rate [16] [17]. This disparity is largely due study.
to the semantic gap between inputs and the anticipated outputs.
Despite the imperfections, these work have demonstrated the II. R ELATED W ORK
great potential of exploring LLMs for hardware designing. A. LLM for Hardware Design
Inspired by prior works, we introduce HLSPilot, an au- Recent works have begun to utilize LLMs to assist the
tomated framework that utilizes LLMs to generate and op- hardware designing from different angles [15], [16], [18]–
timize HLS code from sequential C/C++ code. Instead of [25]. Generating RTL code with natural language is a typical
generating RTL code from natural language directly, HLSPilot approach of hardware design with LLMs. For instance, VGen
mainly leverages LLMs to generate the C-like HLS code [18] leverages an open-source LLM, CodeGen [26], fine-
from C/C++ with much narrower semantic gap and outputs tuned with Verilog code corpus to generate Verilog code.
RTL code eventually using established HLS tools. Essen- Similarly, VerilogEval [19] enhances the LLM’s capability
tially, HLSPilot accomplishes RTL code generation from to generate Verilog by constructing a supervised fine-tuning
C/C++ without imposing hardware design tasks with broad dataset, it also establishes a benchmark for evaluating LLM’s
semantic gap on LLMs. Specifically, HLSPilot initiates the performance. ChipChat [24] achieves an 8-bit accumulator-
process with runtime profiling to pinpoint code segments based microprocessor design through multi-round natural lan-
that are the performance bottleneck and require optimization. guage conversation. ChipGPT [16] proposes a four-stage zero-
Subsequently, HLSPilot extracts the kernel code segments code logic design framework based on GPT for hardware
and applies appropriate HLS optimization strategies to the design. These studies have successfully applied LLMs to
computing kernels to generate optimized HLS code. Then, practical hardware designing. However, these methods are
HLSPilot employs a design space exploration (DSE) tool mostly limited to small functional modules and the success
to fine-tune the parameters of the generated HLS design. rate drops substantially when the hardware design gets larger.
Finally, HLSPilot leverages Xilinx OpenCL APIs to offload GPT4AIGchip proposed in [15] can also leverage LLMs to
the compute kernels to the FPGA, facilitating the deployment generate efficient AI accelerators based on a hardware tem-
of the entire algorithm on a hybrid CPU-FPGA architecture. plate, but it relies on pre-built hardware library that requires
In summary, LLMs are utilized for the hardware acceleration intensive understanding of both the domain knowledge and
throughout the entire hardware acceleration workflow ranging the hardware design techniques. which can hinders its use by
from profiling, HW/SW partitioning, HLS code generation, software developers.Recently, a domain-specific LLM for chip
HLS code optimization, and tool usage, thereby achieving a design, ChipNeMo [17], was proposed. ChipNeMo employs
high degree of design automation. a series of domain-adaptive techniques to train the LLM
The major contributions of this work are summarized as capable of generating RTL code, writing EDA tool scripts, and
follows: summarizing bugs. While powerful, domain-specific LLMs
• We propose HLSPilot, the first automatic HLS code face challenges such as high training costs and difficulties in
generation and optimization framework from sequential data collection.
C/C++ code using LLM. This framework investigates
B. LLM for Code Generation
the use of LLM for HLS design strategy learning and
tool learning, and build a complete hardware acceleration Code generation is one of the key applications of LLMs.
workflow ranging from runtime profiling, kernel identi- A number of domain-specific LLMs such as CodeGen [26],
fication, automatic HLS code generation, design space CodeX [27], and CodeT5 [28] have been proposed to address
exploration, and HW/SW co-design on a hybrid CPU- the programming of popular languages such as C/C++, Python,
FPGA computing architecture. The framework is open and Java, which have a large number of corpus for pre-
sourced on Github1 . training and fine-tuning. In contrast, it can be challenging to
• We propose a retrieval based approach to learn the HLS collect sufficient corpus for the less popular languages. VGen
optimization techniques and examples from Xilinx user [18] collected and filtered Verilog corpus from Github and
manual and utilize an in-context learning approach to textbooks, obtaining only hundreds of MB of corpus. Hence,
apply the learned HLS optimizations on serial C/C++ prompt engineering in combination with in-context learning
code and generate optimized HLS code with LLM for provides an attractive approach to leverage LLMs to generate
various computing kernels. code for domain-specific languages. For instance, the authors
• According to our experiments on an HLS benchmark, in [29] augment code generation by providing the language’s
HLSPilot can generate optimized HLS code from sequen- Backus–Naur form (BNF) grammar within prompts.
tial C/C++ code and the resulting designs can outperform
III. HLSP ILOT F RAMEWORK
manual optimizations with the assistance of DSE tools in
most cases. In addition, we also demonstrate the success- The remarkable achievements of LLMs across a wide
ful use of HLSPilot as a complete hardware acceleration domain of applications inspire us to create an LLM-driven
automatic hardware acceleration design framework tailored
1 https://github.com/xcw-1010/HLSPilot for a hybrid CPU-FPGA architecture. Unlike previous efforts
3-1. Automated Optimization Strategy Learning 3-2. Strategy Retrieval and Applying
strategy application
strategy retrieval
strategy1:
introduction scenes
parameter
demos prompt
description
system prompt:
… ... … You are an expert in FPGA…
stage 1
area
stage 2
profiling source
code
stage 3
stage1 stage2
latency
profiling report kernel to be
software code stage2-1 stage2-2
that primarily focused on code generation, our objective is such as execution time distribution across the algorithm and
to harness the potential of LLMs to emulate the role of an the number of function calls conveniently. Since LLMs is
expert engineer in hardware acceleration. Given that hardware capable to understand and summarize the textual reports,
acceleration on a hybrid CPU-FPGA architecture demands the time-consuming functions can be identified conveniently.
a set of different design tasks such as runtime profiling, HLSPilot extracts the computing kernels to be optimized in
compute kernel identification, compute kernel acceleration, next stage based on these profiling information.
design space exploration, and CPU-FPGA co-design, LLMs Secondly, the computing kernels are organized as dependent
must understand the design guidelines and manipulate the tasks and pipelined accordingly. The dependent tasks can
relevant design tools to achieve the desired design objec- be implemented efficiently with the data flow mechanism
tives, akin to an engineer. Fortunately, LLMs have exhibited supported by Xilinx HLS. While the compute kernels can be
powerful capabilities in document comprehension, in-context irregular, we propose a program-tree-based strategy to refactor
learning, tool learning, and code generation, all of which align the program structure of the compute kernels and generate
perfectly with the hardware acceleration design requirements. an optimized task flow graph while ensuring equivalent code
The intended design framework eventually provides an end-to- functionality. Details of the automatic task pipelining will be
end high-level synthesis of sequential C/C++ code on a hybrid illustrated in Section III-B.
CPU-FPGA architecture, thus named as HLSPilot, which will Thirdly, we start to optimize each task with HLS inde-
be elaborated in the rest of this section. pendently. While there are many distinct HLS optimization
strategies applicable to different high-level code patterns, we
A. HLSPilot Overview create a set of HLS optimization strategies based on Xilinx
HLSPilot as presented in Fig. 1 takes sequential C/C++ code HLS user guide and leverage LLMs to select and apply the
as design input and it mainly includes five major processing appropriate optimization strategies automatically based on the
stages to generate optimized hardware acceleration solution on code patterns in each task. Details of the LLM-based automatic
a hybrid CPU-FPGA architecture. HLS optimization will be presented in Section III-C.
Firstly, HLSPilot conducts runtime profiling on the high- Fourthly, after the code refactoring and the application
level application code to identify the most time-consuming of various HLS pragmas, the HLS code can be obtained,
computing kernels, which will be the focus of subsequent but the parameters such as the initiation interval (II) for
optimization. In this work, we profile the target algorithm and pipelining, the factors of loop unrolling, and the size for array
analyze the execution time with gprof on a CPU system. Then, partitioning in the HLS code still needs to be tuned to produce
a detailed performance report will be generated as needed. accelerators with higher performance. However, it remains
With the report, we can obtain the performance information rather challenging for LLMs to decide design parameters of
a complex design precisely. To address this issue, HLSPilot Algorithm 1: Program-tree-based Pipelining Strategy
utilizes external tools to conduct the design space exploration Input: Top-level Function Code C
and decides the optimized solution automatically. According Output: Tasks Collection
to recent research [30], LLMs is capable to learn and utilize T = {task1 , task2 , . . . , taskn }
external APIs and tools efficiently. Hence, HLSPilot leverages 1 T ← {C}
LLMs to extract the parameters from HLS code and invoke 2 while T has task that can be further split do
the DSE tool proposed in [31] by generating the corresponding 3 Tnew ← {}
execution scripts. 4 for taski ∈ T do
Finally, when the compute kernels are optimized with 5 if LLM decides to futher split taski then
HLS, they can be compiled and deployed on FPGAs for 6 1.For non-loop blocks: split the code based
hardware acceleration. Nonetheless, these accelerators must be on the functionality of the statement
integrated with a host processor to provide a holistic hardware execution
acceleration solution. The acceleration system has both host 7 2.For loop blocks: split the code based on
code and device code that will be executed on CPU side and the minimum parallelizable loop granularity
FPGA side respectively. HLSPilot leverages LLMs to learn the 8 Add the refactored code to Tnew
APIs provided by Xilinx runtime (XRT) to manage the FPGA- 9 else
based accelerators and perform the data transfer between host 10 Add taski to Tnew
memory and FPGA device memory. Then, it generates the host 11 end
code mostly based on the original algorithm code and replaces 12 end
the compute kernels with the compute APIs that will invoke 13 T ← Tnew
the FPGA accelerators and the data movement APIs. The 14 end
device code is mainly the HLS code generated in prior steps.
With both the host code and device code, the entire algorithm
can be deployed on the hybrid CPU-FPGA architecture.
have LLM to analyze the semantics of code statements, rec-
B. Program-Tree-based Task Pipelining ognize the purpose of these statements, and group statements
While the compute kernel can be quite complex, it needs to performing the same function into a single task. For loop
be split into multiple tasks for the sake of potential pipelining code, the decomposition is primarily based on the smallest
or parallel processing, which is critical to the performance of loop granularity that can be executed in parallel. We take
the generated accelerator. However, it is difficult to split the advantage of the in-context learning capabilities of LLMs and
compute kernel appropriately because inappropriate splitting present a few representative decomposition examples to guide
may lead to imbalanced pipelining and low performance. In the task decomposition for general scenarios. These examples
addition, the splitting usually causes code refactoring, which as detailed as follows.
may produce code with inconsistent functionality and further 1) Each iteration of the loop is considered as a task: In the
complicate the problem. To address this problem, we propose original merge sort loop, each iteration processes all intervals
a program-tree-based strategy to guide LLM to produce fine- of the same width. Therefore, each iteration can be regarded
grained task splitting and pipelining. as a task. For example, taski merges all intervals with a width
The proposed program-tree based task pipelining strategy equal to 2i .
is detailed in Algorithm 1. According to the strategy, LLM // before:
iteratively decomposes the compute kernel to smaller tasks and for (int width = 1; width < SIZE; width = 2 * width
) {
form a tree structure eventually. An input compute kernel C is for (int i1 = 0; i1 < SIZE; i1 = i1 + 2 * width)
denoted as the root node of the tree. Hence, the initial node set {
of the tree T = {C}. Then, LLM decides whether each task int i2 = i1 + width;
int i3 = i1 + 2 * width;
in T can be further decomposed based on the complexity of if (i2 >= SIZE) i2 = SIZE;
the task code. If a decomposition is confirmed in taski , LLM if (i3 >= SIZE) i3 = SIZE;
will perform the code decomposition. The decomposition for merge(A, i1, i2, i3, temp);
}
non-loop tasks and loop tasks are different and they will be }
detailed later in this sub section. If the task cannot be further
decomposed, the taski is added to Tnew directly. // after:
for (int stage = 1; stage < STAGES - 1; stage++) {
The major challenge of the program-tree-based task pipelin- // merge all equally wide intervals
ing strategy is the task decomposition metric which depends merge_intervals(temp[stage - 1], width, temp[
on the code structures and can vary substantially. As a result, stage]);
width *= 2;
the metric can be difficult to quantify. Instead of using a }
determined quantitative metric, we leverage LLMs to perform
the task decomposition with natural language rules and typical 2) The first and second halves of a loop’s traversal are
decomposition examples. Specifically, for non-loop code, we each considered as a task: In histogram statistics, since the
first and second halves of the loop can be executed in parallel, 4) Multiple levels of loops are considered as a task: In
they are considered as two tasks. video frame image convolution, there are a total of 4 layers
// before:
of loops, where loop1 and loop2 are considered as the tasks
for (int i = 0; i < INPUT_SIZE; i++) { for reading the pixel, and loop3 and loop4 are the tasks for
val = in[i]; calculating the convolution.
hist[val] = hist[val] + 1;
} // before:
loop1: for(int line=0; line<img_h; ++line) {
// after: loop2: for(int pixel=0; pixel<img_w; ++pixel) {
for (int i = 0; i < INPUT_SIZE / 2; i++) { float sum_r = 0, sum_g = 0, sum_b = 0;
val = in1[i]; loop3: for(int m=0; m<coeff_size; ++m) {
hist1[val] = hist1[val] + 1; loop4: for(int n=0; n<coeff_size; ++n) {
} int ii = line + m - center;
for (int i = 0; i < INPUT_SIZE / 2; i++) { int jj = pixel + n - center;
val = in2[i]; if(ii >= 0 && ii < img_h && jj >= 0 && jj < img_w)
hist2[val] = hist2[val] + 1; {
} sum_r += in[(ii * img_w) + jj].r * coeff[(m *
histogram_reduce(hist1, hist2, hist); coeff_size) + n];
sum_g += in[(ii * img_w) + jj].g * coeff[(m *
3) Each level of a loop is considered as a task: In BFS coeff_size) + n];
sum_b += in[(ii * img_w) + jj].b * coeff[(m *
algorithm, there are two loops, with the first loop used to coeff_size) + n];
find the frontier vertex and read the corresponding rpao data, }
the second loop used to traverse the neighbors of the frontier ...
}
vertex, which can be divided into two tasks based on this.
// before: // after:
loop1: for (int i = 0; i < vertex_num; i++) { void read_dataflow(hls::stream<RGBPixel>&
char d = depth[i]; read_stream, const RGBPixel *in, int img_w, int
if (d == level) { elements, int half) {
start = rpao[i]; int pixel = 0;
end = rpao[i + 1]; while(elements--) {
loop2: for (int j = start; j < end; j++) { read_stream << in[pixel++];
ngb_vidx = ciao[j]; }
ngb_depth = depth[ngb_vidx]; ...
if (ngb_depth == -1) { }
depth[ngb_vidx] = level_plus1;
} void compute_dataflow(hls::stream<RGBPixel>&
} write_stream, hls::stream<RGBPixel>&
} read_stream, const float* coefficient, int
} img_width, int elements, int center) {
static RGBPixel window_mem[COEFFICIENT_SIZE][
// after: MAX_WIDTH];
void read_frontier_vertex(int *depth, int static fixed coef[COEFFICIENT_SIZE *
vertex_num, int level, int *rpao, ...) { COEFFICIENT_SIZE];
... for(int i = 0; i < COEFFICIENT_SIZE*
for (int i = 0; i < vertex_num; i++) { COEFFICIENT_SIZE; i++) {
if (depth[i] == level) { coef[i] = coefficient[i];
int start = rpao[i]; }
int end = rpao[i + 1]; ...
start_stream << start; }
end_stream << end;
}
} In order to demonstrate the proposed task decomposition
} strategy, we take BFS with relatively complex nested loop as
void traverse(hls::stream<int>& start_stream, hls:: an example and present the generated program tree in Fig.2.
stream<int>& end_stream, ...) {
... It shows that the nested loop in BFS are effectively identified
while (!start_stream.empty() && !end_stream. and extracted as dependent tasks correctly.
empty()) {
int start = start_stream.read();
When the tasks are decomposed, the corresponding code
int end = end_stream.read(); segments will be packed into a function and the code needs
for (int j = start; j < end; j++) { to be refactored accordingly. Before proceeding to the HLS
ngb_vidx = ciao[j];
ngb_depth = depth[ngb_vidx];
acceleration, HLSPilot needs to check the correctness of the
if (ngb_depth == -1) { refactored code. Specifically, we compare the refactored code
depth[ngb_vidx] = level_plus1; to the original code by testing the execution results to ensure
}
}
the computing results are consistent. We follow a bottom-up
} testing strategy and start from the leaf nodes of the program
} tree. If an error occurs, it can be traced back to the erroneous
leaf node and check from its parent node. If errors persist
void bfs_kernel(…) {
for (int i = 0; i < vertex_num; i++) {
Upon retrieving a suitable optimization strategy, the strategy’s
… // traverse node
if (d == level) {
parameter description information and optimization example
… // find frontier
for (int j = start; j < end; j++) { information are integrated into the prompt, utilizing the LLM’s
… // process neighbor of frontier // stage1-1: load node depth
}… void load_depth(…) { in-context learning capabilities to generate optimized code.
for (int i = 0; i < vertex_num; i++) {
depth_inspect_stream <<
depth_for_inspect[i]; IV. E XPERIMENT
// stage1: traverse node and find frontier }
void read_frontier_vertex(...) { }
for (int i = 0; i < vertex_num; i++) { A. Experiment Setting
if (d == level) {
frontier_stream << i;
} // stage1-2: load frontier according to depth In this section, we demonstrate the effectiveness of HLSPi-
} void load_frontier(…) {
} for (int i = 0; i < vertex_num; i++) { lot framework for automatically generating and optimizing
d = depth_inspect_stream.read();
select strategies
Instructions:
Please apply these strategies in appropriate places based on
their descriptions and examples.
It is generally accepted that a custom hardware implementation of a set of computations will provide supe-
rior speed and energy-efficiency relative to a software implementation. However, the cost and difficulty of
hardware design is often prohibitive, and consequently, a software approach is used for most applications.
In this paper, we introduce a new high-level synthesis tool called LegUp that allows software techniques to
be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the
program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware
accelerators that communicate through a standard bus interface. In the hybrid processor/accelerator archi-
tecture, program segments that are unsuitable for hardware implementation can execute in software on the
processor. LegUp can synthesize most of the C language to hardware, including fixed-sized multi-dimensional
arrays, structs, global variables and pointer arithmetic. Results show that the tool produces hardware so-
lutions of comparable quality to a commercial high-level synthesis tool. We also give results demonstrating
the ability of the tool to explore the hardware/software co-design space by varying the amount of a program
that runs in software vs. hardware. LegUp, along with a set of benchmark C programs, is open source and
freely downloadable, providing a powerful platform that can be leveraged for new research on a wide range
of high-level synthesis topics.
Categories and Subject Descriptors: B.7 [Integrated Circuits]: Design Aids
General Terms: Design, Algorithms
Additional Key Words and Phrases: High-level synthesis, field-programmable gate arrays, FPGAs, synthesis,
performance, power, hardware/software co-design
1. INTRODUCTION
Two approaches are possible for implementing computations: software (running on a stan-
dard processor) or hardware (custom circuits). A hardware implementation can provide
a significant improvement in speed and energy-efficiency versus a software implementa-
tion (e.g. [Cong and Zou 2009; Luu et al. 2009]). However, hardware design requires writing
complex RTL code, which is error prone and can be notoriously difficult to debug. Software
design, on the other hand, is comparatively straightforward, and mature debugging and
analysis tools are freely accessible. Despite the apparent energy and performance benefits,
hardware design is simply too difficult and costly for most applications, and a software
approach is preferred.
This work is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada,
and Altera Corporation.
The authors are with the Dept. of Electrical and Computer Engineering, University of Toronto, Toronto,
ON M5S 3G4 CANADA. T. Czajkowski is with the Altera Toronto Technology Centre, Toronto, ON M5S
1S4 CANADA. E-mail: legup@eecg.toronto.edu
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or commercial advantage
and that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA,
fax +1 (212) 869-0481, or permissions@acm.org.
c 2012 ACM 1539-9087/2012/07-ART1 $10.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:2 A. Canis, J. Choi, M. Aldham et al.
In this paper, we propose LegUp – an open source high-level synthesis (HLS) framework
we have developed that aims to provide the performance and energy benefits of hardware,
while retaining the ease-of-use associated with software. LegUp automatically compiles a
standard C program to target a hybrid FPGA-based software/hardware system-on-chip,
where some program segments execute on an FPGA-based 32-bit MIPS soft processor and
other program segments are automatically synthesized into FPGA circuits – hardware ac-
celerators – that communicate and work in tandem with the soft processor. Since the first
FPGAs appeared in the mid-1980s, access to the technology has been restricted to those
with hardware design skills. However, according to labor statistics, software engineers out-
number hardware engineers by more than 10X in the U.S. [United States Bureau of Labor
Statistics 2010]. An overarching goal of LegUp is to broaden the FPGA user base to include
software engineers, thereby expanding the scope of FPGA applications and growing the size
of the programmable hardware market – a goal we believe will keenly interest commercial
FPGA vendors and the embedded systems community.
The decision to include a soft processor in the target system is based on the notion that
not all C program code is appropriate for hardware implementation. Inherently sequential
computations are well-suited for software (e.g. traversing a linked list); whereas, other com-
putations are ideally suited for hardware (e.g. addition of integer arrays). Incorporating
a processor into the target platform also offers the advantage of increased high-level lan-
guage coverage – program segments that use restricted C language constructs can execute
on the processor (e.g. calls to malloc/free). We note that most prior work on high-level
hardware synthesis has focused on pure hardware implementations of C programs, not a
hybrid software/hardware system.
LegUp is written in modular C++ to permit easy experimentation with new HLS algo-
rithms. We leverage the state-of-the-art LLVM (low-level virtual machine) compiler frame-
work for high-level language parsing and its standard compiler optimizations [LLVM 2010],
and we implement hardware synthesis as new back-end compiler passes within LLVM. The
LegUp distribution includes a set of benchmark C programs [Hara et al. 2009] that the user
can compile to pure software, pure hardware, or a combined hardware/software system. For
the hardware portions, LegUp produces RTL code that can be synthesized using standard
commercial synthesis tools. In this paper, we present an experimental study demonstrat-
ing that LegUp produces hardware implementations of comparable quality to a commercial
tool [Y Explorations (XYI) 2010]. We also give results illustrating LegUp’s ability to effec-
tively explore the design space between a pure software implementation and pure hardware
implementation of a given program.
While the promise of high-level hardware synthesis has been touted for decades (consider
that Synopsys introduced its Behavioral Compiler tool in 1996), the technology has yet to
be embraced broadly by industry. We believe its widespread adoption has been impeded by
a number of factors, including a lack of comprehensive C/C++ language support, and, in
some cases, the use of non-standard languages (e.g., [Huang et al. 2008]). While a number
of research groups have developed high-level hardware synthesis tools, few have gained
sustained traction in the research community and the tools have been kept proprietary in
many cases. The open source nature of LegUp is a key differentiator relative to prior work.
Prior high-quality open source EDA projects have had a tremendous impact in spurring
new research advances. As an example, the VPR system has enabled countless studies on
FPGA architecture, packing, placement, and routing [Betz and Rose 1997]. Similarly, the
ABC logic synthesis system has reinvigorated low-level logic synthesis research [Mishchenko
et al. 2006]. High-level hardware synthesis and application-specific processor design can
likewise benefit from the availability of a robust publicly-accessible framework such as LegUp
– a framework used and contributed to by researchers around the world. In fact, at the time
of acceptance, the tool has been downloaded over 350 times by research groups around the
world (since March 2011).
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:3
A key usage scenario for the LegUp tool is in the area of FPGA-based embedded systems
design, which frequently include a soft processor [Wayne Marx 2008]. LegUp can improve
computational throughput and energy-efficiency of such systems by allowing computations
to be migrated from the processor to custom hardware. In addition, since LegUp can also
synthesize a program (or a subset of its constituent functions) to pure hardware, it can be
applied to implement the hardware accelerators in a “server style” processor/accelerator
platform, where a high-end processor communicates with FPGA-based accelerators over a
PCIe bus. While the server scenario is certainly possible, it is the embedded systems usage
model that is explored more heavily in this paper.
A preliminary version of a portion of this work appears in [Canis et al. 2011]. In this
extended journal version, we elaborate on all aspects of the proposed framework, including
background on the intermediate representation (IR) within the LLVM compiler, and how
programs represented in the IR are synthesized to hardware circuits. We describe the pro-
cessor/accelerator interconnection approach in further detail, as well as provide additional
information on the benchmark suite and debugging capabilities. Circuit-by-circuit experi-
mental results for speed, area and power are also included (whereas, only average data was
included in the 4-page conference version). We also describe how LegUp can be modified to
support different FPGA architectures, implement a new scheduling algorithm, and support
parallel accelerators.
The remainder of this paper is organized as follows: Section 2 presents related work.
Section 3 introduces the target hardware architecture and outlines the high-level design
flow. The details of the high-level synthesis tool and software/hardware partitioning are
described in Section 4. An experimental evaluation appears in Section 5. Section 6 presents
three cases studies that serve to demonstrate the extensibility of the LegUp tool: 1) to
target an alternate FPGA device, 2) to evaluate a different scheduling algorithm, and 3) to
support concurrently running accelerators. Conclusions and suggestions for future work are
given in Section 7.
2. RELATED WORK
2.1. High-Level Synthesis
Automatic compilation of a high-level language program to silicon has been a decades-long
quest in the EDA field, with early seminal work done in the 1980s. We highlight several
recent efforts, with emphasis on tools that target FPGAs.
Several HLS tools have been developed for targeting specific applications. GAUT is a
high-level synthesis tool that is designed for DSP applications [Coussy et al. 2010]. GAUT
synthesizes a C program into an architecture with a processing unit, a memory unit, and
a communication unit, and requires that the user supply specific constraints, such as the
pipeline initiation interval.
ROCCC is an open source high level synthesis tool that can create hardware accelerators
from C [Villarreal et al. 2010]. ROCCC is designed to accelerate critical kernels that perform
repeated computation on streams of data, for instance DSP applications such as FIR filters.
ROCCC does not support several commonly-used aspects of the C language, such as generic
pointers, shifting by a variable amount, non-for loops, and the ternary operator. ROCCC
has a bottom-up development process that involves partitioning one’s application into mod-
ules and systems. Modules are C functions that are converted into computational datapaths
with no FSM, with loops fully unrolled. These modules cannot access memory but have data
pushed to them and output scalar values. Systems are C functions that instantiate modules
to repeat computation on a stream of data or a window of memory, and usually consist of
a loop nest with special function parameters for streams. ROCCC supports advanced op-
timizations such as systolic array generation, temporal common subexpression elimination,
and it can generate Xilinx PCore modules to be used with a Xilinx MicroBlaze proces-
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:4 A. Canis, J. Choi, M. Aldham et al.
sor. However, ROCCC’s strict subset of C is insufficient for compiling any of the CHStone
benchmarks used in this study and described in Section 4.5. Broadly speaking, ROCCC
works and excels for a specific class of applications (streaming-oriented applications), but it
is not a general C-to-hardware compiler. By supporting the CHStone benchmarks, LegUp
provides researchers with the opportunity to compile larger C programs than is possible
with ROCCC.
General (application-agnostic) tools have also been proposed in recent years. CHiMPS
is a tool developed by Xilinx and the University of Washington that synthesizes programs
into a many cache architecture, taking advantage of the abundant small block RAMs avail-
able throughout the FPGA fabric [Putnam et al. 2008]. LiquidMetal is a compiler being
developed at IBM Research comprising a HLS compiler and a new (non-standard) language,
LIME, that incorporates hardware-specific constructs, such as bitwidth specification on in-
tegers [Huang et al. 2008]. xPilot is a tool that was developed at UCLA [Cong et al. 2006]
and used successfully for a number of HLS studies (e.g., [Chen and Cong 2004]). Trident is
a tool developed at Los Alamos National Labs, with a focus on supporting floating point
operations [Tripp et al. 2007]. xPilot and Trident have not been under active development
for several years and are no longer maintained.
Among prior academic work, the Warp Processor proposed by Vahid, Stitt and Lysecky
bears the most similarity to our framework [Vahid et al. 2008]. In a Warp Processor, soft-
ware running on a processor is profiled during its execution. The profiling results guide the
selection of program segments to be synthesized to hardware. Such segments are disassem-
bled from the software binary to a higher-level representation, which is then synthesized to
hardware [Stitt and Vahid 2007]. The software binary running on the processor is altered
automatically to leverage the generated hardware. We take a somewhat similar approach,
with the key differences being that we compile hardware from the high-level language source
code (not from a disassembled binary) and our tool is open source.
With regard to commercial tools, there has been considerable activity in recent years,
both in start-ups and major EDA vendors. Current offerings include AutoPilot from Au-
toESL [AutoESL ] (a commercial version of xPilot, recently acquired by Xilinx, Inc.), Cata-
pult C from Mentor Graphics [Mentor Graphics 2010], C2R from CebaTech [CebaTech 2010],
eXCite from Y Explorations [Y Explorations (XYI) 2010], CoDeveloper from Impulse Ac-
celerated Technologies [Impulse 2010], Cynthesizer from Forte [Forte 2010], and C-to-Silicon
from Cadence [Cadence 2010]. On our experience, attaining a binary executable for evalu-
ation has not been possible for most tools.
Also on the commercial front is Altera’s C2H tool [Altera, Corp. 2009]. C2H allows a
user to partition a C program’s functions into a hardware set and a software set, where
the software-designated functions execute on a Nios II soft processor, and the hardware-
designated functions are synthesized into custom hardware accelerators that connect to the
Nios II through an Avalon interface (Altera’s on-chip interconnect standard). The C2H
target system architecture closely resembles that targeted by our tool.
Table I shows the release status of each non-commercial tool surveyed above, indicating
whether each is: 1) open source, 2) binary only (i.e., only the binary is publicly available),
or 3) no source or binary available. Tools in category #2 cannot be modified by the research
community to explore new HLS algorithms or new processor/accelerator design styles. Re-
sults produced by tools in category #3 cannot be independently replicated. In the open
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:5
....
y[n] = 0;
for (i = 0; i < 8; i++) { Self-Profiling
y[n] += coeff[i] * x[n-i]; 1 Processor
MIPS Processor
} C Compiler
(MIPS)
....
2
Program code
LegUp
5
Altered SW binary (calls HW accelerators) Profiling Data:
3 Execution Cycles
4 High-level Power
synthesis Suggested
µP Hardened
program
Cache Misses
program
segments to
segments
6 FPGA fabric target to
HW
source category, the Trident tool was based on an early version of LLVM, however, it is
has not been actively maintained for several years, and it targeted pure hardware and not
a hybrid hardware/processor architecture. ROCCC is actively being worked on, however,
it targets a feed-forward pipeline hardware architecture model. To our knowledge, there is
currently no open source HLS tool that compiles a standard C program to a hybrid pro-
cessor/accelerator system architecture, where the synthesized hardware follows a general
datapath/state machine model. By supporting nearly all of the commonly-used aspects of
the C language, as evidenced by the CHStone benchmark programs [Hara et al. 2009],
LegUp provides researchers with the infrastructure needed to compile larger and more gen-
eral C programs than those supported by ROCCC. Section 6 describes case studies that
demonstrate the tools extensibility.
3. LEGUP OVERVIEW
In this section, we provide a high-level overview of the LegUp design flow and its target
architecture. Algorithmic and implementation details follow in Section 4.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:6 A. Canis, J. Choi, M. Aldham et al.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:7
FPGA
Hardware Hardware
MIPS Processor
Accelerator Accelerator
AVALON INTERCONNECT
On-Chip
Memory Controller
Cache
Off-Chip Memory
The architecture depicted in Figure 2 represents the target system most natural for an
initial release of the tool. We expect the shared memory to become a bottleneck if many
processors and accelerators are included in the system. The architecture of processor/ac-
celerator systems is an important direction for future research – research enabled by a
framework such as LegUp – with key questions being the investigation of the best on-chip
connectivity and memory architecture. Moreover, in our initial release, the processor and
accelerators share a single clock signal. Multi-clock domain processor/accelerator systems-
on-chip is an important avenue to explore.
4.1.1. Low-Level Virtual Machine (LLVM). LegUp leverages the low-level virtual machine
(LLVM) compiler framework – the same framework used by Apple for iPhone/iPad ap-
plication development. At the core of LLVM is an intermediate representation (IR), which
is essentially machine-independent assembly language. C code is translated into LLVM’s
IR then analyzed and modified by a series of compiler optimization passes. Current re-
sults show that LLVM produces code of comparable quality to gcc for x86-based processor
architectures.
Consider an 8-tap finite impulse response (FIR) filter whose output, y[n], is a weighted
sum of the current input sample, x[n] and seven previous input samples. The C code for
calculating the FIR response is given in Figure 3. The unoptimized LLVM IR corresponding
to this C code is given in Figure 4. We highlight a few key elements of the IR here. The LLVM
IR is in single static assignment (SSA) form, which prohibits variable re-use, guaranteeing
a 1-to-1 correspondence between an instruction and its destination register. Register names
in the IR are prefixed by %. Types are explicit in the IR. For example, i32 specifies a 32-bit
integer type and i32* specifies a pointer to a 32-bit integer.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:8 A. Canis, J. Choi, M. Aldham et al.
y[n] = 0;
for(i = 0; i < 8; i++) {
y[n] += coeff[i] * x[n - i];
}
In the example IR for the FIR filter in Figure 4, line 1 marks the beginning of a basic
block called entry. A basic block is a contiguous set of instructions with a single entry (at
its beginning) and exit point (at its end). Lines 2 and 3 initialize y[n] to 0. Line 4 is an
unconditional branch to a basic block called bb1 that begins on line 5. phi instructions
are needed to handle control flow-dependent variables in SSA form. For example, the phi
instruction on line 6 assigns loop index register %i to 0 if the previous basic block was
entry; otherwise, %i is assigned to register %i.new, which contains the incremented %i from
the previous loop iteration. Line 7 initializes a pointer to the coefficient array. Lines 8 and
9 initialize a pointer to the sample array x. Lines 10-12 load the sum y[n], sample and
coefficient into registers. Lines 13 and 14 perform the multiply-accumulate. The result is
stored in line 15. Line 16 increments the loop index %i. Lines 17 and 18 compare %i with
loop limit (8) and branch accordingly.
Observe that LLVM instructions are simple enough to directly correspond to hardware
operations (e.g., a load from memory, or an arithmetic computation). Our HLS tool operates
directly with the LLVM IR, scheduling the instructions into specific clock cycles (described
below).
Scheduling operations in hardware requires knowing data dependencies between opera-
tions. Fortunately, the SSA form of the LLVM IR makes this easy. For example, the multiply
instruction (mul) on line 13 of Figure 4 depends on the results of two load instructions on
lines 11 and 12. Memory data dependencies are more problematic to discern; however, LLVM
includes alias analysis – a compiler technique for determining which memory locations a
pointer can reference. In Figure 4, the store on line 15 has a write-after-read dependency
with the load on line 10, but has no memory dependencies with the loads on lines 12 and
13. Alias analysis can determine that these instructions are independent and can therefore
be performed in parallel.
Transformations and optimizations in the LLVM framework are structured as a series
of compiler passes. Passes include optimizations such as dead code elimination, analysis
passes such as alias analysis, and back-end passes that produce assembly for a particular
target machine (e.g. MIPS or ARM). The infrastructure is flexible, allowing passes to be
reordered, substituted with alternatives, and disabled. LegUp HLS algorithms have been
implemented as LLVM passes that fit into the existing framework. Implementing the HLS
steps as distinct passes also allows easy experimentation with different HLS algorithms. For
example, one could modify LegUp to “plug in” a new scheduling algorithm and study its
impact on quality of results.
4.1.2. Device Characterization. For a given FPGA family, LegUp includes scripts to pre-
characterize the hardware operation corresponding to each LLVM instruction for all sup-
ported bitwidths (typically, 8, 16, 32, 64). The scripts synthesize each operation in isolation
for the target FPGA family to determine the propagation delay, required number of logic
elements, registers, multiplier blocks, and power consumption. This characterization data
allows LegUp to make early predictions of circuit speed and area for the hardware acceler-
ators and also to aid scheduling and binding.
4.1.3. Allocation. The purpose of allocation is to determine the amount of hardware that
may be used to implement the circuit. LegUp reads allocation information from a configura-
tion Tcl file, which specifies the target FPGA device and the resource limits for the device,
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:9
1: entry:
2: %y.addr = getelementptr i32* %y, i32 %n
3: store i32 0, i32* %y.addr
4: br label %bb1
5: bb1:
6: %i = phi i32 [ 0, %entry ], [ %i.new, %bb1 ]
7: %coeff.addr = getelementptr [8 x i32]* %coeff,
i32 0, i32 %i
8: %x.ind = sub i32 %n, %i
9: %x.addr = getelementptr i32* %x, i32 %x.ind
10: %0 = load i32* %y.addr
11: %1 = load i32* %coeff.addr
12: %2 = load i32* %x.addr
13: %3 = mul i32 %1, %2
14: %4 = add i32 %0, %3
15: store i32 %4, i32* %y.addr
16: %i.new = add i32 %i, 1
17: %exitcond = icmp eq i32 %i.new, 8
18: br i1 %exitcond, label %return, label %bb1
19:return:
e.g. the number of available multiplier blocks. In general, LegUp HLS operates as though
an unlimited amount of resources are available in the target FPGA. The reason for this is
that resource sharing (i.e. using a single hardware unit to implement multiple operations
within the program being synthesized) requires adding multiplexers to the input ports of a
shared hardware unit, and multiplexers are costly to implement in FPGAs. For example, a
32-bit adder can be implemented using 32 4-input LUTs (and associated carry logic), and
32 2-to-1 multiplexers also require 32 4-input LUTs – the same number of LUTs as the
adder itself! Thus, for the allocation step, LegUp does the following:
— Multiply: Hard multiplier blocks in the FPGA fabric are used. Sharing multipliers is only
done when the benchmark being synthesized requires more multipliers than that available
in the FPGA.
— Divide/Modulus: These operations are implemented with LUTs, and consume significant
area. Therefore, we set the number of divide/remainder units to be the maximum number
used in any cycle of the schedule. Multiplexers are added to the input ports of the unit(s)
to facilitate the resource sharing (described below in the binding section).
4.1.4. Scheduling. Scheduling is the task of assigning operations to clock cycles and building
a finite state machine (FSM). A control flow graph (CFG) of a program is a directed graph
where basic blocks are represented by vertices and branches are represented by edges. For
example, given two basic blocks, b1 and b2 , b1 has an edge to b2 in the CFG if b1 can
branch to b2 . We can think of a CFG as a coarse representation of the FSM needed to
control the hardware being synthesized – the nodes and edges are analogous to those of a
state diagram. What is not represented in this coarse FSM are data dependencies between
operations within a basic block and the latencies of operations (e.g., a memory access may
take more than a single cycle).
Having constructed the coarse FSM from the CFG, LegUp then schedules each basic block
individually, which amounts to splitting each node in the CFG into multiple nodes, each
corresponding to one FSM state (clock cycle). The initial release of LegUp uses as-soon-as-
possible (ASAP) scheduling [Gajski and et. al. Editors 1992], which assigns an instruction to
the first state after all of its dependencies have been computed. Traversing basic blocks, and
visiting the instructions within each basic block in order, the operands for each instruction
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:10 A. Canis, J. Choi, M. Aldham et al.
0 1 2 3 4 5 6 7 8
br label %bb1
are either: 1) from this basic block and therefore guaranteed to have already been assigned
a state, or 2) from outside this basic block, in which case we can safely assume they will be
available before control reaches this basic block. Note that our scheduler properly handles
instructions with multi-cycle latencies, such as pipelined divides or memory accesses.
In some cases, we can schedule an instruction into the same state as one of its operands.
This is called operation chaining. We perform chaining in cases where the estimated delay of
the chained operations (from allocation) does not exceed the estimated clock period for the
design. Chaining can reduce hardware latency (# of cycles for execution) and save registers
without impacting the final clock period.
Fig. 5 is a Gantt chart showing the ASAP schedule of the FIR filter instructions shown
in Fig. 4. The chart shows the same LLVM instructions, now organized into nine states.
Data dependencies between operations are shown; in this case we do not allow operation
chaining (for clarity). Load instructions have a two cycle latency, allowing us to pipeline
our memory controller for higher speed performance. Once a load has been issued, a new
load can be issued on the next cycle. Because our memory controller is single ported, only
one load can be performed every cycle.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:11
4.1.5. Binding. Binding comprises two tasks: assigning operators from the program being
synthesized to specific hardware units (operation assignment), and assigning program vari-
ables to registers (register allocation). When multiple operators are assigned to the same
hardware unit, or when multiple variables are bound to the same register, multiplexers are
required to facilitate the sharing. We make two FPGA-specific observations in our approach
to binding. First, multiplexers are relatively expensive to implement in FPGAs using LUTs.
Consequently, there is little advantage to sharing all but the largest functional units, namely,
multipliers and dividers. Likewise, the FPGA fabric is register rich – each logic element in
the fabric has a LUT and a register. Therefore, sharing registers is rarely justified.
We have three goals when binding operations to shared functional units. First, we would
like to balance the sizes of the multiplexers across functional units to keep circuit perfor-
mance high. Multiplexers with more inputs have higher delay, so it is desirable to avoid
having a functional unit with a disproportionately large multiplexer on its input. Second,
we want to recognize cases where we have shared inputs between operations, letting us
save a multiplexer if the operations are assigned to the same functional unit. Lastly, during
binding if we can assign two operations that have non-overlapping livetime intervals to the
same functional unit, we can use a single output register for both operations. In this case
we save a register, without needing a multiplexer. We use the LLVM live variable analysis
pass to check for the livetime intervals.
To account for these goals we use the following cost function to measure the benefit of
assigning operation op to function unit fu:
Cost(op, f u) = φ · existingM uxInputs(f u) +
β · newM uxInputs(op, f u) −
θ · outputRegisterSharable(op, f u) (1)
where φ = 0.1, β = 1, and θ = 0.5 to give priority to saving new multiplexer inputs, then
output registers, and finally balancing the multiplexers. Notice that sharing the output
register reduces the cost, while the other factors increase it.
The initial release of LegUp uses a weighted bipartite matching heuristic to solve the
binding problem [Huang et al. ]. The binding problem is represented using a bipartite graph
with two vertex sets. The first vertex set corresponds to the operations being bound (i.e.
LLVM instructions). The second vertex set corresponds to the available functional units.
A weighted edge is introduced from a vertex in the first set to a vertex in the second set
if the corresponding operation is a candidate to be bound to the corresponding functional
unit. We set the cost (edge weight) of assigning an operation to a functional unit using (1).
Weighted bipartite matching can be solved optimally in polynomial time using the well-
known Hungarian method [Kuhn 2010]. We formulate and solve the matching problem one
clock cycle at a time until the operations in all clock cycles (states) have been bound.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:12 A. Canis, J. Choi, M. Aldham et al.
LegUp automatically generates the multiplexing logic to interpret the tags and steer memory
requests. Tag 000000000 is reserved for the NULL pointer, and tag 000000001 indicates that
the memory access should be steered to the shared memory. The remaining 510 different
tags can be used to differentiate between up to 510 local accelerator memories. Using 9 bits
for the tag implies that 23 bits are available for encoding the address. The decision to use
9-bit tags in the initial release of LegUp was taken because the Altera DE2 board contains
an 8 MB SDRAM which is fully addressable using 23 bits. It is straightforward to change
LegUp to use a different tag width if desired.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:13
6.7% overhead on the MIPS processor area when configured to track up to 32 functions
using 32-bit counters. Complete details on the profiler, including how it can be extended
to profile energy consumption, are omitted for lack of space, but can be found in [Aldham
et al. 2011].
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:14 A. Canis, J. Choi, M. Aldham et al.
Supported Unsupported
Functions Dynamic Memory
Arrays, Structs Floating Point
Global Variables Recursion
Pointer Arithmetic
Unlike many HLS tools, synthesis of fixed-size multi-dimensional arrays, structs, global vari-
ables, and pointer arithmetic are supported by LegUp. Regarding structs, LegUp supports
structs with arrays, arrays of structs, and structs containing pointers. LegUp stores structs
in memory using the ANSI C alignment standards. Functions that return a struct, dynamic
memory allocation, recursion and floating point arithmetic are unsupported in the initial
release of the tool.
With the LegUp distribution, we include 13 benchmark C programs, summarized in
Table III. Included are all 12 programs in the CHStone high-level synthesis benchmark
suite [Hara et al. 2009], as well as Dhrystone – a standard integer benchmark. The pro-
grams represent a diverse set of computations falling into several categories: arithmetic,
encryption, media, processing and general. They range in size from 232-1692 lines of C
code. The arithmetic benchmarks implement 64-bit double-precision floating-point opera-
tions in software using integer types. Notice that the CHStone suite contains a benchmark
which is a software model of a MIPS processor (which we can then run on a MIPS processor).
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:15
A key characteristic of the benchmarks is that inputs and expected outputs are included
in the programs themselves. The presence of the inputs and golden outputs for each pro-
gram gives us assurance regarding the correctness of our synthesis results. Each benchmark
program performs computations whose results are then checked against golden values. This
is analogous to built-in self test in design-for-test methodology. No inputs (e.g. from the
keyboard or a file) are required to run the programs.
4.6. Debugging
The initial release of LegUp includes a basic debugging capability which consists of auto-
matically adding print statements into the LLVM IR to dump variable values at the end of
each basic block’s execution. When the IR is synthesized to hardware, the Verilog can be
simulated using ModelSim producing a log of variable value changes that can be directly
compared with an analogous log from a strictly software execution of a benchmark. We
found even this limited capability to be quite useful, as it allows one to pinpoint the first
LLVM instruction where computed values differ in hardware vs. software, aiding problem
diagnosis and debugging.
5. EXPERIMENTS
The goals of our experimental study are three-fold: 1) to demonstrate that the quality of
results (speed, area, power) produced by LegUp HLS is comparable to that produced by
a commercial HLS tool, eXCite [Y Explorations (XYI) 2010], 2) to demonstrate LegUp’s
ability to effectively explore the hardware/software co-design space, and 3) to compare the
quality of hardware vs. software implementations of the benchmark programs. We chose
eXCite because it was the only commercial tool we had access to that could compile the
benchmark programs. With the above goals in mind, we map each benchmark program
using 5 different flows, representing implementations with successively increasing amounts
of computation happening in hardware vs. software. The flows are as follows (labels appear
in parentheses):
(1) A software-only implementation running on the MIPS soft processor (MIPS-SW ).
(2) A hybrid software/hardware implementation where the second most 1 compute-intensive
function (and its descendants) in the benchmark is implemented as a hardware accel-
erator, with the balance of the benchmark running in software on the MIPS processor
(LegUp-Hybrid2 ).
(3) A hybrid software/hardware implementation where the most compute-intensive func-
tion (and its descendants) is implemented as a hardware accelerator, with the balance
in software (LegUp-Hybrid1 ).
(4) A pure hardware implementation produced by LegUp (LegUp-HW ).
(5) A pure hardware implementation produced by eXCite (eXCite-HW )2 .
The two hybrid flows correspond to a system that includes the MIPS processor and a
single accelerator, where the accelerator implements a C function and all of its descendant
functions.
For the back-end of the flow, we use Quartus II ver. 9.1 SP2 to target the Cyclone II
FPGA. Quartus II was executed in timing-driven mode with all physical synthesis opti-
mizations turned on3 . The correctness of the LegUp implementations was verified using
post-routed ModelSim simulations and also in hardware using the Altera DE2 board.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:16 A. Canis, J. Choi, M. Aldham et al.
Benchmark Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time
adpcm 193607 74.26 2607 159883 61.61 2595 96948 57.19 1695 36795 45.79 804 21992 28.88 761
aes 73777 74.26 993 55014 54.97 1001 26878 49.52 543 14022 60.72 231 55679 50.96 1093
blowfish 954563 74.26 12854 680343 63.21 10763 319931 63.7 5022 209866 65.41 3208 209614 35.86 5845
dfadd 16496 74.26 222 14672 75.01 196 5649 77.41 73 2330 124.05 19 370 24.54 15
dfdiv 71507 74.26 963 15973 77.92 205 4538 65.92 69 2144 74.72 29 2029 43.95 46
dfmul 6796 74.26 92 10784 75.58 143 2471 79.14 31 347 85.62 4 223 49.17 5
dfsin 2993369 74.26 40309 293031 65.66 4463 80678 68.23 1182 67466 62.64 1077 49709 40.06 1241
gsm 39108 74.26 527 29500 61.46 480 18505 61.14 303 6656 58.93 113 5739 41.82 137
jpeg 29802639 74.26 401328 16072954 51.2 313925 15978127 46.65 342511 5861516 47.09 124475 3248488 22.66 143358
mips 43384 74.26 584 6463 75.51 86 6463 75.51 86 6443 90.09 72 4344 76.25 57
motion 36753 74.26 495 34859 73.34 475 17017 79.67 214 8578 91.79 93 2268 42.87 53
sha 1209523 74.26 16288 358405 77.40 4631 265221 75.76 3508 247738 86.93 2850 238009 62.48 3809
dhrystone 28855 74.26 389 25599 77.64 330 25509 76.99 331 10202 85.38 119 - - -
Geomean: 173332.0 74.26 2334.1 86258.3 67.10 1285.9 42700.5 65.65 650.3 20853.8 71.56 291.7 14594.4 40.87 357.1
Ratio: 1 1 1 0.50 0.90 0.55 0.25 0.88 0.28 0.12 0.96 0.12 0.08 0.55 0.15
Three metrics are employed to gauge quality of result: 1) circuit speed, 2) area, and
3) energy consumption. For circuit speed, we consider the cycle latency, clock frequency
and total execution time. Cycle latency refers to the number of clock cycles required for a
complete execution of a benchmark. Clock frequency refers to the reciprocal of the post-
routed critical path delay reported by Altera timing analysis. Total execution time is simply
the cycle latency multiplied by the clock period. For area, we consider the number of used
Cyclone II logic elements (LEs), memory bits, and 9x9 multipliers.
Energy is a key cost metric, as it directly impacts electricity costs, as well as influences
battery life in mobile settings. To measure energy, we use Altera’s PowerPlay power analyzer
tool, applied to the routed design. We gather switching activity data for each benchmark
through a post-route full delay simulation with Mentor Graphics’ ModelSim. ModelSim
produces a VCD (value change dump) file containing activity data for each design signal.
PowerPlay reads the VCD to produce a power estimate for each design. To compute the
total energy consumed by a benchmark for its computational work, we multiply the average
core dynamic power reported by PowerPlay with the benchmark’s total execution time.
5.1. Results
Table IV presents speed performance results for all circuits and flows. Three data columns
are given for each flow: Cycles contains the latency in number of clock cycles; Freq presents
the post-routed critical path delay in MHz; Time gives the total executation time in µS
(Cycles/F req). The flows are presented in the order specified above, from pure software on
the left, to pure hardware on the right. The second last row of the table contains geometric
mean results for each column. The dhrystone benchmark was excluded from the geomean
calculations, as eXCite was not able to compile this benchmark. The last row of the table
presents the ratio of the geomean relative to the software flow (MIPS-SW ).
Beginning with the MIPS-SW flow, the data in Table IV indicates that the processor runs
at 74 MHz on the Cyclone II and the benchmarks take between 6.7K-29M cycles to complete
their execution. In terms of program execution time, this corresponds to a range of 92-401K
µS4 . In the LegUp-Hybrid2 flow, where the second most compute-intensive function (and
its descendants) is implemented as a hardware accelerator, the number of cycles needed for
execution is reduced by 50% compared with software, on average. The Hybrid2 circuits run
at 10% lower frequency than the processor, on average. Overall, LegUp-Hybrid2 provides
a 45% (1.8×) speed-up in program execution time vs. software (MIPS-SW ). Moving onto
the LegUp-Hybrid1 flow, which represents additional computations in hardware, Table IV
4 Asa comparison, we also ran the benchmarks on the Altera NIOS II/f (fast) soft processor and found the
NIOS II performance to be about twice as fast as Tiger MIPS. Note, however, that NIOS II is not open
source, has a 6-stage pipeline and is specially tuned for Altera devices, whereas, Tiger MIPS has a 5-stage
pipeline and is not optimized for any particular FPGA device architecture.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:17
shows that cycle latency is 75% lower than software alone. However, clock speed is 12%
worse for this flow, which when combined with latency, results in a 72% reduction in program
execution time vs. software (a 3.6× speed-up over software). Looking broadly at the data for
MIPS-SW, LegUp-Hybrid1 and LegUp-Hybrid2, we observe a trend: execution time decreases
substantially as more computations are mapped to hardware. Note that the MIPS processor
would certainly run at a higher clock speed on a 40/45 nm FPGA, e.g. Stratix IV, however
the accelerators would also speed-up commensurately.
The two right-most flows in Table IV correspond to pure hardware implementations. Ob-
serve that benchmark programs mapped using the LegUp-HW flow require just 12% of the
clock cycles of the software implementations, on average, yet they run at about the same
speed in MHz. When benchmarks are mapped using eXCite-HW, even fewer clock cycles are
required to complete their execution – just 8% of that required for software implementations.
However, implementations produced by eXCite run at 45% lower clock frequency than the
MIPS processor, on average. LegUp produces heavily pipelined hardware implementations,
whereas, we believe eXCite does more operation chaining, resulting in few computation cy-
cles yet longer critical path delays. Considering total execution time of a benchmark, LegUp
and eXCite offer similar results. LegUp-HW provides an 88% execution time improvement
vs. software (8× speed-up); eXCite-HW provides an 85% improvement (6.7× speed-up).
Both of the pure hardware implementations are a significant win over software. The most
favorable LegUp results were for the dfdiv and dfsin benchmarks, for which the speed-up
over pure software was over 30×. The benchmark execution times of LegUp implementa-
tions relative to eXCite are comparable, which bodes well for our framework and gives us
assurance that it produces implementations of reasonable quality.
Observe that neither of the hybrid scenarios provide a performance win over pure hard-
ware for these particular benchmark circuits. Moreover, none of the benchmarks use C
language constructs that are unsupported by LegUp. Nevertheless, the hybrid scenarios
do serve to demonstrate LegUp’s ability to synthesize working systems that contain both
hardware and software aspects.
It is worth highlighting a few anomalous results in Table IV. Comparing LegUp-HW
with eXCite-HW for the benchmark aes, LegUp’s implementation provides a nearly 5×
improvement over eXCite in terms of execution time. Conversely, for the motion benchmark,
LegUp’s implementation requires nearly 4× more cycles than eXCite’s implementation. We
believe such differences lie in the extent of pipelining used by LegUp vs. eXCite, especially
for arithmetic operations such as division. In LegUp, we pipeline arithmetic units to the
maximum extent possible, leading to higher cycle latencies, and improved clock periods.
Area results are provided for each circuit in Table V. For each flow, three columns provide
the number of Cyclone II logic elements (LEs), the number of memory bits used (# bits),
as well as the number of 9x9 multipliers (Mults). As in the performance data above, the
geometric mean and ratios relative to MIPS software alone are given in the last two rows
of Table V. Observe that some columns contain a 0 for one or more circuits, invalidating
the geomean calculation. To calculate the geomean for such columns, the 0’s were taken to
be 1’s5 .
Beginning with the area of the MIPS processor, the data in Table V shows it requires
12.2K LEs, 226K memory bits, and 16 multipliers. The hybrid flows include both the MIPS
processor, as well as custom hardware, and consequently, they consume considerably more
area. When the LegUp-Hybrid2 flow is used, the number of LEs, memory bits, and multi-
pliers increase by 2.23×, 1.14×, and 2.68×, respectively, in Hybrid2 vs. the MIPS processor
alone, on average. The LegUp-Hybrid1 flow requires even more area: 2.75× LEs, 1.16×
memory bits, and 3.18× multipliers vs. MIPS. Note that link time optimization in LLVM
was disabled for the hybrid flows, as was necessary to preserve the integrity of the function
5 This convention is used in life sciences studies.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:18 A. Canis, J. Choi, M. Aldham et al.
Benchmark LEs # bits Mults LEs # bits Mults LEs # bits Mults LEs # bits Mults LEs # bits Mults
adpcm 12243 226009 16 25628 242944 152 46301 242944 300 22605 29120 300 16654 6572 28
aes 12243 226009 16 56042 244800 32 68031 245824 40 28490 38336 0 46562 18688 0
blowfish 12243 226009 16 25030 341888 16 31020 342752 16 15064 150816 0 31045 33944 0
dfadd 12243 226009 16 22544 233664 16 26148 233472 16 8881 17120 0 9416 0 0
dfdiv 12243 226009 16 28583 226009 46 36946 233472 78 20159 12416 62 9482 0 32
dfmul 12243 226009 16 16149 226009 48 20284 233472 48 4861 12032 32 4536 0 26
dfsin 12243 226009 16 34695 233472 78 54450 233632 116 38933 12864 100 22274 0 38
gsm 12243 226009 16 25148 232576 114 30808 233296 142 19131 11168 70 6114 3280 2
jpeg 12243 226009 16 46432 338096 252 64441 354544 254 46224 253936 172 30420 105278 20
mips 12243 226009 16 18857 230304 24 18857 230304 24 4479 4480 8 2260 3072 8
motion 12243 226009 16 28761 243104 16 18013 242880 16 13238 34752 0 20476 16384 0
sha 12243 226009 16 20382 359136 16 29754 359136 16 12483 134368 0 13684 3072 0
dhrystone 12243 226009 16 15220 226009 16 16310 226009 16 4985 82008 0 - - -
Geomean: 12243 226009 16 27248 258526 43 33629 261260 51 15646 28822 12 13101 496 5
Ratio: 1 1 1 2.23 1.14 2.68 2.75 1.16 3.18 1.28 0.13 0.72 1.07 0.00 0.32
2500 40000
Execution time (geometric mean)
# of LEs 35000
2000
1500 25000
20000
1000 15000
10000
500
5000
0 0
boundaries6 . However, link time optimization was enabled for the MIPS-SW and LegUp-
HW flows, permitting greater compiler optimization for such flows, possibly improving area
and speed.
Turning to the pure hardware flows in Table V, the LegUp-HW flow implementations
require 28% more LEs than the MIPS processor on average; the eXCite-HW implementa-
tions require 7% more LEs than the processor. In other words, on the key area metric of the
number of LEs, LegUp implementations require 19% more LEs than eXCite, on average.
We consider the results to be quite encouraging, given that this is the initial release of an
open source academic HLS tool. In terms of memory bits, both the LegUp-HW flow and
the eXCite-HW flow require much fewer memory bits than the MIPS processor alone. For
the benchmarks that require embedded multipliers, the LegUp-HW implementations use
more multipliers than the eXCite-HW implementations, which we believe is due to more
extensive multiplier sharing in the binding phase of eXCite.
Figure 9 summarizes the speed and area results. The left vertical axis represents geometric
mean execution time; the right axis represents area (number of LEs). Observe that execution
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:19
600,000
500,000
300,000
200,000
100,000
time drops as more computations are implemented in hardware. While the data shows
that pure hardware implementations offer superior speed performance to pure software or
hybrid implementations, the plot demonstrates LegUp’s usefulness as a tool for exploring
the hardware/software co-design space. One can multiply the delay and area values to
produce an area-delay product. On such a metric, LegUp-HW and eXCite-HW are nearly
identical (∼4.6M µS-LEs vs. ∼4.7M µS-LEs) – LegUp-HW requires more LEs vs. eXCite-
HW, however, it offers better speed, producing a roughly equivalent area-delay product.
The area-delay product parity with eXCite gives us further confidence that the HLS results
produced by LegUp are competitive with commercial tools.
Figure 10 presents the geometric mean energy results for each flow. The energy results
bear similarity to the trends observed for execution time, though the trends here are even
more pronouced. Energy is reduced drastically as computations are increasingly imple-
mented in hardware vs. software. The LegUp-Hybrid2 and LegUp-Hybrid1 flows use 47%
and 76% less energy than the MIPS-SW flow, respectively, representing 1.9× and 4.2× en-
ergy reductions. The pure hardware flows are even more promising from the energy stand-
point. With LegUp-HW, the benchmarks use 94% less energy than if they are implemented
with the MIPS-SW flow (an 18× reduction). The eXCite results are similar. Pure hardware
benchmark implementations produced by eXCite use over 95% less energy than software im-
plementations (a 22× reduction). The energy results are promising, especially since energy
was not a specific focus of our initial release.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:20 A. Canis, J. Choi, M. Aldham et al.
D Q
D Q 6-LUT
4-LUT
D Q
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:21
Porting LegUp to an alternative FPGA device for pure hardware HLS is straightforward,
however, supporting the hybrid processor/accelerator scenario on a non-Altera device is
more involved. In particular, the Tiger MIPS processor makes use of Altera megafunctions
for memory, division and multiplication. The megafunctions would need to be changed to ref-
erence the corresponding modules for the alternate FPGA vendor. Moreover, as described in
Section 3.2, the LegUp hybrid platform uses the Altera Avalon interface for processor/ac-
celerator communication. If a Xilinx FPGA were targeted, processor/accelerator system
generation and communication would need to be modified to use the Xilinx EDK tool and
PLB bus [PLB 2011]. The PLB and Avalon interfaces are quite similar however, as both are
memory-mapped master/slave bus interfaces. We therefore see no significant barriers that
would prevent LegUp from targeting a Xilinx device.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:22 A. Canis, J. Choi, M. Aldham et al.
Fig. 12. SDC scheduling results for Cyclone II with various clock period constraints (bars represent per-
formance in MHz; the line represents latency in clock cycles.
The LegUp implementation has a scheduler DAG object that, essentially, annotates each
LLVM instruction with data relevant to its scheduling: its combinational delay, as charac-
terized in the target FPGA, and the instructions on which it depends. The scheduler DAG
object can be viewed as an overlay on the dataflow graph with scheduling-specific infor-
mation. The object contains all of the information needed for us to generate the SDC LP
formulation. After solving the LP, we deposit the cycle assignment for each instruction into
another LegUp data structure called the scheduler mapping. For each LLVM instruction, the
mapping holds the scheduled cycle number. Following scheduling, FSM generation accesses
the mapping object to construct the FSM.
Fig. 12 shows SDC scheduling results for Cyclone II, demonstrating the impact of running
SDC with different clock period constraints. The left axis (bar) gives the geometric mean
post-routed clock frequency across the 12 CHStone circuits and dhrystone; the right axis
(line) gives the geometric mean latency (# of clock cycles to execute). The four datapoints
show SDC scheduling results for clock period constraints of 20, 15, 10, and 7.5 ns, respec-
tively. Observe that circuit clock frequency increases as P is decreased, which demonstrates
the effectiveness of SDC, as well as provides confidence in our operator speed characteriza-
tion. Note that P is a minimum clock period constraint – no effort is made to actually slow
circuits down. Hence, for the P = 20 ns datapoint, the circuits run considerably faster than
50 MHz. As P is decreased, the circuits are more heavily pipelined and take larger numbers
of cycles to execute.
SDC scheduling will be made LegUp’s default scheduling algorithm in a subsequent re-
lease.
6.3. Parallel Accelerators
As a last case study, we demonstrate the capability of LegUp to synthesize multi-accelerator
systems. As a proof-of-concept application, we use array addition for four 1000-element
arrays. Three parallelization scenarios were evaluated: 1) pure software with the MIPS
processor performing all of the work, 2) a single accelerator, called by the processor, per-
forming each of the four array additions sequentially, and, 3) four accelerators, operating
in parallel, with each accelerator performing the addition for one of the four arrays. In the
multi-accelerator case, the processor signals each accelerator to start its work and polls until
all four have completed. We found that a single acclerator doing all of the work sequentially
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:23
provides a 5.2× speedup over the pure software case. Using four parallel accelerators yields
a 3.7× speedup vs. using a single accelerator. While this is a simple application, with no po-
tential cache coherency issues, it serves to illustrate that concurrently running accelerators
are feasible with LegUp – a topic we plan to explore further in future work.
Acknowledgements
The authors thank Dr. Tedd Hadley from Y Explorations for providing the eXCite tool
used in the experimental study. The authors gratefully acknowledge the comments of the
anonymous reviewers that have significantly improved the manuscript.
REFERENCES
2011. CoreConnect, Xilinx, Inc. http://www.xilinx.com/support/documentation/ipembedprocess coreconnect.htm.
2011. lp solve linear programming solver. http://lpsolve.sourceforge.net/5.5/ .
2011. VTR – the Verilog-to-routing project for FPGAs. http://www.eecg.toronto.edu/vtr/ .
Aldham, M., Anderson, J., Brown, S., and Canis, A. 2011. Low-cost harware profiling of run-time
and energy in FPGA embedded processors. In IEEE Int’l Conference on Application-specific Systems,
Architecture and Processors (ASAP). 61–68.
Altera, Corp. 2009. Nios II C2H Compiler User Guide. Altera, Corp., San Jose, CA.
Altera, Corp. 2010. Avalon Interface Specification. Altera, Corp., San Jose, CA.
Altera, Corp. 2011. Stratix IV FPGA Family Data Sheet. Altera, Corp., San Jose, CA.
AutoESL. AutoESL Design Technologies, Inc. (http://www.autoesl.com). AutoESL.
Betz, V. and Rose, J. 1997. VPR: A new packing, placement and routing tool for FPGA research. In Int’l
Workshop on Field Programmable Logic and Applications. 213–222.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:24 A. Canis, J. Choi, M. Aldham et al.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:25
Tripp, J., Gokhale, M., and Peterson, K. 2007. Trident: From high-level language to hardware circuitry.
IEEE Computer 40, 3, 28–37.
United States Bureau of Labor Statistics 2010. Occupational Outlook Handbook 2010-2011 Edition. United
States Bureau of Labor Statistics.
University of Cambridge 2010. The Tiger MIPS processor (http://www.cl.cam.ac.uk/teaching/
0910/ECAD+Arch/mips.html). University of Cambridge.
Vahid, F., Stitt, G., and R., L. 2008. Warp processing: Dynamic translation of binaries to FPGA circuits.
IEEE Computer 41, 7, 40–46.
Villarreal, J., Park, A., Najjar, W., and Halstead, R. 2010. Designing modular hardware accelerators
in C with ROCCC 2.0. In IEEE Int’l Symposium on Field-Programmable Custom Computing Machines.
127–134.
Wayne Marx, V. A. 2008. FPGAs Are Everywhere - In Design, Test & Control. RTC Magazine.
Y Explorations (XYI) 2010. eXCite C to RTL Behavioral Synthesis 4.1(a). Y Explorations (XYI), San
Jose, CA.
ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: High-Level Synthesis for FPGA-Based
Processor/Accelerator Systems
Andrew Canis1 , Jongsok Choi1 , Mark Aldham1 , Victor Zhang1 , Ahmed Kammoona1 ,
Jason Anderson1 , Stephen Brown1 , and Tomasz Czajkowski‡
1
ECE Department, University of Toronto, Toronto, ON, Canada
‡
Altera Toronto Technology Centre, Toronto, ON, Canada
legup@eecg.toronto.edu
35000
Energy (nJ) (geometric mean)
# of LEs 500,000
2000
# of LEs (geometric mean)
Exec. time 30000 can explore the hardware/software design space, where some
400,000
1500 25000 portions of a program run on a processor, and others as
20000 300,000 custom hardware circuits. LegUp, along with its suite of
1000 15000
200,000 benchmark C programs, is a powerful open source platform
500
10000 for HLS research that we expect will enable a variety of
100,000
5000 research advances in hardware synthesis, as well as in hard-
0 0 - ware/software co-design. LegUp is available for download
at: http://www.legup.org.
*{yz882,zhiruz}@cornell.edu
ABSTRACT 1 INTRODUCTION
Modern high-level synthesis (HLS) tools greatly reduce the turn- Field-programmable gate arrays (FPGAs) have become an attractive
around time of designing and implementing complex FPGA-based option for realizing specialized accelerators thanks to their reconfig-
accelerators. They also expose various optimization opportunities, urability, massive fine-grained parallelism, and performance per watt
which cannot be easily explored at the register-transfer level. With advantage. With the extreme-scale integration of modern system-
the increasing adoption of the HLS design methodology and con- on-chip (SoC) and escalating design complexity of emerging applica-
tinued advances of synthesis optimization, there is a growing need tions, designing at a higher level of abstraction has become crucial
for realistic benchmarks to (1) facilitate comparisons between tools, to achieving high productivity. To address this challenge, high-level
(2) evaluate and stress-test new synthesis techniques, and (3) estab- synthesis (HLS) tools have emerged to allow application developers
lish meaningful performance baselines to track progress of the HLS to describe the hardware accelerator using common software pro-
technology. While several HLS benchmark suites already exist, they gramming languages like C/C++ by automatically generating RTL
are primarily comprised of small textbook-style function kernels, from behavioral descriptions [7, 14]. With the recent advances on
instead of complete and complex applications. To address this limita- HLS techniques and algorithms, modern HLS tools enable design-
tion, we introduce Rosetta, a realistic benchmark suite for software ers to explore optimization opportunities that are infeasible at the
programmable FPGAs. Designs in Rosetta are fully-developed appli- register-transfer level.
cations. They are associated with realistic performance constraints, Programming FPGAs with HLS tools is drastically different from
and optimized with advanced features of modern HLS tools. We be- writing traditional software code. HLS users typically need to apply
lieve that Rosetta is not only useful for the HLS research community, many optimization pragmas/directives to meet design constraints.
but can also serve as a set of design tutorials for non-expert HLS The success of such manual optimization often requires nontrivial
users. In this paper we describe the characteristics of our bench- hardware design knowledge. For example, in image/video processing,
marks and the optimization techniques applied to them. We further the right combination of SRAM-based line buffers and shift regis-
report experimental results on an embedded FPGA device as well as ters is needed to achieve the ideal throughput and resource usage
a cloud FPGA platform. for pipelining the stencil code in hardware. With a more complex
dataflow structure, the user needs to further calculate and specify the
ACM Reference Format: right FIFO depth to obtain the best pipeline rate without causing too
Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen much area overhead. However, these advanced HLS optimizations
Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, are rarely used or even required in the existing HLS benchmark
Wenping Wang, Zhiru Zhang. 2018. Rosetta: A Realistic High-Level Synthesis suites (e.g., [11], [23]), which primarily include relatively small ker-
Benchmark Suite for Software Programmable FPGAs. In FPGA ’18: 2018 nels that are designed to test some of the basic capabilities of an
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
HLS tool such as the synthesis support of high-level language con-
February 25–27, 2018, Monterey, CA, USA. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/3174243.3174255
structs. In addition, for HLS tool developers and the HLS research
community at large, there is also a growing demand for a common
set of realistic and complex designs to evaluate the efficacy of new
⋆ Udit, Gustavo, and Wenping conducted this research when they were affiliated with
or visiting Cornell. synthesis techniques.
To this end, we introduce Rosetta1 — a suite of realistic HLS bench-
marks for software programmable FPGAs. Rosetta includes popular
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed machine learning workloads such as logistic regression and neural
for profit or commercial advantage and that copies bear this notice and the full citation network inference, as well as real-time video processing applications
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
including image rendering and face detection. Unlike previous ef-
to post on servers or to redistribute to lists, requires prior specific permission and/or a forts, Rosetta presents fully developed applications instead of small
fee. Request permissions from permissions@acm.org. kernel programs, and specifies realistic design constraints for each
FPGA ’18, February 25–27, 2018, Monterey, CA, USA
© 2018 Association for Computing Machinery. 1 Rosetta gets the name following the convention of a plethora of “stone” benchmark
ACM ISBN 978-1-4503-5614-5/18/02. . . $15.00 suites. It also symbolizes that our benchmarks are specified in multiple languages (i.e.,
https://doi.org/10.1145/3174243.3174255 C++, OpenCL) and useful for evaluating HLS across different tools and platforms.
269
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
application. These design constraints are satisfied by applying ad- In particular, state-of-the-art HLS tools provide many advanced fea-
vanced optimizations of state-of-the-art HLS tools, which are not tures for achieving high design quality. Examples include arbitrary-
exercised by existing benchmark suites. With these features, Rosetta precision datatypes, parameterized hardware data structures (e.g.,
is not only a set of practical benchmarks for the HLS community, line buffers), and hierarchical dataflow pipelining. These features are
but also a design tutorial on how to build specialized FPGA accelera- often used in combination with other common HLS optimizations
tors with advanced HLS optimizations. More concretely, our main such as unrolling, loop pipelining [9, 15, 37], and array partition-
contributions are threefold: ing [30, 41]. Moreover, they are typically applied across multiple
• We design and present Rosetta, which couples a range of realistic kernels exhibiting different characteristics to meet the stringent
applications with real-world design constraints under different applicant-level design constraints.
programming models. Current Rosetta designs are written in C++ We believe that a new set of full-application benchmarks is desir-
and OpenCL. The synthesized hardware accelerators are tested able to enable more realistic performance reporting of HLS tools and
on both embedded and cloud FPGA platforms. FPGA-based acceleration. Along this line, Liu et al. [16] conducted a
comprehensive case study on an H.264 decoder, and they have open
• Rosetta demonstrates how to effectively apply advanced optimiza-
tions provided by modern HLS tools to meet the design constraints sourced their HLS implementation. Rosetta goes one step further
by providing a suite of application benchmarks that can be used
and achieve high quality of results. Examples of these optimiza-
tions include fixed-point optimization, dataflow pipelining, and to (1) facilitate comparisons between HLS tools, (2) evaluate new
synthesis techniques, and (3) establish meaningful baselines to track
data reuse through customized memory.
progress of the HLS and FPGA technologies. Each application in
• The proposed benchmark suite is freely available in open-source Rosetta includes a set of enforceable application-level design con-
format2 . We plan to continuously improve Rosetta by strengthen- straints based on real-world specifications. These constraints model
ing current cases and adding new applications from other domains. the realistic use cases for FPGA-based hardware accelerators, which
The rest of this paper is organized as follows: in Section 2, we helps standardize the evaluation of future advancements in HLS
introduce related work on HLS benchmarking and optimizations; tools. Furthermore, the applications in Rosetta leverage advanced
Section 3 outlines the Rosetta applications and key HLS optimiza- features of HLS tools to achieve high quality of results (QoRs) across
tion techniques leveraged by them; details of each benchmark are a distinct set of hardware designs. Hence these benchmarks can
described in Section 4; we show our experimental results in Section also serve as useful design tutorials for FPGA programmers to build
5, and conclude this work in Section 6. high-performance hardware accelerators using HLS.
270
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
Table 1: The current set of the Rosetta applications — Rosetta contains both compute-bound and memory-bound applications with
different workloads. Kernels in each application expose different sources of parallelism: SLP = subword-level parallelism; DLP = data-level
parallelism; ILP = instruction-level parallelism. Different types of parallelism available in each compute kernel are listed in parentheses.
Application Categorization Major Compute Kernels Major HLS Optimizations
Video processing
Dataflow pipelining
3D Rendering Compute bound Integer arithmetics (ILP)
Communication customization
Integer operation intensive
Machine learning
Hamming distance (SLP, DLP, ILP) Loop unrolling
Digit Recognition Compute bound
KNN voting (ILP) Loop pipelining
Bitwise operation intensive
Dot product (DLP, ILP)
Machine learning Dataflow pipelining
Scalar multiplication (DLP, ILP)
Spam Filtering Memory bound Datatype customization
Vector addition (DLP, ILP)
Fixed-point arithmetic intensive Communication customization
Sigmoid function (ILP)
Video processing Dataflow pipelining
1D convolution (DLP, ILP)
Optical Flow Memory bound Memory customization
Outer product (DLP, ILP)
Floating-point arithmetic intensive Communication customization
Machine learning Memory customization
Binarized Neural Binarized 2D convolution (SLP, DLP, ILP)
Compute bound Datatype customization
Network (BNN) [39] Binarized dot product (SLP, DLP, ILP)
Bitwise operation intensive Communication customization
Video processing
Image scaling (DLP, ILP) Memory customization
Face Detection [25] Compute bound
Cascaded classifiers (DLP, ILP) Datatype customization
Integer arithmetic intensive
• Compute customization – Compute customization improves 1 TRIANGLES: for (int i = 0; i < NUM_3D_TRI; i++) {
the latency and/or throughput of the design through paralleliza- 2 #pragma HLS dataflow
tion and pipelining. Loop unrolling, loop pipelining, and dataflow 3 // five stages for processing each 3D triangle
pipelining fall into this category. 4 projection(triangle_3ds, &triangle_2ds, angle);
5 flag = rasterization1(triangle_2ds, max_min,
• Memory customization – FPGA accelerators typically demand 6 &triangle_2ds_same, max_index);
very high on-chip memory bandwidth to enable highly distributed 7 size = rasterization2(flag, max_min, max_index,
control and computation. Therefore, it is critical to set up cus- 8 triangle_2ds_same, fragment);
tomized memory hierarchy to provide the required bandwidth 9 size_pixels = zculling(i, fragment, size, pixels);
through data reuse and memory banking. 10 coloringFB(i, size_pixels, pixels, frame_buffer);
11 }
• Communication customization – The limited data bandwidth
between off-chip memories and the FPGA accelerators often be-
comes the performance bottleneck for memory-bound applica- Figure 1: Main loop for 3D Rendering. One triangle is pro-
tions. Hence it is crucial to customize the communication channel cessed by five image processing stages in each iteration.
and protocol used by the hardware accelerator to fully utilize off-
chip memory bandwidth through proper data packing and careful Time
271
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
1 __local WholeDigitType training_set[NUM_TRAINING] These two kernels have very different characteristics: while we can
2 __attribute__((xcl_array_partition(block,PAR_FACTOR,1))); easily exploit the bit-level and data-level parallelism in the Hamming
3 distance kernel, the KNN voting kernel is harder to parallelize.
4 __attribute__((xcl_pipeline_loop))
Digit recognition has a high compute to communication ratio.
5 TRAINING_LOOP:
For each test instance, Hamming distance calculation requires 100s-
6 for (int i = 0; i < NUM_TRAINING / PAR_FACTOR; i ++) {
7 __attribute__((opencl_unroll_hint)) 1000s of cycles depending on the parallelization factor, and KNN
8 LANES: voting requires 10s-100s of cycles depending on K and the paralleliza-
9 for (int j = 0; j < PAR_FACTOR; j ++) { tion factor. The training samples and their labels are stored on-chip
10 // Read a new instance from the training set and reused for all test instances. As a result, digit recognition is a
11 int train_id = j * NUM_TRAINING / PAR_FACTOR + i; compute-bound application.
12 WholeDigitType training_instance; Figure 3 shows the main compute loop nest for KNN calcula-
13 training_instance = training_set[train_id]; tion, alongside key HLS optimizations. TRAINING_LOOP iterates over
14 // Update the KNN set training samples, while the inner loop, LANES, instantiates different
15 update_knn(test_instance, training_instance,
Hamming distance units. In addition to compute optimizations in
16 &knn_set[j*K_CONST]);
the form of loop pipelining and unrolling (lines 4 and 7 of Figure 3),
17 }
18 } memory optimization is needed since the default implementation of
on-chip array training_set only has two memory ports, it cannot
supply PAR_FACTOR training instances per cycle. The training_set
Figure 3: Main compute loop nest for KNN calculation in
array is partitioned in line 2. With these optimizations, we can exploit
OpenCL.
the data-level parallelism between training instances.
arithmetic, while rasterization1 and zculling are heavy on in- Design parameters. The user can tune the following knobs:
teger comparisons. Each triangle requires a large amount of com-
putation relative to its memory size. Therefore, the application is • K: number of nearest neighbors.
categorized as compute-bound. • PAR_FACTOR: number of parallel Hamming distance units.
3D rendering is a prime example of dataflow optimization, which These two parameters present an interesting trade-off between
is applied in the HLS code on line 2 of Figure 1. Dataflow optimiza- classification accuracy, latency, and resource utilization. Increasing
tion exploits task-level parallelism by overlapping different stages PAR_FACTOR reduces the latency of the Hamming distance kernel,
of the image processing pipeline, as shown in Figure 2. Although but complicates the KNN voting kernel. Parallelization also causes
the latency of processing each triangle is not reduced, dataflow opti- frequency to drop. Furthermore, the complexity of both kernels in-
mization improves throughput and ensures no hardware module in creases with K. Additional results and analysis on the design space
the pipeline is idle in the steady state. are presented in Section 5.
Design parameters. We provide a switch in the source code to
enable/disable dataflow optimization. 4.3 Spam Filtering
The spam filtering application uses stochastic gradient descent (SGD)
4.2 Digit Recognition to train a logistic regression (LR) model for spam email classifica-
Digit recognition classifies hand-written digits using the K-nearest- tion [19]. The input is a dataset containing 5000 emails, 4500 for
neighbor (KNN) algorithm. The application works on a downsampled training and 500 for testing [26]. Each email is represented as a 1024-
subset of the MNIST database [13], with 18000 training samples and dimensional vector whose elements are relative word frequencies
2000 test samples evenly split amongst the ten digit classes. Each stored as 16-bit fixed-point numbers. The SGD training process pro-
MNIST image is downsampled to 14x14 and each pixel is represented duces a vector of 32-bit fixed-point parameters for the LR model.
as a single bit; thus, each image can be stored as a 196-bit unsigned We use five training epochs and a minibatch size of one; each epoch
integer. The KNN algorithm computes the Hamming distance be- processes every training sample once and updates the parameters
tween a test input and each training sample, stores the labels of the after each sample.
training samples with the K shortest distances, and votes among the The performance target of spam filtering is to minimize training
K labels to decide the label of the test sample. The design objective latency. Critical resource constraints are the number of hardened
for digit recognition is to minimize the total latency of classifying DSP blocks and the size of on-chip storage, which limits the level of
the 2000 test samples. compute parallelization and the amount of data stored on the FPGA.
Digit recognition includes two major compute kernels: Hamming The SGD algorithm contains kernels commonly found in machine
distance calculation and KNN voting. The Hamming distance kernel learning applications, including dot product, vector addition, and
computes the Manhattan distance between two samples; as each sigmoid.
sample is comprised of 1-bit pixels, this is done via bitwise XOR on Our spam filtering design exploits datatype customization and
the inputs, followed by computing a population count of the result. approximation of complex arithmetic operations on the FPGA. Fig-
The kernel is therefore rich in bitwise logic. The Hamming distance ure 4 shows the optimized sigmoid function. Lines 1-3 show the
must be calculated between a test input and every training sample. customized datatypes used to avoid expensive floating-point arith-
As a result, Hamming distance calculation is the dominant workload metic. We also eliminate most of the compute by taking advantage
of digit recognition. The KNN voting kernel examines the list of of the properties of the sigmoid function. Sigmoid asymptotically
Hamming distances to find the K nearest training samples, and out- approaches one when the input is large and zero when the input is
puts the classification result as the most frequent label amongst them. small (i.e. large negative). Sigmoid values when the input is between
The main workload in this kernel is integer comparison and sorting. minus four and four are hardcoded in a look-up table.
272
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
273
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
window buffer. Figure 7 gives a pictorial illustration of a 2-row line Input Binarize words
buffer and a 3x3 window buffer. The line buffer reads in one pixel per words
BitSel ∗ Integer
buffer
cycle and stores pixels in recently visited rows. The window buffer Variable-width fout Conv
is completely partitioned into registers for parallel data access, and Line Buffer Weights
274
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
5 6 7 8 9 A B C x
Figure 10: Example usage of variable-width line buffer for 8- D E F x
wide and 32-wide feature maps (figure adapted from [39]). Integral Image
Current Window
Window Buffer
Zhao et al. implement the BNN model described in [8], which
operates on the CIFAR-10 dataset [12]. It contains six convolutional
layers, three pooling layers, and three fully-connected layers. Fig- Figure 11: Specialized line buffer and window buffer for face
ure 9 shows the hardware diagram of the BNN accelerator, which detection [25] — Here we show a 3x3 example, but the actual imple-
uses a configurable number of convolvers to exploit data-level paral- mentation uses 25x25 windows. Solid arrows refer to normal register
lelism in a scalable manner. The authors target a small FPGA device shifting, while dashed arrows refer to addition. The image window
with limited on-chip storage. As a result, the BNN weights cannot buffer accumulates the incoming pixels and construct the integral
fit on-chip and the accelerator must be invoked multiple times to image on the fly. The integral image window buffer accesses the
classify an image; each time new weights are loaded from off-chip image window buffer for new data.
memory.
There are two major kernels in BNN: binarized convolution and
binarized dot product. Both kernels are intensive of bitwise logic Table 2: Device capacity of the two FPGA platforms and the
operations. Binarized convolution comprise the majority of opera- resource utilization of the platform logic (shell) on AWS F1 —
tions in classifying an image, and is heavily parallelized as a result. The last row reports the average resource utilization of the shell,
In contrast, the binarized fully-connected layers, which use the dot with the standard deviation in parentheses.
product kernel, are limited by off-chip memory-bandwidth. We cate-
# LUTs # FFs # BRAMs # DSPs
gorize BNN as compute-bound since latency improvement mostly
comes from accelerating compute in the convolutional layers. AWS F1 Total 1181768 2363536 2160 6840
Since 2D convolutional layers have a sliding window access pat- ZC706 Total 218600 437200 545 900
tern, line buffers are used to exploit data locality. In particular, a AWS F1 Shell 293209 (±3693) 381853 (±5138) 545 (±0) 12 (±0)
variable-width line buffer (VWLB) is designed to keep the hardware
convolvers fully utilized despite the varying sizes of the feature maps. kernel in feature extraction applications such as SIFT [17], as well
Figure 10 shows how the VWLB works for different input widths. For as the pooling layers of CNNs. The cascaded classifiers are the domi-
input feature map with a width of 32, the VWLB operates identically nant workload for the face detection application. The authors of [25]
to a conventional line buffer. For a smaller feature map with a width parallelize the first three classifier stages and pipeline the rest of
of 8, each row in the VWLB stores multiple rows of the input. The the stages to exploit data-level parallelism. This kernel also exposes
rows are carefully arranged in the VWLB so that the convolutional an irregular memory access pattern — each classifier accesses ei-
filter can slide through and produce correct results. ther eight or twelve pixels, and the classifiers have different access
Design parameters. The BNN benchmark allows users to tune the patterns. This feature itself makes the kernel interesting for HLS
number of convolvers in the accelerator. Other parameters such as memory optimization techniques. Customized memory partitioning
the size of buffers are automatically scaled. is applied to improve kernel frequency and reduce routing effort [41].
The cascaded classifiers operate on a sliding window of the inte-
4.6 Face Detection gral image. As a result, face detection can also benefit from the line
buffer and window buffer optimization introduced in Section 4.4.
The face detection application is adopted from [25]. It uses the Viola- However, constructing the whole integral image before applying
Jones algorithm [28] to detect human faces in a given image. More the classifiers would require a significant amount of on-chip storage
specifically, the accelerator takes an 320x240 greyscale image as and incur performance loss. Therefore, the authors of [25] modified
input, which is scaled to construct an image pyramid; afterwards, an the window buffer to construct the integral image efficiently. The
integral image is constructed from each image in the image pyramid, operation of this buffer is depicted in Figure 11, where the modified
and a set of cascaded classifiers are applied to a fixed-size window image window buffer accumulates pixels on the diagonal to compute
which scans through the integral image; eventually, the positions the pixel values in the integral image.
and sizes of the human faces are returned.
As mentioned in [25], the throughput target for face detection
is 30 frames per second. In addition, the application is subject to 5 EXPERIMENTAL RESULTS
hardware constraints including limited on-chip storage and rout- We have synthesized the Rosetta benchmarks targeting an embedded
ing resources. The two major compute kernels in face detection are FPGA as well as a cloud FPGA instance. We use Xilinx ZC706 for
image scaling and cascaded classifiers. Image scaling is a common the embedded platform, which contains a Kintex-7 FPGA with a
275
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
Table 3: Rosetta results on Xilinx ZC706 Platform — The Runtime column shows overall
execution time. Resource numbers show the total resource usage of the designs, including
both kernel function and shell logic. Bitstreams are generated by Xilinx SDSoC 2017.1.
Benchmark # LUTs # FFs # BRAMs # DSPs Runtime (ms) Throughput
3D Rendering 8893 12471 48 11 4.7 213 frames/s
Digit Recognition1 41238 26468 338 1 10.6 189k digits/s
Spam Filtering2 12678 22134 49 160 78.9 285k samples/s
Optical Flow 42878 61078 54 454 24.3 41.2 frames/s
Binarized Neural Network3 46899 46760 102 4 4995.2 200 images/s
Face Detection 62688 83804 121 79 33.0 30.3 frames/s
1. K = 3, PAR_FACTOR = 40. 2. Five epochs, PAR_FACTOR = 32, VDWIDTH = 512.
3. Eight convolvers, 1000 test images.
Table 4: Rosetta results on AWS F1 Platform — Kernel: execution time on the FPGA; Comm.: time of data transfer between
host and global memory; Runtime: overall execution time. Performance-Cost Ratio is calculated based on the hourly rate (in
US Dollar/$) of the AWS f1.2xlarge instance [1]. Resource numbers are for kernel functions only. Bitstreams are generated by
Xilinx SDAccel 2017.1.
Performance-Cost
Benchmark # LUTs # FFs # BRAMs # DSPs Kernel (ms) Comm. (ms) Runtime (ms) Throughput
Ratio
3D Rendering 6763 7916 36 11 3.6 0.19 4.4 227 frames/s 496k frames/$
Digit Recognition1 39971 33853 207 0 9.9 0.55 11.1 180k digits/s 393M digits/$
Spam Filtering2 7207 17434 90 224 25.1 4.8 30.9 728k samples/s 1.6G samples/$
Optical Flow 38094 63438 55 484 2.6 4.8 8.4 119 frames/s 260k frames/$
Face Detection 48217 54206 92 72 20.2 0.47 21.5 46.5 frames/s 101k frames/$
1. K = 3, PAR_FACTOR = 40. 2. Five epochs, PAR_FACTOR = 32, VDWIDTH = 512.
276
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
(a) (b)
Figure 13: Spam filtering design space, results are for AWS F1
Figure 12: Digit recognition design space, results are for AWS platform — Off-chip memory bandwidth is controlled by VDWIDTH.
F1 platform — (a) Kernel time vs. K value. Difference in kernel time This parameter strictly limits the performance of the hardware ker-
is caused by variance in latency and kernel frequency. (b) LUT usage nel, showing that spam filtering is a memory-bound application.
vs. K value.
already large. When the Hamming distance kernel is highly paral-
on ZC706 and 227 frames per second on F1. While the throughput lelized, the KNN voting kernel, which is highly sequential, becomes
calculated with our test input is much higher than the target, both the performance bottleneck. The performance can be further im-
kernel time and communication time increase with more triangles proved by optimizing the KNN voting kernel, and finding an optimal
in the input. Communication latency is not significant on F1, but combination of the K value and PAR_FACTOR.
the software API calls in OpenCL runtime incur a 0.6 ms overhead,
which is not negligible for this specific application. These API calls Spam Filtering. The performance of spam filtering significantly
initiate data transfer, enqueue the kernel function, and set proper differs on two platforms. The kernel time on F1 is 3.1x shorter than
kernel arguments. ZC706, and the total execution time on F1 is 2.6x shorter, despite the
Table 5 shows the resource utilization and kernel time of a baseline additional 4.8 ms latency for host-global memory communication.
design where dataflow optimization is not applied. Comparing with In addition to the frequency improvement, this performance gap
the first row of Table 4, enabling dataflow optimization improves the is mainly caused by the difference in off-chip memory bandwidth.
kernel time by around 30% without significant resource overhead. Since we apply dataflow optimization to overlap communication and
This result demonstrates the efficacy of dataflow optimization in compute, the overall latency of the design is determined by the max-
image processing pipelines. imum of compute and communication latency. Because the compute
kernels are highly parallel, the low communication bandwidth on
ZC706 results in a much longer latency of the dataflow pipeline.
Digit Recognition. In contrast to other benchmarks, the perfor-
Figure 13 shows the kernel time on AWS F1 with different com-
mance of digit recognition is currently slightly worse on F1 than
binations of PAR_FACTOR and VDWIDTH. Here PAR_FACTOR specifies
ZC706. The overall throughput is 189k digits per second on ZC706
the degree of parallelism in vector kernels, and VDWIDTH controls the
and 180k digits per second on F1. Although F1 has a shorter kernel
off-chip communication bandwidth. With the same off-chip band-
time of 9.9ms, the latency of communication and other overhead in
width, increasing PAR_FACTOR beyond 64 does not result in much
OpenCL runtime seem to have offset this advantage. According to
performance gain, since the communication latency already dom-
our analysis, this is likely due to a missing feature in the specific
inates the compute latency. When off-chip bandwidth is reduced,
version of the tool we are using, where async_group_copy is not
communication latency further increases, and kernel time degrades
pipelined to the full extent. Hence we expect to achieve a higher
for all PAR_FACTOR values we tested. The best-achievable perfor-
performance on F1 in the near future once this issue is resolved.
mance improves with a higher off-chip memory bandwidth. These
As mentioned in Section 4.2, digit recognition has a complex de-
results confirm that spam filtering is a memory-bound application.
sign space. Table 6 shows the classification accuracy of different
K values. Figure 12 shows kernel time and resource utilization of Optical Flow. The total execution time of optical flow is 8.4 ms on
different design points. We only show kernel time in Figure 12a F1 and 24.3 ms on ZC706. Both implementations satisfy the through-
because host-global memory communication time is not affected by put constraint. On the AWS F1 platform, host-global memory com-
kernel implementation. In Figure 12b, only the most critical resource munication time takes up approximately 60% of the total execution
LUT is shown. As we can see from Table 6 and Figure 12, the two time due to the large input/output data size. If we only consider ker-
design parameters expose interesting design trade-offs. Increasing nel time, it is 9.3x shorter on F1 than on ZC706. Similar with spam
the K value improves classification accuracy at a cost of significant filtering, this behavior is also caused by the difference in off-chip
increase in kernel time, which is caused by the frequency drop and memory bandwidth. The optical flow accelerator is reading from and
the worsened latency of the KNN voting kernel. Additionally, the writing to the off-chip memory at the same time due to the stream-
benefit of increasing PAR_FACTOR diminishes when PAR_FACTOR is ing dataflow optimization. The F1 platform has multiple off-chip
277
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA
DDR banks to handle concurrent read and write requests. On ZC706, [13] Y. LeCun. The MNIST Database of Handwritten Digits. http://yann. lecun. com/exd-
however, these concurrent requests would cause contention on the b/mnist/, Dec 2017.
[14] Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen. High-Level Synthesis:
off-chip memory, and the accelerator is often stalled due to the lack Productivity, Performance, and Software Constraints. Journal of Electrical and
of input data. Computer Engineering, 2012:1:1–1:1, Jan 2012.
[15] G. Liu, M. Tan, S. Dai, R. Zhao, and Z. Zhang. Architecture and Synthesis for
Area-Efficient Pipelining of Irregular Loop Nests. IEEE Trans. on Computer-Aided
6 CONCLUSIONS AND FUTURE WORK Design of Integrated Circuits and Systems (TCAD), 2017.
[16] X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen. High Level
We have presented Rosetta, an open-source, realistic benchmark Synthesis of Complex Applications: An H. 264 Video Decoder. Int’l Symp. on
suite for high-level synthesis targeting modern FPGA platforms. Field-Programmable Gate Arrays (FPGA), Feb 2016.
Rosetta is designed to be a collection of real applications which are [17] D. G. Lowe. Object Recognition from Local Scale-Invariant Features. Int’l Conf. on
Computer Vision (ICCV), Oct 1999.
optimized for performance and resource constraints. All Rosetta [18] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo. Optimizing Loop Operation and Dataflow
applications are ready to be executed on the supported embedded in FPGA Acceleration of Deep Convolutional Neural Networks. Int’l Symp. on
and cloud platforms. We believe that Rosetta can serve as a useful Field-Programmable Gate Arrays (FPGA), Feb 2017.
[19] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
benchmark suite for HLS algorithms and tools, as well as a set of [20] J. Pineda. A Parallel Algorithm for Polygon Rasterization. ACM SIGGRAPH
design tutorials for application developers interested in FPGA-based Computer Graphics, 22(4):17–20, 1988.
[21] L.-N. Pouchet. Polybench: The Polyhedral Benchmark Suite. http://www. cs. ucla.
accelerated computing. edu/pouchet/software/polybench, Dec 2017.
Rosetta will be continuously improved in the future. We will [22] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-Based Data Reuse
extend Rosetta to include more realistic applications from emerging Optimization for Configurable Computing. Int’l Symp. on Field-Programmable Gate
Arrays (FPGA), Feb 2013.
domains. For the existing benchmarks, we plan to provide both [23] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks
C++ and OpenCL implementations for every benchmark to embrace for Accelerator Design and Customized Architectures. Int’l Symp. on Workload
different programming models commonly supported by HLS tools. Characterization (IISWC), Oct 2014.
[24] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A Pre-RTL, Power-
The benchmarks will also be further optimized for achieving higher Performance Accelerator Simulator Enabling Large Design Space Exploration of
performance and resource efficiency. Customized Architectures. Int’l Symp. on Computer Architecture (ISCA), Jun 2014.
[25] N. K. Srivastava, S. Dai, R. Manohar, and Z. Zhang. Accelerating Face Detection on
Programmable SoC Using C-Based Synthesis. Int’l Symp. on Field-Programmable
ACKNOWLEDGEMENTS Gate Arrays (FPGA), Feb 2017.
[26] The Apache Software Foundation. Public Corpus. http://spamassassin. apache.
This research was supported in part by a DARPA Young Faculty org/old/publiccorpus/, Apr 2017.
Award, NSF Awards #1337240, #1453378, and a research gift from [27] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vis-
Xilinx, Inc. We thank Dr. Sumit Roy from Xilinx for providing helpful sers. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference.
Int’l Symp. on Field-Programmable Gate Arrays (FPGA), Feb 2017.
feedback on the Rosetta designs. We also thank Ackerley Tng, Edgar [28] P. Viola, M. J. Jones, and D. Snow. Detecting Pedestrians using Patterns of Motion
Munoz, Wendian Jiang, Lin Wang, Yun Qing, Nithya Subramanian, and Appearance. International Journal of Computer Vision, 63(2):153–161, Jul 2005.
[29] S. Wang, Y. Liang, and W. Zhang. FlexCL: An Analytical Performance Model for
Nikita Patil, Surabhi Singh, Judy Stephen, and Ian Thompson for OpenCL Workloads on Flexible FPGAs. Design Automation Conf. (DAC), Jun 2017.
their contributions to the baseline designs of digit recognition, 3D [30] Y. Wang, P. Li, and J. Cong. Theory and Algorithm for Generalized Memory
rendering, spam filtering, and optical flow. Partitioning in High-Level Synthesis. Int’l Symp. on Field-Programmable Gate
Arrays (FPGA), Feb 2014.
[31] Z. Wang, B. He, W. Zhang, and S. Jiang. A Performance Analysis Framework for
REFERENCES Optimizing OpenCL Applications on FPGAs. Int’l Symp. on High Performance
[1] Amazon Web Services. AWS FPGA Developer AMI. https://aws. amazon. com/- Computer Architecture (HPCA), Mar 2016.
marketplace/pp/B06VVYBLZZ, Dec 2017. [32] Z. Wei, L. Dah-Jye, and B. E. Nelson. FPGA-Based Real-Time Optical Flow Algo-
[2] Amazon Web Services. AWS Shell Interface Specification. https://github. rithm Design and Implementation. Journal of Multimedia, 2:38–45, Sep 2007.
com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md, [33] H. Yonekawa and H. Nakahara. On-Chip Memory Based Binarized Convolutional
Dec 2017. Deep Neural Network Applying Batch Normalization Free Technique on an FPGA.
[3] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu. An OpenCL Int’l Parallel and Distributed Processing Symp. Workshops (IPDPSW), May 2017.
Deep Learning Accelerator on Arria 10. Int’l Symp. on Field-Programmable Gate [34] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-Based
Arrays (FPGA), Feb 2017. Accelerator Design for Deep Convolutional Neural Networks. Int’l Symp. on
[4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A Naturalistic Open Source Field-Programmable Gate Arrays (FPGA), Feb 2015.
Movie for Optical Flow Evaluation. European Conference on Computer Vision [35] C. Zhang and V. K. Prasanna. Frequency Domain Acceleration of Convolutional
(ECCV), Oct 2012. Neural Networks on CPU-FPGA Shared Memory System. Int’l Symp. on Field-
[5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Ro- Programmable Gate Arrays (FPGA), Feb 2017.
dinia: A Benchmark Suite for Heterogeneous Computing. Int’l Symp. on Workload [36] J. Zhang and J. Li. Improving the Performance of OpenCL-Based FPGA Accelerator
Characterization (IISWC), Oct 2009. for Convolutional Neural Network. Int’l Symp. on Field-Programmable Gate Arrays
[6] P. Colangelo, R. Huang, E. Luebbers, M. Margala, and K. Nealis. Fine-Grained Ac- (FPGA), Feb 2017.
celeration of Binary Neural Networks Using Intel Xeon Processor with Integrated [37] Z. Zhang and B. Liu. SDC-Based Modulo Scheduling for Pipeline Synthesis. Int’l
FPGA. Int’l Symp. on Field-Programmable Custom Computing Machines (FCCM), Conf. on Computer-Aided Design (ICCAD), Nov 2013.
Apr/May 2017. [38] J. Zhao, L. Feng, S. Sharad, W. Zhang, Y. Liang, and B. He. COMBA: A Com-
[7] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level prehensive Model-Based Analysis Framework for High Level Synthesis of Real
Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on Applications. Int’l Conf. on Computer-Aided Design (ICCAD), Nov 2017.
Computer-Aided Design of Integrated Circuits and Systems, 30(4):473–491, 2011. [39] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. B. Srivastava, R. Gupta, and
[8] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Z. Zhang. Accelerating Binarized Convolutional Neural Networks with Software-
Neural Networks: Training Deep Neural Networks with Weights and Activations Programmable FPGAs. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), Feb
Constrained to + 1 or -1. arXiv preprint arXiv:1602.02830, Mar 2016. 2017.
[9] S. Dai, R. Zhao, G. Liu, S. Srinath, U. Gupta, C. Batten, and Z. Zhang. Dynamic [40] G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar. Lin-Analyzer: A High-Level
Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis. Int’l Performance Analysis Tool for FPGA-Based Accelerators. Design Automation Conf.
Symp. on Field-Programmable Gate Arrays (FPGA), Feb 2017. (DAC), Jun 2016.
[10] Q. Gautier, A. Althoff, P. Meng, and R. Kastner. Spector: An OpenCL FPGA [41] Y. Zhou, K. M. Al-Hawaj, and Z. Zhang. A New Approach to Automatic Memory
Benchmark Suite. Int’l Conf. on Field Programmable Technology (FPT), Dec 2016. Banking using Trace-Based Address Mining. Int’l Symp. on Field-Programmable
[11] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and Quantitative Gate Arrays (FPGA), Feb 2017.
Analysis of the CHStone Benchmark Program Suite for Practical C-Based High- [42] W. Zuo, P. Li, D. Chen, L.-N. Pouchet, S. Zhong, and J. Cong. Improving Poly-
Level Synthesis. Journal of Information Processing, Vol. 17, pages 242–254, Oct hedral Code Generation for High-Level Synthesis. Proc. of the 8th Int. Conf. on
2008. Hardware/Software Codesign and System Synthesis (CODES+ISSS), Sep/Oct 2013.
[12] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny
Images. Technical report, University of Toronto, Apr 2009.
278
IEEE EMBEDDED SYSTEMS LETTERS, VOL. XX, NO. X, DECEMBER 2016 1
percent of design
in industry to create specialized System-on-Chip (SoC) architec-
billions, USD
tures. Increasing the level of security of these heterogeneous ar- 100
60
chitectures is becoming critical. However, state-of-the-art security
countermeasures are still applied only to the code executing on 40
the processor cores or manually implemented into the generated 50
components, leading to suboptimal and sometimes even insecure
20
designs. This paper discusses extensions to HLS tools for creating
secure heterogeneous architectures.
0 0
2,009
2,010
2,011
2,012
2,013
2,014
2,015
2,016
2,017
2,018
2,019
2,020
2,021
2,022
2,023
2,024
Index Terms—High-Level Synthesis, Hardware Security.
3e-06
Optimized
-1e-06 designer
reg_3 Data -2e-06
-3e-06
0 5 10 15 20 25 30 35 40 45 50
Time [10 ps]
microarchitecture of an accelerator, which is composed of the sensitive data through the output port or via the memory space
controller and the datapath. The execution of the function is shared with the attacker (CWE-485: Insufficient Encapsulation
controlled by a finite-state machine (controller) that, based on and CWE-922: Insecure Storage of Sensitive Information).
a set of conditions, determines which operations are executed Hence, the execution and the outcome of an accelerator are not
by the arithmetic resources (datapath) in any given clock cycle. secure if not adequately verified and protected (see Fig. 2(b)).
These resources elaborate input data, provided through param- Even when the specification of the accelerator is secure,
eters or stored in memories – either in local directly-accessible its implementation can be compromised through physical
scratchpads or external memory accessed through memory attacks, where the adversary exploits the weaknesses of the
controllers – with the possibility of computing on memory implementation (CWE-693: Protection Mechanism Failure).
addresses (e.g., pointer arithmetic) [9]. This is achieved by Side-channel attacks can be used to extract secret data from
daisy chaining all memory components (i.e., local scratchpads embedded devices and high-end cloud servers. A paradigmatic
and the controller for the external memory). In this way, example is the Advanced Encryption Standard (AES). The
accelerators can automatically identify the memory location algorithm is mathematically secure, but its physical imple-
accessed by a memory operation based on the dynamic value mentations have been attacked using power and timing attacks
of the address, broadening the range of applications that can (CWE-326: Inadequate Encryption Strength). Accelerators can
leverage such heterogeneous building blocks. help mitigate side-channel attacks, for instance ensuring con-
Since heterogeneous architectures are leveraging hard- stant execution time and thus making timing attacks infeasi-
ware accelerators to provide energy-efficient high-performance ble. For example, Intel recently added the AES-NI instruc-
computation, such components are an attractive target for tions [11]. Accelerators must be protected from a variety of
attacks. Current protection mechanisms target software execu- other attacks, including fault-based and side-channels [16]. If
tion on processors [10], [11], are manually implemented [12], not adequately protected, a circuit separated from the rest of
and introduce significant overheads [13]. The approach is not the processor can be localized easily, becoming the target of
efficient and scalable when applied to accelerators, requiring precise power side-channel, ultimately leading to easy key
revisiting the design process. recovery (see Fig. 2(c)).
We discuss hardware vulnerabilities listed in the CWE list1 , Besides these hardware vulnerabilities for the end user,
focusing on how to exploit design errors and alter the accel- secure accelerators should be protected from reverse engi-
erator behavior. First, vulnerabilities in hardware accelerators neering, insertion of hardware Trojans (CWE-912: Hidden
can be exploited to launch software-based attacks. Even if Functionality) and unauthorized copy. Otherwise, the tech-
it is not possible to implement a different functionality as nological advantage of the IP provider can be undermined,
is done by exploiting buffer overflow (CWE-121) and code creating billions of dollars of economical damage [17]. The
injection (CWE-94), one can manipulate input values (either hardware description of the accelerator depends not only on
configuration parameters or memory values) to exploit design the initial high-level specification but also on the optimizations
errors. For example, attackers may exploit vulnerabilities in selected by the designer and performed by the design tools.
the accelerator controller to launch a wide range of attacks Reverse engineering would make all these assets available to
(CWE-691: Insufficient Control Flow Management) [14]. At- unauthorized parties (see Fig. 2(d)).
tackers can also exploit vulnerabilities in the SoC architecture. In a nutshell, since designers are integrating hardware
For example, the attacker may tamper with the system bus to accelerators into their designs, we expect securing these com-
insert malicious operations to trigger unauthorized execution ponents to become relevant in the coming years.
of the accelerators (CWE-284: Improper Access Control) [15].
If the system is not adequately protected, the resulting III. S ECURING H ARDWARE ACCELERATORS
execution may be compromised. One can access internal and
The proliferation of third-party applications for embedded
1 The Common Weakness Enumeration List (CWE), http://cwe.mitre.org systems (e.g., in Apple App Store or Google Play) is becoming
a serious threat to the user’s privacy since such systems C-based Tech Security
Constraints
can leak personal information without authorization [10]. Spec Library Spec
Received August 13, 2011; revised November 20, 2011; accepted November 30, 2011
ABSTRACT
This paper addresses the challenges of System-on-Chip designs using High-Level Synthesis (HLS). HLS tools convert
algorithms designed in C into hardware modules. This approach is a practical choice for developing complex applica-
tions. Nevertheless, certain hardware considerations are required when writing C applications for HLS tools. Hence, in
order to demonstrate the fundamental hardware design concepts, a case study is presented. Fast Fourier Transform (FFT)
implementation in ANSI C is examined in order to explore the important design issues such as concurrency, data recur-
rences and memory accesses that need to be resolved before generating the hardware using HLS tools. There are addi-
tional language constraints that need to be addressed including use of pointers, recursion and floating point types.
Keywords: System Level Design; High Level Synthesis; Field Programmable Gate Arrays; Fourier Transform
been accomplished using Hardware Descriptive Lan- Focusing further on HLS, the design flow is shown in
guages such as VHDL or Verilog. Each expression in Figure 2. Each module of a system is implemented using
HDL represents a group of gates that operate in parallel, high level languages such as C, C++, Java, or Matlab
as opposed to machine instructions executed sequentially. [2,18], which can then be tested automatically with test-
This concept of instruction level parallelism is one of the benches provided by the user. After verification of the
first major hurdles when introducing hardware concepts. complete system, the user can specify in the HLS tool
Once an RTL module is designed, it can be compiled which modules will be converted into hardware accel-
and simulated. The simulation is done by creating a se- erators in order to speed up the application. This is one of
ries of pre-defined inputs, known as a testbench, and re- the core elements of hardware/software co-design that
cording the outputs. If a module passes the simulation software developers need to understand. There are in-
then a low level implementation can be created. This low herent restrictions in the HDLs that are mirrored in the
level implementation then enters the verification process HLS tool. Therefore, the emphasis for teaching HDL to
to ensure that all timing dependencies are met. In prac- software developers is on its constraints and how it af-
tice, simulating and verifying an implementation can take fects the HLS tools.
50% - 60% of the development time, increasing the time- After generation of the hardware modules along with
to-market (TTM) [14]. By automating the simulation and testbenches, the system is verified and can be imple-
verification process, it is possible to greatly reduce the mented using synthesis tools.
development time. This paper, as mentioned earlier, focuses on designing
Integration of HLS tools into the FPGA or ASIC de- a Fast Fourier Transform. The concept of HLS is pre-
sign flow, as shown in Figure 1, allows software design- sented by using PICO (Program-In Chip-Out) Extreme
ers to build hardware modules and speed up the TTM from Synfora [10,19,20] to generate the RTL code of an
significantly. During the generation process of an RTL FFT. To be specific, PICO takes a C-based description of
module from a software implementation, simulation and an algorithm and generates: performance-driven device-
verification are done automatically by using a formal dependent synthesizable RTL code, testbench files, ap-
proof provided during the initial steps. Subsequently, by plication drivers, simulation scripts as well as SystemC
using synthesis tools, the RTL module is implemented based Transaction Level Models (TLM) [3,17,18,21].
and timing verification is done. An independent evalua- PICO design flow is shown in Figure 3. With integration
tion of HLS tools for Xilinx FPGAs has been done by
Berkeley Design Technology [15]. It shows that using
HLS tools with FPGAs can improve the performances of
an application by an order of magnitude compared to
DSPs. Moreover, this study shows that for a given appli-
cation, HSL tools will achieve similar results compared to
hand-written HDL code with a shorter development time.
HLS software based approach for simulation and veri-
fication is made possible by using SystemC, a language
developed by Synopsys, University of California Irvine,
Frontier Design and IMEC. SystemC is an extension of
C++ that provides additional libraries to design an em-
bedded system. The first version was released in 1999
and in 2005 it became IEEE standardized SystemC [16,
17] as the IEEE-1666-2005. These additional libraries
make it possible to specify the hardware and software
components in an embedded system using one unified
paradigm and to generate testbenches.
Figure 1. FPGA high level synthesis block diagram. Figure 2. High level synthesis (HLS) design flow.
of the PICO design tools to their FPGA flow, designers sult of these constraints, the reference code included in
can create complex hardware [20] sub-systems from se- this section does not use divisions, is completely iterative,
quential untimed C algorithms. It allows designers to and has not pointer variables. However, before going into
explore programmability, performance, power, area and the details of the implementation, the mathematical back-
clock frequency. This is achieved by providing a com- ground of the FFT is presented.
prehensive and robust verification and validation envi-
ronment. PICO is designed to explore different type of 3.1. FFT Algorithm
parallelism and will choose the optimal one transparently.
The Fourier transform takes a signal x in time t and trans-
Results in terms of throughput and area are given along
forms it into a function X in frequency ω:
with detailed reports that will help the user for code op-
timization. When the synthesized performances are sat-
x(t ) * e
2 jπ t
isfactory, RTL code is generated and can be implemented X ( ) dt (1)
in the targeted platform. Because the testing is done in C,
the verification time of the RTL module can be signifi- The transform can be computed using a Discrete Fou-
cantly reduced [20]. rier Transform (DFT).
N 1 n
2 jπk
3. Fast Fourier Transform X k xn e N
(2)
n 0
In most cases, the first step when using an HLS tool is to
where k 0, , N 1.
create a reference implementation, which is used to ver-
The direct realization of DFT algorithm requires O(N2)
ify the synthesized product. The reference code itself can
computational time. To make this computation faster, an
be compiled using any C compiler, and is purely soft-
entire class of Fast Fourier Transforms (FFT) were de-
ware based. This means that no new concepts have to be
veloped [8]. However, in this paper a radix-2 FFT deci-
taught, making the reference implementation a logical
mated in time is implemented. This algorithm divides the
starting point when using HLS.
original DFT into two DFTs with half the length (i.e.
When creating the reference code for FFT, there are
decimation). The first step in decimation is shown below:
few issues that need to be addressed when using HLS
tools. The first issue is that arithmetic operations such as N
1 2πj
N
1 2πj
2 2 m k 2 2 m 1 k
division can significantly decrease the performance of
the design, and therefore should be avoided whenever
Xk x2me N
x2 m1e N
(3)
m 0 m 0
possible. Nevertheless, division by a power of two is
Then the algorithm is recursively applied to each term
considered as a bit shift operation and hence can be used
until each DFT’s length is 1. This recursive deconstruc-
at no cost. The second issue, and more fundamental issue,
tion of the DFT makes the computational time of
is that pointers and recursion are not supported by the
O(Nlog(N)) [8].
current HLS tools due to the fact that those concepts are
purely software and can’t be applied to hardware designs.
3.2. Software Implementation of the FFT
Finally, HLS tools may not have the capability to synthe-
size software functions such as cosine and sine. As a re- In Figure 4, a 16-point radix-2 FFT is shown. A signal is
[0] [0]
0 # include " fft.h "
2
[8] [1] # include math.h
0
[1]
4
[2] # define pi 2 (double)6.28318530717958647692528676655901
1
0
2 4 extern s _ complex fix _ float[ N / 2];
[9] [3]
0 void table _ setup (void )
8
[2] [4]
0
{
1
2 8
[10] [5] double a 0.0;
0 2
[3]
4 8
[6]
double e pi 2 / N ;
0
2 4
1 3
8
float cos _ val , sin _ val ;
[11] [7]
0
int i;
16
[4] [8] for (i 0; i N / 2; i ){
0 1
[12]
2 16
[9] cos _ val cos(a );
4
0 2
16 sin _ val sin(a );
[5] [10]
0
2
1 3 fix _ float[i ].x cos _ val ;
4 16
[13] [11]
4
fix _ float[i ]. y sin _ val ;
0
8 16
[6] [12] a a e;
0 1 5
[14]
2 8 16
[13]
}
4
0
8
2 6
16 }
[7] [14]
0 1 3 7
2 4 8 16
[15] [15]
The particular implementation chosen for this refer-
Figure 4. 16-point radix-2 FFT. ence FFT was provided by [22]. The exact code used is
shown in Figure 5. N represents the length of the FFT
inputted into the FFT in a bit reversed order and then and must be a power of 2. Before using the function
goes through log2(N) passes, where each pass has N/2 fft_ref, the function table_setup must be executed in or-
“butterfly” operations. These butterfly operations are der to compute the twiddle factors and store them in the
defined as: array fix_float. The FFT of an input z can then be exe-
cuted. The first phase is the bit-reverse operation where
f O the input data are rearranged as show in Figure 4. Then,
WNk for each passes, the butterfly operations are performed
g G until the FFT is completed. In the next section this code
will be made fully synthesizable by applying four modi-
WNk e 2πjk N
(called the Twiddle factor) fications to it.
F f WNk g
(4) 4. Code Modification for HLS
G f WNk g
The objective of this section is to generate the hardware
The butterfly operation requires complex number of a FFT block based on the reference C code using HLS
arithmetic additions and multiplications. Because of the tools. Multiple modifications are needed in order to ge-
programming constraints placed on the reference code, nerate an optimal hardware in term of resource usage and
most complex number libraries are not useable. Hence, throughput. As an example, we generate an 8-bit 1024-
this reference code uses its own complex number repre- point radix-2 FFT. The output is on 18 bits and will
sentation shown below: beavailable in natural order. The size of the data width
inside the FFT has been chosen so that the HLS FFT
typedef struct{ gives the same results as the Xilinx FFT core [23].
float x ;
4.1. Floating Point to Fixed Point Implementation
float y ;
Since the reference C code is using floating point num-
}s _ complex;
bers, a fixed-point library is needed. For example, PICO,
the HLS used in this demonstration, provides such library.
Moreover, in order to perform the butterfly operation, The PICO fixed-point arithmetic library derives its se-
the WNk terms need to be calculated. Since we assume mantics from the SystemC fixed-point library and it sup-
that HLS library does not support cosine and sine func- ports signed and unsigned arithmetic operations. Hence,
tions, the twiddle factors are pre-computed and stored in the previous floating point complex structure must be
a table using the code below: modified as followed:
# < ℎ. ℎ > used. PICO supports two types of streams: external and
# " . ℎ"
internal. External streams are used to stream data from/to
_ _ [ /2]; global memory and/or other blocks in the system. Inter-
_ ( , , _ ∗ ) nal streams are used to stream data between loops within
{
, , , 1, 2; a multi-loop accelerator designed by PICO. In PICO,
, , 1, 2;
= 0; streams are specified using explicit procedure calls that
2 = / 2;
( = 1; < − 1; + +) {
transmit a scalar value to an output stream or receive a
1 = 2; scalar value from an input stream. These procedures are
ℎ ( >= 1) {
= − 1; converted into special opcodes that receive (transmit)
1 = 1 / 2; Bit-reverse
} data from (to) actual streams. For the FFT application,
operation
= + 1;
( < ){
four streams are needed: input/output streams for real
_ = [ ]; and imaginary parts:
[ ] = [ ];
[] = ;
}
}
char pico _ stream _ input _ xin();
1 = 0; char pico _ stream _ input _ yin();
2 = 1;
( = 0; < ; + +){ void pico _ stream _ output _ xout (int);
1 = 2;
2 = 2 + 2; void pico _ stream _ output _ yout (int);
= 0;
= /(1 << ( + 1));
( = 0; < 1; + +) { Obtain cosine and
=
=
_
_
[
[
]. ;
]. ;
sine values for the PICO synthesizes a FIFO (within the RTL) for each
= + ; butterfly operation internal and external stream in the code. Different para-
( == /2 ) { = 0; }
, 1; meters such as the length of the FIFO can be configured
( = ; < ; = + 2) {
1 = [ + 1]. ∗ ); using pragmas. The first step of the FFT will be the
= [ + 1]. ∗ );
1 −= ; loading phase where input data are stored into a RAM
2 = [ + 1]. ∗ ;
= [ + 1]. ∗ ;
called z as shown below:
2 += ; Butterflycalculation
[ + 1]. = [ ]. − 1; for (h 0; h N ; h ){
[ + 1]. = [ ]. − 2;
[ ]. = [ ]. + 1; z[h].x ( floatP) pico _ stream _ input _ xin();
[ ]. = [ ]. + 2; z h . y floatP pico _ stream _ input _ yin();
}
} }
}
0;
}
Finally after the FFT is computed, the unloading phase
is performed:
Figure 5. FFT reference C code.
for ( p 0; p N ; p ){
pico _ stream _ output _ xout ( z[ p ]. x );
typedef pico :: s _ fixed 22,18, pico :: S _ RND, pico :: S _ SAT , 0 floatP;
pico _ stream _ output _ yout ( z[ p ]. y );
typedef struct{
}
floatP x;
floatP y;
}s _ complexP;
4.3. Bit-Reverse Operation
If we look at the reference C code, the next step would be
FFT is computed using 22-bit data width with 18 bits the bit-reverse stage; this operation takes 1024 cycles.
for the integer part and 4 bits for the fractional part. However, it can be integrated in the radix-2 FFT block,
Rounding and saturation configuration is used. The effect hence reducing the total number of cycles required to
of the number of bits allocated to the fractional part on perform the calculations. This can be done using the
the precision and resource usage of the FFT HLS is pre- bit_swap function:
sented in Section V. The twiddle factors are pre-calcu-
lated with a precision of 16 bits and stored in an array unsigned short bit _ swap (unsigned short in , unsigned short bits ){
unsigned short out 0;
eliminating the need of trigonometric functions. unsigned short k ;
# pragma unroll
4.2. Input Array to Stream of Input Data for ( k 0; k bits ; k ){
out ( out 1) | (in & 0 x1);
In the reference C code, the input data are passed to the in in 1;
}
function as an array. This will be translated into memory return out ;
accesses by the HLS tool which is not optimal for hard- }
ware implementation. Hence, a stream of input data is
this section, increasing the frequency will increase the Area reduction in terms of slices and DSP48E blocks
resources of the hardware generated by the HLS tool. can be achieved by increasing the number of clock cycles
The throughput (number of FFTs that can be done in one required to perform the FFT. Hence, for equivalent
second) can also be specified. In order to achieve a high throughput, it is better to choose a higher operational
throughput, the HLS tool will parallelize tasks; hence frequency and a higher number of clock cycles required
increasing the hardware resources. Finally, the user can to perform the FFT. Table 2 shows the hardware usage
specify to implement arrays using block RAMs or look- of the FFT for a targeted frequency of 150MHz with dif-
up tables (LUTs). Hardware implementation results are ferent throughputs. For example, from Table 1, for a
obtained using Xilinx ISE 12.1 software with either frequency of 75 MHz, the throughput is 10,463. Never-
speed or area optimization for Virtex-5 FPGA. The twid- theless, with a frequency of 150 MHz, a better through-
dle factors have been implemented using LUTs but can put can be obtained using fewer DSP48E blocks (see
be also implemented using RAMs. By doing this, it will Table 2, second row).
reduce the total number of slices LUTs but increase the Figure 7 shows the error variation with respect to the
number of blocks RAM/FIFO. Table 1 shows the hard- width of the fractional part compared to the reference
ware usage of the HLS implementation of FFT with code shown in Figure 5. The relative error for the FFT is
22-bit data width for different targeted frequencies. given using the formula below:
One can see a significant increase in terms of logic
slices for 150 MHz operational frequency. This is due to 1 1 99 1023 X ref [n][k ] X HLS [n][k ]
the fact that we have selected optimization for speed in
error
100 1024 n0 k 0 X ref [n][k ]
ISE in order to achieve the desired operational frequency (6)
after place and route. For frequencies lower than 150 Yref [n][k ] YHLS [n][k ]
MHz, optimization in terms of area has been selected. Yref [n][k ]
For frequencies from 50 MHz to 150 MHz, the total
number of clock cycles achieved by PICO to perform the where X and Y are real and imaginary parts respectively.
1024-point FFT is 7168 but for 175 MHz it is increased The relative error is calculated for 100 random input
to 12288 clock cycles. 7168 clock cycles is the minimum signals of 1024 samples each Figure 7 shows that the
latency that can be obtained and is calculated as follow: relative error decreases linearly as the number of bit for
latency loading FFT unloading the fractional part increases. For the implementation of
the FFT, –40 dB is achieved giving the same results as
N
latency N *log 2 N N (5) the Xilinx FFT core. Nevertheless, the user can increase
2 the precision at the expense of hardware usage. For 13
lantecy 1024 512 *10 1024 7168clock cycles bits, the relative error achieved is –73 dB compared to
For frequencies higher than 150 MHz, PICO reduces the reference C code based on double precision floating
the tasks’ parallelism of the FFT in order to achieve the point operations.
desired frequency. This results in an increase of the la- Table 3 shows the hardware usage with respect to the
tency and a reduction of the hardware resources. The width of the fractional part for a desired operational fre-
maximum frequency that can be obtained by PICO is quency of 100 MHz. As expected, the resource usage
around 270 MHz with a total of 17,408 clock cycles increases with the number of bit for the fractional part.
( 1024 3 512 10 1024 ) to compute the FFT. Never- Nevertheless, the number of blocks RAM/FIFO used is
theless, after place and route, the maximum frequency the same. This is due to the architecture of the Virtex-5
obtained is 180 MHz due to the FPGA targeted. FPGA selected.
Resource usage
Targeted frequency Achieved frequency
Slices Registers Slices LUTs Block RAM/FIFO DSP48E
50 MHz 749 1700 2 4 50 MHz
75 MHz 765 1769 2 4 75 MHz
100 MHz 926 1967 2 4 100 MHz
125 MHz 1042 1714 2 4 125 MHz
150 MHz 1546 2004 2 4 150 MHz
175 MHz 1380 1849 2 2 165 MHz
270 MHz 1457 1989 2 2 180 MHz
Table 2. FFT hardware usage for different throughputs. code. Results of the generated FFT for a Virtex-5 FPGA
have been presented. FFT has a broad range of appli-
Targeted
Resource usage cations in digital signal processing, and multimedia. It is
throughput Slices Slices Blocks a key component that determines most of the design met-
DSP48Es rics in many signal processing communication applica-
Registers LUTs RAM/FIFO
20926 1546 2004 2 4
tions. HLS tools facilitate complex algorithms to be real-
ized at a higher level. They can reduce the design cycle
12207 1351 1693 2 2 significantly while successfully generating results very
8616 1186 1418 2 2 close to handmade HDL design.
6658 1161 1404 2 1
7. Acknowledgements
Relative error for different bit sizes for the fractional part The authors would like to thank the Xilinx, Inc.
–10 (www.xilinx.com) and Synopsys (www.synopsys.com)
for their valuable support.
–20
–30 REFERENCES
Relative error in Db
creases Productivity—A Case Study,” IEEE Interna- [19] S. Van Haastregt and B. Kienhuis, “Automated Synthesis
tional Symposium on VLSI Design, Automation and Test, of Streaming C Applications to Process Networks in
Hsinchu, 28-30 April 2009, pp. 96-101. Hardware,” Proceedings of the Conference on Design
doi:10.1109/VDAT.2009.5158104 Automation & Test in Europe, April 2009, pp. 890-893.
[15] Berkeley Design Technology, “An independent Evalua- [20] P. Coussy and A. Morawiec, “High-Level Synthesis:
tion of High-Level Synthesis Tools for Xilinx FPGAs,” From Algorithm to Digital Circuits,” Springer Science +
http://www.bdti.com Business Media, Chapters 1, 4, Berlin, 2008.
[16] K. L. Man, “An overview of SystemCFL,” Research in [21] N. Hatami, A. Ghofrani, P. Prinetto and Z. Navabi, “TLM
Microelectronics and Electronics, 2005 PhD, Vol. 1, 2.0 Simple Sockets Synthesis to RTL,” International
2005, pp. 145-148. Conference on Design & Technology of Integrated Sys-
tems in Nanoscale Era, Vol. 1, 2000, pp. 232-235.
[17] P. Schumacher, M. Mattavelli, A. Chirila-Rus and R.
Turney, “A Software/Hardware Platform for Rapid Pro- [22] D. L. Jones, “FFT Reference C Code,” University of Illi-
totyping of Video and Multimedia Designs,” Proceedings nois at Urbana-Champaign, 1992.
of Fifth International Workshop on System-on-Chip for [23] Xilinx Inc., “CoreGen,” http://www.xilinx.com
Real-Time Applications, 20-24 July 2005, pp. 30-33.
doi:10.1109/IWSOC.2005.27
[18] W. Chen (Ed.), “The VLSI Handbook,” 2nd Edition,
Chapter 86, CRC Press LCC, Boca Raton, 2007.
deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing
transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We
systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS
code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation
can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining,
on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various
transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim
to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential
offered by spatial computing architectures using HLS.
Mem. buffering §4.2 – – – - – – – – – – – – registers including the logic between them (i.e., the critical
Mem. striping §4.3 – – – - – – – – – – –
Type demotion §4.4 – – – - - - – - – – – – – path of the circuit), will determine the maximum obtainable
TABLE 1: Overview of transformations, the characteristics of their effect on the frequency. ¹ Bitstream generation translates the final circuit
HLS code and the resulting hardware, and the objectives that they can target. The description to a binary format used to configure the device.
center group of column marks the following transformation characteristics: (PL)
enables pipelining; (RE) increases data reuse, i.e., increases the arithmetic intensity of Most effort invested by an HLS programmer lies in guiding
the code; (PA) increases or exposes more parallelism; (ME) optimizes memory accesses; the scheduling process in ¶ to implement deep, efficient
(RS) does not significantly increase resource consumption; (RT) does not significantly
impair routing, i.e., does not potentially reduce maximum frequency or prevent pipelines, but · is considered when choosing data types and
the design from being routed altogether; (SC) does not change the schedule of loop buffer sizes, and ¸ can ultimately bottleneck applications
nests, e.g., by introducing more loops; and (CC) does not significantly increase
code complexity. The symbols have the following meaning: “–”: no effect, “-”: once the desired parallelism has been achieved, requiring
positive effect, “-!”: very positive effect, “(-)”: small or situational positive the developer to adapt their code to aid this process.
effect, “”: negative effect, “!”: very negative effect, “()”: small or situational
negative effect, “∼”: positive or negative effect can occur, depending on the
context. The right group of columns marks the following objectives that 1.2 Key Transformations for High-Level Synthesis
can be targeted by transformations: (LD) resolve loop-carried dependencies, due This work identifies a set of optimizing transformations that
to inter-iteration dependencies or resource contention; (RE) increase data reuse;
(CU) increase parallelism; (BW) increase memory bandwidth utilization; (PL) reduce are essential to designing scalable and efficient hardware
pipelining overhead; (RT) improve routing results; (RS) reduce resource utilization. kernels in HLS. An overview given in Tab. 1. We divide
the transformations into three major classes: pipelining
In addition to identifying previous work that apply one or transformations, that enable or improve the potential for
more of the transformations defined here, we describe and pipelining computations; scaling transformations that in-
publish a set of end-to-end “hands-on” examples, optimized crease or expose additional parallelism; and memory en-
from naive HLS codes into high performance implementa- hancing transformations, which increase memory utilization
tions. This includes a stencil code, matrix multiplication, and efficiency. Each transformation is further classified ac-
and the N-body problem, all available on github. The cording to a number of characteristic effects on the HLS
optimized codes exhibit dramatic cumulative speedups of source code, and on the resulting hardware architecture
up to 29,950× relative to their respective naive starting (central columns). To serve as a cheat sheet, the table further-
points, showing the crucial necessity of hardware-aware more lists common objectives targeted by HLS programmers,
transformations, which are not performed automatically by and maps them to relevant HLS transformations (rightmost
today’s HLS compilers. As FPGAs are currently the only columns). Characteristics and objectives are discussed in
platforms commonly targeted by HLS tools in the HPC detail in relevant transformation sections.
domain, transformations are discussed and evaluated in this Throughout this work, we will show how each transfor-
context. Evaluating FPGA performance in comparison to mation is applied manually by a performance engineer by
other platforms is out of scope of this work. Our work pro- directly modifying the source code, giving examples before
vides a set of guidelines and a cheat sheet for optimizing and after it is applied. However, many transformations are
high-performance codes for reconfigurable architectures, also amenable to automation in an optimizing compiler.
guiding both performance engineers and compiler devel-
1.3 The Importance of Pipelining
opers to efficiently exploit these devices.
Pipelining is essential to efficient hardware architectures,
as expensive instruction decoding and data movement be-
1.1 From Imperative Code to Hardware tween memory, caches and registers can be avoided, by
Before diving into transformations, it is useful to form an sending data directly from one computational unit to the
intuition of the major stages of the source-to-hardware stack, next. We attribute two primary characteristics to pipelines:
to understand how they are influenced by the HLS code: • Latency (L): the number of cycles it takes for an input to
¶ High-level synthesis converts a pragma-assisted proce- propagate through the pipeline and arrive at the exit, i.e.,
dural description (C++, OpenCL) to a functionally equiva- the number of pipeline stages.
lent behavioral description (Verilog, VHDL). This requires • Initiation interval or gap (I ): the number of cycles that
mapping variables and operations to corresponding con- must pass before a new input can be accepted to the
structs, then scheduling operations according to their inter- pipeline. A perfect pipeline has I=1 cycle, as this is re-
dependencies. The dependency analysis is concerned with quired to keep all pipeline stages busy. Consequently,
creating a hardware mapping such that the throughput the initiation interval can often be considered the inverse
requirements are satisfied, which for pipelined sections re- throughput of the pipeline; e.g., I=2 cycles implies that the
quire the circuit to accept a new input every cycle. Coarse- pipeline stalls every second cycle, reducing the through-
grained control flow is implemented with state machines, put of all pipelines stages by a factor of 12 .
3
To quantify the importance of pipelining in HLS, we con- previous iteration, which takes multiple cycles to com-
sider the number of cycles C it takes to execute a pipeline plete (i.e., has multiple internal pipeline stages). If the
with latency L (both in [cycles]), taking N inputs, with an latency of the operations producing this result is L, the
initiation interval of I [cycles]. Assuming a reliable producer minimum initiation interval of the pipeline will be L.
and consumer at either end, we have: This is a common scenario when accumulating into a sin-
gle register (see Fig. 2), in cases where the accumulation
C = L + I · (N − 1) [cycles]. (1)
operation takes Lacc >1 cycles.
This is shown in Fig. 1. The time to execute all N iterations 2) Interface contention (intra-iteration): a hardware re-
with clock rate f [cycles/s] of this pipeline is thus C/f . source with limited ports is accessed multiple times in
the same iteration of the loop. This could be a FIFO
queue or RAM that only allows a single read and write
I
per cycle, or an interface to external memory, which only
N
p
oo
1 for (int n = 0; n < N; ++n) 2 double t[16];
rl
ne
2 for (int m = 0; m < M; ++m) { K 3 #pragma PIPELINE
in
3 double acc = C[n][m]; 4 for (int i = 0; i < N; ++i) { // P0
+ +
4 #pragma PIPELINE 5 auto prev = (i < 16) ? 0 : t[i%16];
C [N×M]
5 for (int k = 0; k < K; ++k) 6 t[i%16] = prev + arr[i]; }
]
[K ×K
6 acc += A[n][k] * B[k][m];
]
double res = 0;
M
[N
7
N
×
A
7 C[n][m] = acc; } M 8 for (int i = 0; i < 16; ++i) // P1
B
(a) Naive implementation of general matrix multiplication C=AB+C . 9 res += t[i]; // Not pipelined
10 return res; }
1 for (int n = 0; n < N; ++n) {
2 double acc[M]; // Uninitialized Listing 2: Two stages required for single loop accumulation.
inner acc[ 0] , acc[ 1] , . . . , acc[ M-1]
3 for (int k = 0; k < K; ++k) loop
4 double a = A[n][k]; // Only read once K
5 #pragma PIPELINE
6 for (int m = 0; m < M; ++m) {
result buffers, the second phase collapses the partial results
N
7 double prev = (k == 0) ? C[n][m] into the final output. This is shown in Lst. 2 for K=16.
× ]
[K ×K
8 : acc[m]; Optionally, the two stages can be implemented to run in
]
M
[N
acc[m] = prev + a * B[k][m]; }
A
9
M a coarse-grained pipelined fashion, such that the first stage
B
10 for (int m = 0; m < M; ++m) // Write
11 C[n][m] = acc[m]; } // out begins computing new partial results while the second stage
(b) Transposed iteration space, same location written every M cycles. is collapsing the previous results (by exploiting dataflow
1 for (int n = 0; n < N; ++n)
between modules, see Sec. 3.3).
2 for (int m = 0; m < M/T; ++m) {
3 double acc[T]; // Tiles of size T inner
loop
acc[ 0] , . . ., acc[ B-1] 2.1.4 Batched Accumulation Interleaving
4 for (int k = 0; k < K; ++k) K For algorithms with loop-carried dependencies that cannot
5 double a = A[n][k]; // M/T reads
6 #pragma PIPELINE N be solved by either method above (e.g., due to a non-
7 for (int t = 0; t < T; ++t) { C [N×M] commutative accumulation operator), we can still pipeline
double prev = (k == 0) ?
]
8
K
]
M
9 C[n][m*T+t] : acc[t];
×
A
[K
10 acc[t] = prev + a * B[k][m*T+t]; } M additional loop nested in the accumulation loop. This pro-
11 for (int t = 0; t < T; ++t) // Write cedure is similar to Sec. 2.1.2, but only applies to programs
12 C[n][m*T+t] = acc[t]; } // out
where it is relevant to compute the accumulation for multi-
(c) Tiled iteration space, same location written every T cycles.
ple data streams, and requires altering the interface and data
Listing 1: Interleave accumulations to remove loop-carried dependency.
movement of the program to interleave inputs in batches.
The code in Lst. 3a shows an iterative solver code with
an inherent loop-carried dependency on state, with a min-
• The loop-carried dependency is resolved: each location is
imum initiation interval corresponding to the latency LStep
only updated every M cycles (with M ≥Lacc in Fig. 3).
of the (inlined) function Step. There are no loops to inter-
• A, B , and C are all read in a contiguous fashion, achiev-
change, and we cannot change the order of loop iterations.
ing perfect spatial locality (we assume row-major memory
While there is no way to improve the latency of producing
layout. For column-major we would interchange the K -
a single result, we can improve the overall throughput by a
loop and N -loop).
factor of LStep by pipelining across N ≥LStep different inputs
• Each element of A is read exactly once.
(e.g., overlap solving for different starting conditions). We
The modified code is shown in Lst. 1b. We leave the accumu- effectively inject another loop over inputs, then perform
lation buffer defined on line 2 uninitialized, and implicitly transposition or tiled accumulation interleaving with this
reset it on line 8, avoiding M extra cycles to reset (this is a loop. The result of this transformation is shown in Lst. 3b,
form of pipelined loop fusion, covered in Sec. 2.4). for a variable number of interleaved inputs N.
2.1.2 Tiled Accumulation Interleaving 1 Vec<double> IterSolver(Vec<double> state, int T) {
For accumulations done in a nested loop, it can be sufficient 2 #pragma PIPELINE // Will fail to pipeline with I=1
3 for (int t = 0; t < T; ++t)
to interleave across a tile of an outer loop to resolve a 4 state = Step(state);
loop-carried dependency, using a limited size buffer to store 5 return state; }
intermediate results. This tile only needs to be of size ≥Lacc , (a) Solver executed for T steps with a loop-carried dependency on state.
where Lacc is the latency of the accumulation operation. 1 template <int N> inner loop T
2 void MultiSolver(Vec<double> *in,
This is shown in Lst. 1c, for the transposed matrix mul-
3 Vec<double> *out, int T) { b[0]
tiplication example from Lst. 1b, where the accumulation 4 Vec<double> b[N]; // Partial results
array has been reduced to tiles of size T (which should be 5 for (int t = 0; t < T; ++t) b[1]
6 #pragma PIPELINE
≥Lacc , see Fig. 3), by adding an additional inner loop over 7 for (int i = 0; i < N; ++i) {
the tile, and cutting the outer loop by a factor of B . 8 auto read = (t == 0) ? in[i] : b[i]; ...
9 auto next = Step(read);
10 if (t < T-1) b[i] = next;
2.1.3 Single-Loop Accumulation Interleaving b[N-1]
11 else out[i] = next; }} // Write out
If no outer loop is present, we have to perform the ac- (b) Pipeline across N ≥Lstep inputs to achieve I=1 cycle.
cumulation in two separate stages, at the cost of extra Listing 3: Pipeline across multiple inputs to avoid loop-carried dependency.
resources. For the first stage, we perform a transformation
similar to the nested accumulation interleaving, but strip- 2.2 Delay Buffering
mine the inner (and only) loop into blocks of size K ≥ Lacc , When iterating over regular domains in a pipelined fashion,
accumulating partial results into a buffer of size K . Once it is often sufficient to express buffering using delay buffers,
all incoming values have been accumulated into the partial expressed either with cyclically indexed arrays, or with
5
constant offset delay buffers, also known from the Intel Lst. 4b demonstrates the shift register pattern used to
ecosystem as shift registers. These buffers are only accessed in express the stencil buffering scheme, which is supported
a FIFO manner, with the additional constraint that elements by the Intel OpenCL toolflow. Rather than creating each
are only be popped once they have fully traversed the depth individual delay buffer required to propagate values, a
of the buffer (or when they pass compile-time fixed access single array is used, which is “shifted” every cycle using
points, called “taps”, in Intel OpenCL). Despite the “shift unrolling (lines 6-7). The computation accesses elements of
register” name, these buffers do not need to be implemented this array using constant indices only (line 10), relying on the
in registers, and are frequently implemented in on-chip tool to infer the partitioning into individual buffers (akin
RAM when large capacity is needed, where values are not to loop idiom recognition [25]) that we did explicitly in
physically shifted. Lst. 4a. The implicit nature of this pattern requires the tool
A common set of applications that adhere to the delay to specifically support it. For more detail on buffering stencil
buffer pattern are stencil applications such as partial dif- codes we refer to other works on the subject [44], [39].
ferential equation solvers [27], [28], [29], image processing Opportunities for delay buffering often arise naturally in
pipelines [30], [31], and convolutions in deep neural net- pipelined programs. If we consider the transposed matrix
works [32], [33], [34], [35], [36], all of which are typically multiplication code in Lst. 1b, we notice that the read from
traversed using a sliding window buffer, implemented in acc on line 8 and the write on line 9 are both sequential, and
terms of multiple delay buffers (or, in Intel terminology, a cyclical with a period of M cycles. We could therefore also
shift register with multiple taps). These applications have use the shift register abstraction for this array. The same is
been shown to be a good fit to spatial computing architec- true for the accumulation code in Lst. 3b.
tures [37], [38], [39], [40], [41], [42], [43], as delay buffering
is cheap to implement in hardware, either as shift registers
Seq.
+ + + ×
DRAM
south
in general purpose logic, or in RAM blocks. east west north
13
14 west = center; center = east; } } // Propagate registers 4 int bin = CalculateBin(memory[i]);
5 hist[bin] += 1; // Single cycle access Seq.
(a) Delay buffering using cyclically indexed line buffers.
6 } // ...write result out to memory...
1 // Pipelined loops executed sequentially 1 for (int i = 0; i < N0+N1; ++i) { 1 for (int i = 0; i < max(N0, N1); ++i) {
2 for (int i = 0; i < N0; ++i) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); // Omit ifs
3 for (int i = 0; i < N1; ++i) Bar(i, /*...*/); 3 else Bar(i - N0, /*...*/); } 3 if (i < N1) Bar(i, /*...*/); } // for N0==N1
(a) (L0 + I0 (N0 −1)) + (L1 + I1 (N1 −1)) cycles. (b) L2 + I(N0 + N1 −1) cycles. (c) L3 + I · (max(N0 , N1 )−1) cycles.
Listing 5: Two subsequent pipelined loops fused sequentially (Lst. 5b) or concurrently (Lst. 5c). Assume that all loops are pipelined (pragmas omitted for brevity).
For two consecutive loops with latencies/bounds/initi- Lst. 7b). There can be a (tool-dependent) benefit from saving
ation intervals {L0 , N0 , I0 } and {L1 , N1 , I1 } (Lst. 5a), re- overhead logic by only implementing the orchestration and
spectively, the total runtime according to Eq. 1 is (L0 + interfaces of a single pipeline, at the (typically minor) cost
I0 (N0 −1)) + (L1 + I1 (N1 −1)). Depending on which con- of the corresponding predication logic. More importantly,
dition(s) are met, we can distinguish between three levels of eliminating the coarse-grained control can enable other
pipelined loop fusion, with increasing performance benefits: transformations that significantly benefit performance, such
1) I=I0 =I1 (true in most cases): Loops can be fused by as fusion [§2.4] with adjacent pipelined loops, flattening
summing the loop bounds, using loop guards to sequen- nested loops [§2.6], and on-chip dataflow [§3.3].
tialize them within the same pipeline (Lst. 5b).
2.6 Pipelined Loop Flattening/Coalescing
2) Condition 1 is met, and only fine-grained or no dependencies
exist between the two loops: Loops can be fused by To minimize the number of cycles spent in filling/draining
iterating to the maximum loop bound, and loop guards pipelines (where the circuit is not streaming at full through-
are placed as necessary to predicate each section (Lst. 5c). put), we can flatten nested loops to move the fill/drain
3) Conditions 1 and 2 are met, and N =N0 =N1 (same loop phases to the outermost loop, fusing/absorbing code that
bounds): Loops bodies can be trivially fused (Lst. 5c, but is not in the innermost loop if necessary.
with no loop guards necessary). Lst. 8a shows a code with two nested loops, and gives the
An alternative way of performing pipeline fusion is to total number of cycles required to execute the program. The
instantiate each stage as a separate processing element, and latency of the drain phase of the inner loop and the latency
stream fine-grained dependencies between them (Sec. 3.3). of Bar outside the inner loop must be paid at every iteration
of the outer loop. If N0 L0 , the cycle count becomes just
2.5 Pipelined Loop Switching L1 + N0 N1 , but for applications where N0 is comparable to
L0 , draining the inner pipeline can significantly impact the
The benefits of pipelined loop fusion can be extended to
runtime (even if N1 is large). By transforming the code such
coarse-grained control flow by using loop switching (as op-
that all loops are perfectly nested (see Lst. 8b), the HLS tool
posed to loop unswitching, which is a common transforma-
can effectively coalesce the loops into a single pipeline, where
tion [25] on load/store architectures). Whereas instruction-
next iteration of the outer loop can be executed immediately
based architectures attempt to only execute one branch of
after the previous finishes.
a conditional jump (via branch prediction on out-of-order
processors), a conditional in a pipelined scenario will result 1 for (int i = 0; i < N1; ++i) { 1 for (int i = 0; i < N1; ++i) {
in both branches being instantiated in hardware, regardless 2 #pragma PIPELINE 2 #pragma PIPELINE
3 for (int j = 0; j < N0; ++i) 3 for (int j = 0; j < N0; ++i)
of whether/how often it is executed. The transformation of 4 Foo(i, j); 4 Foo(i, j);
coarse-grained control flow into fine-grained control flow is 5 Bar(i); } 5 if (j == N0 - 1) Bar(i); }
implemented by the HLS tool by introducing predication to (a) L1 + N1 · (L0 + N0 −1) cycles. (b) L2 + N0 N1 −1 cycles.
the pipeline, at no significant runtime penalty.
Lst. 7 shows a simple example of how the transformation
Inner state 0 Inner state 1
fuses two pipelined loops in different branches into a single Outer state Single state
loop switching pipeline. The transformation applies to any Listing 8: Before and after coalescing loop nest to avoid inner pipeline drains.
pipelined code in either branch, following the principles
described for pipelined loop fusion (§2.4 and Lst. 5). To perform the transformation in Lst. 8, we had to absorb
The implications of pipelined loop switching are more Bar into the inner loop, adding a loop guard (line 5 in
subtle than the pure fusion examples in Lst. 5, as the total Lst. 8b), analogous to pipelined loop fusion (§2.4), where
number of loop iterations is not affected (assuming the fused the second pipelined “loop” consists of a single iteration.
loop bound is set according to the condition, see line 1 in This contrasts the loop peeling transformation, which is
used by CPU compilers to regularize loops to avoid branch
mispredictions and increasing amenability to vectorization.
1 if (condition) 1 auto N = condition ? N0 : N1;
2 #pragma HLS PIPELINE 2 #pragma HLS PIPELINE While loop peeling can also be beneficial in hardware, e.g.,
3 for (int i = 0; i < N0; ++i) 3 for (int i = 0; i < N; ++i) { to avoid deep conditional logic in a pipeline, small inner
4 y[i] = Foo(x[i]); 4 if (condition)
5 else 5 y[i] = Foo(x[i]); loops can see a significant performance improvement by
6 #pragma HLS PIPELINE 6 else eliminating the draining phase.
7 for (int i = 0; i < N1; ++i) 7 y[i] = Bar(x[i]);
8 y[i] = Bar(x[i]); 8} 2.7 Inlining
(a) Coarse-grained control flow. (b) Control flow absorbed into pipeline. In order to successfully pipeline a scope, all function calls
Listing 7: Pipelined loop switching absorbs coarse-grained control flow. within the code section must be pipelineable. This typically
7
ditional resources consumed for every additional callsite 4 C[i*W + w] = A[i*W + w]*B[i*W + w]; 4 C[i] = A[i] * B[i];
after the first. This replication is done automatically by HLS (a) Using strip-mining. (b) Using partial unrolling.
compilers on demand, but an additional inline pragma Listing 9: Two variants of vectorization by factor W using loop unrolling.
can be specified to directly “paste” the function body into
the callsite during preprocessing, removing the function
boundary during optimization and scheduling. compute jlogic k (e.g., from off-chip memory), according to
Wmax = fBS , where f [cycle/s] is the clock frequency of
3 S CALABILITY T RANSFORMATIONS
the unrolled logic, and S [Byte/operand] is the operand
Parallelism in HLS revolves around the folding of loops, size in bytes. Horizontal unrolling is usually not sufficient to
achieved through unrolling. In Sec. 2.1 we used strip- achieve high logic utilization on large chips, where the avail-
mining and reordering to avoid loop-carried dependencies able memory bandwidth is low compared to the available
by changing the schedule of computations in the pipelined amount of compute logic. Furthermore, because the energy
loop nest. In this section, we similarly strip-mine and re- cost of I/O is orders of magnitude higher than moving data
order loops, but with additional unrolling of the strip-mined on the chip, it is desirable to exploit on-chip memory and
chunks. Pipelined loops constitute the iteration space; the pipeline parallelism instead (this follows in Sec. 3.2 and 3.3).
size of which determines the number of cycles it takes
to execute the program. Unrolled loops, in a pipelined 3.2 Vertical Unrolling
program, correspond to the degree of parallelism in the We can achieve scalable parallelism in HLS without relying
architecture, as every expression in an unrolled statement on external memory bandwidth by exploiting data reuse,
is required to exist as hardware. Parallelizing a code thus distributing input elements to multiple computational units
means turning sequential/pipelined loops fully or partially replicated “vertically” through unrolling [49], [38], [50]. This
into parallel/unrolled loops. This corresponds to folding the is the most potent source of parallelism on hardware architectures,
sequential iteration space, as the number of cycles taken to as it can conceptually scale indefinitely with available silicon
execute the program are effectively reduced by the inverse when enough reuse is possible. Viewed from the paradigm
of the unrolling factor. of cached architectures, the opportunity for this transforma-
tion arises from temporal locality in loops. Vertical unrolling
a b a0 b0 a1 b1 a2 b2 a3 b3 b a1 b a2 b a3 b a0 a1 a2 a3
a0
CU CU
draws on bandwidth from on-chip fast memory by storing
CU CU CU CU CU CU CU CU CU CU CU
b b b b
more elements temporally, combining them with new data
(a) Before. (b) Horizontal unroll. (c) Vertical unroll. (d) Dataflow. streamed in from external memory to increase parallelism,
allowing more computational units to run in parallel at the
Fig. 5: Horizontal unrolling, vertical unrolling, and dataflow, as means to increase
parallelism. Rectangles represent buffer space, such as registers or on-chip RAM. expense of buffer space. In comparison, horizontal unrolling
Horizontal: four independent inputs processed in parallel. Vertical: one input is requires us to widen the data path that passes through the
combined with multiple buffered values. Dataflow: similar to vertical, but input
or partial results are streamed through a pipeline rather than broadcast.
processing elements (compare Fig. 5b and 5c).
When attempting to parallelize a new algorithm, iden-
3.1 Horizontal Unrolling (Vectorization) tifying a source of temporal parallelism to feed vertical
We implement vectorization-style parallelism with HLS by unrolling is essential to whether the design will scale. Pro-
“horizontally” unrolling loops in pipelined sections, or by grammers should consider this carefully before designing
introducing vector types, folding the sequential iteration the hardware architecture. From a reference software code,
space accordingly. This is the most straightforward way of the programmer can identify scenarios where reuse occurs,
adding parallelism, as it can often be applied directly to an then extract and explicitly express the temporal access pattern
inner loop without further reordering or drastic changes to in hardware, using a delay buffering [§2.2] or random-access
the nested loop structure. Vectorization is more powerful [§2.3] buffering scheme. Then, if additional reuse is possible,
in HLS than SIMD operations on load/store architectures, vertically unroll the circuit to scale up performance.
as the unrolled compute units are not required to be ho- As an example, we return to the matrix multiplication
mogeneous, and the number of units are not constrained code from Lst. 1c. In Sec. 2.1.2, we saw that strip-mining
to fixed sizes. Horizontal unrolling increases bandwidth
utilization by explicitly exploiting spatial locality, allowing
more efficient accesses to off-chip memory such as DRAM. 1 for (int n = 0; n < N / P; ++n) { // Folded by unrolling factor P
2 for (int m = 0; m < M / T; ++m) { // Tiling
Lst. 9 shows two functionally equivalent ways of vec- 3 double acc[T][P]; // Is now 2D
torizing a loop over N elements by a horizontal unrolling 4 // ...initialize acc from C...
5 for (int k = 0; k < K; ++k) {
factor of W . Lst. 9a strip-mines a loop into chunks of W 6 double a_buffer[P]; // Buffer multiple elements to combine with
and unrolls the inner loop fully, while Lst. 9b uses partial 7 #pragma PIPELINE // incoming values of B in parallel
8 for (int p = 0; p < P; ++p)
unrolling by specifying an unroll factor in the pragma. As 9 a_buffer[p] = A[n*P + p][k];
a third option, explicit vector types can be used, such as 10 #pragma PIPELINE
11 for (int t = 0; t < T; ++t) // Stream tile of B
those built into OpenCL (e.g., float4 or int16), or custom 12 #pragma UNROLL
vector classes [48]. These provide less flexibility, but are 13 for (int p = 0; p < P; ++p) // P-fold vertical unrolling
more concise and are sufficient for most applications. 14 acc[t][p] += a_buffer[p] * B[k][m*T+t];
15 } /* ...write back 2D tile of C... */ } }
In practice, the unrolling factor W [operand/cycle] is con-
strained by the bandwidth B [Byte/s] available to the Listing 10: P -fold vertical unrolling of matrix multiplication.
8
and reordering loops allowed us to move reads from matrix To see how streaming can be an important tool to express
A out of the inner loop, re-using the loaded value across scalable hardware, we apply it in conjunction with vertical
T different entries of matrix B streamed in while keeping unrolling (Sec. 3.2) to implement an iterative version of the
the element of A in a register. Since every loaded value stencil example from Lst. 4. Unlike the matrix multiplication
of B eventually needs to be combined with all N rows of code, the stencil code has no scalable source of parallelism
A, we realize that we can perform more computations in in the spatial dimension. Instead, we can achieve reuse by
parallel by keeping multiple values of A in local registers. folding the outer time-loop to treat P consecutive timesteps
The result of this transformation is shown in Lst. 10. By in a pipeline parallel fashion, each computed by a distinct
buffering P elements (where P was 1 in Lst. 1c) of A prior PE, connected in a chain via channels [37], [51], [38]. We
to streaming in the tile of B -matrix (lines 8-9), we can fold replace the memory interfaces to the PE with channels, such
the outer loop over rows by a factor of P , using unrolling that the memory read and write become Pop and Push oper-
to multiply parallelism (as well as buffer space required for ations, respectively. The resulting code is shown in Lst. 11a.
the partial sums) by a factor of P (lines 12-14). We then vertically unroll to generate P instances of the PE
(shown in Lst. 11b), effectively increasing the throughput
3.3 Dataflow
of the kernel by a factor of P , and consequently reducing
For complex codes it is common to partition functionality the runtime by folding the outermost loop by a factor of P
into multiple modules, or processing elements (PEs), stream- (line 3 in Lst. 11a). Such architectures are sometimes referred
ing data between them through explicit interfaces. In con- to as systolic arrays [52], [53].
trast to conventional pipelining, PEs arranged in a dataflow For architectures/HLS tools where large fan-out is an is-
architecture are scheduled separately when synthesized by sue for compilation or routing, an already replicated design
the HLS tool. There are multiple benefits to this: can be transformed to a dataflow architecture. For example,
• Different functionality runs at different schedules. For exam- in the matrix multiplication example in Lst. 10, we can move
ple, issuing memory requests, servicing memory requests, the P -fold unroll out of the inner loop, and replicate the
and receiving requested memory can all require different entire PE instead, replacing reads and writes with channel
pipelines, state machines, and even clock rates. accesses [50]. B is then streamed into the first PE, and
• Smaller components are more modular, making them eas- passed downstream every cycle. A and C should no longer
ier to reuse, debug and verify. be accessed by every PE, but rather be handed downstream
• The effort required by the HLS tool to schedule code similar to B , requiring a careful implementation of the start
sections increases dramatically with the number of opera- and drain phases, where the behavior of each PE will vary
tions that need to be considered for the dependency and slightly according to its depth in the sequence.
pipelining analysis. Scheduling logic in smaller chunks is
thus beneficial for compilation time. 3.4 Tiling
• Large fan-out/fan-in is challenging to route on real hard- Loop tiling in HLS is commonly used to fold large problem
ware, (i.e., 1-to-N or N -to-1 connections for large N ). This sizes into manageable chunks that fit into fast on-chip
is mitigated by partitioning components into smaller parts memory, in an already pipelined program [38]. Rather than
and adding more pipeline stages. making the program faster, this lets the already fast archi-
• The fan-in and fan-out of control signals (i.e., stall, reset) tecture support arbitrarily large problem sizes. This is in
within each module is reduced, reducing the risk of these contrast to loop tiling on CPU and GPU, where tiling is used
signals constraining the maximum achievable frequency. to increase performance. Common to both paradigms is that
they fundamentally aim to meet fast memory constraints. As
To move data between PEs, communication channels with
with horizontal and vertical unrolling, tiling relies on strip-
a handshake mechanism are used. These channels double
mining loops to alter the iteration space.
as synchronization points, as they imply a consensus on
Tiling was already shown in Sec. 2.1.2, when the accu-
the program state. In practice, channels are always FIFO
mulation buffer in Lst. 1b was reduced to a tile buffer in
interfaces, and support standard queue operations Push,
Pop, and sometimes Empty, Full, and Size operations. They
1 void PE(FIFO<float> &in, FIFO<float> &out, int T) {
occupy the same register or block memory resources as 2 // ..initialization...
other buffers (Sec. 2.2/Sec. 2.3). 3 for (int t = 0; t < T / P; ++t) // Fold timesteps T by factor P
4 #pragma PIPELINE
The mapping from source code to PEs differs between 5 for (/* loops over spatial dimensions */) {
HLS tools, but is manifested when functions are connected 6 auto south = in.Pop(); // Value for t-1 from previous PE
using channels. In the following example, we will use the 7 // ...load values from delay buffers...
8 auto next = 0.25*(north + west + east + south);
syntax from Xilinx Vivado HLS to instantiate PEs, where 9 out.Push(next); }} // Value for t sent to PE computing t+1
each non-inlined function correspond to a PE, and these (a) Processing element for a single timestep. Will be replicated P times.
are connected by channels that are passed as arguments
1 #pragma DATAFLOW // Schedule nested functions as parallel modules
to the functions from a top-level entry function. Note that 2 void SystolicStencil(const float in[], float out[], int T) {
this functionally diverges from C++ semantics without 3 FIFO<float> pipes[P + 1]; // Assume P is given at compile time
4 ReadMemory(in, pipes[0]); // Head
additional abstraction [48], as each function in the dataflow 5 #pragma UNROLL // Replicate PEs
scope is executed in parallel in hardware, rather than in the 6 for (int p = 0; p < P; ++p)
sequence specified in the imperative code. In Intel OpenCL, 7 PE(pipe[p], pipe[p + 1], T); // Forms a chain
8 WriteMemory(pipes[P], out); } // Tail
dataflow semantics are instead expressed with multiple
(b) Instantiate and connect P consecutive and parallel PEs.
kernel functions each defining a PE, which are connected by
global channel objects prefixed with the channel keyword. Listing 11: Dataflow between replicated PEs to compute P timesteps in parallel.
9
Lst. 1c, such that the required buffer space used for partial one cache line. If we instead read the two sections of A
results became a constant, rather than being dependent on sequentially (or in larger chunks), the HLS tool can infer
the input size. This transformation is also relevant to the two bursts accesses to A of length N/2, shown in Lst. 12c.
stencil codes in Lst. 4, where it can be used similarly to Since the schedules of memory and computational modules
restrict the size of the line buffers or shift register, so they a are independent, ReadA can run ahead of PE, ensuring that
no longer proportional to the problem size. memory is always read at the maximum bandwidth of the
interface (Sec. 4.2 and Sec. 4.3 will cover how to increase this
4 M EMORY ACCESS T RANSFORMATIONS bandwidth). From the point of view of the computational
When an HLS design has been pipelined, scheduled, and PE, both A0 and A1 are read in parallel, as shown on
unrolled as desired, the memory access pattern has been line 5 in Lst. 12b, hiding initialization time and inconsistent
established. In the following, we describe transformations memory producers in the synchronization implied by the
that optimize the efficiency of off-chip memory accesses in data streams.
the HLS code. For memory bound codes in particular, this is An important use case of memory extraction appears in
critical for performance after the design has been pipelined. the stencil code in Lst. 11, where it is necessary to separate
the memory accesses such that the PEs are agnostic of
4.1 Memory Access Extraction whether data is produced/consumed by a neighboring PE
By extracting accesses to external memory from the compu- or by a memory module. Memory access extraction is also
tational logic, we enable compute and memory accesses to useful for performing data layout transformations in fast
be pipelined and optimized separately. Accessing the same on-chip memory. For example, we can change the schedule
interface multiple times within the same pipelined section of reads from A in Lst. 10 to a more efficient scheme by
is a common cause for poor memory bandwidth utilization buffering values in on-chip memory, while streaming them
and increased initiation interval due to interface contention, to the kernel according to the original schedule.
since the interface can only service a single request per
cycle. In the Intel OpenCL flow, memory extraction is done 4.2 Memory Buffering
automatically by the tool, but since this process must be When dealing with memory interfaces with an inconsistent
conservative due to limited information, it is often still data rate, such as DRAM, it can be beneficial to request
beneficial to do the extraction explicitly in the code [54]. In and buffer accesses earlier and/or at a more aggressive pace
many cases, such as for independent reads, this is not an in- than what is consumed or produced by the computational
herent memory bandwidth or latency constraint, but arises elements. For memory reads, this can be done by reading
from the tool scheduling iterations according to program ahead of the kernel into a deep buffer instantiated between
order. This can be relaxed when allowed by inter-iteration memory and computations, by either 1) accessing wider vec-
dependencies (which can in many cases be determined tors from memory than required by the kernel, narrowing or
automatically, e.g., using polyhedral analysis [55]). widening data paths (aka. “gearboxing”) when piping to or
In Lst. 12a, the same memory (i.e., hardware memory from computational elements, respectively, or 2) increasing
interface) is accessed twice in the inner loop. In the worst the clock rate of modules accessing memory with respect to
case, the program will issue two 4 Byte memory requests the computational elements.
every iteration, resulting in poor memory performance, and The memory access function Lst. 12c allows long bursts
preventing pipelining of the loop. In software, this problem to the interface of A, but receives the data on a narrow bus
is typically mitigated by caches, always fetching at least at W · Sint = (1 · 4) Byte/cycle. In general, this limits the
bandwidth consumption to f ·W Sint at frequency f , which is
likely to be less than what the external memory can provide.
1 void PE(const int A[N], int B[N/2]) { 1 elem./burst To better exploit available bandwidth, we can either read
#pragma PIPELINE // Achieves I=2
DRAM
2 1 burst A[i]
Compute
respective interfaces, pushing to FIFO buffers that are read CPU-Oriented Transformations and how they apply to HLS codes.
in parallel and combined by another module (for writing: in Loop interchange [57], [47] is used to resolve loop-carried dependencies [§2].
Strip-mining [58], loop tiling [59], [47], and cycle shrinking [60] are central compo-
reverse), exposing a single data stream to the computational nents of many HLS transformations [§2.1, §3.1, §3.2, §2.1.2].
kernel. This is illustrated in Fig. 6, where the unlabeled Loop distribution and loop fission [61], [47] are used to separate differently scheduled
DDR1
DDR2
DDR1
DDR2
DDR0
DDR3
DDR0
function calls.
moved to a type that is natively supported by the target I/O format compilation: No I/O supported directly in HLS.
Supercompiling: is infeasible for HLS due to long synthesis times.
architecture, such as single precision floating point on Loop pushing/embedding: Inlining completely is favored to allow pipelining.
Intel’s Arria 10 and Stratix 10 devices [56]. Automatic decomposition and alignment, scalar privatization, array privatization,
cache alignment, and false sharing are not relevant for HLS, as there is no (implicit)
• Bandwidth bound architectures, where performance can cache coherency protocol in hardware.
be improved by up to the same factor that the size of the Procedure call parallelization and split do not apply, as there are no forks in hardware.
Graph partitioning only applies to explicit dataflow languages.
data type can be reduced by. There are no instruction sets in hardware, so VLIW transformations do not apply.
• Latency bound architectures where the data type can be TABLE 2: The relation of traditional CPU-oriented transformations to HLS codes.
reduced to a lower latency operation, e.g., from floating
point to integer. It is interesting to note that the majority of well-known
In the most extreme case, it has been shown that collapsing transformations from software apply to HLS. This implies
the data type of weights and activations in deep neural that we can leverage much of decades of research into high-
networks to binary [34] can provide sufficient speedup for performance computing transformations to also optimize
inference that the increased number of weights makes up hardware programs, including many that can be applied
for the loss of precision per weight. directly (i.e., without further adaptation to HLS) to the im-
perative source code or intermediate representation before
5 S OFTWARE T RANSFORMATIONS IN HLS synthesizing for hardware. We stress the importance of sup-
In addition to the transformations described in the sections port for these pre-hardware generation transformations in
above, we include an overview of how well-known CPU- HLS compilers, as they lay the foundation for the hardware-
oriented transformations apply to HLS, based on the com- specific transformations proposed here.
piler transformations compiled by Bacon et al. [25]. These
transformations are included in Tab. 2, and are partitioned 6 E ND - TO -E ND E XAMPLES
into three categories: To showcase the transformations presented in this work and
provide a “hands-on” opportunity for seeing HLS optimiza-
• Transformations directly relevant to the HLS transforma-
tions applied in practice, we will describe the optimization
tions already presented here.
process on a sample set of classical HPC kernels, available
• Transformations that are the same or similar to their
as open source repositories on github1 . These kernels are
software counterparts.
• Transformations with little or no relevance to HLS. 1. https://github.com/spcl?q=hls
11
written in C++ for Xilinx Vivado HLS [12] with hlslib [48] [GOp/s] Stencil Matrix Multiplication N-Body
extensions, and are built and run using the Xilinx Vitis envi- 103 (25×/18270×)409.3 (52×/29950×)497.0 (42×/167×)270.7
ronment. For each example, we will describe the sequence of (14×/720×) 16.1 (16×/578×) 9.6 (4×) 6.4
101 1.6
transformations applied, and give the resulting performance (53×) 1.2 (36×) 0.6
at each major stage. 10−1 <0.1 <0.1
The included benchmarks were run on an Alveo 10 −3
Nai Pip Vec S Nai Pip Vec S I P S
U250 board, which houses a Xilinx UltraScale+ XCU250- ve elin tor ystoli ve elin tor ystoli nitial ipelin ystoli
ed ized c ed ized c ed c
FIGD2104-2L-E FPGA and four 2400 MT/s DDR4 banks (we
Fig. 7: Performance progression of kernels when applying transformations. Paren-
utilize 1-2 banks for the examples here). The chip consists theses show speedup over previous version, and cumulative speedup.
of four almost identical chiplets with limited interconnect [Utilization] LUTs DSPs BRAM
between them, where each chiplet is connected to one of 100%
Stencil Matrix Multiplication N-Body
the DDR4 pinouts. This multi-chiplet design allows more 10%
resources (1728K LUTs and 12,288 DSPs), but poses chal- 1%
lenges for the routing process, which impedes the achiev-
0.1%
able clock rate and resource utilization for a monolithic ker-
nel attempting to span the full chip. Kernels were compiled 0.01%
Nai Pip Vec S Nai Pip Vec S Init Pip S
for the xilinx u250 xdma 201830 2 shell with Vitis 2019.2 ve elin tor ystoli ve elin tor ystoli ial elin ystoli
ed ized c ed ized c ed c
and executed with version 2.3.1301 of the Xilinx Runtime Fig. 8: Resource usage of kernels from Fig. 7 as fractions of available resources.
(XRT). All benchmarks are included in Fig. 7, and the The maxima are taken as 1728K LUTs, 12,288 DSPs, and 2688 BRAM.
into P parallel processing element arranged in a systolic Lee et al. [90] present an OpenACC to OpenCL com-
array. Each element holds T resident particles, and parti- piler, using Intel OpenCL as a backend. The authors im-
cles are streamed [§3.3] through the PEs. plement horizontal and vertical unrolling, pipelining and
The second stage gains a factor of 4× corresponding to the dataflow by introducing new OpenACC clauses. Papakon-
latency of the interleaved accumulation, followed by a factor stantinou et al. [91] generate HLS code for FPGA from
of 42× from unrolling units across the chip. T ≥L+ can be directive-annotated CUDA code.
used to regulate the arithmetic intensity of the kernel. The Optimizing HLS compilers. Mainstream HLS compil-
bandwidth requirements can be reduced further by storing ers automatically apply many of the well-known software
more resident particles on the chip, scaling up to the full transformations in Tab. 2 [22], [92], [93], but can also employ
fast memory usage of the FPGA. The tiled accumulation in- more advanced FPGA transformations. Intel OpenCL [19]
terleaving transformation thus enables not just pipelining of performs memory access extraction into “load store units”
the compute, but also minimization of I/O. The optimized (LSUs), does memory striping between DRAM banks, and
implementation is available on github4 . detects and auto-resolves some buffering and accumulation
These examples demonstrate the impact of different patterns. The proprietary Merlin Compiler [94] uses high-
transformations on a reconfigurable hardware platform. In level acceleration directives to automatically perform some
particular, enabling pipelining, regularizing memory ac- of the transformations described here, as source-to-source
cesses, and vertical unrolling are shown to be central com- transformations to underlying HLS code. Polyhedral compi-
ponents of scalable hardware architectures. The dramatic lation is a popular framework for optimizing CPU and GPU
speedups over naive codes also emphasize that HLS tools do loop nests [55], and has also been applied to HLS for FPGA
not yield competitive performance out of the box, making it for optimizing data reuse [95]. Such techniques may prove
critical to perform further transformations. For additional valuable in automating, e.g., memory extraction and tiling
examples of optimizing HLS codes, we refer to the numer- transformations. While most HLS compilers rely strictly
ous works applying HLS optimizations listed below. on static scheduling, Dynamatic [68] considers dynamically
scheduling state machines and pipelines to allow reducing
7 R ELATED W ORK the number of stages executed at runtime.
Optimized applications. Much work has been done in
Domain-specific frameworks. Implementing programs
optimizing C/C++/OpenCL HLS codes for FPGA, such as
in domain specific languages (DSLs) can make it easier
stencils [38], [39], [40], [74], [75], deep neural networks [76],
to detect and exploit opportunities for advanced trans-
[77], [35], [36], [34], matrix multiplication [78], [75], [50], [79],
formations. Darkroom [30] generates optimized HDL for
graph processing [80], [81], networking [82], light propaga-
image processing codes, and the popular image process-
tion for cancer treatment [46], and protein sequencing [49],
ing framework Halide [31] has been extended to support
[83]. These works optimize the respective applications using
FPGAs [96], [97]. Luzhou et al. [53] and StencilFlow [44]
transformations described here, such as delay buffering,
propose frameworks for generating stencil codes for FP-
random access buffering, vectorization, vertical unrolling,
GAs. These frameworks rely on optimizations such as delay
tiling for on-chip memory, and dataflow.
buffering, dataflow, and vertical unrolling, which we cover
Transformations. Zohouri et al. [84] use the Rodinia
here. Using DSLs to compile to structured HLS code can
benchmark to evaluate the performance of OpenCL codes
be a viable approach to automating a wide range of trans-
targeting FPGAs, employing optimizations such as SIMD
formations, as proposed by Koeplinger et al. [98], and the
vectorization, sliding-window buffering, accumulation in-
FROST [99] DSL framework.
terleaving, and compute unit replication across multiple
Other approaches. There are other approaches than
kernels. We present a generalized description of a superset
C/C++/OpenCL-based HLS languages to addressing the
of these transformations, along with concrete code examples
productivity issues of hardware design: Chisel/FIR-
that show how they are applied in practice. The DaCe frame-
RTL [100], [101] maintains the paradigm of behavioral pro-
work [85] exploits information on explicit dataflow and
gramming known from RTL, but provides modern language
control flow to perform a wide range of transformations,
and compiler features. This caters to developers who are
and code generates efficient HLS code using vendor-specific
already familiar with hardware design, but wish to use a
pragmas and primitives. Kastner et al. [86] go through the
more expressive language. In the Maxeler ecosystem [102],
implementation of many HLS codes in Vivado HLS, focus-
kernels are described using a Java-based language, but
ing on algorithmic optimizations. da Silva et al. [87] explore
rather than transforming imperative code into a behavioral
using modern C++ features to capture HLS concepts in a
equivalent, the language provides a DSL of hardware con-
high-level fashion. Lloyd et al. [88] describe optimizations
cepts that are instantiated using object-oriented interfaces.
specific to Intel OpenCL, and include a variant of memory
By constraining the input, this encourages developers to
access extraction, as well as the single-loop accumulation
write code that maps well to hardware, but requires learning
variant of accumulation interleaving.
a new language exclusive to the Maxeler ecosystem.
Directive-based frameworks. High-level, directive-based
frameworks such as OpenMP and OpenACC have been 8 TOOLFLOW OF X ILINX VS . I NTEL
proposed as alternative abstractions for generating FPGA
When choosing a toolflow to start designing hardware with
kernels. Leow et al. [89] implement an FPGA code gen-
HLS, it is useful to understand two distinct approaches
erator from OpenMP pragmas, primarily focusing on cor-
by the two major vendors: Intel OpenCL wishes to en-
rectness in implementing a range of OpenMP pragmas.
able writing accelerators using software, making an effort to
4. https://github.com/spcl/nbody hls abstract away low-level details about the hardware, and
13
present a high-level view to the programmer; whereas Xil- [6] D. B. Thomas et al., “A comparison of CPUs, GPUs, FPGAs,
inx’ Vivado HLS provides a more productive way of writing and massively parallel processor arrays for random number
generation,” in FPGA, 2009.
hardware, by means of a familiar software language. Xilinx [7] D. Bacon et al., “FPGA programming for the masses,” CACM,
uses OpenCL as a vehicle to interface between FPGA and 2013.
host, but implements the OpenCL compiler itself as a thin [8] G. Martin and G. Smith, “High-level synthesis: Past, present, and
wrapper around the C++ compiler, whereas Intel embraces future,” D&T, 2009.
[9] J. Cong et al., “High-level synthesis for FPGAs: From prototyping
the OpenCL paradigm as their frontend (although they to deployment,” TCAD, 2011.
encourage writing single work item kernels [103], effectively [10] R. Nane et al., “A survey and evaluation of FPGA high-level
preventing reuse of OpenCL kernels written for GPU). synthesis tools,” TCAD, 2016.
Vivado HLS has a stronger coupling between the HLS [11] W. Meeus et al., “An overview of today’s high-level synthesis
tools,” DAEM, 2012.
source code and the generated hardware. This requires [12] Z. Zhang et al., “AutoPilot: A platform-based ESL synthesis
the programmer to write more annotations and boilerplate system,” in High-Level Synthesis, 2008.
code, but can also give them stronger feeling of control. [13] Intel High-Level Synthesis (HLS) Compiler. https://www.
intel.com/content/www/us/en/software/programmable/
Conversely, the Intel OpenCL compiler presents convenient quartus-prime/hls-compiler.html. Accessed May 15, 2020.
abstracted views, saves boilerplate code (e.g., by automat- [14] A. Canis et al., “LegUp: High-level synthesis for FPGA-based
ically pipelining sections), and implements efficient substi- processor/accelerator systems,” in FPGA, 2011.
tutions by detecting common patterns in the source code [15] Mentor Graphics. Catapult high-level synthesis. https:
//www.mentor.com/hls-lp/catapult-high-level-synthesis/
(e.g., to automatically perform memory extraction [§4.1]). c-systemc-hls. Accessed May 15, 2020.
The downside is that developers end up struggling to write [16] C. Pilato et al., “Bambu: A modular framework for the high level
or generate code in a way that is recognized by the tool’s synthesis of memory-intensive applications,” in FPL, 2013.
“black magic”, in order to achieve the desired result. Finally, [17] R. Nane et al., “DWARV 2.0: A CoSy-based C-to-VHDL hardware
compiler,” in FPL, 2012.
Xilinx’ choice to allow C++ gives Vivado HLS an edge in [18] M. Owaida et al., “Synthesis of platform architectures from
expressibility, as (non-virtual) objects and templating turns OpenCL programs,” in FCCM, 2011.
out to be a useful tool for abstracting and extending the [19] T. Czajkowski et al., “From OpenCL to high-performance hard-
language [48]. Intel offers a C++-based HLS compiler, but ware on FPGAs,” in FPL, 2012.
[20] R. Nikhil, “Bluespec system Verilog: efficient, correct RTL from
does not (as of writing) support direct interoperability with high level specifications,” in MEMOCODE, 2004.
the OpenCL-driven accelerator flow. [21] J. Auerbach et al., “Lime: A Java-compatible and synthesizable
language for heterogeneous architectures,” in OOPSLA, 2010.
9 C ONCLUSION [22] ——, “A compiler and runtime for heterogeneous computing,”
in DAC, 2012.
The transformations known from software are insufficient to [23] J. Hammarberg and S. Nadjm-Tehrani, “Development of safety-
optimize HPC kernels targeting spatial computing systems. critical reconfigurable hardware with Esterel,” FMICS, 2003.
We have proposed a new set of optimizing transforma- [24] M. B. Gokhale et al., “Stream-oriented FPGA computing in the
Streams-C high level language,” in FCCM, 2000.
tions that enable efficient and scalable hardware architec- [25] D. F. Bacon et al., “Compiler transformations for high-
tures, and can be applied directly to the source code by a performance computing,” CSUR, 1994.
performance engineer, or automatically by an optimizing [26] S. Ryoo et al., “Optimization principles and application perfor-
compiler. Performance and compiler engineers can benefit mance evaluation of a multithreaded GPU using CUDA,” in
PPoPP, 2008.
from these guidelines, transformations, and the presented [27] G. D. Smith, Numerical solution of partial differential equations: finite
cheat sheet as a common toolbox for developing high per- difference methods, 1985.
formance hardware using HLS. [28] A. Taflove and S. C. Hagness, “Computational electrodynamics:
The finite-difference time-domain method,” 1995.
[29] C. A. Fletcher, Computational Techniques for Fluid Dynamics 2, 1988.
ACKNOWLEDGEMENTS [30] J. Hegarty et al., “Darkroom: compiling high-level image process-
This work was supported by the European Research Coun- ing code into hardware pipelines.” TOG, 2014.
cil under the European Union’s Horizon 2020 programme [31] J. Ragan-Kelley et al., “Halide: A language and compiler for
optimizing parallelism, locality, and recomputation in image
(grant agreement DAPP, No. 678880). The authors wish processing pipelines,” in PLDI, 2013.
to thank Xilinx and Intel for helpful discussions; Xilinx [32] T. Ben-Nun and T. Hoefler, “Demystifying parallel and dis-
for generous donations of software, hardware, and access tributed deep learning: An in-depth concurrency analysis,”
to the Xilinx Adaptive Compute Cluster (XACC) at ETH CSUR, 2019.
[33] G. Lacey et al., “Deep learning on FPGAs: Past, present, and
Zurich; the Swiss National Supercomputing Center (CSCS) future,” arXiv:1602.04283, 2016.
for providing computing infrastructure; and Tal Ben-Nun [34] M. Courbariaux et al., “Binarized neural networks: Training deep
for valuable feedback on iterations of this manuscript. neural networks with weights and activations constrained to +1
or -1,” arXiv:1602.02830, 2016.
[35] Y. Umuroglu et al., “FINN: A framework for fast, scalable bina-
R EFERENCES rized neural network inference,” in FPGA, 2017.
[36] M. Blott et al., “FINN-R: An end-to-end deep-learning framework
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implica- for fast exploration of quantized neural networks,” TRETS, 2018.
tions of the obvious,” SIGARCH, 1995. [37] H. Fu and R. G. Clapp, “Eliminating the memory bottleneck: An
[2] M. Horowitz, “Computing’s energy problem (and what we can FPGA-based solution for 3d reverse time migration,” in FPGA,
do about it),” in ISSCC, 2014. 2011.
[3] D. D. Gajski et al., “A second opinion on data flow machines and [38] H. R. Zohouri et al., “Combined spatial and temporal block-
languages,” Computer, 1982. ing for high-performance stencil computation on FPGAs using
[4] S. Sirowy and A. Forin, “Where’s the beef? why FPGAs are so OpenCL,” in FPGA, 2018.
fast,” MS Research, 2008. [39] H. M. Waidyasooriya et al., “OpenCL-based FPGA-platform for
[5] A. R. Brodtkorb et al., “State-of-the-art in heterogeneous comput- stencil computation and its optimization methodology,” TPDS,
ing,” Scientific Programming, 2010. May 2017.
14
[40] Q. Jia and H. Zhou, “Tuning stencil codes in OpenCL for FPGAs,” [76] N. Suda et al., “Throughput-optimized OpenCL-based FPGA
in ICCD, 2016. accelerator for large-scale convolutional neural networks,” in
[41] X. Niu et al., “Exploiting run-time reconfiguration in stencil FPGA, 2016.
computation,” in FPL, 2012. [77] J. Zhang and J. Li, “Improving the performance of OpenCL-based
[42] ——, “Dynamic stencil: Effective exploitation of run-time re- FPGA accelerator for convolutional neural network,” in FPGA,
sources in reconfigurable clusters,” in FPT, 2013. 2017.
[43] J. Fowers et al., “A performance and energy comparison of [78] E. H. D’Hollander, “High-level synthesis optimization for
FPGAs, GPUs, and multicores for sliding-window applications,” blocked floating-point matrix multiplication,” SIGARCH, 2017.
in FPGA, 2012. [79] P. Gorlani et al., “OpenCL implementation of Cannon’s matrix
[44] J. de Fine Licht et al., “StencilFlow: Mapping large stencil pro- multiplication algorithm on Intel Stratix 10 FPGAs,” in ICFPT,
grams to distributed spatial computing systems,” in CGO, 2021. 2019.
[45] X. Chen et al., “On-the-fly parallel data shuffling for graph [80] M. Besta et al., “Graph processing on FPGAs: Taxonomy, survey,
processing on OpenCL-based FPGAs,” in FPL, 2019. challenges,” arXiv:1903.06697, 2019.
[46] T. Young-Schultz et al., “Using OpenCL to enable software-like [81] ——, “Substream-centric maximum matchings on FPGA,” in
development of an FPGA-accelerated biophotonic cancer treat- FPGA, 2019.
ment simulator,” in FPGA, 2020. [82] H. Eran et al., “Design patterns for code reuse in HLS packet
[47] D. J. Kuck et al., “Dependence graphs and compiler optimiza- processing pipelines,” in FCCM, 2019.
tions,” in POPL, 1981. [83] E. Rucci et al., “Smith-Waterman protein search with OpenCL on
[48] J. de Fine Licht and T. Hoefler, “hlslib: Software engineering for an FPGA,” in Trustcom/BigDataSE/ISPA, 2015.
hardware design,” arXiv:1910.04436, 2019. [84] H. R. Zohouri et al., “Evaluating and optimizing OpenCL kernels
[49] S. O. Settle, “High-performance dynamic programming on FP- for high performance computing with FPGAs,” in SC, 2016.
GAs with OpenCL,” in HPEC, 2013. [85] T. Ben-Nun et al., “Stateful dataflow multigraphs: A data-centric
[50] J. de Fine Licht et al., “Flexible communication avoiding matrix model for performance portability on heterogeneous architec-
multiplication on FPGA with high-level synthesis,” in FPGA, tures,” in SC, 2019.
2020. [86] R. Kastner et al., “Parallel programming for FPGAs,”
[51] K. Sano et al., “Multi-FPGA accelerator for scalable stencil com- arXiv:1805.03648, 2018.
putation with constant memory bandwidth,” TPDS, 2014. [87] J. S. da Silva et al., “Module-per-Object: a human-driven method-
[52] H. Kung and C. E. Leiserson, “Systolic arrays (for VLSI),” in ology for C++-based high-level synthesis design,” in FCCM, 2019.
Sparse Matrix Proceedings, 1978. [88] T. Lloyd et al., “A case for better integration of host and target
[53] W. Luzhou et al., “Domain-specific language and compiler compilation when using OpenCL for FPGAs,” in FSP, 2017.
for stencil computation on fpga-based systolic computational- [89] Y. Y. Leow et al., “Generating hardware from OpenMP pro-
memory array,” in ARC, 2012. grams,” in FPT, 2006.
[54] T. Kenter et al., “OpenCL-based FPGA design to accelerate the [90] S. Lee et al., “OpenACC to FPGA: A framework for directive-
nodal discontinuous Galerkin method for unstructured meshes,” based high-performance reconfigurable computing,” in IPDPS,
in FCCM, 2018. 2016.
[55] T. Grosser et al., “Polly – performing polyhedral optimizations on [91] A. Papakonstantinou et al., “FCUDA: Enabling efficient compila-
a low-level intermediate representation,” PPL, 2012. tion of CUDA kernels onto FPGAs,” in SASP, 2009.
[56] U. Sinha, “Enabling impactful DSP designs on FPGAs with hard- [92] S. Gupta et al., “SPARK: a high-level synthesis framework for ap-
ened floating-point implementation,” Altera White Paper, 2014. plying parallelizing compiler transformations,” in VLSID, 2003.
[93] ——, “Coordinated parallelizing compiler optimizations and
[57] J. R. Allen and K. Kennedy, “Automatic loop interchange,” in
high-level synthesis,” TODAES, 2004.
SIGPLAN, 1984.
[94] J. Cong et al., “Source-to-source optimization for HLS,” in FPGAs
[58] M. Weiss, “Strip mining on SIMD architectures,” in ICS, 1991.
for Software Programmers, 2016.
[59] M. D. Lam et al., “The cache performance and optimizations of
[95] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization
blocked algorithms,” 1991.
for configurable computing,” in FPGA, 2013.
[60] C. D. Polychronopoulos, “Advanced loop optimizations for par-
[96] J. Pu et al., “Programming heterogeneous systems from an image
allel computers,” in ICS, 1988.
processing DSL,” TACO, 2017.
[61] D. J. Kuck, “A survey of parallel machine organization and [97] J. Li et al., “HeteroHalide: From image processing DSL to efficient
programming,” CSUR, Mar. 1977. FPGA acceleration,” in FPGA, 2020.
[62] A. P. Yershov, “ALPHA – an automatic programming system of [98] D. Koeplinger et al., “Automatic generation of efficient accelera-
high efficiency,” J. ACM, 1966. tors for reconfigurable hardware,” in ISCA, 2016.
[63] M. J. Wolfe, “Optimizing supercompilers for supercomputers,” [99] E. D. Sozzo et al., “A common backend for hardware acceleration
Ph.D. dissertation, 1982. on FPGA,” in ICCD, 2017.
[64] J. J. Dongarra and A. R. Hinds, “Unrolling loops in Fortran,” [100] J. Bachrach et al., “Chisel: constructing hardware in a scala
Software: Practice and Experience, 1979. embedded language,” in DAC, 2012.
[65] M. Lam, “Software pipelining: An effective scheduling technique [101] A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware
for VLIW machines,” in PLDI, 1988. construction languages, compiler frameworks, and transforma-
[66] C. D. Polychronopoulos, “Loop coalescing: A compiler transfor- tions,” in ICCAD, 2017.
mation for parallel machines,” Tech. Rep., 1987. [102] Maxeler Technologies, “Programming MPC systems (white pa-
[67] F. E. Allen and J. Cocke, A catalogue of optimizing transformations, per),” 2013.
1971. [103] Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide,
[68] L. Josipović et al., “Dynamically scheduled high-level synthesis,” UG-OCL003, revision 2020.04.1. Accessed May 15, 2020.
in FPGA, 2018.
[69] J. Cocke and K. Kennedy, “An algorithm for reduction of operator Johannes de Fine Licht is a PhD student at ETH Zurich. His research topics
strength,” CACM, 1977. revolve around spatial computing systems in HPC, and include programming
[70] R. Bernstein, “Multiplication by integer constants,” Softw. Pract. models, applications, libraries, and enhancing programmer productivity.
Exper., 1986.
[71] G. L. Steele, “Arithmetic shifting considered harmful,” ACM Maciej Besta is a PhD student at ETH Zurich. His research focuses on under-
SIGPLAN Notices, 1977. standing and accelerating large-scale irregular graph processing in any type of
[72] A. V. Aho et al., “Compilers, principles, techniques,” Addison setting and workload.
Wesley, 1986.
Simon Meierhans is studying for his MSc degree at ETH Zurich. His interests
[73] T. De Matteis et al., “Streaming message interface: High- include randomized and deterministic algorithm and data structure design.
performance distributed memory programming on reconfig-
urable hardware,” in SC, 2019. Torsten Hoefler is a professor at ETH Zurich, where he leads the Scalable
[74] D. Weller et al., “Energy efficient scientific computing on FPGAs Parallel Computing Lab. His research aims at understanding performance of
using OpenCL,” in FPGA, 2017. parallel computing systems ranging from parallel computer architecture through
[75] A. Verma et al., “Accelerating workloads on FPGAs via OpenCL: parallel programming to parallel algorithms.
A case study with opendwarfs,” Tech. Rep., 2016.
Understanding the Potential of FPGA-Based Spatial
Acceleration for Large Language Model Inference
HONGZHENG CHEN, Cornell University, USA
JIAHAO ZHANG∗ , Tsinghua University, China
arXiv:2312.15159v2 [cs.LG] 7 Apr 2024
YIXIAO DU, SHAOJIE XIANG, and ZICHAO YUE, Cornell University, USA
NIANSONG ZHANG, YAOHUI CAI, and ZHIRU ZHANG, Cornell University, USA
Recent advancements in large language models (LLMs) boasting billions of parameters have generated
a significant demand for efficient deployment in inference workloads. While hardware accelerators for
Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal
architectures that reuse hardware units for different network layers and operators. However, these methods
often encounter challenges in achieving low latency due to considerable memory access overhead.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference
on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers,
facilitating direct communication between them through a dataflow architecture while minimizing off-chip
memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial
LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA.
This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can
identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine
the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.
To enable more productive implementations of an LLM model on FPGAs, we further provide a library of
high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as
open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented
BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach
can achieve up to 13.4× speedup when compared to previous FPGA-based accelerators for the BERT model.
For GPT generative inference, we attain a 2.2× speedup compared to DFX, an FPGA overlay, in the prefill
stage, while achieving a 1.9× speedup and a 5.7× improvement in energy efficiency compared to the NVIDIA
A100 GPU in the decode stage.
CCS Concepts: • Hardware → Hardware-software codesign; • Computing methodologies → Neural
networks.
Additional Key Words and Phrases: FPGA, high-level synthesis, large language models, hardware acceleration
1 Introduction
The rapid advancement of Transformer-based large language models (LLMs) [5, 74] has sparked a
revolution across a wide range of natural language processing tasks, such as conversational AI [13,
54, 104] and code generation [10, 42, 52]. Recent research has brought to light the phenomenon of
“emergence” in LLMs, where advanced capabilities become evident as the models scale up to billions
of parameters [77, 78]. However, supporting this unprecedented scale poses significant challenges,
particularly in terms of computational and memory resources. At the same time, the increasing use
of LLMs in interactive applications like voice assistants and autonomous systems requires hardware
accelerators capable of providing both low latency and high energy efficiency [17, 54, 62].
Recent efforts have primarily focused on improving the performance of LLM inference on
GPUs [2, 53], although GPUs are known for their high power consumption and are less suitable
for latency-sensitive workloads [32, 62]. There is also an active body of research dedicated to
developing specialized hardware accelerators tailored for Transformer models, with several of these
efforts using FPGAs as the target platforms [23, 26, 39, 46, 59, 63].
∗ Work was done when interning at Cornell.
1
Chen et al.
space
PE1
f1 f2 f3 f4 f1 f2 f3 f4 time
Fig. 1. Temporal and spatial architectures — PE stands for processing engine; 𝑓1 -𝑓4 represent different operators
in the model.
FPGA-based LLM accelerators can be broadly categorized into two architectural paradigms:
temporal architecture and spatial architecture. In a temporal architecture, a processing engine
(PE) capable of performing various tasks is constructed and reused across different layers and
models, as shown in Figure 1(a). For flexibility, these accelerators typically employ an overlay
approach [23, 28, 39], where a virtual hardware architecture that executes instructions is “laid” on
top of the physical FPGA fabric. Overlays provide a more restricted configuration space, allowing
for quicker compilation with bitstream reuse across multiple models. However, the use of such
temporal architecture requires more frequent off-chip memory access, as intermediate results must
be written back to memory. This incurs a cost in terms of both latency and energy consumption
that is significantly higher than direct on-chip memory access. Additionally, one could argue that
an FPGA overlay will inherently be less efficient than its hardened ASIC counterpart.
In contrast, an FPGA-based spatial architecture typically involves the specialization of distinct
PEs for specific operators or layers, facilitating direct communication between them using streaming
buffers (e.g., FIFOs or multi-buffers) [60, 72, 75, 80], as depicted in Figure 1(b-c). This dataflow-style
execution substantially reduces off-chip memory accesses and enables the concurrent processing of
multiple PEs in a pipelined manner. Moreover, the fine-grained programmability of FPGAs allows
efficient support of model-specific spatial architectures, which can further leverage efficiency
optimizations such as low-bitwidth quantization, custom numerical types, and sparsity [58, 69, 93,
102]. These capabilities can potentially enable highly efficient LLM inference implementations that
surpass GPUs, especially in small-batch low-latency scenarios.
However, implementing a spatial architecture for LLM inference presents significant challenges.
Challenge 1: Navigating diverse parallelism in LLMs. The generative inference process of
LLMs typically consists of two distinct stages: (1) simultaneously processing user prompts and
(2) sequentially generating new tokens in an autoregressive manner. These two stages exhibit
significantly different computational and memory characteristics (detailed in §3), making it nec-
essary to tailor hardware accelerators for their specific needs. This challenge cannot be directly
addressed by leveraging techniques from the traditional convolutional neural network (CNN)
designs [32, 97]. The large number of parameters and intermediate tensors further complicates the
choice between on-chip and off-chip storage. Additionally, harnessing multiple accelerators for
distributed LLM inference adds complexity, particularly when dealing with intricate parallelization
schemes [23, 49, 68].
2
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Challenge 2: Lack of standard LLM building blocks in hardware accelerators. The rapid
evolution of LLM architectures [5, 54, 70] contrasts with the comparatively slow pace of hardware
development. While a plethora of building blocks for Transformers have been proposed in the
software domain [14, 19, 38], the absence of reusable blocks for hardware accelerator design
hampers development progress. Many frameworks have been designed to automatically map deep
learning models to FPGAs [3, 20, 72, 98, 99], but they are constrained to small CNN designs and lack
support for complicated Transformer models. It is also hard to scale their designs to accommodate
large models and multi-die FPGAs.
To tackle these challenges, this paper is to provide a comprehensive set of hardware design
considerations for LLMs and try to answer the following question: What role can FPGA-based
spatial accelerators play in enabling efficient LLM inference? We start by conducting an in-depth
analysis of the computational and memory requirements associated with each operator within
Transformer models across two distinct stages of LLM generative inference – prefill and decode.
Subsequently, we extend our analysis to reveal the potential benefits of distributed inference using
multiple FPGAs. We believe that providing such an analysis, rather than presenting only
positive results in selectively chosen settings for an FPGA LLM accelerator, offers more
valuable insights to the community. To validate the feasibility of our analytical framework,
we implement a specific design point and demonstrate its viability. Leveraging this analytical
framework, we employ specific optimizations in HLS to craft each kernel and compose them into a
hardware accelerator that achieves the expected performance. While our primary focus is not to
propose a new LLM accelerator architecture, we demonstrate that by using the analytical model,
we can create a high-performance design that surpasses previous efforts. Our major contributions
are as follows:
• We introduce an analytical framework that presents the first in-depth analysis of both the
advantages and limitations of FPGA-based LLM spatial acceleration. This framework not
only allows us to estimate the performance of a specific accelerator configuration on a given
FPGA device but also provides guidance for designing accelerators for LLM inference.
• We create a suite of modular and reusable HLS kernels designed for building FPGA-based
spatial accelerators for different Transformer models. We plan to open-source this kernel
library1 and expect it to serve as a valuable resource for benchmarking HLS and FPGA
acceleration more broadly.
• Leveraging our kernel library, we design and implement a range of high-performance
FPGA-based LLM accelerators that achieve speedups comparable to previous GPU and
FPGA-based accelerators. Specifically, for the BERT model, we achieve a 13.4× speedup
over prior FPGA-based accelerators. For GPT generative inference, we achieve speedups
of 2.2× and 1.1× in prefill and decode stages respectively, when compared to DFX, an
FPGA-based overlay architecture. Additionally, our accelerator is 1.9× faster and 5.7× more
energy-efficient than the A100 GPU in the decode stage.
2 Background
This section provides backgrounds on Transformer models and introduces parallelization schemes
for LLM inference.
3
Chen et al.
Layer N Layer N
Linear Linear
LayerNorm & Add
LayerNorm & Add
FFN Softmax Softmax
FFN
GELU
1st output 2nd output GELU
token token
Fig. 2. Transformer model. Red blocks represent linear operators, and blue blocks signify non-linear operators.
generation [54, 65, 70]. We will mainly discuss decoder-only models in this paper, but since encoders
and decoders share the core building blocks with subtle architectural variances, our approach can
also be extended for encoder-only models [16, 36, 45].
As illustrated in Figure 2, generative inference of LLMs has two stages: prefill stage and decode
stage [62]. In the prefill stage, the model takes in user prompts, normally a long sequence with 𝑙 input
tokens, goes through the whole Transformer model, and generates the first token. In the decode
stage, the model takes in the previously generated token and generates 𝑙 gen new tokens one at a
time in an auto-regressive way. Since each token depends on the previously generated tokens, the
decode stage is purely sequential.
We then go through the detailed model architecture. The input tokens are first passed into an
embedding layer that maps the discrete tokens into high-dimensional continuous representations
while incorporating positional encoding for each token. Subsequently, it generates a tensor (i.e.,
hidden states) of shape (𝑙, 𝑑), where 𝑙 represents sequence length, and 𝑑 is the size of hidden
dimensions. We omit the batch dimension to simplify the analysis, focusing solely on single-batch
inference in this paper, but our approach can be easily extended to different batch sizes for LLM
serving by adding an additional batch dimension [34, 43].
The hidden states then pass through a series of 𝑁 Transformer blocks. Each Transformer block
consists of two sublayers: a multi-head attention (MHA) module and a feed-forward network (FFN).
Residual connections and layer normalization (LayerNorm) functions are applied between these
sublayers, although the specific order and application may vary across different models [91]. The
MHA module plays a crucial role in capturing token relationships within the input sequence. The
input is initially partitioned into ℎ segments, where ℎ corresponds to the number of attention heads.
To compute the attention scores for each head, the input sequence of length 𝑙 undergoes three
4
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
TP rank#1
GL
SM
all_reduce
all_reduce
LN & add
LN & add
SM
GL
TP rank#2
Fig. 3. An example of tensor parallelism of a Transformer layer with two devices. TP rank is the unique
identifier given to a device within a TP group. SM is the softmax function, LN is LayerNorm, and GL is the
GeLU function.
linear projections: query, key, and value. These projections, which are trainable, yield matrices
𝑄, 𝐾, and 𝑉 respectively. Attention scores are then computed using a scaled dot-product (SDP)
operator between 𝑄, 𝐾, and 𝑉 , as specified by the formula:
. √︁
Attention(𝑄, 𝐾, 𝑉 ) = softmax 𝑄𝐾 𝑇 𝑑𝑘 𝑉 , (1)
where 𝑑𝑘 is the size of the hidden dimension. The output from this operator involves ℎ outputs,
which are subsequently concatenated and processed through an additional linear projection. In the
prefill stage, the generated 𝐾 and 𝑉 tensors will be stored as KV cache and later be concatenated
before SDP during the decode stage [62].
The FFN module comprises a linear layer followed by a non-linear activation function and
another linear layer. This module transforms the outputs of MHA into embedding matrices, which
are then further processed by subsequent Transformer layers.
Finally, the output tensor will go through a softmax function to obtain a distribution. The model
will sample a token from this distribution and feed it into the decode stage. For encoder-only
models, there is only a prefill stage involved, and the distribution will be directly used for different
downstream tasks like text classification [16, 36, 45].
In this paper, we only focus on analyzing the core Transformer blocks and accelerating them on
FPGAs. Embedding layers and output sampling [8, 25] require extensive random memory accesses,
which may not be suitable for FPGA acceleration. Also, they only take a small fraction of overall
compute that does not affect the overall latency [32], so we leave them to execute on CPUs or GPUs
as usual.
5
Chen et al.
Table 1. MACs of the prefill and decode stages of the linear layers in the Transformer model in Figure 2
— 𝑙 denotes input sequence length, 𝑑 denotes input feature dimension size, and 𝑑 FFN denotes FFN hidden
dimension size.
operations to ensure model correctness. Megatron-LM [68] is the first to explore tensor parallelism
for Transformer-based models, proving to be efficient in both training and inference due to relatively
low communication costs. As shown in Figure 3, tensor parallelism requires two all_reduce
operations inside a Transformer layer to ensure the results are correct. Our accelerator design also
explores tensor parallelism, as detailed in §3.4.2.
Lastly, pipeline parallelism [49, 50, 92] divides the model across network layers. Multiple layers
are grouped into a pipeline stage, and different stages are assigned to different devices. Pipeline
parallelism is typically employed across multiple nodes. Since both tensor parallelism and pipeline
parallelism handle only portions of the network model, they are collectively referred to as model
parallelism. We revisit these parallelization schemes in §3.4.2.
6
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
3.2.1 Compute Resource Constraints. The core computational element for linear operators is the
MAC unit. Let 𝑀𝑖 denote the compute power, in terms of the number of MACs per cycle allocated
to each matrix multiplication kernel, where 𝑖 ranges over 𝑞, 𝑘, 𝑣, 𝑎 1 , 𝑎 2 , 𝑝, 𝑓1 , and 𝑓2 , based on
the notation in Table 1. We quantize the matrix multiplication to integer inputs for maximum
efficiency, which has been proven to be effective by many recent studies [15, 30, 67, 81]. Quantization
enables single-cycle accumulation. As a result, one multiply-accumulator (MAC) unit can provide
a 1 MAC/cycle throughput with a properly pipelined multiplier. Therefore, the latency for the 𝑄
projection can be calculated as 𝑙𝑑 2 /𝑀𝑞 cycles, considering that the total number of MACs computed
in this operator is 𝑙𝑑 2 .
Suppose we want to deploy 𝐶 Transformer model layers on an FPGA. The total MAC units
must not exceed the capacity of the device. Since we employ a dataflow design that unfolds all the
layers on-board, the required MAC units are simply the sum of the MAC units for each layer. This
requirement can be expressed as:
∑︁
𝑀𝑖 𝐶 < 𝑀tot, 𝑖 ∈ {𝑞, 𝑘, 𝑣, 𝑎 1, 𝑎 2, 𝑝, 𝑓1, 𝑓2 } , (2)
where 𝑀tot represents the total available compute power of an FPGA in terms of MACs per cycle,
which can be obtained from the official data sheets. For FPGAs with specialized compute blocks
(e.g., AI Engine [86] and AI Tensor Blocks [37]), we can convert their compute power to match the
frequency of the programming logic, thus obtaining an overall value 𝑀tot for the entire FPGA. For
example, the VCK5000 FPGA [86] has 400 AI Engines, each of which can compute 128 MACs/cycle
at 1GHz. Therefore, the equivalent compute power at 250MHz is 128×400×1GHz/250MHz, which
is 204800 MACs/cycle.
3.2.2 Memory Capacity Constraints. The demand for memory capacity stems from a variety of
on-chip buffers, including weight buffers for parameters, buffers for 𝐾 and 𝑉 matrices, and FIFOs
interconnecting different stages.
Parameter buffers. To optimize an FPGA-based dataflow design, we assume that all the quan-
tized parameters can be accommodated in on-chip or off-chip memory. Suppose all the linear
weights are quantized to 𝑏𝑊 bits, and the size of the linear operator 𝑖 is 𝑠𝑖 . The total size of the
buffers is 𝑆 param = 𝑖 ∈ {𝑞,𝑘,𝑣,𝑝,𝑓1,𝑓2 } 𝑠𝑖 𝑏𝑊 = (4𝑑 2 + 2𝑑𝑑 FFN )𝑏𝑊 if storing on-chip. If the parameters are
Í
too large to fit in on-chip memory, we can store the parametersÍ in DRAM and tile the parameters
with size 𝑀𝑖 on-chip, then the total tiled buffer size is 𝑆 tile = 𝑖 ∈ {𝑞,𝑘,𝑣,𝑝,𝑓1,𝑓2 } 𝑀𝑖 𝑏𝑊 . To hide the
memory access latency, we need to double buffer those parameters, so the final buffer size of the
𝑖-th linear operator is 2𝑆 tile .
KV Cache. When conducting matrix multiplication, at least one of the matrices’ elements must
be accessed repeatedly so that a buffer is required. Given that parameters are already buffered, only
the SDP requires buffering for at least one of the input matrices. In our case, we choose to buffer 𝐾
and 𝑉 , which will be later passed to the decode stage as the KV cache. We also double buffer 𝐾 and
𝑉 matrices to improve throughput. The final buffer size is 𝑆 KV = 4𝑙 max𝑑𝑏𝐴 , where 𝑏𝐴 is the bitwidth
of the activation and 𝑙 max is the maximum sequence length supported by the model. Notice KV
cache can also be tiled on-chip, which can leverage a similar analysis above.
7
Chen et al.
FIFOs. The intermediate results between linear operators flow in FIFOs since the linear operators
sequentially access them. For the initial residual connection, we assume that the input tensors are
fetched from off-chip memory to obviate the need for additional buffering. However, for the second
residual connection related to the FFN, it is necessary to use an intermediate buffer to store the
projection’s activation 𝑋 act before the FFN. This buffer simultaneously serves as a bypass path. To
avoid deadlock, the buffer must possess sufficient capacity to store 𝑋 act . We simply create a FIFO
of size 𝑙𝑑𝑏𝐴 to store it. For other FIFO connections, we assume a FIFO depth of 𝑠 and one FIFO
connecting each layer in Figure 2, so the total FIFO size is equal to 𝑆 FIFO = 16𝑠𝑏𝐴 + 𝑙𝑑𝑏𝐴 .
In summary, the memory capacity constraint is expressed as:
𝑆 param𝐶 < 𝐷𝑅𝐴𝑀tot ,
∑︁ (3)
𝑆𝑖 𝐶 < 𝑆𝑅𝐴𝑀tot , 𝑖 ∈ {tile, KV, FIFO} ,
if the parameters are stored off-chip. 𝐷𝑅𝐴𝑀tot and 𝑆𝑅𝐴𝑀tot are the total available off-chip and
on-chip memory.
3.2.3 Memory Port Constraints. Besides memory capacity, we also need to consider constraints
on memory ports in a highly paralleled design. For matrix multiplication, if different MAC units
work in parallel, they will visit the weight/result buffers simultaneously, hence contending for
memory ports. This issue can be addressed by either partitioning the buffer, effectively offering
more memory ports; or packing data to create wider elements, subsequently reducing the number
of memory ports required.
SRAM resources. The on-chip SRAM resources of FPGAs are typically organized as blocks.
Each block has a fixed capacity and may support configurable bitwidth. For example, on AMD
UltraScale+ FPGAs, there are two types of SRAM resources: Block RAM (BRAM) and Ultra RAM
(URAM). BRAM blocks can be configured to 1×36 Kb block or 2×18 Kb blocks, with two read and
write ports each. URAM blocks are 288 Kb with one read and one write port. The port width of the
BRAM block is flexible; it can be configured to 1, 2, 4, 9, 18, 36, or 72 (in 36 Kb mode) bits, while the
port width of the URAM block is fixed at 72 bits. Similar to BRAM and URAM, Intel FPGAs have
M20K and eSRAM with different configurable port widths.
Memory blocks needed without data packing. To begin with, we analyze the port constraints
without data packing. In this case, to eliminate the port contention, different MAC units may
need different memory ports. Consider the linear operator 𝑖 with the size of 𝑠𝑖 with 𝑀𝑖 MAC units
working in parallel, each loaded weight may feed multiple MAC units due to intrinsic data reuse
in GEMM. We use 𝑟𝑖 to represent the data reuse factor (number of MAC units sharing the loaded
weight). Therefore, the weight buffer needs to be partitioned into 𝑀𝑖 /𝑟𝑖 parts. If we store all the
weight buffers on-chip, then the number of 𝑏𝑊 -bit elements in each partition is 𝑠𝑖 /(𝑀𝑖 /𝑟𝑖 ). However,
𝑏𝑊 may not fully occupy one memory word as the memory bitwidth can only take limited options.
We introduce the effective bit width, 𝑏 𝐵𝑅𝐴𝑀 , to be the smallest memory bitwidth larger than 𝑏𝑊 .
Let 𝑆 𝐵𝑅𝐴𝑀 be the total capacity (in bits) of one memory block, we can deduce the total number of
memory blocks for one linear operator:
𝑠𝑖 𝑏 𝐵𝑅𝐴𝑀
𝑅𝑖 = × 𝑀𝑖 /𝑟𝑖 . (4)
𝑀𝑖 /𝑟𝑖 × 𝑆 𝐵𝑅𝐴𝑀
If the parameters are loaded from off-chip memory and we only store a tile of the weight on-chip,
then 𝑠𝑖 is simply 𝑀𝑖 , and 𝑅𝑖 also becomes 𝑀𝑖 as 𝑏 𝐵𝑅𝐴𝑀 ≪ 𝑆 𝐵𝑅𝐴𝑀 . Since we need to double buffer
those parameters, the final buffer size of the 𝑖-th linear operator is 2𝑀𝑖 . Notice the 𝑘 and 𝑣 layers
need to be double-buffered, so the required BRAM also doubles in these two layers. We can obtain
8
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Memory blocks needed with data packing. Data packing can alleviate the strain on memory
port contention by consolidating multiple narrow data into a single, wider data element. This
process allows multiple MAC units to access data from the same memory port. We consider packing
data into 𝑏 𝑝𝑎𝑐𝑘 bits for the linear weights, and we have 𝑏 𝑝𝑎𝑐𝑘 = 𝑘𝑏𝑊 . Again, we denote 𝑏 𝐵𝑅𝐴𝑀 as
the smallest memory bitwidth larger than 𝑏 𝑝𝑎𝑐𝑘 . We need to partition 𝑀𝑖 /𝑟𝑖 MAC units to 𝑀𝑖 /𝑟𝑖 /𝑘
parts, and each partition has ⌈𝑠𝑖 /𝑘 ×𝑏 𝐵𝑅𝐴𝑀 /(𝑀𝑖 /𝑟𝑖 /𝑘)⌉ bits. Therefore, the total number of memory
blocks needed is:
𝑠𝑖 𝑏 𝐵𝑅𝐴𝑀 𝑀𝑖 /𝑟𝑖
𝑅𝑖 = × . (6)
𝑀𝑖 /𝑟𝑖 × 𝑆 𝐵𝑅𝐴𝑀 𝑘
3.2.4 Memory Bandwidth Constraints. If the parameters are stored off-chip, we need to consider
the impact of off-chip memory bandwidth. Similar to §4.2.2, we use 𝑟𝑖 to denote the data reuse
factor of a linear operator with 𝑀𝑖 MAC units. Effectively, 𝑀𝑖 /𝑟𝑖 weights must be loaded from
off-chip memory per cycle to feed the MAC units, requiring a bandwidth of:
𝐵𝑖 = 𝑏𝑊 × 𝑀𝑖 /𝑟𝑖 × 𝑓 𝑟𝑒𝑞 , (7)
Í
where 𝑓 𝑟𝑒𝑞 is the achieved frequency of FPGA. If the total required bandwidth, 𝑖 𝐶𝐵𝑖 (𝑖 ∈
{𝑞, 𝑘, 𝑣, 𝑝, 𝑓1, 𝑓2 }), exceeds the maximum device bandwidth, the inference becomes bandwidth bound.
Notice this bandwidth requirement needs to be analyzed for each operator individually if the data
loading requires accessing multiple DDR or HBM channels.
Time
Layer 1
Layer 2
... ... ...
PE
Fig. 4. Pipeline diagram. Different colors stand for different input samples. Different blocks stand for different
linear operators which also constitute the pipeline stages. ℎ is the number of attention heads.
3.3.1 Latency Estimation. We construct the pipeline diagram as shown in Figure 4. As mentioned
in §3.2.2, since we need to store the 𝐾 and 𝑉 values after the linear operators, there is an implicit
synchronization point between the 𝑞/𝑘/𝑣 operator and the latter SDP and FFN parts. The com-
putation of them cannot be overlapped. Notice the 𝑞/𝑘/𝑣 operator can be performed in parallel
since they do not have any dependencies. After 𝑘 and 𝑣 have been fully calculated, the subsequent
computations of SDP and FFN can be greatly overlapped. This is because these operations do not
9
Chen et al.
need to wait for all the results to perform the next operation. The results of the previous operation
can be directly streamed into the next operation as input. Moreover, since different Transformer
layers share the same architecture, their computation can also be overlapped without waiting for
the result of the previous layer.
Suppose the Transformer model has 𝑁 layers in total. Since we have 𝐶 layers on one FPGA, it
needs to iterate 𝑁 /𝐶 times to process the whole model. We can calculate the latency of different
stages, and the overall latency is the maximum latency of these stages (which defines the initiation
interval of the pipeline) times the number of iterations, i.e.,
1 𝑁 𝑙𝑑 2
2 2
𝑙𝑑 𝑙 𝑑 𝑙𝑑𝑑 FFN
𝑇prefill = + 𝐶 max , , ,𝑇mem , (8)
𝑓 𝑟𝑒𝑞 𝐶 𝑀𝑘 𝑀𝑘 𝑀𝑎1 𝑀 𝑓1
1 𝑁 𝑑2
2
𝑑 (𝑙 max + 1)𝑑 𝑑𝑑 FFN
𝑇decode = + 𝐶 max , , ,𝑇mem , (9)
𝑓 𝑟𝑒𝑞 𝐶 𝑀𝑘 𝑀𝑘 𝑀𝑎1 𝑀 𝑓1
where the first term inside the parentheses is the latency of the 𝑞/𝑘/𝑣 linear operator (i.e., 𝑡 in
Figure 4). 𝑇mem is the off-chip memory access latency, which can be calculated based on Equation (7).
3.3.2 Work Balancing. As the overall latency is determined by the slowest stage in the dataflow,
we can balance the execution time of each stage; hence we have
𝑙𝑑 2 𝑙 2𝑑/ℎ 𝑙𝑑𝑑 FFN
= ℎ= (10)
𝑀𝑞,𝑘,𝑣,𝑝 𝑀𝑎1,𝑎2 𝑀 𝑓1,𝑓2
=⇒ 𝑀 = 𝑀𝑞,𝑘,𝑣,𝑝 = 𝑑/𝑙𝑀𝑎1,𝑎2 = 𝑑/𝑑 FFN 𝑀 𝑓1,𝑓2 , (11)
where 𝑀 is defined as the global compute power in MACs/cycle. Finally, Equation (8) can be
simplified to
1 𝑙𝑑 2
1
𝑇prefill = 𝑁 1+ , (12)
𝑓 𝑟𝑒𝑞 𝐶 𝑀
which shows the overall latency with work balancing. We can obtain the latency for the decode
stage using a similar analysis.
To derive the optimal 𝑀 for a given model, we devise a linear search algorithm to identify the
maximum available 𝑀 based on the constraints in Equations (2), (3), and (6). Notice the optimal 𝑀
represents an upper bound of the compute power. In practice, we also need to consider the routing
issue to adjust the actual achievable 𝑀 as discussed in §5.2.
10
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
3.4.2 Parallelization Schemes. As mentioned in §2.2, we have various parallelization schemes when
considering multiple devices. We first analyze tensor parallelism (TP). As shown in Figure 3, the
parameters of the linear operations are partitioned across different devices. For example, suppose
the weight parameters of the two FFN layers 𝑓1 and 𝑓2 are 𝐴 and 𝐵, then we can partition 𝐴 along
its column and partition 𝐵 along its row, and obtain
𝐵1
𝜎 (𝑍𝐴)𝐵 = 𝜎 𝑍 𝐴1 𝐴2 = 𝜎 (𝑍𝐴1 )𝐵 1 + 𝜎 (𝑍𝐴2 )𝐵 2 ,
𝐵2
where 𝜎 is the GeLU function. Therefore, apart from partitioning 𝐴 and 𝐵, we need to insert an
all-reduce operation to aggregate the partial results on each device to ensure correctness. The
partitioned parameters will be stored on different devices. For example, 𝐴1 will be on the first FPGA,
and 𝐴2 will be on the second FPGA. A similar partition scheme can be applied for MHA, and we
refer the readers to [68] for more details.
Based on this partition scheme, TP requires two all-reduce operations within one Transformer
layer. However, these communicative operations are implemented in a blocking way. Figure 5(a)
shows the subsequent FFN module needs to wait for the completion of the all-reduce process before
it can conduct computation [76]. Notice that the all-reduce operation only involves fetching results
from other devices and adding the result to its local tensor. Given that the output of MHA is a
sequential stream, we can perform elementwise addition in a non-blocking manner. As soon as the
kernel receives enough data, it can initiate data transfer to other devices without waiting for the
remaining data to be computed. This leads to substantial synchronization time savings as shown in
Figure 5(b).
Time
MHA all_reduce FFN all_reduce (a) Blocking all-
reduce in TP
MHA
(b) Proposed
all_reduce
non-blocking
FFN Time saved
all-reduce
all_reduce
Fig. 5. Blocking and non-blocking all-reduce in TP. The latency of different stages is not drawn to scale.
Since the size of the output tensor of MHA and FFN are both 𝑙𝑑, the communication time for one
all-reduce is
𝑇comm = 𝑙𝑑𝑏𝐴 /(𝛼𝐵) . (13)
As we have already implemented dataflow inside a device, pipeline parallelism (PP) essentially
extends the dataflow to 𝑝 2 devices with a tensor of size 𝑙𝑑 communicated in between. Here, we
only split the pipeline between two Transformer layers so the results of the previous device can be
directly streamed to the next device in the same PP group. Notice TP and PP can be combined to
conduct model inference [51], and the latency of Equation (8) becomes
!!
1 𝑁 𝑙𝑑 2 𝑙𝑑 2 𝑙 2𝑑 𝑙𝑑𝑑 FFN
𝑇prefill = + 𝑝 2𝐶 max , , ,𝑇mem,𝑇comm , (14)
𝑓 𝑟𝑒𝑞 𝑝 2𝐶 𝑝 1 𝑀𝑘 𝑝 1 𝑀𝑘 𝑝 1 𝑀𝑎1 𝑝 1 𝑀 𝑓1
where 𝑝 1 and 𝑝 2 are the size of a TP group and a PP group [68]. Additionally, the memory re-
quirements of Equations (3) and (5) need to be divided by 𝑝 1 to satisfy the constraints of multiple
devices.
11
Chen et al.
Notice we only discuss two basic parallelism schemes for Transformer models. Some recent
works may partition the sequence dimension and leverage reduce-scatter and all-gather to reduce
the overheads of all-reduce [33, 51]. The communication time can be similarly analyzed, and we
will not discuss them here. The optimal parallelism scheme on multiple devices [48, 62, 73, 82, 105]
is out of the scope of this paper, and we will leave it as future works.
4 Case Study
In this section, we leverage actual hardware configurations to estimate the model performance
using our analytical framework and provide insights for LLM accelerator design.
12
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
(a) BERT (b) GPT2 Prefill Stage (c) GPT2 Decode Stage
103 103 RTX 2080Ti
Tesla A100
Latency (ms)
Latency (ms)
Latency (ms)
103
2
102 Alveo U280 (est.)
10
Versal VCK5000 (est.)
102 101
Versal VHK158 (est.)
101 Stratix 10-NX2100 (est.)
100
101 Agilex 7-AGM039 (est.)
Fig. 6. Latency estimation of BERT and GPT2 on different FPGAs. GPU results are obtained from actual
profiling.
13
Chen et al.
Latency (ms)
Latency (ms)
103
101
102
101
0 10000 20000 30000 0 500 1000 1500 2000 2500 3000
# of MACs/cycle M # of MACs/cycle M
Fig. 7. Latency estimation of LLaMA2 model. The sequence length is set as 128, and the W4A8 quantization
scheme is used in this experiment. GPU results are obtained from actual profiling.
Insight I: Existing FPGAs are inferior in the compute-intensive prefill stage but can
outperform GPUs in the memory-intensive decode stage.
To further investigate what constrains the performance of FPGAs, we conduct an analysis on the
LLaMA2 model by varying different 𝑀 and observing the changes in latency. As shown in Figure 7,
the VCK5000 FPGA exhibits the smallest off-chip memory bandwidth, which leads it to reach a
latency plateau rather quickly. Conversely, the VHK158 FPGA has the largest off-chip memory
bandwidth, so it can achieve the lowest latency in both prefill and decode stages. Moreover, we
include the curve of ideal FPGA performance in Figure 7 to assess the compute power required to
attain A100-level performance. Based on this estimation, we need around 30,000 MACs/cycle in
order to achieve the A100-level performance in the prefill stage, assuming no memory bandwidth
constraints. This is achievable by those AI-optimized FPGAs, which can conduct a large number
of MACs efficiently. On the contrary, for the decode stage, once an FPGA has enough memory
bandwidth, such as U280, it can reach the A100-level performance easily.
Insight II: The prefill stage requires large compute power 𝑀 to achieve the GPU-level
performance, while the decode stage only requires a small 𝑀.
4.2.2 Quantization Schemes. We then investigate the impact of different quantization schemes and
memory packing. We consider quantizing the weight parameters to 𝑥 bits and the activation to 𝑦
bits (abbreviated as W{𝑥 }A{𝑦}). As shown in Figure 8(a), the red dashed line depicts the maximum
available MACs/cycle on-board, which is calculated based on Equation (2). Different quantization
schemes may have different requirements on BRAM usage constrained by Equation (3). W4A8 is
the scheme that can almost fully utilize the compute resources. W8A8 and W16A16 require more
memory resources, resulting in lower performance since the computation is bound by the limited
BRAM resources on-board. Also, we can see quantizing the weights gives the most benefits, but
quantizing activation only gives little benefit (𝑀 does not change a lot under the same weight
bitwidth), which is due to the fact that we employ a dataflow architecture and do not require large
buffers to store the intermediate tensors on-board.
Insight III: Weight quantization is necessary for reducing memory usage, while activa-
tion quantization only has limited benefit.
4.2.3 Memory Packing. Next, we further consider the impact of memory packing under the W4A8
setting. As shown in Figure 8(b), if we do not conduct memory packing, it even cannot satisfy
the memory port constraint (Equation (5)) when 𝑀 is small (blue curve). This is because a large
number of partitioned arrays require more BRAMs, and many BRAMs are not fully utilized causing
a large waste of resources. The orange curve shows packing two int4 elements to int8, and we can
achieve a small 𝑀 under the resource constraint since the number of partitioned arrays is reduced.
The green curve packs 9×int4 elements to int36, and it can achieve more than four times of 𝑀
14
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Latency (ms)
Latency (ms)
W8A8
W8A16 160 720 1440
W16A8 75 102
W16A16 103
W4A8 w/o packing 50
W4A8 w/ 8-bit packing
W4A8 w/ 32-bit packing 25
W4A8 w/ 72-bit packing
Max available BRAM
500 1000 1500 500 1000 1500 0 1000 2000 3000
# of MACs/cycle M # of MACs/cycle M # of MACs/cycle M
Fig. 8. (a) Impact of different quantization techniques on GPT2 prefilling stage on U280. The sequence length
is set as 128. The cyan line shows the theoretical latency under different 𝑀 without memory bandwidth
constraints. Thin dashed lines depict the maximum 𝑀 constrained by available BRAM resources. (b) Impact
of memory packing in the W4A8 setting. (c) Impact of different weight quantization schemes on memory
bandwidth and overall latency.
compared to the int8 packing. The purple curve packs 18×int4 elements to int72, and the curve
can almost intersect with the red line before intersecting with the blue line, which means it reaches
the maximum DSP constraint on-board (Equation (2)). This study shows that it is important to pack
the parameters to reduce on-chip memory usage.
Insight IV: Memory packing can efficiently reduce the required BRAMs to store the
tensors.
4.2.4 Memory Bandwidth. Lastly, we investigate how quantization impacts the required memory
bandwidth. As shown in Figure 8(c), the low-bit weight quantization can significantly alleviate the
demands of off-chip memory access. By reducing the volume of data needed in each cycle, it can
achieve a larger compute power 𝑀, thus leading to a better performance. In particular, quantizing
the model to a 2-bit representation yields a performance boost exceeding an order of magnitude
when compared to a 16-bit weight quantization scheme. Recent research [7, 103] has demonstrated
that 4-bit or even 2-bit quantization can be implemented without compromising model accuracy,
which makes efficient LLM deployment on FPGAs possible.
Insight V: Low-bit weight quantization can further help alleviate the demands of off-
chip memory access.
Latency (ms)
FPGA×4
101
FPGA×8 103
Max M on U280
Max M on VHK158
Max M on VCK5000 100
102
Max M on Stratix 10
500 1000 1500 2000 500 1000 1500 2000
# of MACs per cycle per FPGA (M) # of MACs per cycle per FPGA (M)
For multiple devices, we use the Vicuna-13B model to estimate the performance of 2, 4, and
8 FPGAs based on our analytical model. As shown in Figure 9, the latency can scale well when
the number of devices increases. Since we employ a non-blocking communication scheme in our
dataflow design as discussed in § 3.4, communication will not be the bottleneck of the design.
15
Chen et al.
Multiple FPGAs can reduce the number of required MACs on each device, but cannot increase
the number of available MACs on an FPGA, so the performance is still limited by the maximum
available resources on-board and the off-chip memory bandwidth. For the decode stage, leveraging
two FPGAs can already reduce the inference latency of the Vicuna-13B model to less than 10ms
based on the estimation.
Insight VI: Multiple FPGAs help reduce overall latency under the same 𝑀 on each
device.
5 Implementations
In this section, we describe the kernel implementation and accelerator design to show how to
efficiently achieve the design points in the analytical framework.
in[7:0]
Activation / LHS Buffer
2 1 F E D C B A 0 9 8 7 6 5 4 3 2 1 18-bit
w[16:13] w[3:0]
A 0 9 8 7 6 5 4 3 2 1 F E D C B A 0 9 8 7 6 5 4 3 2 1 27-bit
out[24:13] out[11:0]
. . A 0 9 8 7 6 5 4 3 2 1 F E D C B A 0 9 8 7 6 5 4 3 2 1 45-bit
Fig. 10. Systolic array and DSP packing. The yellow blocks in the systolic array represent output buffers.
Each MAC unit can be implemented with a single DSP block and can provide one-MAC-per-cycle
throughput. Based on the discussion in §4.2.2, we adopt the W4A8 quantization scheme for our
16
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Mask SM LN
input
LN
SDP output
Fig. 11. Overall dataflow architecture of a single Transformer layer that uses post-LayerNorm scheme [65].
SDP denotes scaled dot-product. Orange nodes denote the GEMM kernels. Yellow nodes are the non-linear
kernels, including softmax (SM), LayerNorm (LN), and GELU (GL). Green rectangles represent the FIFOs
between kernels, and purple rectangles are the data loaders.
accelerator design, which maximizes the utilization of available resources. As a result, the matrix
multiplications involve either int4 by int8 or int8 by int8 operations, which are too large for
LUTs, thus relying primarily on DSPs or specialized compute blocks (e.g., AIE [84]). In AMD
FPGAs, the DSP48E2 hard blocks can support 18-bit by 27-bit multiplication and accumulation [83],
enabling the packing of two multiplications into one slice for a W4A8 quantized model to save
DSP resources and achieve a larger 𝑀. Figure 10(b) shows the bit packing method for 4-bit by
8-bit integer multiplications. One activation is filled into the lower 8 bits of the 18-bit DSP input,
and two weights are filled into 0-to-3 and 13-to-16 bit positions of the 27-bit DSP input to avoid
overlapping results. Finally, the two multiplication results are extracted by bit-slicing the 45-bit
DSP result. Notice that since the DSP output is wide enough, we can also pack two 8-bit by 8-bit
integer multiplications into one DSP slice by further offsetting the second weight and output. With
DSP packing, we can easily double 𝑀 to achieve higher performance but with much fewer DSPs.
Non-Linear Operators. Since quantizing non-linear operators can lead to a significant degradation
in model accuracy [67, 81], and these non-linear operators are not the bottleneck of the design, we
directly implement the floating-point version of the operators in HLS. Specifically, we buffer a row
of elements for softmax and LayerNorm functions, which requires conducting reduction along the
last dimension. Consequently, this approach eliminates the need to wait for the complete results
for the computation of these non-linear operators and effectively prevents dataflow stalling.
17
Chen et al.
placement constraints, AMD Vitis toolchain will automatically insert AXI Register Slice IPs to
pipeline SLR crossing.
We leverage the proposed analytical framework to guide our accelerator design. Since typical
Transformer models have 𝑑 FFN = 4𝑑 [16, 65] and 𝑙 < 𝑑, according to work balancing of Equation (11),
we have 𝑀𝑞,𝑘,𝑣,𝑝 = 𝑀, 𝑀𝑎1,𝑎2 < 𝑀, and 𝑀 𝑓 1,𝑓 2 = 4𝑀. A straight-forward division is to put the PEs for
𝑞, 𝑘, 𝑣, SDP, and 𝑝 on SLR0, 𝑓 1 on SLR1, and 𝑓 2 on SLR2 so that each SLR roughly contains 4𝑀 MAC
units. However, we observe that scaling up the linear operators in FFN poses significant challenges
to timing closure. Among various configurations of systolic arrays we tested, the maximum capacity
of one SLR at 250 MHz is three of 8 × 16 systolic arrays; a single 16 × 16 one fails timing. Therefore,
we only leverage 8 × 8 and 8 × 16 systolic arrays for simplicity. We also explore using LUT-based
multipliers as they provide greater flexibility for placement compared to DSPs. However, the
presence of additional inter-LUT wires results in a much lower frequency (191 MHz) compared to
the DSP-based multipliers. To minimize the number of SLR crossings, we put 𝑞, 𝑘, and 𝑣 on SLR0
and use 8 × 16 systolic arrays, which also ensures a relatively low latency for the first stage based
on Equation (8). MHA and the 𝑝 projection are on SLR1, with 𝑎 1 and 𝑎 2 using 8 × 8, and 𝑝 using
8 × 16 systolic arrays. 𝑓1 and 𝑓2 operators on SLR2 using 8 × 16 systolic arrays. Therefore, it can
still form a relatively balanced 3:2:2 resource utilization ratio for linear operators.
6 Evaluation on FPGAs
In this section, we implement two design points studied in §4 to validate the feasibility of our
framework. We first describe our experimental setup and perform evaluation on a single FPGA.
18
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Table 4. Experimental results compared with other FPGA-based accelerators. Sequence lengths are set as 512.
Latency (ms)
102
102
101 100
101
101
32 64 128 256 512 32 64 128 256 512 1 4 16 64 128 1 4 16 64 128
Input sequence length (tokens) Input sequence length (tokens) Output sequence length (tokens) Output sequence length (tokens)
Fig. 13. Latency and energy efficiency of GPT2 model on different devices. The GPU results are obtained
following the same setting in §4.
Figure 12 shows the final device layout of the implemented accelerator. We use OpenCL with
Xilinx RunTime (XRT) for hardware execution and Xilinx Board Utility (xbutil) for computing
power measurements. The environment for GPU experiments is listed in §4.1, and NVIDIA system
management interface (nvidia-smi) is used for measuring GPU power. Notice the quantized models
on GPUs are slower than the FP16 models, as the quantization methods normally leverage fake
quantization and lack high-performance GPU kernels to support efficient inference. Therefore, we
directly compare our accelerator with the best FP16 GPU results. The FPGA on-board results match
the outputs from the quantized model in PyTorch and are able to achieve the same accuracy. The
latency results are the average across fifty runs.
19
Chen et al.
tensor to be produced, which may take hundreds of cycles. Therefore, even if the per-layer latency
of spatial architectures is longer, the end-to-end latency can still be significantly lower than the
temporal architectures employed by FQ-BERT and TRAC. Furthermore, our analytical framework
precisely predicts the performance of the accelerator with less than 2ms of differences, showing
the practicality of our approach.
We next design an accelerator for the GPT2 model. We support importing quantized models
from different quantization frameworks [7, 81, 103]. Specifically, we export the W8A8 model from
SmoothQuant [81] and achieve 62.2% on the LAMBADA dataset [56], whereas the FP16 model
demonstrates an accuracy of 65.0%. We compare our GPT accelerator with the state-of-the-art GPT
accelerator, DFX [23], which employs a temporal architecture with an instruction set and uses the
same U280 FPGA device for on-board evaluation. On average, we are 2.16× and 1.10× faster than
DFX in the prefill and decode stage respectively. This is because our spatial architecture overlaps
the computation and greatly eliminates off-chip memory access. We can also see our estimations in
§3 align closely with the actual performance, achieving a 92% accuracy for the prefill stage. For the
decode stage, the estimated latencies are lower than the actual results, which is mainly because the
initial interval between two operators is not significantly smaller than the execution time of one
stage, contributing to a notable increase in latency.
We also include the GPU results in §4 for a more comprehensive evaluation. As shown in
Figure 13, neither DFX nor our design performs well during the prefill stage compared to GPUs
that have more compute resources to exploit the abundant parallelism. Notably, the latency of
FPGAs in the prefill stage increases linearly, while the GPU ones almost remain constant as the
model does not fully utilize GPUs. For the decode stage, the situation is reversed. FPGA-based
accelerators are more efficient than GPUs, and our accelerator can achieve a 1.85× speedup and is
5.69× more energy efficient compared to the A100 GPU. This is because the generation of each
token is fully sequential, and GPUs cannot leverage their abundant parallelism, and suffer from
extensive memory access overheads. On the contrary, our dataflow accelerator eliminates most of
the off-chip memory accesses and overlaps the compute as much as possible. Thus, we can achieve
a better performance compared to GPUs, aligning with our estimation results in §4. Notice the
U280 FPGA only uses a 16nm process while the A100 GPU has a more advanced 7nm process node
based on the data in Table 3, but we can still achieve higher speedup, demonstrating the efficiency
of our spatial accelerators. It also indicates the potential of further optimizing our HLS design and
scaling it up to achieve even higher performance.
20
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
a pivotal role in this predictability. Additionally, our function offers enhanced customizability,
accommodating varying sizes and the choice of different quantization schemes.
Moreover, employing DSP packing further reduces the DSP usage, allowing one DSP to handle
two MAC operations within a single cycle, a feature not supported in AutoSA. This experiment
shows the efficiency of our kernels, facilitating the development of high-performance Transformer
accelerators.
Table 5. Latency and resource usage of our systolic array library function. Results are directly derived from
the HLS report in 300MHz. The GEMM kernel is extracted from the first FFN layer in the BERT-base model
with size (512, 768) × (768, 3072). We use a 16 × 16 systolic array to calculate the int8 GEMM. The theoretical
peak performance without DSP packing is (512 × 768 × 3072)/(16 × 16) cycles ×3.33 ns/cycle = 15.71 ms.
Lastly, we analyze the performance of the non-linear operators. As shown in Table 6, we observe
that the softmax operator in the MHA module incurs the highest latency, primarily due to the need
to compute the exponential function. Since these operators are elementwise and only require a row
of data to start the computation, they can be easily fused with the preceding linear operators in
the pipeline, thereby not significantly impacting the overall latency. For instance, the combined
latency of a GEMM kernel (10.77ms) and the softmax operator (6.67ms) greatly exceeds the latency
of SLR1 in Table 4 (14.63ms/300MHz×245MHz), indicating substantial overlap between the softmax
operator and other operators. Again, these ablation studies show that considering only the linear
operators in the analytical framework is sufficient to achieve an accurate latency estimation.
Table 6. Performance and resource usage of non-linear operators in our kernel library. Kernel sizes are set to
match those of the BERT model in Table 4. Results are directly derived from the HLS report in 300MHz.
7 Discussion
In the previous sections, we provide details of the analytical framework and prove that it can
achieve high accuracy compared to the latency of actual implementation. However, our framework
may have limitations when analyzing the overlay designs or compressed models with sparsity,
which requires changes in the resource and latency estimation. In this section, we will delve into
several unanswered questions and open challenges.
AI-Optimized FPGAs. In §4, we demonstrate the potential of leveraging FPGAs with specialized
compute engines to accelerate LLMs. Although AIEs and tensor blocks provide massive compute
power [37, 84], the memory layout and bandwidth requirements remain undiscovered. Future FPGAs
for AI workloads should provide enough memory bandwidth and efficient on-chip interconnect to
facilitate the local data movements in a spatial architecture. Moreover, these specialized hardware
blocks usually adopt a unique programming model with custom compilation flows. It is still an
21
Chen et al.
open question whether existing practices for programming those hardware blocks enable efficient
execution of Transformer models.
Timing Closure on Multi-Die FPGAs. We encounter timing problems in partitioning and
scaling our design in §5.2. In general, it is hard to adequately explore the design space of multi-
die partitioning and scaling. There are automated frameworks [22, 29] to generate floorplanning
constraints, but they are currently not expressive enough to capture the various data movement
schemes (e.g., residual connection, multi-head splitting) within Transformer models. We hope
similar tools for Transformers could be derived from our analytical framework to speed up the
design closure.
Heterogeneous Deployment. Nowadays, data centers are increasingly heterogeneous, with
CPUs, GPUs, and FPGAs available at scale [6, 12, 51]. Therefore, it is possible to leverage the
advantages of different hardware to accelerate Transformer models. For example, GPUs are good
for the GPT prefill stage due to their high compute power; FPGAs can achieve low-latency decode
stage with customized spatial architecture. The key challenge is to build a distributed system
that efficiently manages hundreds of heterogeneous devices. We hope our analysis on resource
constraints, latency, and scaling could assist future deployment and evaluation of LLMs in a
heterogeneous and distributed environment.
8 Related Work
FPGA-Based Transformer Accelerators. Most of the prior works on hardware accelerators
leverage temporal or overlay architecture with one FPGA [26, 28, 39, 40, 46, 59, 63, 64]. Their
performance usually suffers from frequent data movements of intermediate results. DFX [23]
explores using multiple FPGAs to accelerate GPT2 inference, but it is still an overlay design. Some
research has delved into software-hardware co-design to optimize the attention kernel [100]. These
endeavors often lack in-depth analysis on resource utilization and cannot be easily generalized to
other kernels.
Quantization on LLMs. Initial investigations [15, 81, 94] demonstrate lossless 8-bit quantization
for LLMs. Subsequent studies [21, 31, 44, 94, 96, 103] keep lowering the bit width; the latest
advancements reveal that 2-bit [7] and even 1-bit (binary) quantization [101] are adequate for
an accurate LLM. While these approaches offer valuable insights, our focus remains orthogonal
to quantization, as we illustrate optimization techniques and provide high-performance building
blocks for deploying quantized LLMs on FPGAs.
HLS Kernel Libraries. Despite the existence of kernel libraries for accelerating Transformer
models on GPUs [14, 38, 79], the hardware domain has seen only a handful of initiatives in this
regard. AMD provides Vitis HLS library [87, 88] that only has basic kernel-level examples without
comprehensive designs tailored for Transformer models. TRAC [61] attempts to provide an HLS-
based Transformer library, but its kernel performance is unpredictable, and it exclusively focuses on
the BERT model using a temporal architecture. Some frameworks map deep learning frameworks
to FPGAs [4, 20, 72, 98, 99], but can only handle small CNN designs and do not cater to LLMs.
More recent tools allow hardware design using Python [24, 35, 55, 80, 95], but are still general-
purpose and require hardware engineers to construct and optimize kernels from scratch. Our work
provides a Transformer kernel library designed for dataflow implementations and demonstrates
their composability in constructing high-performance hardware accelerators.
9 Conclusion
In this paper, we propose an analytical framework for large language models and point out the bot-
tlenecks and potential optimizations across the prefill and decode stages in the generative inference.
To verify the feasibility of our framework, we provide a reusable HLS kernel library to quickly
22
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
compose Transformer kernels into different LLMs that can achieve the expected performance.
Based on these proposed kernels, we design FPGA-based spatial accelerators for both BERT and
GPT models and achieve high performance and energy efficiency on par with high-end GPUs. By
offering insights into performance bottlenecks, a suite of reusable kernels, and a high-performance
accelerator, we propel the deployment of LLMs for real-world applications while pushing the
boundaries of hardware innovation.
Acknowledgments
This work was supported in part by ACE, one of the seven centers in JUMP 2.0, a Semiconductor
Research Corporation (SRC) program sponsored by DARPA and NSF Awards #2007832, #2019306,
and #2118709. We would like to thank anonymous reviewers, Keisuke Kamahori, and Zihao Ye for
providing insightful feedback. We also thank Jiajie Li, Jie Liu, and Zhanqiu Hu for their contributions
to the initial LLM modeling and benchmarking.
References
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay
Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G.
Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating
Systems Design and Implementation, 2016.
[2] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji
Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed-inference: Enabling efficient inference
of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis, 2022.
[3] Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, and Jason Cong. Flexcnn: An end-to-end framework
for composing cnn accelerators on fpga. ACM Trans. Reconfigurable Technol. Syst., 16(2), mar 2023.
[4] Michaela Blott, Thomas B Preußer, Nicholas J Fraser, Giulio Gambardella, Kenneth O’brien, Yaman Umuroglu, Miriam
Leeser, and Kees Vissers. Finn-r: An end-to-end deep-learning framework for fast exploration of quantized neural
networks. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(3):1–23, 2018.
[5] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein,
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258, 2021.
[6] Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil,
Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael,
Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In 2016 49th
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13, 2016.
[7] Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language
models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
[8] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating
large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
[9] Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, and Yida Wang. Slapo: A schedule language
for progressive optimization of large deep learning model training. In Proceedings of the 29th ACM International
Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS’24), 2024.
[10] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser,
Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak,
Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan
Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati,
Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[11] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang,
Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4
23
Chen et al.
24
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
[28] Hamza Khan, Asma Khan, Zainab Khan, Lun Bin Huang, Kun Wang, and Lei He. Npe: An fpga-based overlay
processor for natural language processing. In The 2021 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, FPGA ’21, page 227, New York, NY, USA, 2021. Association for Computing Machinery.
[29] Moazin Khatti, Xingyu Tian, Yuze Chi, Licheng Guo, Jason Cong, and Zhenman Fang. Pasta: Programming and
automation support for scalable task-parallel hls programs on modern multi-die fpgas. In 2023 IEEE 31st Annual
International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 12–22, 2023.
[30] Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-bert: Integer-only bert quantization.
In Proceedings of the International Conference on Machine Learning (ICML), 2021.
[31] Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael Mahoney, and Kurt Keutzer.
Squeezellm: Dense-and-sparse quantization. arXiv, 2023.
[32] Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing
Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, and Amir Gholami. Full stack optimization of
transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
[33] Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and
Bryan Catanzaro. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198,
2022.
[34] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao
Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[35] Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. Heterocl: A
multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the
2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019.
[36] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
bert for self-supervised learning of language representations. In International Conference on Learning Representations,
2020.
[37] Martin Langhammer, Eriko Nurvitadhi, Bogdan Pasca, and Sergey Gribok. Stratix 10 nx architecture and applications.
In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, page 57–67, New
York, NY, USA, 2021. Association for Computing Machinery.
[38] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru
Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer
modelling library. https://github.com/facebookresearch/xformers, 2022.
[39] Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and
Caiwen Ding. Ftrans: Energy-efficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE
International Symposium on Low Power Electronics and Design, ISLPED ’20, page 175–180, New York, NY, USA, 2020.
Association for Computing Machinery.
[40] Qin Li, Xiaofan Zhang, Jinjun Xiong, Wen-Mei Hwu, and Deming Chen. Efficient methods for mapping neural
machine translator on fpgas. IEEE Transactions on Parallel and Distributed Systems (TPDS), 32(7):1866–1877, 2021.
[41] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian
Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel
training. Proc. VLDB Endow., 2020.
[42] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin,
Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz,
Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level
code generation with alphacode. Science, 378(6624):1092–1097, 2022.
[43] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao
Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical multiplexing with model parallelism for deep
learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages
663–679, Boston, MA, July 2023. USENIX Association.
[44] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight
quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
[45] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[46] Zejian Liu, Gang Li, and Jian Cheng. Hardware acceleration of fully quantized bert for efficient natural language
processing. 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 513–516, 2021.
[47] Meta. Fully sharded data parallel: faster ai training with fewer gpus. https://engineering.fb.com/2021/07/15/open-
source/fsdp/, 2021.
25
Chen et al.
[48] Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient
transformer training over multiple gpus using automatic parallelism. Proc. VLDB Endow., 16(3):470–479, nov 2022.
[49] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger,
Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings
of the 27th ACM Symposium on Operating Systems Principles, 2019.
[50] Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel
dnn training. In Proceedings of the 38th International Conference on Machine Learning, 2021.
[51] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri
Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training
on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, 2021.
[52] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International
Conference on Learning Representations, 2023.
[53] Nvidia. Fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2022.
[54] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[55] Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini,
Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, and Zhiru Zhang. Accelerator design with decoupled hardware
customizations: benefits and challenges: invited. In Proceedings of the 59th ACM/IEEE Design Automation Conference,
DAC ’22, page 1351–1354, New York, NY, USA, 2022. Association for Computing Machinery.
[56] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle,
Marco Baroni, Gemma Boleda, and R. Fernández. The lambada dataset: Word prediction requiring a broad discourse
context. arXiv preprint arXiv:1606.06031, 2016.
[57] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An
imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on
Neural Information Processing Systems, 2019.
[58] Hongwu Peng, Shaoyi Huang, Shiyang Chen, Bingbing Li, Tong Geng, Ang Li, Weiwen Jiang, Wujie Wen, Jinbo Bi,
Hang Liu, and Caiwen Ding. A length adaptive algorithm-hardware co-design of transformer on fpga through sparse
attention and dynamic pipelining. In Proceedings of the 59th ACM/IEEE Design Automation Conference, DAC ’22, page
1135–1140, New York, NY, USA, 2022. Association for Computing Machinery.
[59] Hongwu Peng, Shaoyi Huang, Tong Geng, Ang Li, Weiwen Jiang, Hang Liu, Shusen Wang, and Caiwen Ding.
Accelerating transformer-based deep learning models on fpgas using column balanced block pruning. In 2021 22nd
International Symposium on Quality Electronic Design (ISQED), pages 142–148, 2021.
[60] Lucian Petrica, Tobías Alonso, Mairin Kroes, Nicholas J. Fraser, Sorin Dan Cotofana, and Michaela Blott. Memory-
efficient dataflow inference for deep cnns on fpga. 2020 International Conference on Field-Programmable Technology
(ICFPT), pages 48–55, 2020.
[61] Patrick Plagwitz, Frank Hannig, and Jürgen Teich. Trac: Compilation-based design of transformer accelerators for
fpgas. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), 2022.
[62] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan
Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine
Learning and Systems, 2023.
[63] Panjie Qi, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Hongwu Peng, Shaoyi Huang, Zhenglun Kong, Yuhong Song,
and Bingbing Li. Accelerating framework of transformer by hardware design and model compression co-optimization.
In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), page 1–9. IEEE Press, 2021.
[64] Panjie Qi, Yuhong Song, Hongwu Peng, Shaoyi Huang, Qingfeng Zhuge, and Edwin Hsing-Mean Sha. Accommodating
transformer onto fpga: Coupling the balanced model compression and fpga-implementation optimization. In
Proceedings of the 2021 on Great Lakes Symposium on VLSI, GLSVLSI ’21, page 163–168, New York, NY, USA, 2021.
Association for Computing Machinery.
[65] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[66] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training
trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage
and Analysis, 2020.
[67] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer.
Q-BERT: hessian based ultra low precision quantization of BERT. In The Thirty-Fourth AAAI Conference on Artificial
26
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
27
Chen et al.
28