0% found this document useful (0 votes)

20 views207 pages

RM Merged Files

Uploaded by

yo bro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views207 pages

RM Merged Files

Uploaded by

yo bro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 207

Journal of Signal Processing Systems (2021) 93:513–529

https://doi.org/10.1007/s11265-021-01651-5

A Parametrizable High-Level Synthesis Library for Accelerating

Neural Networks on FPGAs
Lester Kalms1 · Pedram Amini Rad1 · Muhammad Ali1 · Arsany Iskander2 · Diana Göhringer1

Received: 4 May 2020 / Revised: 4 November 2020 / Accepted: 10 February 2021 / Published online: 15 March 2021
© The Author(s) 2021

Abstract
In recent years, Convolutional Neural Network CNN have been incorporated in a large number of applications, including
multimedia retrieval and image classification. However, CNN based algorithms are computationally and resource intensive
and therefore difficult to be used in embedded systems. FPGA based accelerators are becoming more and more popular in
research and industry due to their flexibility and energy efficiency. However, the available resources and the size of the on-
chip memory can limit the performance of the FPGA accelerator for CNN. This work proposes an High-Level Synthesis
HLS library for CNN algorithms. It contains seven different streaming-capable CNN (plus two conversion) functions for
creating large neural networks with deep pipelines. The different functions have many parameter settings (e.g. for resolution,
feature maps, data types, kernel size, parallelilization, accuracy, etc.), which also enable compile-time optimizations. Our
functions are integrated into the HiFlipVX library, which is an open source HLS FPGA library for image processing and
object detection. This offers the possibility to implement different types of computer vision applications with one library.
Due to the various configuration and parallelization possibilities of the library functions, it is possible to implement a
high-performance, scalable and resource-efficient system, as our evaluation of the MobileNets algorithm shows.

Keywords High-level synthesis · Neural networks · FPGA · Hardware acceleration · Library

1 Introduction in object detection [25] and image classification [21] due to

the ability to extract both spatial and temporal features [12].
Nowadays neural network applications are widely used in However, the performance relies on the computing platform
new technologies such as artificial intelligence and robotics and implementation. The first challenge in accelerating
[23]. Convolutional Neural Network (CNNs) [12] are a type CNNs is that they have a huge memory requirement, since
of deep neural networks, which are significantly successful common CNN models use millions of trained parameters.
Furthermore, CNNs are computationally intensive with
over billions of operations for the inference. Due to these
Lester Kalms challenges, GPUs [18], ASICs [4] and FPGAs [30] are
lester.kalms@tu-dresden.de mainly used for accelerating the CNN inference. Because of
the advantages of high performance, energy efficiency and
Pedram Amini Rad
pedram.amini rad@tu-dresden.de flexibility, FPGAs are attracting the attention of researchers
to accelerate CNNs [27]. On the other hand, a straight
Muhammad Ali forward design for FPGAs written in VHDL or Verilog can
muhammad.ali@tu-dresden.de
achieve a suitable performance. However, an effective and
Arsany Iskander precise hardware design requires a high time to market and
arsany.eskander@student.guc.edu.eg a lot of effort. Moreover, a flexible development framework
like Caffe [13] and TensorFlow [1] for CPU and GPU
Diana Göhringer
is not available for FPGAs. To address this, High-Level
diana.goehringer@tu-dresden.de
Synthesis (HLS) tools from FPGA vendors, such as Xilinx’s
Vivado HLS [34] or Intel’s OpenCL SDK [10], reduce the
1 Technische Universität Dresden, Dresden, Germany programming difficulty and shorten the development time
2 German University in Cairo, New Cairo, Egypt remarkably, making FPGA-based solutions more popular.
514 J Sign Process Syst (2021) 93:513–529

Our contribution consists of a generic, template-based SDSoC [26] from Xilinx or the OpenCL SDK [10] from
and open source HLS library for a fast implementation of Intel) reduces the programming hurdle and shortens the
CNNs on FPGA-based embedded or HPC systems. The development time of FPGA-based hardware accelerators.
library consists of 7 different layers, which are used in Consequently, many HLS implementations have been
common CNN algorithms. It operates on parameterizable introduced for the acceleration from CNNs, like from
fixed-point data types and floating-point data types, and has Tapiador et al. [30], Zhang et al. [37] or Venieris et al. [32],
been optimized for performance and resource efficiency. to implement energy-efficient and effective HLS-based
The different compile time parameters and data types of neural network accelerators. While some papers present
the library functions offer multiple opportunities for an approaches for cloud applications with sufficient resources
optimized design and extensive design space exploration. [3], others present designs for embedded applications
One benefit is that all functions are streaming capable to with limited resources [37]. For example, Yao et al. [3]
allow a deep pipeline. Creating streaming applications with implemented an HLS-based library for cloud systems, like
multiple nodes or layers gives FPGAs the ability to achieve the AWS. To address the resource limitation on FPGAs,
higher performance and power efficiency for computer many optimizations and implementations were carried out
vision algorithms compared to other architectures, such as to reduce the resource usage. Suda et al. [28] propose an
CPUs and GPUs, as Kalms et al. [14] or Qasaimeh et al. [24] implementation using HLS for a lighter data type (fixed-
show. Furthermore, we have researched and implemented point 16-bit) while our proposed work supports multiple
different possibilities of parallelization in order to achieve a data types (32-bit floating-point and 8-bit, 16-bit or 32-bit
high performance with an efficient use of resources, which fixed-point with an adjustable size of the fraction part).
we show in this paper using our implementation of the Guo et al. [7] proposed a flexible CNN accelerator
MobileNets algorithm [9]. Our library is integrated into with bit-width reduction using quantization, improving
the HiFlipVX library, which is an open source HLS FPGA the performance of OpenCL-based FPGA accelerators for
library for image processing [17] and object detection [15]. CNNs. Liu et al. [19] integrated the pointwise separable
This offers the possibility to design and implement different convolution, which is needed in different neural networks
kinds of computer vision applications with one library. Most like MobileNets. Some of the previous studies focused only
functions of the libraries are based on the OpenVX standard. on the acceleration of the convolution layers of CNNs. For
This simplifies the design of applications on heterogeneous example, Liu et al. [20] only used models with convolution
systems containing different types of architectures (e.g. layers without any Fully Connected layer. Therefore, it is
CPU, GPU and FPGA), due to the different existing hard to be used for accelerating different CNN algorithms.
implementations from different vendors. Memory bandwidth issues in CNNs are discussed
In the following, Section 2 provides information about by Zhang et al. [37] and Zhang et al. [39]. Guan
the related work, Section 3 describes the implementation et al. [6] proposed FP-DNN, which is an end-to-end
of the neural network library and MobileNets, Section 4 framework that automatically generates optimized FPGA-
evaluates the achieved results and Section 5 contains based implementations of deep neural networks (DNNs)
conclusion and outlook. using an RTL/HLS hybrid library. Another HLS based
library is the Caffeine FPGA engine [38] that uses an
HLS-based systolic-like architecture to implement matrix
2 Related Work multiplication kernels. It allows changing parameters such
as the number of (PEs), precision, and feature map size.
State-of-the-art CNN architectures for large-scale visual The proposed CNN library is highly parametrizable, has
recognition use a multitude of layers with millions of a rich set of functions and is therefore applicable for various
computations. FPGA designers for embedded applications algorithm designs. All functions are streaming capable and
encountered three major challenges to efficiently map can be easily connected to each other. High performance
CNNs on hardware such as a difficult programming frame- with an efficient use of resources can be achieved through
work, limited FPGA resources and memory bandwidth. the streaming approach and the various parallelization
Many implementations have been proposed to address the parameters. The integration of the proposed CNN library
above mentioned challenges on FPGAs. Wang et al. [33] into the HiFlipVX image processing library [17], which has
built an RTL library to map neural networks on FPGAs. been extended for object detection [15], increases the range
However, RTL implementations suffer from high costs and of possible applications that can be implemented in the field
time-to-market which makes RTL-based custom hardware of computer vision. Following the OpenVX standard [5]
accelerators infeasible for most cases. makes it easier to create a heterogeneous system consisting
The availability of HLS tools, using OpenCL, C or C++, of different architectures (e.g. CPU, GPU and FPGA)
from FPGA vendors (e.g. Vivado HLS [34] from Xilinx, from different vendors. Since the library does not require
J Sign Process Syst (2021) 93:513–529 515

vendor-specific or other external libraries, it can be ported library have integrated vectorization, which can be applied
to other platforms more easily. This also improves the on their Input Feature Map (IFM) and/or their Output
verification and integration process in frameworks like Feature Map (OFM). Unsigned and signed 8-bit and 16-bit
Tensorflow [1] or Caffe [13]. fixed-point and 32-bit floating-point data types are possible
for the inputs, outputs, weights and biases, to be applicable
for many hardware designs. The size of the fraction can
3 Implementation be configured as a parameter of the function. For an 8-
bit unsigned integer data type, this value can be between 0
This section first describes the architecture and implemen- and 8. If the fixed-point position is set to 5, the fractional
tation of the different Neural Network layers. The library part is 5 and the integer part is 3. Functions that need
contains 7 neural network functions, which are described in trained coefficients buffer them on first use, if configured,
Section 3.2. All functions are streaming-capable to exploit to reduce the amount of global memory access. The fixed-
the advantages of an FPGA. Section 3.3 describes how to point implementations contain policies for rounding and
create an algorithm using the library components by using overflow. If an overflow occurs data can either be truncated
MobileNets as an example. It also adds two additional or saturated to its maximum/minimum value. For fixed-
functions needed to create an efficient implementation. point arithmetic operations, the data can be rounded to zero
or the nearest number.
3.1 The HiFlipVX Library Seven different neural network layers were designed
and implemented: 3D Convolution, Depthwise Convolution,
HiFlipVX is an open source [16] HLS FPGA library for Pooling, Activation, Batch Normalization, Fully Connected
image processing [17], which has been extended for object and Softmax. The I/Os of the different layer functions are
detection [15]. The library contains 46 C++ streaming- the input vector, the output vector and, if required, the
capable functions, which are mostly based on the OpenVX weights vector and the biases vector. Since all functions
standard. OpenVX is an open, royalty-free standard for are streaming capable, we can use the simple AXI4-Stream
cross platform acceleration of computer vision applications interface for the Xilinx implementation. It is a simple
[5]. The library functions are parametrizable using C++ protocol, with ready and valid signal for handshaking
templates and highly optimized for performance and FPGA and the corresponding data signal. The HLS interface
resources. In comparison to the xfOpenCV library from axis directive in the library functions automatically
Xilinx [36], it only consumed in average 39% FFs and 32% creates this interface. For all interface parameters we
Lookup Table (LUTs) for a selected set of functions [17]. use the vx image data<DATA TYPE, VEC SIZE> of
In addition to the OpenVX standard, most functions support the HiFlipVX library. It is a vector data type, with
different vectorization options (2x, 4x, 8x) and additional two additional configurable signals for the AXI4-Stream
data types (8-, 16- or 32-bit signed/unsigned integers). The interface that can be activated by using macros. These
use of vectorization does not only increases performance, signals indicate the Start of Frame (SoF) (last signal) and
but also the energy efficiency, as shown by Akguen et al. [2]. End of Frame (EoF) (user signal) and are needed when
The functions of our proposed library were integrated connected to the DMA or Video DMA (VDMA) blocks
into the HiFlipVX library. They use the same data types and from Xilinx. The remaining library function parameters are
the function headers have a similar structure. Therefore it template parameters (e.g. input/output image size, kernel
is easy to connect the functions of the two libraries, either size, IFM, OFM, etc.).
directly or e.g. by using data-width converters. Furthermore, The library has been optimized and tested for Vivado
certain pre-processing for the CNNs can be done with the HLS [34] and SDSoC [26] 2019.1, but also works with other
functions of the HiFlipVX library, e.g. changing the image versions. Internally SDSoC uses HLS, but builds a complete
size or the image format. The integration also makes it easier system around the accelerated functions containing the
to use existing OpenVX-based frameworks, which did not hardware and software layers. To create such a system,
observe CNNs, like AFFIX [29] or JANUS [22]. SDSoC adds some restrictions that basically affect the
interfaces of the function. One of these limitations is that
3.2 Neural Network Layers only structs with more than 1 element are synthesizable.
This has been solved by automatically using native data
One goal of the library was the streaming capability of types instead of structs for these kind of interfaces. This
the library functions. Since all functions are pipelined, a is also possible, since SDSoC adds the SoF and EoF
pipeline interval of 1 was a key objective to achieve an signals to the AXI4-Stream interface by itself. Furthermore,
optimized performance. Additionally, all functions in the interface arrays need a known amount of elements. The
516 J Sign Process Syst (2021) 93:513–529

proposed library does not need vendor specific or external be a drawback of using HLS. The general equation for
libraries. For some mathematical operations or signals, calculating a 3D convolution is:
Xilinx libraries have been used for a resource efficient
implementation. By using a macro, alternatives are applied y −1 Kx −1
M−1 K
I F
if other tools are used. dsty,x,o =
i=0 n=0 m=0

3.2.1 3D Convolution
× src(y+n− Ky ),(x+m− Kx ),i · Wo,i,n,m +Bo (1)
2 2
The process of 3D convolution is the most computational
intensive layer in most feed forward networks. The main Parallelization The performance benchmark for most 3D
goal of the proposed image and loop dimension ordering convolution layers is the number of multiplications pro-
was to achieve a streaming capable function. Under this cessed per second. For a multiplication mostly internal Dig-
constraint we developed a structure that is optimized for ital Signal Processor (DSPs) are used on an FPGA. When
performance and resource usage. Therefore, the ordering of increasing the number of multiplications, the amount of data
some dimensions is different from the OpenVX standard. that is needed simultaneously and thus the required memory
Listing 1 shows the general structure of the hardware bandwidth increases. To implement an efficient streaming
implementation, which is be explained throughout this capable function, data of the input image as well as the coef-
subsection. Therefore the total latency can be derived ficients should be buffered. This can limit the maximum
from the total number of loop iterations plus the pipeline resolution of the image to be processed. These buffers are
stages. The order of the image and coefficient dimensions usually implemented using Block RAM (BRAM). However,
is: BRAM has a limited bandwidth to read and write data. To
increase the bandwidth, data can be distributed over sev-
– Input Image: BAT CH × ROWsrc × COLsrc × I F M eral BRAMs. However, this can lead to fragmentation, if
– Output Image: BAT CH ×ROWdst ×COLdst ×OF M the BRAM is not fully utilized and can therefore limit the
– Weights: OF M × I F M × Ky × Kx data to be stored. For this reason, fragmentation should be
– Biases: (0) ∨ (OF M) ∨ (BAT CH × ROWdst × kept as small as possible while increasing the amount of
COLdst × OF M) multiplications.
Various loop variables are suitable for parallelization, as
As shown, different sizes for the Bias are possible. A illustrated in Listing 1. One possibility of parallelization
stride is set, when input and output resolution differ. In would be in the direction of the (COL) as in HiFlipVX.
the proposed implementation the stride only effects the However, this type of parallelization would increase the
condition when a result is written to the output. It has no bit-width of various buffers and therefore lead to a high
effect to the latency. However, loop iterations could also fragmentation of BRAM. Additional buffers would also
be skipped in dependence of the stride. However, the used have to be introduced to restructure the input and output
HLS compiler only allows ”perfect loops”, which could data. Therefore, we have concentrated on the parallelization
of the inner loops, as shown by the parameters (Vo ) and
(Vi ) in Listing 1. Both (OF M) and (I F M) parallelization
would increase the bit-width of the coefficient buffer.
Additionally, the parallelization of (I F M) increases the bit-
width of the input buffers. In some cases (Vi ) can be raised
to a certain point without causing additional fragmentation
of the input buffer.

Structure of Buﬀers Figure 1 shows the structure needed

to buffer the input data to achieve the sliding window
effect for the 3D convolution function. It also shows the
size of the different buffers, all of which have a depth of
(Vi ) elements. Additionally, the image shows the read and
write operations between the different blocks by the dashed
and dotted lines. The line buffers store complete rows of
Listing 1 General structure of the 3D convolution function. Input
the image including all feature maps (COLsrc · I FViM ). Its
Image Resolution: (ROWsrc × COLsrc ); Output Feature Map:
(OF M); Input Feature Map: (I F M); Output Vectorization: (Vo ); height of ((Ky − 1) · Vi ) elements is stored as 1 element
Input Vectorization: (Vi ); Kernel Size: (Ky × Kx ) in the BRAM to reduce BRAM usage, since it can reduce
J Sign Process Syst (2021) 93:513–529 517

Figure 1 The input stage buffers the input image for the 3D convolu- steps. 1. step reads new input vector of size Vi (dashed lines) while
tion function with a Ky × Kx window/kernel size (here 3 × 3) and an computing the first Output Feature Map [o = 0] 2. step updates win-
input vector size of Vi . The input stage contains input registers on the dow registers (continuous lines). 3. step updates buffers (dotted lines).
left (white), big line buffers (dark gray), small input/window buffers 4. step sends all data from the sliding window registers to the com-
(light gray) and sliding window registers on the right (white). The pro- pute stage (dashed lines). I F M = Input Feature Map; COLsrc = Input
cess of buffering the input image can be separated into 4 pipelined Image Columns.

fragmentation. The input buffer is a line buffer that does not buffer. The other elements of the sliding window get their
have to store the entire image row. Instead, it is sufficient data from the window buffers. Additionally, the algorithm
to store the ( I FViM ) elements of the current iteration of (x). checks whether valid data should be present in the buffers.
The sliding window updates its complete elements in each Otherwise a zero is loaded into the corresponding sliding
clock cycle, because all feature maps of (x) are calculated window elements, to apply zero padding. The proposed
before the window is moved one element to the right. The implementation always applies zero padding of ( K2x ) on
K
window buffers are needed for this, since only 1 element can both sides in x-direction and of ( 2y ) on both sides in
be read from each line/input buffer in a clock cycle. Each of y-direction.
the (Ky · (Kx − 1)) window buffers has ( I FViM ) elements. 3) Update Buffers: This stage reads the data from the
The different computation steps of the 3D convolution are window and writes it to the different buffers, as shown in
described below in chronological order. Figure 1 by the dotted lines. The input buffer receives its
1) Read Input Vector: Reads vector of Vi elements from data from the bottom left element in the window. Since the
the input image, if the following condition is met: (y ≤ input data can only be read once, it must be buffered. The
ROWsrc ) ∧ (x ≤ COLsrc ) ∧ (o = 0). line buffer receives its data from the right column of the
window in the last iteration of (o = OFVo − 1). This moves
2) Update Sliding Window: In this stage, the data is read M

from the different buffers and stored in the sliding window, the data of the image one line up so that it is available again
as shown in Figure 1 by the dashed lines. Each element in at the next iteration of (x). The window buffer receives its
the sliding window of size (Ky × Kx ) contains (Vi ) vector data from the left columns of the window in the last iteration
elements. The left column of the sliding window get its of (o = OF Vo − 1). The sliding window effect results from
M

data from the line buffers and the input buffer. If (o = 0) the offset reading and writing between the window buffer
new data is read from the input image instead of the input and the window.
518 J Sign Process Syst (2021) 93:513–529

Figure 2 Computation stages of the 3D convolution implementation. gray) buffer weights/biases if configured. I F M = Input Feature Map;
The input comes from the sliding window of the input stage (Figure 1). OF M = Output Feature Map; Vi = Input Vector Size; Vo = Output
Some stages are for floating-point or fixed-point numbers only. Buffers Vector Size; Kx × Ky = Kernel Size.
for weights/biases are marked in light gray. Reader functions (dark

4) Compute Convolution: Figure 2 shows the computa- is checked for overflow and saturated if the corresponding
tion stage of the 3D convolution process. As stated, some policy is set.
of the blocks in the image are only used for fixed-point 4) Write Output Vector: Writes back a vector
or floating-point calculations. The gray blocks show the of Vo elements to the output image, if the follow-
K ROWsrc −1
data whose contents needs to be maintained between loop ing condition is met: (y − 2y ) mod ( ROW −1 ) ∧
iterations and stored in buffers. The weight and bias coeffi-
dst
COLsrc −1
cients can be buffered within the function if the user sets the (x − K2x ) mod ( COL dst −1
) ∧ i = I FViM − 1 . The
appropriate parameter. On first use, they are read from the condition includes the stride computation, expressed with
interfaces and stored in the buffers. If the same coefficients the modulus operation. The value for the stride must be an
are needed again, they can be accessed from the buffers. element of the natural numbers.
In the first step, the input data is taken from the sliding
window and multiplied by the corresponding weights. In 3.2.2 Depthwise Convolution
total (Vo × Vi ) 2D-convolutions of the size (Ky × Kx ) are
calculated. Then (Vi ) 2D-convolutions of the different (Vo ) The Depthwise Convolution can be considered as a 2D
are added together to partially calculate the 3D convolution convolution that is applied to each feature maps of a 3D
of each (Vo ). input image separately. This layer is usually used together
The operation of calculating a sum over several loop with a “pointwise” convolution of (1 × 1), as in MobileNets
iterations violated the desired pipeline interval of 1 by a [9]. This means that for a (3 × 3) convolution, a (3 × 3)
factor of 5 when using floating-point numbers with the pointwise convolution and a (1 × 1) pointwise convolution
Xilinx tools. Therefore, we convert floating-point numbers is used. The advantage of this approach is that less
for this summation to a value that is saturated to a 32- multiplications and weights are required for the convolution
bit fixed-point number. The user sets the parameter for process. Comparing it to a classic 2D convolution, it has a
the fixed-point position of this variable. In the next step similar effect to the separable filter shown in [17].
the partial 3D convolutions are added to the final 3D The amount of feature maps in the input and output
convolution until all 2D-convolutions are summed up. Then image are the same for this function. When comparing
the result is converted back if the final output should be a with the structure of Listing 1, the loop over OF M is
floating-point number. eliminated. Consequently, the total latency is reduced by
When using fixed-point numbers, the multiplication that factor and there is only one parallelization term (Vi ).
in the 2D-convolution increases the fixed-point position. The rest of the basic structure in Listing 1 remains. The total
Therefore, the value is shifted back to the fixed-point number of multiplications and weights is reduced by a factor
position, while ensuring the overflow policy. This process is of OF M compared to a pointwise convolution. Therefore
done before adding the bias, because it has the same fixed- fewer weights must be stored in the internal buffers. On
point position as the output. After adding the bias, the result the other hand, the size of biases remains unchanged. It
J Sign Process Syst (2021) 93:513–529 519

is still possible to choose between the different sizes of 3.2.4 Activation

biases. Furthermore, the stride computation remains the
same. The Activation function is a crucial component in CNNs.
Compared to the structure in Figure 1 nothing changes In general, the function is connected to each neuron in the
for the buffering of the input image and the sliding window. network and determines whether it should be activated or
As pointwise convolution only performs 2D convolution not. Table 1 shows the 9 implemented activation functions,
operations, Figure 2 omits the summation blocks. This which have been defined in the OpenVX standard. For
eliminates the need for inter-loop summation and the fixed-point numbers the overflow policy needs to be applied
conversions for floating-point numbers. Except for the to the following functions: sof trelu, square and linear.
amount of weights and convolutions, the rest of the structure In addition, an overflow can occur when calculating the
in Figure 2 remains. The conditions when a vector is absolute function for a signed data type and if the value is
read from the input image or when it is written to the the smallest possible. For fixed-point numbers the rounding
output image only change in such a way that the following policy must to be applied to the following functions:
conditions are omitted: (o = 0) for the input and (i = logistic, sof trelu, square and linear. The logarithmic
( I FViM − 1)) for the output. This also implies that the stride and exponential activation functions are computed using
calculation remains the same. floating-point operations, due to the high range of possible
values and the resulting accuracy loss when using fixed-
3.2.3 Pooling point numbers. Therefore, conversions are needed for
fixed-point input and output images using multiplication
The purpose of the Pooling layer is to reduce the spatial operations. As shown in the table, the hyperbolic tangent
size of the image to reduce the number of parameters and function is calculated with 1 repeated exponential function
calculations in the neural network. The pooling operation and 1 division to the reduce resource usage. The most
works independently on each feature map. Similar to a resource efficient fixed-point square root was the one from
2D convolution a window slides over an input image. the Xilinx HLS library. It is automatically selected, when
To calculate the output, the values in the window are using Xilinx tools, otherwise HiFlipVX proposed functions
either averaged or the maximum value is taken. With the is used [17]. The activation function can be computed in
help of a stride, pixels can be skipped so that the output parallel (V ) in a SIMD manner on the 3D input image. The
image becomes smaller than the input image. The following latency of the hardware function is: ROW S ·COLS · I FVM +
equation is used to calculate the stride in x direction: P.

3.2.5 Batch Normalization

COLsrc + 2 · P adx − Kx
Stridex = (2)
COLdst
Batch Normalization is a technique to improve the stability
and performance in neural networks. The core idea is that
The window (Kx ×Ky ) can have any size between (1×1) the inputs of each layer of an image are normalized so
and (ROW × COL). Like in the convolution filters zero that the mean output activation is zero and the standard
padding can be applied. The padding size (P adx,y ) is in deviation is one [11]. The Batch Normalization calculates
K
the range between (0) and 2x,y . The calculated Stridex,y a mini-batch (B) over a set of pixels values (xi ): B =
is in the range between (1) and (Kx,y ). The overall {x1 , x2 , ..., xI F M }. Considering a three dimensional input
structure of the function is very similar to the pointwise image (srcx,y,i ), the mini-batch would be calculated over
convolution, without the need of buffering coefficients. The
total latency only differs slightly, since the padding size is Table 1 Implemented activation functions.
not fixed: ((ROWsrc +P ady )×(COLsrc +P adx )× I FViM ).
logistic f (x) = 1
e−x
For average pooling the sum of all window elements is
calculated and then multiplied by the normalization. For hyperbolic tangent f (x) = a · tanh(b · x) = a · e2·b·x −1
e2·b·x +1
fixed-point values an additional operation is needed that relu f (x) = max(0, x)
shifts the result back to the desired fixed-point position bounded relu f (x) = min(a, max(0, x))
(F P ). soft relu f (x) = log(1 + ex )
abs f (x) = |x|

2F P square f (x) = x 2
avg = sum · ·2−F P (3) √
Ky · Kx square root f (x) = x
shif ting linear f (x) = a · x + b
normalization
520 J Sign Process Syst (2021) 93:513–529

the third dimension of size I F M. It first calculates the mean Batch Normalization is very resource-efficient. The latency
(μ) of the pixel values, as shown in Eq. 4. Using the the of the hardware function is: ROW S · COLS · I FVM + P .
mean value, the variance (σ 2 ) is calculated, as shown in
Eq. 5. Using the mean, variance and a set of pre-trained 1
values (γi , βi ), the output image pixels are calculated, as dsti = γi · (srci − μi ) · ci + βi , ci = (7)
shown in Eq. 6. σi2 +
I
FM
1 3.2.6 Fully Connected
μ= · xi (4)
IFM
i=1
The Fully Connected layer is an essential component of
I
FM most CNNs. It is one of the last layers and is used for
1
σ2 = · (xi − μ)2 (5) the final classification decision. Simplified it is a 3D
IFM
i=1 convolution with a (1 × 1) kernel on an image with (1 × 1)
xi − μ pixel. However, the I F M and OF M can be very large.
yi = γi · √ + βi (6) The weights, biases and input image are buffered on first
σ2 +
use. However, since each weight/bias is read only once per
A straightforward way to compute this function would
image, it is recommended not to buffer them if the weight
be in three separate loops iterating over the 3rd dimension,
matrix becomes too large. The summation of Eq. 8 has been
nested in the loops iterating over the 1st and 2nd
implemented using fixed-point numbers for the floating-
dimensions. With this approach only one output pixel is
point implementation. Therefore, the multiplication result
generated every 3 clock cycles for a parallelization degree
is converted and saturated to a 32-bit wide number before
of zero. Therefore, we created three functions to compute
summation and converted back afterwards. Fixed-point
μ, σ 2 and yi , which are used in a pipelined manner inside
numbers were used, since a summation with floating-point
the three nested loops. As a result, the overall latency is as
numbers increased the total latency by a factor of 5. The
follows: (ROW S · COLS + 2) · I FVM + P . Two times I FVM
fixed-point position is set by a parameter. Depending on the
additional clock cycles are required, since each mini-batch
degree of parallelization, V multiplications are calculated
must pass through these three stages in a pipeline manner.
in parallel and added together. After summation, the data
The input data of a mini-batch (B) is stored in a buffer in the
must be shifted back due to the fixed-point multiplication
first stage to be used for the next two stages. Since there are
according to the rounding policy. Then the bias is added.
3 stages, 3 consecutive input vectors of size I F M must be
When using fixed-point values, the result is converted back
stored in buffers. The weight vectors γ and β are read in the
to the output format according to the overflow policy. The
third stage at the first use and buffered for the further use.
latency of the hardware function is: OF M · I FVM + P .
To calculate (μ) and (σ 2 ) a sum of values must
be computed. Like in the convolution filter, the sum is
computed using fixed-point numbers, because a floating- I
FM

point sum increases the latency by a factor of 5. Therefore, dstof m = (srcif m · weightof m,if m ) + biasof m (8)
floating-point numbers are converted and saturated to a 32- if m=1
bit wide integer value. Since the normalization ( I F1M ) is a
constant, it can be pre-computed to replace the division by a 3.2.7 Softmax
multiplication. Both calculations are easy to vectorize, since
only the sum needs to be parallelized. For the parallelization The Softmax layer normalizes an input vector into a
of the last stage which computes yi , the term √ 12 can be probability distribution and limits the output to a range
σ + between 0 and 1. It is used to determine the probability of
pre-computed once. This has a big impact on the resource several classes at once. The calculation shown in Eq. 9 is
usage when vectorizing the function, since the division and done in two parts. The first part computes the sum and stores
square root are the most resource consuming functions. Due the exponents of the inputs into a buffer. Due to the high
to the accuracy, √ 12 is calculated using floating-point range of values in this function all operations are done using
σ +
numbers. floating-point operations. However, for the same reason
There are different variants of the Batch Normalization. as in the previous functions, the summation is calculated
One of them avoids the calculation of (μ) and (σ 2 ). In using fixed-point numbers. Therefore the exponent result
this variant, the two values are passed to the function as is converted and saturated into a 32-bit fixed-point number
additional parameters, as shown in Eq. 7. Since both values before summation. For each element in the input vector,
are constants, the value of (ci ) can be pre-calculated after (V ) exponents are calculated, stored and added to the
training the neural network. As a result, this variant of the summation. The second part calculates the division of
J Sign Process Syst (2021) 93:513–529 521

Eq. 9. For fixed-point numbers, the division result must The modules 2 to 14 all have the same structure with
be shifted (multiplied) to the correct position according different parameter settings. The first three layers, which
to the rounding policy. Depending on the parallelization include a depthwise convolution, Batch Normalization and
degree, (V ) output elements are computed. The latency of activation layer, all have the same degree of parallelization
the hardware function is: 2 · I FVM + P . (VDW ). Again, the data for pointwise convolution must be
converted to (VI F M ) and then to (VOF M ). The last two
esrcif m layers have a parallelization degree of (VP W ). With these
dstif m = I F M (9)
srci ) 4 vectorization parameters (VDW , VI F M , VOF M , VP W ) the
i=1 (e
optimal configuration for the desired amount of resources
3.3 MobileNets Architecture can be found, as shown in the evaluation. Data width
converters can also be connected between the various
MobileNets [9] were presented by Google Inc. and were modules. However, they were not needed in the final
developed for mobile and embedded vision applications. configuration.
MobileNets utilizes a combination of depthwise separable Module 15 contains the last layer and its output is
convolutions and pointwise convolution to form lightweight therefore connected to a e.g. 64-bit wide DMA via a data
deep neural networks. These networks also introduced two width converter. This module only needs the parameter
global parameters for the width and resolution multiplier (VDW ) for pooling and the parameter (VI F M ) for the input
to define different sizes of the networks. The different of the Fully Connected layer. The Softmax layer is not
networks based on these parameters have different latency computationally intensive enough to become a bottleneck.
and accuracy. This allows using optimum networks to match In general, the vector parameters must be set so that no
the design requirements of the system. single layer becomes a bottleneck, since the slowest layer
MobileNets architecture is based on depthwise separable limits the speed of the others. The different modules contain
convolution as mentioned before. A standard convolution scatter engines to distribute all coefficients to the local
can be factorized into a depthwise convolution and a buffers. This allows all coefficients to be preloaded with
pointwise convolution. Depthwise separable convolution optimal utilization of the memory bandwidth. The scatter
separates filtering and combines inputs into two layers, engine also reduces the number of DMAs needed to access
on contrary to a standard convolution. This factorization memory to load new coefficients. They require data width
of the convolution layer results in a reduction in model converters, since each local buffer has a different depth of
size and computation requirements of the algorithm. This its elements depending on the degree of parallelization of
concept is used in MobileNets in order to have light- the corresponding layer. The HiFlipVX data converter can
weighted neural networks. The first layer of MobileNets also convert between widths that are not multiples of each
is a full convolution layer. Later layers are a combination other. Therefore the data for the different local buffers must
of depthwise convolution and pointwise convolutions. All be aligned to the data type of the scatter engine in global
convolutions are followed by a Batch Normalization layer memory.
and activation layer (ReLU). The final Fully Connected
layer has no non-linearity and is followed by a Softmax 3.4 High-Level Synthesis Directive Usage
layer. Before the final Fully Connected layer an average
pooling is used to reduce the spatial resolution. In total In this work we use 8 different directives (pragma
MobileNets has 28 layers. HLS): inline, interface, data pack, dataflow,
Figure 3 shows the hardware implementation of the stream, resource, pipeline, array partition.
different layers of MobileNets, which is parameterizable. All internal and callable library functions are inlined using
The different modules are interconnected, with module 1 the inline directive.
containing the first layer and module 15 the last layer, The interface directive is only needed in wrapper
creating a very deep pipeline. The input of the first layer functions, which instantiate the library functions and set the
in module 1 is connected to a data width converter and template parameters. There is an example test bench for
gets its data from the global memory. To optimize the each function of the library and the different MobileNets
memory bandwidth, it receives an e.g. 64-bit wide input layers in the main file. When using Xilinx SDSoC no
and converts it to the desired vector size (VI F M ) of the 3D interface directive is be needed. For the SDSoC tool
convolution. The output vector (VOF M ) is then converted we set the ap fifo protocol for all ports. For Xilinx
to the vectorization (VP W ) of the Batch Normalization and Vivado HLS we set the AXI4-Stream (axis) protocol
activation layers. All layers and conversion units of the as interface for the ports. It is a simple handshaking
pipeline are connected via very small FIFO buffers. They protocol most Xilinx IP-Cores use. Additionally, we
are marked by a thicker line in the figure. deactivate the control port of all IP-Cores in Vivado HLS
522 J Sign Process Syst (2021) 93:513–529

Figure 3 Block design of the MobileNets hardware implementation. buffers are marked in light gray. Data movers blocks are marked in
MobileNets has been separated into 15 modules. The modules are dark gray. Multiple scatter units can be connected to the same DMA.
directly connected to each other in the order of their numbering. Local

(ap ctrl none port=return). This port should not The dataflow directive enable task-level pipelining. It
be deactivated for SDSoC. The ( SDSCC ) macro is is needed to create the streaming applications in the three
globally set by the SDSoC tool and is used by our different MobileNets layers shown in Figure 3. To enable
library to automatically switch between the two Xilinx streaming between the different functions of the MobileNets
tools. Setting the ap fifo ports and using the C99 layers, FIFOs are needed. Therefore, the stream directive
style for arrays, our library does not need any specific is used for these FIFOs using a depth of 8. The small
SDSoC directives (pragma SDS). As mentioned in depth allows to use LUTs instead of BRAMs for the FIFOs,
Section 3.2, we use the (vx image data<DATA TYPE, since BRAM is often a limiting resource. Within all library
VEC SIZE>) data type for the function ports, to apply functions we use the pipeline directive with the goal to
vectorization and set the last and user bits of the AXI4- achieve an initiation interval of one. Since all loops below
Stream protocol if needed. To achieve the full bandwidth, the pipeline directive are unrolled automatically, there is
all callable library functions use the data pack directive no need of using the unroll directive.
for their ports. Additionally, we use the data pack Therefore, the resource directive is set to
directive for our internal buffers and FIFOs, to reduce the FIFO LUTRAM for these FIFOs. For most internal buffers,
fragmentation of the utilized (BRAMs). shown in Figures 1 and 2, we set the resource directive to
J Sign Process Syst (2021) 93:513–529 523

use LUTs (RAM 2P LUTRAM) or BRAMs (RAM 2P BRAM) images that are processed one after the other. Increasing
depending on their size. RAM 2P BRAM has been used for this parameter has the advantage that a function can read in
most weight and line buffers. RAM 2P LUTRAM has been pixels of a new image before it has finished the calculation
used for most bias, window and input buffers. The use of of the last image. Coefficients can be buffered on first use,
the resource directives should be used with caution, which does not need to be repeated for the other input
since it can also have a negative effect. In most cases it is images (batches). The batch size can be set for all functions
advisable to give the tool the choice, because then it can in the library. The input resolution can differ from the output
select according to the total resource usage, bit-widths and resolution, but has to be bigger. This is only possible for
selected frequency. The array partition directive is the two convolution functions and the pooling function to
needed if the LUT and BRAM memories do not provide implement a stride. Only the 3D convolution and the Fully
the required bandwidth. E.g. to separate the window buffers Connected layers have both (I F M) and (OF M), all other
of Figure 1. The array partition directive is also functions only have one feature map (I F M). For resolution,
used quite often to completely partition C++ arrays into batch amount and the feature map size we allow a value
registers, like for the white boxes in Figure 1. between 1 and 2028. The bias size can be (0), (OF M) or
(BAT CH ES·OF M) for the two convolution functions and
the Fully Connected layer. The kernel size can be changed
4 Evaluation for the two convolution functions and is (n × m), where (n)
and (m) can be different but must be odd numbers and must
In this section a detailed evaluation of the different be in the range of 1 and 9. It is the same for the pooling
functions of the library is made. Different parameter settings size, but the numbers can also be even. Pooling and padding
are evaluated to make general assumptions. Furthermore, sizes can only be set for the pooling function. The padding
designing larger algorithms is evaluated using MobileNets. size can be between 0 and the half of the pooling size. The
convolution functions automatically
use a padding, which is
4.1 Single Layers Ky,x
the half of the kernel size 2 .
On the right side of the table there are parameters
This part evaluates the different layers of the proposed that are more specific to the FPGA design, such as
library. We tested the design on a ZCU104 MPSoC FPGA frequency changes. The (VI F M ) parallelization is used
from Xilinx using the 2019.1 tool chain including SDSoC in all functions. The (VOF M ) parallelization is only
and Vivado HLS. To obtain the implementation results, needed for 3D convolution and Fully Connected layers,
we built a design with SDSoC and took the results of the for exploration and to further improve the performance.
single functions from the Vivado project. All functions in For both parallelization parameters we allow a value
the library have several parameters which can be changed at between 1 and 128. We allow different data types for
compile time. Table 2 shows the default configuration of the the inputs/outputs and weights of the different layers
parameters of the different layers tested in this evaluation. (int8, uin8, int16, uint16, float32). The
The table also shows the high configurability of the library. biases can have a different data type if fixed-point
On the left side of the table are the normal parameters numbers are used (int8, uin8, int16, uint16,
of a neural network, which are also needed in non-FPGA int32, uint32, float32). This approach has been
designs. Additionally we support 2 pooling types and 9 suggested by some CNN algorithm implementations. The
activation function types. In our terminology, batches are fixed point position determines the size of the fraction
and must be below the number of digits of the data type.
For signed data types, at least 1 bit is required for the
Table 2 Default configurations for the changeable parameters of the integer part. For arithmetic calculations, mainly for fixed-
different layers. point numbers, we must check for overflow and perform
batches 4 vif m 1 the wanted rounding policy. If an overflow occurs data can
input 64x64 vof m 1 either be truncated or saturated to its maximum/minimum
output 64x64 frequency 100 Mhz value. For fixed-point arithmetic operations, the data can be
IFM 32 data type uint8 rounded to zero or the nearest number. Coefficients (weights
OFM 32 bias data type uint8
and biases) can be buffered during execution within the
bias size OFM fixed point position 8
function (buffer coefficients). In Figure 3, this
kernel size 3x3 overflow saturate
is done outside the function to increase efficiency of the
coefficient reading process.
pooling size 2x2 rounding to zero
To verify the correctness of the library functions, we
padding size 1x1 buffer coefficients yes
calculate the mean absolute percentage error (MAPE) of
524 J Sign Process Syst (2021) 93:513–529

Table 3 MAPE (mean absolute percentage error) between the 32- large for FFs and LUTs. The implementation results in
bit floating point baseline software implementation and the various the table do not include the additional blocks that SDSoC
hardware implementations.
integrates into the HW design. The DSP behavior is dif-
layers uint16 int16 float32 ferent because we let the tool decide whether to use LUTs
or DSPs for the arithmetic calculation, as this can vary
3D convolution 0.3413 0.6804 0.00003 depending on the application. The Fully Connected layer
depthwise conv. 0.0127 0.0261 0.00000 usually has many coefficients and therefore requires a lot
pooling (max) 0.0000 0.0000 0.00000 of BRAM. Therefore, it may make sense not to buffer
activation (relu) 0.0000 0.0000 0.00000 the weights, since each weight is only required once per
batch normalization 0.0390 0.1012 0.00004 batch. The 3D convolution consumes more BRAM than
fully connected 0.0000 0.3421 0.00000 the depthwise convolution because it has to buffer more
softmax 0.2104 0.4245 0.00001 coefficients.
In addition, the table shows the estimated latency per
batch. As it is well known, the process of 3D convolution
our hardware implementation compared to a floating-point is the most computationally intensive part in many CNN
baseline one. Table 3 shows the results using the default algorithms and must therefore be parallelized more. The
configuration and quantized random input numbers in the Softmax function is the least computationally intensive
range between 0 and 1, where {x ∈ R|0 ≤ x < 1}. The function and could therefore be executed on a CPU in a
calculation of MAPE has a problem if the divisor is zero. HW/SW co-design, as this function is also quite resource
Therefore we do not consider results where the divisor is intensive. By adding the proposed multi-stage pipelining
less than 10−6 . The fixed-point positions for the data type approach, Batch Normalization can calculate the three
in the table are 16 (uint16), 15 (int16) and 24 (float32). The internal functions almost in the same time as the activation
MAPE of 0.68% for the 3D convolution is due to the high layer. Due to this approach and the computationally
number of multiplications and additions for each output intensive operations like division and square root, more
pixel. A similar behavior can be observed with the other resources are needed. Depthwise convolution and pooling
functions, where many variables have to be added and/or require some additional cycles due to the line buffers.
multiplied together. The float32 computation can have a Table 5 shows the resource usage of the implemented
very small error for the functions that have to calculate a design from the various activation functions using unsigned
sum over several loops, because we had to use fixed point 16-bit data types. As expected, all functions that include an
arithmetic for this summation. Of course, if numbers had to exponent, logarithm, or division in their equation consume
be saturated, the MAPE would be higher, but this was not more resources. Using exponential functions instead of
meant to be proven by this approach, since it is generally the the hyperbolic functions could reduce resource usage. For
case for fixed point numbers. the square root function, there is an option for relaxed
Table 4 shows the resource utilization of the implemented mathematical calculation to reduce resource usage by
and synthesized (grey) designs using the default configura- reducing the precision of the fraction part. The difference
tion. In this table, the Softmax and Fully Connected layers in accuracy can be seen with a MAPE of 0.37 %. Due to
have 256 I F M and 256 OF M, since the resolution for the accuracy, mainly floating point operations were used
these layers is (1 × 1). As it can be seen from the table, for the computational-intensive functions. However, due to
the difference between the estimated synthesis results from quantization, there is still a small error rate left for these
SDSoC and the implemented results from Vivado is quite functions.

Table 4 Resource utilization and latency per batch of implemented (black) and synthesized (grey) designs.

Fully Connected and Softmax layers have 256 IFM and 256 OFM
J Sign Process Syst (2021) 93:513–529 525

Table 5 MAPE (mean absolute percentage error) and resource FFs (43% on average), but also the LUTs (8% on average).
utilization of the implemented design of the various activation However, it has no effect on the BRAMs or DSPs. For
functions using unsigned 16-bit data types.
designs with higher accuracy or for fast integration and
BRAM DSP FF LUT MAPE testing, the library also supports floating point numbers.
They have no effect on the latency of the various library
logistic 0 15 1362 2146 0.00126 functions, except for additional pipeline stages, but have a
hyperbolic 0 17 1549 2370 0.02543 high impact on resource utilization: +432% LUTs, +784%
relu 0 0 26 17 0.00000 FFs and +423% DSPs. When using 16-bit fixed-point values
brelu 0 0 26 25 0.00000 to increase accuracy, there is only a small increase for
softrelu 0 28 1418 2112 0.00144 LUTs (25%), FFs (17%) and DSPs (2%). This again shows
abs 0 0 26 17 0.00000 the importance of quantization in FPGAs. The BRAM
square 0 1 28 28 0.00000 usage always scales with the bit width of the data type
sqrt 0 0 164 384 0.00139 used. Increasing the kernel size has a similar effect for 3D
sqrt (relaxed) 0 0 113 243 0.36990 and depthwise convolution. In both cases the DSPs grow
linear 0 0 26 25 0.00000 with the kernel size. The BRAM increase depends on the
coefficient size (ky × kx ) and the line buffer amount of
(ky − 1). LUTs and FFs are only increased by 85% and 65%
respectively for 2.78× the amount of weights.
Figure 4 shows the relative resource usage for various A more detailed investigation of parallelization was done,
parameter settings compared to the default configuration. because finding the right parameters is important for an
As expected, a change in frequency mainly increases the efficient and performant design. The Batch Normalization

Figure 4 Relative resource utilization for various settings compared to the default configuration. Value is not reported if it is zero. 3D convolution
has a vectorization of vif m × vof m .
526 J Sign Process Syst (2021) 93:513–529

layer scales well with parallelization because resource- configuration with (VOF M = 8), (VI F M = 8) and a fre-
intensive functions do not need to be calculated multiple quency of 200 MHz, an acceleration of 260 was achieved
times as described in Section 3.2.5. Only the increase when the convolution function was executed on the real
in DSPs approximates to a linear behavior. The DSPs system using SDSoC. The measurements were performed
of all other functions scale linearly with the degree of with the ARM processor, on which no operating system is
parallelization. The LUTs and FFs of the Pooling layer running. The consumed resources for the convolution func-
scale less than linearly with the degree of parallelization. tion are: 8858 LUTs, 7679 FFs, 576 DSPs and 66 BRAMs.
The Fully Connected layer even shows a reduction of FFs The BRAM has increased due to fragmentation and a high
and BRAMs due to fragmentation. The 3D convolution has demand of on-chip bandwidth. The execution time of the
a combined vectorization of (VI F M × VOF M ). Different hardware is 837μs, which includes the cache flushing and
combinations of (VI F M ) and (VOF M ) were tested to find an data movement between the FPGA and DMA.
optimized combination. The combined vectorization results
in a parallelization (V ) for 2 (1 × 2|2 × 1), 4 (4 × 1|4 × 4.2 MobileNets
4|2 × 2), 8 (8 × 1|1 × 8|4 × 2|2 × 4) or 16 (16 × 1|1 ×
16|8 × 2|2 × 8|4 × 4). Some assumptions can be made when Before implementing the MobileNets layers onto hardware,
comparing these combinations. The greater the imbalance the optimal parameters must be set. When creating a deep
between (VI F M ) and (VOF M ), the more resources are used pipeline, the system normally is as fast as its slowest
on average. If, for the same (V ), (VI F M ) is greater than component. Table 6 shows our offline calculations for an
(VOF M ), the average usage of LUTs and FFs increases optimal setting of the different modules containing the
slightly by 6% and 10% respectively. On the contrary, a high MobileNets layers. All parameters, which are not reported
(VI F M ) can cause more BRAM to be used if it worsens line in the table use the default configuration. The parameter
buffer fragmentation. values for the resolution and feature maps are set by the
Additionally, one 3D convolution layer has been imple- algorithm. Using the latency equations of Section 4.1, the
mented with a high parallelization to show the perfor- estimated latency can be calculated. The number of pipeline
mance improvement in comparison to a baseline imple- stages was ignored in this estimation as it has almost no
mentation, which is running on the ARM processor of the impact. The maximum latency in the right column shows the
ZCU104 MPSoC at a frequency of 1.2 GHz in release mode bottleneck of the design. In the next step, the vectorization
using the O3 optimization option. Using the same default (V ) settings discussed in Section 3.3 are adapted to improve

Table 6 Shows proposed vectorization (V ) setting for MobileNets layers of Section 3.3.

Resolution Parallelization Estimated Latency (clock cycles)

input output IFM OF M vdw vif m vof m vpw dwconv dwbn pwconv pwbn max

1 224x224 112x112 3 16 – 3 8 2 101250 100368 101250

2 112x112 112x112 16 32 2 8 8 4 102152 100368 100352 100368 102152
3 112x112 56x56 32 64 4 8 8 2 102152 25104 100352 100416 102152
4 56x56 56x56 64 64 2 8 16 2 103968 100416 100352 100416 103968
5 56x56 28x28 64 128 2 8 8 1 103968 25152 100352 100608 103968
6 28x28 28x28 128 128 1 8 16 1 107648 100608 100352 100608 107648
7 28x28 14x14 128 256 1 8 8 1 107648 25344 100352 50688 107648
8 14x14 14x14 256 256 1 8 16 1 57600 50688 100352 50688 100352
9 14x14 14x14 256 256 1 8 16 1 57600 50688 100352 50688 100352
10 14x14 14x14 256 256 1 8 16 1 57600 50688 100352 50688 100352
11 14x14 14x14 256 256 1 8 16 1 57600 50688 100352 50688 100352
12 14x14 14x14 256 256 1 8 16 1 57600 50688 100352 50688 100352
13 14x14 7x7 256 512 1 8 8 1 57600 13056 100352 26112 100352
14 7x7 7x7 512 512 1 8 16 1 32768 26112 100352 26112 100352
15 7x7 1x1 512 1000 1 8 1 1 25088 64000 2000 64000

Latency is calculated for functions in Figure 3 separately without pipeline stages. Depthwise (dw) & pointwise (pw) latency of Batch
Normalization (bn) & convolution are reported. Maximum latency of all functions within a layer is shown on the right
J Sign Process Syst (2021) 93:513–529 527

Table 7 Final results of the three MobileNets modules shown in is some overhead for streaming multiple functions in a
Section 3.3 executed separately on the ZCU104. pipeline, data moving between the DDR and cache flushing.
module 1 module 2 module 15 This overhead is 90.8%, 76.6% and 52.9% for the modules
1, 2 and 15. When executing the layers sequentially, these
ARM (ms) 34.166 53.221 9.944 numbers would be higher. To verify the propagation of the
FPGA (ms) 0.966 0.902 0.489 error, the MAPE value was computed for 16-bit unsigned
speed-up 35.4 59.0 20.3 fixed-point numbers. It was 0.21%, 0.79% and 0.78% for
LUT 11881 16914 10579 the modules 1, 2 and 15. The resources listed in the table
FF 13265 16660 5773 contain only the modules and no DMAs. When considering
DSPs 237 140 27 the ZCU104 the resource usage is sufficient to fit all layers.
BRAM 1 20 263.5 For this case, the Ultra Rams would be needed and the Fully
Connected layer in module 15 should not buffer its weights.
The FPGA runs at 200 MHz
4.3 Comparisons to Related Work

the max latency, while keeping the available resources for Hassan et al. [8] presented a HW/SW co-design implemen-
DSPs and BRAMs into account. Since these two resources tation of AlexNet on an FPGA. They performed the first
can be easily estimated and are in most cases the limiting layer of AlexNet on hardware and achieved 2147483647
resources for CNNs. The activation layer is not taken into clock cycles, which would be approximately 10.7 ms when
account, since it has the same parallelization as the Batch considering a frequency of 0.2 GHz. For comparison,
Normalization, but a slightly lower latency. In the table: a similar convolution layer was implemented using our
(vdw ) refers to (dwconv ) and (dwbn ); (vif m ) and (vof m ) library with the same frequency, same parameters and 8-bit
refer to (pwconv ); (vpw ) refers to (pwbn ). For module 15, unsigned integer data types. The implemented convolution
(dwconv ) refers to the Pooling layer and (pwbn ) to the layer had a latency of 3.31 ms, which is a speed-up of
Softmax layer. 3.23. For the same layer, our work shows almost 73% less
Table 7 shows the final implemented design executed BRAM usage, demonstrating the proposed library’s ability
on the ZCU104 MPSoC in baremetal. A baseline software to reduce the memory consumption of large neural networks
implementation uses 32-bit floating point numbers and on FPGAs.
runs on the ARM processor at a frequency of 1.2 GHz Liu et al. [19] proposed and developed a CNN accelerator
in release mode using the O3 optimization option. Our for the Xilinx ZYNQ-7100 platform. They implemented the
proposed implementation uses 8-bit unsigned numbers and SSD-MobileNets-V1 [31] layers as test application for their
runs on the FPGA at a frequency of 0.2 GHz. The time proposed work. The proposed work is also HLS based and
measurements have been done using the ARM processor. A uses Vivado HLS 2016.4. We implemented the most time
good speed-up has been achieved for the single modules. consuming SSD-MobileNets-V1 layers and compared them
Module 2 has the highest speed-up, since it has the highest with the work of Liu et al. in Table 8. For our hardware
parallelization degree and contains most functions executed and software implementations we used the Zynq ZCU102.
in a streaming manner. For module 1 and 2 also a frequency For measurements we executed the algorithms on the board
of 300 MHz was possible. When combining all modules to and measured them from the ARM processor. We show
a very deep pipeline this speed-up would be even higher. the CPU and FPGA results of our work and of Liu et
When comparing the FPGAs computation time with the al. [19]. Both implementations run at 100 MHz, to have
estimated time of the slowest function in Table 6, there a fair comparison, but higher frequencies can be achieved

Table 8 Comparison with a state-of-the-art implementation of

computational intensive layers of the SSD-MobileNets-V1 algorithm.

Layer 1 Layer 7 Layer 27 Layer 29

CPU [19] 2000.00 9000.00 5500.00 11000.00

Accelerator [19] 10.00 30.00 55.00 110.00
ARM Cortex A-53 82.89 339.41 377.65 82.64
Proposed 1 (VI F M × VOF M ) (3x2) 11.38 (8x8) 6.08 (8x8) 9.52 (2x4) 11.45
Proposed 2 (VI F M × VOF M ) (3x4) 5.95 (8x16) 3.38 (8x16) 4.92 (4x4) 5.89

All results are in ms. VI F M : Parallelization of the Input Feature Map. VOF M : Parallelization of the Output Feature Map
528 J Sign Process Syst (2021) 93:513–529

with our implementation. The table shows the execution project AITIA: Embedded AI Techniques for Industrial Applications.
times for different parallelization settings for IFM and OFM CORNET-AITIA is funded by the BMWi (Federal Ministry for
Economic Affairs and Energy) under the IGF-project number: 249
of our implementation. The Proposed 2 settings should
EBG.
be the maximum possible in terms of available resources,
if the complete algorithm is ported to the ZCU102 and Funding Open Access funding enabled and organized by Projekt
the different functions stream their results between each DEAL.
other. If we compare the results of Layer 27 and Layer 29
of the SSD-Mobilenet-V1 network, our execution time is Open Access This article is licensed under a Creative Commons
11.2x and 18.7x times faster. When computing the complete Attribution 4.0 International License, which permits use, sharing,
streaming network of SSD-Mobilenet-V1, Layers 1 and adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
27 would be the bottleneck. This is intended, since layers source, provide a link to the Creative Commons licence, and indicate
1, 27, 29, 31 and 33 of SSD-Mobilenet-V1 are the only if changes were made. The images or other third party material in
layers with a 3 × 3 convolution kernel. Therefore, we this article are included in the article’s Creative Commons licence,
used these layers as roofline, since they consume more unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
DSPs than the other layers. For our work, we propose to intended use is not permitted by statutory regulation or exceeds
use a quantization technique, like the TensorFlow post- the permitted use, you will need to obtain permission directly from
training quantization [1]. This will allow us to use smaller the copyright holder. To view a copy of this licence, visit http://
parameters thus saving resources and power. For our work, creativecommonshorg/licenses/by/4.0/.
we used unsigned 8-bit integers as data types for inputs,
outputs, weights and biases. Wu et al. [35] investigated the
mathematical aspect of quantization parameters on different References
neural networks. Also, an 8-bit quantization workflow is
presented where an accuracy within 1% of the floating-point 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,
Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016).
baseline is maintained. Therefore quantized parameters with Tensorflow: A system for large-scale machine learning. In 12th
smaller bit widths should be used instead of floating-point USENIX symposium on operating systems design and implemen-
parameters. As it maintains a reasonable accuracy, achieves tation (OSDI 16) (pp. 265–283). https://github.com/tensorflow/
a higher speedup and saves resources. models/blob/master/research/slim/nets/mobilenet v1.md.
2. Akgün, G., Kalms, L., Göhringer, D. (2020). Resource efficient
dynamic voltage and frequency scaling on xilinx fpgas. In
International symposium on applied reconfigurable computing
5 Conclusion (ARC) (pp. 178–192).
3. Chen, Y., He, J., Zhang, X., Hao, C., Chen, D. (2019). Cloud-
dnn: an open framework for mapping dnn models to cloud
In this work we have shown an HLS FPGA library fpgas. In Proceedings of the international symposium on field-
for neural networks. It contains 7 different streaming programmable gate arrays (FPGA) (pp. 73–82). https://doi.org/
capable functions to create large neural networks with deep 10.1145/3289602.3293915.
pipelines. Due to the high parameterization of its functions 4. Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen,
T., Xu, Z., Sun, N., Temam, O. (2014). Dadiannao: a machine-
the library is suitable for embedded and HPC systems. The learning supercomputer. In 47th annual IEEE/ACM international
integration in HiFlipVX allows the use of further image symposium on microarchitecture (pp. 609–622).
processing functions. Although the library was optimized 5. Giduthuri, R., & Pulli, K. (2016). Openvx: A framework for
by Xilinx HLS directives, it was implemented in a way that accelerating computer vision. In SIGGRAPH ASIA 2016 Courses
(pp. 14:1–14:50). https://doi.org/10.1145/2988458.2988513.
it is vendor independent. The different parameter settings 6. Guan, Y., Liang, H., Xu, N., Wang, W., Shi, S., Chen, X., Sun,
and parallelization possibilities were investigated in the G., Zhang, W., Cong, J. (2017). Fp-dnn: An automated framework
evaluation to make conclusions for the user. The evaluation for mapping deep neural networks onto fpgas with rtl-hls hybrid
also shows the low error rate, high performance, scalability templates. In 25th annual international symposium on field-
programmable custom computing machines (FCCM) (pp. 152–
and resource efficiency of the library. Using the MobileNets 159).
algorithm we show how to efficiently create and optimize 7. Guo, K., Sui, L., Qiu, J., Yu, J., Wang, J., Yao, S., Han, S.,
larger designs. An efficient approach to transfer coefficients Wang, Y., Yang, H. (2018). Angel-eye: A complete design flow
and a way to find the optimal vectorization parameters for mapping cnn onto embedded fpga. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 37(1),
were shown. In future, we plan to enhance the library to a 35–47.
framework, which uses the OpenVX graph-based approach. 8. Hassan, R., & Mostafa, H. (2020). Implementation of deep
neural networks on fpga-cpu platform using xilinx sdsoc Analog
Acknowledgements This work has been funded partially by the Integrated Circuits and Signal Processing. https://doi.org/10.1007/
German Federal Ministry of Education and Research BMBF as part s10470-020-01638-5.
of the PARIS project under grant agreement number 16ES0657 9. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
and partially by COllective Research NETworking (CORNET) W., Weyand, T., Andreetto, M., Adam, H. (2017). Mobilenets:
J Sign Process Syst (2021) 93:513–529 529

Efficient convolutional neural networks for mobile vision 26. Sekar, C. (2017). Hemasunder: Tutorial t7: Designing with xilinx
applications. arXiv:1704.04861. sdsoc. In 30th international conference on VLSI design and 16th
10. Intel (2020). Intel FPGA SDK for OpenCL Pro Edition: international conference on embedded systems (VLSID) (pp. xl–
Programming Guide 19.4. xli). https://doi.org/10.1109/VLSID.2017.97.
11. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating 27. Song, L., Wang, Y., Han, Y., Zhao, X., Liu, B., Li, X. (2016). C-
deep network training by reducing internal covariate shift. brain: A deep learning accelerator that tames the diversity of cnns
arXiv:1502.03167. through adaptive data-level parallelization. In Proceedings of the
12. Ji, S., Xu, W., Yang, M., Yu, K. (2013). 3d convolutional neural 53rd Annual Design Automation Conference (DAC). https://doi.
networks for human action recognition. IEEE Transactions on org/10.1145/2897937.2897995.
Pattern Analysis and Machine Intelligence, 35(1), 221–231. 28. Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula,
13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Gir- S., Seo, J.S., Cao, Y. (2016). Throughput-optimized opencl-based
shick, R., Guadarrama, S., Darrell, T. (2014). Caffe: Convolutional fpga accelerator for large-scale convolutional neural networks. In
architecture for fast feature embedding. In Proceedings of the Inproceedings of international symposium on field-programmable
22nd ACM international conference on multimedia (pp. 675– gate arrays (FPGA) (pp. 16–25). https://doi.org/10.1145/2847263.
678). 2847276.
14. Kalms, L., & Göhringer, D. (2017). Exploration of opencl for 29. Taheri, S., Behnam, P., Bozorgzadeh, E., Veidenbaum, A.,
fpgas using sdaccel and comparison to gpus and multicore cpus. Nicolau, A. (2019). Affix: Automatic acceleration framework for
In 27th international conference on field programmable logic and fpga implementation of openvx vision algorithms. In International
applications (FPL) (pp. 1–4). https://doi.org/10.23919/FPL.2017. symposium on field-programmable gate arrays (FPGA) (pp. 252–
8056847. 261). https://doi.org/10.1145/3289602.3293907.
15. Kalms, L., & Göhringer, D. (2020). Accelerated high-level 30. Tapiador Morales, R., Rios-Navarro, A., Linares-Barranco, A.,
synthesis feature detection for FPGAs using HiFlipVX, chap. 7, Kim, M., Kadetotad, D., Seo, J.S. (2016). Comprehensive evalua-
(pp. 115–135). New York: Springer. tion of opencl-based convolutional neural network accelerators in
16. Kalms, L., & Göhringer, D. (2020). Hiflipvx: Open source high- xilinx and altera fpgas coRR.
level synthesis fpga library for image processing. https://github. 31. Tensorflow (2020). Ssd mobilenet v1. https://tensorflow.org/lite/
com/TUD-ADS/HiFlipVX. models/object detection/overview.
17. Kalms, L., Podlubne, A., Göhringer, D. (2019). Hiflipvx: An open 32. Venieris, S.I., & Bouganis, C. (2017). Latency-driven design for
source high-level synthesis fpga library for image processing. In fpga-based convolutional neural networks. In 27Th international
Applied reconfigurable computing (pp. 149–164). conference on field programmable logic and applications (FPL)
18. Krizhevsky, A., Sutskever, I., Hinton, G.E. (2017). Imagenet clas- (pp. 1–8).
sification with deep convolutional neural networks. Communica- 33. Wang, Y., Xu, J., Han, Y., Li, H., Li, X. (2016). Deepburning:
tions of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386. Automatic generation of fpga-based learning accelerators for the
19. Liu, B., Zou, D., Feng, L., Feng, S., Fu, P., Li, J. (2019). neural network family. In 53rd design automation conference
An fpga-based cnn accelerator integrating depthwise separable (DAC) (pp. 1–6).
convolution. Electronics, 8, 281. 34. Winterstein, F., Bayliss, S., Constantinides, G.A. (2013). High-
20. Liu, Z., Chow, P., Xu, J., Jiang, J., Dou, Y., Zhou, J. (2019). A level synthesis of dynamic data structures: A case study using
uniform architecture design for accelerating 2d and 3d cnns on vivado hls. In International conference on field-programmable
fpgas. Electronics, 8, 65. technology (FPT) (pp. 362–365). https://doi.org/10.1109/FPT.
21. Long, J., Shelhamer, E., Darrell, T. (2015). Fully convolutional 2013.6718388.
networks for semantic segmentation. In 2015 IEEE Conference 35. Wu, H., Judd, P., Zhang, X., Isaev, M., Micikevicius, P. (2020).
on computer vision and pattern recognition (CVPR) (pp. 3431– Integer quantization for deep learning inference: Principles and
3440). empirical evaluation.
22. Omidian, H., & Lemieux, G.G.F. (2018). Janus: A compi- 36. Xilinx (2019). xfopencv. https://github.com/Xilinx/xfopencv.
lation system for balancing parallelism and performance in 37. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J. (2015).
openvx. Journal of Physics: Conference Series (JPCS) 012–011. Optimizing fpga-based accelerator design for deep convolutional
https://doi.org/10.1088/1742-6596/1004/1/012011. neural networks. In Proceedings of the international symposium
23. Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M.P., on field-programmable gate arrays (FPGA) (pp. 161–170).
Shyu, M.L., Chen, S.C., Iyengar, S.S. (2018). A survey on deep https://doi.org/10.1145/2684746.2689060.
learning: Algorithms, techniques, and applications. ACM Comput. 38. Zhang, C., Sun, G., Fang, Z., Zhou, P., Pan, P., Cong, J. (2018).
Surv 51(5). https://doi.org/10.1145/3234150. Caffeine:towards uniformed representation and acceleration for
24. Qasaimeh, M., Denolf, K., Lo, J., Vissers, K., Zambreno, deep convolutional neural networks. IEEE Transactions on
J., Jones, P.H. (2019). Comparing energy efficiency of cpu, Computer-Aided Design of Integrated Circuits and Systems.
gpu and fpga implementations for vision kernels. In Interna- 39. Zhang, J., Li, J., 25–34 (2017). Improving the performance of
tional conference on embedded software and systems (ICESS) opencl-based fpga accelerator for convolutional neural network. In
(pp. 1–8). Inproceedings of international symposium on field-programmable
25. Ren, S., He, K., Girshick, R., Sun, J. (2017). Faster r-cnn: Towards gate arrays (FPGA). https://doi.org/10.1145/3020078.3021698.
real-time object detection with region proposal networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39(6), Publisher’s Note Springer Nature remains neutral with regard to
1137–1149. jurisdictional claims in published maps and institutional affiliations.
c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

AnyHLS: High-Level Synthesis with Partial

Evaluation
M. Akif Özkan‡ , Arsène Pérard-Gayot† , Richard Membarth†∗ , Philipp Slusallek†∗ , Roland Leißa∗ ,
Sebastian Hack∗ , Jürgen Teich‡ , and Frank Hannig‡
‡ Friedrich-Alexander University Erlangen-Nürnberg (FAU), Germany
∗ Saarland University (UdS), Germany † German Research Center for Artificial Intelligence (DFKI), Germany

Abstract—FPGAs excel in low power and high throughput different performance or area objectives. In recent languages
arXiv:2002.05796v2 [cs.PL] 21 Jul 2020

computations, but they are challenging to program. Traditionally, such as Chisel [1], VeriScala [2], and MyHDL [3], programmers
developers rely on hardware description languages like Verilog or can create a functional description of their design but stick to
VHDL to specify the hardware behavior at the register-transfer
level. High-Level Synthesis (HLS) raises the level of abstraction, the RTL.
but still requires FPGA design knowledge. Programmers usually High-Level Synthesis (HLS) increases the abstraction level
write pragma-annotated C/C++ programs to define the hardware to an untimed high-level specification similar to imperative
architecture of an application. However, each hardware vendor programming languages and automatically solves low-level
extends its own C dialect using its own vendor-specific set design issues such as clock-level timing, register allocation,
of pragmas. This prevents portability across different vendors.
Furthermore, pragmas are not first-class citizens in the language. and structural pipelining [4]. However, an HLS code that is
This makes it hard to use them in a modular way or design proper optimized for the synthesis of high-performance circuits is
abstractions. fundamentally different from a software program delivering
In this paper, we present AnyHLS, an approach to synthesize high performance on a CPU. This is due to the significant gap
FPGA designs in a modular and abstract way. AnyHLS is between the programming paradigms. An HLS compiler has to
able to raise the abstraction level of existing HLS tools by
resorting to programming language features such as types and
optimize the memory hierarchy of a hardware implementation
higher-order functions as follows: It relies on partial evaluation and parallelize its data paths [5].
to specialize and to optimize the user application based on In order to achieve good Quality of Results (QoR), HLS
a library of abstractions. Then, vendor-specific HLS code is languages demand programmers also to specify the hardware
generated for Intel and Xilinx FPGAs. Portability is obtained architecture of an application instead of just its algorithm. For
by avoiding any vendor-specific pragmas at the source code. In
order to validate achievable gains in productivity, a library for
this reason, HLS languages offer hardware-specific pragmas.
the domain of image processing is introduced as a case study, This ad-hoc mix of software and hardware features makes
and its synthesis results are compared with several state-of-the- it difficult for programmers to optimize an application. In
art Domain-Specific Language (DSL) approaches for this domain. addition, most HLS tools rely on their own C dialect, which
prevents code portability. For example, Xilinx Vivado HLS [6]
uses C++ as base language while Intel SDK [7] (formerly
Altera) uses OpenCL C. These severe restrictions make it hard
I. I NTRODUCTION
to use existing HLS languages in a portable and modular way.
Field Programmable Gate Arrays (FPGAs) consist of a In this paper, we advocate describing FPGA designs using
network of reconfigurable digital logic cells that can be functional abstractions and partial evaluation to generate
configured to implement any combinatorial logic or sequential optimized HLS code. Consider Figure 1 for an example from
circuits. This allows the design of custom application-tailored image processing: With a functional language, we separate
hardware. In particular memory-intensive applications benefit the description of the sobel_x operator from its realization
from FPGA implementations by exploiting fast on-chip memory in hardware. The hardware realization make_local_op is
for high throughput. These features make FPGA implementa- a function that specifies the data path, the parallelization,
tions orders of magnitude faster/more energy-efficient than CPU and memory architecture. Thus, the algorithm and hardware
implementations in these areas. However, FPGA programming architecture descriptions are described by a set of higher-
poses challenges to programmers unacquainted with hardware order functions. A partial evaluator, ultimately, combines
design. these functions to generate an HLS code that delivers high-
FPGAs are traditionally programmed at Register-Transfer performance circuit designs when compiled with HLS tools.
Level (RTL). This requires to model digital signals, their timing, Since the initial descriptions are high-level, compact, and
flow between registers, as well as the operations performed functional, they are reusable and distributable as a library.
on them. Hardware Description Languages (HDLs) such as We leverage the AnyDSL compiler framework [8] to perform
Verilog or VHDL allow for the explicit description of arbitrary partial evaluation and extend it to generate input code for
circuits but require significant coding effort and verification HLS tools targeting Intel and Xilinx FPGA devices. We claim
time. This makes design iterations time-consuming and error- that this approach leads to a modular and portable code other
prone, even for experts: The code needs to be rewritten for than existing HLS approaches, and is able to produce highly
Mem2D Mem2D Mem2D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D Mem1D
(W × H, v) line buffer op1 (W × H, v)
row col ...
sel sel
line buffer opv

line buffers sliding window

c Blender Foundation (CC BY 3.0) local operator
let sobel_x = @|img, x, y| let input = make_img_mem1d("sandiego.jpg");
-1 * img.read(x-1, y-1) + 1 * img.read(x+1, y-1) + let output = make_img_mem1d("output.jpg");
-2 * img.read(x-1, y ) + 2 * img.read(x+1, y ) + let operator = make_local_op(sobel_x);
-1 * img.read(x-1, y+1) + 2 * img.read(x+1, y+1); with generate(vhls) { operator(input, output) }

Figure 1. AnyHLS example: The algorithm description sobel_x is decoupled from its realization in hardware make_local_op. The hardware realization
is a function that specifies important transformations for the exploitation of parallelism and memory architecture. The function generate(vhls) selects the
backend for code generation, which is Vivado HLS in this case. Ultimately, an optimized input code for HLS is generated by partially evaluating the algorithm
and realization functions.

efficient hardware implementations. There is an ongoing discussion whether C-based languages

In summary, this paper makes the following contributions: are good candidates for HLS [4], [12]–[15]. Yet, most com-
1
• We present AnyHLS , raising the abstraction level in HLS monly used HLS compilers (e.g., Vivado HLS, AOCL, Catapult,
by using partial evaluation of higher-order functions as a LegUp) are based on C-based languages [4], [6], [7], [10]. The
core compiler technology. It guarantees the well-typedness modularity and readability of C/C++ or OpenCL descriptions
of the residual program and offers considerably higher often conflict with best coding practices of HLS compilers [16],
productivity than existing DSL design techniques and [17]. In the hardware design context, QoR design refers to
C/C++-based approaches (see Section II). the ratio between the performance of the circuit (latency,
• AnyHLS offers unprecedented target independence, and throughput) and design cost (circuit area, energy consumption).
thus portability, across different HLS tools by avoiding A C-based HLS code optimized for satisfactory QoR is entirely
tool-specific pragma extensions and generating target- different from a typical software program [16]–[20]. Thereby,
specific OpenCL or C code as input to existing HLS
++ the developer should express the FPGA implementation of an
tools (see Section III). application using the language abstractions of software (i.e.,
• Productivity, modularity, and portability gains are demon- arrays, loops to specify the memory hierarchy and hardware
strated by presenting an image processing library as a pipelining). Language extensions like pragmas fill the gap
case study in Section IV. For this domain, we show for the lacking FPGA-centric features. However, pragmas are
that a competitive performance in terms of throughput specific to HLS tools, and they cannot be used in a modular way
and resource usage can be achieved in comparison with because the preprocessor already resolves them (e.g., pragmas
existing state-of-the-art DSLs (see Section V). cannot be passed as function parameters). This ad-hoc mix of
software and hardware abstractions of programming languages
II. OVERVIEW, BACKGROUND , AND R ELATED W ORK in HLS makes optimizations hard [15], [17], [19]. Furthermore,
In the following, we briefly discuss prior work (Sections II-A the lack of standardization in HLS languages and compilers
to II-B) and fundamental concepts of AnyDSL (Section II-C). hinders the portability of code across them. Often, the code
optimized for one HLS tool must significantly be changed to
target another HLS tool even when the same FPGA design is
A. QoR and Portability of Code in C-based HLS
described. For these reasons, we believe that the next step for
HLS increases the abstraction level to an untimed high-level HLS requires an increased level of abstraction on the language
specification such as C/C++ or OpenCL from a fully-timed side, which can reduce the need for expert knowledge.
RTL. This eases the hardware design problem by eliminating
low-level issues such as clock-level timing, register allocation, B. Raising the Abstraction Level in HLS
and gate-level pipelining [4], [9], [10]. Modern HLS tools are
able to generate high-quality results for DSP and datapath- Recent work suggests raising the abstraction level in HLS
oriented applications. Several authors (e.g., [4], [11], [12]) by designing libraries, DSLs or source-to-source compilers
have argued the following points as key to this success: to hide low-level implementation details. This improves the
(i) advancements in RTL design tools, (ii) device-specific code modularity and reduces code duplication, but is hard to develop
generation, (iii) domain-specific focus on the target applications, and maintain when well-typedness of programs are preserved.
and (iv) generating both software and hardware from the [16]–[19] make extensive use of C++ template metaprogramming
same code. Modern HLS tools such as Intel FPGA SDK for to provide libraries that are optimized for Vivado-HLS. Generic
OpenCL (AOCL) and Xilinx SDX offer system synthesis to programs can be optimized for compile-time known values
map program parts to either software or hardware. This enables using metaprogramming techniques, but it has the following
software-like development for library design and verification. drawbacks: (i) The well-typedness of the generated program
cannot be guaranteed in metaprogramming. This makes it
1 https://github.com/AnyDSL/anyhls difficult to understand error messages. (ii) Metaprograms are
hard to develop, maintain, and understand since the meta C. AnyDSL Compiler Framework
language is different from the core language (C++ core vs.
AnyDSL2 [8], [34] is a compiler framework for designing
C++ template language). For this reason, code cannot be easily
high-performance, domain-specific libraries. It provides the
moved between the core and the meta language. (iii) Lambda
imperative and functional language Impala. Impala’s syntax is
expressions are not allowed to be used as template arguments
inspired by Rust. We will now briefly discuss Impala’s most
in C++. We refer to [8] for more details. In particular, [16], [18]
important features that we rely on in AnyHLS.
explain the challenges of implementing higher-order algorithms
1) Partial Evaluation: Partial evaluation is a technique for
in C++ for Vivado-HLS. OpenCL C does not support template
program optimization by specialization of compile-time known
metaprogramming, thus forces users to use preprocessor macros
values. Assume that each input of a program F is classified
for generic library design. Therefore, libraries developed by
as either static s or dynamic d, and values for all of the static
using C++ template metaprogramming have to be rewritten
inputs are given. Then, partial evaluation produces an optimized
completely for OpenCL C, that is, for AOCL.
(residual) program Fs such that
DSLs use domain-specific knowledge to parallelize algo-
rithms and generate low-level, optimized code [21]. Program- [[F s]](d) = [[F ]](s, d) (1)
ming accelerators using DSLs is thus easier, in particular
for FPGAs, because the compiler performs scheduling. A and running Fs on the dynamic inputs produces the same
prominent example of that is the FPGA version of Spiral [22]. result as running the original program F on all of the
It generates HDL for digital signal processing applications. inputs [36]. Compiler techniques such as constant propagation,
In the domain of image processing, recent projects include loop unrolling, or inlining are examples to partial evaluation.
Darkroom [23], Rigel [24], and the work of Pu et al. [25] Typically, the user has no control when these optimizations
based on Halide [26]. Hipacc [27], PolyMage [28], SODA [29], are applied from a compiler.
and RIPL [30] create image processing pipelines from a Impala allows programmers to partially evaluate [37] their
DSL. Rigel/Halide, PolyMage, and RIPL are declarative DSLs, program at compile time. Programmers control the partial
whereas Hipacc is embedded into C++. All of these compilers, evaluator via filters [38]. These are Boolean expressions of the
except Rigel, generate HLS code in order to simplify their form @(expr) that annotate function signatures. Each call site
backends. Other examples include L IFT that targets FPGAs via instantiates the callee’s filter with the corresponding argument
algorithmic patterns [31] and Tiramisu [32] for data-parallel list. The call is specialized when the expression evaluates to
algorithms on dense arrays. Tiramisu takes as input a set of true. The expression ?expr yields true, if expr is known
scheduling commands from the user and feeds it to the polyhe- at compile-time; the expression $expr is never considered
dral analysis of the compiler. However, a considerable portion constant by the evaluator. For example, the following @(?n)
of these scheduling primitives remains platform-specific [33]. filter will only specialize calls to pow if n is statically known
Spatial [15] is a language for programming Coarse-Grained at compile-time:
Reconfigurable Architectures (CGRAs) and FPGAs. Spatial fn @(?n) pow(x: int, n: int) -> int {
provides language constructs to express control, memory, and if n == 0 {
interfaces of hardware implementation. 1
} else {
In this paper, it is shown that the described need to raise the if n %
abstraction level in HLS may be accomplished by using recent let y = pow(x, n / 2);
y * y
compiler technology, in particular by exploring the concepts } else {
of partial evaluation and high-order-functions. Unlike the }
x * pow(x, n - 1)

aforementioned DSL compilers, AnyHLS allows programmers }

to build the basic blocks and abstractions necessary for their }

application domain by themselves (see Section III). AnyHLS is Thus, the calls
thereby built on top of AnyDSL [8] (see Section II-C). AnyDSL let z = pow(x, 5); let z = pow(3, 5);
offers partial evaluation to enable shallow embedding [34] will result in the following equivalent sequences of instructions
without the need for modifying a compiler. This means after specialization:
that there is no need to change the compiler when adding let y = x * x; let z = 243;
support for a new application domain, since programmers can let z = x * y * y;

design custom control structures. Partial evaluation specializes As syntactic sugar, @ is available as shorthand for @(true).
algorithmic variants of a program at compile-time. Compared This causes the partial evaluator to always specialize the
to metaprogramming, partial evaluation operates in a single annotated function.
language and preserves the well-typedness of programs [8]. Fur- FPGA implementations must be statically defined for QoR:
thermore, different combinations of static/dynamic parameters types, loops, functions, and interfaces must be resolved at
can be instantiated from the same code. Previously, we have compile-time [16], [18], [19]. Partial evaluation has many
shown how to abstract image border handling implementations advantages compared to metaprogramming as discussed in
for Intel FPGAs using AnyDSL [35]. In this paper, we present Section II-B. Hence, Impala’s partial evaluation is particularly
AnyHLS and an image processing library to synthesize FPGA useful to optimize HLS descriptions.
designs in a modular and abstract way for both Intel and Xilinx
FPGAs. 2 https://anydsl.github.io
2) Generators: Because iteration on various domains is a halide-app.cpp hipacc-app.cpp anyhsl-app.impala
common pattern, Impala provides syntactic sugar for invoking
Halide compiler Hipacc compiler
certain higher-order functions. The loop Vivado Vivado AOCL
AnyDSL compiler
+
Image
Processing
backend backend backend (partial evaluator) Lib.impala
for var1, ..., varn in iter(arg1, ..., argn) { /* ... */ }

translates to VHLS-code.cpp VHLS-code.cpp AOCL-code.cl VHLS-code.cpp AOCL-code.cl

iter(arg1, ..., argn, |var1, ..., varn| { /* ... */ });
template template template
The body of the for loop and the iteration variables constitute library library library

an anonymous function
|var1, ..., varn| { /* ... */ } VHLS VHLS AOCL VHLS AOCL

that is passed to iter as the last argument. We call functions XILINX XILINX INTEL XILINX INTEL
FPGA FPGA FPGA FPGA FPGA
that are invokable like this generators. Domain-specific libraries
implemented in Impala make busy use of these features as
Figure 2. FPGA code generation flows for Halide, Hipacc, and AnyHLS (from
they allow programmers to write custom generators that take left to right). VHLS and AOCL are used as acronyms for Vivado HLS and
advantage of both domain knowledge and certain hardware Intel FPGA SDK for OpenCL, respectively. Halide and Hipacc rely on domain-
features, as we will see in the next section. specific compilers for image processing that instantiate template libraries.
AnyHLS allows defining all abstractions for a domain in a language called
Generators are particularly powerful in combination with Impala and relies on partial evaluation for code specialization. This ensures
partial evaluation. Consider the following functions: maintainability and extensibility of the provided domain-specific library—for
image processing in this example.
type Body = fn(int) -> ();
fn @(?a & ?b) unroll(a: int, b: int, body: Body) -> () {
if a < b { body(a); unroll(a+1, b, body) }
} with vhls() { body() } with opencl() { body() }
fn @ range(a: int, b: int, body: Body) -> () {
unroll($a, b, body) With opencl we use a grid and block size of (1, 1, 1)
}
to generate a single work-item kernel, as the official AOCL
Both generators iterate from a (inclusive) to b (exclusive) documentation recommends [7]. We extended AnyDSL’s
while invoking body each time. The filter unroll tells the OpenCL runtime by the extensions of Intel OpenCL SDK.
partial evaluator to completely unroll the recursion if both loop To provide an abstraction over both HLS backends, we create
bounds are statically known at a particular call site. a wrapper generate that expects a code generation function:
type Backend = fn(fn() -> ()) -> ();
III. T HE A NY HLS L IBRARY fn @ generate(be: Backend, body: fn() -> ()) -> () {
with be() { body() }
Efficient and resource-friendly FPGA designs require }

application-specific optimizations. These optimizations and Switching backends is now just a matter of passing an
transformations are well known in the community. For example, appropriate function to generate:
de Fine Licht et al. [20] discuss the key transformations of HLS
let backend = vhls; // or opencl
codes such as loop unrolling and pipelining. They describe with generate(backend) { body() }
the whole hardware design from the low-level memory layout
to the operator implementations with support for low-level
loop transformations throughout the design. In our setting, B. Building Abstractions for FPGA Designs
the programmer defines and provides these abstractions using
In the following, we present abstractions for the key
AnyDSL for a given domain in the form of a library. We
transformations and design patterns that are common in FPGA
rely on partial evaluation to combine those abstractions and to
design. These include (a) important loop transformations, (b)
remove overhead associated with them. Ultimately, the AnyDSL
control flow and data flow descriptions such as reductions,
compiler synthesizes optimized HLS code (C++ or OpenCL C)
Finite State Machines (FSMs) and (d) the explicit utilization of
from a given functional description of an algorithm as shown
different memory types. Approaches like Spatial [15] expose
in Figure 2. The generated code goes to the selected HLS tool.
these patterns within the language—new patterns require
This is in contrast to other domain-specific approaches like
dedicated support from the compiler. Hence, these languages
Halide-HLS [25] or Hipacc [27], which rely on domain-specific
and compilers are restricted to a specialized application
compilers to instantiate predefined templates or macros. Hipacc
domain they have been designed for. In AnyHLS, Impala’s
makes use of two distinct libraries to synthesize algorithmic
functional language and partial evaluation allow us to design
abstractions to Vivado-HLS and Intel AOCL, while AnyHLS
the abstractions needed for FPGA synthesis in the form of
uses the same image processing library that is described in
a library. New patterns can be added to the library without
Impala.
dedicated support from the compiler. This makes AnyHLS
easier to extend compared to the approaches mentioned afore.
A. HLS Code Generation 1) Loop Transformations: C++ compilers usually provide
For HLS code generation, we implemented an intrinsic certain preprocessor directives that perform particular code
named vhls in AnyHLS to emit Vivado HLS and an intrinsic transformations. A common feature is to unroll loops (see
named opencl to emit AOCL: left-hand side):
Instead of a pragma (on the left), AnyHLS uses the intrinsic
body body body
generator pipeline (on the right). Unlike the above loop
abstractions (e.g., unroll), Impala emits a tool-specific pragma
body body body
for the pipeline abstraction. This provides portability across
body
no unrolling
body body
different HLS tools. Furthermore, it allows the programmer
unroll inner loop
to invoke and pass around pipeline—just like any other
unroll outer loop unroll inner and outer loop
generator.
2) Reductions: Reductions are useful in many contexts. The
Figure 3. Parallel processing following function takes an array of values, a range within,
and an operator:
for (int i=0; i<N/W; ++i) { for i in range(0, N/W) { type T = int;
for (int w=0; w<W; ++w) { for w in unroll(0, W) { fn @(?beg & ?end) reduce(beg: int, end: int, input: &[T],
#pragma unroll op: fn(T, T) -> T) -> T {
body(i*W + w); body(i*W + w); let n = end - beg;
} } if n == 1 {
} } input(beg)
} else {
Such pragmas are built into the compiler. The Impala version let m = (end + beg) / 2;
(shown at right) uses generators that are entirely implemented let a = reduce(beg, m, input, op);
let b = reduce(m, end, input, op);
as a library. Partial evaluation optimizes Impala’s range op(a, b)
and unroll abstractions as well as the input body function }
}
according to their static inputs, i.e., N, W. The residual program
consists of the consecutive body function according to the value In the above filter, the recursion will be completely unfolded
of the W as shown in Figure 3. This generates a concise and if the range is statically known. Thus,
clean code for the target HLS compiler, which is drastically reduce(0, 4, [a, b, c, d], |x, y| x + y)
different from using a pragma.
Generators, unlike C++ pragmas, are first-class citizens of yields: (a + b) + (c + d).
the Impala language. This allows programmers to implement 3) Finite State Machines: AnyHLS models computations
sophisticated loop transformations. For example, the following that depend not only on the inputs but also on an internal
function tile returns a new generator. It instantiates a tiled state with an FSM. To define an FSM, programmers need to
loop nest of the specified tile size with the Loops inner specify states and a transition function that determines when
and outer: to change the current state based on the machine’s input. This
type Loop = fn(int, int, fn(int) -> ()) -> ();
is especially beneficial for modeling control flow. To describe
fn @ tile(size: int, inner: Loop, outer: Loop) -> Loop { an FSM in Impala, we start by introducing types to represent
@|beg, end, body| outer(0, (end-beg)/size,
|i| inner(i*size + beg, (i+1)*size + end, |j| body))
the states and the machine itself:
} type State = int;
struct FSM {
let schedule = tile(W, unroll, range); add: fn(State, fn() -> (), fn() -> State) -> (),
for i in schedule(0, N) { run: fn(State) -> ()
body(i) }
}
An object of type FSM provides two operations: adding one
Passing W for the tiling size, unroll for the inner loop, and
state with add or running the computation. The add method
range for the outer loop yields a generator that is identical
takes the name of the state, an action to be performed for this
to the loop nest at the beginning of this paragraph. With this
state, and a transition function associated with this state. Once
design, we can reuse or explore iteration techniques without
all states are added, the programmer runs the machine by
touching the actual body of a for loop. For example, consider
passing the initial state as an input parameter. The following
the processing options for a two-dimensional loop nest as shown
example adds 1 to every element of an array:
in Figure 3: When just passing range as inner and outer
let buf = /*...*/;
loop, the partial evaluator will keep the loop nest and, hence, let mut (idx, pixel) = (0, 0);
not unroll body and instantiate it only once. Unrolling the inner let fsm = make_fsm();
fsm.add(Read, || pixel = buf(idx),
loop replicates body and increases the bandwidth requirements || if idx>=len { Exit } else { Compute });
accordingly. Unrolling the outer loop also replicates body, but fsm.add(Compute, || pixel += 1, || Write);
fsm.add(Write, || buf(idx++) = pixel, || Read );
in a way that benefits data reuse from the temporal locality of fsm.run(Read);
an iterative algorithm. Unrolling both loops replicate body for
increased bandwidth and data reuse for the temporal locality. Similar the other abstractions introduced in this section, the
C/C++-based HLS solutions often use a pragma to mark a constructor for an FSM is not a built-in function of the compiler
loop amenable for pipelining. This means parallel execution but a regular Impala function. In some cases, we want to
of the loop iterations in hardware. For example, the following execute the FSM in a pipelined way. For this scenario, we add
code on the left uses an initiation interval (II) of 3: a second method run_pipelined. As all the methods, e.g.,
for (int i=0; i<N; ++i) { let II = 3; make_fsm, add, run, are annotated for partial evaluation
#pragma HLS pipeline II=3 for i in pipeline(II, 0, N) {
body(i); body(i) (by @), input functions to these methods will be optimized
} } according to their static inputs. Ultimately, AnyHLS will emit
the states of an FSM as part of a loop according to the selected
run method.
global memory on-chip memory register stream
4) Memory Types and Memory Abstractions: FPGAs have
different memory types of varying sizes and access properties.
Impala supports four memory types specific to hardware design Figure 4. Memory types provided for FPGA design
(see Figure 4): global memory, on-chip memory, registers, and
streams. Global memory (typically DRAM) is allocated on the OnChipArray
Regs2D StreamArray
host using our runtime and accessed through regular pointers. Regs1D
On-chip memory (e.g., BRAM or M10K/M20K) for the FPGA
is allocated using the reserve_onchip compiler intrinsic.
1D register array
Memory accesses using the pointer returned by this intrinsic 2D register array stream array
on-chip array
will map to on-chip memory. Standard variables are mapped
to registers, and a specific stream type is available to allow
for the communication between FPGA kernels. Memory-wise, Figure 5. Memory abstractions
a stream is mapped to registers or on-chip memory by the
HLS tools. These FPGA-specific memory types in Impala will
the smaller array. The generator (make_regs1d) returns an
be mapped to their corresponding tool-specific declarations in
Impala variable that can be read and written by index values
the residual program (on-chip memory will be defined as local
(regs in the following code), similar to C arrays.
memory for AOCL whereas it will be defined as an array in
let regs = make_regs1d(size);
Vivado HLS).
a) Memory partitioning: an array partitioning pragma However, it defines size number of registers in the residual
must be defined as follows to implement a C array with program instead of declaring an array and partitioning it by
hardware registers using Vivado HLS [6]: tool-specific pragmas as in Listing 1. The generated code
typedef int T; does not contain any compiler directives; hence it can be
T Regs1D[size];
#pragma HLS variable=Regs1D array_partition dim=0
used for different HLS tools (e.g., Vivado HLS, AOCL). Since
we annotated make_regs1d, read, and write for partial
Listing 1. A typical way of partitioning an array by using pragmas in existing
HLS tools. evaluation, any call to these functions will be inlined recursively.
This means that the search to find the register to read to or
Other HLS tools offer similar pragmas for the same task. write from will be performed at compile time. These registers
Instead, AnyHLS provides a more concise description of a will be optimized by the AnyDSL compiler, just like any other
register array without using any tool-specific pragma by the variables: unnecessary assignments will be avoided, and a clean
recursive declaration of registers as follows: HLS code will be generated.
type T = int; Correspondingly, AnyHLS provides generators (similar to
struct Regs1D { Listing 2) for one and two-dimensional arrays of on-chip
read: fn(int) -> T,
write: fn(int, T) -> (), memory (e.g., line buffers in Section IV), global memory, and
size: int streams (as illustrated in Figure 5) instead of using memory
}
fn @ make_regs1d(size: int) -> Regs1D { partitioning pragmas encouraged in existing HLS tools (as in
if size == 0 { Listing 1).
Regs1D {
read: @|_| 0,
write: @|_, _| (),
size: size IV. A L IBRARY FOR I MAGE P ROCESSING ON FPGA
}
} else {
AnyHLS allows for defining domain-specific abstractions
let mut reg: T; and optimizations that are used and applied prior to generating
let others = make_regs1d(size - 1);
Regs1D {
customized input to existing HLS tools. In this section, we
read: @|i| if i+1 == size { reg } introduce a library that is developed to support HLS for the
else { others.read(i) },
write: @|i, v| if i+1 == size { reg = v }
domain of image processing applications. It is based on the
else { others.write(i, v) }, fundamental abstractions introduced in Section III-B. Our low-
size: size
}
level implementation is similar to existing domain-specific
} languages targeting FPGAs [24], [27]. For this reason, we focus
}
on the interface of our abstractions as seen by the programmer.
Listing 2. Recursive description of a register array using partial evalution We design applications by decoupling their algorithmic
instead of declaring an array and partitioning it by HLS pragmas.
description from their schedule and memory operations. For
instance, typical image operators, such as the following
When the size is not zero, each recursive call to this
Sobel filter, just resort to the make_local_op generator.
function allocates a register variable named reg, and creates
Similarly, we implement a point operator for RGB-to-gray
a smaller register array with one element less named others.
color conversion as follows (Listing 3):
The read and write functions test if the index i is equal
fn sobel_edge(output: &mut [T], input: &[T]) -> () {
to the index of the current register. In the case of a match, let img = make_raw_mem2d(width, height, input);
the current register is used. Otherwise, the search continues in let dx = make_raw_mem2d(width, height, output);
let sobel_extents = extents(1, 1); // for 3x3 filter access an element of the vector. This increases data reuse and
let operator = make_local_op(4, // vector factor
sobel_operator_x, sobel_extents, mirror, mirror);
DRAM-to-on-chip memory bandwidth [42].
with generate(hls) { operator(img, dx); } 2) Stream Processing: Inter-kernel dependencies of an
}
algorithm should be accessed on-the-fly in combination with
fn rgb2gray(output: &mut [T], input: &[T]) -> () { fine-granular communication in order to pipeline the full
let img = make_raw_img(width, height, input);
let gray = make_raw_img(width, height, output); implementation with a fixed throughput. That is, as soon as a
let operator = make_point_op(@ |pix| { block produces one data, the next block consumes it. In the
let r = pix & 0xFF;
let g = (pix >> 8) & 0xFF; best case, this requires only a single register of a small buffer
let b = (pix >> 16) & 0xFF; instead of reading/writing to temporary images:
(r + g + b) / 3
});
Mem1D Mem1D Mem1D Mem1D
with generate(hls) { operator(img, gray); }
} Kernel1 Kernel2 Kernel3
Listing 3. Sobel filter and RGB-to-gray color conversion as example
applications described by using our library.
We define a stream between two kernels as follows:
The image data structure is opaque. The target platform fn make_mem_from_stream(size: int, data: stream) -> Mem1D;
mapping determines its layout. AnyHLS provides common
border handling functions as well as point and global operators 3) Line Buffers: Storing an entire image to on-chip memory
such as reductions (see Section III-B2). These operators are before execution is not feasible since on-chip memory blocks
composable to allow for more sophisticated ones. are limited in FPGAs. On the other hand, feeding the data
on demand from main memory is extremely slow. Still, it is
possible to leverage fast on-chip memory by using it as FIFO
A. Vectorization
buffers containing only the necessary lines of the input images
Image processing applications consist of loops that possess a (W pixels per line).
very high degree of spatial parallelism. This should be exploited Mem2D (1, h, v)
to reach the bandwidth speed of memory technologies. A
line buffer
resource-efficient approach, so-called vectorization or loop
coarsening, is to aggregate the input pixels to vectors and
process multiple input data at the same time to calculate line buffer
Mem1D (W, v)
multiple output pixels in parallel [39]–[41]. This replicates only
the arithmetic operations applied to data (so-called datapath) line buffers (W, h, v)
instead of the whole accelerator, similar to Single Instruction
Multiple Data (SIMD) architectures. Vectorization requires a This enables parallel reads at the output for every pixel read
control structure specialized to a considered hardware design. at the input. We model a line buffer as follows:
We support the automatic vectorization of an application by type LineBuf1D = fn(Mem1D) -> Mem1D;
a given factor v when using our image processing library. In fn make_linebuf1d(width: int) -> LineBuf1D;
// similar for LineBuf2D
particular, our library use the vectorization techniques proposed
in [40]. For example, the make_local_op function has Akin to Regs1D (see Section III-B4), a recursive call builds
an additional parameter to specify the desired vectorization an array of line buffers (each line buffer will be declared by a
and will propagate this information to the functions it uses separate memory component in the residual program similar
internally: make_local_op(op, v). For brevity, we omit to on-chip array in Figure 5).
the parameter for the vectorization factor for the remaining 4) Sliding Window: Registers are the most amenable re-
abstractions in this section. sources to hold data for highly parallelized access. A sliding
window of size w × h updates the constituting shift registers by
B. Memory Abstractions for Image Processing a new column of h pixels and enables parallel access to w · h
1) Memory Accessor: In order to optimize memory access pixels.
Mem2D (w, h, 1)
and encapsulate the contained memory type (on-chip memory,
etc.) into a data structure, we decouple the data transfer from Mem2D
(1, h, v)
the data use via the following memory abstractions:
struct Mem1D { struct Mem2D {
read: fn(int) -> T, read: fn(int, int) -> T,
write: fn(int, T)->(), write: fn(int, int, T)->(),
update: fn(int) -> (), update: fn(int, int) -> (), sliding window
size: int width: int, height: int
} } This provides high data reuse for temporal locality and avoids
Similar to hardware design practices, these memory abstractions waste of on-chip memory blocks that might be utilized for a sim-
require the memory address to be updated before the ilar data bandwidth. Our implementation uses make_regs2d
read/write operations. The update function transfers data for an explicit declaration of registers and supports pixel-based
from/to the encapsulated memory to/from staging registers indexing at the output. This will instantiate w · h registers in
using vector data types. Then, the read/write functions the residual program, as explained in Section III-B4.
type Swin2D = fn(Mem2D) -> Mem2D; type LocalOp = fn(Mem1D) -> Mem1D;
fn @ make_sliding_window(w: int, h: int) -> Swin2D { fn @ make_local_op(v: int, op: Op, ext: Extents,
let win = make_regs2d(w, h); bh_lower: FnBorder,
// ... bh_upper: FnBorder) -> LocalOp {
} @ |img, out| {
let mut (col, row, idx) = (0, 0, 0);
let wait = /* initial latency */
C. Loop Abstractions for Image Processing let fsm = make_fsm();
fsm.add(Read, || img.update(idx), || Compute);
1) Point Operators: Algorithms such as image scaling and fsm.add(Compute, || {
line_buffer.update(col);
color transformation calculate an output pixel for every input sliding_window.update(row);
pixel. The point operator abstraction (see Listing 4) in AnyHLS col_sel.update(col);
for i in unroll(0, v) {
yields a vectorized pipeline over the input and output image. out.write(i, op(col_sel.read(i)));
This abstraction is parametric in its vector factor v and the }
}, || if idx > wait { Write } else { Index });
desired operator function op. fsm.add(Write, || out.update(idx-wait-1), || Index);
fsm.add(Index, || {
type PointOp = fn(Mem1D) -> Mem1D;
idx++; col++;
fn @ make_point_op(v: int, op: Op) -> PointOp {
if col == img_width { col=0; row++; }
@ |img, out| {
}, || if idx < img.size { Read } else { Exit });
for idx in pipeline(1, 0, img.size) {
fsm.run_pipelined(Read, 1, 0, img.size);
img.update(idx);
}
for i in unroll(0, v) {
}
out.write(i, op(img.read(i)));
}
out.update(idx); Listing 5. Implementation of the local operator abstraction.
}
}
}
Compared to the local operator in Figure 1, we also support
Listing 4. Implementation of the point operator abstraction. boundary handling. We specify the extent of the local operator
(filter size / 2) as well as functions specifying the boundary
handling for the lower and upper bounds. Then, row and column
The total latency is
selection functions apply border handling correspondingly in x-
L = Larith + dW/ve · H cycles (2) and y−directions by using one-dimensional multiplexer arrays
similar to Özkan et al. [40].
where W and H are the width and height of the input image,
and Larith is the latency of the data path. V. E VALUATION AND R ESULTS
2) Local Operators: Algorithms such as Gaussian blur and In the following, we compare the Post Place and Route
Sobel edge detection calculate an output pixel by considering (PPnR) results using AnyHLS and other state-of-the-art domain-
the corresponding input pixel and a certain neighborhood of it specific approaches including Halide-HLS [25] and Hipacc [27].
in a local window. Thus, a local operator with a w × h window The generated HLS codes are compiled using Intel FPGA SDK
requires w · h pixel reads for every output. The same (w − 1) · h for OpenCL 18.1 and Xilinx Vivado HLS 2017.2 targeting a
pixels are used to calculate results at the image coordinates Cyclone V GT 5CGTD9 FPGA and a Zynq XC7Z020 FPGA,
(x, y) and (x + 1, y). This spatial locality is transformed into repectively.
temporal locality when input images are read in raster order for The generated hardware designs are evaluated for their
burst mode, and subsequent pixels are sequentially processed throughput, latency, and resource utilization. FPGAs possess
with a streaming pipeline implementation. The local operator two types of resources: (i) computational: LUTs and DSP
implementation in AnyHLS (shown in Listing 5) consists of blocks; (ii) memory: Flipflops (FFs) and on-chip memory
line buffers and a sliding window to hold dependency pixels (BRAM/M20K). A SLICE/ALM is comprised of look-up tables
in on-chip memory and calculates a new result for every new (LUTs) and flip flops, thus indicate the resource usage when
pixel read. considered with the DSP block and on-chip memory blocks.
Mem2D Mem2D Mem2D
The implementation results presented for Vivado HLS feature
Mem1D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D
only the kernel logic, while those by Intel OpenCL include
(W × H, v) line buffer op1 (W × H, v) PCIe interfaces. The execution time of an FPGA circuit (Vivado
row col
line buffer
sel sel
...
HLS implementation) equals to Tclk · latency, where Tclk is
opv
the clock period of the maximum achievable clock frequency
line buffers sliding window
(lower is better). We measured the timing results for Intel
local operator
OpenCL by executing the applications on a Cyclone V GT
This provides a throughput of v pixels per clock cycle at the 5CGTD9 FPGA. This is the case for all analyzed applications.
cost of an initial latency (v is the vectorization factor) We have no intention nor license rights [43, §4] [44, §2] to
benchmark and compare the considered FPGA technologies or
Linitial = Larith + (bh/2c · dW/ve + bdw/ve/2c) (3) HLS tools.
that is spent for caching neighboring pixels of the first
calculation. The final latency is thus: A. Applications
In our experimental evaluation, we consider the following
L = Linitial + (dW/ve · H) (4) applications:
Harris
2) Vectorization: Many FPGA implementations benefit from
FChain parallel processing in order to increase memory bandwidth.
AnyHLS implicitly parallelizes a given image pipeline by a
Harris naïve vectorization factor v. As an example, Figure 7 shows the
FChain streaming pipeline PPnR results, along with the achieved memory throughput for
0 16 35 107 different vectorization factors for the mean filter on a Cyclone V.
Execution time [ms] The memory-bound of the Cyclone V is reported by Intel’s

Figure 6. Execution time for naïve and streaming pipeline implementations Memory Bound [MB/s]

Vectorization factor (v)

of the Harris and FChain for an Intel Cyclone V for images of 1024 × 1024. 32
16
8
• Gaussian (Gauss) blurring an image with a 5 × 5 integer
4
kernel 2
• Harris corner detector (Harris) consisting of 9 kernels 1
that resort to integer arithmetic and horizontal/vertical 200 400 600 800 1,000 1,200 1,400
derivatives
Throughput [MB/s]
• Jacobi smoothing an image with a 3 × 3 integer kernel
35
filter chain (FChain) consisting of 3 convolution kernels

Resource Usage in %
•
On-Chip Mem Blocks Logic Resources
as a pre-processing algorithm 30
• bilateral filter (Bilateral), a 5 × 5 floating-point kernel
as an edge-preserving and noise-reducing function based 25
on exponential functions
• mean filter (MF), a 5×5 filter that determines the average 20
within a local window via 8-bit arithmetic
15
• SobelLuma, an edge detection algorithm provided as a
1 2 4 8 16 32
design example by Intel. The algorithm consists of RGB Vectorization factor (v)
to Luma color conversion, Sobel filters, and thresholding
Figure 7. PPnR results of AnyHLS’s mean filter implementation on an Intel
B. Library Optimizations Cyclone V. The memory bound of the device for our setup is 1344.80 MB/s.

AnyHLS exploits stream processing and performs implicit

diagnosis tool. The speedup is almost linear, whereas resource
parallelization. The following subsections show the impact of
utilization is sub-linear to the vectorization factor, as Figure 7
those optimizations.
depicts. AnyHLS exploits the data reuse between consecutive
1) Stream Processing: Memory transfers between FPGA’s
iterations of the local operators. Data is read and written with
programmable logic and external memory are one of the most
the vectorized data types. The line buffers and the sliding
time-consuming parts of many image processing applications.
window are extended to hold dependency pixels for vectorized
AnyHLS streaming pipeline optimization passes dependency
processing. Thus, only the datapath is replicated instead of the
pixels directly from the producer to the consumer kernel,
whole accelerator implementation (see Section IV-A). All the
as explained in Section IV-B2. This allows pipelined kernel
considered applications except Bilateral in Figure 9 reach the
execution and makes intermediate images between kernels
memory bound. Bilateral is compute-bound due to its large
superfluous. The more intermediate images are eliminated, the
number of floating-point operations.
better the performance of the resulting designs. For example,
this eliminates 8 intermediate images in Harris corner and 2 in
filter chain, see Figure 6 for the performance impact. C. Hardware Design Evaluation
The throughput of both streaming pipeline implementations
We evaluate the generated hardware designs based on their
is indeed determined by their slowest individual kernel, which
throughput, latency, and resource utilization. As a reference, we
is a local operator. Consider Table I, which displays the Vivado
use the designs generated by Halide-HLS [25] and Hipacc [27],
HLS reports. The latency results correspond to Equation (4).
two state-of-the-art image processing DSLs that generate
better results than previous approaches (e.g., Xilinx OpenCV).
Table I
S TREAMING PIPELINE IMPLEMENTATIONS OF H ARRIS AND FC HAIN ON A In contrast to these, which implement dedicated HLS code
X ILINX Z YNQ . DATA IS TRANSFERRED TO THE FPGA ONLY ONCE , THUS generators, AnyHLS is essentially implemented as a library
SIMILAR THROUGHPUTS ARE ACHIEVED . I MAGES SIZES ARE 1024 × 1024, within the AnyDSL framework, as illustrated in Figure 2. Our
v = 1, ftarget = 200 MH Z .
focus is to show that higher-order abstractions, together with
App. Largest mask Sequential Dependency Latency [cyc.] Throughput [MB/s]
partial evaluation, are powerful enough to design a library
FChain 5×5 local + local + local 1050649 821 targeting different HLS compilers.
Harris 3×3 local + local + point 1049634 825 1) Experiments using Xilinx Vivado HLS: We evaluate the
results of circuits generated using AnyHLS in comparison with
the domain-specific language approaches Hipacc and Halide- Table II
HLS. We consider two representative applications from the PP N R RESULTS FOR THE X ILINX Z YNQ BOARD FOR IMAGES OF SIZE
1020 × 1020 AND Ttarget = 5 NS ( CORRESPONDS TO ftarget = 200 MH Z ).
Halide-HLS repository with different configurations (border B ORDER HANDLING IS UNDEFINED .
handling mode and vectorization factor): Gauss and Harris.
These DSLs have been developed by FPGA experts and perform App v #BRAM #SLICE #DSP Latency [cyc.] Throughput [MB/s]
AnyHLS 8 463 16 1042456 828.2
better than many other existing libraries. The applications are 1 Halide-HLS 8 1823 50 1052673 438.2
rewritten for Hipacc and AnyHLS by respecting their original Gauss
Hipacc 8 473 16 1044500 764.7
AnyHLS 16 1441 80 260626 3041.4
descriptions. This ensures that Halide-HLS applications have 4 Halide-HLS 16 4112 180 266241 1640.1
Hipacc 16 1519 64 261649 3064.6
been implemented with adequate scheduling primitives. Hipacc
AnyHLS 20 1405 22 1041450 829.0
and AnyHLS implementations require only the algorithm 1 Halide-HLS 16 2688 35 1052673 464.0
Hipacc 20 1457 34 1042466 828.2
descriptions as input. Harris
AnyHLS 20 2513 44 520740 1450.4
For almost all applications in Tables II and III, AnyHLS 2 Halide-HLS 16 4011 70 528385 895.0
Hipacc 20 2326 68 521756 1637.8
implementations demand fewer resources and deliver higher
performance. Of course, this improvement mainly stems from
Table III
our library implementation. AnyHLS achieves a lower latency PP N R RESULTS FOR THE G AUSSIAN BLUR WITH CLAMPING AT THE
mainly because of the following reasons: BORDERS . I MAGE SIZES ARE 1024 × 1024, v = 1, ftarget = 200 MH Z .
i) The latency of a local operator generated from AnyHLS’
image processing library corresponds to the theoretical Framework #BRAM #SLICE #DSP Latency [cyc.] Throughput [MB/s]

latency given in Equation (4), which is L = Larith + AnyHLS 8 1646 16 1050641 801.8
Halide-HLS 16 2096 50 1060897 458.7
1.042.442 clock cycles for Gauss when v = 1. Larith = Hipacc 8 1709 16 1052693 820.1
14 for AnyHLS’ Gauss implementation as shown in
Table II.
ii) Halide-HLS pads input images according to the selected has control over code generation. Extending AnyHLS’ image
border handling mode (even when no border handling is processing library only requires adding new functions in Impala
defined). This increases the input image size from (W , (see Figure 2). Our intention to compare AnyHLS with these
H) to (W + w − 1, H + h − 1), thus the latency. DSLs is to show that we can generate equally good designs
iii) Hipacc does not pad input images, but run (H + bh/2c · without creating an entire compiler backend.
(W + bw/2c)) loop iterations for a (W × H) image 2) Experiments using Intel FPGA SDK for OpenCL (AOCL):
and (w × h) window. This is similar to the convolution Table IV presents the implementation results for an edge
example in the Vivado Design Suite User Guide [6], but detection algorithm provided as a design example by Intel. The
not optimal. algorithms consist of RGB to Luma color conversion, Sobel
The execution time of an implementation equals to Tclk · filters, and thresholding. Intel’s implementations consist of a
latency, where Tclk is the clock period of the maximum single-work item kernel that utilizes shift registers according
achievable clock frequency (lower is better). Overall, AnyHLS to the FPGA design paradigm. These types of techniques are
processes a given image faster than the other DSL implemen- recommended by Intel’s optimization guide [7] despite that
tations. the same OpenCL code performs drastically bad on other
Halide-HLS uses more on-chip memory for line buffers (see computing platforms.
Section IV-C2) compared to Hipacc and AnyHLS because of its
image padding for border handling. Let us consider the number Table IV
PP N R RESULTS OF AN EDGE DETECTION APPLICATION FOR THE I NTEL
of BRAMs utilized for the Gaussian blur: The line buffers need C YCLONE V. I MAGE SIZES ARE 1024 × 1024. N ONE OF THE
to hold 4 image lines for the 5 × 5 kernel. The image width IMPLEMENTATIONS USE DSP S .
is 1024 and the pixel size is 32 bits. Therefore, AnyHLS and
v Framework #M10K #ALM #DSP Throughput [MB/s]
Hipacc use eight 18K BRAMs as shown in Table II. However,
Halide-HLS stores 1028 integer pixels, which require 16 18K Intel’s Imp. 290 23830 0 419.5
1 AnyHLS 291 23797 0 422.5
BRAMs to buffer four image lines. This doubles the number Hipacc 318 25258 0 449.1
of BRAMs usage (see Table III). Intel’s Imp. - - 0 -
AnyHLS use the vectorization architecture proposed in [40]. 16 AnyHLS 337 29126 0 1278.3
Hipacc 362 35079 0 1327.7
This improves the use of the registers compared to Hipacc and
Intel’s Imp. - - 0 -
Halide. 32 AnyHLS 401 38069 0 1303.8
The performance metrics and resource usage reported by Hipacc 421 44059 0 1320.0
Vivado HLS correlate with our Impala descriptions, hence we
claim that the HLS code generated from AnyHLS’ image We described Intel’s handwritten SobelLuma example using
processing library does not entail severe side effects for Hipacc and AnyHLS. Both Hipacc and AnyHLS provide a
the synthesis of Vivado HLS. Hipacc and Halide-HLS have higher throughput even without vectorization. In order to reach
dedicated compiler backends for HLS code generation. These memory-bound, we would have to rewrite Intel’s hand-tuned
can be improved to achieve similar performance to AnyHLS. design example to exploit further parallelism. AnyHLS uses
However, this is not a trivial task and prone to errors. The slightly less resource, whereas Hipacc provides slightly higher
advantage of AnyDSL’s partial evaluation is that the user throughput for all the vectorization factors. Similar to Figure 7,
REFERENCES

16 AnyHLS Table V
103 PP N R FOR THE I NTEL C YCLONE V. M ISSING NUMBERS (-) INDICATE THAT
Throughput in [MPixel/s]
NDRange
8 THE GENERATED IMPLEMENTATIONS DO NOT FIT THE BOARD .

4
App v Framework #M10K #ALM #DSP Throughput [MB/s]
2
CU4/SIMD16 16 AnyHLS 401 37509 0 1330.1
102 1 Gauss
16 Hipacc 402 35090 0 1301.2
16 AnyHLS 370 31446 0 1328.8
Jacobi
CU1/SIMD1 16 Hipacc 372 30296 0 1282.9
CU16/SIMD1
1 AnyHLS 399 79270 153 326.6
Bilat.
1 Hipacc 422 79892 159 434.7
20 30 40 50 60 70 80
16 AnyHLS 400 39266 0 1255.68
Hardware resources (logic utilization [%]) MF 16 Hipacc - - - -
8 Hipacc 351 31796 0 1275.9
8 AnyHLS 418 44807 0 1230.6
Figure 8. Design space for a 5 × 5 mean filter using an NDRange kernel FChain
8 Hipacc 645 64225 0 427.4
(using the num_compute_units / num_simd_work_items attributes)
8 AnyHLS 442 50537 96 1158.5
and AnyHLS (using the vectorization factor v) for an Intel Cyclone V. Harris
8 Hipacc 668 74246 96 187.14

Hipacc AnyHLS
Throughput in [MPixel/s]

10
VI. C ONCLUSIONS
2
In this paper, we advocate the use of modern compiler
29
technologies for high-level synthesis. We combine functional
abstractions with the power of partial evaluation to decouple a
high-level algorithm description from its hardware design that
28
implements the algorithm. This process is entirely driven by
code refinement, generating input code to HLS tools, such as
Harris Gauss Bilateral Jacobi FChain MF Vivado HLS and AOCL, from the same code base. To specify
important abstractions for hardware design, we have introduced
Figure 9. Throughput measurements for an Intel Cyclone V for the a set of basic primitives. Library developers can rely on these
implementations generated from AnyHLS and Hipacc. Resource utilization primitives to create domain-specific libraries. As an example,
for the same implementations are shown in Table V.
we have implemented an image processing library for synthesis
to both Intel and Xilinx FPGAs. Finally, we have shown that
our results are on par or even better in performance compared
both frameworks yield throughputs very close to the memory
to state-of-the-art approaches.
bound of the Intel Cyclone V.
The OpenCL NDRange kernel paradigm conveys multiple
ACKNOWLEDGMENTS
concurrent threads for data-level parallelism. OpenCL-based
HLS tools exploit this paradigm to synthesize hardware. AOCL This work is supported by the Federal Ministry of Education
provides attributes for NDRange kernels to transform its iter- and Research (BMBF) as part of the Metacca, MetaDL,
ation space. The num_compute_units attribute replicates ProThOS, and REACT projects as well as the Intel Visual
the kernel logic, whereas num_simd_work_items vector- Computing Institute (IVCI) at Saarland University. It was
3
izes the kernel implementation . Combinations of those provide also partially funded by the Deutsche Forschungsgemein-
a vast design space for the same NDRange kernel. However, as schaft (DFG, German Research Foundation) – project number
Figure 8 demonstrates, AnyHLS achieves implementations that 146371743 – TRR 89 “Invasive Computing”. Many thanks to
are orders of magnitude faster than using attributes in AOCL. our colleague Puya Amiri for his work on the pipeline support.
Finally, Table V and Figure 9 present a comparison between
AnyHLS and the AOCL backend of Hipacc [45]. As shown R EFERENCES
in Figure 2, Hipacc has an individual backend and template [1] J. Bachrach et al., “Chisel: Constructing hardware in a Scala
library written with preprocessor directives to generate high- embedded language”, in Proc. of the 49th Annual Design
Automation Conf. (DAC), IEEE, Jun. 3–7, 2012.
performance OpenCL code for FPGAs. In contrast, the ap-
[2] Y. Liu et al., “A scala based framework for developing accel-
plication and library code in AnyHLS stays the same. The eration systems with FPGAs”, Journal of Systems Architecture,
generated AOCL code consists of a loop that iterates over vol. 98, 2019.
the input image. Compared to Hipacc, AnyHLS achieves [3] J. Decaluwe, “MyHDL: A Python-based hardware description
similar performance but outperforms Hipacc for multi-kernel language”, Linux Journal, no. 127, 2004.
[4] J. Cong et al., “High-level synthesis for FPGAs: From
applications such as the Harris corner detector. This shows that
prototyping to deployment”, IEEE Trans. on Computer-Aided
AnyHLS optimizes the inter-kernel dependencies better than Design of Integrated Circuits and Systems (TCAD), vol. 30, no.
Hipacc (see Section IV-B2). 4, 2011.
[5] J. Cong et al., “Automated accelerator generation and opti-
3 These mization with composable, parallel and pipeline architecture”,
parallelization attributes are suggested in [7] for NDRange kernels,
in Proc. of the 55th Annual Design Automation Conf. (DAC),
not for the single-work item kernels using shift registers such as the edge
detection application shown in Table IV. ACM, Jun. 24–29, 2018.
[6] Xilinx, Vivado Design Suite user guide high-level synthesis [27] O. Reiche et al., “Generating FPGA-based image processing
UG902, 2017. accelerators with Hipacc”, in Proc. of the Int’l Conf. On
[7] Intel, Intel FPGA SDK for OpenCL: Best practices guide, 2017. Computer Aided Design (ICCAD), IEEE, Nov. 13–16, 2017.
[8] R. Leißa et al., “AnyDSL: A partial evaluation framework for [28] N. Chugh et al., “A DSL compiler for accelerating image
programming high-performance libraries”, Proc. of the ACM processing pipelines on FPGAs”, in Proc. of the Int’l Conf.
on Programming Languages (PACMPL), vol. 2, no. OOPSLA, on Parallel Architecture and Compilation Techniques (PACT),
Nov. 4–9, 2018. ACM, Sep. 11–15, 2016.
[9] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization [29] Y. Chi et al., “Soda: Stencil with optimized dataflow archi-
for configurable computing”, in Proc. of the ACM/SIGDA tecture”, in 2018 IEEE/ACM Int’l Conf. on Computer-Aided
international symposium on Field programmable gate arrays, Design (ICCAD), IEEE, 2018.
ACM, 2013. [30] R. Stewart et al., “A dataflow IR for memory efficient
[10] R. Nane et al., “A survey and evaluation of FPGA high-level RIPL compilation to FPGAs”, in Proc. of the Int’l Conf. on
synthesis tools”, IEEE Trans. on Computer-Aided Design of Algorithms and Architectures for Parallel Processing (ICA3PP),
Integrated Circuits and Systems, vol. 35, no. 10, 2015. Springer, Dec. 14–16, 2016.
[11] G. Martin and G. Smith, “High-level synthesis: Past, present, [31] M. Kristien et al., “High-level synthesis of functional patterns
and future”, IEEE Design & Test of Computers, vol. 26, no. 4, with Lift”, in Proc. of the 6th ACM SIGPLAN Int’l Workshop on
2009. Libraries, Languages and Compilers for Array Programming,
[12] D. F. Bacon et al., “FPGA programming for the masses”, ARRAY@PLDI 2019, Phoenix, AZ, USA, June 22, 2019., 2019.
Communications of the ACM, vol. 56, no. 4, 2013. [32] R. Baghdadi et al., “Tiramisu: A polyhedral compiler for
[13] S. A. Edwards, “The challenges of synthesizing hardware from expressing fast and portable code”, in Proc. of the IEEE/ACM
C-like languages”, IEEE Design & Test of Computers, vol. 23, Int’l Symp. on Code Generation and Optimization (CGO),
no. 5, 2006. IEEE, Feb. 16–20, 2019.
[14] J. Sanguinetti, “A different view: Hardware synthesis from [33] E. Del Sozzo et al., “A unified backend for targeting FPGAs
SystemC is a maturing technology”, IEEE Design & Test of from DSLs”, in Proc. of the 29th Annual IEEE Int’l Conf.
Computers, vol. 23, no. 5, 2006. on Application-specific Systems, Architectures and Processors
[15] D. Koeplinger et al., “Spatial: A language and compiler for (ASAP), IEEE, Jul. 10–12, 2018.
application accelerators”, in Proc. of the 39th ACM SIGPLAN [34] R. Leißa et al., “Shallow embedding of DSLs via online partial
Conf. on Programming Language Design and Implementation evaluation”, in Proc. of the Int’l Conf. on Generative Program-
(PLDI), ACM, Jun. 18–22, 2018. ming: Concepts & Experiences (GPCE), ACM, Oct. 26–27,
[16] H. Eran et al., “Design patterns for code reuse in HLS packet 2015.
processing pipelines”, in 27th Annual Int’l Symp. on Field- [35] M. A. Özkan et al., “A journey into DSL design using
Programmable Custom Computing Machines (FCCM), IEEE, generative programming: FPGA mapping of image border
2019. handling through refinement”, in Proc. of the 5th Int’l Workshop
[17] J. S. da Silva et al., “Module-per-object: A human-driven on FPGAs for Software Programmers (FSP), VDE, 2018.
methodology for C++-based high-level synthesis design”, in [36] N. D. Jones et al., Partial evaluation and automatic program
27th Annual Int’l Symp. on Field-Programmable Custom generation. Peter Sestoft, 1993.
Computing Machines (FCCM), IEEE, 2019. [37] Y. Futamura, “Parital computation of programs”, in Proc. of the
[18] D. Richmond et al., “Synthesizable higher-order functions for RIMS Symposia on Software Science and Engineering, 1982.
C++”, Trans. on Computer-Aided Design of Integrated Circuits [38] C. Consel, “New insights into partial evaluation: The SCHISM
and Systems, vol. 37, no. 11, 2018. experiment”, in Proc. of the 2nd European Symp. on Program-
[19] M. A. Özkan et al., “A highly efficient and comprehensive ming (ESOP), Springer, Mar. 21–24, 1988.
image processing library for C++-based high-level synthesis”, [39] M. Schmid et al., “Loop coarsening in C-based high-level
in Proc. of the 4th Int’l Workshop on FPGAs for Software synthesis”, in Proc. of the 26th Annual IEEE Int’l Conf.
Programmers (FSP), VDE, 2017. on Application-specific Systems, Architectures and Processors
[20] J. de Fine Licht et al., “Transformations of high-level synthesis (ASAP), IEEE, 2015.
codes for high-performance computing”, The Computing Re- [40] M. A. Özkan et al., “Hardware design and analysis of efficient
search Repository (CoRR), 2018. arXiv: 1805.08288 [cs.DC]. loop coarsening and border handling for image processing”,
[21] G. Ofenbeck et al., “Spiral in Scala: Towards the systematic in Proc. of the Int’l Conf. on Application-specific Systems,
construction of generators for performance libraries”, in Proc. Architectures and Processors (ASAP), IEEE, Jul. 10–12, 2017.
of the Int’l Conf. on Generative Programming: Concepts & [41] G. Stitt et al., “Scalable window generation for the Intel
Experiences (GPCE), ACM, Oct. 27–28, 2013. Broadwell+Arria 10 and high-bandwidth FPGA systems”, in
[22] P. Milder et al., “Computer generation of hardware for linear Proc. of the ACM/SIGDA Int’lSymp. on Field-Programmable
digital signal processing transforms”, ACM Trans. on Design Gate Arrays (FPGA), ACM, Feb. 25–27, 2018.
Automation of Electronic Systems (TODAES), vol. 17, no. 2, [42] Y.-k. Choi et al., “A quantitative analysis on microarchitectures
2012. of modern CPU-FPGA platforms”, in Proc. of the 53rd Annual
[23] J. Hegarty et al., “Darkroom: Compiling high-level image Design Automation Conf. (DAC), ACM, Jun. 5–9, 2016.
processing code into hardware pipelines”, ACM Trans. on [43] Core evaluation license agreement, version 2014.06, Xilinx,
Graphics (TOG), vol. 33, no. 4, 2014. Inc., Jun. 2014. [Online]. Available: https://www.xilinx.com/
[24] J. Hegarty et al., “Rigel: Flexible multi-rate image processing products/intellectual-property/license/core-evaluation-license-
hardware”, ACM Trans. on Graphics (TOG), vol. 35, no. 4, agreement.html.
2016. [44] Intel program license subscription agreement, version Rev.
[25] J. Pu et al., “Programming heterogeneous systems from an 10/2009, Intel Corporation, Oct. 2009. [Online]. Available:
image processing DSL”, ACM Trans. on Architecture and Code https://www.intel.com/content/www/us/en/programmable/
Optimization (TACO), vol. 14, no. 3, 2017. downloads/software/license/lic-prog_lic.html.
[26] J. Ragan-Kelley et al., “Halide: A language and compiler for [45] M. A. Özkan et al., “FPGA-based accelerator design from
optimizing parallelism, locality, and recomputation in image a domain-specific language”, in Proc. of the 26th Int’l Conf.
processing pipelines”, in Proc. of the Conf. on Programming on Field-Programmable Logic and Applications (FPL), IEEE,
Language Design and Implementation (PLDI), ACM, Jun. 16– Aug. 29–Sep. 2, 2016.
19, 2013.
150

COSMOS: Coordination of High-Level Synthesis and

Memory Optimization for Hardware Accelerators

LUCA PICCOLBONI, Columbia University

PAOLO MANTOVANI, Columbia University
GIUSEPPE DI GUGLIELMO, Columbia University
LUCA P. CARLONI, Columbia University
Hardware accelerators are key to the efficiency and performance of system-on-chip (SoC) architectures. With
high-level synthesis (HLS), designers can easily obtain several performance-cost trade-off implementations
for each component of a complex hardware accelerator. However, navigating this design space in search of
the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS,
an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates
both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datap-
ath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of
the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the de-
sired trade-off point between cost and performance at the system level. When applied to the system-level
design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as
completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×.
CCS Concepts: • Hardware → High-level and register-transfer level synthesis; Methodologies for EDA;
• Computer systems organization → Architectures; Embedded systems;
Additional Key Words and Phrases: design-space exploration, system-level design, high-level synthesis,
hardware accelerators, specialized hardware
ACM Reference format:
Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2017. COSMOS: Coordination
of High-Level Synthesis and Memory Optimization for Hardware Accelerators. ACM Trans. Embed. Comput.
Syst. 16, 5s, Article 150 (September 2017), 22 pages.
https://doi.org/10.1145/3126566

1 INTRODUCTION
High-performance systems-on-chip (SoCs) are increasingly based on heterogeneous architectures
that combine general-purpose processor cores and specialized hardware accelerators [4, 8, 22]. Ac-
celerators are hardware devices designed to perform specific functions. Accelerators are become
popular because they guarantee considerable gains in both performance and energy efficiency
with respect to the corresponding software executions [9–11, 20, 23, 29, 41, 48]. However, the

This article was presented in the International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS) 2017 and appears as part of the ESWEEK-TECS special issue.
Authors addresses: The authors are within the Department of Computer Science, Columbia University, New York, NY,
USA (Luca Piccolboni: piccolboni@cs.columbia.edu, Paolo Mantovani: paolo@cs.columbia.edu, Giuseppe Di Guglielmo:
giuseppe@cs.columbia.edu, and Luca P. Carloni: luca@cs.columbia.edu).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2017 ACM 1539-9087/2017/09-ART150 $15.00
https://doi.org/10.1145/3126566

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:2 L. Piccolboni et al.

integration of several specialized hardware blocks into a complex accelerator is a difficult design
and verification task. In response to this challenge, we advocate the application of two key prin-
ciples. First, to cope with the increasing complexity of SoCs and accelerators, most of the design
effort should move away from the familiar register-transfer level (RTL) by embracing system-level
design (SLD) [18, 42] with high-level synthesis (HLS) [32, 39]. Second, it is necessary to create
reusable and flexible components, also known as intellectual property (IP) blocks, which can be
easily (re)used across a variety of architectures with different targets for performance and metrics
for cost.

1.1 System-Level Design

SLD has been proposed as a viable approach to cope with the increasing complexity of today
architectures [18, 42]. The SoC complexity is growing as a result of integrating a larger number of
heterogeneous accelerators on the same chip. Further, accelerators are themselves becoming more
complex to meet the high-performance and low-power requirements of emerging applications, e.g,
deep-learning applications [10, 23, 41, 48]. To address the complexity of systems and accelerators,
SLD aims at raising the level of abstraction of hardware design by replacing cycle-accurate low-
level specifications (i.e., RTL Verilog or VHDL code) with untimed or transaction-based high-level
specifications (i.e., C, C++ or SystemC code) [39]. This allows designers to focus on the relations
between the data structures and computational kernels that characterize the accelerators, quickly
evaluate different alternative implementations of the accelerators, and perform more complex and
meaningful full-system simulations of the entire SoC. Indeed, designers can ignore low-level logic
and circuit details that burden the design process. This improves the productivity and reduces the
chances of errors [8].
Unfortunately, current HLS tools are not ready yet to handle the complexity of today acceler-
ators. Many accelerators are too complex to be synthesized by state-of-the-art HLS tools without
being partitioned first. Accelerators must be decomposed into several computational blocks, or
components, to be synthesized and explored efficiently. Decomposing an accelerator also helps im-
prove the quality of results. Indeed, the choice of a particular RTL implementation for a component
must be made in the context of the choices for all the other accelerator components. A particular
set of choices leads to one point in the multi-objective design space of the accelerator. Thus, the
process of deriving the diagram of Pareto-optimal points repeats itself hierarchically from the sin-
gle component to the entire accelerator. This complexity is not handled by current HLS tools that
optimize the single components independently from the others.

1.2 Intellectual Property Reuse

HLS supports IP block reuse and exchange. For instance, a team of computer-vision experts can
devise an innovative algorithm for object recognition, design a specialized accelerator for this al-
gorithm with a high-level language (C, C++, SystemC), and license it as a synthesizable IP block
to different system architects; the architects can then exploit HLS tools to derive automatically the
particular implementation that provides the best trade-off point (e.g., higher performance or lower
area/power) for their particular system. The main idea of HLS is to raise the abstraction level of the
design process to allow designers to generate multiple RTL implementations that can be reused
across many different architectures. To obtain such a variety of implementations, the designers
can change high-level configuration options, known as knobs, so that HLS can transform automat-
ically the high-level specification of the accelerator and obtain several RTL implementations with
different performance figures and implementation costs. For example, loop unrolling is a knob
that allows designers to replicate parts of the logic to distribute computation in space (resource

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:3

Fig. 1. COSMOS: a methodology to coordinate HLS and memory optimization for the DSE of hardware
accelerators.

replication), rather than in time. The application of this knob generally leads to a faster, but larger,
implementation of the initial specification.
Despite the advantages of HLS, performing this design-space exploration (DSE) is still a compli-
cated task, especially for complex hardware accelerators. First, the support for memory generation
and optimization is limited in current HLS tools. Some HLS tools still require third-party gener-
ators to provide a description of the memory organization and automatize the DSE process [36,
37]. Several studies, however, highlight the importance of private memories to sustain the parallel
datapath of accelerators: on a typical accelerator design, memory takes from 40% to 90% of the
area [16, 30]; hence, its optimization cannot be an independent task. Second, HLS tools are based
on heuristics, whose behavior is not robust and often hard to predict [24]. Small changes to the
knobs, e.g., changing the number of iterations unrolled in a loop, can cause significant and un-
expected modifications at the implementation level. This increases the DSE effort because small
changes to the knobs can take the exploration far from the Pareto-optimality.

1.3 Contributions
To address these limitations, we present COSMOS1 : an automatic methodology for the DSE of
complex hardware accelerators, which are composed of several components. COSMOS is based on
a compositional approach that coordinates both HLS tools and memory generators. First, thanks to
the datapath and memory co-design, COSMOS produces a large set of Pareto-optimal implemen-
tations for each component, thus increasing both performance and cost spans. These spans are
defined as the ratios between the maximum value and the minimum value for performance and
cost, respectively. Second, COSMOS leverages compositional design techniques to significantly re-
duce the number of invocations to the HLS tool and the memory generator. In this way, COSMOS
focuses on the most critical components of the accelerator and quickly converges to the desired
trade-off point between cost and performance for the entire accelerator. The COSMOS methodol-
ogy consists of two main steps (Figure 1). First, COSMOS uses an algorithm to characterize each
component of the accelerator individually by efficiently coordinating multiple runs of the HLS and
memory generator tools. This algorithm finds the regions in the design space of the components
that include the Pareto-optimal implementations (Component Characterization in Figure 1). Sec-
ond, COSMOS performs a DSE to identify the Pareto-optimal solutions for the entire accelerator
by efficiently solving a linear programming (LP) problem instance (Design-Space Exploration).
We evaluate the effectiveness and efficiency of the COSMOS methodology on a complex accel-
erator for wide-area motion imagery (WAMI) [3, 38], which consists of approximately 7000 lines
of SystemC code. While exploring the design space of WAMI, COSMOS returns an average perfor-
mance span of 4.1× and an average area span of 2.6×, as opposed to 1.7× and 1.2× when memory

1 COSMOS stands for “COordination of high-level Synthesis and Memory Optimization for hardware acceleratorS”. We also
adopt the name COSMOS for our methodology since it is the opposite of CHAOS (in the Greek creation myths). In our
analogy, CHAOS corresponds to the complexity of the DSE process.

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:4 L. Piccolboni et al.

Fig. 2. Architecture of a loosely-coupled accelerator.

optimization is not considered and only standard dual-port memories are used. Further, COSMOS
achieves the target data-processing throughput for the WAMI accelerator while reducing the
number of invocations to the HLS tool per component by up to 14.6×, with respect to an
exhaustive exploration approach.

1.4 Organization
The paper is organized as follows. Section 2 provides the necessary background for the rest of the
paper. Section 3 describes few examples to show the effort required in the DSE process. Section 4
gives an overview of the COSMOS methodology, which is then detailed in Sections 5 (Component
Characterization) and 6 (Design-Space Exploration). Section 7 presents the experimental results.
Section 8 discusses the related work. Finally, Section 9 concludes the paper.

2 PRELIMINARIES
This section provides the necessary background concepts. We first describe the main characteris-
tics of the accelerators targeted by COSMOS in Section 2.1. Then, we present the computational
model we adopt for the DSE in Section 2.2.

2.1 Hardware Accelerators

Several accelerator designs have been proposed in the literature to realize hardware implementa-
tions that execute important computational kernels more efficiently than corresponding software
executions [9, 10, 23, 29, 41, 48]. The accelerators can be located either inside (tightly-coupled)
or outside (loosely-coupled) the processing cores [16]. The former class of accelerators is more
suitable for fine-grain computations on small data sets, while the latter is better for coarse-grain
computations on large data sets. We focus on loosely-coupled accelerators in this paper because
the complexity of their design requires a compositional approach. WAMI is representative of a set
of classes of applications that can be benefit from the adoption of the loosely-coupled accelerator
model and a compositional design approach.
Architecture. We design our accelerators in SystemC. Figure 2 illustrates their typical architec-
ture. They are made of multiple components that are designed individually to cope with the cur-
rent limitations of HLS tools in optimizing complex components. Partitioning the accelerators into
multiple components allows HLS tools to handle them separately, thus reducing the synthesis time
and improving the quality of results. Each component is specified as a separated SystemC mod-
ule and represents a computational block within the accelerator. The components communicate

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:5

Fig. 3. Execution of a loosely-coupled accelerator.

by exchanging the data through an on-chip interconnect network that implements transaction-
level modeling (TLM) [19] channels. These channels synchronize the components by absorbing
the potential differences in their computational latencies with a latency-insensitive communica-
tion protocol [7]. This ensures that the components of an accelerator can always be replaced with
different Pareto-optimal implementations without affecting the correctness of the accelerator im-
plementation. COSMOS employs channels with a fixed bitwidth (256 bits) and does not explore
different design alternatives to implement the communication among the components. It can be
extended, however, to support this type of DSE by using, for example, the XKnobs [35] or buffer-
restructuring techniques [13]. Each component includes a datapath, which is organized in a set of
loops, to read and store input and output data and to compute the required functionality. There
are also private local memories (PLMs), or scratchpads, where data resides during the computation.
PLMs are multi-bank memory architectures that provide multiple read and write ports to allow
accelerators to perform parallel accesses. We generate optimized memories for our accelerators
by using the Mnemosyne memory generator [37]. Several analyses highlight the importance of
the PLMs in sustaining the parallel datapath of accelerators [16, 30]. PLMs play a key role on
the performance of accelerators [25], and they occupy from 40% to 90% of the entire area of the
components of a given accelerator [30].

Execution. Figure 3 reports an example of execution of an accelerator made of multiple com-

ponents. The execution of each component of the accelerator is divided in three phases (showed
on the top of the figure for Component #1). In the load phase the components communicate with
the on-chip interconnect network to read the input data and store it in the PLMs. In the compute
phase the components execute the given functions on the data currently stored in the PLMs. In
the store phase the components communicate with the on-chip interconnect network to store the
output data available in the PLMs. These three phases can be pipelined by using techniques such
as ping-pong or circular buffers [16], as shown on the top of the figure. After having identified
the minimum block of data that is sufficient to realize the required function in each component,
e.g., a frame, the execution of the components can be: (i) completely overlapped when there are no
dependencies (e.g., Component #1 and #K), or (ii) serialized when a component needs input data
from another component to start its computation (e.g., Component #1 and #2).

2.2 Computational Model

To formally model the loosely-coupled accelerators we use timed marked graphs (TMGs), a sub-
class of Petri nets (PNs) [34]. TMGs are commonly used to perform compositional performance
analysis of discrete-event systems [6]. While TMGs do not allow to capture data-dependent

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:6 L. Piccolboni et al.

behaviors, they are a practical model to analyze stream processing accelerators for many classes
of applications, e.g., image and signal processing applications. A PN is a bipartite graph defined as
a tuple (P,T , F , w, M 0 ), where P is a set of m places, T is a set of n transitions, F : (P × T ) ∪ (T × P )
is a set of arcs, w : F → N+ is an arc weighting function, and M 0 ∈ Nm is the initial marking, i.e.,
the number of tokens at each p ∈ P. A PN is strongly-connected if for every pairs of places pi and
p j there exists a sequence of transitions and places such that pi and p j are mutually reachable in
the net. A PN can be organized in a set of strongly-connect components, i.e., the maximal sets of
places that are strongly-connected. A TMG is a PN such that (i) each place has exactly one input
and one output transition, and (ii) w : F → 1, i.e., every arc has a weight equal to 1. To measure
performance, TMGs are extended with a transition firing-delay vector τ ∈ Rn , which represents
the duration of each particular firing.
The minimum cycle time of a strongly-connected TMG is defined as: max {D k /Nk | k ∈ K },
where K is the set of cycles of the TMG, D k is the sum of the transition firing delays in cycle
k, and Nk is the number of tokens in cycle k [40]. In this paper, we use the TMG model to formally
describe the accelerators. We use the term system to indicate a complex accelerator that is made of
multiple components. Each component of the system is represented with a transition in the TMG
whose firing delay is equal to its effective latency. The effective latency λ of a component is defined
as the product of its clock cycle count and target clock period. The maximum sustainable effective
throughput θ of the system is then the reciprocal of the minimum cycle time of its TMG, if the TMG
is strongly connected. Otherwise, it is the minimum θ among its strongly-connected components.
We use λ and θ as performance figures for the single components and the system, respectively. We
use the area α as the cost metric for both the components and the system.

3 MOTIVATIONAL EXAMPLES
Performing an accurate and as exhaustive as possible DSE for a complex hardware accelerator is
a difficult task for three main reasons: (i) HLS tools do not always support PLM generation and
optimization (Section 3.1), (ii) HLS tools are based on heuristics that make it difficult to configure
the knobs (Section 3.2), and (iii) HLS tools do not handle the simultaneous optimization of multiple
components (Section 3.3). Next, we detail these issues with some examples.

3.1 Memories
The joint optimization of the accelerator datapath and PLM architecture is critical for an effective
DSE. Figure 4 depicts the design space of Gradient, a component we designed for WAMI. The
graph reports different design points, each characterized in terms of area (mm2 ) and effective
latency (milliseconds), synthesized for an industrial 32nm ASIC technology library. The points
with the same color (shape) are obtained by partially unrolling the loops for different numbers
of iterations. The different colors (shapes) indicate different numbers of ports for the PLM2 . By
increasing the number of ports, we notice a significant impact on both latency and area. In fact,
multiple ports allow the component to read and write more data in the same clock cycle, thus
increasing the hardware parallelism. Multi-port memories, however, require much more area
since more banks may be used depending on the given memory-access pattern. Note that ignoring
the role of the PLM limits considerably the design space. By changing the number of ports of
the PLM, we obtain a latency span of 7.9× and an area span of 3.7×. By using standard dual-port
memories, we have only a latency span of 1.4× and an area span of 1.2×. This motivates the need

2 Here and in the rest of the paper, the number of ports indicates the number of read ports to the memories containing the

input data of the component and the number of write ports containing the output data of the component, i.e., the ports
that allow parallelism in the compute phase of the component.

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:7

Fig. 4. Example of application of two HLS knobs (number of ports, number of unrolls) to Gradient, a com-
ponent of WAMI. The nested graph magnifies the design points with 2 read and 2 write ports. The numbers
indicate the numbers of iterations unrolled.

of considering the optimization of PLMs in the DSE process. COSMOS takes into consideration
the PLMs by generating optimized memories with Mnemosyne [37].

3.2 HLS Unpredictability

Dealing with the unpredictability of the HLS tool outcomes is necessary to remain in the Pareto-
optimal regions of the design space [24]. This is highlighted by the magnified graph in Figure 4 that
reports the number of iterations unrolled for each design point of Gradient. By increasing the
number of iterations unrolled in a loop for a particular configuration of the PLM ports we expect
to obtain design points that have more area and less latency. In fact, unrolling a loop increases the
number of hardware resources to allow more parallel operations. However, an effective paralleliza-
tion is not always guaranteed. Some combinations of loop unrolling have a negative effect on both
latency and area due to the applications of HLS heuristics (e.g., points 7u, 8u and 9u in Figure 4).
In fact, HLS tools need to insert additional clock cycles in the body of a loop when (i) operation
dependencies are present or (ii) the area is growing too much with respect to the scheduling met-
rics they adopt (HLS tools often perform latency-constrained optimizations to minimize the area).
This motivates the need of dealing with the HLS unpredictability in the DSE process. COSMOS
applies synthesis constraints to account for the high variability and partial unpredictability of the
HLS tools.

3.3 Compositionality
Complex accelerators need to be partitioned into multiple components to be efficiently synthesized
by current HLS tools. This reduces the synthesis time and improves the quality of results, but sig-
nificantly increases the DSE effort. Figure 5 reports a simple example to illustrate this problem.
On the top, the figure reports two graphs representing a small subset of Pareto-optimal points for
Gradient and Grayscale, two components of WAMI. Assuming that they are executed sequen-
tially in a loop, their aggregate throughput is the reciprocal of the sum of their latencies. On the
bottom, the figure reports all the possible combinations of the design points of the two components,
differentiating the Pareto-optimal combinations from the Pareto-dominated combinations. These
design points are characterized in terms of area (mm2 ) and effective throughput (1/milliseconds).
In order to find the Pareto-optimal combinations at the system level, an exhaustive search method

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:8 L. Piccolboni et al.

Fig. 5. Example of composition for Gradient and Grayscale, two components of WAMI. The graphs on
the top report some Pareto-optimal points for the two components. The graph on the bottom shows all the
possible combinations of these components, assuming they are executed sequentially in a loop. In the graph
of the composition, the effective throughput is used as the performance metric.

would apply the following steps: (i) synthesize different points for each component by varying
the settings of the knobs, (ii) find the Pareto-optimal points for each component, and (iii) find the
Pareto-optimal combinations of the components at the system level. This approach is impractical
for complex accelerators. First, step (i) requires to try all the combinations of the knob settings (e.g.,
different number of ports and number of unrolls). Second, step (iii) requires to evaluate an expo-
nential number of combinations at the system level to find those that are Pareto-optimal. In fact,
if we have n components with k Pareto-optimal points each, then the number of combinations to
check is O(k n ). This example motivates the need of a smart compositional method that identifies
the most critical components of an accelerator and minimizes the invocations to the HLS tool. In
order to do that, COSMOS reduces the number of combinations of knob settings that are used for
synthesis and prioritizes the synthesis of the components depending on their level of contribution
to the effective throughput of the entire accelerator.

4 THE COSMOS METHODOLOGY

As shown in Figure 1, COSMOS consists of the following steps:

(1) Component Characterization (Section 5): in this step COSMOS analyzes each component
of the system individually; for each component it identifies the boundaries of the regions
that include the Pareto-optimal designs; starting from the HLS-ready implementation of
each component (in SystemC), COSMOS applies an algorithm that generates knob and
memory configurations to automatically coordinate the HLS and memory generator tools;
the algorithm takes into account the memories of the accelerators and tries to deal with
the unpredictability of HLS tools;
(2) Design-Space Exploration (Section 6): in this step COSMOS analyzes the design space of
the entire system; the system is modeled with a TMG to find the most critical components
for the system throughput; then, COSMOS:

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:9

• formulates a LP problem instance to identify the latency requirements of each compo-

nent that ensure the specified system throughput and minimize the system cost; this
step is called Synthesis Planning (Section 6.1);
• maps the solutions of the LP problem to the knob-setting space of each component and
runs additional synthesis to get the RTL implementations of the components; this step
is called Synthesis Mapping (Section 6.2).

5 COMPONENT CHARACTERIZATION
Algorithm 1 reports the pseudocode used for the component characterization. The designer pro-
vides the clock period, the maximum number of ports for the PLMs (mainly constrained by the
target technology and the memory generator) and the maximum number of loop unrolls. In order
to keep the delay of the logic for selecting the memory banks negligible, the number of ports should
be a power of two. Note that this constraint can be partially relaxed without requiring Euclidean
division for the selection logic [46]. The number of unrolls depends on the loop complexity. Loops
with few iterations can be completely unrolled, while more complex loops can be only partially
unrolled. In fact, unrolling loops replicates the hardware resources, thus making the scheduling
more complex for the HLS tool. The algorithm identifies regions in the design space of the com-
ponent. A region includes design points that have the same number of ports. They are bounded
by an upper-left (λmin , αmax ) and a lower-right (λmax , αmin ) point. These regions represent the
design space of the component that will be used for the DSE at the system level, as explained in
Section 6.
ALGORITHM 1: Component Characterization
Input: clock, max_ports, max_unrolls
Output: set of regions (λmax , αmin , λmin , αmax )
1 for ports = 1 up to max_ports do
2 // Identification of max-λ min-α point
3 (λmax , αmin ) = hls_tool(ports, ports, clock);
4 // Identification of min-λ max-α point
5 for unrolls = max_unrolls down to ports + 1 do
6 (λmin , αmax ) = hls_tool(unrolls, ports, clock);
7 if λ_constraintports (unrolls) is sat then break;
8 // Generation of the PLM of the component
9 αplm = memory_generator(ports);
10 αmin += αplm ; αmax += αplm ;
11 // Save the region of the design space
12 save(ports, unrolls, λmax , αmin , λmin , αmax );
tool parameters: hls_tool(unrolls, ports, clock);
tool parameters: memory_generator(ports);

Algorithm 1 starts by identifying the lower-right point of the region. To identify this design
point, it sets the number of unrolls equal to the current number of ports (line 3). This ensures that
all the ports of the PLM are exploited and the obtained point is not redundant. In fact, this point
cannot be obtained by using a lower number of ports. On the other hand, finding the upper-left
point is more challenging. A complete unroll (which could lead to the point with the minimum
latency) is unfeasible in case of complex loops. Indeed, it is not always guaranteed that, by increas-
ing the number of unrolls, the HLS tool returns an implementation of the component that gives

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:10 L. Piccolboni et al.

Fig. 6. Example of application of the λ-constraint.

lower latency in exchange for higher area occupation. To overcome these problems, Algorithm 1
introduces a constraint, λ − constraint for the rest of the paper, that defines the maximum number
of states that the HLS tool can insert in the body of a loop. This helps in constraining the behavior
of the HLS tool to be more deterministic and in removing some of the Pareto-dominated points.
Thus, Algorithm 1 uses the following function to estimate the number of states that should be
sufficient to schedule one iteration of the loop that includes read and write operations:

γr ∗ unrolls γw
hpor t s (unrolls) = + +η (1)
ports ports
where γr is the maximum number of read accesses to the same array per loop iteration, γw is
the maximum number of write accesses to the same array per loop iteration and, η accounts for
the latency required to perform the operations that do not access the PLM. These parameters are
inferred by traversing the control data flow graph (CDFG) created by the HLS tool for scheduling
the lower-right point. This function is used as an upper bound of the number of states that the
HLS tool can insert. If this upper bound is not sufficient, then the synthesis fails and the point is
discarded. A synthesis run with a lower number of unrolls is performed to find another point to
be used as the upper-left extreme (lines 5-7).
Example 1. Figure 6 shows an example of using the λ-constraint. The loop (reported on the left)
contains two read operations to two distinct arrays, i.e., γr = 1, and one write operation, i.e., γw = 1.
We assume that all the operations that are neither read nor write operations can be performed in
one clock cycle, i.e., η = 1. The two diagrams (on the right) show the results of the scheduling by
using two ports for the PLM and by unrolling two or three times the loop, respectively. In the first
case (unrolls = 2), the HLS tool can schedule all the operations in a maximum of h 2 (2) = 3 clock
cycles. Thus, this point would be chosen by Algorithm 1 to be used as upper-left extreme. In the
second case (unrolls = 3), the HLS tool is not able to complete the schedule within h 2 (3) = 4 clock
cycles (it needs at least 5 clock cycles). Thus, this point is discarded.
Note that the λ-constraint is not guaranteed to obtain a Pareto-optimal point due to the intrinsic
variability of the HLS results. Still, this point can serve as an upper bound of the region in the
design space. Note also that the λ − constraint cannot be applied to loops that (i) require data from

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:11

sub-components through blocking interfaces or (ii) do not present memory accesses to the PLM.
In these cases, in fact, it is necessary to extend the definition of the estimation function given in
Equation (1) to handle such situations. Alternatively, COSMOS can optionally run some synthesis
in the neighbourhood of the maximum number of unrolls and use a local Pareto-optimal point as
the upper-left extreme.

5.1 Memory Generation

After the two extreme points of a region have been determined, the algorithm instructs the memory
generator to create the PLM architecture (line 9). COSMOS uses Mnemosyne [37] to generate
optimized PLMs for the components. Mnemosyne has been integrated with the commercial HLS
tool we use for the experimental results (Section 7). The CDFG, created by the HLS tool, is analyzed
to find the arrays specified in the code and their access patterns. Then, a memory is generated
according to these specifications and the area required for the PLM is added to the logic area
reported by the HLS tool (line 10). The memory architecture is tailored to the component needs
and is optimized with respect to the required number of ports and access patterns. In particular,
given a certain number of ports, Mnemosyne combines several SRAMs, or BRAMSs in case of
FPGA devices, into a multi-bank architecture (Figure 2). Each SRAM (BRAM) provides 2 read/write
ports, thus by combining them in a multi-bank architecture Mnemosyne allows the component to
perform multiple accesses in parallel [2].

6 DESIGN-SPACE EXPLORATION
After the characterization of the single components of a given accelerator, COSMOS uses a LP
formulation to find the Pareto-optimal design points at the system level. The DSE problem at the
system level can be formulated as follows:
Problem 1. Given a TMG model of the system where each component has been characterized, a
HLS tool, and a target granularity δ > 0, find a Pareto curve α versus θ of the system, such that:
(i) given two consecutive points d, d on the Pareto curve, they have to satisfy: max {d α /d α −
1, dθ /dθ − 1} < δ ; this ensures a maximum distance between two design points on the
curve;
(ii) the HLS tool must be invoked as few times as possible.
This formulation is borrowed from [28], where the authors propose a solution that requires the
manual effort of the designers to characterize the components. In contrast, COSMOS solves this
problem by leveraging the automatic characterization method in Section 5 and by dividing it into
two steps: Synthesis Planning and Synthesis Mapping.

6.1 Synthesis Planning

Given a strongly-connected system TMG, COSMOS uses the following θ -constrained cost-
minimization LP formulation: n
min i=1 f i (τi )
s.t. Aσ + M 0 /θ ≥ τ − (2)
−
τmin ≤ τ − ≤ τmax
−

where the function fi returns the implementation cost (α) of the i-th component given the firing-
delay τi of transition ti , σ ∈ Rn is the transition-firing initiation-time vector, M 0 ∈ Nm is the initial
marking, τ − ∈ Rm is the input-transition firing-delay vector, i.e., τi− is the firing-delay of the tran-
sition tk entering in place pi (note that τmin− −
and τmax correspond to the extreme λmin and λmax

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:12 L. Piccolboni et al.

of the components), and A is the m × n incidence matrix defined as:

⎪ +1 if t j is an output transition of pi ,
⎧
⎪
A[i, j] = ⎨
⎪ −1 if t j is an input transition of pi , (3)
⎪0 otherwise.
⎩
The objective function minimizes the implementation costs of the components, while satisfying
the system throughput requirements. Given the component extreme latencies λmin and λmax , it is
possible to determine the values of θmin and θmax by labeling the transitions of the TMG of the
system with such latencies. By iterating from θmin to θmax with a ratio of (1 + δ ), we can then find
the optimal values of λ for the components that solve Problem 1. This formulation guarantees that
the components that are not critical for the system throughput are selected to minimize their cost.
The cost functions fi in Equation (2) are unknown a-priori, but they can be approximated with
convex piecewise-linear functions. This LP formulation can be solved in polynomial time [5], and
it can be extended to the case of non-strongly-connected TMGs.

6.2 Synthesis Mapping

Given the optimal values of λ of each component that solve Problem 1, it is necessary to determine
the knob settings that provide the component implementations meeting such requirements. In
other words, we need an inverse function ϕ that maps the optimal solutions λ to the corresponding
values in the knob-setting space of each component. The solutions of Equation (2) can require
values of λ for a component falling inside a certain region. Since we have only the component
implementations for the extreme points of the region (synthesized with Algorithm 1), we need to
find the knob settings that return also the intermediate points. Given the λt ar дet requirement of a
component (from Equation (2)), COSMOS first finds the region (returned by Algorithm 1) in which
λt ar дet falls, i.e., λt ar дet ∈ [λmin , λmax ]. Then, since every region includes design points that have
the same number of ports, COSMOS needs only to estimate the number of unrolls to generate a
proper knob setting. To do that, COSMOS uses the following modified version of Amdahl’s Law [1]:
λt ar дet 1
= μ t ar дe t −μmin μ t ar дe t −μmin λmax
(4)
λmax 1− μmax −μmin + μmax −μmin ∗ λmin
where μ t ar дet is the estimated number of unrolls, while μmin , μmax are the numbers of unrolls
which correspond to λmax and λmin , respectively. The only unknown term in this equation is
μ t ar дet , i.e., the number of unrolls that can be used to satisfy the λt ar дet requirement. Thus, COS-
MOS uses the following mapping function to map the λ requirements to the effective number of
unrolls:
μ t ar дet = ϕ (λt ar дet , λmin , λmax , μmin , μmax )
(λmin λmax μmax + λt ar дet λmax μmin ) − (λmin λmax μmin + λt ar дet λmin μmax ) (5)
=
λt ar дet (λmax − λmin )
This function is derived from Equation (4), and thus it models the law of diminishing returns. This
provides a good approximation of the number of unrolls because, typically, the relative gains in
latency keep decreasing as we increase the number of unrolls (Section 7). After generating the
knob settings by using the mapping function, COSMOS runs the corresponding synthesis to get
(i) the actual values for λ and α and (ii) the RTL implementation of the component.
Example 2. Figure 7 shows an example of a mapping function. The lower-right point of the
corresponding region has a latency of 40 s, while the upper-left point has a latency of 10 s, i.e.,
λmax = 40 s, λmin = 10 s. The lower-right point does not unroll the loops, while the other one unrolls
the loops for 30 iterations, i.e., μmin = 1, μmax = 30. By using these parameters the graph plots the

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:13

Fig. 7. Example of application of the mapping function ϕ.

mapping function that returns the number of unrolls that should be applied, given a specific value
for the latency (we apply the ceiling function to get an integer value). For instance, if a point with
latency of 20 s is required, the mapping function returns 11 as the number of unrolls. Note that by
specifying the maximum latency, the function returns the minimum number of unrolls, while by
specifying the minimum latency, it returns the maximum number of unrolls.
It is possible that the mapping may fail by choosing a value for μ t ar дet that does not satisfy the λ-
constraint (Section 5). In this case, COSMOS tries to increase the number of unrolls to preserve the
throughput. Further, if λt ar дet is not included in any region, COSMOS uses the slowest point of the
next region that has a larger number of ports. This does not require a synthesis run (because that
point has been synthesized during the characterization), and it is a conservative solution because,
as in the case of failure of the λ-constraint, we are willing to trade area to preserve the throughput.

7 EXPERIMENTAL RESULTS
We implement the COSMOS methodology with a set of tools and scripts to automatize the DSE.
Specifically, COSMOS includes: (i) Mnemosyne [37] to generate multi-bank memory architectures
as described in Section 5, (ii) a tool to extract the information required by Mnemosyne from the
database of the HLS tool, (iii) a script to run the synthesis and the memory generator according
to Algorithm 1, (iv) a program that creates and solves the LP model by using the GLPK Library3
(Section 6.1), and (v) a tool that maps the LP solutions to the HLS knobs and runs the synthesis
(Section 6.2).
We evaluate the effectiveness and efficiency of COSMOS by considering the WAMI applica-
tion [38] as a case study. The original specification of the WAMI application is available in C in
the PERFECT Benchmark Suite [3]. Starting from this specification, we design a SystemC acceler-
ator to be synthesized with a commercial HLS tool, i.e., Cadence C-to-Silicon. We use an industrial
32nm ASIC technology as target library4 . We choose the WAMI application as our case study due
to (i) the different types of computational blocks it includes and (ii) its complexity. The hetero-
geneity of its computational blocks allows us to develop different components for each block and

3 GLPK (GNU Linear Programming Kit): https://www.gnu.org/software/glpk/.

4 Notethat COSMOS can be used for FPGA-based designs as well. It is sufficient to (i) modify the target library used by
the HLS tool and (ii) instructs the memory generator to generate memories by using the BRAM blocks available in FPGA
devices (instead of the SRAM blocks of ASIC technologies).

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:14 L. Piccolboni et al.

Fig. 8. TMG modeling the WAMI application.

show the vast applicability of COSMOS. The C specification is roughly 1000 lines of code. The
specification of our accelerator design is roughly 7000 lines of SystemC code.

7.1 Computational Model

We model the WAMI application as a loosely-coupled accelerator. Figure 8 illustrates the re-
sulting TMG model of the accelerator. The WAMI specification includes four main components:
(i) Debayer for image filtering, (ii) Grayscale for RGB-to-Grayscale color conversion, (iii) Lucas-
Kanade for the image alignment, and (iv) the Change-Detection classifier. We partition Lucas-
Kanade into many sub-components to further increase the hardware parallelism. Matrix-Inv is
executed in software to preserve the floating-point precision. Therefore, it is modeled with a fixed
effective latency during the DSE process.

7.2 Component Characterization

COSMOS applies Algorithm 1 (Section 5) to characterize the components of the system. Table 1 re-
ports the results of the characterization for the WAMI accelerator: the algorithm used by COSMOS
(COSMOS) is compared with the case in which memory is not considered in the characterization
(No Memory). In the latter case, we assume to have only standard dual-port memories. For each
component, the table reports the latency span (λspan ), i.e., the ratio between the maximum latency
and the minimum latency, the area span (α span ), i.e., the ratio between the maximum area and
the minimum area. For COSMOS, the table shows also the total number of regions identified by
the algorithm (reд). For Algorithm 1 we use a number of ports in the interval [1, 16] and a maxi-
mum number of unrolls in the interval [8, 32], depending on the components. COSMOS guarantees

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:15

Table 1. Characterization of the Components for WAMI. The

Table Reports the Differences in Latency (λ) and Area (α) Span
When Memory is Considered (COSMOS) or Not (No Memory).
For COSMOS, reд Indicates the Number of Regions Found with
Algorithm 1

COSMOS No Memory
Component reд λspan α span λspan α span
Debayer 3 2.89× 1.99× 1.04× 1.36×
Grayscale 4 6.91× 3.41× 2.75× 1.14×
Gradient 4 7.89× 3.65× 1.39× 1.22×
Hessian 4 7.70× 7.30× 1.44× 1.30×
SD-Update 4 9.87× 2.01× 2.78× 1.79×
Matrix-Sub 4 2.75× 3.98× 1.88× 1.05×
Matrix-Add 3 1.53× 1.01× 1.26× 1.01×
Matrix-Mul 3 2.88× 3.05× 1.92× 1.14×
Matrix-Resh 1 1.02× 1.04× 1.02× 1.04×
Steep.-Descent 1 1.95× 1.46× 1.95× 1.46×
Change-Det. 1 2.21× 1.04× 2.21× 1.04×
Warp 1 1.09× 1.03× 1.09× 1.03×
Average - 4.06× 2.58× 1.73× 1.22×

overall a richer DSE, as evidenced by the average results. For some components the algorithm ex-
tracts only one region because multiple ports can incur in additional area for no latency gains.
This happens when (i) the algorithm cannot exploit multiple accesses in memory, or (ii) the data
is cached into local registers which can be accessed in parallel in the same clock cycle, e.g., for
Change-Detection. On the other hand, in most cases COSMOS provides significant gains in
terms of area and latency spans compared to a DSE that does not consider the memories.
Figure 9 shows the design space of four representative components of WAMI. The rectangles in
the figures are the regions found by Algorithm 1. For completeness, in addition to the design points
corresponding to the extreme points of the regions, the graphs show also the intermediate points
that could be selected by the mapping function. The small graphs on the right magnify the cor-
responding regions reported on the left. As in the examples discussed in Section 3, increasing the
number of ports has a significant impact on the DSE, while loop unrolling has a local effect within
each region. Another aspect that is common among many components is that the regions become
smaller as we keep increasing the number of ports. For example, for Grayscale in Figure 9(c), we
note that by increasing the number of ports, we reach a point where the gain in latency is not sig-
nificative. This effect, called diminishing returns [1], is the same effect that can be observed in the
parallelization of software algorithms. In some cases, changing the ports increases only the area
with no latency gains as discussed in the previous paragraph. This is highlighted in Figure 9(d),
where for Change-Detection we report two additional regions with respect to those specified
in Table 1. The diminishing-return effect can also be observed by increasing the number of unrolls
inside a region, e.g., Figure 9(b). This is why COSMOS exploits Amdahl’s Law (Section 6.2). On the
other hand, we notice some discontinuities of the Pareto-optimal points within some regions, e.g.,
the region in the bottom-right corner of Figure 9(a). Even by applying the λ − constraints (Sec-
tion 5) it is not possible to completely discard the Pareto-dominated implementations. In fact,
by further restricting the imposed constraints, i.e., by reducing the number of states that the
HLS tool can insert in each loop, we observe that also the Pareto-optimal implementations are

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:16 L. Piccolboni et al.

Fig. 9. Characterization of four representative components of the WAMI accelerator.

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:17

Fig. 10. Results of the compositional DSE for WAMI.

discarded. Thus, it is not always possible to obtain a curve composed only of Pareto-optimal points
within a certain region. Finally, the Pareto-optimal points outside the regions are not discarded by
COSMOS. They can be chosen when it is necessary to perform the mapping (Section 6.2).

7.3 Design-Space Exploration

After the characterization of the single components, COSMOS applies the DSE approach ex-
plained in Section 6. It first finds the optimal solutions at the system level by using Equation (2)
(Section 6.1). It then applies the mapping function to determine the knob settings of the single
components and runs the necessary synthesis (Section 6.2). Figure 10 shows the resulting Pareto
curve that includes the planned points (from Equation (2)) and the mapped points (returned by
the mapping function). These design points are characterized in terms of effective throughput
(frame/s) and area (mm2 ). To quantify the mismatch between the planned points and the mapped
points we calculate the following ratio:

| dm − d p |
σ (dp , dm ) =
dp

where dp is the area of a planned point p, while dm is the area of the corresponding mapped
point m. Each planned point in Figure 10 is labeled with its corresponding σ % value. Note that the
curve obtained with LP is a theoretical curve because the points found at the system level do not
guarantee the existence of a corresponding set of implementations for the components. The error
is mainly due to the impact of the memory, which determines a significant distance between two
consecutive regions (e.g., the points with more than 10% of mismatch in Figure 10). In fact, if a point
is mapped between two regions it must be approximated with the lower-right point of the next
region with lower effective latency. This choice permits to satisfy the throughput requirements
almost always, but at the expense of additional area. In fact, even if Equation (2) is constrained by
the system throughput, it is not always guaranteed to obtain the same throughput because it is not
always the case that there exists a mapped point that has exactly the same latency of a planned
point. To solve this issue, one could try to reduce the clock period and satisfy the throughput
requirements.

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:18 L. Piccolboni et al.

Fig. 11. Number of invocations of the HLS tool for an exhaustive exploration (bars on the left) and COSMOS
(on the right).

Finally, to demonstrate the efficiency of COSMOS, Figure 11 shows the number of invocations
to the HLS tool. For each component of WAMI, the right bars report the breakdown of the syn-
thesis calls performed in each phase of the algorithm. At least two invocations are necessary for
each region to characterize a component. Then, we have to consider the invocations that fail due
to the λ − constraints, and finally, the invocations required at system level on the most critical
components (mapping). Some components do non play any role in the efficiency of the system.
For example, for Matrix-Mul, there are no invocations after the characterization because only the
slowest version has been requested by Equation (2) (to save area). This component is not important
to guarantee a high throughput for the entire system. Moreover, some synthesized points belong
to multiple solutions of the LP problem, as in the case of Debayer. Therefore, COSMOS avoids
performing an invocation of the HLS with the same knobs more than once. On the other hand, the
left bars in Figure 11 report the number of invocations required for a exhaustive exploration. Such
exploration requires to (i) synthesize all the possible configurations of unrolls and memory ports
for each component, (ii) find the Pareto-optimal design points for each component, and (iii) com-
pose all the Pareto-optimal designs to find the Pareto curve at the system level (Section 3). The left
bars in Figure 11 show the number of invocations to the HLS tool required in step (i). COSMOS
reduces the total number of invocations for WAMI by 6.7× on average and up to 14.6× for the
single components, compared to the exhaustive exploration. Further, while COSMOS returns the
Pareto-optimal implementations at the system level, to find the combinations of the components
that are Pareto optimal with an exhaustive search method, one has to combine the huge number
of solutions for the single components. In the case of WAMI, the number of combinations, i.e.,
the product of the number of Pareto-optimal points of each component, is greater than 9 ∗ 1012 .
This motivates the need of using a compositional method like COSMOS for the DSE of complex
accelerators.

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:19

7.4 Summary
We report a brief summary of the achieved results:
• COSMOS guarantees a richer DSE with respect to the approaches that do not consider the
memory as integral part of the DSE: for WAMI, COSMOS guarantees an average perfor-
mance span of 4.06× and an average area span of 2.58× as opposed to 1.73× and 1.22×,
respectively, when only standard dual-port memories are used; COSMOS obtains a richer
set of Pareto-optimal implementations thanks to memory generation and optimization;
• COSMOS guarantees a faster DSE compared to exhaustive search methods: for WAMI,
COSMOS reduces the number of invocations to the HLS tool by 6.7× on average and by up
to 14.6× for the single components; COSMOS is able to reduce the number of invocations
thanks the compositional approach discussed in Section 6;
• COSMOS is an automatic and scalable methodology for DSE: the approach is intrinsi-
cally compositional, and thus with larger designs the performance gains are expected to be
as good as smaller ones, if not better. While an exhaustive method has to explore all the
alternatives, COSMOS focuses on the most critical components.

8 RELATED WORK
This section describes the most-closely related methods to perform DSE. We distinguish the meth-
ods that explore single-component designs (reported in Section 8.1) from those that are composi-
tional like COSMOS (in Section 8.2).

8.1 Component DSE

Several methods have been proposed to drive HLS tools for DSE. There exist probabilistic ap-
proaches [43], search algorithms based on heuristics, such as simulated annealing [44], iterative
methodologies that exploit particle-swarm optimization [33], as well as genetic algorithms [17],
and machine-learning-based exploration methodologies [26, 31, 45]. All these methods try to
quickly predict the relevance of the knobs and determine the Pareto curves of the scheduled RTL
implementations in a multi-objective design space. None of these methods, however, consider the
generation of optimized memory subsystems for hardware accelerators. Conversely, other meth-
ods focus on creating efficient memory subsystems, but without exploring the other HLS knobs.
For instance, Pilato et al. [36] propose a methodology to create optimized memory architectures,
partially addressing the limitations of current HLS tools in handling memory subsystems. This en-
ables a DSE that takes into account also the memory of accelerators. However, that work focuses
on optimizing the memory architectures and not in proposing efficient DSE methods. Similarly,
Cong et al. [12] explore memory reuse and non-uniform partitioning for minimizing the number of
banks in multi-bank memory architectures for stencil computations. Differently from these works,
COSMOS coordinates both memory generators, like the one proposed in [37], and HLS tools to find
several Pareto-optimal implementations of accelerators. Other methodologies apply both loop ma-
nipulations and memory optimizations. For instance, Cong et al. [14, 15] adopt polyhedral-based
analysis to apply loop transformations with the aim of optimizing memory reuse or partitioning.
Differently from these works, COSMOS focuses on configuring the knobs provided by HLS, after
applying such loop transformations. Indeed, COSMOS realizes a compositional-based methodol-
ogy, and thus it finds Pareto-optimal implementations of the entire system, and not only of the
single components. The first step of COSMOS consists in the characterization of components to
identify regions of the multi-objective design space where feasible RTL implementations exist. This
step differs from previous works [27, 28, 43] for two main aspects. First, COSMOS includes memory
generation and optimization in the DSE process. Second, COSMOS applies synthesis constraints

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:20 L. Piccolboni et al.

to account for the high variability and partial unpredictability of the HLS tools. Such constraints
consider both the dependency graph of the specification and the memory references in each loop.
Thus, COSMOS identifies larger regions of Pareto-optimal implementations.
Other methods, such as Aladdin [47], perform a DSE without using HLS tools and without gener-
ating the RTL implementations, estimating the performance and costs of high-level specifications
(C code for Aladdin). COSMOS differs from these methods because it aims at generating efficient
RTL implementations by using HLS and memory generator tools. Indeed, such methods can be
used before applying COSMOS to pre-characterize the different components of an accelerator that
is not ready to be synthesized with HLS tools. Since the design of HLS-ready specifications requires
significant efforts [39], this can help the designers to focus only on the most critical components,
i.e., those that are expected to return good performance gains over software executions. After this
pre-characterization, COSMOS can be used to perform a DSE of such components and obtain the
Pareto-optimal combinations of their RTL implementations.

8.2 System DSE

While the previous approaches obtain Pareto curves for single components, only few methodolo-
gies adopt compositional design methods for the synthesis of complex accelerators. The approach
used by COSMOS predicts the Pareto curve at the system level, similarly to those proposed by
Liu et al. [28] and Haubelt and Teich [21]. Differently from these works, COSMOS correlates also
the planned design points, which are simply theoretical (the LP solutions), with feasible high-level
knob settings and memory configuration parameters. Further, COSMOS focuses on optimizing the
HLS knobs, e.g., loop manipulations, and memory subsystems, rather than tuning low-level knobs,
e.g., the target clock period.

9 CONCLUDING REMARKS
We presented COSMOS, an automatic methodology for compositional DSE that coordinates both
HLS and memory generator tools. COSMOS takes into account the unpredictability of the current
HLS tools and considers the PLMs of the components as an essential part of the DSE. The method-
ology of COSMOS is intrinsically compositional. First, it characterizes the components to define
the regions of the design space that contain Pareto-optimal implementations. Then, it exploits a
LP formulation to find the Pareto-optimal solutions at the system level. Finally, it identifies the
knobs for each component that can be used to obtain the corresponding implementations at RTL.
We showed the effectiveness and efficiency of COSMOS by considering the WAMI accelerator as
a case study. Compared to methods that do not consider the PLMs, COSMOS finds a larger set of
Pareto-optimal implementations. Additionally, compared to exhaustive search methods, COSMOS
reduces the number of invocations to the HLS tool by up to one order of magnitude.

ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their valuable comments and help-
ful suggestions that help us improve the paper considerably. This work was supported in part
by DARPA PERFECT (C#: R0011-13-C-0003), the National Science Foundation (A#: 1527821), and
C-FAR (C#: 2013-MA-2384), one of the six centers of STARnet, a Semiconductor Research Corpo-
ration program sponsored by MARCO and DARPA.

REFERENCES
[1] M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In
Proc. of the ACM Spring Joint Computer Conference (AFIPS).

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:21

[2] N. Baradaran and P. C. Diniz. 2008. A Compiler Approach to Managing Storage and Memory Bandwidth in Config-
urable Architectures. ACM Transaction on Design Automation of Electronic Systems (2008).
[3] K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez,
L. Song, N. Tallent, and A. Tumeo. 2013. PERFECT (Power Efficiency Revolution For Embedded Computing Tech-
nologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute.
http://hpc.pnl.gov/PERFECT/.
[4] S. Borkar and A. Chien. 2011. The Future of Microprocessors. Communication of the ACM (2011).
[5] S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
[6] J. Campos, G. Chiola, J. M. Colom, and M. Silva. 1992. Properties and Performance Bounds for Timed Marked Graphs.
IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications (1992).
[7] L. P. Carloni. 2015. From Latency-Insensitive Design to Communication-Based System-Level Design. Proc. of the IEEE
(2015).
[8] L. P. Carloni. 2016. The Case for Embedded Scalable Platforms. In Proc. of the ACM/IEEE Design Automation Conference
(DAC). (Invited).
[9] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. 2014. DaDianNao:
A Machine-Learning Supercomputer. In Proc. of the Annual ACM/IEEE International Symposium on Microarchitecture
(MICRO).
[10] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks. IEEE Journal of Solid-State Circuits (2017).
[11] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. 2014. Accelerator-Rich Architectures:
Opportunities and Progresses. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[12] J. Cong, P. Li, B. Xiao, and P. Zhang. 2016. An Optimal Microarchitecture for Stencil Computation Acceleration Based
on Nonuniform Partitioning of Data Reuse Buffers. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems (2016).
[13] J. Cong, P. Wei, C. H. Yu, and P. Zhou. 2017. Bandwidth Optimization Through On-Chip Memory Restructuring for
HLS. In Proc. of the Annual Design Automation Conference (DAC).
[14] J. Cong, P. Zhang, and Y. Zou. 2011. Combined Loop Transformation and Hierarchy Allocation for Data Reuse Opti-
mization. In Proc. of the ACM/IEEE International Conference on Computer-Aided Design (ICCAD).
[15] J. Cong, P. Zhang, and Y. Zou. 2012. Optimizing Memory Hierarchy Allocation with Loop Transformations for High-
Level Synthesis. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[16] E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2015. An Analysis of Accelerator Coupling in Hetero-
geneous Architectures. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[17] F. Ferrandi, P. L. Lanzi, D. Loiacono, C. Pilato, and D. Sciuto. 2008. A Multi-objective Genetic Algorithm for Design
Space Exploration in High-Level Synthesis. In Proc. of the IEEE Computer Society Annual Symposium on VLSI.
[18] A. Gerstlauer, C. Haubelt, A. D. Pimentel, T. P. Stefanov, D. D. Gajski, and J. Teich. 2009. Electronic System-level
Synthesis Methodologies. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2009).
[19] F. Ghenassia. 2006. Transaction-Level Modeling with SystemC. Springer-Verlag.
[20] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A High-Performance and Energy-
Efficient Accelerator for Graph Analytics. In Proc. of the Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO).
[21] C. Haubelt and J. Teich. 2003. Accelerating Design Space Exploration Using Pareto-Front Arithmetics [SoC design].
In Proc. of the ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC).
[22] M. Horowitz. 2014. Computing’s energy problem (and what we can do about it). In Proc. of the IEEE International
Solid-State Circuits Conference (ISSCC).
[23] L. W. Kim. 2017. DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks.
IEEE Transactions on Neural Networks and Learning Systems (2017).
[24] S. Kurra, N. K. Singh, and P. R. Panda. 2007. The Impact of Loop Unrolling on Controller Delay in High Level Synthesis.
In Proc. of the ACM/IEEE Conference on Design, Automation and Test in Europe (DATE).
[25] B. Li, Z. Fang, and R. Iyer. 2011. Template-based Memory Access Engine for Accelerators in SoCs. In Proc. of the
ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC).
[26] H. Y. Liu and L. P. Carloni. 2013. On Learning-Based Methods for Design-Space Exploration with High-Level Synthe-
sis. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[27] H. Y. Liu, I. Diakonikolas, M. Petracca, and L. P. Carloni. 2011. Supervised Design Space Exploration by Compositional
Approximation of Pareto Sets. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[28] H. Y. Liu, M. Petracca, and L. P. Carloni. 2012. Compositional System-Level Design Exploration with Planning of
High-Level Synthesis. In Proc. of the AMC/IEEE Conference on Design, Automation, and Test in Europe (DATE).

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:22 L. Piccolboni et al.

[29] X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen. 2016. High Level Synthesis of Complex Appli-
cations: An H.264 Video Decoder. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA).
[30] M. J. Lyons, M. Hempstead, G. Y. Wei, and D. Brooks. 2012. The Accelerator Store: A Shared Memory Framework for
Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization (2012).
[31] A. Mahapatra and B. Carrion Schafer. 2014. Machine-learning based Simulated Annealer Method for High Level
Synthesis Design Space Exploration. In Proc. of the Electronic System Level Synthesis Conference (ESLsyn).
[32] W. Meeus, K. Van Beeck, T. Goedemé, J. Meel, and D. Stroobandt. 2012. An Overview of Today’s High-Level Synthesis
Tools. Design Automation for Embedded Systems (2012).
[33] V. K. Mishra and A. Sengupta. 2014. PSDSE: Particle Swarm Driven Design Space Exploration of Architecture and
Unrolling Factors for Nested Loops in High Level Synthesis. In Proc. of the IEEE International Symposium on Electronic
System Design (ISED).
[34] T. Murata. 1989. Petri Nets: Properties, Analysis and Applications. Proc. of the IEEE (1989).
[35] L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. Broadening the Exploration of the Accelerator
Design Space in Embedded Scalable Platforms. In Proc. of the IEEE High Performance Extreme Computing Conference
(HPEC).
[36] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2014. System-level Memory Optimization for High-level
Synthesis of Component-based SoCs. In Proc. of the ACM/IEEE International Conference on Hardware/Software Code-
sign and System Synthesis (CODES+ISSS).
[37] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. System-Level Optimization of Accelerator Local
Memory for Heterogeneous Systems-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems (2017).
[38] R. Porter, A. M. Fraser, and D. Hush. 2010. Wide-Area Motion Imagery. IEEE Signal Processing Magazine (2010).
[39] A. Qamar, F. B. Muslim, F. Gregoretti, L. Lavagno, and M. T. Lazarescu. 2017. High-Level Synthesis for Semi-Global
Matching: Is the Juice Worth the Squeeze? IEEE Access (2017).
[40] C. V. Ramamoorthy and G. S. Ho. 1980. Performance Evaluation of Asynchronous Concurrent Systems Using Petri
Nets. IEEE Transaction on Software Engineering (1980).
[41] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. HernÃandez-Lobato, G. Y. Wei, and D. Brooks.
2016. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In Proc. of the ACM/IEEE
Annual International Symposium on Computer Architecture (ISCA).
[42] A. Sangiovanni-Vincentelli. 2007. Quo Vadis, SLD? Reasoning About the Trends and Challenges of System Level
Design. Proc. of the IEEE (2007).
[43] B. Carrion Schafer. 2016. Probabilistic Multiknob High-Level Synthesis Design Space Exploration Acceleration. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems (2016).
[44] B. Carrion Schafer, T. Takenaka, and K. Wakabayashi. 2009. Adaptive Simulated Annealer for High Level Synthe-
sis Design Space Exploration. In Proc. of the IEEE International Symposium on VLSI Design, Automation and Test
(VLSI-DAT).
[45] B. Carrion Schafer and K. Wakabayashi. 2012. Machine Learning Predictive Modelling High-Level Synthesis Design
Space Exploration. IET Computers Digital Techniques (2012).
[46] A. Seznec. 2015. Bank-interleaved Cache or Memory Indexing Does Not Require Euclidean Division. In Proc. of the
Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD).
[47] Y. S. Shao, B. Reagen, G. Y. Wei, and D. Brooks. 2014. Aladdin: A Pre-RTL, Power-performance Accelerator Simulator
Enabling Large Design Space Exploration of Customized Architectures. In Proc. of the ACM/IEEE Annual International
Symposium on Computer Architecture (ISCA).
[48] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA).

Received April 2017; revised May 2017; accepted June 2017

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
Extending High-Level Synthesis for Task-Parallel
Programs
Yuze Chi∗ , Licheng Guo∗ , Jason Lau∗ , Young-kyu Choi∗† , Jie Wang∗ , Jason Cong∗
∗ University of California, Los Angeles, † Inha University
{chiyuze,cong}@cs.ucla.edu

Abstract—C/C++/OpenCL-based high-level synthesis (HLS) takes only a few minutes for a simple design or a component
becomes more and more popular for field-programmable gate in a modular design.
array (FPGA) accelerators in many application domains in recent
Thanks to the advances in HLS scheduling algorithms [13–
arXiv:2009.11389v2 [cs.AR] 6 May 2021

years, thanks to its competitive quality of results (QoR) and

short development cycles compared with the traditional register- 17] and timing optimizations [18–21], HLS can not only
transfer level design approach. Yet, limited by the sequential shorten the development cycle, but also generate programs that
C semantics, it remains challenging to adopt the same highly are often competitive in cycle count [22], and more recently
productive high-level programming approach in many other in clock frequency as well [19, 21]. Moreover, FPGA vendors
application domains, where coarse-grained tasks run in parallel
and communicate with each other at a fine-grained level. While provide host drivers and communication interfaces for kernels
current HLS tools do support task-parallel programs, the pro- designed in HLS [23, 24], further reducing programmers’ bur-
ductivity is greatly limited ¬ in the code development cycle due to den to integrate and offload workload to FPGA accelerators.
the poor programmability, in the correctness verification cycle However, not all programs are created equal for HLS.
due to restricted software simulation, and ® in the QoR tuning
cycle due to slow code generation. Such limited productivity Data-parallel programs can be easily programmed following
often defeats the purpose of HLS and hinder programmers from the sequential C semantics, which enables such applications
adopting HLS for task-parallel FPGA accelerators. to be quickly designed and iterated in the fast correctness
In this paper, we extend the HLS C++ language and present a verification cycle and QoR tuning cycle. In contrast, task-
fully automated framework with programmer-friendly interfaces, parallel programs are not supported by the native C semantics,
unconstrained software simulation, and fast hierarchical code
generation to overcome these limitations and demonstrate how and the productivity provided by current HLS tools is greatly
task-parallel programs can be productively supported in HLS. limited for the following reasons:
Experimental results based on a wide range of real-world task- • Poor programmability. Due to the lack of convenient appli-
parallel programs show that, on average, the lines of kernel and cation programming interfaces (API), programmers are often
host code are reduced by 22% and 51%, respectively, which
considerably improves the programmability. The correctness forced to write more code than necessary. For example, for
verification and the iterative QoR tuning cycles are both greatly an accelerator with PEs connected through a simple on-chip
shortened by 3.2× and 6.8×, respectively. Our work is open-source network, a network node needs to forward packets based on
at https://github.com/UCLA-VAST/tapa/. their content (header) and the availability of output ports.
Without an API to read packets without consuming them
I. I NTRODUCTION (i.e., “peek”) from the ports, programmers have to manually
and carefully create a buffer and maintain a small state
C/C++/OpenCL-based high-level synthesis (HLS) [1] has machine to keep track of incoming packets. This not only
been adopted rapidly by both the academia and the industry elongates the development cycle, but also is error-prone.
for programming field-programmable gate array (FPGA) ac- • Restricted software simulation. As the key to fast correctness
celerator design in many application domains, e.g., machine verification, software simulation is not always available to
learning [2–4], scientific computing [5–8], and image pro- task-parallel programs. For example, Vivado HLS software
cessing [9–11]. Compared with the traditional register-transfer simulation does not support Cannon’s algorithm [25] be-
level (RTL) paradigm where the debug turnaround time of cause its sequential execution of tasks cannot correctly
even simple applications [12] can take tens of minutes, with simulate feedback loops in data paths, while Intel OpenCL
HLS, programmers can follow a rapid development cycle. simulator does not support more than 256 concurrent
Programmers can write code in C and leverage fast software kernels [24]. Unavailability of software simulation forces
simulation to verify the functional correctness. The debug programmers to resort to RTL simulation for correctness
turnaround time for such a correctness verification cycle can verification, significantly elongating the development cycle.
take as few as just one second instead of tens of minutes, • Slow code generation. We found that current HLS compilers
allowing functionalities to be iterated at a fast pace. Once do not support hierarchical code generation for task-parallel
the HLS code is functionally correct, programmers can then programs. Instead, they treat all tasks as a monolithic design
generate RTL code, evaluate the quality of results (QoR) based and process each instance of the same task as if they were
on the generated performance and resource reports, and modify different. For designs that instantiate the same task many
the HLS code accordingly. Such a QoR tuning cycle typically times (e.g., in a systolic array), this leads to repetitive

1
TABLE I: Summary of related work.
compilation on each task and unnecessarily slows down code
generation. Programmers can manually synthesize tasks Programmability
Software RTL Code
separately and instantiate them in RTL, but doing so requires Related Work Peek- Trans- Host Simulation Generation
ing action Iface.
debugging RTL code, which is time-consuming and error-
prone. We think such processes should be automated. Fleet [37] No No N/A Sequential N/A
Intel HLS (pipe) No No N/A Multi-thread Monolithic
Limited productivity support for task-parallel programs Intel HLS (stream) No Yes N/A Multi-thread Monolithic
significantly elongates the development cycles and under- Intel OpenCL No No OpenCL Multi-thread Monolithic
LegUp [38, 39] No No N/A Multi-thread Monolithic
mines the benefits brought by HLS. One may argue that Merlin [40] No No C++ Sequential Monolithic
programmers should always go for data-parallel implemen- ST-Accel [36] No No VFS Sequential Hierarchical
tations when designing FPGA accelerators using HLS, but Vivado HLS (ap_fifo) No No OpenCL Sequential Monolithic
Vivado HLS (axis) No Yes OpenCL Multi-thread Manual
data-parallelism may be inherently limited, for example, in Xilinx OpenCL No No OpenCL Multi-thread Monolithic
applications involving graphs. Moreover, researches show that TAPA Yes Yes C++ Coroutine Hierarchical
even for data-parallel applications like neural networks [3] and
stencil computation [9], task-parallel implementations show
better scalability and higher frequency than their data-parallel general HLS tools treat a task-parallel program as a monolithic
counterparts due to the localized communication pattern [26]. design and generate RTL code for each instance of task
In fact, at least 6 papers [11, 27–31] among the 28 research separately, except that Vivado HLS axis allows programmers
papers published in the ACM FPGA 2020 conference use task- to manually instantiate tasks using a configuration file when
parallel implementation with HLS, and another 3 papers [32– running logic synthesis and implementation. To the best of our
34] use RTL implementation that would have required task- knowledge, TAPA is the only work that provides convenient
parallel implementation if written in HLS. programming interfaces, unconstrained software simulation,
In this paper, we extend the HLS C++ language and and hierarchical code generation for general task-parallel pro-
present our framework, TAPA (task-parallel)1 , as a solution grams on FPGAs using HLS.
to the aforementioned limitations of HLS productivity. Our
II. BACKGROUND
contributions include:
• Convenient programming interfaces: We show that, with A. Task-Parallel Program
peeking and transactions added to the programming inter- Task-level parallelism is a form of parallelization of com-
faces, TAPA can be used to program task-parallel kernels puter programs across multiple processors. In contrast to data
with 22% reduction in lines of code (LoC) on average. By parallelism where the workload is partitioned on data and each
unifying the interface used for the kernel and host, TAPA processor executes the same program (e.g., OpenMP [41]),
further reduces the LoC on the host side by 51% on average. different processors in a task-parallel program often behave
• Unconstrained software simulation: We demonstrate that differently, while data are passed between processors. Ex-
our proposed simulator can correctly simulate task-parallel amples of task-parallel programs include image processing
programs that existing software simulators fail to simulate. pipelines [9–11], graph processing [42–45], and network
Moreover, the correctness verification cycle can be short- switching [33]. Task-parallel programs are often described us-
ened by a factor of 3.2× on average. ing dataflow models [46–50], where tasks are called processes.
• Hierarchical code generation: We show that by modu- Processes communicate only through unidirectional channels.
larizing a task-parallel program and using a hierarchical Data exchanged between channels are called tokens. In this
approach, RTL code generation can be accelerated by a paper, we borrow the terms channel and token, and focus on
factor of 6.8× on our server with 32 hyper-threads. the problem of statically mapping tasks to hardware. That
• Fully automated open-source framework: TAPA is open- is, instances of tasks are synthesized to different areas in
source at https://github.com/UCLA-VAST/tapa/. an FPGA accelerator. We plan to address dynamic schedul-
Table I summarizes the related work. Among all general ing [35, 39, 51] in our future work.
HLS tools (Section VI-A) and streaming frameworks (Sec-
tion VI-B): ¬ None of them supports peeking in their kernel B. A Motivating Example
APIs; only Intel HLS stream and Vivado HLS axis support An on-chip ring network is a commonly used topology to
transactions; only Merlin allows the accelerator kernel to be provide all-to-all interconnection among many task-parallel
called from the host as if it is a C/C++ function. Vivado processing elements (PE) in a single FPGA accelerator, which
HLS, Merlin, and both streaming frameworks (ST-Accel [36] is particularly useful in graph processing [52–58] where each
and Fleet [37]) execute tasks sequentially for simulation, vertex may be connected to any other vertices. A ring network
which works on limited applications, while others launch one has the advantages of simplicity and high routability, but
thread per task instance, which does not scale well. ® All implementing a customized ring network in HLS faces several
1 While a prior work TAPAS [35] and our work TAPA share similarity in
issues that make such designs verbose to write, hard to read,
name, our work focuses on statically mapping tasks to hardware, yet TAPAS and error-prone. In this section, we use a simplified real-world
specializes in dynamically scheduling tasks. design to illustrate the productivity issues for implementing

2
Ring Ring in the first stage of pipeline, leading to II greater than 1. This
PE 1 Node Node PE 2 further complicates the HLS implementation (Listing 1).
1 2 3) System integration: To offload computation kernel from
the host CPU to PCIe-based FPGA accelerators, programmers
need to write host-side code to interface the accelerator kernel
Ring Ring with the host. FPGA vendors adopt the OpenCL standard
PE 0 Node Node PE 3 to provide such a functionality. While the standard OpenCL
0 3
host-kernel interface infrastructure relieves programmers from
Fig. 1: An accelerator with 4 PEs connected via a ring network. writing their own operating system drivers and low-level
libraries, it is still inconvenient and hard-to-use. Programmers
such a ring network in HLS, which serves as a motivating often have to write and debug tens of lines of code just to set
example for our work. up the host-kernel interface. This includes manually setting up
Fig. 1 shows an example where PEs in an accelerator are environmental variables for simulation, creating, and maintain-
interconnected via a ring network. In this example, network ing OpenCL Context, CommandQueue, Program, Kernel, etc.
nodes form a cyclic ring, and each ring node is connected to a data structures [59]. Task-parallel accelerators often make the
PE via a bidirectional link. Each PE can send packets to other situation worse because the parallel tasks are often described
PEs through its associated node, specifying its destination PE as distinct OpenCL kernels [24], which significantly increases
in the packet header. Each node forwards packets either to the programmers’ burden on managing multiple kernels in the
its next node or to its associated PE, based on the packet host-kernel interface. In our experiments, more than 60 lines
header. We assume packets are sent infrequently and channels of host code are created just for the host-kernel integration,
between nodes are provisioned so that they will never be full. which constitute more than 20 percent of the whole source
Furthermore, we would like to insert packets from PEs to the code. Yet, what we want is just a single function invocation
network ASAP so that PEs will not stall due to back pressure of the synthesized FPGA bitstream given proper arguments.
from the ring nodes. While such a ring node can be written 4) Software simulation: C does not have explicit parallel
using Vivado HLS (Listing 1), we found that the followings semantics by itself. Vivado HLS uses the dataflow model and
are missing or hard-to-use in the HLS tools and significantly allows programmers to instantiate tasks by invoking each of
degrade the productivity. them sequentially [23]. While this is very concise to write
1) Peeking: Peeking is defined as reading a token from (Listing 2), it leads to incorrect simulation results because the
a channel without consuming it. Compared with the normal communication between a ring node and its corresponding PE
destructive read, peeking is non-destructive because the token is bidirectional, yet sequential execution can only send tokens
may be read many times. For example, in our ring network, from nodes to PEs because of their invocation order. This
when Node 1 receives incoming packets from both PE 1 (via problem was also pointed out in [60]. In order to run software
pe_in) and Node 0 (via node_in), it will forward the packet simulation correctly, the programmer can change the source
from PE 1 to Node 2 (via node_out) to prevent PE 1 from code to run tasks in multiple threads, but doing so requires the
being stalled due to back pressure. In the same clock cycle, the same piece of task instantiation code to be written twice for
packet from Node 0 cannot be forwarded unless the destination synthesis and simulation, reducing productivity. While there
of that packet is PE 1 (via pe_out), because we cannot write exist other tools (e.g. [24]) that can run tasks in parallel threads
two tokens to the same output channel (node_out) in the and do not have the same correctness problem, we will show
same clock cycle. This requires us to conditionally read tokens in Section V-D that such simulators do not scale well when
based on the content of tokens. Without a peek API, one the number of task instances increases.
has to manually maintain a buffer for the incoming values, 5) RTL code generation: In our ring network example, the
as shown in Line 7–15 of Listing 1. This not only increases same ring node is instantiated many times. While state-of-
the programming burden, but also makes the design prone to the-art HLS compilers can recognize multiple instances of the
errors in state transitions of the buffer. same function and reuse HLS results for regular non-task-
2) Transactions: A sequence of tokens may constitute a parallel programs, task-parallel programs are always treated as
single logical communication transaction. Using the same a monolithic one. This means instances of the same task in a
ring network example, we consider the whole accelerator task-parallel program are treated as if they were different, pos-
execution as a logical communication transaction, and let each sibly in order to explore different communication interfaces of
PE control the termination of each RingNode, as shown in each instance. This significantly elongates the code generation
Line 11 of Listing 1. Without an eot API, one has to manually time when the number of instances is large (Section V-E). We
add a special bit to the data structure to indicate the end-of- can manually do hierarchical code generation, i.e., synthesize
transaction (Line 1–4 of Listing 1). Note that the Pkt struct each task separately and connect the generated RTL code, but
may be used elsewhere, thus it may be infeasible to add the doing so forces us to debug RTL code and spend tens of
eot bit directly to the Pkt struct. Moreover, determining the minutes to verify the correctness for each code modification,
end of transaction must be a peek operation; otherwise, the thus defeats the purpose for adopting HLS.
HLS compiler will be unable to schedule the exit condition In this paper, we present the TAPA framework and address

3
1 struct PktEoT { 1 void RingNode(istream<Pkt>& node_in, istream<Pkt>& pe_in,
2 Pkt pkt; Auxiliary struct for termination control; 2 ostream<Pkt>& node_out, ostream<Pkt>& pe_out) {
3 bool eot; eot stands for “end of transaction”. 3 while (! pe_in.eot() ) {
4 }; 4 if (!pe_in.empty()) {
5 void RingNode(stream<Pkt>& node_in, stream<PktEoT>& pe_in, 5 node_out.write( pe_in.read() );
6 stream<Pkt>& node_out, stream<Pkt>& pe_out) { 6 if (!node_in.empty() && IsForThisNode( node_in.peek() ))
7 Pkt node_pkt; Manually maintained input 7 pe_out.write( node_in.read() );
8 bool node_pkt_valid = false; } else if (!node_in.empty()) {
9 PktEoT pe_pkt;
buffers to implement non- 8
9 Pkt pkt = node_in.read() ;
10 bool pe_pkt_valid = false; destructive read (i.e., peek). 10 (IsForThisNode(pkt) ? pe_out : node_out).write(pkt);
11 while (!( pe_pkt_valid && pe_pkt.eot )) { 11 }
12 if (!pe_pkt_valid) Manually 12 } // Highlighted are destructive read operations and
13 pe_pkt_valid = pe_in.read_nb(pe_pkt); } // non-destructive read (peek) operations .
14 if (!node_pkt_valid) update 13

15 node_pkt_valid = node_in.read_nb(node_pkt); buffers.

16 if (pe_pkt_valid) { Listing 3: Ring network node written in TAPA.
17 node_out.write( pe_pkt .pkt);
18 pe_pkt_valid = false;
19 if (node_pkt_valid && IsForThisNode( node_pkt )) { B. Convenient Programming Interfaces
20 pe_out.write( node_pkt );
21 node_pkt_valid = false; 1) Communication Interface: TAPA provides separate com-
22 }
23 } else if (node_pkt_valid) {
munication APIs for the producer side and the consumer
24 Pkt pkt = node_pkt ; side, which use ostream and istream as the interfaces,
25 node_pkt_valid = false;
26 (IsForThisNode(pkt) ? pe_out : node_out).write(pkt);
respectively. The producer of a channel can test the fullness
27 } of the channel and append tokens to the channel (write) if
28 } // Highlighted are destructive read operations and
29 } // non-destructive read (peek) operations . the channel is not full. The consumer of a channel can test the
emptiness of the channel and remove tokens from the channel
Listing 1: Ring network node written in Vivado HLS. (destructive read), or duplicate the head of token without
removing it (non-destructive read, a.k.a., peek), if the channel
1 void Kernel(...) { is not empty. Read, peek, and write operations can be blocking
2 stream<Pkt, 2> node_0_1, node_1_2, ...
3 stream<Pkt, 2> from_pe_0, to_pe_0, from_pe_1, to_pe_1, ... or non-blocking.
4 // Instantiates other channels... A special token denoting end-of-transaction (EoT) is avail-
5 #pragma HLS dataflow
6 RingNode(node_0_1, node_1_2, from_pe_1, to_pe_1); able to all channels. A process can “close” a channel by
7 RingNode(node_1_2, node_2_3, from_pe_2, to_pe_2); writing an EoT token to it, and a process can “open” a channel
8 // Instantiates other ring nodes and PEs...
9 } by reading an EoT token from it. A process can also test if
a channel is closed, which is a non-destructive read operation
Listing 2: Accelerator task instantiation in Vivado HLS. to the channel (eot). An EoT token does not contain any
useful data. This is designed deliberately to make it possible
these challenges by providing convenient programming in- to break from a pipelined loop when an EoT is present, for
terfaces, unconstrained software simulation, and hierarchical example, in Line 3 of Listing 3. Listing 3 shows an example
code generation. of how the communication interfaces are used in TAPA, which
implements the same functionality as Listing 1, but with 55%
III. TAPA P ROGRAMMING M ODEL AND I NTERFACES fewer lines due to the absence of the auxiliary struct for end-
of-transaction token and the manually maintained input buffer
A. Hierarchical Programming Model that implements peek operations.
TAPA uses a hierarchical programming model. Each task 2) Instantiation Interface: A parent task can instantiate
is either a leaf that does not instantiate any channels or channels and tasks using the instantiation interface. Channels
tasks, or a collection of tasks and channels with which the are instantiated using channel<type,capacity>. For exam-
tasks communicate. A task that instantiates a set of tasks and ple, channel<Pkt,2> instantiates a channel with capacity 2,
channels is called the parent task for that set. Correspondingly, and data tokens transmitted using this channel have type
the instantiated tasks are the children tasks of their parent, Pkt. Tasks are instantiated using task::invoke, with the first
which may be parents of their own children. Each channel argument being the task function and the rest of arguments
must be connected to exactly two tasks. One of the tasks must being the arguments to the task instance. This is consistent
act as a producer and the other must act as a consumer. The with std::invoke in the C++ standard library. Listing 4 shows
producer streams tokens to the consumer via the channel in the how channels and tasks are instantiated in TAPA.
first-in-first-out (FIFO) order. Each task is implemented as a 3) System Integration Interface: TAPA uses a unified sys-
C++ function, which can communicate with each other via the tem integration interface to further reduce programmers’ bur-
communication interface. A parent task instantiates channels den. To offload a kernel to an FPGA accelerator, programmers
and tasks using the instantiation interface, and waits until all only need to call the top-level task as a C++ function in the
its children tasks finish. One of the tasks is designated as host code. Since TAPA can extract metadata information, e.g.,
the top-level task, which defines the communication interfaces argument type, from the kernel code, TAPA will automatically
external to the FPGA accelerator, i.e., the system integration synthesize proper OpenCL host API calls and emit an imple-
interface. mentation of the top-level task C++ function that can set up

4
1 void Kernel(...) { context switch between coroutines takes only 26ns on modern
2 channel<Pkt, 2> node_0_1, node_1_2, ...
3 channel<Pkt, 2> from_pe_0, to_pe_0, from_pe_1, to_pe_1, ... CPUs [63], while a preemptive thread context switch takes
4 // Instantiates other channels... 1.2~2.2µs [67], which is two orders of magnitude slower.
5 task()
6 .invoke(RingNode, node_0_1, node_1_2, from_pe_1, to_pe_1) TAPA leverages coroutines to perform software simulation
7 .invoke(RingNode, node_1_2, node_2_3, from_pe_2, to_pe_2) as follows. When a task is instantiated, a coroutine is launched
8 // Instantiates other ring nodes and PEs...
9 } but suspended immediately. Once all tasks are instantiated,
the simulator starts to resume the suspended coroutines. A
Listing 4: Accelerator task instantiation in TAPA. resumed task will be suspended again if any input channel
is accessed when empty or any output channel is accessed
the runtime environment properly. As a user of TAPA, the when full, which means that no progress can be made from
programmer can use a single function invocation in the same this task. A different task will then be selected and resumed
source code to run software simulation, hardware simulation, by the simulator. Moreover, the coroutines can be distributed
and on-board execution, with the only difference of specifying in a thread pool. The thread pool launches one thread per
proper kernel binaries. CPU core and can bind the thread to the corresponding core,
which prevents the threads from preemption against each other.
IV. TAPA F RAMEWORK I MPLEMENTATION
This improves simulation parallelism without introducing high
A. Software Simulation context switch overhead as in the multi-thread simulators. We
State-of-the-Art Approach: There are two state-of-the-art will show in Section V-D that the coroutine-based simulator
approaches to run software simulation for task-parallel ap- outperforms the existing simulators by 3.2× on average. TAPA
plications: the sequential approach and the multi-thread ap- software simulator is implemented as a C++ library, which can
proach. A sequential simulator invokes tasks sequentially in be compiled by any compatible C++ compiler.
the invocation order [23]. Sequential simulators are fast, but
cannot correctly simulate the capacity of channels and appli- B. RTL Code Generation
cations with tasks communicating bidirectionally, as discussed State-of-the-Art Approach: Current HLS tools treat the
in Section II-B. A multi-thread simulator invokes tasks in whole task-parallel program as a monolithic design, treat
parallel by launching a thread for each task. This enables channels as global variables, and compile different instances
the capacity of channels and bidirectional communication to of tasks as if they are completely unrelated. This can lead
be simulated correctly. However, they may perform poorly to a significant amount of repeated work. For example, the
due to the inefficient context switch handled by the operating dataflow architecture generated by a stencil accelerator com-
system. The FLASH simulator [60, 61] proposed an alterna- piler, SODA [7, 9], is highly modularized, and has many
tive to the above, which uses HLS scheduling information functionally identical modules. However, both the Vivado HLS
to create an interleaving execution of all tasks. Note that and Intel FPGA OpenCL backends generate RTL code for each
although FLASH is also single-threaded, it is different from a module separately. When the design scales out to hundreds of
sequential simulator because it interleaves tasks via source-to- modules, RTL code generation can easily run for hours, taking
source transformation while a sequential simulator does not. even longer time than logic synthesis and implementation.
Compared with a sequential simulator, FLASH is on average While we recognize that a programmer can manually generate
1.7× slower [61], due to additional scheduling information RTL code for each task and glue them at RTL level, doing
being taking into consideration for cycle-accurate modeling. so defeats the purpose of using HLS for high productivity.
Besides, generating simulation executable becomes slower due We also recognize that fast RTL code generation in general
to the need of the HLS scheduler output for cycle-accuracy, is an interesting problem, but we focus on the inefficiency
which is not needed for correctness verification. exacerbated by task-parallel programs in this paper.
In this section, we present an alternative approach to run Modularized Approach: Thanks to the hierarchical pro-
software simulation on task-parallel applications. Given that gramming model, TAPA can keep the program hierarchy,
the inefficiency of multi-thread execution is mainly caused by recognize different instances of the same task, and compile
the preemptive nature of operating system threads, we propose each task only once. As such, the total amount of time spent
an approach that uses collaborative coroutines [62, 63] instead on RTL code generation is reduced. Moreover, modularized
of preemptive threads for each task. Note that fast and/or cycle- compilation makes it possible to compile tasks in parallel,
accurate debugging in general [64] is out of the scope of this further reducing RTL code generation time on multi-core
paper; we focus on the correctness and scalability issues for machines. TAPA implements this by invoking the vendor tools
task-parallel programs. in parallel for each task. On average, TAPA reduces HLS
Coroutine-Based Approach: Routines in programming lan- compilation time by 4.9× (Section V-E).
guages are the units of execution contexts, e.g., functions Fig. 2 shows how RTL code is generated by TAPA, which
in C/C++ [65]. Coroutines [66] are routines that execute is composed of four steps. First, TAPA extracts the HLS
collaboratively; more specifically, coroutines can be explicitly code for each task and the metadata information of the whole
suspended and resumed. A coroutine can invoke subroutines design, including the communication topology among tasks,
and suspend from and resume to any subroutine [63]. A token types exchanged between tasks, and the capacity of each

5
TABLE II: Benchmarks used in this paper. Each task may be instantiated
Handled automatically by TAPA multiple times, so task instance count (#Inst.) and channel count (#Chan.) are
Source to source RTL code & greater than task count (#Task).
HLS code HLS
transformation HLS report
(per task) compiler
TAPA C++ (TAPA) (per task) Benchmark Application #Task #Inst. #Chan.
code Instantiate tasks,
Extract Task info cannon Cannon’s algorithm [25] 5 91 344
metadata Channal info channels, and their cnn VGG [68] convolutional network [3] 14 209 366
(TAPA) ctrl. logic (TAPA) gaussian Gaussian stencil filter [9] 15 564 1602
(TAPA)
gcn Graph convolutional network [52] 5 12 25
gemm General matrix multiplication [3] 14 207 364
C++ code with host Complete kernel network Bucket sort w/ Omega network [69] 3 14 32
OpenCL function calls RTL code page_rank PageRank citation ranking [54] 4 18 89
Fig. 2: TAPA code generation. The host-kernel interface code is generated
together with the kernel RTL code using metadata of the top-level task.
1.6 Vivado HLS

Lines of Code
(Normalized)
channel. Source-to-source transformation is applied in this 1.2 Intel OpenCL
TAPA
step to insert HLS pragmas where necessary (e.g., to generate 0.8
proper RTL interfaces). Then, the vendor HLS tool is used to 0.4
generate RTL code and HLS report for each task. While TAPA
0.0
uses libraries to implement kernel APIs extensively, e.g., for on cnn sian gcn gemm ork e_rank
cann gaus netw pag
read, write, and the end-of-transaction bit, not all APIs, e.g.,
peeking, can be implemented as libraries, due to the lack of Fig. 3: LoC comparison for kernel code. Lower is better.
support from the HLS scheduler. To support peeking, TAPA
adds a scalar argument to each istream, and connect this port that the generated RTL codes have exactly the same cycle-
to the output of first-word-fall-through FIFO when the RTL accurate behavior without having access to the HLS compiler’s
code is assembled in the next step. scheduling algorithm. For example, the bucket sort network
Using the metadata extracted in the first step, TAPA assem- implemented in TAPA has a total latency of 3 cycles while
bles the per-task RTL code to create the complete kernel. In the Vivado HLS implementation has a total latency of 6.
this step, for each parent task, TAPA instantiates the children This is inevitable because, using Vivado HLS, the manually
tasks and channels, and generates a small state machine that maintained buffer forces an additional latency of 1 cycle at
controls start of the children tasks and termination of the parent each network stage. The shallower pipeline makes TAPA use
task. Finally, TAPA packages the assembled RTL code to a 40% fewer LUTs and 39% fewer FFs for network. For other
format that the vendor implementation tool can recognize (xo benchmarks, TAPA uses 0.4% fewer LUTs and 1% fewer FFs
file for Vitis). on average. This shows that the additional APIs provided by
TAPA does not add resource overhead.
V. E VALUATION
We prototype TAPA on Xilinx devices using Vivado HLS B. Lines of Kernel Code
as the backend; support for Intel devices will be added later.
TAPA simplifies the kernel code in two aspects. First, the
We compare the productivity of TAPA with two vendor tools
TAPA communication interfaces simplify the code with the
that provide end-to-end high-level programming experience
built-in support for peeking and transactions. This not only
(including host-kernel communication): Xilinx Vitis 2019.2
simplifies the body of each task definition, but also removes
suite and Intel FPGA SDK for OpenCL Pro Edition 19.4. The
the necessity for many struct definitions. Second, the TAPA
experimental results are obtained on an Ubuntu 18.04 server
instantiation interfaces simplify the code by allowing tasks to
with 2 Xeon Gold 6244 processors.
be launched concisely. Fig. 3 shows the lines of kernel code
comparison of each benchmark. On average, TAPA reduces
A. Benchmarks
the lines of kernel code by 22%. Note that only synthesizable
Table II summarizes the benchmarks used in this paper. kernel code is counted; code added for multi-thread software
All implementations (Vivado HLS, Intel OpenCL, and TAPA) simulation is not counted for Vivado HLS.
of each benchmark are written in such a way that tasks in
each implementation have one-to-one correspondence, corre-
C. Lines of Host Code
sponding loops are scheduled with the same initiation interval
(II), and each task performs the same computation. This not The host code used in the benchmarks contains a mini-
only guarantees source codes to all tools are functionally mal test bench to verify the correctness of the kernel code.
equivalent, but also makes all tools generate consistent quality TAPA system-integration API automatically interfaces with the
of results (QoR), which enables fair comparison of tool run OpenCL host APIs and relieves the programmer from writing
time. Note that we aim to compare the productivity of the HLS repetitive code just to connect the kernel to a host program.
tools, not QoR (although we want to make sure there is no Table 4 shows the lines of host code comparison. On average,
QoR degradation). In particular, we were unable to guarantee the length of host code is reduced by 51%.

6
E. RTL Code Generation Time
4 Vivado HLS
Lines of Code
(Normalized)
3 Intel OpenCL Fig. 6 shows the RTL code generation time comparison.
TAPA Thanks to the hierarchical programming model and modular-
2
ized code generator, TAPA shortens the HLS compilation time
1 by 6.8× on average. This is because ¬ TAPA runs HLS for
0 each task only once even if it is instantiated many times, while
cann
on cnn sian gcn gemm netw
ork e_rank
gaus pag Vivado HLS and Intel OpenCL run HLS for each task instance;
TAPA runs HLS in parallel on multi-core machines.
Fig. 4: LoC comparison for host code. Lower is better.
VI. R ELATED W ORK

10 hr Table I on Page 2 shows a brief summary of the related HLS

Vivado HLS (Seq) Intel OpenCL (MT) tools. Section VI-A presents more details about these tools.
1 hr
Elapsed Time

Vivado HLS (MT) TAPA (Coroutine) Two domain-specific streaming frameworks are discussed in
10 min
1 min Section VI-B. SystemC and pthread are two well-known alter-
10 sec native API paradigms that support task-parallel programs. We
1 sec will discuss and compare them with TAPA in Section VI-C.
cann
on cnn sian gcn gemm netw
ork e_rank
gaus pag A. HLS Support for Task-Parallel Programs
Fig. 5: Simulation time in log scale. Lower is better. Sequential simulator Intel HLS supports two different inter-task communication
fails to simulate cannon and pagerank correctly. Intel OpenCL multi-thread interfaces: pipe and stream. pipe implements a simple FIFO
simulator cannot simulate gaussian due to its large number of task instances. interface with data, valid, and ready signals, while stream
implements an Avalon-ST interface that supports transactions.
Tasks are instantiated using launch and collect.
10 hr Vivado HLS Intel FPGA OpenCL supports the simple FIFO interface
Elapsed Time

1 hr Intel OpenCL
TAPA via two sets of APIs, i.e., standard OpenCL pipe and Intel-
10 min
specific channel. Tasks are instantiated by defining OpenCL
1 min
10 sec __kernels, which forces instances of the same task to be
1 sec synthesized separately as different OpenCL kernels.
cann
on cnn sian gcn gemm netw
ork e_rank Vivado (Vitis) HLS provides two different streaming inter-
gaus pag
faces: ap_fifo and axis. ap_fifo generates the simple FIFO
Fig. 6: RTL code generation time in log scale. Lower is better. interface. Tasks are instantiated by invoking the corresponding
functions in a dataflow region (Listing 2). axis generates
D. Software Simulation Time AXI-Stream interface with transaction support. It requires the
programmers to instantiate channels and tasks in a separate
Fig. 5 shows four simulators, that is, the sequential Vivado configuration file when running logic synthesis and imple-
HLS simulator, the multi-thread Vivado HLS simulator, the mentation. This allows different instances of the same task
multi-thread Intel OpenCL simulator, and the coroutine-based to be synthesized only once, but takes longer time to learn
TAPA simulator. Among the three simulators, the sequential and implement compared with ap_fifo.
simulator fails to correctly simulate benchmarks that require Xilinx OpenCL supports standard OpenCL pipe, which
feedback data paths (cannon and page_rank). Due to the generates AXI-Stream interfaces similar to Vivado HLS axis,
larger memory footprint required for storing the tokens trans- but pipe does not provide APIs to support transactions.
mitted between tasks and lack of parallelism, the sequential LegUp supports the simple FIFO interface via FIFO. Tasks
simulator is outperformed by the coroutine-based simulator are instantiated using pthread API (Section VI-C).
in all but one of the benchmarks (network). The two multi- Merlin [40] allows programmers to call the FPGA kernel as
thread simulators correctly simulate all benchmarks, except a C/C++ function and provides OpenMP-like simple pragmas
that Intel OpenCL cannot handle gaussian because its large with automated design space exploration based on machine
number of task instances (564) exceeds the maximum allowed learning. To support task-parallel programs, Merlin leverages
by the simulator (256). However, the multi-thread simula- its backend vendor HLS tools’ programming interfaces.
tors perform poorly on benchmarks that are communication- Their limitations are summarized in Table I on Page 2. Note
intensive (e.g., network) or have more tasks than the number that a common limitation of HLS tools (including TAPA) is
of available threads (e.g., gaussian). Although the coroutine- that they can not guarantee the software description produces
based TAPA simulator is not always the fastest simulator for deterministic output sequences for task-parallel programs. For
all benchmarks, the worst-case slowdown is only 6%, which is instance, the emptiness test to an input channel is prone to
not significant in comparison with the multi-thread simulator, breaking determinism, yet it is available to all HLS tools for
which can be 11× slower. On average, TAPA is 3.2× faster performance and expressiveness reasons: merging two input
than other simulators. channels round-robin using non-blocking reads would produce

7
an output sequence determined by the relative arrival order of 1 SC_MODULE(RingNode) {
2 sc_port<tlm_fifo_get_if<Pkt>> node_in;
the input tokens. An implication of non-determinism is we 3 sc_port<tlm_fifo_get_if<PktEoT>> pe_in;
cannot assert that a program is deadlock-free just because 4 sc_port<tlm_fifo_put_if<Pkt>> node_out, pe_out;
5 SC_CTOR(RingNode) { SC_THREAD(thread); }
its simulation succeeds. This is different from deterministic 6 void thread() { while (...) {...} }
programs, e.g., Kahn process networks [47], whose successful 7 };
8 SC_MODULE(Kernel) {
simulation generally implies deadlock-free on-board execu- 9 tlm_fifo<Pkt> node_0_1{/*depth=*/2}, node_1_2{2}, ...
tion. For applications that can be efficiently written without 10 // Other channels...
11 RingNode node1, node2, ...
breaking determinism, e.g., streaming applications, there are 12 // Other tasks...
dedicated frameworks developed specifically for them, which 13 SC_CTOR(Kernel) {
14 node1.node_in(node_0_1);
are discussed in the next section. 15 node1.node_out(node_1_2);
16 // Other argument bindings...
17 }
B. Streaming Framework 18 };
ST-Accel [36] is a high-level programming platform that
features highly efficient host-kernel communication interface Listing 5: SystemC TLM API example.
exposed as a virtual file system (VFS). It uses Vivado HLS as
its backend for hardware generation. 1 struct RingNode_Arg {
Fleet [37] is a massively parallel streaming framework for 2 FIFO<Pkt>* node_in, node_out, pe_out;
3 FIFO<PktEoT>* pe_in;
FPGAs that features highly efficient memory interfaces for 4 };
massive instances of parallel processing elements. Program- 5 void RingNode(void* arg) {
6 FIFO<Pkt>* node_in = ((RingNode_arg*)arg)->node_in;
mers write Fleet programs in a domain-specific RTL language 7 // Unpack other arguments...
8 while (...) {...}
based on Chisel [70]. 9 pthread_exit(NULL);
TAPA aims to support more general task-parallel applica- 10 }
11 void Kernel(...) {
tions beyond streaming. 12 FIFO<Pkt> node_0_1, node_1_2, ...
13 // Instantiate other channels...
C. Alternative APIs 14 RingNode_Arg node1_arg, node2_arg, ...
15 node1_arg.node_in = &node_0_1;
SystemC is a set of C++ classes and macros that provide 16 // Pack other arguments...
17 pthread_t node1_pid, node2_pid, ...;
detailed hardware modeling and event-driven simulation. It 18 pthread_create(&node1_pid, NULL, RingNode, &node1_arg);
supports both cycle-accurate and untimed simulation and 19 // Create other threads...
20 pthread_join(&node1_pid, NULL);
many simulator implementations are available [71, 72]. The 21 // Join other threads...
official open-source SystemC simulator implementation uses 22 }
coroutines without thread pooling. Some HLS tools support
Listing 6: Pthread API example.
a subset of untimed SystemC as the input [23]. SystemC
supports task-parallel programs natively via the SC_MODULE
program: for TAPA code snippets shown in Listing 3 and
constructs and tlm_fifo interfaces, which supports peeking.
Listing 4, equivalent pthread-based code would be 2.4× long.
While SystemC supports peeking FIFOs and coroutine-based
simulation for task-parallel programs, it is limited by its In summary, while the API alternatives do exist in their
special and verbose coding style. Listing 5 shows the example own domains, they are more verbose and thus less productive
discussed in Section II-B written in SystemC. Compared with compared with TAPA for task-parallel FPGA acceleration.
other C-like HLS languages, SystemC is more verbose and less
productive due to its special language constructs: for TAPA
code snippets shown in Listing 3 and Listing 4, the equivalent VII. C ONCLUSION AND F UTURE W ORK
SystemC kernel code would be 86% longer. On the host side,
SystemC generates the main function in sc_main by itself for In this paper, we present TAPA as an HLS C++ language
simulation, and programmers need to spend time incorporating extension to enhance the programming productivity of task-
the SystemC test bench with other parts of their program. This parallel programs on FPGAs. TAPA has multiple advantages
is not a problem if the whole system is defined by the kernel over state-of-the-art HLS tools: on average, ¬ its enhanced
in SystemC, e.g., as in embedded systems, but in data center programming interface helps to reduce the lines of kernel code
applications where the FPGA accelerator is only part of the by 22%, its unified system integration interface reduces the
system, this introduces non-trivial complication. lines of host code by 51%, ® its coroutine-based software
Pthread API is a set of widely used standard APIs that can simulator shortens the correctness verification development
be used to implement task-parallel programs using threads. cycle by 3.2×, ¯ its modularized code generation approach
Pthread requires programmers to explicitly create and join shortens the QoR tuning development cycle by 6.8×. As a
threads, and each argument needs to be manually packed and fully automated and open-source framework, TAPA aims to
passed. Listing 6 shows an example using the accelerator provide highly productive development experience for task-
discussed in Section II-B. Compared with the invoke API parallel programs using HLS. For future work, we plan to
used by TAPA, the pthread APIs require more effort to extend our work to support dynamic tasks on FPGAs.

8
ACKNOWLEDGMENT [24] Intel, “Intel FPGA SDK for OpenCL Pro Edition: Programming Guide,”
2020.
The authors would like to thank the anonymous reviewers [25] H.-J. Lee, J. P. Robertson, and J. A. Fortes, “Generalized Cannon’s
and our labmate, Linghao Song, for their valuable comments Algorithm for Parallel Matrix Multiplication,” in ICS, 1997.
and helpful suggestions. This work is partially supported by [26] J. Cong, P. Wei, C. H. Yu, and P. Zhou, “Latte: Locality Aware
Transformation for High-Level Synthesis,” in FCCM, 2018.
a Google Faculty Award, the NSF RTML program (CCF- [27] T. Young-Schultz, L. Lilge, S. Brown, and V. Betz, “Using OpenCL
1937599), NIH Brain Initiative (U01MH117079), the Xilinx to Enable Software-like Development of an FPGA-Accelerated Biopho-
Adaptive Compute Clusters (XACC) program, and CRISP, one tonic Cancer Treatment Simulator,” in FPGA, 2020.
[28] V. Rybalkin and N. Wehn, “When Massive GPU Parallelism Ain’t
of six JUMP centers. Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network,”
in FPGA, 2020.
R EFERENCES [29] A. Sohrabizadeh, J. Wang, and J. Cong, “End-to-End Optimization of
[1] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, Deep Learning Applications,” in FPGA, 2020.
“High-Level Synthesis for FPGAs: From Prototyping to Deployment,” [30] J. De Fine Licht, G. Kwasniewski, and T. Hoefler, “Flexible Com-
TCAD, 2011. munication Avoiding Matrix Multiplication on FPGA with High-Level
[2] X. Wei, Y. Liang, and J. Cong, “Overcoming Data Transfer Bottlenecks Synthesis,” in FPGA, 2020.
in FPGA-based DNN Accelerators via Layer Conscious Memory Man- [31] J. Jiang, Z. Wang, X. Liu, J. Gómez-Luna, N. Guan, Q. Deng, W. Zhang,
agement,” in DAC, 2019. and O. Mutlu, “Boyi: A Systematic Framework for Automatically De-
[3] J. Cong and J. Wang, “PolySA: Polyhedral-Based Systolic Array Auto- ciding the Right Execution Model of OpenCL Applications on FPGAs,”
Compilation,” in ICCAD, 2018. in FPGA, 2020.
[4] Y.-H. Lai, Y. Chi, Y. Hu, J. Wang, C. H. Yu, Y. Zhou, J. Cong, and [32] H. Zeng and V. Prasanna, “GraphACT: Accelerating GCN training on
Z. Zhang, “HeteroCL: A Multi-Paradigm Programming Infrastructure CPU-FPGA heterogeneous platforms,” in FPGA, 2020.
for Software-Defined Reconfigurable Computing,” in FPGA, 2019. [33] P. Papaphilippou, J. Meng, and W. Luk, “High-Performance FPGA
[5] H. R. Zohouri, A. Podobas, and S. Matsuoka, “Combined Spatial Network Switch Architecture,” in FPGA, 2020.
and Temporal Blocking for High-Performance Stencil Computation on [34] H. Chen, S. Madaminov, M. Ferdman, and P. Milder, “FPGA-
FPGAs Using OpenCL,” in FPGA, 2018. Accelerated Samplesort for Large Data Sets,” in FPGA, 2020.
[6] M. Koraei, O. Fatemi, and M. Jahre, “DCMI: A Scalable Strategy for [35] S. Margerm, A. Sharifian, A. Guha, A. Shriraman, and G. Pokam,
Accelerating Iterative Stencil Loops on FPGAs,” TACO, vol. 16, no. 4, “TAPAS: Generating Parallel Accelerators from Parallel Programs,” in
2019. MICRO, 2018.
[7] Y. Chi and J. Cong, “Exploiting Computation Reuse for Stencil Accel- [36] Z. Ruan, T. He, B. Li, P. Zhou, and J. Cong, “ST-Accel: A High-
erators,” in DAC, 2020. Level Programming Platform for Streaming Applications on FPGA,”
[8] J. de Fine Licht, A. Kuster, T. De Matteis, T. Ben-Nun, D. Hofer, and in FCCM, 2018.
T. Hoefler, “StencilFlow: Mapping Large Stencil Programs to Distributed
[37] J. Thomas, P. Hanrahan, and M. Zaharia, “Fleet: A Framework for
Spatial Computing Systems,” in CGO, 2021.
Massively Parallel Streaming on FPGAs,” in ASPLOS, 2020.
[9] Y. Chi, J. Cong, P. Wei, and P. Zhou, “SODA : Stencil with Optimized
Dataflow Architecture,” in ICCAD, 2018. [38] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson,
[10] J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, and S. Brown, and T. Czajkowski, “LegUp: High-Level Synthesis for FPGA-
M. Horowitz, “Programming Heterogeneous Systems from an Image Based Processor/Accelerator Systems,” in FPGA, 2011.
Processing DSL,” TACO, vol. 14, no. 3, 2017. [39] J. Choi, S. D. Brown, and J. H. Anderson, “From Pthreads to Multicore
[11] J. Li, Y. Chi, and J. Cong, “HeteroHalide: From Image Processing DSL Hardware Systems in LegUp High-Level Synthesis for FPGAs,” TVLSI,
to Efficient FPGA Acceleration,” in FPGA, 2020. vol. 25, no. 10, 2017.
[12] UCLA-VAST, “TAPA Sample Applications.” [Online]. Available: [40] J. Cong, M. Huang, P. Pan, D. Wu, and P. Zhang, “Software Infras-
https://github.com/UCLA-VAST/tapa/tree/master/apps tructure for Enabling FPGA-Based Accelerations in Data Centers,” in
[13] J. Cong and Z. Zhang, “An Efficient and Versatile Scheduling Algorithm ISLPED, 2016.
Based On SDC Formulation,” in DAC, 2006. [41] L. Dagum and R. Menon, “OpenMP: An Industry Standard API for
[14] J. Cheng, S. T. Fleming, Y. T. Chen, J. H. Anderson, and G. A. Shared-Memory Programming,” IEEE Computational Science and En-
Constantinides, “EASY: Efficient Arbiter SYnthesis from Multi-threaded gineering, vol. 5, no. 1, 1998.
Code,” in FPGA, 2019. [42] G. Dai, Y. Chi, Y. Wang, and H. Yang, “FPGP: Graph Processing
[15] J. Cheng, L. Josipović, G. A. Constantinides, P. Ienne, and J. Wickerson, Framework on FPGA A Case Study of Breadth-First Search,” in FPGA,
“Combining Dynamic & Static Scheduling in High-level Synthesis,” in 2016.
FPGA, 2020. [43] G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang, “ForeGraph:
[16] H. Hsiao and J. Anderson, “Thread Weaving: Static Resource Scheduling Exploring Large-scale Graph Processing on Multi-FPGA Architecture,”
for Multithreaded High-Level Synthesis,” in DAC, 2019. in FPGA, 2017.
[17] A. Haj-Ali, Q. Huang, W. Moses, J. Xiang, K. Asanovic, J. Wawrzynek, [44] S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu,
and I. Stoica, “AutoPhase: Juggling HLS Phase Orderings in Random “HitGraph: High-throughput Graph Processing Framework on FPGA,”
Forests with Deep Reinforcement Learning,” in MLSys, 2020. TPDS, 2019.
[18] Y. T. Chen, J. H. Kim, K. Li, G. Hoyes, and J. H. Anderson, “High- [45] Y. Wang, J. C. Hoe, and E. Nurvitadhi, “Processor Assisted Worklist
Level Synthesis Techniques to Generate Deeply Pipelined Circuits for Scheduling for FPGA Accelerated Graph Processing on a Shared-
FPGAs with Registered Routing,” in FPT, 2019. Memory Platform,” in FCCM, 2019.
[19] L. Guo, J. Lau, Y. Chi, J. Wang, C. H. Yu, Z. Chen, Z. Zhang, and [46] C. A. R. Hoare, “Communicating Sequential Processes,” Communica-
J. Cong, “Analysis and Optimization of the Implicit Broadcasts in FPGA tions of the ACM, vol. 21, no. 8, 1978.
HLS to Improve Maximum Frequency,” in DAC, 2020.
[47] G. Kahn, “The Semantics of a Simple Language for Parallel Program-
[20] L. Josipović, S. Sheikhha, A. Guerrieri, P. Ienne, and J. Cortadella,
ming,” in IFIP, 1974.
“Buffer Placement and Sizing for High-Performance Dataflow Circuits,”
in FPGA, 2020. [48] E. A. Lee and D. G. Messerschmitt, “Synchronous Data Flow,” IEEE,
[21] L. Guo, Y. Chi, J. Wang, J. Lau, W. Qiao, E. Ustun, Z. Zhang, vol. 75, no. 9, 1987.
and J. Cong, “AutoBridge: Coupling Coarse-Grained Floorplanning and [49] J. T. Buck, “Scheduling Dynamic Dataflow Graphs with Bounded
Pipelining for High-Frequency HLS Design on Multi-Die FPGAs,” in Memory Using the Token Flow Model,” Ph.D. dissertation, 1993.
FPGA, 2021. [50] J. L. Peterson, “Petri Nets,” ACM Computing Surveys, vol. 9, no. 3,
[22] J. Cong, P. Wei, C. H. Yu, and P. Zhang, “Automated Accelerator 1977.
Generation and Optimization with Composable, Parallel and Pipeline [51] M. Abeydeera and D. Sanchez, “Chronos: Efficient Speculative Paral-
Architecture,” in DAC, 2018. lelism for Accelerators,” in ASPLOS, 2020.
[23] Xilinx, “Vivado Design Suite User Guide: High-Level Synthesis [52] T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph
(UG902),” 2020. Convolutional Networks,” in ICLR, 2017.

9
[53] C. Deng, Z. Zhao, Y. Wang, Z. Zhang, and Z. Feng, “GraphZoom:
A Multi-level Spectral Approach for Accurate and Scalable Graph
Embedding,” in ICLR, 2020.
[54] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation
Ranking: Bringing Order to the Web,” Tech. Rep., 1998.
[55] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community
Structure in Large Networks: Natural Cluster Sizes and the Absence of
Large Well-Defined Clusters,” Internet Mathematics, vol. 6, no. 1, 2009.
[56] J. Mcauley, “Learning to Discover Social Circles in Ego Networks,” in
NIPS, 2012.
[57] Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang, “NXgraph:
An Efficient Graph Processing System on a Single Machine,” in ICDE,
2016.
[58] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and
H. Yang, “GraphH: A Processing-in-Memory Architecture for Large-
scale Graph Processing,” TCAD, 2018.
[59] Xilinx, “Vitis Accel Hello World Example.” [Online]. Available:
https://github.com/Xilinx/Vitis_Accel_Examples/blob/21bb0cf788ace59
3c6075accff7f7783588ae8b4/hello_world/src/host.cpp#L58-L115
[60] Y. Chi, Y.-k. Choi, J. Cong, and J. Wang, “Rapid Cycle-Accurate
Simulator for High-Level Synthesis,” in FPGA, 2019.
[61] Y.-k. Choi, Y. Chi, J. Wang, and J. Cong, “FLASH: Fast, ParalleL, and
Accurate Simulator for HLS,” TCAD, 2020.
[62] A. L. de Moura and R. Ierusalimschy, “Revisiting Coroutines,” TOPLAS,
vol. 31, no. 2, 2009.
[63] O. Kowalke, “Boost Library Documentation, Coroutine2,” 2014.
[Online]. Available: https://boost.org/doc/libs/1_65_0/libs/coroutine2/d
oc/html/coroutine2/intro.html
[64] A. S. Jamal, E. Cahill, J. Goeders, and S. J. E. Wilton, “Fast Turnaround
HLS Debugging using Dependency Analysis and Debug Overlays,”
TRETS, vol. 13, no. 1, 2020.
[65] D. E. Knuth, Fundamental Algorithms. The Art of Computer Program-
ming 1, 3rd ed., 1997.
[66] M. E. Conway, “Design of a Separable Transition-Diagram Compiler,”
Communications of the ACM, vol. 6, no. 7, 1963.
[67] E. Bendersky, “Measuring context switching and memory overheads
for Linux threads,” 2018. [Online]. Available: https://eli.thegreenplace.
net/2018/measuring-context-switching-and-memory-overheads-for-linu
x-threads/
[68] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” in ICLR, 2015.
[69] D. H. Lawrie, “Access and Alignment of Data in an Array Processor,”
ToC, vol. C-24, no. 12, 1975.
[70] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis,
J. Wawrzynek, and K. Asanović, “Chisel: Constructing Hardware in a
Scala Embedded Language,” in DAC, 2012.
[71] T. Schmidt, G. Liu, and R. Dömer, “Exploiting Thread and Data Level
Parallelism for Ultimate Parallel SystemC Simulation,” in DAC, 2017.
[72] M. K. Chung, J. K. Kim, and S. Ryu, “SimParallel: A High Performance
Parallel SystemC Simulator Using Hierarchical Multi-threading,” in
ISCAS, 2014.

10
From Software to Accelerators with LegUp High-Level Synthesis

Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort,
Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, Jason Anderson
ECE Department, University of Toronto, Toronto, ON, Canada
legup@eecg.toronto.edu

Abstract nificant traction in industry as evidenced by many new commercial

Embedded system designers can achieve energy and performance offerings: eXCite from Y Explorations [34], Calypto Catapult [16],
benefits by using dedicated hardware accelerators. However, im- Forte Cynthesizer [19], Xilinx Vivado [21], and Altera’s OpenCL
plementing custom hardware accelerators for an application can Compiler [17]. The advantage of high-level synthesis is that a cir-
be difficult and time intensive. LegUp is an open-source high- cuit designer can work more productively at a higher level of ab-
level synthesis framework that simplifies the hardware accelera- straction, and achieve faster time-to-market than using manually
tor design process [8]. With LegUp, a designer can start from designed RTL.
an embedded application running on a processor and incremen- We have implemented an open-source high-level synthesis re-
tally migrate portions of the program to hardware accelerators im- search framework called LegUp [8]. An overarching goal of LegUp
plemented on an FPGA. The final application then executes on is to offer a programming paradigm for FPGAs that simplifies
an automatically-generated software/hardware coprocessor system. the design process for software engineers who are not familiar
This paper presents on overview of the LegUp design methodol- with hardware design. Using LegUp, the designer first implements
ogy and system architecture, and discusses ongoing work on pro- their algorithm on a processor, then incrementally offloads por-
filing, hardware/software partitioning, hardware accelerator quality tions of the program into hardware accelerators that execute in tan-
improvements, Pthreads/OpenMP support, visualization tools, and dem with the processor, thereby achieving improved speed and en-
debugging support. ergy consumption. The tool encourages the exploration of the soft-
ware/hardware design space, for example, by running only critical
Categories and Subject Descriptors B.7 [Integrated Circuits]: or highly parallelizable portions of the application in hardware, and
Design Aids running the remainder on a processor. Being one of the few robust
open-source HLS frameworks, LegUp is a powerful platform that
Keywords High-Level Synthesis, Hardware Accelerators, FPGA enables research in a variety of areas including HLS algorithms,
hardware/software co-design, and embedded systems. LegUp 3.0
1. Introduction was released in January 2013 and is available for download at:
Field-programmable gate arrays (FPGAs) are integrated circuits http://legup.eecg.toronto.edu.
that can be programmed by the end user to implement any dig- In this paper, we provide an overview of the LegUp project and
ital circuit. Since the dawn of the FPGA era in the 1980s, their discuss our recent research directions. The remainder of this paper
size and complexity has tracked with Moore’s Law, growing expo- is organized as follows: Section 2 provides an overview of LegUp’s
nentially with each process generation. Today’s largest FPGAs in- design methodology and target architecture. Section 3 presents a
corporate billions of transistors and can implement large complex method of automating the hardware/software partitioning when re-
systems. However, the growing complexity of such systems has targeting an algorithm from a software implementation to an FPGA
made them increasingly difficult to design. One root of this problem processor/accelerator system. Section 4 discusses quality improve-
is that most engineers are accustomed to software development— ments that we have made to LegUp’s high-level synthesis algo-
typically using C, which is considerably simpler than the circuit rithms. Section 5 presents LegUp’s support for parallel programs
design and verification required for FPGA design. described with Pthreads and OpenMP, which can use multiple hard-
Implementing a design on an FPGA can offer orders of mag- ware accelerators running concurrently. Section 6 discusses work-
nitude improvement over a processor in terms of energy efficiency in-progress on a debugging platform and also describes our hard-
and performance for some applications [12, 25]. However, to ac- ware visualization tool. Conclusions are given in Section 7.
cess such benefits, an FPGA designer faces many challenges that
do not exist in software development, such as choosing a suitable
datapath architecture, implementing control logic, verifying the cir- 2. LegUp Flow Overview
cuit functionality with a cycle-accurate simulator, and finally, using Fig. 1 illustrates the steps of the LegUp design methodology. Re-
a static timing analysis tool to ensure timing constraints are met. ferring to the labels in the figure, at step ➀, the designer imple-
The speed and energy efficiency of computing could be improved ments their application in software using C, targeting a soft-core
tremendously if this programmability hurdle could be lowered, es- MIPS processor running on the FPGA [33]. As the application ex-
pecially considering that software developers outnumber hardware ecutes in step ➁, a built-in hardware profiler [2] identifies critical
designers 10 to 1 [32]. sections of the code that would benefit from a hardware implemen-
One approach to ease the FPGA design burden is to use high- tation. Using this profiling information in step ➂, the user marks
level synthesis (HLS), which automatically generates a cycle- functions in the program to be synthesized into hardware acceler-
accurate RTL circuit description from a high-level untimed C soft- ators. The application is re-compiled by LegUp in step ➃, which
ware specification. Recently, high-level synthesis has gained sig- automatically converts the marked sections into hardware acceler-
....
y[n] = 0;
for (i = 0; i < 8; i++) { Self-Profiling
y[n] += coeff[i] * x[n-i]; 1 Processor
MIPS Processor
} C Compiler
(MIPS)
....
2

Program code
LegUp
5
Altered SW binary (calls HW accelerators) Profiling Data:
3 Execution Cycles
4 High-level Power
synthesis Suggested Cache Misses
µP Hardened
program
program
segments to
segments
6 FPGA fabric target to
HW

Figure 1. Design flow with LegUp.

ators using LegUp’s high-level synthesis engine. Next, in step ➄,

the original software is re-compiled with the accelerated functions
replaced with code to start the corresponding hardware accelerators
and pass any required function parameters to the hardware. Finally,
the hybrid processor/accelerator system executes on the FPGA in
step ➅. In this self-accelerating adaptive system, the designer can
harness the performance and energy benefits of an FPGA using
an incremental methodology, while limiting time spent on hard-
ware design. The LegUp programming flow bears some similar-
ity to GPGPU programming (using languages like CUDA [26] and
OpenCL [1]) in the sense that we allow the programmer to itera-
tively and incrementally work to raise performance, with the whole Figure 2. Default LegUp system architecture.
program working correctly at all times.

2.1 LegUp System Architecture to increase memory bandwidth is to use multiple coherent mem-
ory blocks, with extra circuitry to manage memory coherency be-
LegUp can target two Altera FPGAs: the Cyclone II on the Altera
tween the memory blocks. However, by implementing memory co-
DE2 board [5], and the Stratix IV on the Altera DE4 board [6]. The
herency we add area and latency overhead. Thus, we take an alter-
target system architecture is shown in Fig. 2. The system comprises
nate approach, where we implement multi-ported memories (that
the MIPS soft processor, hardware accelerators, on-chip cache, as
have more than 2 ports) using existing dual-ported memory blocks.
well as off-chip memory (8MB SDRAM on the DE2 board or 2GB
We can then use these multi-ported memories to implement multi-
DDR2-SDRAM on the DE4 board). An accelerator may have lo-
ported caches suitable for many-accelerator systems.
cal memories for storing data that is not shared with the proces-
We have investigated two types of multi-ported caches, called
sor or other accelerators. These local memories are implemented
the LVT cache and the MP cache [10], both of which allow multiple
in on-chip block RAMs, instantiated within a hardware accelera-
concurrent accesses to all regions of the cache in every clock cycle.
tor. Data shared between the processor and hardware accelerators
The LVT cache is based on memory replication, whereas the MP
is stored in off-chip memory, which can be accessed using the on-
cache uses memory multi-pumping (operating the memory at a
chip cache. The components of the system communicate via the
higher clock rate than the surrounding system). The main advantage
Avalon Interconnect, Altera’s on-chip interface, which is generated
of both cache architectures is that they offer higher on-chip memory
automatically by Altera’s SOPC Builder tool [7]. Avalon is a point-
bandwidth than what is typically available on the FPGA fabric,
to-point network, which allows multiple independent transfers to
while providing a shared memory space which acts as a single piece
occur simultaneously via memory-mapped addresses. When multi-
of memory. These caches also require no cache coherency scheme,
ple components are connected to a single component, such as the
avoiding the area and latency costs for synchronization.
on-chip data cache, a round-robin arbiter is generated to arbitrate
among simultaneous accesses.
3. Hardware/Software Partitioning
2.2 Multi-ported caches With the LegUp design methodology, the program is partitioned
When many accelerators are operating in parallel, memory band- into both a hardware portion and a software portion. The chosen
width can easily become a performance bottleneck. The on-chip partitioning depends on the designer’s objective, which is often to
RAMs on current commercial FPGAs have two ports, meaning that reduce overall execution time. Towards this goal, the MIPS soft
for a given memory block, there can only be up to two memory ac- processor contains a hardware profiler to determine which sections
cesses at a time. However, for systems with many accelerators that of the original program are taking the most execution time. LegUp
need to access memory concurrently, two ports may not be ade- can also estimate the speedup associated with migrating a particular
quate and cache accesses may limit performance. A typical way program segment into hardware versus leaving it in software.
3.1 Hardware Profiling may not be suitable for hardware acceleration, perhaps because
The hardware profiler in the MIPS soft processor is called LEAP, they contain a sequential algorithm with minimal instruction level
which stands for Low-overhead and Extensible Architecture for parallelism or they are too memory intensive. Ideally, we would
Profiling [2]. For each function in a program, the profiler can be know exactly how much execution time would be saved by synthe-
used to quickly and accurately obtain the exact number of clock cy- sizing a segment of software code into a hardware accelerator.
cles spent executing the function. In software-based profiling, the One way we could gauge the speedup achieved by hardware ac-
program being profiled must be modified with instrumentation to celeration is to actually convert the segments into hardware circuits
gather profiling data during its execution. In contrast, our hardware- and then run the program on the board to measure the results. But
based approach allows the program to execute in its original un- this approach is too time consuming if there are many alternatives
modified form at full speed on the processor. The MIPS processor to investigate, as it requires running FPGA synthesis and place-
is augmented with additional circuitry that automatically gathers and-route tools for each alternative. Alternately, one could run an
profiling data as the program executes. Such hardware profiling is RTL simulation to measure the execution time, in cycles, of the fi-
superior in speed and accuracy when compared to software profil- nal hybrid system. However, this method is also time-consuming
ing. and becomes infeasible in real applications.
To aid in the task of software/hardware partitioning, LegUp
Instr, PC provides an estimate of the total number of clock cycles consumed
by a function if it is accelerated in hardware, which can then
be compared to the LEAP profiling results described above. This
PC approach uses profiling (in software) to estimate the execution flow
Is call? Is return?
Change? of the processor/accelerator hybrid system and then uses early high-
yes yes yes
level synthesis scheduling information to predict the number of
Data Store/Update Data cycles required by portions of the program after being synthesized
Counter++ Hash to Counter
function to hardware.
Store/Update Data
number When synthesizing a software program into a hybrid system,
Pop function number
Counter
off stack LegUp replaces the accelerating functions with wrapper functions
Push to enable communication between the processor and accelerators.
function
Reset Data Counter
number to
Reset Data Counter The call to a wrapper function starts the accelerator’s execution
stack and the return from the wrapper function indicates the accelera-
tor finished its execution. Our estimation approach considers the
Figure 3. High-level flow chart for instruction-count profiling. hardware cycles spent on three operations: 1) the execution of the
hardware accelerator, 2) the accelerator’s initialization performed
The high-level operation of LEAP is shown in Fig. 3. LEAP by the software wrapper function, and 3) reads and writes to the
profiles the execution of the program by monitoring the proces- shared memory space.
sor’s program counter and instruction bus. During execution, LEAP We estimate the cycles taken during the accelerator’s execution
maintains a counter, called a Data Counter, that tracks the number in two steps. First, we perform HLS scheduling for the accelerated
of times an event has occurred. Two modes are available: the pro- function to determine the number of clock cycles required for
filer can count dynamic instructions, or clock cycles. each basic block in the function. A basic block is a contiguous
LEAP organizes the collected data on a per-function basis by al- set of instructions with a single entry (at its beginning) and exit
locating a storage counter for each software function. LEAP iden- point (at its end). Next, we execute the program in software using
tifies function boundaries by decoding (in hardware) the executing representative inputs to estimate the number of times each basic
instruction to determine if it is a function call or return. If a call is block is executed. Finally, we estimate the total cycle count of the
detected, the Data Counter is added to any previously stored values accelerated function by multiplying the estimated number of times
associated with the function containing the call instruction (from each basic block is executed by the number of clock cycles required
previous invocations of the function). The Data Counter is then re- by the corresponding basic block in its schedule.
set to 0 to begin counting the events in the called function. If a To estimate the time taken by the software wrapper function
function return is detected, the Data Counter value is added to the running on the processor, we count the number of instructions
counter associated with the current function, and once again the required by the wrapper. The instruction count is sufficient for
Data Counter is reset. wrapper function estimation, as the wrapper function is small and
In order to determine the counter associated with a particular only contains simple operations to communicate with hardware
function, other hardware profilers, such as SnoopP [30] (a hard- accelerators, and we have found empirically that the instructions-
ware profiler for FPGA-based processors), use a large number of per-cycle of the MIPS processor is close to one.
comparators to associate program counter address ranges with in- We estimate the cycles spent accessing shared memory in three
dividual counters. A novel aspect of LEAP is the use of perfect steps. First, we run the program with a representative set of inputs
hashing hardware to associate function addresses with counters. A using a MIPS emulator to determine the address sequence accessed
set of hashing parameters are generated during the software compi- by the software program (without hardware accelerators). Next,
lation stage (step ➀ in Fig. 1) and used to configure the profiler we predict the address sequence accessed by the hybrid proces-
on the FPGA. No modifications of the hardware profiler circuit sor/accelerator system by eliminating any addresses that are stored
(e.g. resynthesis or reprogramming) are needed to profile a new in local memory of the hardware accelerators. Then, we use a cache
program. The use of hashing leads to significantly less hardware simulator to determine the number of cache hits and misses. Finally,
overhead when compared to other hardware profilers. Specifically, we can use the estimated cost of a cache hit or miss (in cycles) to
relative to SnoopP, our design requires up to 18× less area [2]. predict the total cycles spent on shared memory accesses.
Experimental results show that our approach has an average
3.2 Accelerator Speedup Prediction error rate of about 7% compared to the results obtained from RTL
simulation, but with 184× less run-time on average.
By using the LEAP profiler, the user can identify time-consuming
program segments. However, these compute-intensive functions
3.3 Partitioning Example
1.8
An example of hardware/software partitioning is provided in Ta-
ble 1 for four functions of the jpeg benchmark in the CHStone

# of clock cycles (geomean ratio)

1.6
benchmark suite [15]. In the “Profiling” column, the table presents
the number of cycles required by the function when running on the 1.4
MIPS soft processor (profiled using LEAP). In the “Estimation”
column an estimate of the execution cycles are given when the 1.2
function is migrated to a hardware accelerator. The “Simulation” 1
column gives the actual number of execution cycles measured us-
ing cycle-accurate simulation of the generated hybrid system. As 0.8
shown in the table, if we select the function with the most soft-
ware cycles (on the MIPS determined using LEAP) for accelera- 0.6
tion, function buf getb will be synthesized to hardware, resulting
in 63,337 cycles of actual reduction versus a pure-software imple-
mentation. However, by choosing the function that has the highest
estimated cycle reduction, function YuvToRgb will be synthesized
to hardware, leading to an execution time reduction of 327,308 cy-
cles. Figure 4. Impact of individual compiler passes on geomean clock
Currently, based on the profiling and estimation results, users cycle latencies across 11 benchmarks.
can choose which function to accelerate based on the estimated
speedups. In the long run, we would like LegUp to act as a self-
accelerating adaptive processor, that will profile running applica- loop unrolling, or loop rotation. The passes within these compiler
tions and automatically synthesize code segments into hardware, frameworks were intended to optimize software programs that run
improving performance without user intervention. on a microprocessor. We studied the impact of these passes for
HLS-generated hardware.
LegUp is implemented within the LLVM compiler frame-
Table 1. Software Cycles and Hybrid Cycles of JPEG Benchmark.
Profiling Estimation Simulation work [20], which contains implementations for 56 optimization
Software Hybrid Reduced Hybrid Reduced (transform) passes that may alter the program. LLVM represents
Functions Cycles Cycles Cycles Cycles Cycles the program being compiled using an intermediate presentation
buf getb 713,102 582,386 130,716 649,765 63,337 (IR) that is essentially machine-independent assembly code. Com-
ChenIDct 657,775 678,617 -20,842 681,366 -23,591
YuvToRgb 569,501 158,891 410,610 242,193 327,308 piler passes receive the IR as input, and produce an optimized IR as
buf getv 564,650 222,027 342,623 241,306 323,344 output. The familiar command-line optimization levels (e.g. -O3)
correspond to a particular set and sequence of compiler passes.
While the standard compiler optimization levels offer a simple set
4. Hardware Accelerator Quality of choices for a developer, the particular optimizations applied at
High-level synthesis has traditionally been divided into three each level are generally chosen to benefit the run-time of a basket of
steps [13]: allocation, scheduling and binding. Allocation deter- programs. It is not guaranteed, for example, that for a specific pro-
mines the properties of the target hardware: the number of func- gram the -O3 level produces superior results to the -O2 level. This
tional units available, the number of pipeline stages of each func- has led the (software) compiler community to consider selecting a
tional unit, and the estimated functional unit delay. Scheduling particular set of compiler optimization passes on a per-program (or
assigns each operation to a state, while satisfying data and control even per-code-segment) basis. Such “adaptive” compiler optimiza-
dependencies, and constructs a finite state machine to control the tion has been the subject of active research in recent years, with a
datapath. The LegUp HLS tool uses SDC scheduling [11], which few examples of highly-cited works being [3, 27, 31].
formulates the scheduling problem mathematically as a linear pro- Our research [23] in this area has focussed on two issues:
gram. Binding is performed after scheduling to assign the opera- 1) determining the impact of different compiler passes on HLS-
tions in the program to hardware functional units. When multiple generated hardware and 2) creating an HLS-oriented approach to
operators are assigned to the same hardware unit, multiplexers are the application of compiler optimization passes.
added to facilitate the sharing. LegUp uses a weighted bipartite
matching heuristic to solve the binding problem [22], which can be 4.2 Analysis of Passes
optimally solved in polynomial time [24]. To understand the effect of compiler passes on hardware, we con-
An ongoing challenge in high-level synthesis is to generate ducted a wide range of experiments to explore: 1) the impact of
circuits that can meet realistic FPGA design constraints that are each LLVM pass in isolation, and 2) the impact of pass ordering.
comparable to hand-designed digital circuits. Towards this goal, We begin by analyzing LLVM optimization passes in isolation
we have improved the high-level synthesis algorithms of LegUp by relative to -O0 (no optimization). Fig. 4 shows how a subset of
analyzing the impact of compiler passes, investigated the impact passes individually affect the number of hardware execution cy-
of bitwidth minimization, added support for loop pipelining and cles – the cycle latency. The horizontal axis lists the names of
explored multi-pumping the FPGA DSP blocks. each pass. The vertical axis represents the geometric mean ratio
(over 11 benchmarks) of cycle latency when a particular pass is
4.1 Compiler Passes used, relative to the -O0 case. Values less than 1 represent reduc-
Modern HLS tools are implemented within software compiler tions in cycle latency relative to the baseline case. Of the 56 dif-
frameworks, and consequently, the programs that are input to such ferent LLVM passes, only the 13 passes shown in the figure, im-
tools are subjected to standard compiler optimizations applied be- pacted the geomean cycle latency by more than 1% when applied
fore HLS commences. Compilers perform their optimizations in in isolation. Observe in the figure that passes, -loop-extract
passes, where each pass is responsible for a specific code transfor- and -loop-extract-single caused a large increase in the ge-
mation, for example, dead-code elimination, constant propagation, omean number of execution cycles. Both of these optimizations ex-
tract loops into separate functions. The LegUp HLS tool does not Table 2. Compiler passes performance results (IT: Iteration
optimize across function boundaries, and moreover, implements Method, IN: Insertion Method, IN3: Insertion-3 Method).
each function as a separate Verilog module, with handshaking be- Clock Cycles Wall Time (µs)
tween modules occurring when one function calls another. Outlin- Flows Geomean Ratio Geomean Ratio
ing loops as functions therefore naturally leads to higher numbers -O0 18,404 1.12 300 1.16
of execution cycles. The -inline pass has precisely the opposite -O3 16,381 1.00 260 1.00
effect: a large decrease in cycle latency is observed when callees IT 14,717 0.90 231 0.89
are collapsed (inlined) into callers. IN 14,572 0.89 229 0.88
Another observation obtained by analyzing passes in isolation,
IN3 13,641 0.83 217 0.84
was that the set of beneficial passes is highly benchmark dependant.
Therefore, we created custom “recipes” of passes tailored to each
benchmark. Only those passes that positively benefited the particu-
lar benchmark were selected and were ordered alphabetically. The
results of these recipes showed clock cycle latency improvements
over -O3 (the default optimization level for LegUp) for 10 of our
11 benchmarks.
We also considered the order in which passes are applied. We
selected 33 passes, comprised of all those passes that had an impact
in isolation (on top of -O0) and also those passes that had an
impact when removed from -O3. We considered all pairs of passes
(of which there were 33 = 528 pairs) from this group and
2 Figure 5. Time sequence of a loop pipeline with II=3 and five loop
evaluated the pairs in both orders. Of the 528 pass pairs, 411 had
iterations.
a difference in their impact depending on the ordering. The results
clearly demonstrate the importance of pass ordering on HLS quality duce significantly better results than -O3, on average. The iteration
of results for the majority of pass pairs. method provides 10% improvement; the insertion method offers
11% improvement; and, the insertion-3 method provides 17% im-
4.3 HLS-Directed Compiler Optimizations provement in cycle latency. These results tracked very well with the
Given our experience with customized recipes and the observation improvement in wall clock time, as shown in right-side of the table.
that the compiler passes beneficial to each benchmark are both We believe the automated approaches to selecting compiler op-
benchmark dependent and order dependent, we felt it would be timizations on a per-program basis are practical, and will be of keen
difficult to devise a single recipe of passes that would benefit interest to FPGA users seeking high design performance. Such
all circuits. We therefore proposed a HLS-directed approach to approaches also appear to be a useful mechanism for narrowing
the application of compiler optimization passes. At a high level, the gap between HLS-generated hardware and manually-designed
our approach works as follows: we iteratively apply one or more RTL.
passes and then “score” the result by invoking partial HLS coupled
with rapid profiling (in software). Transformations made by passes 4.4 Loop Pipelining
deemed to positively impact hardware are accepted. Conversely, we In many applications, the majority of time is spent executing criti-
undo the transformations of passes that we predict to be damaging cal loops. Loop pipelining is a way of extracting parallelism auto-
to hardware quality. matically from a program by analyzing loops and generating hard-
We implemented and evaluated three variants of our HLS- ware pipelines to exploit the inherent parallelism across loop it-
directed approach to the application of passes, which we refer to erations. Loop pipelining is based on a compiler technique tra-
as the iteration method, the insertion method, and the insertion- ditionally aimed at VLIW processors called software pipelining.
3 method. The first two variants differ from one another in their A popular software pipelining technique is called iterative mod-
implementation of how a chosen pass p are applied to the best IR ulo scheduling [28], which has been adapted for loop pipelining in
found so far. In the iteration method, we traverse all passes in an or- high-level synthesis by C-to-Verilog [18], PICO [29], and also by
der based on the pairs analysis results (as described in Section 4.2) LegUp.
so that the pairwise pass ordering favors reductions in clock cycle Iterative modulo scheduling combines list-scheduling, back-
latency. We apply the passes in order, in particular, we apply the tracking, and a resource reservation table to reorder instructions
selected pass, p, at the end of the pass recipe that produces the best from multiple loop iterations into a set of stages comprising the
IR so far. loop kernel. The kernel of a loop pipeline starts a new loop itera-
In the insertion method, we consider all possible insertion po- tion every II cycles, where II is the initiation interval, which is also
sitions for p in the recipe that produced the best IR so far, and the number of cycles between successive inputs to the pipeline.
keep the recipe and IR corresponding to the insertion position that The kernel consists of one or more pipeline stages that all execute
produced the IR with the lowest number of clock cycles. Our last in parallel.
variant, insertion-3, extends the insertion method by storing the Fig. 5 shows the time sequence of a loop pipeline with an initia-
top 3 IRs and recipes, instead of storing the single best IR and tion interval of three cycles and a kernel consisting of three pipeline
recipe. In insertion-3, the chosen pass p is applied to all 3 of the stages executing in parallel. At any time step in the steady-state
top IRs/recipes. operation of the pipeline, we are executing operations from three
Table 2 shows the geomean and ratio of speed-performance consecutive iterations of the loop, one in each pipeline stage. For
results over 11 benchmark circuits optimized using five differ- instance, when the pipeline initially reaches steady state, loop iter-
ent compiler optimization flows: no optimization (-O0), standard ations: i = 0, i = 1, and i = 2 are all executing, and iteration
-O3 optimization, the iteration method, insertion method, and the i = 0 is finishing. In Fig. 5, the prologue is the time period when
insertion-3 method. First, observe that -O3 provides a clear ad- the pipeline is filling up, while the epilogue is the time period when
vantage over -O0: clock cycle latencies without any optimization the pipeline is flushing (loop execution is concluding). Compared
are 12% higher, on average, vs. with -O3. All of the flows pro- to sequential execution, loop pipelining increases parallelism by
overlapping the execution of loop iterations, which decreases the Table 3 shows the results of applying bitwidth minimization
time required to complete the loop while increasing hardware uti- techniques for the set of CHStone [15] benchmarks, targeted to the
lization. Altera Cyclone II 90nm commercial FPGA [4]. Based on static
LegUp supports loop pipelining of simple loops which consist analysis alone that analyzes both ranges and bitmasks, circuit area
of a single basic block and where the loop bounds are not modi- can be reduced by 9%, on average, compared with Altera’s Quar-
fied during loop execution. The loop body can contain multi-cycle tus II RTL synthesis tool, which itself significantly prunes the cir-
operations such as floating point and memory operations. Simple cuit based on constants in the RTL code. With additional dynamic
cross-iteration dependencies are supported with conservative alias profile-driven analysis, area reductions increase to 34%, on aver-
analysis. The current implementation assumes that whenever the age, with the caveat that results are only guaranteed to be correct
loop body contains a read and write to the same array that a depen- if the values of variables remain within the ranges that were ob-
dency exists between the current and previous loop iteration. Fu- served during profile-driven analysis. Full details of our bitwidth
ture releases of LegUp will include more advanced cross-iteration minimization approach can be found in [14].
dependency analysis.

4.5 Multi-Pumping
For applications that involve many multiplication operations,
LegUp uses a new approach to resource sharing that allows multi-
ple operations to be performed by a single multiply functional unit 5. Pthreads and OpenMP
in one clock cycle [9]. Our approach is based on multi-pumping, One source of the quality gap between HLS-generated hardware
which operates functional units at a higher frequency than the sur- and human-designed hardware is the inability of HLS to fully
rounding system logic, typically 2×, allowing multiple computa- exploit the parallelism available in the target FPGA fabric for a
tions to complete in a single system cycle. This method is partic- given application. Current HLS tools can typically employ instruc-
ularly effective for the DSP blocks on modern FPGAs. The hard- tion level parallelism and loop pipelining to execute multiple op-
ened DSP blocks in modern FPGAs can operate a speeds exceeding erations in parallel. This fine-grained parallelism, however, is of-
500 MHz, whereas typical system speeds are less than 300 MHz. ten not enough to meet the performance requirements of a high-
We have found that multi-pumping is a viable approach to achieve performance system. Coarse-grained parallelism is often realized
the area reductions of resource sharing, with considerably less neg- by using an HLS tool to synthesize a single hardware core, and then
ative impact to circuit performance. For a given constraint on the manually instantiating multiple instances of the core in structural
number of DSPs, multi-pumping can deliver considerably higher HDL. Some commercial HLS tools, such as Vivado [21], allow this
performance than resource sharing. Empirical results over digital to be done through vendor-specific pragmas. Although the use of
signal processing benchmarks show that multi-pumping achieves vendor-specific pragmas can ease the process of instantiating multi-
the same DSP reduction as resource sharing, but with a lower imple hardware cores, it nevertheless requires knowledge of hardware
pact to circuit performance: decreasing circuit speed by only 5% design – a barrier for software engineers. We addresses this chal-
instead of 80%. lenge by providing a mechanism through which an engineer may
use software techniques to specify parallelism to the LegUp HLS
4.6 Bitwidth Minimization tool, with the tool then implementing the specified parallelism in a
Software programs today use standard datatypes that are 8, 16, 32, hardware circuit.
or 64-bits in length. As such, programs are over engineered in the LegUp provides support for two standard parallel programming
sense that variables are frequently represented using more bits than methodologies which software engineers are likely familiar with –
are actually required, e.g. a 32-bit int datatype may be used for a Pthreads and OpenMP. Parallelism described in the software code
loop index that is known to have a range from 0 to 100. Because is automatically synthesized into parallel hardware accelerators that
processor datapaths are of fixed widths, there is little to be gained in perform the corresponding computations concurrently. Parallel pro-
term’s a software program’s performance by optimizing bitwidths. gramming in software often requires the use of synchronization
However, in HLS, hardware quality (area, speed and power) is constructs that, for example, manage which threads may execute
impacted considerably by the bit-level representation of program a given code segment at any given moment. Recognizing this, we
variables. also provide HLS support for two key thread synchronization con-
LegUp uses two strategies to statically (i.e. at compile time) structs in the Pthreads/OpenMP library: mutexes and barriers. The
or dynamically (i.e. using run-time profiling) determine minimized approach we take is to automatically instantiate parallel hardware
representations of variables: 1) range analysis and 2) bitmask anal- for parallel threads. That is, each software thread is mapped auto-
ysis. Range analysis seeks to determine the maximum and mini- matically into a hardware accelerator. The remaining (sequential)
mum values that variables take on in a program’s execution and in portions of the program are executed in software on the MIPS soft
so doing, bound the number of bits required to represent the vari- processor.
able. Variable ranges can be deduced from constants in the source Table 4 shows a list of Pthreads and OpenMP library func-
code, and then propagated through a program’s control-dataflow tions which are currently supported by LegUp. In addition to those
graph to infer ranges for other variables. Bitmask analysis, on the listed in the table, OpenMP clauses to set the number of threads
other hand, seeks to characterize the individual bits in a variable. (num threads), the scopes of variables (e.g. public, private)
For example, assume that A and B are unknown 16-bit values and and the division of work among threads (static scheduling of any
consider the C-language statement: Z = A & (B << 2). In this chunk size) are also supported. Note that all of the OpenMP/Pthreads
case, the two right-most bits of Z are guaranteed to be logic-0 and functions in Table 4 are automatically compiled in our framework,
this property can be applied to minimize the size of hardware that requiring no manual code changes by the user. Meaning that, the
uses Z as an operand (e.g. if Z feeds into a multiplier, the two right- input C program with calls to the Pthreads/OpenMP API can be
most bits of the product are guaranteed to be logic-0). Note that compiled to a hybrid processor/accelerator system as is. The com-
while bitmask analysis guarantees that Z’s two LSBs are 0, range plete system, including the MIPS processor, on-chip cache, off-chip
analysis can infer nothing regarding Z’s min and max values. The memory controller, as well as parallel accelerators, can be created
two forms of analysis thus offer complementary information. with a single make target.
Table 3. Bitwidth minimization Cyclone II implementation results.
LUTs Registers FMax (MHz)
Benchmark Baseline Bitmask+ Dynamic+ Baseline Bitmask+ Dynamic+ Baseline Bitmask+ Dynamic+
Range Bitmask Range Bitmask Range Bitmask
dhrystone 5244 4120 3738 3575 3131 2438 117.94 114.09 115.96
fft 2046 2043 1880 1048 1028 746 92.89 91.3 91.3
adpcm 21695 18631 7036 11039 10020 4291 55.46 56.04 56.16
aes 19784 15792 8871 11470 9162 4066 49.38 49.82 46.47
blowfish 10621 10590 10296 7412 7353 7040 75.41 73.61 71.62
gsm 9787 9645 7807 6612 6487 5029 33.2 32.39 32.98
jpeg 33618 31083 22057 20688 19388 11885 18.02 17.53 19.15
mips 3384 3358 2116 1620 1590 999 98.8 95.56 110.22
motion 4054 4020 2946 2526 2526 1656 112.18 111.83 125.85
sha 10686 8243 7612 7779 5838 5371 99.42 106.68 109.42
Geomean: 8655 7838 5711 5230 4794 3217 65.7 65.2 67.3
Ratio: 1.00 0.91 0.66 1.00 0.92 0.62 1.00 0.99 1.02

(a) Schedule Gantt Chart (b) Control Flow Graph (c) Loop Pipeline Schedule

Figure 6. Screenshots of the LegUp visualization tool.

function, in this case the user has selected the basic block labeled
Table 4. Supported Pthreads functions/OpenMP pragmas. “BB 1”. In the “Schedule Chart” window pane the schedule viewer
Pthreads Functions Description gives a list of all LLVM instructions inside the selected basic block.
pthread create(..) Invoke thread
Each LLVM instruction corresponds to a hardware operation in the
pthread join(..) Wait for thread to finish
pthread exit(..) Exit from thread, can be used to return data synthesized circuit. The user can highlight any instruction to dis-
pthread mutex lock(..) Lock mutex play the data dependencies between all predecessor and successor
pthread mutex unlock(..) Unlock mutex instructions. Fig. 6b shows the control flow graph for the kernel,
pthread barrier init(..) Initialize barrier where each node in the graph is a basic block. Fig. 6c shows the
pthread barrier wait(..) Synchronize on barrier object loop pipeline schedule after the basic block has been pipelined. The
OpenMP Pragmas Description pipeline initiation interval is two, which means a new loop iteration
omp parallel Parallel section begins every two clock cycles. The area highlighted in black is the
omp parallel for Parallel for loop steady-state operation of the pipeline; observe that three iterations
omp master Parallel section executed by master thread only of the loop are executing in parallel.
omp critical Critical section In addition to visualization, we have been focusing recently
omp atomic Atomic section on adding debugging capabilities to LegUp. Debugging tools are
reduction(operation: var) Reduce a var with operation
ubiquitous in the software development community because they
OpenMP Functions Description
omp get num threads() Get number of threads
raise productivity by providing insight into the execution state as
omp get thread num() Get thread ID a program executes. In contrast, most hardware designers are ac-
customed to using simulation waveforms to debug their digital cir-
cuits. With LegUp, we want to bridge this gap by offering users a
software-like debugging platform for the hybrid hardware/software
6. Visualization and Debugging coprocessor system. LegUp’s debugging platform will help devel-
LegUp provides visualization tools for analyzing the internal HLS opers gain insight into problems with their applications at a higher
algorithms. For instance, we have a graphical viewer for the level of abstraction than traditional RTL simulation and waveform
scheduling report file produced by LegUp that shows a Gantt chart analysis.
of the scheduled instructions for the program and also can visu- To implement the debugger, LegUp leverages the LLVM com-
alize loop pipeline scheduling. Fig. 6 shows three screenshots of piler debugging meta-data, which maps each C statement to a set
the LegUp visualization tool for a matrix multiply kernel. Fig. 6a of one or more simple instructions in LLVM’s intermediate rep-
shows a Gantt chart for LegUp’s high-level synthesis schedule. On resentation (IR). Fig. 7 depicts this mapping. Next, we map the IR
the left side, the “Explorer” panel lists each basic block for each instructions to LegUp-synthesized hardware elements. Each LLVM
8. Acknowledgements
The financial support of the Natural Sciences and Engineering
Research Council of Canada (NSERC) and Altera Corporation is
gratefully acknowledged.

References
[1] The OpenCL specification version: 1.0 document revision: 48, 2009.
[2] M. Aldham, J. Anderson, S. Brown, and A. Canis. Low-cost hardware
profiling of run-time and energy in FPGA embedded processors. In
Figure 7. Mapping from C statements to LLVM intermediate rep- IEEE ASAP, pages 61–68, 2011.
resentation instructions. [3] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves,
D. Subramanian, L. Torczon, and T. Waterman. Finding effective
compilation sequences. In ACM LCTES, pages 231–239, 2004.
IR instruction is scheduled to run in one or more states of the finite
state machine. Also, each IR instruction can be synthesized into [4] Cyclone-II Data Sheet. Altera, Corp., San Jose, CA, 2004.
several hardware units and signals. Some hardware signals, such [5] DE2 Development and Education Board. Altera, Corp., San Jose, CA,
as the memory controller signals, can be shared between multiple 2010.
instructions, depending on the state. [6] DE4 Development Board. Altera, Corp., San Jose, CA, 2010.
Our goal is to have an integrated debugging system that is ca- [7] SOPC Builder User Guide. Altera, Corp., San Jose, CA, 2010.
pable of capturing, and displaying to the user, hardware signals [8] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson,
while running the hybrid processor/accelerator system runs on the S. Brown, and T. Czajkowski. LegUp: high-level synthesis for FPGA-
board. Fig. 8 shows a screenshot of the LegUp debugging platform, based processor/accelerator systems. In ACM/SIGDA FPGA, pages
which is “work-in-progress”. Currently, the debugging platform is 33–36, 2011.
for simulation only; that is, we communicate with the simulation [9] A. Canis, J. H. Anderson, and S. D. Brown. Multi-pumping for
tool, ModelSim, to inspect signal values and control the simulation resource reduction in FPGA high-level synthesis. In IEEE DATE,
cycle by cycle. By examining the state of the finite state machine, pages 194–197, 2013.
we can detect the current state being executed and highlight the ac- [10] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski.
tive C statements associated with the current state. There may more Impact of cache architecture and interface on performance and area of
than one active C statement per state, due to the instruction-level FPGA-based processor/parallel-accelerator systems. In IEEE FCCM,
parallelism in hardware (see Fig. 8). By clicking on a C statement, pages 17–24, 2012.
the corresponding synthesized Verilog code is highlighted. Single- [11] J. Cong and Z. Zhang. An efficient and versatile scheduling algorithm
stepping is supported, which runs the circuit simulation until the based on sdc formulation. In ACM DAC, volume 43, pages 433–438,
next C statement is reached. Note that C statements may take more 2006.
than one clock cycle to complete. Developers can also step over a [12] J. Cong and Y. Zou. FPGA-based hardware acceleration of litho-
C statement to reach the next executing statement, or can step into graphic aerial image simulation. ACM Transactions on Reconfigurable
a C statement to see IR-level and hardware-level details related to Technology and Systems (TRETS), 2(3):1–29, 2009.
that statement on a cycle-by-cycle basis. Hardware signal names [13] P. Coussy, D. Gajski, M. Meredith, and A. Takach. An introduction
and current values are displayed based on the circuit’s current state to high-level synthesis. IEEE Design Test of Computers, 26(4):8 – 17,
so that developers can track signal value changes (right panel in the jul. 2009.
figure). [14] M. Gort and J. H. Anderson. Range and bitmask analysis for hardware
The LegUp debugging platform is still under development. Sup- optimization in high-level synthesis. In ASP DAC, pages 773–779,
porting break-points, enabling the debugging of hybrid proces- 2013.
sor/accelerator applications and on-chip hardware debugging are [15] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and quan-
all future work. titative analysis of the CHStone benchmark program suite for practical
C-based high-level synthesis. Journal of Information Processing, 17:
242–254, 2009.
7. Conclusion [16] Calypto Catapult. http://calypto.com/en/products/catapult/overview,
2013.
LegUp is a high-level synthesis (HLS) framework that allows soft-
ware methodologies to be used for the synthesis of a hybrid system [17] OpenCL for Altera FPGAs. http://www.altera.com/products/software/
opencl/opencl-index.html, 2013.
comprising an embedded processor, and one or more FPGA-based
accelerators. Since the original LegUp release in March 2011, it [18] C-to-Verilog. http://www.c-to-verilog.com, 2013.
has been downloaded over 600 times by researchers around the [19] Forte Design Systems The high level design company.
world (at the time of writing). As described in this paper, the cur- http://www.forteds.com/products/cynthesizer.asp, 2013.
rent LegUp 3.0 release includes functionality to assist with hard- [20] LLVM Compiler Infrastructure Project. http://www.llvm.org, 2010.
ware/software partitioning, multi-ported caches to ease memory [21] Xilinx: Vivado Design Suite. http://www.xilinx.com/products/design
bottlenecks, support for Pthreads and OpenMP, and improvements tools/vivado/vivado-webpack.htm, 2013.
to the core HLS algorithms, including loop pipelining, multipump- [22] C. Huang, Y. Che, Y. Lin, and Y. Hsu. Data path allocation based
ing, bitwidth optimization, and tools to select profitable compiler on bipartite weighted matching. In ACM/IEEE DAC, pages 499–504,
optimization passes to improve hardware quality. One of the few 1990.
open-source frameworks of its kind, we hope the tool will be use- [23] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and J. Ander-
ful to the embedded systems research community as a platform son. The effect of compiler optimizations on high-level synthesis for
to explore new design methodologies and synthesis strategies. The FPGAs. In IEEE FCCM, pages 89–96, 2013.
LegUp project website, http://legup.eecg.toronto.edu, in- [24] H. Kuhn. The Hungarian method for the assignment problem. In
cludes documentation, tutorials on how to use and modify the tool, 50 Years of Integer Programming 1958-2008, pages 29–47. Springer,
related publications, as well as links to download the source code. 2010.
Figure 8. Screenshot of debugging platform.

[25] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. FPGA-

based monte carlo computation of light absorption for photodynamic
cancer therapy. In IEEE FCCM, pages 157–164, 2009.
[26] CUDA: Compute Unified Device Architecture Programming Guide.
NVIDIA CORPORATION, 2007.
[27] Z. Pan and R. Eigenmann. Fast and effective orchestration of compiler
optimizations for automatic performance tuning. In IEEE CGO, pages
319–332, 2006.
[28] B. Ramakrishna Rau. Iterative modulo scheduling. The International
Journal of Parallel Processing, 24(1):3–65, Feb 1996.
[29] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. R. Rau, D. Cron-
quist, and M. Sivaraman. PICO-NPA: High-level synthesis of nonpro-
grammable hardware accelerators. Journal of VLSI signal processing
systems for signal, image and video technology, 31(2):127–142, 2002.
[30] L. Shannon and P. Chow. Using reconfigurability to achieve real-time
profiling for hardware/software codesign. In ACM FPGA, pages 190–
199, 2004.
[31] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August.
Compiler optimization-space exploration. In IEEE CGO, pages 204–
215, 2003.
[32] Occupational Outlook Handbook 2010-2011 Edition. United States
Bureau of Labor Statistics, 2010.
[33] The Tiger ”MIPS” processor. University of Cambridge,
http://www.cl.cam.ac.uk/teaching/0910/ECAD+Arch/mips.html,
2010.
[34] eXCite C to RTL Behavioral Synthesis 4.1(a). Y Explorations (XYI),
San Jose, CA, 2010.
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/282503089

High Level Synthesis Based Hardware Accelerator Design for Processing SQL
Queries

Conference Paper · September 2015

DOI: 10.1145/2889287.2889299

CITATIONS READS
15 781

5 authors, including:

Nehir Sönmez Arda Yurdakul

Barcelona Supercomputing Center Bogazici University
25 PUBLICATIONS 270 CITATIONS 74 PUBLICATIONS 779 CITATIONS

SEE PROFILE SEE PROFILE

Adrian Cristal
Barcelona Supercomputing Center
264 PUBLICATIONS 3,609 CITATIONS

SEE PROFILE

All content following this page was uploaded by Arda Yurdakul on 03 July 2018.

The user has requested enhancement of the downloaded file.

High Level Synthesis Based Hardware Accelerator Design
for Processing SQL Queries

Gorker Alp Malazgirt Nehir Sonmez Arda Yurdakul

Bogazici University Barcelona Supercomputing Bogazici University
Department of Computer Center Department of Computer
Engineering Barcelona, Spain Engineering
nehir.sonmez@bsc.es
Bebek, Istanbul, Turkey Bebek, Istanbul, Turkey
alp.malazgirt@boun.edu.tr yurdakul@boun.edu.tr
Adrian Cristal Osman Unsal
Centro Superior de Barcelona Supercomputing
Investigaciones Cientificas Center
(IIIA-CSIC) Barcelona, Spain
Barcelona Supercomputing osman.unsal@bsc.es
Center
Barcelona, Spain
adrian.cristal@bsc.es

ABSTRACT capable memory technologies, main memory databases are

About three exabytes of data is created and stored in databas- faster than disk-optimized databases, using RAMs as the
es each day, and this number is doubling approximately ev- main storage units.
ery forty months. Querying this enormous amount of data Many previous studies, in order to have faster query pro-
has been a challenge and new methods have been actively cessing capabilities, have looked into accelerating database
researched. In this paper, we present hardware accelerators analytics in hardware: using ASICs [13], or using FPGAs
which are designed to speed up database analytics for in- statically [4, 12, 11], or using dynamic reconfiguration [10,
memory databases. Unlike traditional hardware accelerator 2]. However, no previous work has looked into how such ac-
designs, our hardware accelerators are composed using High celerators can be designed using High Level Synthesis (HLS).
Level Synthesis (HLS), which enables high level descriptions Traditional accelerator design requires writing complex
of functionality such as data filtering, sorting, equijoins to RTL code that is prone to errors and difficult to debug. HLS
be targeted directly into RTL. We have simulated TPC-H uses high level software implementations of algorithms. In
benchmark queries using Xilinx Vivado HLS managed in our this work, we make the following contributions:
custom simulation software framework. Our results have
• Using Vivado HLS, we design hardware accelerators
demonstrated the capabilities of HLS in database accelera-
for data filtering, aggregation, sort, merge, join and
tion domain; such that the 200MHz FPGA accelerator can
string matching operations for a Virtex-7 FPGA. We
provide two orders of magnitude performance improvement
describe the design implementation, and the tradeo↵s
compared to PostgreSQL based full software implementation
of employing each accelerator module.
running on a modern multicore system.
Categories and Subject Descriptors • We use these modules to simulate an in-memory database
C.1.3 [Other Architecture Styles]: Adaptable Architec- accelerator. We present performance and area results
tures of simulating three TPC-H benchmarks completely and
compare the results with a modern DBMS, PostgreSQL
1. INTRODUCTION software implementation that runs on 32-core 2.60 GHz
An in-memory database is a database management sys- 256GB Intel processor
tem (DBMS) that primarily relies on main memory for com-
puter data storage, as opposed to a traditional disk stor- The next section presents the implementation of our ac-
age mechanism. With the introduction of faster and more celerators. Section 3 describes how we run full queries, and
validate the results of our experiments. Section 4 includes
the related work, and Section 5 concludes the paper.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 2. ACCELERATOR IMPLEMENTATION
not made or distributed for profit or commercial advantage and that copies In order to support full and complex database analytics
bear this notice and the full citation on the first page. To copy otherwise, to in hardware, we have focused on accelerating data filtering,
republish, to post on servers or to redistribute to lists, requires prior specific arithmetic, logic, sorting, aggregation and equi-join opera-
permission and/or a fee.
FPGAWorld ’15 September 8-10, Stockholm and Copenhagen tions. All the units are designed to work at 200 MHz on a
Copyright 2015 ACM 978-1-4503-3737-3 ...$15.00. Virtex-7 xc7vx690t↵g1761-2 FPGA.
Purpose: The goal is to enhance the performance of various database operations by implementing them in hardware.
Operations Accelerated: The specific operations targeted for acceleration include:
Data Filtering: Selecting specific data based on certain criteria.
Arithmetic Operations: Basic mathematical operations like addition, subtraction, multiplication, and division.
Logic Operations: Logical operations such as AND, OR, and NOT.
Sorting: Arranging data in a specific order.
Aggregation: Combining multiple data values into a single value (e.g., sum, average).
Equi-Join: Combining rows from two tables based on a common column.
Hardware Specifications: These accelerators are designed to operate at a frequency of 200 MHz on a Virtex-7 xc7vx690t FPGA.
2.1 Data Filtering, Arithmetic and
Logic Operations
Database filtering operations are relational operations that
test numerical or logical relations between columns, numer-
ical and/or boolean values. For this purpose, we designed
a pipelined, parametrizable width, n-way compute engine
that takes rows as inputs, applies a filtering operation on
the desired columns and produces an output bitmap. This
bitmap determines the selected rows for further process-
ing after the filtering operation. The main importance of
filtering operations in an SQL query is to filter out un-
wanted data from further processing, thus reducing the size
of the input set. The most important design choices of fil-
tering operations is selecting the correct parallelism for the
maximum utilization of memory bandwidth. Similarly, we
have designed pipelined, parametrizable, n-way arithmetic
and logical compute engines. The arithmetic compute en-
gine supports integer ADD, SUB, MULT and DIV opera-
tions, whereas the logical compute engine supports the log-
ical AND, OR and NAND operations.

2.2 Aggregation Operations

Aggregation operations compute a single value from a col-
lection of values and we provide an n-way aggregator engine
that supports MAX, MIN, COUNT, SUM and AVERAGE.
Similar to the filtering unit, it takes multiple values in the
form of an array as the input and calculates the final ag- Figure 1: Required steps for Sort-merge Join operation
gregate value. We apply the binary fan-in method, which is
fully pipelined at each stage. The number of pipeline stages
is dependent on the input array size such that log2 arraysize 2.4 Table Joins
stages should be designed. This way, the Vivado HLS tool A join operator is used to combine fields from two tables
can easily map equations to two-input comparators in Virtex- by using values that are common to each. Joining two ta-
7 and support full pipelining. bles is a very time consuming operation, especially when
dealing with decision support systems. Table joins account
2.3 Sort and Merge for more than 40% of total execution time while running
TPC-H queries [13].
Ordering data is widely used in queries for data organiza-
We opted for implementing the merge-join algorithm be-
tion and presentation. Therefore, efficient sorting is highly
cause of two reasons. Firstly, we already have a highly par-
sought in database analytics. For this reason, we imple-
allel sorting network which a merge-join requires. Secondly,
mented a 64-way Bitonic sorting network [1], which is highly
the sort network and merge unit can be highly parallelized
efficient on FPGAs, providing high throughput while using
through HLS. In contrast, a hash table might require man-
acceptable amounts of real estate.
aging collisions, and other complicated circuitry which are
Figure 1 presents the overview of the sort, merge and join
difficult to express in HLS. The remaining option, nested
processes. Bitonic network is fed with unordered data from
loop joins are more suitable for software implementations.
RAM. All memory accesses between the hardware accelera-
For these reasons, we have designed a parallel merge-join
tors and the memory are shown with mem access labels in
block (for equi-joins) that can match and join a window of
the figure. Bitonic network can only partially sort a large
(n x n) rows on two given columns by comparing sorted
data set and it is not adequate for high volumes. There-
rows of database tables in parallel. In order to perform a
fore, partially sorted values are written back to RAM. This
merge-join, first we need to sort and merge both the input
first pass creates arrays of partially sorted data and these
tables by using the Sort and Merge modules. Later the join
arrays must be merged for global sorting. For this purpose,
operates on the sorted tables. Similar to our merge unit,
we designed a 4-deep, 2-way merger that can output 4 val-
our join operator can join 4 rows from two separate tables
ues at a time, that is similar to the merge unit presented
such as Table A and Table B in Figure 1. Our experiments
in [4]. It uses 14 comparators, and was written as a long
have shown that increasing the number of rows in the merge
case statement (16 possible output combinations for a 2x4
unit presents di↵erent latencies and area consumptions. For
array).
example, in our design a 4x4 join has a 4-cycle latency.
The depth of four of the merger unit has allowed the utilize
available bandwidth without consuming too much FPGA
area, hence it is possible to use multiple instances given 3. RUNNING FULL, COMPLEX QUERIES
higher bandwidth specifications. Although larger mergers To test our engine, we ran example queries from TPC-
could be designed, the amount of comparisons would super- H, a decision support benchmark, which illustrates decision
linearly increase. support systems that examine large volumes of data, execute
queries with a high degree of complexity, and give answers
to critical business questions [6]. In this work, we have run a synthetic input file is read and output is compared with
TPC-H queries 6, 14 and 19 [6] which involves filtering, join, the expected results. The full software simulation allows to
aggregation operations. Although this is not a full stress- guarantee the correctness of the hardware accelerators.
test, we believe that these queries provide a good testbench The second step of verification is to embed the hardware
because they present some di↵erent properties: (i) q6 can accelerator code in Vivado HLS and to test it. The major
be fully run without writing intermediate results to memory, di↵erences between the full software implementation of an
(ii) q14 introduces memory overheads, as well as performing accelerator and its Vivado HLS version are the directives,
a large join operation (of size 200K x 74K), and (iii) q19 custom data widths and file I/O. Vivado HLS directives
additionally performs various iterations on our accelerators. manage the compiler optimizations and synthesize the hard-
One exception to a software run is that currently, we do not ware accordingly. Similarly, custom data types allow fine
provide support for floating point numbers, and we only use grain width control on variables on the synthesized hard-
fixed point arithmetic. Therefore, we converted the dates ware. File I/O manages the data management between ex-
into 32-bit timestamps and the often used numeric(15,2) ternal memory and FPGA memory. This could be a major
type was converted to fixed-point, similarly to [13]. With hurdle because Vivado does not have any mechanisms to re-
Vivado HLS, it is straightforward to provide floating point place the file I/O for memory management. For operations
support. However, it is not in the scope of this paper. In which require a lot of I/O like the join operator, the memory
order to run full TPC-H queries, we also needed to design management controller hardware must be implemented.
some additional special units such as: multiply-accumulate, After the accelerator source is modified to work on Vi-
divider, or conditional outputs. vado HLS, the original test bench software is also ported
In a typical SQL relational DBMS, a query execution plan- to the Vivado HLS framework. The major hurdle during
ner does the job of establishing an ordered set of steps that this porting is the incompatible library usage between com-
are used to access data. Similarly, a planner in hardware can pilers. Vivado supports GCC 4.6, therefore, test benches
be devised to schedule operations to modules. The schedul- which are designed with other compilers might need to be
ing problem also a↵ects if running multiple queries in flight rewritten to work with Vivado. The last step is to generate
can be supported or not. In this work, we manually estab- the RTL files from C/C++ and to verify the functionality
lish the query plan, as we describe in the next subsection. at the RTL level. The Vivado HLS automates the process
We want to automatize this process in the future. of RTL verification and generates the RTL files of accelera-
The input data is organized in memory in table columns. tors and test bench codes. Generating RTL test bench files
We assume that a given index n for column a points to the from test bench source code turns out to be efficient because
same row for column b, i.e. the tuple order is preserved converting the complex logic in software test bench to RTL
across the columns. To have a realistic simulation envi- test bench saves a large amount of time.
ronment, we constrained our memory accesses to a maxi-
mum of 512-bits each cycle/channel, and with a 16-cycle 3.2 Controlling Hardware Accelerators
latency at 200 MHz, which matches the DDR3 specifica- The hardware accelerators that were defined previously
tions on our VC-709 FPGA board (featuring 2 channels of need control logic in order to read, operate and send data
4GB DDR3, up to 120 Gbits/s per channel). Although this to the RAM. For this reason, we implemented controllers
does not perfectly model DRAM properties, we believe that in software and also synthesize them in Vivado HLS for ob-
it can be a representative way for adding memory latency taining their resource usage. In the software implementa-
and bandwidth constraints to our model. We used both tion, in addition, we also implement the memory controller.
the available DDR3 channels, whereas more channels can The memory read/write operations are implemented using
easily be converted into more performance by laying out the C++ I/O library. Thus, in the software implementation, the
data columns in parallel and thus increasing bandwidth. We memory controller is coupled with the controllers of hard-
also assumed that for exploiting the maximum bandwidth, ware accelerators. This design choice was made to reduce
columns might be distributed into the two channels. the design time of the simulator. The hardware design of
Depending on the query plan, we compose our accelera- the memory controller is not in the scope of this paper. In
tor modules together and write the intermediate results into an actual implementation, the memory controller and the
memory as needed. A unified memory model enables our controller of the hardware blocks can be either separate or
accelerator system to be customizable for each query and is combined.
capable of expressing complex queries. Combining all mod-
ules in a long pipeline would be another alternative design. 3.3 Accelerator Code Generation and
However, it is not very clear how this design decision would HLS Configuration Directives
a↵ect performing iterations, such as in a merge-sort scenario.
All of the accelerators are coded from scratch using C++
for Vivado HLS. The development process has been very
3.1 Verification of Hardware Accelerators rapid compared to RTL. The source code for filtering units
Before discussing discuss each query in detail, we mention are straightforward because they consist of simple but many
and detail our functional simulation framework. We have comparison operators which C/C++ language has very wide
verified the functionality of our accelerators and queries in support of. The sorting network is also implemented as a
three steps. The first step is functional verification of the ac- large network of comparators and the merge block is coded
celerators in software. Each accelerator has its own C/C++ as a long case block. Compared to RTL implementations,
source code. The source code of the accelerator under test each C/C++ implementation of the accelerators have fewer
is added to the software simulator framework. Then, all the lines of codes. The data widths of input data are selected
necessary macro definitions and data types are arranged ac- according to the original data types in the software imple-
cording to the underlying instruction set. The original or mentation. If a variable data type is defined for an input
Figure 2: Q6 query plan

in the software implementation, we set the data width to

64-bits as the upper bound.
Hardware generation in Vivado HLS is controlled through
optimization directives. For our hardware accelerators, we
apply several directives. The ARRAY PARTITION direc-
tive is used to partition input data into registers. This has
prevented data access bottlenecks due to dual port BRAMs.
The INLINE directive is used to remove function hierarchy
of C/C++ language. All loops in the design are unrolled
using the UNROLL directive. The level of unrolling is cal-
culated based on the memory bandwidth requirements. We
used the PIPELINE directive in order to reduce the initi-
ation interval to 1, so that the all accelerators can process
data at each clock cycle.

3.4 Experimental Results

In this work, we have used Vivado HLS and ISE version Figure 3: Q14 query plan
14.1. As a basis for comparison, we ran the same TPC-H
queries on a popular DBMS, PostgreSQL 9.2 running on a
32-core Intel Xeon E5-2670 at 2,60GHz, with 256GB DDR3-
1600, using 4 channels and delivering up to 51.5 GB/s. We the 200K table is sorted on the p partkey in step 4. Finally,
built all indexes, and ran the benchmarks on a RAMdisk to these two tables get joined on the keys and two sums are
get the best possible performance out of our server. kept, one for those rows that include ”PROMO”, and an-
The details of the queries are shown in [6]. Based on these other for all the rows. The LIKE keyword is implemented
queries, in Figures 2, 3 and 4, we present the query plans, with simple comparators and matches strings from the two
where a sort sign also includes inherent merge steps. We ran tables. Finally, we divide the two aggregates and multiply
TPC-H in the 1GB scale, and used the two main tables that this by 100 to get the final result.
the queries operate on. The lineitem table has around 6M Q19: This query needs to perform 3 iterations on the first
rows and contains 16 columns that start with a l_, whereas 4 steps depicted. Then, these iterations are combined and
the part table (starting with a p_) has 200K rows. The steps 1-bit columns are created. These 1-bit columns reduce the
in the figures represent the execution order of the operations number of input rows significantly, which implies fast sorting
based on the query plan. and fast joining afterwards. The l table input is reduced to
Q6: The 3 filter operations required were designed to approx. 47K elements, and the p table to less than 250 in
work in a 5-way SIMD fashion, after which the resulting all 3 cases. After the join, 2 columns are multiplied and
bit-maps get AND-ed. If the result is a 1, it gets multiplied- aggregated. This whole process is repeated 3 times for each
aggregated and finally outputted. This query, compared to part of the case statement. The final result is obtained by
the DBMS run, gave the highest speedup because it fits summing the 3 aggregates in the end.
well our computational capacity and completes in a single Our performance results running these queries are pre-
step since we never need to write any intermediate results sented in Figure 5. It can be seen that our efficiency in
to memory. performing n-way filtering and thus working with a reduced
Q14: In the first two steps, a 16-way filter operation reset of data to be sorted/joined has paid o↵, as we were able
duces a 6M element table to approx. 74K rows and writes to achieve between 15–140x of speedup. For both systems,
these into memory. Later, in preparation to the join opera- the reported runtimes assume that the tables are already in
tion, these columns are sorted on the key, l partkey. Later, RAM. Furthermore, the speed up can decrease when our sys-
Table 1: Hardware usage for Q6, Q14 and Q19
Query 6 LUT FF DSP Latency
5x 32-bit between 845 0
5x 32-bit LT 342 0
5x 64-bit Mult 19 80 18
5x 128-bit Sum 656 642 1
Q6 Total 1857 661 80 19
Query 14 LUT FF DSP Latency
16x 32-bit between 2704 0
32-bit merge 2613 0
32-bit bitonic64 69216 20491 10
3x 200-bit comp 1284 0
2x 128-bit agg 267 265 1
2x 128-bit agg 267 265 1
3x 64-bit sub-mult 192 19 48 18
join 32-bit 42088 10137 9
Q14 Total 118659 31187 48 39
Query 19 LUT FF DSP Latency
16x 32bit between 2704 0
10x 80-bit comp 980 0
8x 64-bit between 1274 0
5x 200-bit comp 642 0
32-bit 64-bitonic 69216 20491 10
32-bit merge 2613 0
3x 64-bit sub-mult 192 19 48 18
2x 128-bit Sum 267 265 1
5x 32-bit Logic 750 0
15x 70-bit comp 900 14 0
join 32-bit 42088 10137 9
Q19 Total 121654 30936 48 38
Virtex-7 FPGA 433200 866400 2940

our Virtex-7 FPGA which is also shown in Table 1. Vivado

maps multiplication and division operations to built-in DSP
blocks. However, flip flop utilization is increased when array
partition directive is used aggressively.
Query 14 and Query 19 have more complex control logic
than Query 6 because the merge and join units are not
completely data parallel, therefore the state machine has
to check for the sequential cases. In addition, Vivado does
not present any mechanisms to connect FPGA memories to
external memory, similar to [3] and manage data transfers
like [5]. The easiest solution is to synthesize an AXI inter-
face and connect each accelerator to external memory which
is additional work for the designer, but even in this case the
Figure 4: Q19 query plan data management is still left to the user.
A significant design tradeo↵ is to bring the columns next
to a bitmap versus to send a column made up of pointers
tem is migrated from simulator to actual hardware because to memory. We have applied the first case here, but if the
the memory handling in the simulator is more optimistic second case is implemented, it might be possible to achieve
than the hardware. In addition, the hardware synthesis time higher gains. Furthermore, we observed that the merge-join
of the queries are not added. The synthesis time of a query module is our slowest component, which justifies the flurry
is less than two minutes for queries with join operations and of research in table join acceleration [9, 8]. We believe that
less than a minute with the ones without join operations. another important aspect that should be looked into is a
To run all queries using the same hardware, we implement hardware planner/scheduler. We ported the queries man-
a superset of all the necessary pieces, so that the FPGA is ually and we plan to automate the query planning process
programmed with a bitstream that supports all the neces- and port all TPC-H queries to run on our system in the near
sary operations for the entire query set, so that we don’t future.
reprogram/reconfigure the FPGA.
Table 1 shows area usage and latencies for the generated
hardware for each query. Vivado HLS has been very effi- 4. RELATED WORK
cient while synthesizing Query 6, it uses the least amount of Designing ASICs for database analytics is less flexible but
area and provides the highest speedup. The most area con- more energy efficient than its software implementation coun-
suming operations are the bitonic network due to its highly terpart. The work in [13] represents heterogenous compute
parallel nature and join unit. Other operations are negli- tiles which manipulate rows and handle database operations
gible in size. Among the TPC-H benchmarks, Query 19 is in a coarse grain fashion. In order to manipulate streams of
the most area consuming query which consumes a quarter of data according to the given query, the authors present spa-
integration of HLS hardware and manual data management.
In the simulation environment, we have shown between 15–
140x speedup compared to Postgres software DBMS running
selected TPC-H queries.

6. ACKNOWLEDGEMENT
Funding from the European Union’s Seventh Framework
Programme (FP7/2007-2013) under grant agreement No 318633,
Figure 5: Runtimes for 3 queries (in ms, log scale) UPC project TIN2012-34557, the Turkish Ministry of Devel-
opment under the TAM Project, number 2007K120610 as
well as Severo Ochoa Mobility grant program support was
tial and temporal planning which enables/disables compute received.
units. The biggest di↵erence of ASIC design compared to
HLS is the flexibility. HLS can adjust to di↵erent sizes of 7. REFERENCES
blocks such as database columns by extending or shrinking [1] K. E. Batcher. Sorting networks and their
data sizes in high level source code. Hence, it is a more applications. In Proc. spring joint computer
flexible solution for hardware design. conference, pages 307–314. ACM, 1968.
The authors in [12] discuss efficient methodologies for de- [2] A. Becher, F. Bauer, D. Ziener, and J. Teich.
coupling accelerators from its host. In contrast, our work Energy-aware sql query acceleration through
presents in-memory database acceleration where the mem- fpga-based dynamic partial reconfiguration. In Field
ory is controlled by a host system. Thus, in this manner, our Programmable Logic and Applications (FPL), 2014
accelerator can be classified as a tightly coupled accelerator. 24th International Conference on, pages 1–8. IEEE,
Complementary to our work, authors in [4] discuss im- 2014.
plementation challenges of merge sort and join operations. [3] Canis and et al. Legup: high-level synthesis for
They have designed and customized a merge-sort join imple- fpga-based processor/accelerator systems. In
mentation for a specific platform and have studied scaling Proceedings of the 19th ACM/SIGDA international
and parallelization capabilities. Our work focuses on using symposium on Field programmable gate arrays, pages
HLS in the database analytics domain without any custom 33–36. ACM, 2011.
enhancements based on the underlying architecture. [4] J. Casper and K. Olukotun. Hardware acceleration of
While HLS produces hardware from high level software, database operations. In FPGA ’14, pages 151–160.
Glacier compiles VHDL code from algebraic expressions [11]. [5] Chung and et al. Coram: an in-fabric memory
This adds additional steps in the system design because al- architecture for fpga-based computing. In Proceedings
gebraic expressions must be created from SQL expressions or of the 19th ACM/SIGDA international symposium on
they are taken directly from a query planner. Our approach Field programmable gate arrays, pages 97–106. ACM,
binds accelerators to SQL operators semantically. Then, 2011.
the query plan is made accordingly. In this work, the query [6] T. P. P. Council. Tpc-h benchmark specification.
plans are generated manually. The query plan creation is Published at
out of scope of this work. We plan to automate our resource http://www.tpc.org/tpch/spec/tpch2.6.0.pdf, 2008.
allocation the near future. [7] C. Dennl, D. Ziener, and J. Teich. On-the-fly
Runtime query processing has been possible by runtime composition of FPGA-based SQL query accelerators
reconfiguration capabilities of FPGAs. Authors of [7] have using a partially reconfigurable module library. In
built a database operations library which at runtime forms Proc. FCCM, IEEE, pages 45–52, 2012.
the data path based on the given SQL query. They have [8] Halstead and et al. Accelerating join operation for
focused on data filtering operations. The main advantage relational databases with FPGAs. In Proc. FCCM,
of runtime reconfiguration is to eliminate the synthesis of pages 17–20, 2013.
queries, if the available runtime operator library can execute [9] István and et al. A flexible hash table design for
the given query. The flexibility that HLS provides is at com- 10gbps key-value stores on fpgas. In FPL, pages 1–8,
pile time rather than runtime. Hence, there is possibility to 2013.
combine runtime reconfiguration with HLS technology. Al-
[10] D. Koch and J. Torresen. Fpgasort: A high
though our design could be enhanced by the use of dynamic
performance sorting architecture exploiting run-time
reconfiguration, it does not necessarily require it, since the
reconfiguration on fpgas for large problem sorting. In
static design is able to work with di↵erent parameters and
FPGA ’11, pages 45–54.
arguments, and it can fit in our FPGA.
[11] R. Mueller, J. Teubner, and G. Alonso. Glacier: A
query-to-hardware compiler. In SIGMOD ’10, pages
5. CONCLUSIONS 1159–1162.
As our results demonstrate while simulating an in-memory [12] A. Parashar et al. Triggered instructions: A control
database accelerator, HLS tools can present high perfor- paradigm for spatially-programmed architectures.
mance gains running complete database queries and is a SIGARCH Comput. Archit. News, pages 142–153.
promising way to address the big data explosion. Although [13] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and
designing the most of the database operations are straight- K. A. Ross. Q100: The architecture and design of a
forward in HLS, certain issues such as the memory - accel- database processing unit. In ASPLOS ’14, pages
erator communication synthesis might require handwritten 255–268.

View publication stats

Received 22 July 2022, accepted 16 August 2022, date of publication 23 August 2022, date of current version 1 September 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3201107

High-Level Synthesis Hardware Design for

FPGA-Based Accelerators: Models,
Methodologies, and Frameworks
ROMINA SOLEDAD MOLINA 1,3,4 , (Student Member, IEEE), VERONICA GIL-COSTA2 ,
MARÍA LIZ CRESPO 3 , AND GIOVANNI RAMPONI 1 , (Life Senior Member, IEEE)
1 Dipartimentodi Ingegneria e Architettura (DIA), Università degli Studi di Trieste, 34127 Trieste, Italy
2 CONICET, Universidad Nacional de San Luis, San Luis D5700HHW, Argentina
3 MultidisciplinaryLaboratory (MLab), The Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, Italy
4 Departamento de electrónica, Universidad Nacional de San Luis, San Luis D5700HHW, Argentina

Corresponding author: Romina Soledad Molina (rominasoledad.molina@phd.units.it)

The work of Romina Soledad Molina was supported by the University of Trieste and the Abdus Salam International Centre for Theoretical
Physics.

ABSTRACT Hardware accelerators based on field programmable gate array (FPGA) and system on chip
(SoC) devices have gained attention in recent years. One of the main reasons is that these devices contain
reconfigurable logic, which makes them feasible for boosting the performance of applications. High-level
synthesis (HLS) tools facilitate the creation of FPGA code from a high level of abstraction using different
directives to obtain an optimized hardware design based on performance metrics. However, the complexity
of the design space depends on different factors such as the number of directives used in the source code,
the available resources in the device, and the clock frequency. Design space exploration (DSE) techniques
comprise the evaluation of multiple implementations with different combinations of directives to obtain
a design with a good compromise between different metrics. This paper presents a survey of models,
methodologies, and frameworks proposed for metric estimation, FPGA-based DSE, and power consumption
estimation on FPGA/SoC. The main features, limitations, and trade-offs of these approaches are described.
We also present the integration of existing models and frameworks in diverse research areas and identify the
different challenges to be addressed.

INDEX TERMS Computing models, design space exploration, field programmable gate array (FPGA),
system on chip (SoC), power consumption.

I. INTRODUCTION Several high-level synthesis (HLS) tools [5] have been

Nowadays the development of algorithms focuses on proposed by vendors and academics such as Vivado
performance-efficient and energy-efficient computations. HLS [6], formerly AutoPilot [7], Intel HLS [8], LegUp [9],
Technologies such as field programmable gate array (FPGA) Bambu [10], and others [5]. These tools facilitate the adop-
and system on chip (SoC) based on FPGA (FPGA/SoC) [1], tion of FPGAs in different fields, as they allow the cre-
[2], [3], [4] have shown their ability to accelerate intensive ation of a register transfer level (RTL) code from a high
computing applications while saving power consumption, level of abstraction. Nevertheless, the efficient use of these
owing to their capability of high parallelism and reconfigu- technologies usually requires the knowledge of the underly-
ration of the architecture. ing hardware and the use of code restructuring techniques
in the original algorithm [11]. This is a time-consuming
task for algorithm designers, who want to take advan-
The associate editor coordinating the review of this manuscript and tage of the inherent characteristics of these reconfigurable
approving it for publication was Vincenzo Conti . technologies.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 90429
R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

HLS tools support C/C++, SystemC, and OpenCL [12] Consequently, the reader will become more confident about
codes to generate the final RTL code. These tools pro- the fundamental and technical aspects of the comput-
vide the designer with a detailed report for each algorith- ing models, methodologies, and frameworks designed for
mic solution, including information about the estimation of FPGA/SoC, acquiring a clear idea of the main parameters
latency, resource utilization (also known as area occupied), required by each one. We highlight the importance of hav-
and throughput. The use of directives allows code optimiza- ing simple approaches with few parameters, such as those
tion through parallel techniques, such as loop pipelining, loop proposed for other parallel architectures, so that they have a
unrolling, array partitioning, and array reshaping. For each greater scope and can be widely used. Based on this literature
solution, the designer can specify different combinations of review, the FPGA developer can select the approach that best
directives; comparing the reports provided by these tools, suits the application, hardware architecture, and program-
the best option can be determined according to different ming skills.
performance metrics. Some survey articles are available in the literature
Furthermore, these tools allow a design space exploration for FPGA-based reconfigurable hardware. Schafer and
(DSE), which involves the evaluation of multiple imple- Wang [14] divide HLS DSE techniques into two main groups:
mentations with different combinations of user design con- synthesis-based and model-based. In addition to this classifi-
straints, FPGA features, and directives (also known as knobs cation, a third group appears including DSE synthesis-based
or optimizations). Setting these optimizations to obtain a and supervised learning. According to [15], HLS DSE can
hardware design with the desired characteristics is a problem be developed using model-based and model-free techniques.
that increases exponentially as the designer applies more Model-based techniques are composed of tools and method-
directives, and the program has more complex code struc- ologies that use analytical models, whereas model-free tech-
tures. The generated hardware is directly associated with the niques include approaches where the HLS tool is treated as a
applied directives, but sometimes applying and tuning direc- black box. A survey of automatic high-level code deployment
tives requires a considerable endeavour to obtain a proper for HLS tools and toolchains is presented in [16]. The authors
hardware implementation. An optimal DSE process grants a analyze commercial HLS tools, academic HLS tools, HLS
hardware design with a good compromise between metrics code generation tools, domain-specific language tools for
such as latency, area, throughput, and power consumption. HLS, dataflow HLS tools, and automatic code deployment
Over the years, parallel computing models have proven tools (including automated DSE). Yehya et al. [17] focus
their benefits across different architectures, such as clusters on power consumption. They classify different estimation
of distributed processors with single cores and multicores, techniques as analytical, table-based, polynomial-based, and
GPU, and cloud. These models act as a bridge between neural networks. The work in [18] analyzes different per-
the architecture and software developer. The actual trend in formance and power estimation models for CPU, GPU, and
parallel computer architectures demonstrates progress toward FPGA. Moreover, reconfigurable architectures can be cate-
hybrid architectures combining namely many cores, super- gorized as coarse-grained and fine-grained according to [19],
scalars, single instruction/multiple data (SIMD), hardware [20]. In this work, we focus on FPGA and FPGA/SoC archi-
accelerators, and on-chip communication systems, among tectures included in the last category.
others, which require handling computations and data locality To the best of our knowledge, there is no previous work
at several levels to achieve suitable performance [13]. that jointly:
Using computing models, and also methodologies, and • describes the models, methodologies, and frameworks
frameworks to predict the performance of FPGA/SoC archi- developed for the estimation of metrics, FPGA-based
tectures may reduce design times and improve productivity, DSE, and power consumption estimation on FPGA/SoC,
which are critical issues when choosing these architectures. • shows their application in different research areas,
In this survey, a model is an abstraction that represents a sim- • analyzes the challenges to be addressed to widely use
plified system. A methodology describes the steps involved them for FPGA/SoC,
in the process for systematically solving a problem. A frame- • compares them with the commonly used parallel com-
work provides the structure needed in the form of a template puting models for CPU, GPU, and multicore processors.
or conceptual scheme to simplify the elaboration of a task.
B. METHODOLOGY
A. CONTRIBUTION This survey is conducted by collecting the latest con-
In this paper, we present a thorough analysis of the com- tributions, focusing on the models, methodologies, and
puting models, methodologies, and frameworks proposed frameworks for FPGA-based devices. The paper collection
for reconfigurable hardware accelerators based on FPGA. process has been performed mainly using models, method-
We compare their main features, including the inputs, ologies, FPGA/SoC, parallel computing models, DSE, and
outputs, and techniques employed for their development. Pareto-optimal design keywords in well-known scientific
Then, we show how these approaches for FPGA/SoC can databases such as IEEE Xplore, Scopus, Web of Science,
be applied in different research fields, exposing their ben- ScienceDirect, arXiv, and Directory of Open Access Journals
efits in improving the design process and productivity. (DOAJ). The collected contributions are from the last six

90430 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

FIGURE 1. PRAM model. Different processors execute read and write

operations in a shared memory.

years (2016–2022) and have been selected based on the topics

addressed in this survey. Several papers published before
2016 have been considered because of their contributions to
the current literature. FIGURE 2. Superstep of the BSP model.

C. OUTLINE
The remainder of this paper is organized as follows. Section II model [22] based on the RAM model. The main idea behind
briefly presents the most widely used parallel computing PRAM is that there is a shared memory m connected to
models for CPU, GPU, and multicore processors. Section III several processing units with a global clock, as shown in
introduces the FPGA-based reconfigurable hardware accel- Fig. 1. In this scenario, one processor P can execute one
erator architectures, hardware/software co-design, DSE and operation (arithmetic, memory access, or logic) within one
metrics, and the techniques to improve latency, area, and single clock cycle. However, this model does not consider the
power for this technology. In Section IV, we describe pre- communication or synchronization overheads.
vious works on models, methodologies, and frameworks PRAM sub-models like the exclusive read exclusive write
proposed for FPGA/SoC according to their main features: (EREW), exclusive read concurrent write (ERCW), concur-
metrics estimation (IV-A), FPGA-based DSE (IV-B), and rent read exclusive write (CREW), and concurrent read con-
power consumption estimation (IV-C); and in Section IV- current write (CRCW) are introduced to handle read/write
D, we present a summary and discussion. The integration operations in a shared memory model [23].
of models and frameworks for FPGA-based reconfigurable
hardware accelerators in different research fields is exposed B. BULK SYNCHRONOUS PARALLEL MODEL
in Section V. Challenges are analyzed in Section VI. Finally, The bulk synchronous parallel model (BSP) [24] proposed for
conclusions are presented in Section VII. distributing computing is a bridging model between hardware
and algorithms that offers a high degree of abstraction. The
II. PARALLEL COMPUTING MODELS FOR BSP program is divided into supersteps separated by a barrier
PERFORMANCE ESTIMATION synchronization. Each superstep comprises several blocks of
Computing models allow to easily analyzing algorithms by computation and communication. Fig. 2 shows the workflow
simplifying the computational world to a reduced set of of the BSP model.
parameters that define the cost of arithmetic and memory A BSP computer is represented by parameters P, s, L, and
access operations and communication. These models con- G, where:
tribute to the search for efficient algorithms for a given • P: number of processors of the BSP computer.
architecture, improving the productivity of designers, pro- • s: processor speed.
grammers, and engineers. A small amount of communication, • L: cost, in step, to complete a barrier synchronization.
a small number of operations, and a high degree of parallelism • G: cost, in words, of delivering a message.
are key points that directly contribute to the efficiency of a The normalized cost G is defined by Eq.1
parallel algorithm.
This section summarizes the characteristics of the most Oplocal
G= (1)
widely used parallel computing models for performance Wsec
estimation. It is not aimed at providing a comprehensive where Oplocal is the number of local operations executed in
presentation or a thorough classification of parallel models, a processor and Wsec is the number of words communicated
languages, and architectures. In addition, we present some by the network per second. L represents the barrier synchro-
examples of their application in different architectures. nization cost at the end of each superstep.
The sum of G and L is the superstep cost. The former rep-
A. RANDOM ACCESS MACHINE AND PARALLEL RANDOM resents the number of maximum local computations executed
ACCESS MACHINE on parallel processors. The latter represents a cost composed
The random access machine (RAM) model is proposed of the cost of the communications plus the synchronization at
in [21] for sequential algorithms. It is composed of a memory, the end of the superstep.
control unit, processor, and program. In 1978, Fortune and The multi-BSP model [25] extends the BSP to multicore
Wyllie proposed the parallel random access machine (PRAM) architectures by considering the architecture as a tree with d

VOLUME 10, 2022 90431

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

does not include a model for application/computation. LogP

is devised for distributed computation, is based on message
passing, and can simulate a BSP model.
Different variants of LogP, such as LogGP [38],
LogGPC [39], and LogPQ [40], were introduced to improve
the model. LogPQ includes communication queues for send-
ing, receiving, and transferring operations. LogGP introduces
a new parameter G, defined as the time per byte for a
long message (gap per byte). This allows for the modelling
of short and long messages. Finally, LogGPC uses LogP
parameters for short messages and LogGP parameters for
FIGURE 3. LogP model, based on [37]. From a local point of view, for one longer messages. Its contribution relies on the inclusion of
processor (P), g represents the gap between messages, o is the network contention and network interface direct memory
communication overhead, and L is the communication delay.
access (DMA).
PLogP [41] and mPlogP [42] have been introduced for
multicore architectures. The former includes the overhead
leaves. This is a multilevel model with explicit parameters (sender and receiver), latency, gap message, and number of
for the number of processors, memory/cache sizes, commu- nodes. It is suitable for modelling inter-node communication,
nication, and synchronization costs. The multi-BSP allows: but lacks a memory access model. The latter is proposed
(i) modelling a multicore computer as a tree, (ii) design- as an extension of PlogP. Unlike PlogP, mPlogP considers
ing a parallel algorithm as a single program multiple data multi-grain parallelism (through vector parameters), intra-
(SPMD) program with strict separation between computa- node communication, and inter-level memory access. The
tion and communication, and (iii) computing the cost of an parameters of mPlogP are the overhead (o), which includes
algorithm on a specific computer based on computation, data the overhead of the sender and receiver, latency (l), gap
movement, and latency. For a tree with i levels, the main between messages (g), memory access time (m), and number
parameters related to this model are as follows: of cores (P).
• Pi : number of processors at i-th level. For CPU/GPU heterogeneous clusters, the work in [43]
• gi : communication bandwidth. presents the mHLogGP model based on the mPlogP, LogGP,
• Li : cost, in step, to complete a barrier synchronization at and LogP models. It is used to predict the performance of
level i. point-to-point and broadcast communications, and the run-
• mi : words of memory at i-th level. ning time of parallel algorithms. The model uses parameters
BSP and multi-BSP have been widely used in multiple con- such as overhead, latency, gap per byte, gap between mes-
texts and applications because of their flexibility in allowing sages, and number of computer nodes. The model also helps
portable and efficient parallel programs for a wide range of to estimate possible bottlenecks.
computers [26], [27], [28], [29], [30], [31], [32]. The results
presented in [33] demonstrate the feasibility of the BSP-based D. COLLECTIVE COMPUTING MODEL
machine learning (ML) computing model in the field of The collective computing model (CCM) [44] is based on the
intrusion detection. An elastic BSP for relaxing the synchro- BSP model and is composed of processors, memory, and
nization stage in the context of distributed deep learning is two types of supersteps: normal and division. The normal
presented in [34]. The authors focus on the data parallelism superstep is characterized by computation, followed by the
approach, in which weight synchronization during training is execution of a collective communication function (f ). The
crucial. The BSP is adapted for CUDA applications in [35]. division superstep considers that the machine can be divided
This BSP for the CUDA model allows the prediction of into submachines. Based on this assumption, several steps
execution times for a single kernel function on the GPU. are performed: P processors are divided into r groups and
This proposal focuses on a number of computational and the input data are distributed in tasks, each one is executed,
communication steps, but removes synchronization at the end followed by a phase of re-joinment. Finally, the distribution
of each step. of the results is performed.
CCM has as parameters P: number of processors, F: group
C. LogP MODEL of collective functions f , TF: cost functions for each f F,
The LogP model [36] describes a parallel machine using four P: group of partition functions p, and TP cost functions for
main parameters: communication delay (L), communication each p P.
overhead (o), gap between each message (g, from a local
point of view), and the number of processors (P). A graphical E. ROOFLINE MODEL
representation of the different parameters is presented in The Roofline [45] is a throughput-oriented performance
Fig. 3. The model decomposes each communication step into model for auto-tuning the performance of multicore com-
three elements: L, o, and g, measured in clock cycles, but it puters. It provides information about data movement and

90432 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

parallel computing models (BSP, CCM, and LogP and

its variants). The third group includes hierarchical mem-
ory parallel computing models (P-HMM [52], UHM [53],
LogP-HMM [54], HPM [55], among others). The authors
remark the simplicity, portability, and structured program-
ming style of the BSP model, concluding that BSP offers
a better level of abstraction than LogP for designing and
programming parallel algorithms.
The third group is based on the speed gap between the
processor and the memory system. To reflect the memory
access costs, the models incorporate a local memory hier-
archy. Models within this last category are subdivided into
uniform hierarchical models, LogP extended models, DRAM
FIGURE 4. Roofline model, based on [48]. The x-axis represents the (h, k) model, and HPM model.
operational or computational intensity (CI) and y -axis represents the
attainable performance (AP) or throughput. Computational roof and I/O
Some models cannot be strictly classified into these
bandwidth roof limit the achievable AP. On the right (yellow area), the three groups. This is the case with the traditional Roofline
algorithms are compute-bound, while on the left (orange area), they are
memory-bound.
model [45], which quantifies the traffic between memory and
cache rather than between processors and cache. The pro-
cessor performance depends on the off-chip memory traffic.
computation to understand the limitations of the code and In contrast, DRAM-only Roofline is extended and improved
combines bandwidth, locality, and different parallelization in the recent hierarchical Roofline [46], [47] supporting dif-
paradigms. Fig. 4 shows the output of the model, which ferent cache levels.
includes the computational intensity, peak computation (PC), A technical literature survey is presented in [56] for per-
peak memory bandwidth (PMB), and architectural and algo- formance modelling and prediction of parallel and distributed
rithmic features. The main parameter of the Roofline model is computing systems. It analyzes different techniques, mathe-
the arithmetic intensity (or computational/operational inten- matical modelling, measurements, and simulations. A recent
sity – CI – [GFlops per byte]), which corresponds to the study by Riahi et al. [57] compares analytical, and machine
x-axis and is defined as the ratio of the number of operations learning models for predicting CPU/GPU data transfer time.
(floating-point) to the total data movement (bytes). The Table 1 presents a comparison of the main features of
attainable performance (AP) is defined by Eq. 2, and corre- the models described in this section. The table includes the
sponds to the y-axis [GFlops per second]. Some contributions type of communication supported by the model (shared,
in the literature, such as [46], [47], extend the Roofline to distributed, or hierarchical), the different costs considered by
cache hierarchy (hierarchical Roofline) by considering L1, the model (synchronization, asynchronous communication,
L2, device memory, and system memory bandwidths. computation, or memory), and the parameters used in each
( model.
PC,
AP[GFLOPS/sec] = min (2) III. FPGA-BASED RECONFIGURABLE HARDWARE
PMB × CI
ACCELERATORS
In recent years, this model has been used for performance FPGA architectures contain a large number of reconfigurable
analysis of different computer architectures and applica- circuits, which makes them feasible for accelerating applica-
tion domains. A toolkit for modelling based on Roofline is tions that require high parallelism, high performance, and low
presented in [49] for multicore, manycore, and accelerated power consumption.
architectures. Roofline has been applied in the context of FPGAs have been commonly used with ‘‘soft’’ proces-
deep learning using GPU [50]. The model includes time and sors, which are designed using programmable logic resources
complexity to add new features pertinent to applications. The instead of being built into the silicon. Because the use of
authors in [47] propose a practical methodology for GPU that reconfigurable devices has grown in increasingly sophis-
allows a hierarchical Roofline performance analysis. ticated applications, the need for FPGA-based systems
including processors has been arising.
F. CLASSIFICATION OF PARALLEL COMPUTING MODELS Integrating a processor and FPGA into a single chip allows
Zhang et al. [51] classify parallel computing models into the exploitation of different but complementary computa-
three groups based on their evolution over the years and in tional resources of both devices. A performance boost of
the memory model of their targeting parallel computers. The the system can be achieved by dumping critical functions to
first group includes the shared memory parallel computing the FPGA while maintaining the data transfer quickly and
model (PRAM), which has four approaches: asynchronous, coherently between the devices.
memory contentions, latency-bandwidth, and hierarchical The SoC based on FPGA architecture combines a
parallelism. The second group includes distributed memory processing system with programmable logic (FPGA).

VOLUME 10, 2022 90433

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

TABLE 1. Features of the computing models PRAM, BSP, LogP, CCM, multi-BSP, and Roofline.

Regarding communication overhead, its complexity should

be minimized between both technologies (that is, between the
processor and the FPGA). Also, energy efficiency could be
achieved through this technique. Recent contributions in the
literature expose the benefits of co-design hardware/software
strategy, such as [58], [59], [60], [61], [62].

B. DESIGN SPACE EXPLORATION AND METRICS

HLS tools are used to create RTL components from a
high-level of abstraction using directives to optimize a
hardware design described in a high-level language. Each
FIGURE 5. Architectures for Zynq-7000 SoC and Zynq UltraScale + MPSoC
devices. hardware obtained is unique based on the strategies and opti-
mizations used to describe it. DSE involves the evaluation
of multiple implementations with different combinations of
The architecture also includes specific interfaces that provide
directives, also known as knobs or optimizations. In this
high bandwidth and low latency in the connections between
context, DSE plays an important role as a fundamental key
the two parts of the SoC based on FPGA device. The pro-
point in obtaining a hardware design with a good compromise
cessing system has a fixed architecture formed of a ‘‘hard’’
between different metrics.
processor and a RAM memory, while the FPGA is completely
In the last few years, most DSE techniques have applied
flexible for hardware design.
multi-objective optimization algorithms (MOOA), which are
Within this context, a processing element (PE) can perform
dedicated to optimizing objective functions in the presence
an entire computation containing all the elements required for
of conflicting metrics. In this scenario, trade-off solutions
its replication, which improves the performance of the entire
contribute to forming an objective space plotted with the
system through coarse-grain parallelism. As an example of
objective values, which builds a Pareto-optimal frontier (PF)
this architecture, Fig. 5 depicts the different components of
and a set of configurations (trade-off solutions) called Pareto-
the Zynq-7000 SoC and Zynq UltraScale+ multiprocessor
optimal designs.
system on chip (MPSoC) architectures from AMD-Xilinx.
Let us denote D as the design space composed by q design
We refer to Xilinx because it is one of the main providers
points, thus q D. PF can be defined as a set of hard-
of this technology. Zynq-7000 SoC combines a dual pro-
ware designs PF = {d1 , d2 , . . . , dk }, where the sub-index k
cessor with an FPGA. Zynq UltraScale + MPSoC devices
defines the number of elements in PF. Each di with 1 ≤ i ≤ q
include quad-core and dual-core real-time processors, GPU,
represents a hardware design with unique features such as
and FPGA.
latency, resource utilization, and clock frequency. In the case
of area (A) and latency (L) as the objective functions; any
A. HARDWARE/SOFTWARE CO-DESIGN
hardware design di is considered a Pareto-optimal design, and
Hardware/software co-design aims to exploit the inherent in consequence di PF, if there is no other design dn with
features of different technologies, deciding which part of the 1 ≤ n ≤ q in the search space such that it simultaneously has
algorithm should be implemented with sequential instruc- less area (A) and less latency (L) than di [14], as shown in
tions (in the processor) and which part in the hardware Eq. 3.
(such as ASIC or FPGA). Usually, a profiling of the algo-
rithm helps to determine which part is suitable to accelerate. A(di ) ≤ A(dn ) and L(di ) ≤ L(dn ) (3)
Typically, the most expensive section of the code, in terms A survey on MOOA for HLS, presented by
of runtime, is a good candidate for hardware acceleration. Fernandez de Bulnes et al. [63], remarks on the expansion

90434 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

Other metrics could be added, such as scalability measured

as the number of PE inside the FPGA, bytes per operation,
processing system features, and off-chip and on-chip memory
bandwidths.

C. TECHNIQUES TO IMPROVE LATENCY, AREA,

AND POWER
Different techniques can be used to improve the performance
of algorithms running on FPGAs though HLS tools [76]. One
of the most common approaches is to use a set of directives
(or knobs) provided by HLS tools to improve throughput,
latency, and resource utilization. To this end, HLS tools insert
pragmas (compiler directives) into the source code [6], [8].
Some of the most used optimization techniques are:
• Pipelining: in the presence of sequential operations exe-
cuted multiple times, this technique allows the insertion
of registers at the output of each stage, so that each
operation can run in parallel on different input data,
increasing the overall throughput at expenses of area.
Pipelining can be applied at instruction and function
FIGURE 6. Typical DSE framework with HLS in the loop, based on [14]. levels.
• Loop unrolling: let us denote f as the unroll factor.
For a rolled loop, one iteration is executed at n clock
of these techniques for the FPGA DSE process. The
cycles. Thus, f iterations can be executed within n clock
authors conclude that the most common objective func-
cycles when unrolling the loop by a factor of f and the
tions are: latency (clock cycles), area (LUT, BRAM, DSP,
total latency for the unrolled loop is n/f (without data
and FF), power (static and dynamic), wire length, digi-
dependency). This technique can improve both latency
tal noise, reliability, temperature, and security. They claim
and throughput, but it is expensive in terms of resource
that all metrics should be minimized, except reliability and
utilization since it is affected proportionally by f .
security. The authors remark on six main multi-objective
• Memory optimizations:
methods applied for HLS DSE: evolutionary algorithms,
single-solution-based heuristics, problem-specific heuristics, – Array partition: let us denote pf as the partitioning
branch-and-X, learning-based methods, and swarm intelli- factor. Array partition splits an array in pf sections
gence systems. Some examples are the studies presented to be mapped into a dedicated memory element,
in [64], [65], [66], [67], [68], [69], [70], [71], [72], and [73]. allowing multiple simultaneous accesses to it, at the
An overview of the general DSE process using HLS tools cost of higher utilization of memory elements.
in the loop, based on [14], is shown in Fig. 6. An application, – Array reshape: this technique allows creating
described mainly in C/C++, SystemC, or OpenCL, is the smaller arrays from the original array, concatenat-
input of this type of system. A low-level virtual machine ing elements by increasing bit widths, thus reducing
intermediate representation (LLVM IR) [74] is obtained from the number of BRAM consumed and allowing par-
the input code through the Clang front-end compiler [75], allel access to the data.
generating a control data flow graph (CDFG). Each node Nevertheless, memory performance could be affected by
of the graph represents the operations connected by control array partition techniques because an improper partitioning
dependency and data. The DSE phase generates a unique leads to generate a large amount of multiplexers, incurring in
batch of directives to minimize a specific cost function. The additional delays [77].
HLS tool then uses the generated optimizations, application, Code restructuring techniques [78], [79], [80], [81], [82]
and technology library to generate the final optimized RTL. are also used to improve the hardware design of the
Among the main objective functions associated with algorithms. Ferreira et al. [83] introduce an approach for
FPGA/SoC, we can identify the performance, area, and automatic code restructuring targeting HLS tools. A detailed
power. The performance includes the latency (L) and through- survey is presented in [82], where the sets of optimizing trans-
put (T). This is directly related to the maximum frequency formations techniques are classified into: pipelining, scaling,
(fmax ) of the synthesized design given by T = fmax /L. The and memory-enhancing transformations.
area includes hardware resources: reconfigurable hardware Quantization techniques aim to reduce memory footprint
(LUTs, CLBs, and slices) and static hardware (DSPs and by selecting the number of bits to represent the data structures
BRAMs). The power is the total power consumed (static and and operations to improve objective functions such as latency,
dynamic). resource utilization, and throughput. Moreover, by reducing

VOLUME 10, 2022 90435

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

the computational intensity, the power consumption also 2) MODELS

decreases [84], [85], [86], [87]. In the early stage of the design, models have been applied to
The dynamic power consumption Pd depends on the design FPGA/SoC to mainly estimate latency and area.
and can be improved by considering each element present in Hora et al. [96] proposes pipelining circuit RAM
the Eq. 4 [88]. As can be noticed, Pd is directly proportional (PCRAM), which is a computational model that considers
to the clock frequency f , which increases with the square of only synchronous circuits. Several algorithms are described,
the power supply V , and it is also affected by the effective and the model is used to obtain time complexities, leaving for
capacitance Ci , resource utilization Ri , and switching activity future work the contrast with the experimental results. In this
Si for a given resource i. Hence, model, the computer comprises a word-RAM of word size w
X with a circuit composed of an execution module, gates, and
Pd = V 2 × f × Ci × Ri × Si (4) inputs/outputs.
i A cost model for FPGA partial reconfiguration, proposed
A survey on this topic is presented in [89], considering by Papadimitriou et al. [97], considers all physical elements
ultra-low-power techniques for FPGA-based IoT systems. involved in the reconfiguration process, where each phase
Contributions devoted to improving power consumption on contributes to the total reconfiguration time. The authors
FPGA are presented in [90] and [91]. also explore the parameters that affect the reconfiguration
performance.
IV. MODELS, METHODOLOGIES, AND FRAMEWORKS FlexCL, introduced by Wang et al. [98], is an analytical
FPGA/SoC performance model that uses the OpenCL kernel as the
We present the models, methodologies, and frameworks that input and supplies the performance estimated for the FPGA.
have been proposed to estimate the performance metrics asso- A high-level scheme of this model is presented in Fig. 7.
ciated with FPGA/SoC to reduce design times and improve The input source code is transformed into an LLVM IR trace
productivity. Some of these models, methodologies, and through Clang. Information such as the code structure and
frameworks propose an exploration of the design space to operation latency is extracted using a kernel analyzer and sent
grant a hardware design with good compromises between to different models: a computation, communication, and
different metrics. Other ones include power consumption global memory model. As a result of the integration of
estimation because low power is one of the main highlights these three models, the execution time for a given kernel
of FPGA-based hardware accelerators. is estimated. FlexCL contributes to identifying performance
In this section, we classify models, methodologies, and bottlenecks on FPGA, where PEs, computation units, and
frameworks into the following categories according to their kernels have their own models. FlexCL considers eight global
main features: metrics estimation, FPGA-based DSE, and memory access patterns; and can also be used to explore
power consumption estimation. the design space to identify solutions under given user
constraints.
A. METRICS ESTIMATION Currently, Roofline is used for the recognition of the
1) METHODOLOGIES highest performance and potential bottlenecks in FPGA, due
Among the methodologies, we can find the works presented to its intuitiveness and simplicity while providing insights
in [92] and [93]. HLScope [92] consists of a performance about the arithmetic computation and attainable perfor-
debugging methodology, that helps to identify potential bot- mance. An extended version of the Roofline multicore model
tlenecks and their causes. HLScope has two flows: in-FPGA for hardware accelerators is presented by Silva et al. [48],
(accurate analysis) and software simulation (rapid anal- maintaining the core of the original proposal, but adding
ysis). For each hardware described by the designer, the the resource utilization and parameters obtained through
tool provides execution times and analyzes various stall HLS tools. The unit for the performance operation is byte-
causes: external DRAM access, synchronization, and depen- operations (Bops), considering that fixed-point operations
dency. HLScope+ [93] extends HLScope to overcome its are more suitable for this technology than floating-point
main drawbacks. HLScope+ includes a fast and accurate operations. The authors also include the scalability param-
HLS-based cycle estimation and an improved memory access eter to determine the PE replication factor, considering
model that considers some PE in the FPGA connected to an the available resources and resource utilization per PE.
external memory through a DRAM controller, avoiding cache Starting from this initial proposal, contributions in the lit-
modelling. erature [99], [100] extend this model to FPGA devices.
Kapre et al. [94] present a communication discipline Calore et al. [99] present an FPGA empirical Roofline (FER)
inspired by synchronous dataflow [95] and BSP computa- to estimate the throughput and memory bandwidth of FPGAs
tional models for OpenCL pipes in FPGA devices, con- for high-performance computing (HPC) applications based
sidering that one of the strategies to exploit FPGA wiring on HLS tools. Nguyen et al. [100] extend the empirical
is through pipes, by reducing the communication latency Roofline toolkit (ERT) to FPGAs, presenting a benchmark
between kernels. for the energy efficiency.

90436 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

FIGURE 7. High-level overview of FlexCL, based on [98]. The input is the OpenCL kernel code, which is transformed to
LLVM IR through Clang. Information from the source code is extracted by a kernel analyzer, which is sent to a computation
model, a communication model, and a global memory model. The results of each model are integrated in one model to
estimate the final kernel execution time.

3) FRAMEWORKS
Pyramid, developed by Makrani et al. [101], is a machine
learning based framework to estimate timing and resource
utilization, and to overcome the differences between the
post-implementation results and intellectual property (IP)
cores created with HLS. It is developed by employing ensem-
ble machine learning techniques, such as linear regression,
artificial neural networks, support vector machines, and ran-
dom forests. As part of the framework, Minerva [102],
which is an automated hardware optimization tool based on
a heuristic model, is used to obtain a good throughput and
throughput-to-area ratio for the RTL code generated by HLS.
Wang et al. [103] present a framework based on a
performance analysis model combined with code tuning tech-
niques for OpenCL applications only on FPGAs, assuming
that an incremental development model is adopted by design-
FIGURE 8. Classification of HLS DSE techniques, based on [14].
ers [104]. The model includes four FPGA-centric metrics to
detect possible bottlenecks related to memory, parallelism,
and computation. designs, also known as Pareto-optimal designs. Considering
that there is a limited number of resources (LUT, BRAM,
DSP, and FF) available in the reconfigurable architecture, the
4) SUMMARY
hardware design cannot request more resources than those
For metric estimation, a few contributions have considered available in the FPGA.
the use of the traditional parallel computing models such The comparison among diverse design space explorers is
as BSP and PRAM [94], [96] on FPGA. Nevertheless, the useful for observing the strengths and weaknesses of each.
adoption of the Roofline model for estimating performance This can be achieved using benchmarks, composed of com-
and bottlenecks on FPGA devices has been widely adopted putational kernels suitable for hardware acceleration. Some
due to its intuitiveness and simplicity [48], [99], [100]. of these are MachSuite [107], CHStone (C-based) [108],
Furthermore, the differences between the metric estimation S2CBench (SystemC-based) [109], Rosetta [110], and Spec-
reported by HLS tools and the post-implementation results tor (OpenCL-based) [111].
are a key point to consider when designing the estimators of Surveys related to this topic are presented in [63] and [14].
performance metrics [101]. In particular, the last one proposes a classification of HLS
DSE techniques into two groups, as depicted in Fig. 8:
B. FPGA-BASED DESIGN SPACE EXPLORATION synthesis-based and model-based. In this classification, the
Design space explorers aim to minimize HLS tools execu- third category is composed of a combination of supervised
tion times, which are highly dependent on the size of the learning and DSE synthesis-based techniques.
space to be analyzed. Different methodologies, models, and According to Sohrabizadeh et al. [15], HLS DSE can be
frameworks have been proposed based on the analysis of HLS developed using model-based and model-free techniques.
directives, where the exploration of the design space [105], Model-based techniques comprise tools and methodologies
[106] is important because it increases exponentially with the that use analytical models. They estimate the resources and
use of directives. The challenge is to find a set of hardware performance of each point in the design space. Model-free

VOLUME 10, 2022 90437

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

techniques include approaches in which the HLS tool is COSMOS includes memory as part of the DSE process and
treated as a black box, such as Bayesian optimization and applies synthesis constraints to reduce the variability of the
reinforcement learning techniques [112], [113], [114], [115]. HLS tools.
The adaptive threshold non-Pareto elimination strategy
1) METHODOLOGIES (ATNE) [122] focuses on inaccuracy estimation, to address
Roofline model has been introduced within methodologies to the exploration of the design space on FPGA for implemen-
explore the design space, targeting HPC applications based tations based on OpenCL. The ATNE algorithm is based
on HLS [116], [117], [118]. on a random forest for regression. The prediction quality
Nabi et al. [117] propose TyTra flow that integrates perfor- is obtained using two metrics: average distance from ref-
mance and cost models based on Roofline analysis to obtain erence set (ADRS) and hypervolume error (HVE). The
an optimized FPGA solution for scientific HPC applications. results are shown for matrix multiplication, Sobel filter, finite
The methodology adopts the models defined in the OpenCL impulse response filter (FIR), histogram, and discrete cosine
standard: platform and memory hierarchy, kernel execution, transform.
memory execution, and data pattern. The Roofline model is Xu et al. [123] propose a methodology for performing
the base for the design space explorer and is used to assist DSE using MPSoC devices. This work presents three meth-
the selection of the best instance to be downloaded into the ods to automatically carry out the exploration: two based
hardware. Additionally, the authors propose an intermediate on simulation (cycle-accurate and fast cycle-accurate) and
representation language (TyTra-IR). For the calculation of one based on hardware acceleration. For this purpose, the
resource utilization to obtain scalability of the system, the authors consider several IP cores in an FPGA. The proposed
authors consider a maximum utilization of the FPGA of 80%, methodology is called fast explorer for behavioral systems
as suggested by [119]. (FEBS), and it accepts the number N of IP cores and their test-
Siracusa et al. [118] propose a DSE methodology, pre- benches as input. The output is a set of dominant systems with
sented in Fig. 9. The system input is the C/C++ source area vs performance trade-off. In this methodology, design
code, which is translated to an LLVM IR trace, obtaining the space exploration is performed for each IP core. The general
baseline of performance estimation and resource utilization overview for this design space explorer is shown in Fig. 11.
through the synthesis process. From this base implemen-
2) MODELS
tation, the Roofline model chart (RooflineOrig) determines
Lo et al. [113] propose a sequential model-based optimiza-
memory bottlenecks. Afterward, an automated DSE estimates
tion, using a transfer-learning mechanism, to select direc-
resources and performance, generating the optimal design
tive configurations in HLS, minimizing the number of
points. The Roofline for the best feasible design is plotted
tool evaluations/executions while obtaining solutions with
along with the RooflineOrig chart, to compare the current
LUTs-latency optimal trade-offs.
design’s performance and the performance of the solution
Kwon et al. [124] propose the mixed-sharing multidomain
derived by the DSE. The explorer includes resource sharing
model for reusing the knowledge obtained from previous
and HLS-specific IR optimizations during sample estima-
HLS DSE whereas exploring a new target design space,
tions. This work is extended in [116], with the hierarchical
showing its effectiveness when approximating quality of
version of Roofline, estimating peak performance analyti-
results (QoR) without running HLS tools.
cally and integrating a guide to reaching memory-transfer and
Dai et al. [125] present a fast and accurate QoR estimation
data-locality optimizations.
based on HLS. For this purpose, they use final HLS reports
Ferretti et al. [120] propose a method for inferring knowl-
from a set of synthesized applications to identify relevant
edge from past design explorations, as shown in Fig. 10.
features and metrics, and construct the dataset to be used for
The authors introduce signature encoding for code and
training machine learning models (linear regression, artifi-
directives, composed of specification encoding (SE), config-
cial neural networks, and gradient tree boosting). To create
uration space descriptor (CSD), and similarity metric longest
the dataset, the authors employ the information obtained
common subsequence (LCS). The methodology uses signa-
from HLS reports for different directives and targeting dif-
ture encoding to create a string with design and configuration
ferent FPGA platforms. In addition, C-to-bitstream flow
spaces (directives and their modes), combining CSD and SE.
for different clock periods is performed to obtain features
On the other side, the LCS metric is used to measure the
such as post-implementation resources and the worst neg-
similarity between the actual and previous DSE stored in a
ative slack. Finally, the authors obtain 234 features, which
database.
were reduced to 87 after an elimination process to remove
COSMOS, an automatic and scalable methodology for
irrelevant features.
DSE, is introduced by Piccolboni et al. [121] for complex
Other models focus on the DSE process are presented
accelerators. It generates a set of Pareto-optimal designs and
in [126], [127], [128], and [129].
reduces the number of HLS invocations. It comprises two
main phases: component characterization and DSE (based on 3) FRAMEWORKS
two steps: synthesis planning and mapping). The comput- Mehrabi et al. propose Prospector framework [114], which
ing model used for DSE is based on timed marked graphs. uses Bayesian techniques to obtain the best configurations

90438 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

FIGURE 9. A DSE methodology presented in [116], [118]. The input source code is translated to LLVM IR trace, obtaining the baseline for
performance estimation and resource utilization. Subsequently, the Roofline model chart estimates memory bottlenecks. An automated
DSE phase allows resource and performance estimations, and the best feasible design is plotted along with the original Roofline chart.

FIGURE 10. A DSE methodology presented in [120] that uses past design explorations to infer knowledge. The signature encoding is used
to create a string with the design and configuration spaces. The new signature is compared with the ones obtained from previous DSE
(DSE database). After the similarity evaluation, the signature selected is used as input for the inference stage, to finally obtain the
optimal configuration.

FIGURE 11. MPSoC DSE, based on [123]. Different IP cores coexist in the MPSoC: some developed
with HLS tools (IP1 and IP2) and others using RTL description. A design space is generated with
the HLS tools. The system level exploration receives as input the number of IP cores described in
ANSI-C or SystemC and their testbenches. The output is a Pareto-design with throughput-area
trade-off. The system level exploration is composed by three methods: two based on simulation
and one based on hardware acceleration.

with fewer resources and reduced latency near Pareto- Lin-Analyzer [130] is a tool that allows accurate and
efficient designs. The HLS tool is considered as a black fast FPGA performance estimation and DSE, consider-
box (or function), which has to be modelled and optimized. ing fine-grained parallelism. With this framework, runtime
Prospector is shown in Fig. 12, where the inputs are the source scales linearly while increasing the design space complex-
code, clock frequency, and directives, and the outputs are the ity; however, only a few optimizations are considered,
synthesized designs. The Bayesian optimization unit (BOU) mainly loop unrolling, loop pipelining, and array partitioning.
is used to explore the design space and control the selection Regarding resource utilization, the authors assume that DSP
of directives. The HLS tool is used to generate RTL from and BRAM are the bottlenecks in accelerator designs. The
the high-level source code. At the end of the process, the communication cost between the FPGA and global mem-
framework can obtain different designs with a latency-area ory is not considered. The framework is divided into three
trade-off, which belong to the Pareto frontier. main stages: instrumentation, optimization of dynamic data

VOLUME 10, 2022 90439

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

FIGURE 12. Prospector framework, based on [114]. The inputs are the source code, clock frequency, and directives; and
the outputs are the synthesized designs with a trade-off between latency and area. The directives are encoded and sent
to the BOU. Source code and clock frequency are the inputs for HLS Tools. Performance and cost values are obtained from
HLS tool and Place & Route process.

dependence graph (DDDG) generation, and DDDG schedul- evaluates the configuration and establishes the next set of
ing. In the last stage, latency is used as a performance metric directives to be applied to the input code. This iteration is
under resource constraints. Lina is proposed in [131] as an repeated until a high-performance configuration is obtained.
extension of Lin-Analyzer, and it includes non-perfect loop Ferretti et al. [134] present a framework for HLS DSE
nests and timing analyses. using a cluster-based heuristic integrally developed in
MPSeeker is proposed by Zhong et al. [132] to estimate MATLAB. The algorithm identifies different clusters in the
the performance and resource utilization from a given DSE, reducing the number of regions to be analyzed; intra-
code (C/C++), considering fine-and coarse-grained paral- clustering is performed, followed by inter-cluster exploration.
lelism, allowing fast DSE. Because MPSeeker contemplates A lattice-traversing DSE framework [135] is proposed to
multi-parallelism using the loop tiling technique, a gradient explore the design space by transforming it into a lattice
boosted machine is proposed to obtain an accurate resource representation. The framework includes three stages: lattice
model for FF and LUT, while Lin-Analyzer is used for creation and initial sampling, selection of lattice Pareto-
BRAM and DSP estimation. The authors also extend the fea- neighbours, and synthesis and lattice labelling.
tures of Lin-Analyzer by including the data communication IronMan [115] is an end-to-end flexible and automated
cost. The performance cost in MPSeeker is modelled as the framework for DSE composed of a performance and resource
sum of the kernel computation and data communication costs. predictor based on a graph-neural network (GPP), multi-
Choi et al. [78] present a DSE and clock cycle estimator objective DSE engine based on reinforcement-learning
using HLS, including code transformations in the presence (RLMD), and code transformer (CT). One of the main fea-
of variable loop bounds. They propose a resource predic- tures of this framework is that it retrieves the final code with
tion method based on HLS reports through shareable and the discovered optimizations, ready to generate the corre-
non-shareable operators from a loop. Using linear interpo- sponding RTL through HLS.
lation, non-shareable resources are obtained, whereas the Sherlock [136], introduced by Gautier et al., is a DSE
resources estimated for shareable operators are computed as framework based on multi-objective optimizations devoted
the maximum of all loops. An analytical model is proposed to find Pareto-optimal solutions (or Pareto front), handling
for clock cycle prediction. In this framework, the design with multiple conflicting optimization objectives. This framework
the best performance is the output. uses active learning to exploit a surrogate design space model
COMBA [77], [133] is a framework that focuses on select- to find the Pareto-optimal designs as quickly as possible.
ing the optimal configuration of directives in HLS, taking Others frameworks devoted to DSE are introduced
into account the use and availability of hardware resources, in [15], [136], [137], and [138].
and provides an estimation of performance and resource
utilization. The authors propose the metric-guided DSE II 4) SUMMARY
(MGDSE-II) algorithm to prune and explore the design space A summary of most of the contributions devised for DSE and
based on three metrics: the number of DSP, BRAM, and LUT. presented in this section are listed in Table 2, considering the
An overview of COMBA, which is composed of a recursive following aspects:
data collector, analytical models (latency and resources), and • Reference.
DSE, is presented in Fig. 13. In COMBA, the input is the • Pruning of the design space (P-DS).
C/C++ source code, which is transformed into an LLVM IR • Whether it is based on the Roofline model.
trace through Clang. The IR trace is the input for the recursive • Whether it considers quality of results (QoR) in relation
data collector, which extracts static and dynamic information to the place and route estimation.
that will be used for the analytical models. MGDSE-II then • Whether it applies transfer learning (TL).

90440 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

FIGURE 13. COMBA framework overview, based on Zhao et al. [77]. LLVM IR is extracted from the source code. This trace is the
input for the recursive data collector, which will extract the parameters used by the analytical models (latency and resource).
MGDSE-II evaluates the configuration and defines the next set of directives to be applied. The output of the complete flow is the
high-performance configuration.

TABLE 2. Summary of most of the contributions devised for DSE and by [113], BRAM and LUT are computed by [137].
presented in this section. The acronyms used in the table are: P-DS:
pruning of the design space, QoR: quality of results in relation to the COMBA [77], [133] estimates DSP, BRAM, and LUT.
place and route estimation, TL: transfer learning, N Resource: number of Lin-Analyzer [130] computes BRAM and DSP, whereas
estimated resources or NS (not specified).
MPSeeker [132] estimates FF and LUT, combining
Lin-Analyzer for DSP and BRAM utilization. Neverthe-
less, overestimating resource utilization can lead to pruning
valid design points in the exploration phase. LUT, FF, DSP,
and BRAM post-implementation estimation is performed
by [125]. A challenge with HLS tools is efficiently predicting
resource sharing for unrolling factors and array partitions
when using HLS pragmas. [78], [118].

C. POWER CONSUMPTION ESTIMATION

Power consumption is an important topic, especially with the
growth of green technology, internet of things (IoT) systems,
and the expansion of communication networks. Power esti-
mation techniques are categorized based on the abstraction
levels of the FPGA design process as follows: system, RTL
level, gate, and layout levels. One of the requirements when
designing IP cores under power, energy, or thermal con-
straints is their estimation in the first steps of the design
process for a given application.
FPGA vendors have proposed different tools to esti-
mate power consumption, such as Maxim R integrated
power solution with a USB-to-PMBus interface don-
gle [139], USB interface adapter EVM from Texas
• The amount of estimated resources (N Resource): Instruments R [140], Xilinx R power estimator based on
1 stands for one resource, 2 for two, and so on. NS stands spreadsheets (XPE) [141], and Intel R FPGA power and
for not specified. thermal calculator [142]. With FPGA/SoC devices, power
From Table 2, only a few contributions include more than is classified as static (fixed and technology-dependent) and
two aspects when developing DSE. A design space explorer dynamic (data and design-dependent). A recent survey on
can benefit from a reduction of the design space by focusing power consumption in FPGA and ASIC devices [17] clas-
on obtaining design points near the Pareto frontier, a parallel sifies the techniques for its estimation into analytical, table-
computing model to guide performance estimation, a good based, polynomial-based, and neural networks.
estimation of QoR, and resource utilization. Transfer learn-
ing, a technique linked mainly with ML approaches, could 1) METHODOLOGIES
help to obtain underlying patterns when developing hardware KAPow, proposed by Davis et al. [143], is an online
through HLS tools. activity-based power methodology that includes a signal
There are contributions that only estimate some FPGA pruning strategy. The flow has two phases: signal selection
resources, as follows. LUT-latency trade-off is estimated (nets with strong relationships between activity and power)

VOLUME 10, 2022 90441

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

and instrumentation (implying the accumulation of events HL-Pow, proposed by Lin et al. [153], is based on machine
to monitor the relevant signals). A linear model is used to learning techniques and overcomes the gap between the HLS
estimate the power contribution of the overall system by synthesis phase and power consumption estimation (usually
computing the power consumption of each IP core. performed after the RTL implementation flow). A DSE is
In the context of approximate computing, Xu et al. [144] introduced to obtain the latency vs power trade-off, with prun-
investigate the use of linear regression and multilayer per- ing to reduce the design space when finding Pareto-optimal
ceptron (MLP) models to generate a new approximated RTL designs. For the machine learning implementation, the train-
design with a trade-off between area and power. Using this ing dataset is constructed by a feature construction (HLS
approach, the search space is extended by reducing the pre- report) and power collection (post-implementation report),
cision of the weights obtained for the predictive models. with a total of 256 elements per feature. The experiments are
The proposed method is divided into three stages: kernel performed with different machine learning models, including
extraction and training data generation, model fitting and linear regression, support vector machines, tree-based mod-
substitution, and model precision optimization with bit width els, and neural networks.
reduction. PowerGear, described by Lin et al. [154], is a graph-
learning-assisted power estimator for FPGA HLS, and is
2) MODELS
composed of a graph construction flow and a power-aware
graph neural network model called HEC-GNN. This study
Lorandel et al. [145] propose the use of neural networks to
considers the impact of interconnections in the hardware
estimate the dynamic power consumption and output signal
design that affects the power modelling. The authors ben-
activities for different IP cores involved in a system. In this
efit from the HLS front-end and HLS back-end to recover
study, two stages are considered: IP characterization and
dataflow graphs because it is possible to obtain the IR traces
high-level system modelling. Nasser et al. [146] present a
and finite state machine with data path information. Pow-
model for the characterization phase by extracting the rele-
erGear can be used to guide a design space explorer with
vant information for each component that has an impact on
a trade-off between latency and power to obtain the Pareto
power.
frontier.
Tripathi et al. [147] introduce an MLP architecture to cal-
Aladdin, introduced by Shao et al. [155], estimates the per-
culate power consumption, using LLVM IR instructions as
formance, power, and area of accelerators. It generates a
input, and modelling only dynamic power.
dependence graph from the input code and produces a fast
Verma et al. [148] present a power estimation model that
cycle estimate before RTL construction.
improves the Deng’s model [149], and is designed using
HAPE, presented by Makni et al. [156], is a framework
nonlinear regression techniques. For this purpose, they use
for area-power estimation based on analytical models, and
the power data of different types of digital circuits (described
it aims to assist the DSE in reducing HLS runtime. HAPE
in VHDL) after the synthesis process. The data is divided
focuses only on the main subtraces present in a source code
into designs with and without clock gating, and based on this
containing the directives provided by the designer. HAPE
separation, two power models are developed.
integrates Lin-Analyzer for computation cost.
In [150] two techniques are proposed by Verma et al.
remarking the importance of predicting the power consump-
4) SUMMARY
tion in an early stage of the accelerator design: a heuristic
approach based on a backpropagation neural network and a Regarding the power consumption, there is an evident trend in
regression based on statistics. estimating this metric in the early stages of design using HLS
FlexCL is extended in [151] through the incorporation of tools. Moreover, some of the presented frameworks integrate
three modes of communication for the memory model: direct, the performance, power, and area estimations with a DSE
burst, and stream access patterns, and an analytical power engine.
model for dynamic and static power.
D. SUMMARY AND DISCUSSION
The studies described in this section are summarized in
3) FRAMEWORKS
Table 3, including for each one:
HLSPredict, developed by O’Neal et al. [152], is a frame-
• Reference and year of publication.
work based on an ensemble of ten machine learning models
• Whether it is a model, a methodology, or a framework.
to predict performance and power consumption without ana-
lytical models or HLS-in-the-loop. Two types of IP cores are – In the case of a model, the number of input param-
considered: without directives (base IP core) or with direc- eters is included. For example, the model presented
tives (optimized IP core). Accelerators for training the models in [157] uses more than 10 input parameters (10+),
are based on a template with DMA for memory transac- and the model presented in [98] uses 21 parameters.
tions, which implies that for every source code implemented The symbol (−) indicates that the number of param-
through HLS, the functionality of the IP core is encapsulated eters is not defined in the corresponding study.
and integrated within the hardware template. • Whether it includes DSE.

90442 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

• Estimated metrics: area (A), latency (L), power

consumption (P), or other metric such as through-
put (T), quality of results (QoR), throughput-
reconfiguration time (T-RT), energy (E), speed-up (S),
or Communication (C).
• Programming language of the input code: SystemC
(S-C), C/C++, Impulse-C (I-C), HDL, or OpenCL.
• Optimized designs, divided into Pareto-designs (Pareto)
and high performance configuration (HP Config).
• Techniques used to implement the proposed approaches:
statistical, analytical, machine learning (ML), and
others.
Table 3 shows that the described contributions are fairly
distributed between models (35%) and frameworks (41%),
whereas 24% propose methodologies. In line with the grow-
ing tendency in developing design space explorers, 55.2% of FIGURE 14. Radar plot for metrics used by models, methodologies, and
the contributions include DSE. frameworks for FPGA-based hardware accelerators.
We can observe that most DSE solutions use high-level
abstraction languages as input, showing a tendency to
increase productivity in the design phase. Likewise, many
studies are focused on obtaining Pareto-optimal designs.
Regarding metrics, latency and area are the most frequently
estimated metrics, followed by power: 65.3%, 57%, and
26.5%, respectively. We also present this result in Fig. 14.
The area and latency metrics are widely estimated because
reconfigurable platforms are resource constrained and are
used for algorithm acceleration.
Concerning the power consumption, some described con-
tributions highlight the benefits of estimating this metric for a
given application at an early stage of its design. Some of the
most recent studies benefit from HLS tools to estimate this
metric before the implementation stage of the overall system
into the hardware platform. This approach is becoming com-
monplace in the literature when considering FPGA/SoC as a FIGURE 15. Radar plot for the techniques used by models,
methodologies, and frameworks for FPGA-based hardware accelerators.
development architecture.
Table 3 also shows that the C/C++ source code is
preferably used as input (65.3%), and the Pareto frontier is
the most applied solution to obtain optimal designs (33%)
in terms of trade-off between area and latency, area and
power, latency and power, among other metrics. Whereas
machine learning and analytic methods are almost equally
used to obtain accurate, fast, and robust models (43%
and 41%, respectively), as shown in Fig. 15. However,
in the last years, machine learning is the most widely used
technique.
The models, methodologies, and frameworks for metric
estimation, FPGA-based DSE, and power consumption
described in this section are illustrated in Fig. 16. It can be
observed that, in recent years, there has been an increasing
number of frameworks including DSE, whereas the power FIGURE 16. Radar plot for models, methodologies, and frameworks for
metric estimation, FPGA-based DSE, and power consumption.
consumption is mainly estimated by models, with a prepon-
derance of analytical techniques.
Fig. 17 summarizes the main topics presented in the V. INTEGRATION IN DIFFERENT RESEARCH FIELDS
research works reviewed in this paper and discussed in this In this section, we present contributions in the literature
section. that propose models and frameworks for specific hardware

VOLUME 10, 2022 90443

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

TABLE 3. Contributions presented in the literature for metric estimation, FPGA-based DSE, and power consumption. The acronyms used in the table are:
A: area, L: latency, P: power consumption, QoR: quality of result, C: communication, T: throughput, E: energy, S: speed-up, RT: reconfiguration time, S-C:
SystemC, I-C: Impulse C, HDL: hardware description language, MH: meta-heuristics, Em: empirical, and PN: Petri Nets.

acceleration applications. Some of them are based on general A. MODELS

models such as Roofline. We show that the frameworks and The Roofline model has been introduced to assist the designer
models for FPGA/SoC are used in diverse research areas, when targeting hardware acceleration of HPC applications,
exposing their benefits in the design of hardware. so as to explore the design space, estimate the performance,

90444 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

security, and biology [163], [164], [165]. Because of this

trend, models for FPGA-based architectures are being devel-
oped to accelerate ML applications with efficient exploitation
of hardware resources, with the aim of improving productiv-
ity in the design phase [166], [167], [168].
Resource and performance models are proposed by
Reggiani et al. [169] for convolutional neural network
(CNN) accelerators, to drive an automatic Pareto-optimal
DSE, exploring network performance on different hardware
platforms. These models are applied to convolutional cores,
which are critical components of the design, directly affecting
the overall latency and DSP utilization. The final relation to
obtain the Pareto-optimal solutions is the number of DSP
vs the initiation interval (input rate of the pipeline in clock
cycles).
Gysel et al. [170] present an analytical model for deep
CNN design, which is useful for obtaining the computational
cost and inferring the required memory bandwidth for the
hardware design.
CaFPGA, developed by Xu et al. [171], is an FPGA-based
DSE for CNN that focuses on convolutional and fully con-
nected layers. To improve the productivity in the design
phase, the authors propose an automatic generation model,
including incremental searching and flexible layer-folding
FIGURE 17. General summary of the surveyed contributions presented in
algorithms, considering that the on-chip memory is a lim-
this section. ited resource in FPGA. The analysis of the design space is
performed using time, resource, memory, and performance
and evaluate the throughput due to its dependency on com- models.
munication and computation. Shan et al. introduce [172] a CNN multi-kernel applica-
Roofline is applied by Du et al. [158] in the acceleration tion and its implementation on AWS-F1, where an analytical
of the stencil computation kernels, by Karp et al. [159] for model is used to compute data transfers (CPU to DDR, DDR
the hardware implementation of a spectral element method, to FPGA, FPGA to DDR, and DDR to CPU) and kernel
and by Nagasu et al. [160] in the context of an FPGA-based computation times.
tsunami simulation. The Roofline model is employed as a performance pre-
In computational fluid dynamics (CFD), Du et al. [161] dictor for FPGA-based CNN accelerators [173], [174],
present an FPGA-based CFD simulation architecture using [175], [176]. Ayat et al. [173] present an optimization for
a performance model to guide the DSE while achieving the an FPGA-based CNN accelerator for energy efficiency.
maximum performance of the lattice Boltzmann method, Xie et al. [174] use this model to quantitatively analyze
searching for an optimal combination of the parameters of the design phase of a CNN accelerator, depending on the
the unroll directive. available computing and memory resources. Park et al. [175]
Reggiani et al. [162] present the acceleration of iterative propose a model based on Roofline to effectively com-
stencil computation using Verilog to describe hardware. pute convolutional layers using metrics such as through-
An analytical model that considers memory transfer and comput, on-chip memory, off-chip memory bandwidth, and the
putation is proposed to estimate the attainable performance of computation-to-communication ratio.
the accelerator and speedup the DSE. Ma et al. [176] introduce a coarse-grained analytical per-
Through efficiency degradation, it is possible to obtain formance model for CNN accelerators. For this purpose, the
hardware designs with higher performance, lower power modelling of DRAM access, latency, and on-chip buffer is
consumption, and lower resource utilization at the cost of analyzed to obtain the final model. Regarding DSE, convolu-
QoR. Manuel et al. [129] propose a DSE in the context of tion throughput is the main focus, considering factors such as
model-based approximate computing for image processing operating frequency, external memory bandwidth, and loop
using a multi-objective genetic algorithm, finding a wide unrolling variables, using Roofline to analyze the throughput
range of Pareto-optimal solutions, from which the desired of the CNN accelerator. Resource costs are obtained by con-
compensation between quality and resources can be chosen. sidering the knobs loop unrolling and tiling.
In recent years, ML techniques have been applied in mul- Table 4 summarizes the models used in the contributions
tiple fields such as fluid dynamics, high-energy physics, described in this section. The first two columns are the ref-
information retrieval, image processing, video processing, erence and the year of publication. The third column is the

VOLUME 10, 2022 90445

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

research area in which the model is applied. The fourth and connected). A greedy algorithm is employed to search for the
fifth columns are the aim and type of model used, respec- best accelerator configuration under constraints such as the
tively, and the last one is the target platform. DSP, BRAM, bandwidth, and DNN layers.
We can observe that most contributions focus on CNN FRED [183], developed by Biondi et al., is a framework
accelerators, and that the models are devoted to carrying out for real-time applications that benefits from a dynamic partial
DSE and performance estimation and are mainly based on reconfiguration (DPR). It includes a hardware task model for
Roofline. The use of this model is based on the premise that the tasks carried out by the FPGA with partial reconfiguration
communication and computation are two basic constraints enabled, a software model for the tasks executed on the
to improve the throughput of an accelerator, specially when processor, and a scheduling infrastructure.
developing hardware for highly demanding applications. Mu et al. present [184] a collaborative framework to obtain
OpenCL-based hardware designs for CNN implementation.
B. FRAMEWORKS A DSE based on LoopTrees is generated and pruned to
Frameworks (or toolflows) have been proposed to map ML reduce the design space. Fine-grained and coarse-grained
inference and training into SoC-based, integrating models to analytical models are introduced to generate the final opti-
mainly estimate hardware resource utilization, latency, and mized solution. The former estimates the latency and resource
throughput. An exhaustive survey is presented in [166]. utilization, whereas the latter applies further optimization
Concerning training acceleration, Geng et al. [177] to the best candidate designs obtained after applying the
developed FPDeep, a toolflow for a scalable CNN training fine-grained model.
acceleration on deeply-pipelined FPGA clusters, proposing a The heterogeneous image processing acceleration
model for operator graph partitioning and hardware resource (Hippac), proposed by Reiche et al. [185], is a framework
allocation (with a distinction between small and large FPGA that allows the generation of image processing accelerators.
clusters). Roofline is used to evaluate the throughput, because Several steps are performed by analyzing the IR trace: data
of its dependency on communication and computation. dependency analysis, dependency graph restructuring, and
F-CNN, introduced by Zhao et al. [178], is an automatic transformations (streaming objects, memory allocation, and
framework for CNN training based on the reconfiguration replication of the innermost kernel to improve throughput).
of a streaming data path at runtime. The proposed mod- A framework named Spark-to-FPGA-Accelerator (S2FA),
els for resource and bandwidth estimation guide the space introduced by Yu et al. [186], transforms Scala computa-
exploration under design constraints to obtain an optimal tional kernels based on Apache Spark applications into opti-
performance. mized accelerator designs. For this, a learning-based DSE
HP-GNN, proposed by Lin et al. [179], is a framework is employed to obtain high-performance RTL designs using
for training graph neural networks (GNN) on a CPU-FPGA an ensemble of reinforcement learning algorithms: uniform
platform. It incorporates an engine dedicated to exploring greedy mutation, differential evolution genetic algorithm,
the design space through an exhaustive search using per- particle swarm optimization, and simulated annealing. The
formance and resource utilization models. HP-GNN also HLS tool is executed in the loop to verify each optimization.
incorporates hardware templates to implement different GNN AutoDNNchip [187] is proposed by Xu et al. to facilitate
architectures. fast chip designs based on DNN, targeting FPGA and ASIC
Regarding inference acceleration, Ghaffari et al. [180] platforms. The main factors involved in the DNN acceleration
present CNN2Gate, a framework based on OpenCL to map process are bit precision, clock frequency, memory technol-
a CNN onto an FPGA with fixed-point arithmetic, including ogy, PE architecture, width for data transfer, memory allo-
a hardware-aware DSE based on resource utilization. It is cation, and DNN mapping. AutoDNNchip is composed of a
implemented using manual directive tuning, reinforcement chip predictor and a chip builder. The former predicts metrics
learning, and the hill-climbing methods. such as area, latency, energy, and throughput, whereas the
Venieris et al. [181] propose the fpgaConvNet toolflow latter performs the DSE optimizing the chip design using the
to map a CNN onto an FPGA, thereby optimizing the results obtained by the predictor. A chip predictor is formed
neural network workload. It includes a DSE using a by two modes: (i) coarse-grained and (ii) fine-grained. In (i),
multi-objective algorithm (simulated annealing), where the analytical models are used to obtain the energy, critical path,
explorer optimizes the design according to latency, through- and area for a DNN model, while in (ii), an algorithm is
put, or maximum throughput with a latency constraint. implemented to obtain the final latency through runtime sim-
Performance estimation and resource utilization models are ulations, considering the results of the coarse-grained mode.
proposed for DSE. A chip builder is composed of a DSE based on two phases:
Cloud-DNN [182], introduced by Chen et al., is a frame- early stage architecture and IP configuration exploration, and
work for mapping DNN to cloud-FPGA, generating the inter-IP pipeline exploration and IP optimization. Finally, the
corresponding HLS project to obtain the final IP core. The RTL is generated and executed to validate the results.
proposed accelerator model is based on hardware resource Table 5 summarizes the frameworks used in the contribu-
cost (considering DSP and BRAM) and a performance tions described in this section. The first two columns are the
model for each layer (convolutional, max pooling, and fully reference and the year of publication. The third column is the

90446 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

TABLE 4. Models used for FPGA/SoC on different research areas.

research area in which the model is applied. The fourth is the • The number of PE replicas in a hardware design, and
name of the framework and the last is the target platform. consequently the level of coarse-grain parallelism that
As we can observe, most frameworks are devoted to can be obtained, is limited to the available physi-
mapping ML-based inference into FPGA/SoC architectures. cal resources. Therefore, different strategies should be
The components of these frameworks are usually expressed implemented to exploit the architecture so as to increase
as pre-defined optimized templates, mainly implemented in the scalability of the system.
C++ and OpenCL, where parallelism can be controlled • There is a trade-off between the different metrics to
by changing the parameters associated with the different be optimized, as was presented in Section III-B. As an
directives. example, the area occupied is likely to increase if the
latency is reduced, and vice versa. Thus, the FPGA
VI. CHALLENGES designer should choose a good compromise between
Nowadays, the explosive growth of accelerators promises the metrics in terms of resources, computing operations,
greater computational capabilities. FPGA/SoC devices are throughput, among others.
widely used as hardware accelerators in different areas of • The hardware generated through HLS tools is directly
research and development. However, the structured study we associated with the applied directives, but sometimes
have presented in the previous sections indicates the necessity applying and tuning directives require a considerable
to address some challenges. Coping with them will permit endeavour to obtain a proper FPGA implementation.
a more widespread adoption of models, methodologies, and Moreover, generating a solution for each directive com-
frameworks for performance estimation of HLS-based hard- bination is associated with the synthesis time, reducing
ware designs for FPGA/SoC technology. productivity.
Even using HLS tools, reconfiguring an FPGA/SoC with • The exploration of the design space is linked to the
an efficient hardware design is a challenging task. This is human effort of performing combinations of direc-
easily made apparent by some observations: tives, user design constraints, FPGA features, and code
• Physical resources, such as memory bandwidth, recon- restructuring, among others.
figurable hardware (LUTs, CLBs, and slices), and static We can cope with the above considerations through mod-
hardware (DSPs and BRAMs) are limited in FPGA/SoC els, methodologies, and frameworks to reduce design time,
devices. Thus, the available physical resources should as follows:
be used skilfully, considering techniques to improve the • The level of coarse-grain parallelism can be obtained
latency, area, and power, as introduced in Section III-C. by means of a model such as Roofline, identifying the
• Code restructuring techniques aid creating efficient computation-to-communication ratio, exposing the rela-
FPGA implementations using HLS tools, modifying the tionship between communication bottlenecks, compu-
original source code of the application according to tations, and number of replicas, as was presented in
the FPGA architecture. Suggestions for this topic are Section II-E and demonstrated in contributions such
presented in [82]. as [48], [118].

VOLUME 10, 2022 90447

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

TABLE 5. Utilization of frameworks FPGA/SoC on different research areas. PDR: Partial dynamic reconfiguration.

• Design space explorers aim to identify the optimal com- computation model to reduce the time required to obtain
bination of directives to obtain an HLS-based hardware the necessary statistics for each implementation for a
design with the best trade-off among different metrics, specific application. However, there is a gap between the
generating the Pareto-optimal set of designs. Reducing HLS report and the real hardware implementation [101]
the design space and avoiding HLS in the exploration that can be addressed with a performance model that
process can improve the design time, as was described includes the results obtained from the sourceCode-to-
in Section IV-B. bitstream flow using the values related to final hardware
• Models integrated within a methodology or frame- utilization, power consumption, and timing reports.
work can automatically estimate the performance of • Computing models for FPGA-based reconfigurable
HLS-based hardware designs without executing HLS hardware accelerators have to consider that the inher-
tools, as presented in Section IV. ent hardware is not fixed. Rather, it is defined by how
• Some frameworks and methodologies including DSE the application is described. Therefore, a higher num-
provide automatic directive-insertion optimizations and ber of parameters have to be included in the model,
code transformation insights, as in contributions such such as hardware resources (DSP, BRAM, LUT, and
as [115], [116], [118]. FF), programmable logic clock, latency, byte-operations
Nevertheless, the literature review shows that a number of (Bops), scalability in the number of PE, and power
challenges has to still be addressed in order to make optimal consumption. This contrasts with the computing models
use of models, methodologies, and frameworks, such as: proposed for other parallel platforms, such as PRAM or
• Recent HLS tools generate more comprehensive reports BSP, that use a few parameters. Nevertheless, including
with more accurate information on total resource avail- more parameters in the model increases the analysis
ability, latency, clock frequency, and resource utiliza- accuracy, but affects the complexity of the model analy-
tion. These reports can be integrated with models, sis. Therefore, the trade-off between these two features
methodologies, and frameworks to estimate metrics and has to be addressed. In addition, the parameters should
provide an initial value for the replication factor of a be adjusted according to the particular combination of
single PE. However, the report generation is linked to directives applied to the source code.
the synthesis time of the FPGA implementation. Reduc- • The compatibility among different versions of HLS tools
ing the design time is an important factor when using is not granted by models, methodologies, and frame-
FPGA/SoC without losing hardware quality to reconfig- works. As a consequence, calibration techniques can
ure the platform. Thus, if the HLS tool is in the loop for help maintain compatibility between high-level tools,
performance estimation using reports, it can lead to an thereby avoiding being tied to one version of HLS tool
increased design time. One way to overcome this is to in particular [14].
use approaches such as [113], [121], [124], [152], [156], • Methodologies and frameworks are typically linked to a
without the need to run HLS in the loop or reduce its tool [77], [130], [131], [136]. However, most such tools
invocation. are not easily available or do not have user support. This
• The performance metrics reported by HLS tools is a critical point in the adoption of methodologies and
make them suitable to be combined with a parallel frameworks for performance estimation, which makes

90448 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

difficult to include them in the design flow. This may

be solved by making methodologies and frameworks
available to the FPGA designer through a repository
system, such as contributions in [77] and [136], among
others.
• The integration of frameworks into the different steps of
the flow for designing IP cores can be affected by the
installation of libraries, dependencies, and tools, such as
LLVM IR and Clang, needed for the execution of the
frameworks. It should be guaranteed to the user a simple
way of installation and maintenance in order to facilitate
their integration in the design flow. This concern can be
addressed by providing a script with dependencies to be
installed, an executable file, or a library package.
• For heterogeneous architectures, the hardware-software
co-design can be considered by models, methodologies,
and frameworks taking into account the inherent features
of different technologies, to ease the decision on which
part of the algorithm should be implemented in software
and which part in hardware. The performance of the
overall system may be estimated by combining tradi-
tional parallel computing models presented in Section II
(for the sequential part) and the contributions discussed
in Section IV (for the FPGA part). In addition, a single
parallel model, such as Roofline, can be applied to both
architectures.
Moreover, when a DSE engine is integrated with models,
methodologies, and frameworks, the following aspects need
to be considered:
• One of the key points in the DSE is the execution of
HLS tools during the exploration stage to validate the
configuration obtained. This behaviour can lead to a
long runtime, becoming a drawback in the DSE phase.
Therefore, the adoption of different techniques to reduce
the execution time of the exploration phase is indispens-
able, as shown in contributions such as [113], [121],
[134], [136], [137].
• It is often sufficient to find a suboptimal combination
FIGURE 18. Summary of the main aspects presented in this section.
of knobs based on specific metrics and user constraints.
An important strategy is pruning the design space using
intermediate Pareto-optimal designs, giving priority to • Mapping an optimal design from the DSE to the
the points that permit high-performance behaviours, FPGA/SoC can be challenging while maintaining the
as introduced in [136], [188], and [134]. QoR reported by the DSE engine, mainly latency.
• The DSE engine should guarantee a good compromise Contributions in the literature [77], [130], [155] have
among the QoR and performance metrics. implemented their own scheduler to obtain solutions
• Approximate computing [189] can lead to an expansion with better timing than HLS tools (with no guarantee
of the design space, generating Pareto-optimal designs that HLS will implement it in the same way) [78].
with a trade-off between area-power-latency estimation To address this, some contributions [78], [118] use a
and error computation [129], [144]. A reduction in the baseline implementation obtained after HLS synthesis to
space to be explored is fundamental to minimizing the consider the impact of the compiler optimizations and
invocations of HLS tools. use the estimated critical path that affects the latency.
• It is important to identify the strengths and weaknesses This implementation is considered the starting point for
of a given design space explorer. This can be per- the DSE engine to search for Pareto-optimal designs.
formed using benchmarks, as was made in [15], [115], Moreover, in the process of mapping the final hardware
[114], [132], and [77], among others. design onto the FPGA/SoC, the place-and-route phase

VOLUME 10, 2022 90449

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

plays an important role and different strategies provided ASIC Application specific integrated circuit.
by commercial tools can be used in this phase, adding BRAM Block RAM.
another factor to be analyzed. BSP Bulk synchronous parallel.
• It is fundamental to consider the application of CCM Collective computing model.
HLS-specific compiler optimizations, due to the impact CDFG Control data flow graph.
that they have on the hardware quality, in terms of CFD Computational fluid dynamics.
latency, area, and power consumption [190]. CI Computational intensity.
Fig. 18 summarizes the main aspects presented in this CLBs Configurable logic block.
section, considering those to create efficient hardware to CNN Convolutional neural network.
reconfigure the FPGA, how some of these aspects may be CRCW Concurrent read concurrent write.
coped through models, methodologies, and frameworks, and CREW Concurrent read exclusive write.
the challenges that need to be considered to bridge the gap CUDA Compute Unified Device Architecture.
between designers and FPGA-based reconfigurable hardware D Design space.
accelerators. DDDG Dynamic data dependence graph.
DMA Direct memory access.
VII. CONCLUSION DNN Deep neural network.
In this survey, different models, methodologies, and frame- DSE Design space exploration.
works proposed for metrics estimation, FPGA-based design DSP Digital signal processor.
space exploration, and power consumption estimation on ERCW Exclusive read concurrent write.
FPGA/SoC have been described. The main features and lim- EREW Exclusive read exclusive write.
itations, as well as trade-offs of these approaches, have been ERT Empirical Roofline toolkit.
presented, and different challenges to be addressed have been FF Flip-Flop.
identified. FIR Finite impulse response filter.
The integration of models and frameworks in different FPGA Field programmable gate array.
research areas has also been described, indicating a growing GNN Graph neural networks.
tendency to apply them in the field of machine learning HDL High-level design.
accelerators for diverse applications. HLS High-level synthesis.
Based on our literature review, it can be observed that HPC High-performance computing.
existing models, methodologies, and frameworks are very HPM Hierarchical model for parallel computations.
difficult to compare against one another. One reason is the HVE Hypervolume error.
lack of standards limiting their evaluation on different hard- I/O Input/Output.
ware and applications, together with the fact that the different IoT Internet of things.
approaches do not analyze the same performance metrics. IP Intellectual property.
In addition, it can be affirmed that the inherent hardware IR Intermediate representation.
reconfigurability of FPGA/SoC affects the complexity of the L Latency.
associated models. Indeed, the models for FPGA/SoC usually L1 Level-1 cache memory.
have a higher complexity than those commonly used for CPU, L2 Level-2 cache memory.
GPU, multicore processors, among other architectures. LLVM IR Low-level virtual machine intermediate
We believe this survey can help readers understand the ben- representation.
efits of integrating models, methodologies, and frameworks LUT LookUp Table.
for FPGA-based hardware accelerators into the design flow. ML Machine learning.
Therefore, the FPGA designer can select the approach that MLP Multi-layer perceptron.
best suits the application, hardware architecture, and pro- MOOA Multi-objective optimization algorithms.
gramming skills. MPSoC Multiprocessor system on chip.
The literature review shows that several challenges have PC Peak computation.
to still be addressed to make optimal integration of models, PE Processing element.
methodologies, and frameworks in the design flow. By high- PF Pareto-optimal frontier.
lighting these challenges, this survey reveals what has to be PMB Peak memory bandwidth.
considered to bridge the gap between the FPGA designer and PRAM Parallel random access machine.
hardware accelerators based on FPGA. QoR Quality of results.
RAM Random access machine.
APPENDIX A. LIST OF ACRONYMS RTL Register transfer level.
A Area. SIMD Single Instruction/Multiple Data.
ADRS Average distance from reference set. SoC System on chip.
AP Attainable performance. SPMD Single program multiple data.

90450 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

T Throughput. [20] A. Podobas, K. Sano, and S. Matsuoka, ‘‘A survey on coarse-grained

reconfigurable architectures from a performance perspective,’’ IEEE
UMH Uniform Memory Hierarchy Model Access, vol. 8, pp. 146719–146743, 2020.
of Computation. [21] S. A. Cook and R. A. Reckhow, ‘‘Time-bounded random access
machines,’’ in Proc. 4th Annu. ACM Symp. Theory Comput. (STOC),
1972, pp. 354–375.
REFERENCES [22] S. Fortune and J. Wyllie, ‘‘Parallelism in random access machines,’’ in
[1] Z. Wan, B. Yu, T. Yuang Li, J. Tang, Y. Zhu, Y. Wang, Proc. Symp. Theory Comput., 1978, pp. 114–118.
A. Raychowdhury, and S. Liu, ‘‘A survey of FPGA-based robotic [23] P. B. Gibbons, ‘‘A more practical PRAM model,’’ in Proc. 1st Annu. ACM
computing,’’ 2020, arXiv:2009.06034. Symp. Parallel Algorithms Architectures (SPAA), 1989, pp. 158–168.
[2] M. Bakiri, C. Guyeux, J.-F. Couchot, and A. K. Oudjida, ‘‘Survey on [24] L. G. Valiant, ‘‘A bridging model for parallel computation,’’ Commun.
hardware implementation of random number generators on FPGA: The- ACM, vol. 33, no. 8, pp. 103–111, Aug. 1990.
ory and experimental analyses,’’ Comput. Sci. Rev., vol. 27, pp. 135–153, [25] L. G. Valiant, ‘‘A bridging model for multi-core computing,’’ J. Comput.
Feb. 2018. Syst. Sci., vol. 77, no. 1, pp. 154–166, Jan. 2011.
[3] A. Ebrahimi and M. Zandsalimy, ‘‘Evaluation of FPGA hardware as a [26] A. Goldchleger, A. Goldman, U. Hayashida, and F. Kon, ‘‘The imple-
new approach for accelerating the numerical solution of CFD problems,’’ mentation of the BSP parallel computing model on the InteGrade grid
IEEE Access, vol. 5, pp. 9717–9727, 2017. middleware,’’ in Proc. 3rd Int. Workshop Middleware Grid Comput.
(MGC), 2005, pp. 1–6.
[4] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, ‘‘A survey of FPGA-
[27] V. Allombert, F. Gava, and J. Tesson, ‘‘Toward performance prediction for
based neural network inference accelerators,’’ ACM Trans. Reconfig-
multi-BSP programs in ML,’’ in Proc. Int. Conf. Algorithms Architectures
urable Technol. Syst., vol. 12, no. 1, pp. 1–26, Mar. 2019.
Parallel Process., 2018, pp. 159–174.
[5] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen,
[28] G. Trabes, V. Gil-Costa, M. Printista, and M. Marin, ‘‘Multi-BSP vs.
H. Hsiao, S. D. Brown, F. Ferrandi, J. H. Anderson, and
BSP: A case of study for dell AMD multicores,’’ in Proc. 26th Euromicro
K. Bertels, ‘‘A survey and evaluation of FPGA high-level synthesis
Int. Conf. Parallel, Distrib. Netw.-Based Process. (PDP), Mar. 2018,
tools,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 35,
pp. 579–587.
no. 10, pp. 1591–1604, Oct. 2016.
[29] A. Savadi, M. Moradi, and H. Deldari, ‘‘Multi-DaC programming model:
[6] Vivado Design Suite User Guide: High-Level Synthesis. UG-902, Xilinx,
A variant of multi-BSP model for divide-and-conquer algorithms,’’
San Jose, CA, USA, 2020. Accessed: Feb. 15, 2022. [Online]. Available:
in Proc. 7th Workshop Declarative Aspects Appl. Multicore Program.
https://docs.xilinx.com/v/u/en-US/ug902-vivado-high-level-synthesis
(DAMP), 2012, pp. 41–46.
[7] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
[30] M. Alaniz and S. Nesmachnow, ‘‘A semi-automatic approach for parallel
‘‘High-level synthesis for FPGAs: From prototyping to deployment,’’
problem solving using the multi-BSP model,’’ Program. Comput. Softw.,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, no. 4,
vol. 45, no. 8, pp. 517–531, Dec. 2019.
pp. 473–491, Apr. 2011.
[31] Z. Zeng and X. Sun, ‘‘Electric vehicle regional management system based
[8] Intel High Level Synthesis Compiler, Best Practices Guide. UG- on the BSP model and multi-information fusion,’’ Syst. Sci. Control Eng.,
20107, Intel, Santa Clara, CA, USA, 2020. Accessed: Feb. 15, 2022. vol. 9, no. sup1, pp. 114–121, Apr. 2021.
[Online]. Available: http://www.audentia-gestion.fr/INTEL/PDF/ug-hls- [32] X. Zhao, M. Papagelis, A. An, B. X. Chen, J. Liu, and Y. Hu, ‘‘ZipLine:
best-practices.pdf An optimized algorithm for the elastic bulk synchronous parallel model,’’
[9] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, Mach. Learn., vol. 110, no. 10, pp. 2867–2903, Oct. 2021.
S. D. Brown, and T. S. Czajkowski, ‘‘LegUp: high-level synthesis for [33] K. Siddique, Z. Akhtar, H.-G. Lee, W. Kim, and Y. Kim, ‘‘Toward bulk
FPGA-based processor/accelerator systems,’’ in Proc. Int. Symp. Field synchronous parallel-based machine learning techniques for anomaly
Program. Gate Arrays (FPGA), 2011, pp. 33–36. detection in high-speed big data networks,’’ Symmetry, vol. 9, no. 9,
[10] C. Pilato and F. Ferrandi, ‘‘Bambu: A modular framework for the high p. 197, Sep. 2017.
level synthesis of memory-intensive applications,’’ in Proc. 23rd Int. [34] X. Zhao, M. Papagelis, A. An, B. X. Chen, J. Liu, and Y. Hu, ‘‘Elastic
Conf. Field Program. Log. Appl., Sep. 2013, pp. 1–4. bulk synchronous parallel model for distributed deep learning,’’ in Proc.
[11] Y. Liang, K. Rupnow, Y. Li, D. Min, M. Do, and D. Chen, ‘‘High-level Int. Conf. Data Mining, (ICDM), 2019, pp. 1504–1509.
synthesis: Productivity, performance, and software constraints,’’ Electr. [35] M. Amaris, D. Cordeiro, A. Goldman, and R. Y. D. Camargo, ‘‘A simple
Comput. Eng., vol. 2012, Jan. 2012, Art. no. 649057. BSP-based model to predict execution time in GPU applications,’’ in
[12] The OpenCL Specification, Version 1.1, Khronos OpenCL Work. Group, Proc. IEEE 22nd Int. Conf. High Perform. Comput. (HiPC), Dec. 2015,
2011. Accessed: Feb. 15, 2022. [Online]. Available: https://registry. pp. 285–294.
khronos.org/OpenCL/specs/opencl-1.1.pdf [36] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos,
[13] C. Kessler and J. Keller, ‘‘Models for parallel computing: Review and R. Subramonian, and T. von Eicken, ‘‘LogP: Towards a realistic model of
perspectives,’’ Mitteilungen Gesellschaft für Informatik e.V., Parallel- parallel computation,’’ in Proc. 4th ACM SIGPLAN Symp. Princ. Pract.
Algorithmen und Rechnerstrukturen, vol. 24, pp. 13–29, Dec. 2007. Parallel Program. (PPOPP), 1993, pp. 1–12.
[14] B. C. Schafer and Z. Wang, ‘‘High-level synthesis design space explo- [37] A. Nomura, H. Matsuba, and Y. Ishikawa, ‘‘Network performance model
ration: Past, present, and future,’’ IEEE Trans. Comput.-Aided Design for TCP/IP based cluster computing,’’ in Proc. IEEE Int. Conf. Cluster
Integr. Circuits Syst., vol. 39, no. 10, pp. 2628–2639, Oct. 2020. Comput., Sep. 2007, pp. 194–203.
[15] A. Sohrabizadeh, C. H. Yu, M. Gao, and J. Cong, ‘‘AutoDSE: Enabling [38] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, ‘‘LogGP:
software programmers to design efficient FPGA accelerators,’’ ACM Incorporating long messages into the LogP model for parallel computa-
Trans. Design Autom. Electron. Syst., vol. 27, no. 4, pp. 1–27, Jul. 2021. tion,’’ J. Parallel Distrib. Comput., vol. 44, no. 1, pp. 71–79, Jul. 1997.
[16] M. W. Numan, B. J. Phillips, G. S. Puddy, and K. Falkner, ‘‘Towards [39] C. A. Moritz and M. I. Frank, ‘‘LoGPG: Modeling network contention in
automatic high-level code deployment on reconfigurable platforms: A message-passing programs,’’ IEEE Trans. Parallel Distrib. Syst., vol. 12,
survey of high-level synthesis tools and toolchains,’’ IEEE Access, vol. 8, no. 4, pp. 404–415, Apr. 2001.
pp. 174692–174722, 2020. [40] T. Touyama and S. Horiguchi, ‘‘Performance evaluation of practical
[17] Y. Nasser, J. Lorandel, J.-C. Prevotet, and M. Helard, ‘‘RTL to transistor parallel computation model LogPQ,’’ in Proc. 4th Int. Symp. Parallel
level power modeling and estimation techniques for FPGA and ASIC: Architectures, Algorithms, Netw. (I-SPAN), 1999, pp. 216–221.
A survey,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., [41] T. Kielmann, H. E. Bal, and K. Verstoep, ‘‘Fast measurement of LogP
vol. 40, no. 3, pp. 479–493, Mar. 2021. parameters for message passing platforms,’’ in Proc. Parallel Distrib.
[18] K. O’Neal and P. Brisk, ‘‘Predictive modeling for CPU, GPU, and FPGA Process. (IPDPS), vol. 1800, 2000, pp. 1176–1183.
performance and power consumption: A survey,’’ in Proc. IEEE Comput. [42] L. Li, X. Zhang, J. Feng, and X. Dong, ‘‘MPlogP: A parallel computa-
Soc. Annu. Symp. VLSI (ISVLSI), Jul. 2018, pp. 763–768. tion model for heterogeneous multi-core computer,’’ in Proc. Int. Conf.
[19] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei, ‘‘A survey Cluster, Cloud Grid Comput., 2010, pp. 679–684.
of coarse-grained reconfigurable architecture and design: Taxonomy, [43] G. Liu, Y. Wang, T. Zhao, J. Gu, and D. Li, ‘‘MHLogGP: A parallel com-
challenges, and applications,’’ ACM Comput. Surveys, vol. 52, no. 6, putation model for CPU/GPU heterogeneous computing cluster,’’ in Proc.
pp. 1–39, Nov. 2019. IFIP Int. Conf. Netw. Parallel Comput., vol. 7513, 2012, pp. 217–224.

VOLUME 10, 2022 90451

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

[44] J. L. Roda, F. Sande, C. Leon, J. A. Gonzalez, and C. Rodriguez, ‘‘The col- [64] Z. Zeng, R. Sedaghat, and A. Sengupta, ‘‘A novel framework of optimiz-
lective computing model,’’ in Proc. Euromicro Workshop Parallel Dis- ing modular computing architecture for multi objective VLSI designs,’’
trib., 1999, pp. 19–26. in Proc. Int. Conf. Microelectron. (ICM), Dec. 2009, pp. 328–331.
[45] S. Williams, A. Waterman, and D. Patterson, ‘‘Roofline: An insightful [65] Y. Ma, S. Roy, J. Miao, J. Chen, and B. Yu, ‘‘Cross-layer optimization
visual performance model for multicore architectures,’’ Commun. ACM, for high speed adders: A Pareto driven machine learning approach,’’
vol. 52, no. 4, pp. 65–76, 2009. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 12,
[46] C. Yang, T. Kurth, and S. Williams, ‘‘Hierarchical roofline analysis for pp. 2298–2311, Dec. 2018.
GPUs: Accelerating performance optimization for the NERSC-9 perl- [66] D. Roy and A. Sengupta, ‘‘Low overhead symmetrical protection of
mutter system,’’ Concurrency Comput., Pract. Exper., vol. 32, no. 20, reusable IP core using robust fingerprinting and watermarking during
p. e5547, Oct. 2020. high level synthesis,’’ Future Gener. Comput. Syst., vol. 71, pp. 89–101,
[47] C. Yang, Y. Wang, T. Kurth, S. Farrell, and S. Williams, ‘‘Hierarchical Jun. 2017.
roofline performance analysis for deep learning applications,’’ in Intelli- [67] L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, ‘‘Broad-
gent Computing (Lecture Notes in Networks and Systems), vol. 284, K. ening the exploration of the accelerator design space in embedded scal-
Arai, Ed. Cham, Switzerland: Springer, 2021, doi: 10.1007/978-3-030- able platforms,’’ in Proc. IEEE High Perform. Extreme Comput. Conf.
80126-7_35. (HPEC), Sep. 2017, pp. 1–7.
[48] B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi, ‘‘Per- [68] R. Resmi and B. B. T. Sundari, ‘‘Allocation of optimal reconfigurable
formance modeling for FPGAs: Extending the roofline model with array using graph merging technique,’’ in Proc. Int. Conf. Embedded Syst.
high-level synthesis tools,’’ Int. J. Reconfigurable Comput., vol. 2013, (ICES), Jul. 2014, pp. 49–54.
Jan. 2013, Art. no. 428078. [69] D. S. H. Ram, M. C. Bhuvaneswari, and S. M. Logesh, ‘‘A novel evolu-
[49] Y. J. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery, tionary technique for multi-objective power, area and delay optimization
N. J. Wright, M. W. Hall, and L. Oliker, ‘‘Roofline model toolkit: A prac- in high level synthesis of datapaths,’’ in Proc. IEEE Comput. Soc. Annu.
tical tool for architectural and program analysis,’’ in High Perform. Com- Symp. VLSI, Jul. 2011, pp. 290–295.
put. Syst. Perform. Modeling, Benchmarking, Simul., 2015, pp. 129–148. [70] A. Sengupta, R. Sedaghat, and P. Sarkar, ‘‘A multi structure genetic algo-
[50] Y. Wang, C. Yang, S. Farrell, Y. Zhang, T. Kurth, and S. Williams, rithm for integrated design space exploration of scheduling and allocation
‘‘Time-based roofline for deep learning performance analysis,’’ in Proc. in high level synthesis for DSP kernels,’’ Swarm Evol. Comput., vol. 7,
Workshop Deep Learn. Supercomputers, Atlanta, GA, USA, Nov. 2020, pp. 35–46, Dec. 2012.
pp. 10–19. [71] A. Sengupta, R. Sedaghat, and P. Sarkar, ‘‘Rapid exploration of integrated
[51] Y. Zhang, G. Chen, G. Sun, and Q. Miao, ‘‘Models of parallel com- scheduling and module selection in high level synthesis for application
putation: A survey and classification,’’ Frontiers Comput. Sci. China, specific processor design,’’ Microprocessors Microsyst., vol. 36, no. 4,
vol. 1, no. 2, pp. 156–165, May 2007. pp. 303–314, Jun. 2012.
[72] B. C. Schafer and K. Wakabayashi, ‘‘Design space exploration acceler-
[52] A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir, ‘‘A model for hier-
ation through operation clustering,’’ IEEE Trans. Comput.-Aided Design
archical memory,’’ in Proc. Symp. Theory Comput., 1987, pp. 305–314.
Integr., vol. 29, no. 1, pp. 153–157, Jan. 2009.
[53] B. Alpern, L. Carter, E. Feig, and T. Selker, ‘‘The uniform memory hierar-
[73] B. C. Schafer, T. Takenaka, and K. Wakabayashi, ‘‘Adaptive simulated
chy model of computation,’’ Algorithmica, vol. 12, nos. 2–3, pp. 72–109,
annealer for high level synthesis design space exploration,’’ in Proc. Int.
Sep. 1994.
Symp. VLSI Design, Autom. Test, Apr. 2009, pp. 106–109.
[54] Z. Li, P. Mills, and J. H. Reif, ‘‘Models and resource metrics for parallel
[74] C. Lattner and V. Adve, ‘‘LLVM: A compilation framework for lifelong
and distributed computation,’’ Parallel Algorithms Appl., vol. 8, no. 1,
program analysis & transformation,’’ in Proc. Int. Symp. Code Gener.
pp. 35–59, 1996.
Optim. (CGO), 2004, pp. 75–86.
[55] X. Qiao, S. Chen, and L. T. Yang, ‘‘HPM: A hierarchical model for [75] LLVM Developer Group. Clang. Accessed: Feb. 1, 2022. [Online]. Avail-
parallel computations,’’ Int. J. High Perform. Comput. Netw., vol. 1, no. 3, able: https://clang.llvm.org
pp. 117–127, 2004.
[76] L. Huang, D.-L. Li, K.-P. Wang, T. Gao, and A. Tavares, ‘‘A survey on
[56] S. Pllana, I. Brandic, and S. Benkner, ‘‘Performance modeling and predic- performance optimization of high-level synthesis tools,’’ J. Comput. Sci.
tion of parallel and distributed computing systems: A survey of the state Technol., vol. 35, no. 3, pp. 697–720, May 2020.
of the art,’’ in Proc. 1st Int. Conf. Complex, Intell. Softw. Intensive Syst. [77] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He,
(CISIS), Apr. 2007, pp. 279–284. ‘‘COMBA: A comprehensive model-based analysis framework for high
[57] A. Riahi, A. Savadi, and M. Naghibzadeh, ‘‘Comparison of analytical level synthesis of real applications,’’ in Proc. Int. Conf. Comput.-
and ML-based models for predicting CPU–GPU data transfer time,’’ Aided Des. (ICCAD), 2017, pp. 430–437, doi: 10.1109/ICCAD.2017.
Computing, vol. 102, no. 9, pp. 2099–2116, Sep. 2020. 8203809.
[58] O. Bringmann, W. Ecker, I. Feldner, A. Frischknecht, C. Gerum, [78] Y.-K. Choi and J. Cong, ‘‘HLS-based optimization and design space
T. Hämäläinen, M. A. Hanif, M. J. Klaiber, D. Mueller-Gritschneder, exploration for applications with variable loop bounds,’’ in Proc. Int.
P. P. Bernardo, S. Prebeck, and M. Shafique, ‘‘Automated HW/SW co- Conf. Comput.-Aided Design (ICCAD), 2018, pp. 1–8.
design for edge AI: State, challenges and steps ahead: Special ses- [79] J. S. Monson and B. L. Hutchings, ‘‘Using source-level transformations to
sion paper,’’ in Proc. Int. Conf. Hardw./Softw. Codesign Syst. Synthesis improve high-level synthesis debug and validation on FPGAs,’’ in Proc.
(CODES+ ISSS), 2021, pp. 11–20, doi: 10.1145/3478684.3479261. Int. Symp. Field-Program. Gate Arrays, 2015, pp. 5–8.
[59] C. Pham-Quoc, X.-Q. Nguyen, and T. N. Thinh, ‘‘Towards an FPGA- [80] C. Li, Y. Bi, Y. Benezeth, D. Ginhac, and F. Yang, ‘‘High-level synthe-
targeted hardware/software co-design framework for CNN-based edge sis for FPGAs: Code optimization strategies for real-time image pro-
computing,’’ Mobile Netw. Appl., vol. 174, pp. 1–12, May 2022. cessing,’’ J. Real-Time Image Process., vol. 14, no. 3, pp. 701–712,
[60] Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y. Liang, ‘‘HASCO: Mar. 2018.
Towards agile HArdware and software CO-design for tensor computa- [81] R. Campos and J. M. Cardoso, ‘‘On data parallelism code restructuring
tion,’’ in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput. Archit. (ISCA), for HLS targeting FPGAs,’’ in Proc. Int. Parallel Distrib. Process. Symp.
Jun. 2021, pp. 1055–1068. Workshops (IPDPSW), 2021, pp. 144–151.
[61] Y. Li, R. Chen, B. Sensale-Rodriguez, W. Gao, and C. Yu, ‘‘Real-time [82] J. de Fine Licht, M. Besta, S. Meierhans, and T. Hoefler, ‘‘Transfor-
multi-task diffractive deep neural networks via hardware-software co- mations of high-level synthesis codes for high-performance comput-
design,’’ Sci. Rep., vol. 11, no. 1, pp. 1–9, Dec. 2021. ing,’’ IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 5, pp. 1014–1029,
[62] N. Talati, K. May, A. Behroozi, Y. Yang, K. Kaszyk, C. Vasiladiotis, May 2021, doi: 10.1109/TPDS.2020.3039409.
T. Verma, L. Li, B. Nguyen, J. Sun, and J. M. Morton, ‘‘Prodigy: [83] A. C. Ferreira and J. M. Cardoso, ‘‘Graph-based code restructuring target-
Improving the memory latency of data-indirect irregular workloads using ing HLS for FPGAs,’’ in Proc. Int. Symp. Appl. Reconfigurable Comput.,
hardware-software co-design,’’ in Proc. Int. Symp. High-Performance 2019, pp. 230–244.
Comput. Archit. (HPCA), 2021, pp. 654–667. [84] M. Q. Hoang, P. L. Nguyen, H. V. Tran, H. Q. Nguyen, V. T. Nguyen,
[63] D. R. F. de Bulnes, Y. Maldonado, and L. Trujillo, ‘‘Development of and C. Vo-Le, ‘‘FPGA oriented compression of DNN using layer-targeted
multiobjective high-level synthesis for FPGAs,’’ Sci. Program., vol. 2020, weights and activations quantization,’’ in Proc. IEEE 8th Int. Conf. Com-
Jun. 2020, Art. no. 7095048. mun. Electron. (ICCE), Jan. 2021, pp. 157–162.

90452 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

[85] Q. Zhang, J. Cao, Y. Zhang, S. Zhang, Q. Zhang, and D. Yu, ‘‘FPGA [107] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks, ‘‘MachSuite:
implementation of quantized convolutional neural networks,’’ in Proc. Int. Benchmarks for accelerator design and customized architectures,’’ in
Conf. Commun. Technol. (ICCT), 2019, pp. 1605–1610. Proc. IEEE Int. Symp. Workload Characterization (IISWC), Oct. 2014,
[86] P. Bacchus, R. Stewart, and E. Komendantskaya, ‘‘Accuracy, training pp. 110–119.
time and hardware efficiency trade-offs for quantized neural networks [108] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, ‘‘CHStone: A
on FPGAs,’’ in Proc. Int. Symp. Appl. Reconfigurable Comput., 2020, benchmark program suite for practical C-based high-level synthesis,’’ in
pp. 121–135. Proc. IEEE Int. Symp. Circuits Syst., May 2008, pp. 1192–1195.
[87] X. Xu, Q. Lu, T. Wang, Y. Hu, C. Zhuo, J. Liu, and Y. Shi, ‘‘Efficient [109] B. C. Schafer and A. Mahapatra, ‘‘S2CBench: Synthesizable SystemC
hardware implementation of cellular neural networks with incremental benchmark suite for high-level synthesis,’’ IEEE Embedded Syst. Lett.,
quantization and early exit,’’ ACM J. Emerg. Technol. Comput. Syst., vol. 6, no. 3, pp. 53–56, Sep. 2014.
vol. 14, no. 4, pp. 1–20, Oct. 2018. [110] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston,
[88] N. Grover and M. Soni, ‘‘Reduction of power consumption in FPGAs— Y.-H. Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang, ‘‘Rosetta: A
An overview,’’ Inf. Eng. Electron. Bus., vol. 4, no. 5, p. 50, 2012. realistic high-level synthesis benchmark suite for software-programmable
[89] M. Ibro and G. Marinova, ‘‘Review on low-power consumption tech- FPGAs,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
niques for FPGA-based designs in IoT technology,’’ in Proc. 16th Int. 2018, pp. 269–278.
Conf. Telecommun. (ConTEL), Jun. 2021, pp. 110–114. [111] Q. Gautier, A. Althoff, P. Meng, and R. Kastner, ‘‘Spector: An OpenCL
[90] B. Khaleghi, S. Salamat, M. Imani, and T. Rosing, ‘‘FPGA energy effi- FPGA benchmark suite,’’ in Proc. Int. Conf. Field-Program. Technol.
ciency by leveraging thermal margin,’’ in Proc. IEEE 37th Int. Conf. (FPT), Dec. 2016, pp. 141–148.
Comput. Design (ICCD), Nov. 2019, pp. 376–384. [112] B. Reagen, J. M. Hernández-Lobato, R. Adolf, M. Gelbart,
[91] H. Kim and K. Choi, ‘‘Low power FPGA-SoC design techniques for P. Whatmough, G.-Y. Wei, and D. Brooks, ‘‘A case for efficient
CNN-based object detection accelerator,’’ in Proc. IEEE 10th Annu. accelerator design space exploration via Bayesian optimization,’’ in
Ubiquitous Comput., Electron. Mobile Commun. Conf. (UEMCON), Proc. Int. Symp. Low Power Electron. Design (ISLPED), 2017, pp. 1–6.
Oct. 2019, pp. 1130–1134. [113] C. Lo and P. Chow, ‘‘Model-based optimization of high level synthesis
[92] Y. Choi and J. Cong, ‘‘HLScope: High-level performance debugging for directives,’’ in Proc. 26th Int. Conf. Field Program. Log. Appl. (FPL),
FPGA designs,’’ in Proc. Annu. Int. Symp. Field-Programmable Custom Aug. 2016, pp. 1–10.
Comput. Mach. (FCCM), 2017, pp. 125–128. [114] A. Mehrabi, A. Manocha, B. C. Lee, and D. J. Sorin, ‘‘Bayesian optimiza-
[93] Y. Choi, P. Zhang, P. Li, and J. Cong, ‘‘HLScope+: Fast and accurate tion for efficient accelerator synthesis,’’ ACM Trans. Archit. Code Optim.,
performance estimation for FPGA HLS,’’ in Proc. Int. Conf. Comput.- vol. 18, no. 1, pp. 1–25, Mar. 2021.
Aided Des. (ICCAD), 2017, pp. 691–698.
[115] N. Wu, Y. Xie, and C. Hao, ‘‘IronMan: GNN-assisted design space
[94] N. Kapre and H. Patel, ‘‘Applying models of computation to OpenCL exploration in high-level synthesis via reinforcement learning,’’ in Proc.
pipes for FPGA computing,’’ in Int. Workshop OpenC (IWOCL), 2017, Great Lakes Symp. VLSI (GLSVLSI), Jun. 2021, pp. 39–44.
pp. 1–9.
[116] M. Siracusa, E. Del Sozzo, M. Rabozzi, L. Di Tucci, S. Williams,
[95] E. A. Lee and D. G. Messerschmitt, ‘‘Synchronous data flow,’’ Proc.
D. Sciuto, and M. D. Santambrogio, ‘‘A comprehensive methodology to
IEEE, vol. 75, no. 9, pp. 1235–1245, Sep. 1987.
optimize FPGA designs via the roofline model,’’ IEEE Trans. Comput.,
[96] M. Hora, V. Končický, and J. Tětek, ‘‘Theoretical model of compu- vol. 71, no. 8, pp. 1903–1915, Aug. 2021.
tation and algorithms for FPGA-based hardware accelerators,’’ 2018,
[117] S. W. Nabi and W. Vanderbauwhede, ‘‘FPGA design space exploration for
arXiv:1807.03611.
scientific HPC applications using a fast and accurate cost model based on
[97] K. Papadimitriou, A. Dollas, and S. Hauck, ‘‘Performance of par-
roofline analysis,’’ J. Parallel Distrib. Comput., vol. 133, pp. 407–419,
tial reconfiguration in FPGA systems: A survey and a cost model,’’
Nov. 2019.
ACM Trans. Reconfigurable Technol. Syst., vol. 4, no. 4, pp. 1–36,
2011. [118] M. Siracusa, L. Di Tucci, M. Rabozzi, S. Williams, E. D. Sozzo, and
M. D. Santambrogio, ‘‘A CAD-based methodology to optimize HLS code
[98] S. Wang, Y. Liang, and W. Zhang, ‘‘FlexCL: An analytical performance
via the roofline model,’’ in Proc. 39th Int. Conf. Comput.-Aided Design,
model for OpenCL workloads on flexible FPGAs,’’ in Proc. 54th Annu.
Nov. 2020, pp. 1–9.
Design Automat. Conf., Jun. 2017, p. 27.
[99] E. Calore and S. F. Schifano, ‘‘Performance assessment of FPGAs as HPC [119] R. Tessier and H. Giza, ‘‘Balancing logic utilization and area efficiency
accelerators using the FPGA empirical roofline,’’ in Proc. 31st Int. Conf. in FPGAs,’’ in Proc. 10th Int. Workshop Field Program. Logic Appl.,
Field-Program. Log. Appl. (FPL), Aug. 2021, pp. 83–90. vol. 1896, 2000, pp. 535–544.
[100] T. Nguyen, S. Williams, M. Siracusa, C. MacLean, D. Doerfler, and [120] L. Ferretti, J. Kwon, G. Ansaloni, G. D. Guglielmo, L. P. Carloni,
N. J. Wright, ‘‘The performance and energy efficiency potential and L. Pozzi, ‘‘Leveraging prior knowledge for effective design-space
of FPGAs in scientific computing,’’ in Proc. Perform. Modeling, exploration in high-level synthesis,’’ IEEE Trans. Comput.-Aided Design
Benchmarking Simul. High Perform. Comput. Syst. (PMBS), 2020, Integr. Circuits Syst., vol. 39, no. 11, pp. 3736–3747, Nov. 2020.
pp. 8–19. [121] L. Piccolboni, P. Mantovani, G. D. Guglielmo, and L. P. Carloni, ‘‘COS-
[101] H. M. Makrani, F. Farahmand, H. Sayadi, S. Bondi, S. M. P. Dinakarrao, MOS: Coordination of high-level synthesis and memory optimization for
H. Homayoun, and S. Rafatirad, ‘‘Pyramid: Machine learning framework hardware accelerators,’’ ACM Trans. Embedded Comput. Syst., vol. 16,
to estimate the optimal timing and resource usage of a high-level synthesis no. 5s, pp. 1–22, Oct. 2017.
design,’’ in Proc. 29th Int. Conf. Field Program. Log. Appl. (FPL), [122] P. Meng, A. Althoff, Q. Gautier, and R. Kastner, ‘‘Adaptive threshold non-
Sep. 2019, pp. 397–403. Pareto elimination: Re-thinking machine learning for system level design
[102] F. Farahmand, A. Ferozpuri, W. Diehl, and K. Gaj, ‘‘Minerva: Automated space exploration on FPGAs,’’ in Proc. Design, Automat. Test Eur. Conf.
hardware optimization tool,’’ in Proc. Int. Conf. ReConFigurable Comput. Exhib. (DATE), 2016, pp. 918–923.
FPGAs (ReConFig), Dec. 2017, pp. 1–8. [123] S. Xu, S. Liu, Y. Liu, A. Mahapatra, M. Villaverde, F. Moreno, and
[103] Z. Wang, B. He, W. Zhang, and S. Jiang, ‘‘A performance analysis B. Carrion Schafer, ‘‘Design space exploration of heterogeneous MPSoCs
framework for optimizing OpenCL applications on FPGAs,’’ in Proc. with variable number of hardware accelerators,’’ Microprocessors
IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Mar. 2016, Microsyst., vol. 65, pp. 169–179, Mar. 2019.
pp. 114–125. [124] J. Kwon and L. P. Carloni, ‘‘Transfer learning for design-space explo-
[104] C. Larman and V. R. Basili, ‘‘Iterative and incremental developments. A ration with high-level synthesis,’’ in Proc. Workshop Mach. Learn.
brief history,’’ Computer, vol. 36, no. 6, pp. 47–56, 2003. (CAD), 2020, pp. 163–168.
[105] J. Cong, W. Jiang, B. Liu, and Y. Zou, ‘‘Automatic memory partitioning [125] S. Dai, Y. Zhou, H. Zhang, E. Ustun, E. F. Y. Young, and Z. Zhang, ‘‘Fast
and scheduling for throughput and power optimization,’’ ACM Trans. Des. and accurate estimation of quality of results in high-level synthesis with
Automat. Electron. Syst., vol. 16, no. 2, pp. 1–15, 2011. machine learning,’’ in Proc. IEEE 26th Annu. Int. Symp. Field-Program.
[106] N. K. Pham, A. K. Singh, A. Kumar, and M. M. A. Khin, ‘‘Exploiting Custom Comput. Mach. (FCCM), Apr. 2018, pp. 129–132.
loop-array dependencies to accelerate the design space exploration with [126] S. Liu, F. C. Lau, and B. C. Schafer, ‘‘Accelerating FPGA prototyping
high level synthesis,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. through predictive model-based HLS design space exploration,’’ in Proc.
(DATE), 2015, pp. 157–162. 56th Annu. Design Autom. Conf., Jun. 2019, p. 97.

VOLUME 10, 2022 90453

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

[127] A. S. B. Lopes and M. M. Pereira, ‘‘A machine learning approach to [148] G. Verma, V. Khare, and M. Kumar, ‘‘More precise FPGA power esti-
accelerating DSE of reconfigurable accelerator systems,’’ in Proc. 33rd mation and validation tool (FPEV_Tool) for low power applications,’’
Symp. Integr. Circuits Syst. Design (SBCCI), Aug. 2020, pp. 1–6. Wireless Pers. Commun., vol. 106, no. 4, pp. 2237–2246, Jun. 2019.
[128] E. Ustun, C. Deng, D. Pal, Z. Li, and Z. Zhang, ‘‘Accurate operation delay [149] L. Deng, K. Sobti, and C. Chakrabarti, ‘‘Accurate models for estimating
prediction for FPGA HLS using graph neural networks,’’ in Proc. 39th Int. area and power of FPGA implementations,’’ in Proc. IEEE Int. Conf.
Conf. Comput.-Aided Design, Nov. 2020, p. 87. Acoust., Speech Signal Process., Mar. 2008, pp. 1417–1420.
[129] M. Manuel, A. Kreddig, S. Conrady, N. A. Vu Doan, and W. Stechele, [150] G. Verma, T. Singhal, R. Kumar, S. Chauhan, S. Shekhar, B. Pandey, and
‘‘Model-based design space exploration for approximate image process- D. M. Akbar Hussain, ‘‘Heuristic and statistical power estimation model
ing on FPGA,’’ in Proc. IEEE Nordic Circuits Syst. Conf. (NorCAS), for FPGA based wireless systems,’’ Wireless Pers. Commun., vol. 106,
Oct. 2020, pp. 1–7. no. 4, pp. 2087–2098, Jun. 2019.
[130] G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar, ‘‘Lin-analyzer: [151] Y. Liang, S. Wang, and W. Zhang, ‘‘FlexCL: A model of performance and
A high-level performance analysis tool for FPGA-based accelerators,’’ in power for OpenCL workloads on FPGAs,’’ IEEE Trans. Comput., vol. 67,
Proc. 53rd Annu. Design Automat. Conf., Austin, TX, USA, Jun. 2016, no. 12, pp. 1750–1764, Dec. 2018.
p. 136. [152] K. O’Neal, M. Liu, H. Tang, A. Kalantar, K. DeRenard, and P. Brisk,
[131] A. B. Perina, J. Becker, and V. Bonato, ‘‘Lina: Timing-constrained high- ‘‘HLSPredict: Cross platform performance prediction for FPGA high-
level synthesis performance estimator for fast DSE,’’ in Proc. Int. Conf. level synthesis,’’ in Proc. Int. Conf. Comput.-Aided Design, Nov. 2018,
Field-Program. Technol. (ICFPT), 2019, pp. 343–346. pp. 1–8.
[132] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar, [153] Z. Lin, J. Zhao, S. Sinha, and W. Zhang, ‘‘HL-Pow: A learning-
‘‘Design space exploration of FPGA-based accelerators with multi-level based power modeling framework for high-level synthesis,’’ in
parallelism,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Proc. Asia South Pacific Design Autom. Conf. (ASP-DAC), 2020,
Mar. 2017, pp. 1141–1146. pp. 574–580.
[133] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, ‘‘Performance [154] Z. Lin, Z. Yuan, J. Zhao, W. Zhang, H. Wang, and Y. Tian, ‘‘PowerGear:
modeling and directives optimization for high-level synthesis on FPGA,’’ Early-stage power estimation in FPGA HLS via heterogeneous edge-
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 7, centric GNNs,’’ in Proc. Design, Automat. Test Eur. Conf. Exhib. (DATE),
pp. 1428–1441, Jul. 2019. 2022, pp. 1341–1346.
[134] L. Ferretti, G. Ansaloni, and L. Pozzi, ‘‘Cluster-based heuristic for high [155] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, ‘‘Aladdin: A pre-RTL,
level synthesis design space exploration,’’ IEEE Trans. Emerg. Topics power-performance accelerator simulator enabling large design space
Comput., vol. 9, no. 1, pp. 35–43, Jan. 2021. exploration of customized architectures,’’ in Proc. ACM/IEEE 41st Int.
[135] L. Ferretti, G. Ansaloni, and L. Pozzi, ‘‘Lattice-traversing design space Symp. Comput. Archit. (ISCA), Jun. 2014, pp. 97–108.
exploration for high level synthesis,’’ in Proc. IEEE 36th Int. Conf. [156] M. Makni, S. Niar, M. Baklouti, and M. Abid, ‘‘HAPE: A high-level area-
Comput. Design (ICCD), Oct. 2018, pp. 210–217. power estimation framework for FPGA-based accelerators,’’ Micropro-
[136] Q. Gautier, A. Althoff, C. L. Crutchfield, and R. Kastner, ‘‘Sherlock: cessors Microsyst., vol. 63, pp. 11–27, Nov. 2018.
A multi-objective design space exploration framework,’’ ACM Trans. [157] K. K. W. Poon, A. Yan, and S. J. E. Wilton, ‘‘A flexible power model
Design Autom. Electron. Syst., vol. 27, no. 4, pp. 1–20, Jul. 2022. for FPGAs,’’ in Proc. 12th Int. Conf. Field Program. Logic Appl. (FPL),
[137] D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, vol. 2438, 2002, pp. 312–321.
and K. Olukotun, ‘‘Automatic generation of efficient accelerators for [158] C. Du and Y. Yamaguchi, ‘‘High-level synthesis design for stencil compu-
reconfigurable hardware,’’ in Proc. ACM/IEEE 43rd Annu. Int. Symp. tations on FPGA with high bandwidth memory,’’ Electronics, vol. 9, no. 8,
Comput. Archit. (ISCA), Jun. 2016, pp. 115–127. p. 1275, Aug. 2020. [Online]. Available: https://www.mdpi.com/2079-
[138] H. M. Makrani, H. Sayadi, T. Mohsenin, S. Rafatirad, A. Sasan, and 9292/9/8/1275
H. Homayoun, ‘‘XPPE: Cross-platform performance estimation of hard- [159] M. Karp, A. Podobas, N. Jansson, T. Kenter, C. Plessl, P. Schlatter,
ware accelerators using machine learning,’’ in Proc. 24th Asia South and S. Markidis, ‘‘High-performance spectral element methods on field-
Pacific Design Automat. Conf., Jan. 2019, pp. 727–732. programmable gate arrays: Implementation, evaluation, and future pro-
[139] PowerTool MAXPOWERTOOL002 Quick Start Guide, Maxim Integr., jection,’’ in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS),
San Jose, CA, USA, 2014. Accessed: Feb. 19, 2022. [Online]. Available: May 2021, pp. 1077–1086.
https://pdfserv.maximintegrated.com/en/an/UG5981.pdf [160] K. Nagasu, K. Sano, F. Kono, and N. Nakasato, ‘‘FPGA-based tsunami
[140] USB Interface Adapter Evaluation Module. User’s Guide, Texas Instrum., simulation: Performance comparison with GPUs, and roofline model for
Dallas, TX, USA, 2006. Accessed: Feb. 19, 2022. scalability analysis,’’ J. Parallel Distrib. Comput., vol. 106, pp. 153–169,
[141] Xilinx Power Estimator User Guide. UG-440 (v2021.2), Xilinx, San Aug. 2017.
Jose, CA, USA, 2021. Accessed: Feb. 19, 2022. [Online]. Available: [161] C. Du, I. Firmansyah, and Y. Yamaguchi, ‘‘FPGA-based computational
https://china.xilinx.com/content/dam/xilinx/support/documents/sw_ fluid dynamics simulation architecture via high-level synthesis design
manuals/xilinx2021_2/ug440-xilinx-power-estimator.pdf method,’’ in Proc. Int. Symp. Appl. Reconfigurable Comput., vol. 12083,
[142] Intel FPGA Power and Thermal Calculator User Guide, Intel, San 2020, pp. 232–246.
Jose, CA, USA, 2021. Accessed: Feb. 19, 2022. [Online]. Available: [162] E. Reggiani, G. Natale, C. Moroni, and M. D. Santambrogio, ‘‘An
https://www.intel.com/content/www/us/en/docs/programmable/683445/ FPGA-based acceleration methodology and performance model for iter-
21-4/overview-of-the.html ative stencils,’’ in Proc. IEEE Int. Parallel Distrib. Process. Symp. Work-
[143] J. J. Davis, E. Hung, J. M. Levine, E. A. Stott, P. Y. K. Cheung, and shops (IPDPSW), May 2018, pp. 115–122.
G. A. Constantinides, ‘‘KAPow: high-accuracy, low-overhead online per- [163] M. Feickert and B. Nachman, ‘‘A living review of machine learning for
module power estimation for FPGA designs,’’ ACM Trans. Reconfig- particle physics,’’ 2021, arXiv:2102.02770.
urable Technol. Syst., vol. 11, no. 1, pp. 1–22, Mar. 2018. [164] A. M. C. Deiana, N. Tran, J. Agar, M. Blott, G. Di Guglielmo,
[144] S. Xu and B. C. Schafer, ‘‘Approximating behavioral HW accelerators J. Duarte, P. Harris, S. Hauck, M. Liu, M. S. Neubauer, and J. Ngadiuba,
through selective partial extractions onto synthesizable predictive mod- ‘‘Applications and techniques for fast machine learning in science,’’ 2021,
els,’’ in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), arXiv:2110.13041.
Nov. 2019, pp. 1–8. [165] S. L. Brunton, B. R. Noack, and P. Koumoutsakos, ‘‘Machine learning for
[145] J. Lorandel, J.-C. Prévotet, and M. Hélard, ‘‘Efficient modelling of FPGA- fluid mechanics,’’ Annu. Rev. Fluid Mech., vol. 52, no. 1, pp. 477–508,
based IP blocks using neural networks,’’ in Proc. Int. Symp. Wireless Jan. 2020.
Commun. Syst. (ISWCS), 2016, pp. 571–575. [166] S. I. Venieris, A. Kouris, and C.-S. Bouganis, ‘‘Toolflows for
[146] Y. Nasser, J. Prévotet, and M. Hélard, ‘‘Power modeling on FPGA: A mapping convolutional neural networks on FPGAs: A survey and
neural model for RT-level power estimation,’’ in Proc. Int. Conf. Comput. future directions,’’ ACM Comput. Surv., vol. 51, no. 3, pp. 1–56,
Frontiers (CF), 2018, pp. 309–313. Jun. 2018.
[147] A. N. Tripathi and A. Rajawat, ‘‘An accurate and quick ANN-based [167] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and
system-level dynamic power estimation model using LLVM IR profiling J. Kepner, ‘‘Survey and benchmarking of machine learning accelerators,’’
for FPGA designs,’’ IEEE Embedded Syst. Lett., vol. 12, no. 2, pp. 58–61, in Proc. IEEE High Perform. Extreme Comput. Conf. (HPEC), Sep. 2019,
Jun. 2020. pp. 1–9.

90454 VOLUME 10, 2022

R. S. Molina et al.: High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

[168] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, [189] Q. Xu, T. Mytkowicz, and N. Kim, ‘‘Approximate computing: A survey,’’
‘‘Survey of machine learning accelerators,’’ in Proc. IEEE High Perform. IEEE Des. Test, vol. 33, no. 1, pp. 8–22, Jan. 2016.
Extreme Comput. Conf. (HPEC), Sep. 2020, pp. 1–12. [190] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and
[169] E. Reggiani, M. Rabozzi, A. M. Nestorov, A. Scolari, L. Stornaiuolo, J. Anderson, ‘‘The effect of compiler optimizations on high-level syn-
and M. Santambrogio, ‘‘Pareto optimal design space exploration for thesis for FPGAs,’’ in Proc. Annu. Int. Symp. Field-Program. Custom
accelerated CNN on FPGA,’’ in Proc. IEEE Int. Parallel Distrib. Process. Comput. Mach., 2013, pp. 89–96.
Symp. Workshops (IPDPSW), May 2019, pp. 107–114.
[170] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, ‘‘Design space explo-
ration of FPGA-based deep convolutional neural networks,’’ in Proc.
21st Asia South Pacific Design Autom. Conf. (ASP-DAC), Jan. 2016,
pp. 575–580. ROMINA SOLEDAD MOLINA (Student Member,
[171] J. Xu, Z. Liu, J. Jiang, D. Yong, and S. Li, ‘‘CaFPGA: An automatic gen- IEEE) received the master’s (Master in Computer
eration model for CNN accelerator,’’ Microprocess. Microsyst., vol. 60, Science) degree from the Universidad Nacional
pp. 196–206, Jul. 2018.
de San Luis, Argentina. She is currently pursuing
[172] J. Shan, M. T. Lazarescu, J. Cortadella, L. Lavagno, and M. R. Casu, the Ph.D. degree in industrial and information
‘‘CNN-on-AWS: Efficient allocation of multikernel applications on engineering with the Università degli Studi di
multi-FPGA platforms,’’ IEEE Trans. Comput.-Aided Design Integr.
Trieste, under a Joint-Supervision Program with
Circuits Syst., vol. 40, no. 2, pp. 301–314, Feb. 2021.
the Universidad Nacional de San Luis. Her main
[173] S. O. Ayat, M. Khalil-Hani, and A. A.-H.-A. Rahman, ‘‘Optimizing
research interests include digital signal processing,
FPGA-based CNN accelerator for energy efficiency with an extended
digital control, image analysis, high-performance
roofline model,’’ Turkish J. Electr. Eng. Comput. Sci., vol. 26, no. 2,
pp. 919–935, Mar. 2018. computing, machine learning, parallel computing models, FPGA, and SOC.
[174] L. Xie, X. Fan, W. Cao, and L. Wang, ‘‘High throughput CNN accelerator
design based on FPGA,’’ in Proc. Int. Conf. Field-Program. Technol.
(FPT), Dec. 2018, pp. 274–277.
VERONICA GIL-COSTA is currently a Former
[175] C. Park, S. Park, and C. S. Park, ‘‘Roofline-model-based design space
Researcher at Yahoo! Labs Santiago hosted by the
exploration for dataflow techniques of CNN accelerators,’’ IEEE Access,
vol. 8, pp. 172509–172523, 2020. University of Chile. She is also an Associate Pro-
fessor at the University of San Luis, a Researcher
[176] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘Performance modeling
at the National Research Council (CONICET)
for CNN inference accelerators on FPGA,’’ IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 39, no. 4, pp. 843–856, Apr. 2020. of Argentina, and a Researcher at the CITIAPS,
Chile. Her research work is on parallel computing
[177] T. Geng, T. Wang, A. Li, X. Jin, and M. Herbordt, ‘‘FPDeep: Scalable
acceleration of CNN training on deeply-pipelined FPGA clusters,’’ 2019, and distributed systems, with applications in query
arXiv:1901.01007. processing and capacity planning for large scale
systems.
[178] W. Zhao, H. Fu, W. Luk, T. Yu, S. Wang, B. Feng, Y. Ma, and
G. Yang, ‘‘F-CNN: An FPGA-based framework for training convolutional
neural networks,’’ in Proc. IEEE 27th Int. Conf. Appl.-Specific Syst.,
Architectures Processors (ASAP), Jul. 2016, pp. 107–114.
[179] Y.-C. Lin, B. Zhang, and V. Prasanna, ‘‘HP-GNN: Generating high MARÍA LIZ CRESPO is currently a Research
throughput GNN training implementation on CPU-FPGA heterogeneous Officer at The Abdus Salam International Centre
platform,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, for Theoretical Physics (ICTP) and an Associate
Feb. 2022, pp. 123–133.
Researcher of the Italian National Institute of
[180] A. Ghaffari and Y. Savaria, ‘‘CNN2Gate: An implementation of convolu- Nuclear Physics (INFN), Trieste, Italy. She is also
tional neural networks inference on FPGAs with automated design space coordinating the research and training program of
exploration,’’ Electronics, vol. 9, no. 12, p. 2200, Dec. 2020.
the Multidisciplinary Laboratory (MLab), ICTP.
[181] S. I. Venieris and C.-S. Bouganis, ‘‘FpgaConvNet: Automated mapping She has organized several international schools
of convolutional neural networks on FPGAs,’’ in Proc. ACM/SIGDA Int.
and workshops on fully programmable systems
Symp. Field-Program. Gate Arrays, Feb. 2017, pp. 291–292.
on chip for nuclear and scientific instrumentation.
[182] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, ‘‘Cloud-DNN: An open She is the coauthor of more than 100 scientific publications in prestigious
framework for mapping DNN models to cloud FPGAs,’’ in Proc. Int.
peer-reviewed journals. Her main research interests include advanced sci-
Symp. Field-Program. Gate Arrays, 2019, pp. 73–82.
entific instrumentation for particle physics experiments and experimental
[183] A. Biondi, A. Balsini, M. Pagani, E. Rossi, M. Marinoni, and G. Buttazzo,
multidisciplinary research.
‘‘A framework for supporting real-time applications on dynamic reconfig-
urable FPGAs,’’ in Proc. IEEE Real-Time Syst. Symp. (RTSS), Nov. 2016,
pp. 1–12.
[184] J. Mu, W. Zhang, H. Liang, and S. Sinha, ‘‘A collaborative framework for
FPGA-based CNN design modeling and optimization,’’ in Proc. 28th Int.
Conf. Field Program. Log. Appl. (FPL), 2018, pp. 139–1397. GIOVANNI RAMPONI (Life Senior Member,
[185] O. Reiche, M. A. Özkan, R. Membarth, J. Teich, and F. Hannig, ‘‘Gener- IEEE) was born in 1956. Since 2000, he has
ating FPGA-based image processing accelerators with Hipacc,’’ in Proc. been a Full Professor of electronics at the Depart-
Int. Conf. Computer-Aided Design (ICCAD), Nov. 2017, pp. 1026–1033. ment of Engineering and Architecture, University
[186] C. H. Yu, P. Wei, M. Grossman, P. Zhang, V. Sarker, and J. Cong, ‘‘S2FA: of Trieste, Italy. He is the co-inventor of inter-
An accelerator automation framework for heterogeneous computing in national patents, and has published more than
datacenters,’’ in Proc. Design Autom. Conf. (DAC), 2018, p. 153. 200 papers in international journals, conference
[187] P. Xu, X. Zhang, C. Hao, Y. Zhao, Y. Zhang, Y. Wang, C. Li, Z. Guan, proceedings, and book chapters. His research
D. Chen, and Y. Lin, ‘‘AutoDNNchip: An automated DNN chip predictor interests include nonlinear digital signal process-
and builder for both FPGAs and ASICs,’’ in Proc. ACM/SIGDA Int. Symp. ing, enhancement and feature extraction in images
Field-Program. Gate Arrays, Feb. 2020, pp. 40–50. and image sequences, image visualization, image quality evaluation, and
[188] L. Ferretti, A. Cini, G. Zacharopoulos, C. Alippi, and L. Pozzi, ‘‘A graph deep learning techniques for image processing. More information can be
deep learning framework for high-level synthesis design space explo- found at: www.units.it/ramponi.
ration,’’ 2021, arXiv:2111.14767.

VOLUME 10, 2022 90455

HLSPilot: LLM-based High-Level Synthesis
Chenwei Xiong1,2 , Cheng Liu1,2∗ , Huawei Li1,2 , Xiaowei Li1,2
1
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
2
Dept. of Computer Science, University of Chinese Academy of Sciences, Beijing, China
{xiongchenwei22s, liucheng}@ict.ac.cn

Abstract—Large language models (LLMs) have catalyzed an implementations. HLS tools automate the design tasks such
upsurge in automatic code generation, garnering significant at- as concurrent analysis of algorithms, interface design, logic
tention for register transfer level (RTL) code generation. Despite unit mapping, and data management, thereby substantially
the potential of RTL code generation with natural language,
arXiv:2408.06810v1 [cs.AR] 13 Aug 2024

it remains error-prone and limited to relatively small modules shortening the hardware design cycle.
because of the substantial semantic gap between natural language While HLS offers numerous advantages such as higher
expressions and hardware design intent. In response to the development efficiency and lower design barriers [1] [2], there
limitations, we propose a methodology that reduces the semantic are still some issues in the real-world HLS-based hardware
gaps by utilizing C/C++ for generating hardware designs via acceleration workflow [3]. Firstly, the overall analysis of the
High-Level Synthesis (HLS) tools. Basically, we build a set
of C-to-HLS optimization strategies catering to various code program is of great importance, determining the performance
patterns, such as nested loops and local arrays. Then, we apply bottlenecks of the program and the co-design between CPU
these strategies to sequential C/C++ code through in-context and FPGA remains a challenging issue. Besides, designs based
learning, which provides the LLMs with exemplary C/C++ to on HLS still encounter a few major performance issues [4] [5].
HLS prompts. With this approach, HLS designs can be generated Foremost, it still requires substantial optimization experience
effectively. Since LLMs still face problems in determining the
optimized pragma parameters precisely, we have a design space to craft high-quality HLS code and achieve desired perfor-
exploration (DSE) tool integrated for pragma parameter tuning. mance in practical development processes [6] [7]. In addition,
Furthermore, we also employ profiling tools to pinpoint the HLS code often struggles to reach optimality due to the large
performance bottlenecks within a program and selectively convert design space of various pragma parameters. Some design
bottleneck components to HLS code for hardware accelera- space exploration (DSE) tools have been proposed [8] [9]
tion. By combining the LLM-based profiling, C/C++ to HLS
translation, and DSE, we have established HLSPilot—the first [10] [11] to automate the parameter tuning, but these tools do
LLM-enabled high-level synthesis framework, which can fully not fundamentally optimize the hardware design. High-quality
automate the high-level application acceleration on hybrid CPU- HLS design turns out to be the major performance challenge
FPGA architectures. According to our experiments on real-world from the perspective of general software designers. Some
application benchmarks, HLSPilot achieve comparable perfor- researchers have attempted to address this challenge by using
mance in general and can even outperform manually crafted
counterparts, thereby underscoring the substantial promise of pre-built templates for specific domain applications [12] [13]
LLM-assisted hardware designs. [14]. For example, ThunderGP [13] has designed a set of HLS-
Index Terms—large language model, high-level synthesis, C- based templates for optimized graph processing accelerator
to-HLS, Code Generation. generation, allowing designers to implement various graph
algorithms by filling in the templates. However, it demands
I. I NTRODUCTION comprehensive understanding of both the domain knowledge
Hardware designing is a demanding task requiring a high and the HLS development experience from designers and there
level of expertise. Traditional hardware design involves coding is still a lack of well-established universal solution to obtain
with register transfer level (RTL) language. However, as optimized HLS code. Bridging the gap between C/C++ and
the complexity of hardware increases continuously with the HLS remains a formidable challenge requiring further efforts.
computing requirements of applications, RTL coding becomes Large Language Models (LLMs) have recently exhibited
exceedingly time-consuming and labor-intensive. The emer- remarkable capabilities in various generative tasks, including
gence of High-Level Synthesis (HLS) enables hardware design text generation, machine translation, and code generation,
at higher abstraction levels [1]. HLS typically employs high- underscoring their advanced learning and imitation skills.
level languages like C/C++ for hardware description, allowing These advancements have opened up possibilities for ad-
software engineers to also engage in hardware development, dressing hardware design challenges. Researchers have begun
which significantly lowering the expertise barrier in hard- applying LLMs to various hardware design tasks, including
ware design. Designers can focus more on the applications general-purpose processor designs, domain-specific accelera-
and algorithms rather than the details of low-level hardware tor designs, and arbitrary RTL code generation. Among these
applications, it can be observed that neural network accelerator
∗ Corresponding author.
generation utilizing a predefined template, as reported in [15],
This work is supported by the National Key R&D Program of China under
Grant (2022YFB4500405), and the National Natural Science Foundation of reaches an almost 100% success rate. In contrast, generating
China under Grant 62174162. register transfer level (RTL) code from natural language de-
scriptions, such as design specifications, experiences a consid- workflow on a hybrid CPU-FPGA architecture with a case
erably higher failure rate [16] [17]. This disparity is largely due study.
to the semantic gap between inputs and the anticipated outputs.
Despite the imperfections, these work have demonstrated the II. R ELATED W ORK
great potential of exploring LLMs for hardware designing. A. LLM for Hardware Design
Inspired by prior works, we introduce HLSPilot, an au- Recent works have begun to utilize LLMs to assist the
tomated framework that utilizes LLMs to generate and op- hardware designing from different angles [15], [16], [18]–
timize HLS code from sequential C/C++ code. Instead of [25]. Generating RTL code with natural language is a typical
generating RTL code from natural language directly, HLSPilot approach of hardware design with LLMs. For instance, VGen
mainly leverages LLMs to generate the C-like HLS code [18] leverages an open-source LLM, CodeGen [26], fine-
from C/C++ with much narrower semantic gap and outputs tuned with Verilog code corpus to generate Verilog code.
RTL code eventually using established HLS tools. Essen- Similarly, VerilogEval [19] enhances the LLM’s capability
tially, HLSPilot accomplishes RTL code generation from to generate Verilog by constructing a supervised fine-tuning
C/C++ without imposing hardware design tasks with broad dataset, it also establishes a benchmark for evaluating LLM’s
semantic gap on LLMs. Specifically, HLSPilot initiates the performance. ChipChat [24] achieves an 8-bit accumulator-
process with runtime profiling to pinpoint code segments based microprocessor design through multi-round natural lan-
that are the performance bottleneck and require optimization. guage conversation. ChipGPT [16] proposes a four-stage zero-
Subsequently, HLSPilot extracts the kernel code segments code logic design framework based on GPT for hardware
and applies appropriate HLS optimization strategies to the design. These studies have successfully applied LLMs to
computing kernels to generate optimized HLS code. Then, practical hardware designing. However, these methods are
HLSPilot employs a design space exploration (DSE) tool mostly limited to small functional modules and the success
to fine-tune the parameters of the generated HLS design. rate drops substantially when the hardware design gets larger.
Finally, HLSPilot leverages Xilinx OpenCL APIs to offload GPT4AIGchip proposed in [15] can also leverage LLMs to
the compute kernels to the FPGA, facilitating the deployment generate efficient AI accelerators based on a hardware tem-
of the entire algorithm on a hybrid CPU-FPGA architecture. plate, but it relies on pre-built hardware library that requires
In summary, LLMs are utilized for the hardware acceleration intensive understanding of both the domain knowledge and
throughout the entire hardware acceleration workflow ranging the hardware design techniques. which can hinders its use by
from profiling, HW/SW partitioning, HLS code generation, software developers.Recently, a domain-specific LLM for chip
HLS code optimization, and tool usage, thereby achieving a design, ChipNeMo [17], was proposed. ChipNeMo employs
high degree of design automation. a series of domain-adaptive techniques to train the LLM
The major contributions of this work are summarized as capable of generating RTL code, writing EDA tool scripts, and
follows: summarizing bugs. While powerful, domain-specific LLMs
• We propose HLSPilot, the first automatic HLS code face challenges such as high training costs and difficulties in
generation and optimization framework from sequential data collection.
C/C++ code using LLM. This framework investigates
B. LLM for Code Generation
the use of LLM for HLS design strategy learning and
tool learning, and build a complete hardware acceleration Code generation is one of the key applications of LLMs.
workflow ranging from runtime profiling, kernel identi- A number of domain-specific LLMs such as CodeGen [26],
fication, automatic HLS code generation, design space CodeX [27], and CodeT5 [28] have been proposed to address
exploration, and HW/SW co-design on a hybrid CPU- the programming of popular languages such as C/C++, Python,
FPGA computing architecture. The framework is open and Java, which have a large number of corpus for pre-
sourced on Github1 . training and fine-tuning. In contrast, it can be challenging to
• We propose a retrieval based approach to learn the HLS collect sufficient corpus for the less popular languages. VGen
optimization techniques and examples from Xilinx user [18] collected and filtered Verilog corpus from Github and
manual and utilize an in-context learning approach to textbooks, obtaining only hundreds of MB of corpus. Hence,
apply the learned HLS optimizations on serial C/C++ prompt engineering in combination with in-context learning
code and generate optimized HLS code with LLM for provides an attractive approach to leverage LLMs to generate
various computing kernels. code for domain-specific languages. For instance, the authors
• According to our experiments on an HLS benchmark, in [29] augment code generation by providing the language’s
HLSPilot can generate optimized HLS code from sequen- Backus–Naur form (BNF) grammar within prompts.
tial C/C++ code and the resulting designs can outperform
III. HLSP ILOT F RAMEWORK
manual optimizations with the assistance of DSE tools in
most cases. In addition, we also demonstrate the success- The remarkable achievements of LLMs across a wide
ful use of HLSPilot as a complete hardware acceleration domain of applications inspire us to create an LLM-driven
automatic hardware acceleration design framework tailored
1 https://github.com/xcw-1010/HLSPilot for a hybrid CPU-FPGA architecture. Unlike previous efforts
3-1. Automated Optimization Strategy Learning 3-2. Strategy Retrieval and Applying
strategy application
strategy retrieval
strategy1:
introduction scenes

parameter
demos prompt
description
system prompt:
… ... … You are an expert in FPGA…

strategy n: strategy application

Optimization strategy:
introduction scenes
strategy 1: description + demo +
optimized code
parameter strategy 2: description + demo
demos …
description
strategy n: description + demo

optimization strategy Optimize instruction

official documents knowledge base stage code
Design Space
Exploration Tools
1. Software Code Profiling and Analysis 2. Program-Tree-based Task Pipeline

stage 1
area
stage 2
profiling source
code
stage 3
stage1 stage2
latency
profiling report kernel to be
software code stage2-1 stage2-2

optimized 4. Design Space Exploration

C++ code HLS code 5. Hardware

Xilinx Runtime library User APIs (Accelerators) Deployment
CPU-FPGA
host code device code

Fig. 1. HLSPilot framework

that primarily focused on code generation, our objective is such as execution time distribution across the algorithm and
to harness the potential of LLMs to emulate the role of an the number of function calls conveniently. Since LLMs is
expert engineer in hardware acceleration. Given that hardware capable to understand and summarize the textual reports,
acceleration on a hybrid CPU-FPGA architecture demands the time-consuming functions can be identified conveniently.
a set of different design tasks such as runtime profiling, HLSPilot extracts the computing kernels to be optimized in
compute kernel identification, compute kernel acceleration, next stage based on these profiling information.
design space exploration, and CPU-FPGA co-design, LLMs Secondly, the computing kernels are organized as dependent
must understand the design guidelines and manipulate the tasks and pipelined accordingly. The dependent tasks can
relevant design tools to achieve the desired design objec- be implemented efficiently with the data flow mechanism
tives, akin to an engineer. Fortunately, LLMs have exhibited supported by Xilinx HLS. While the compute kernels can be
powerful capabilities in document comprehension, in-context irregular, we propose a program-tree-based strategy to refactor
learning, tool learning, and code generation, all of which align the program structure of the compute kernels and generate
perfectly with the hardware acceleration design requirements. an optimized task flow graph while ensuring equivalent code
The intended design framework eventually provides an end-to- functionality. Details of the automatic task pipelining will be
end high-level synthesis of sequential C/C++ code on a hybrid illustrated in Section III-B.
CPU-FPGA architecture, thus named as HLSPilot, which will Thirdly, we start to optimize each task with HLS inde-
be elaborated in the rest of this section. pendently. While there are many distinct HLS optimization
strategies applicable to different high-level code patterns, we
A. HLSPilot Overview create a set of HLS optimization strategies based on Xilinx
HLSPilot as presented in Fig. 1 takes sequential C/C++ code HLS user guide and leverage LLMs to select and apply the
as design input and it mainly includes five major processing appropriate optimization strategies automatically based on the
stages to generate optimized hardware acceleration solution on code patterns in each task. Details of the LLM-based automatic
a hybrid CPU-FPGA architecture. HLS optimization will be presented in Section III-C.
Firstly, HLSPilot conducts runtime profiling on the high- Fourthly, after the code refactoring and the application
level application code to identify the most time-consuming of various HLS pragmas, the HLS code can be obtained,
computing kernels, which will be the focus of subsequent but the parameters such as the initiation interval (II) for
optimization. In this work, we profile the target algorithm and pipelining, the factors of loop unrolling, and the size for array
analyze the execution time with gprof on a CPU system. Then, partitioning in the HLS code still needs to be tuned to produce
a detailed performance report will be generated as needed. accelerators with higher performance. However, it remains
With the report, we can obtain the performance information rather challenging for LLMs to decide design parameters of
a complex design precisely. To address this issue, HLSPilot Algorithm 1: Program-tree-based Pipelining Strategy
utilizes external tools to conduct the design space exploration Input: Top-level Function Code C
and decides the optimized solution automatically. According Output: Tasks Collection
to recent research [30], LLMs is capable to learn and utilize T = {task1 , task2 , . . . , taskn }
external APIs and tools efficiently. Hence, HLSPilot leverages 1 T ← {C}
LLMs to extract the parameters from HLS code and invoke 2 while T has task that can be further split do
the DSE tool proposed in [31] by generating the corresponding 3 Tnew ← {}
execution scripts. 4 for taski ∈ T do
Finally, when the compute kernels are optimized with 5 if LLM decides to futher split taski then
HLS, they can be compiled and deployed on FPGAs for 6 1.For non-loop blocks: split the code based
hardware acceleration. Nonetheless, these accelerators must be on the functionality of the statement
integrated with a host processor to provide a holistic hardware execution
acceleration solution. The acceleration system has both host 7 2.For loop blocks: split the code based on
code and device code that will be executed on CPU side and the minimum parallelizable loop granularity
FPGA side respectively. HLSPilot leverages LLMs to learn the 8 Add the refactored code to Tnew
APIs provided by Xilinx runtime (XRT) to manage the FPGA- 9 else
based accelerators and perform the data transfer between host 10 Add taski to Tnew
memory and FPGA device memory. Then, it generates the host 11 end
code mostly based on the original algorithm code and replaces 12 end
the compute kernels with the compute APIs that will invoke 13 T ← Tnew
the FPGA accelerators and the data movement APIs. The 14 end
device code is mainly the HLS code generated in prior steps.
With both the host code and device code, the entire algorithm
can be deployed on the hybrid CPU-FPGA architecture.
have LLM to analyze the semantics of code statements, rec-
B. Program-Tree-based Task Pipelining ognize the purpose of these statements, and group statements
While the compute kernel can be quite complex, it needs to performing the same function into a single task. For loop
be split into multiple tasks for the sake of potential pipelining code, the decomposition is primarily based on the smallest
or parallel processing, which is critical to the performance of loop granularity that can be executed in parallel. We take
the generated accelerator. However, it is difficult to split the advantage of the in-context learning capabilities of LLMs and
compute kernel appropriately because inappropriate splitting present a few representative decomposition examples to guide
may lead to imbalanced pipelining and low performance. In the task decomposition for general scenarios. These examples
addition, the splitting usually causes code refactoring, which as detailed as follows.
may produce code with inconsistent functionality and further 1) Each iteration of the loop is considered as a task: In the
complicate the problem. To address this problem, we propose original merge sort loop, each iteration processes all intervals
a program-tree-based strategy to guide LLM to produce fine- of the same width. Therefore, each iteration can be regarded
grained task splitting and pipelining. as a task. For example, taski merges all intervals with a width
The proposed program-tree based task pipelining strategy equal to 2i .
is detailed in Algorithm 1. According to the strategy, LLM // before:
iteratively decomposes the compute kernel to smaller tasks and for (int width = 1; width < SIZE; width = 2 * width
) {
form a tree structure eventually. An input compute kernel C is for (int i1 = 0; i1 < SIZE; i1 = i1 + 2 * width)
denoted as the root node of the tree. Hence, the initial node set {
of the tree T = {C}. Then, LLM decides whether each task int i2 = i1 + width;
int i3 = i1 + 2 * width;
in T can be further decomposed based on the complexity of if (i2 >= SIZE) i2 = SIZE;
the task code. If a decomposition is confirmed in taski , LLM if (i3 >= SIZE) i3 = SIZE;
will perform the code decomposition. The decomposition for merge(A, i1, i2, i3, temp);
}
non-loop tasks and loop tasks are different and they will be }
detailed later in this sub section. If the task cannot be further
decomposed, the taski is added to Tnew directly. // after:
for (int stage = 1; stage < STAGES - 1; stage++) {
The major challenge of the program-tree-based task pipelin- // merge all equally wide intervals
ing strategy is the task decomposition metric which depends merge_intervals(temp[stage - 1], width, temp[
on the code structures and can vary substantially. As a result, stage]);
width *= 2;
the metric can be difficult to quantify. Instead of using a }
determined quantitative metric, we leverage LLMs to perform
the task decomposition with natural language rules and typical 2) The first and second halves of a loop’s traversal are
decomposition examples. Specifically, for non-loop code, we each considered as a task: In histogram statistics, since the
first and second halves of the loop can be executed in parallel, 4) Multiple levels of loops are considered as a task: In
they are considered as two tasks. video frame image convolution, there are a total of 4 layers
// before:
of loops, where loop1 and loop2 are considered as the tasks
for (int i = 0; i < INPUT_SIZE; i++) { for reading the pixel, and loop3 and loop4 are the tasks for
val = in[i]; calculating the convolution.
hist[val] = hist[val] + 1;
} // before:
loop1: for(int line=0; line<img_h; ++line) {
// after: loop2: for(int pixel=0; pixel<img_w; ++pixel) {
for (int i = 0; i < INPUT_SIZE / 2; i++) { float sum_r = 0, sum_g = 0, sum_b = 0;
val = in1[i]; loop3: for(int m=0; m<coeff_size; ++m) {
hist1[val] = hist1[val] + 1; loop4: for(int n=0; n<coeff_size; ++n) {
} int ii = line + m - center;
for (int i = 0; i < INPUT_SIZE / 2; i++) { int jj = pixel + n - center;
val = in2[i]; if(ii >= 0 && ii < img_h && jj >= 0 && jj < img_w)
hist2[val] = hist2[val] + 1; {
} sum_r += in[(ii * img_w) + jj].r * coeff[(m *
histogram_reduce(hist1, hist2, hist); coeff_size) + n];
sum_g += in[(ii * img_w) + jj].g * coeff[(m *
3) Each level of a loop is considered as a task: In BFS coeff_size) + n];
sum_b += in[(ii * img_w) + jj].b * coeff[(m *
algorithm, there are two loops, with the first loop used to coeff_size) + n];
find the frontier vertex and read the corresponding rpao data, }
the second loop used to traverse the neighbors of the frontier ...
}
vertex, which can be divided into two tasks based on this.
// before: // after:
loop1: for (int i = 0; i < vertex_num; i++) { void read_dataflow(hls::stream<RGBPixel>&
char d = depth[i]; read_stream, const RGBPixel *in, int img_w, int
if (d == level) { elements, int half) {
start = rpao[i]; int pixel = 0;
end = rpao[i + 1]; while(elements--) {
loop2: for (int j = start; j < end; j++) { read_stream << in[pixel++];
ngb_vidx = ciao[j]; }
ngb_depth = depth[ngb_vidx]; ...
if (ngb_depth == -1) { }
depth[ngb_vidx] = level_plus1;
} void compute_dataflow(hls::stream<RGBPixel>&
} write_stream, hls::stream<RGBPixel>&
} read_stream, const float* coefficient, int
} img_width, int elements, int center) {
static RGBPixel window_mem[COEFFICIENT_SIZE][
// after: MAX_WIDTH];
void read_frontier_vertex(int *depth, int static fixed coef[COEFFICIENT_SIZE *
vertex_num, int level, int *rpao, ...) { COEFFICIENT_SIZE];
... for(int i = 0; i < COEFFICIENT_SIZE*
for (int i = 0; i < vertex_num; i++) { COEFFICIENT_SIZE; i++) {
if (depth[i] == level) { coef[i] = coefficient[i];
int start = rpao[i]; }
int end = rpao[i + 1]; ...
start_stream << start; }
end_stream << end;
}
} In order to demonstrate the proposed task decomposition
} strategy, we take BFS with relatively complex nested loop as
void traverse(hls::stream<int>& start_stream, hls:: an example and present the generated program tree in Fig.2.
stream<int>& end_stream, ...) {
... It shows that the nested loop in BFS are effectively identified
while (!start_stream.empty() && !end_stream. and extracted as dependent tasks correctly.
empty()) {
int start = start_stream.read();
When the tasks are decomposed, the corresponding code
int end = end_stream.read(); segments will be packed into a function and the code needs
for (int j = start; j < end; j++) { to be refactored accordingly. Before proceeding to the HLS
ngb_vidx = ciao[j];
ngb_depth = depth[ngb_vidx];
acceleration, HLSPilot needs to check the correctness of the
if (ngb_depth == -1) { refactored code. Specifically, we compare the refactored code
depth[ngb_vidx] = level_plus1; to the original code by testing the execution results to ensure
}
}
the computing results are consistent. We follow a bottom-up
} testing strategy and start from the leaf nodes of the program
} tree. If an error occurs, it can be traced back to the erroneous
leaf node and check from its parent node. If errors persist
void bfs_kernel(…) {
for (int i = 0; i < vertex_num; i++) {
Upon retrieving a suitable optimization strategy, the strategy’s
… // traverse node
if (d == level) {
parameter description information and optimization example
… // find frontier
for (int j = start; j < end; j++) { information are integrated into the prompt, utilizing the LLM’s
… // process neighbor of frontier // stage1-1: load node depth
}… void load_depth(…) { in-context learning capabilities to generate optimized code.
for (int i = 0; i < vertex_num; i++) {
depth_inspect_stream <<
depth_for_inspect[i]; IV. E XPERIMENT
// stage1: traverse node and find frontier }
void read_frontier_vertex(...) { }
for (int i = 0; i < vertex_num; i++) { A. Experiment Setting
if (d == level) {
frontier_stream << i;
} // stage1-2: load frontier according to depth In this section, we demonstrate the effectiveness of HLSPi-
} void load_frontier(…) {
} for (int i = 0; i < vertex_num; i++) { lot framework for automatically generating and optimizing
d = depth_inspect_stream.read();

// stage2: read neighbor information of frontier

if (d == level) {
frontier_stream << i;
hardware accelerator based on HLS. We utilize GPT-4 [35]
void read_rpao(...) {
while (!frontier_stream.empty()) {
}… as the default LLM to accomplish tasks such as HLS code
int idx = frontier_stream.read();
int start = rpao[idx];
analysis and optimization within the workflow. For accelerator
// stage3-1: load ciao according to rpao
int end = rpao[idx + 1];
start_stream << start;
void read_ciao(…) {
while ((rpao_empty != 1) || (done != 1)) {
deployment and evaluation, we adopt the Vitis HLS design
end_stream << end;
if (rpao_empty != 1) { flow, using the Xilinx Alveo U280 data center accelerator card.
}…
start = start_stream.read();
end = end_stream.read();
for (int i = start; i < end; i++) { For design space exploration, we utilizes GenHLSOptimizer
// stage3: traverse neighbor of frontier ciao_stream << ciao[i];
void traverse_neighbor(...) {
while (!start_stream.empty()
}… [31] to tune the parameters.
&& !end_stream.empty()) {
…
for (int j = start; j < end; j++) { // stage3-2: process neighbor depth
B. Benchmark Introduction
… void process_neighbor(…) {
if (ngb_depth == -1) {
depth[ngb_vidx] = level_plus1;
while (ciao_empty != 1 || done != 1) {
…
Currently most HLS benchmark suites [36]–[38] still face
}… if (ciao_empty != 1) {
vidx = ciao_stream.read();
sevaral limitations. Firstly, many benchmarks are only com-
ngb_depth = depth[vidx];
if (ngb_depth == -1) { prised of some textbook-style function kernels, failing to
depth[vidx] = level_plus1;
}… fully implement the complexity of real-world applications.
Thus evaluations on these benchmarks lack practical value.
Fig. 2. An example of program tree construction. LLM divides BFS with Secondly, most HLS benchmark suites only include optimized
nested loop into multiple dependent tasks for the pipelined execution.
HLS designs, lacking corresponding unoptimized versions,
which is unfriendly for evaluating the effectiveness of HLS
across multiple attempts, the program tree is backtracked and
optimization strategies.
the parent node is considered as the final refactored result.
To address these issues and accurately evaluate the per-
formance of the accelerators generated by our HLSPilot, we
C. LLM-based Automatic HLS Optimization
designed a benchmark suite that considers both the complexity
After the task pipelining, we continue to apply appro- of the designs and the convenience of comparing optimization
priate HLS optimization strategies to these tasks. The HLS effects. This benchmark suite consists of two parts: modified
optimization strategies are mainly extracted from Vendor’s Rosetta benchmarks [38] and a set of manually collected
documentation [32] [33] [34] by LLM. Since the optimizations benchmarks. The Rosetta benchmarks comprise a series of
are usually limited to specific scenarios or code patterns, there complex real-world applications such as 3D rendering, digit
are a number of distinct strategies but only a few of them may recognition, and spam filtering. Each application has both a
be actually utilized for a specific compute kernel in practice. software implementation and a corresponding HLS implemen-
To facilitate the automatic HLS optimization, we build an HLS tation. The original Rosetta benchmarks were implemented
optimization strategy knowledge base and propose a Retrieval- using SDSoC. We have these designs ported to Vitis and
Augmented-Generation-like (RAG-like) strategy to select the proposed corresponding unoptimized HLS designs without any
most suitable optimization strategies from knowledge base. optimization strategies based on the software implementations
The selected optimization strategies will be applied to the of the applications. Additionally, as a supplement, we collected
target code through in-context learning, ensuring optimized and implemented several other classic algorithm applications.
HLS code generation. Similarly, these applications also include unoptimized ver-
The workflow of HLSPilot’s RAG-like automatic optimiza- sions.
tion strategy learning is illustrated in Fig. 3. It uses the Xilinx
HLS official guide documentation as input and extracts struc- C. Experiment Results and Analysis
tured pragma optimization information from the documents. Experiment results. Table I shows the runtime of original
As shown in fig. 4, the structured information consists of four unoptimized design, manually optimized design, HLSPilot-
parts: (1) a brief introduction to the optimization strategy; (2) generated design, and HLSPilot-generated design with DSE
applicable optimization scenarios; (3) parameter descriptions; for each application in the benchmarks. The results indicate a
(4) optimization examples. The introduction to the optimiza- significant improvement in performance compared to the un-
tion strategy and the information on applicable scenarios optimized design when utilizing HLSPilot-generated designs.
are primarily used to assist in retrieving and matching the Overall, HLSPilot-generated designs achieve comparable per-
optimization strategy with the code, thus these information formance to those manually optimized by human experts,
is kept concise and general to enhance retrieval performance. while greatly reducing labor costs. With the utilization of DSE
1. Build Optimization Strategy Knowledge Base
strategy knowledge base
generate strategy introduction
knowledge function inline
loop flatten
used for strategy retrieval
base loop pipeline
application scenes
data pack √
loop unroll array partition √ parameter description
used for in-context learning
cache optimize √ … examples

select strategies

[ cache optimize, data pack, array partition ]

retrieve strategies generate prompt

based on code content
system prompt:
You are an expert in FPGA… Your goal is to optimize this apply
HLS code to make it work more efficiently on FPGA.
```{code content}```
strategies
Optimization strategy:
Here are some suitable strategies for optimizing above code:
[cache optimize, data pack, array partition]
The parameter descriptions and examples of these strategies
are as follows:
1.cache optimization: description + demo
2.data pack: description + demo
3.array partition: description + demo

Instructions:
Please apply these strategies in appropriate places based on
their descriptions and examples.

2. Retrieve and Apply Strategies

Fig. 3. Automatic Optimization Strategies Learning and Application

Strategy Overview：pragma HLS data_pack is a compilation TABLE I

directive to pack data fields of a structure into a scalar with a
wider bit width … B ENCHMARK RUNTIME ( MS ) ON X ILINX A LVEO U280
Application Scenes：
- Optimize the memory layout of structures to reduce storage Application original handcrafted HLSPilot HLSPilot + DSE
space requirements.
- Improve memory access efficiency, allowing simultaneous Fir 0.413 0.279 0.245 0.227
read and write access to all members of a structure.
Merge Sort 786.618 54.878 47.580 47.460
Parameter Description：
- variable=<variable>: Specifies the structure variable to be BFS 5018.551 3973.645 4184.273 3830.421
packed.
- instance=<name>: Optional parameter that specifies the PageRank 1862.214 1254.833 1114.991 1050.617
name of the resulting variable after packing.
- <byte_pad>: Optional parameter specifying whether to pack 3D Rendering 9.177 4.918 5.375 5.146
data on 8-bit boundaries. Supports two values:
- struct_level: … Digit Recognition 9917.663 9.892 78.837 52.832
Examples： Face Detection 83.752 55.909 64.372 59.138
1. Pack a structure array AB[17] with three 8-bit fields (R, G, B)
into a new 24-bit array with 17 elements.
typedef struct {
Optical Flow 101.313 54.084 71.932 63.184
unsigned char R, G, B;
} pixel;
Spam Filter 9278.917 37.346 8013.913 7519.317
pixel AB[17];
#pragma HLS data_pack variable=AB
2. Pack a structure pointer AB with three 8-bit fields
…

dataflow pipelining optimization, there are various ways to

Fig. 4. Structured information extracted by HLSPilot. The optimization split the same kernel. The rich experience of human experts
strategy from documents is summarized into four parts: (1) strategy overview
and (2) applicable scenarios for strategy retrieval; (3) parameter description may lead to more reasonable task partitioning. In addition,
and (4) examples for generating optimization prompt LLM struggles to implement optimizations tailored to specific
scenes. For instance, in the spam filter application, achieving
tools, some HLSPilot-generated designs can even outperform LUT optimization for sigmoid function requires sampling the
human designs. function and generating a specific lookup table, while also
considering issues such as quantization precision, which is
Analysis on the results. Table II shows the major opti-
difficult for LLM to implement.
mization strategies adopted by human expert’s designs and
HLSPilot-generated designs respectively. It can be noted
D. Case Study
that HLSPilot has selected appropriate optimization strategies
for different applications, basically covering the optimization To further verify the practicality of HLSPilot in real-world
selected by human expert. The performance gap between application, we selected the L-BFGS algorithm [39] and
HLSPilot and human expert mainly comes from the specific performed a complete hardware acceleration workflow for it
implementation methods of optimization. For example, for using HLSPilot on the hybrid CPU-FPGA platform.
TABLE II fixed-point conversion on the code, further optimizing the
M AJOR OPTIMIZATION STRATEGIES USED IN HANDCRAFTED DESIGN AND computational performance of the kernel. Finally, HLSPilot
HLSP ILOT- GENERATED DESIGN
determined the pragma parameters through DSE tools.
Application manual HLSPilot Acceleration result. We evaluated the cost calculation
Loop unrolling runtime and algorithm’s total runtime on both CPU and
Loop unrolling
Fir Loop pipeling
Loop pipelining
Memory optimization
CPU-FPGA platforms, as shown in Table III. L-BFGS-CPU
Dataflow pipelining Dataflow pipelining represents the algorithm program running on the CPU, while
Merge Sort Memory Optimization Memory Optimization HLSPilot-FP and HLSPilot-FXP respectively represent the
Loop unrolling Loop unrolling floating-point and fixed-point designs generated by HLSPi-
Dataflow pipelining Dataflow pipelining
BFS
Memory optimization Memory Optimization
lot. Overall, HLSPilot’s floating-point design and fixed-point
Dataflow pipelining Dataflow pipelining design have accelerated the end-to-end runtime by 7.79 times
PageRank
Memory optimization Memory optimization and 11.93 times, respectively. Notably, for the cost calculation,
Dataflow pipelining
Dataflow pipelining HLSPilot can accelerate it by more than 500 times, which fully
3D Rendering Communication optimization
Communication optimization demonstrates the effectiveness of HLSPilot’s acceleration.
Memory optimization
Dataflow pipelining TABLE III
Loop unrolling Loop unrolling ACCELERATION RESULT ON L-BFGS ALGORITHM
Digit Recognition
Loop pipelining Loop pipelining CostCalc. Total CostCalc. End-to-end
Datatype optimization Design
Runtime(s) Runtime(s) Speedup Speedup
Memory optimization Dataflow pipelining
Face Detection CPU 18237 18390 - -
Datatype optimization Memory optimization
Dataflow pipelining HLSPilot-FP 855 2365 21.33x 7.78x
Dataflow pipelining HLSPilot-FXP 31 1541 588.29x 11.93x
Memory optimization
Optical Flow Memory optimization
Datatype optimization
Communication optimization
Loop pipelining
Table IV shows the resource overhead and runtime of
Dataflow pipelining
Memory optimization Dataflow pipelining
the cost calculation kernel in L-BFGS. Runtime in table IV
Spam Filter
Communication optimization Memory optimization represents the time taken to execute one instance of the cost
LUT optimization calculation. It is evident that HLSPilot can effectively optimize
the performance bottlenecks of the algorithm, significantly
Introduction to the L-BFGS algorithm. L-BFGS algo- enhancing performance.
rithm is one of the commonly used algorithms in machine TABLE IV
learning for solving unconstrained optimization problems. C OST C ALC . KERNEL RESOURCE OVERHEAD AND RUNTIME
When solving gradient descent, L-BFGS algorithm approxi- Kernels #LUTs #FFs #BRAMs #DSPs Runtime(ms)
mates the inverse Hessian matrix using only a limited amount CPU - - - - 38529.08
of past information from the gradients, greatly reducing the kernel-FP 54970 66459 46 107 1680.84
storage space of data. However, due to its large number of kernel-FXP 188294 245018 270 624 60.9811
iterations, the algorithm performs poorly on the CPU, typically
taking several hours for each search process. V. C ONCLUSION
Complete acceleration workflow of HLSPilot. In this case, In this paper, we have introduced HLSPilot, the first LLM-
we wrote a C++ software code for L-BFGS algorithm as the driven HLS framework to automate the generation of hard-
input of HLSPilot. HLSPilot firstly ran the sequential C++ ware accelerators on CPU-FPGA platform. HLSPilot focuses
code of the algorithm on CPU and generated a profiling report on the transformation between sequential C/C++ code and
using the gprof tool, which includes detailed function runtime optimized HLS code, which greatly reducing the semantic
and number of calls. According to HLSPilot’s analysis, the gap between design intent and hardware code. Additionally,
cost calculate function in L-BFGS accounts for more than the integration of profiling tools and DSE tools enables auto-
99.1% of the total runtime of the algorithm, which is the matic hardware/software partition and pragma tuning. Through
performance bottleneck of the program. Therefore, this part the combined efforts of various modules driven by LLM,
will be extracted as the kernel for hardware acceleration. HLSPilot automatically generates high-performance hardware
Next, HLSPilot performed the task pipelining on the kernel accelerators. The kernel optimization experiment results on
code, partitioning the cost calculation process into three tasks: the benchmark fully demonstrate the potential of HLSPilot,
cost and convolution calculation, reconstruction error gradi- showing its ability to achieve comparable, and in some cases
ent calculation, and gradient check. Subsequently, HLSPilot superior, performance relative to manually designed FPGA
applied appropriate optimization strategies to each task. The kernel. In addition, we also performed a complete hardware
major optimization strategies employed in this stage included acceleration workflow for a real-world algorithm, achieving
local buffer optimization, loop unrolling, array partitioning, 11.93x speedup on the hybrid CPU-FPGA platform. These
and others. Particularly, HLSPilot noticed that the cost cal- results highlight the significant effects of LLM, suggesting a
culation process involved a significant amount of floating- promising future for LLM-assisted methodology in hardware
point computations. Therefore, it performed floating-point to design.
R EFERENCES [20] Y. Tsai, M. Liu, and H. Ren, “Rtlfixer: Automatically fixing rtl syntax
errors with large language models,” arXiv preprint arXiv:2311.16543,
[1] G. Martin and G. Smith, “High-level synthesis: Past, present, and future,” 2023.
IEEE Design & Test of Computers, vol. 26, no. 4, pp. 18–25, 2009. [21] S. Liu, W. Fang, Y. Lu, Q. Zhang, H. Zhang, and Z. Xie, “Rtlcoder:
[2] C. Liu, X. Chen, B. He, X. Liao, Y. Wang, and L. Zhang, “Obfs: Opencl Outperforming gpt-3.5 in design rtl generation with our open-source
based bfs optimizations on software programmable fpgas,” in 2019 dataset and lightweight solution,” arXiv preprint arXiv:2312.08617,
International Conference on Field-Programmable Technology (ICFPT). 2023.
IEEE, 2019, pp. 315–318. [22] S. Thakur, J. Blocklove, H. Pearce, B. Tan, S. Garg, and R. Karri, “Au-
[3] X. Zhang, Z. Feng, S. Liang, X. Chen, C. Liu, H. Li, and X. Li, tochip: Automating hdl generation using llm feedback,” arXiv preprint
“Graphitron: A domain specific language for fpga-based graph process- arXiv:2311.04887, 2023.
ing accelerator generation,” arXiv preprint arXiv:2407.12575, 2024. [23] Y. Lu, S. Liu, Q. Zhang, and Z. Xie, “Rtllm: An open-source benchmark
[4] S. Lahti, P. Sjövall, J. Vanne, and T. D. Hämäläinen, “Are we there for design rtl generation with large language model,” in 2024 29th Asia
yet? a study on the state of high-level synthesis,” IEEE Transactions and South Pacific Design Automation Conference (ASP-DAC). IEEE,
on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, 2024, pp. 722–727.
no. 5, pp. 898–911, 2018. [24] J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip-chat: Chal-
[5] B. C. Schafer and Z. Wang, “High-level synthesis design space explo- lenges and opportunities in conversational hardware design,” in 2023
ration: Past, present, and future,” IEEE Transactions on Computer-Aided ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD).
Design of Integrated Circuits and Systems, vol. 39, no. 10, pp. 2628– IEEE, 2023, pp. 1–6.
2639, 2019. [25] Z. Jiang, Q. Zhang, C. Liu, H. Li, and X. Li, “Iicpilot: An intelligent
[6] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, “Performance integrated circuit backend design framework using open eda,” arXiv
modeling and directives optimization for high-level synthesis on fpga,” preprint arXiv:2407.12576, 2024.
IEEE Transactions on Computer-Aided Design of Integrated Circuits [26] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese,
and Systems, vol. 39, no. 7, pp. 1428–1441, 2019. and C. Xiong, “Codegen: An open large language model for code with
[7] C. Liu, H.-C. Ng, and H. K.-H. So, “Quickdough: A rapid fpga multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
loop accelerator design framework using soft cgra overlay,” in 2015 [27] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
International Conference on Field Programmable Technology (FPT). H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
IEEE, 2015, pp. 56–63. language models trained on code,” arXiv preprint arXiv:2107.03374,
[8] A. Sohrabizadeh, C. H. Yu, M. Gao, and J. Cong, “Autodse: Enabling 2021.
software programmers to design efficient fpga accelerators,” ACM Trans- [28] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware
actions on Design Automation of Electronic Systems (TODAES), vol. 27, unified pre-trained encoder-decoder models for code understanding and
no. 4, pp. 1–27, 2022. generation,” arXiv preprint arXiv:2109.00859, 2021.
[9] Y.-k. Choi and J. Cong, “Hls-based optimization and design space explo- [29] B. Wang, Z. Wang, X. Wang, Y. Cao, R. A Saurous, and Y. Kim,
ration for applications with variable loop bounds,” in 2018 IEEE/ACM “Grammar prompting for domain-specific language generation with large
International Conference on Computer-Aided Design (ICCAD). IEEE, language models,” Advances in Neural Information Processing Systems,
2018, pp. 1–8. vol. 36, 2024.
[10] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar, [30] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, E. Hambro,
“Design space exploration of fpga-based accelerators with multi-level L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language
parallelism,” in Design, Automation & Test in Europe Conference & models can teach themselves to use tools,” Advances in Neural Infor-
Exhibition (DATE), 2017. IEEE, 2017, pp. 1141–1146. mation Processing Systems, vol. 36, 2024.
[11] L. Ferretti, A. Cini, G. Zacharopoulos, C. Alippi, and L. Pozzi, “Graph [31] aferikoglou, “Genhlsoptimizer,” 2022. [Online]. Available: https:
neural networks for high-level synthesis design space exploration,” ACM //github.com/aferikoglou/GenHLSOptimizer
Transactions on Design Automation of Electronic Systems, vol. 28, no. 2, [32] Xilinx, “Vivado design suite user guide: High-level synthesis
pp. 1–20, 2022. (ug902),” 2020. [Online]. Available: https://docs.amd.com/v/u/en-US/
[12] E. Luo, H. Huang, C. Liu, G. Li, B. Yang, Y. Wang, H. Li, and ug902-vivado-high-level-synthesis
X. Li, “Deepburning-mixq: An open source mixed-precision neural [33] Xilinx, “Vivado hls optimization methodology guide
network accelerator design framework for fpgas,” in 2023 IEEE/ACM (ug1270),” 2018. [Online]. Available: https://docs.amd.com/v/u/en-US/
International Conference on Computer Aided Design (ICCAD). IEEE, ug1270-vivado-hls-opt-methodology-guide
2023, pp. 1–9. [34] Xilinx, “Vitis high-level synthesis user guide (ug1399),” 2023.
[13] X. Chen, H. Tan, Y. Chen, B. He, W.-F. Wong, and D. Chen, “Thun- [Online]. Available: https://docs.amd.com/r/en-US/ug1399-vitis-hls/
dergp: Hls-based graph processing framework on fpgas,” in The 2021 Navigating-Content-by-Design-Process
ACM/SIGDA International Symposium on Field-Programmable Gate [35] OpenAI, “Gpt-4,” 2023. [Online]. Available: https://platform.openai.
Arrays, 2021, pp. 69–80. com/docs/models/gpt-4-and-gpt-4-turbo
[14] S. Liang, C. Liu, Y. Wang, H. Li, and X. Li, “Deepburning-gl: an [36] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “Chstone: A
automated framework for generating graph neural network accelerators,” benchmark program suite for practical c-based high-level synthesis,” in
in Proceedings of the 39th International Conference on Computer-Aided 2008 IEEE International Symposium on Circuits and Systems (ISCAS).
Design, 2020, pp. 1–9. IEEE, 2008, pp. 1192–1195.
[15] Y. Fu, Y. Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y. C. Lin, [37] B. C. Schafer and A. Mahapatra, “S2cbench: Synthesizable systemc
“Gpt4aigchip: Towards next-generation ai accelerator design automation benchmark suite for high-level synthesis,” IEEE Embedded Systems
via large language models,” in 2023 IEEE/ACM International Confer- Letters, vol. 6, no. 3, pp. 53–56, 2014.
ence on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–9. [38] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston,
[16] K. Chang, Y. Wang, H. Ren, M. Wang, S. Liang, Y. Han, H. Li, Y.-H. Lai, G. Liu, G. A. Velasquez et al., “Rosetta: A realistic high-
and X. Li, “Chipgpt: How far are we from natural language hardware level synthesis benchmark suite for software programmable fpgas,”
design,” arXiv preprint arXiv:2305.14019, 2023. in Proceedings of the 2018 ACM/SIGDA International Symposium on
[17] M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, Field-Programmable Gate Arrays, 2018, pp. 269–278.
H. Anand, S. Banerjee, I. Bayraktaroglu et al., “Chipnemo: Domain- [39] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for
adapted llms for chip design,” arXiv preprint arXiv:2311.00176, 2023. large scale optimization,” Mathematical programming, vol. 45, no. 1,
[18] S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan- pp. 503–528, 1989.
Gavitt, and S. Garg, “Benchmarking large language models for auto-
mated verilog rtl code generation,” in 2023 Design, Automation & Test
in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6.
[19] M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Verilogeval: Evaluating
large language models for verilog code generation,” in 2023 IEEE/ACM
International Conference on Computer Aided Design (ICCAD). IEEE,
2023, pp. 1–8.
1
LegUp: An Open Source High-Level Synthesis Tool for FPGA-Based
Processor/Accelerator Systems
ANDREW CANIS, JONGSOK CHOI, MARK ALDHAM,
VICTOR ZHANG, AHMED KAMMOONA, University of Toronto
TOMASZ CZAJKOWSKI, Altera Corporation
STEPHEN D. BROWN and JASON H. ANDERSON, University of Toronto

It is generally accepted that a custom hardware implementation of a set of computations will provide supe-
rior speed and energy-efficiency relative to a software implementation. However, the cost and difficulty of
hardware design is often prohibitive, and consequently, a software approach is used for most applications.
In this paper, we introduce a new high-level synthesis tool called LegUp that allows software techniques to
be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the
program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware
accelerators that communicate through a standard bus interface. In the hybrid processor/accelerator archi-
tecture, program segments that are unsuitable for hardware implementation can execute in software on the
processor. LegUp can synthesize most of the C language to hardware, including fixed-sized multi-dimensional
arrays, structs, global variables and pointer arithmetic. Results show that the tool produces hardware so-
lutions of comparable quality to a commercial high-level synthesis tool. We also give results demonstrating
the ability of the tool to explore the hardware/software co-design space by varying the amount of a program
that runs in software vs. hardware. LegUp, along with a set of benchmark C programs, is open source and
freely downloadable, providing a powerful platform that can be leveraged for new research on a wide range
of high-level synthesis topics.
Categories and Subject Descriptors: B.7 [Integrated Circuits]: Design Aids
General Terms: Design, Algorithms
Additional Key Words and Phrases: High-level synthesis, field-programmable gate arrays, FPGAs, synthesis,
performance, power, hardware/software co-design

1. INTRODUCTION
Two approaches are possible for implementing computations: software (running on a stan-
dard processor) or hardware (custom circuits). A hardware implementation can provide
a significant improvement in speed and energy-efficiency versus a software implementa-
tion (e.g. [Cong and Zou 2009; Luu et al. 2009]). However, hardware design requires writing
complex RTL code, which is error prone and can be notoriously difficult to debug. Software
design, on the other hand, is comparatively straightforward, and mature debugging and
analysis tools are freely accessible. Despite the apparent energy and performance benefits,
hardware design is simply too difficult and costly for most applications, and a software
approach is preferred.

This work is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada,
and Altera Corporation.
The authors are with the Dept. of Electrical and Computer Engineering, University of Toronto, Toronto,
ON M5S 3G4 CANADA. T. Czajkowski is with the Altera Toronto Technology Centre, Toronto, ON M5S
1S4 CANADA. E-mail: legup@eecg.toronto.edu
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or commercial advantage
and that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA,
fax +1 (212) 869-0481, or permissions@acm.org.
c 2012 ACM 1539-9087/2012/07-ART1 $10.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:2 A. Canis, J. Choi, M. Aldham et al.

In this paper, we propose LegUp – an open source high-level synthesis (HLS) framework
we have developed that aims to provide the performance and energy benefits of hardware,
while retaining the ease-of-use associated with software. LegUp automatically compiles a
standard C program to target a hybrid FPGA-based software/hardware system-on-chip,
where some program segments execute on an FPGA-based 32-bit MIPS soft processor and
other program segments are automatically synthesized into FPGA circuits – hardware ac-
celerators – that communicate and work in tandem with the soft processor. Since the first
FPGAs appeared in the mid-1980s, access to the technology has been restricted to those
with hardware design skills. However, according to labor statistics, software engineers out-
number hardware engineers by more than 10X in the U.S. [United States Bureau of Labor
Statistics 2010]. An overarching goal of LegUp is to broaden the FPGA user base to include
software engineers, thereby expanding the scope of FPGA applications and growing the size
of the programmable hardware market – a goal we believe will keenly interest commercial
FPGA vendors and the embedded systems community.
The decision to include a soft processor in the target system is based on the notion that
not all C program code is appropriate for hardware implementation. Inherently sequential
computations are well-suited for software (e.g. traversing a linked list); whereas, other com-
putations are ideally suited for hardware (e.g. addition of integer arrays). Incorporating
a processor into the target platform also offers the advantage of increased high-level lan-
guage coverage – program segments that use restricted C language constructs can execute
on the processor (e.g. calls to malloc/free). We note that most prior work on high-level
hardware synthesis has focused on pure hardware implementations of C programs, not a
hybrid software/hardware system.
LegUp is written in modular C++ to permit easy experimentation with new HLS algo-
rithms. We leverage the state-of-the-art LLVM (low-level virtual machine) compiler frame-
work for high-level language parsing and its standard compiler optimizations [LLVM 2010],
and we implement hardware synthesis as new back-end compiler passes within LLVM. The
LegUp distribution includes a set of benchmark C programs [Hara et al. 2009] that the user
can compile to pure software, pure hardware, or a combined hardware/software system. For
the hardware portions, LegUp produces RTL code that can be synthesized using standard
commercial synthesis tools. In this paper, we present an experimental study demonstrat-
ing that LegUp produces hardware implementations of comparable quality to a commercial
tool [Y Explorations (XYI) 2010]. We also give results illustrating LegUp’s ability to effec-
tively explore the design space between a pure software implementation and pure hardware
implementation of a given program.
While the promise of high-level hardware synthesis has been touted for decades (consider
that Synopsys introduced its Behavioral Compiler tool in 1996), the technology has yet to
be embraced broadly by industry. We believe its widespread adoption has been impeded by
a number of factors, including a lack of comprehensive C/C++ language support, and, in
some cases, the use of non-standard languages (e.g., [Huang et al. 2008]). While a number
of research groups have developed high-level hardware synthesis tools, few have gained
sustained traction in the research community and the tools have been kept proprietary in
many cases. The open source nature of LegUp is a key differentiator relative to prior work.
Prior high-quality open source EDA projects have had a tremendous impact in spurring
new research advances. As an example, the VPR system has enabled countless studies on
FPGA architecture, packing, placement, and routing [Betz and Rose 1997]. Similarly, the
ABC logic synthesis system has reinvigorated low-level logic synthesis research [Mishchenko
et al. 2006]. High-level hardware synthesis and application-specific processor design can
likewise benefit from the availability of a robust publicly-accessible framework such as LegUp
– a framework used and contributed to by researchers around the world. In fact, at the time
of acceptance, the tool has been downloaded over 350 times by research groups around the
world (since March 2011).

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:3

A key usage scenario for the LegUp tool is in the area of FPGA-based embedded systems
design, which frequently include a soft processor [Wayne Marx 2008]. LegUp can improve
computational throughput and energy-efficiency of such systems by allowing computations
to be migrated from the processor to custom hardware. In addition, since LegUp can also
synthesize a program (or a subset of its constituent functions) to pure hardware, it can be
applied to implement the hardware accelerators in a “server style” processor/accelerator
platform, where a high-end processor communicates with FPGA-based accelerators over a
PCIe bus. While the server scenario is certainly possible, it is the embedded systems usage
model that is explored more heavily in this paper.
A preliminary version of a portion of this work appears in [Canis et al. 2011]. In this
extended journal version, we elaborate on all aspects of the proposed framework, including
background on the intermediate representation (IR) within the LLVM compiler, and how
programs represented in the IR are synthesized to hardware circuits. We describe the pro-
cessor/accelerator interconnection approach in further detail, as well as provide additional
information on the benchmark suite and debugging capabilities. Circuit-by-circuit experi-
mental results for speed, area and power are also included (whereas, only average data was
included in the 4-page conference version). We also describe how LegUp can be modified to
support different FPGA architectures, implement a new scheduling algorithm, and support
parallel accelerators.
The remainder of this paper is organized as follows: Section 2 presents related work.
Section 3 introduces the target hardware architecture and outlines the high-level design
flow. The details of the high-level synthesis tool and software/hardware partitioning are
described in Section 4. An experimental evaluation appears in Section 5. Section 6 presents
three cases studies that serve to demonstrate the extensibility of the LegUp tool: 1) to
target an alternate FPGA device, 2) to evaluate a different scheduling algorithm, and 3) to
support concurrently running accelerators. Conclusions and suggestions for future work are
given in Section 7.

2. RELATED WORK
2.1. High-Level Synthesis
Automatic compilation of a high-level language program to silicon has been a decades-long
quest in the EDA field, with early seminal work done in the 1980s. We highlight several
recent efforts, with emphasis on tools that target FPGAs.
Several HLS tools have been developed for targeting specific applications. GAUT is a
high-level synthesis tool that is designed for DSP applications [Coussy et al. 2010]. GAUT
synthesizes a C program into an architecture with a processing unit, a memory unit, and
a communication unit, and requires that the user supply specific constraints, such as the
pipeline initiation interval.
ROCCC is an open source high level synthesis tool that can create hardware accelerators
from C [Villarreal et al. 2010]. ROCCC is designed to accelerate critical kernels that perform
repeated computation on streams of data, for instance DSP applications such as FIR filters.
ROCCC does not support several commonly-used aspects of the C language, such as generic
pointers, shifting by a variable amount, non-for loops, and the ternary operator. ROCCC
has a bottom-up development process that involves partitioning one’s application into mod-
ules and systems. Modules are C functions that are converted into computational datapaths
with no FSM, with loops fully unrolled. These modules cannot access memory but have data
pushed to them and output scalar values. Systems are C functions that instantiate modules
to repeat computation on a stream of data or a window of memory, and usually consist of
a loop nest with special function parameters for streams. ROCCC supports advanced op-
timizations such as systolic array generation, temporal common subexpression elimination,
and it can generate Xilinx PCore modules to be used with a Xilinx MicroBlaze proces-

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:4 A. Canis, J. Choi, M. Aldham et al.

Table I. Release status of recent non-commercial HLS tools.

Open source Binary only No source or binary

Trident xPilot WarpProcessor
ROCCC GAUT LiquidMetal
CHiMPS

sor. However, ROCCC’s strict subset of C is insufficient for compiling any of the CHStone
benchmarks used in this study and described in Section 4.5. Broadly speaking, ROCCC
works and excels for a specific class of applications (streaming-oriented applications), but it
is not a general C-to-hardware compiler. By supporting the CHStone benchmarks, LegUp
provides researchers with the opportunity to compile larger C programs than is possible
with ROCCC.
General (application-agnostic) tools have also been proposed in recent years. CHiMPS
is a tool developed by Xilinx and the University of Washington that synthesizes programs
into a many cache architecture, taking advantage of the abundant small block RAMs avail-
able throughout the FPGA fabric [Putnam et al. 2008]. LiquidMetal is a compiler being
developed at IBM Research comprising a HLS compiler and a new (non-standard) language,
LIME, that incorporates hardware-specific constructs, such as bitwidth specification on in-
tegers [Huang et al. 2008]. xPilot is a tool that was developed at UCLA [Cong et al. 2006]
and used successfully for a number of HLS studies (e.g., [Chen and Cong 2004]). Trident is
a tool developed at Los Alamos National Labs, with a focus on supporting floating point
operations [Tripp et al. 2007]. xPilot and Trident have not been under active development
for several years and are no longer maintained.
Among prior academic work, the Warp Processor proposed by Vahid, Stitt and Lysecky
bears the most similarity to our framework [Vahid et al. 2008]. In a Warp Processor, soft-
ware running on a processor is profiled during its execution. The profiling results guide the
selection of program segments to be synthesized to hardware. Such segments are disassem-
bled from the software binary to a higher-level representation, which is then synthesized to
hardware [Stitt and Vahid 2007]. The software binary running on the processor is altered
automatically to leverage the generated hardware. We take a somewhat similar approach,
with the key differences being that we compile hardware from the high-level language source
code (not from a disassembled binary) and our tool is open source.
With regard to commercial tools, there has been considerable activity in recent years,
both in start-ups and major EDA vendors. Current offerings include AutoPilot from Au-
toESL [AutoESL ] (a commercial version of xPilot, recently acquired by Xilinx, Inc.), Cata-
pult C from Mentor Graphics [Mentor Graphics 2010], C2R from CebaTech [CebaTech 2010],
eXCite from Y Explorations [Y Explorations (XYI) 2010], CoDeveloper from Impulse Ac-
celerated Technologies [Impulse 2010], Cynthesizer from Forte [Forte 2010], and C-to-Silicon
from Cadence [Cadence 2010]. On our experience, attaining a binary executable for evalu-
ation has not been possible for most tools.
Also on the commercial front is Altera’s C2H tool [Altera, Corp. 2009]. C2H allows a
user to partition a C program’s functions into a hardware set and a software set, where
the software-designated functions execute on a Nios II soft processor, and the hardware-
designated functions are synthesized into custom hardware accelerators that connect to the
Nios II through an Avalon interface (Altera’s on-chip interconnect standard). The C2H
target system architecture closely resembles that targeted by our tool.
Table I shows the release status of each non-commercial tool surveyed above, indicating
whether each is: 1) open source, 2) binary only (i.e., only the binary is publicly available),
or 3) no source or binary available. Tools in category #2 cannot be modified by the research
community to explore new HLS algorithms or new processor/accelerator design styles. Re-
sults produced by tools in category #3 cannot be independently replicated. In the open

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:5

....
y[n] = 0;
for (i = 0; i < 8; i++) { Self-Profiling
y[n] += coeff[i] * x[n-i]; 1 Processor
MIPS Processor
} C Compiler
(MIPS)
....
2

Program code
LegUp
5
Altered SW binary (calls HW accelerators) Profiling Data:
3 Execution Cycles
4 High-level Power
synthesis Suggested
µP Hardened
program
Cache Misses
program
segments to
segments
6 FPGA fabric target to
HW

Fig. 1. Design flow with LegUp.

source category, the Trident tool was based on an early version of LLVM, however, it is
has not been actively maintained for several years, and it targeted pure hardware and not
a hybrid hardware/processor architecture. ROCCC is actively being worked on, however,
it targets a feed-forward pipeline hardware architecture model. To our knowledge, there is
currently no open source HLS tool that compiles a standard C program to a hybrid pro-
cessor/accelerator system architecture, where the synthesized hardware follows a general
datapath/state machine model. By supporting nearly all of the commonly-used aspects of
the C language, as evidenced by the CHStone benchmark programs [Hara et al. 2009],
LegUp provides researchers with the infrastructure needed to compile larger and more gen-
eral C programs than those supported by ROCCC. Section 6 describes case studies that
demonstrate the tools extensibility.

2.2. Application-Specific Instruction Processors (ASIPs)

The concept of an application-specific instruction set processor (ASIP) (e.g. [Pothineni
et al. 2010], [Pozzi et al. 2006], [Henkel 2003], [Sun et al. 2004]) is also related to the hybrid
processor/accelerator platform targeted by LegUp. An ASIP combines a processor with
custom hardware to improve the speed and energy efficiency of an application. In an ASIP,
a cluster of a program’s instructions is selected, and the cluster is replaced by a (new)
custom instruction that calls a custom hardware unit implementing the functionality of
the cluster. There are two main differences between typical ASIPs and the LegUp plat-
form. First accelerators in ASIPs are closely coupled to the processor. While the coupling
allows the accelerators to access the processor’s registers, it requires the processor to stall
while an accelerator performs computation. Thus, performance is limited in comparison to
LegUp, where the loosely-coupled processor/accelerator architecture permits multiple ac-
celerators and the processor to execute concurrently. Second, the LegUp platform is a full
HLS framework capable of synthesizing an entire program to hardware and is not limited
to synthesizing clusters of instructions.

3. LEGUP OVERVIEW
In this section, we provide a high-level overview of the LegUp design flow and its target
architecture. Algorithmic and implementation details follow in Section 4.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:6 A. Canis, J. Choi, M. Aldham et al.

3.1. Design Flow

The LegUp design flow comprises first compiling and running a program on a standard
processor, profiling its execution, selecting program segments to target to hardware, and
then re-compiling the program to a hybrid hardware/software system. Figure 1 illustrates
the detailed flow. Referring to the labels in the figure, at step ➀, the user compiles a
standard C program to a binary executable using the LLVM compiler. At ➁, the executable
is run on an FPGA-based MIPS processor. We selected the Tiger MIPS processor from the
University of Cambridge [University of Cambridge 2010], based on its support for the full
MIPS instruction set, established tool flow, and well-documented modular Verilog.
The MIPS processor has been augmented with extra circuitry to profile its own execution.
Using its profiling ability, the processor is able to identify sections of program code that
would benefit from hardware implementation, improving program throughput and power.
Specifically, the profiling results drive the selection of program code segments to be re-
targeted to custom hardware from the C source. Profiling a program’s execution in the
processor itself provides the highest possible accuracy, as the executing code does not need
to be altered to be profiled and can run at full speed. Moreover, with hardware profiling,
system-level characteristics that affect performance are properly accounted for, such as off-
chip memory access times. In this paper, we profile program run-time at the function level.
In the first release of our tool the user must manually examine the profiling results and
place the names of the functions to be accelerated in a Tcl file that is read by LegUp.
Having chosen program segments to target to custom hardware, at step ➂ LegUp is
invoked to compile these segments to synthesizeable Verilog RTL. Presently, LegUp HLS
operates at the function level: entire functions are synthesized to hardware from the C
source. Moreover, if a hardware function calls other functions, such called functions are also
synthesized to hardware. In other words, we do not allow a hardware-accelerated function
to call a software function. The RTL produced by LegUp is synthesized to an FPGA imple-
mentation using standard commercial tools at step ➃. As illustrated in the figure, LegUp’s
hardware synthesis and software compilation are part of the same LLVM-based compiler
framework.
In step ➄, the C source is modified such that the functions implemented as hardware
accelerators are replaced by wrapper functions that call the accelerators (instead of doing
computations in software). This new modified source is compiled to a MIPS binary exe-
cutable. Finally, in step ➅ the hybrid processor/accelerator system executes on the FPGA.

3.2. Target System Architecture

Figure 2 elaborates on the target system architecture. The processor connects to one or
more custom hardware accelerators through a standard on-chip interface. As our initial
hardware platform is the Altera DE2 Development and Education board (containing a 90nm
Cyclone II FPGA) [DE2 2010], we use the Altera Avalon interface for processor/accelerator
communication [Altera, Corp. 2010]. Synthesizable RTL code for the Avalon interface is
generated automatically using Altera’s SOPC builder tool. The Avalon interface comprises
point-to-point connections between communicating modules – it is not a shared bus. The
Cyclone II/DE2 was chosen because of its widespread availability.
As shown in Figure 2, a shared memory architecture is used, with the processor and
accelerators sharing an on-FPGA data cache and off-chip main memory (8 MB of SDRAM).
The on-chip cache memory is implemented using block RAMs within the FPGA fabric
(M4K blocks on Cyclone II). Access to memory is handled by a memory controller. Such
an architecture allows processor/accelerator communication across the Avalon interface or
through memory. The shared single cache obviates the need to implement cache coherency
or automatic cache line invalidation. Although not shown in the figure, the MIPS soft
processor also has an instruction cache.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:7

FPGA

Hardware Hardware
MIPS Processor
Accelerator Accelerator

AVALON INTERCONNECT

On-Chip
Memory Controller
Cache

Off-Chip Memory

Fig. 2. Target system architecture.

The architecture depicted in Figure 2 represents the target system most natural for an
initial release of the tool. We expect the shared memory to become a bottleneck if many
processors and accelerators are included in the system. The architecture of processor/ac-
celerator systems is an important direction for future research – research enabled by a
framework such as LegUp – with key questions being the investigation of the best on-chip
connectivity and memory architecture. Moreover, in our initial release, the processor and
accelerators share a single clock signal. Multi-clock domain processor/accelerator systems-
on-chip is an important avenue to explore.

4. DESIGN AND IMPLEMENTATION

4.1. High-Level Hardware Synthesis
High-level synthesis has traditionally been divided into three steps [Coussy et al. 2009]:
allocation, scheduling and binding. Allocation determines the amount of hardware available
for use (e.g., the number of adder functional units), and also manages other hardware con-
straints (e.g., speed, area, and power). Scheduling assigns each operation in the program
being synthesized to a particular clock cycle (state) and generates a finite state machine.
Binding assigns a program’s operations to specific hardware units. The decisions made by
binding may imply sharing functional units between operations, and sharing registers/mem-
ories between variables. We now describe our initial implementation choices for the HLS
steps, beginning with a discussion of the compiler infrastructure.

4.1.1. Low-Level Virtual Machine (LLVM). LegUp leverages the low-level virtual machine
(LLVM) compiler framework – the same framework used by Apple for iPhone/iPad ap-
plication development. At the core of LLVM is an intermediate representation (IR), which
is essentially machine-independent assembly language. C code is translated into LLVM’s
IR then analyzed and modified by a series of compiler optimization passes. Current re-
sults show that LLVM produces code of comparable quality to gcc for x86-based processor
architectures.
Consider an 8-tap finite impulse response (FIR) filter whose output, y[n], is a weighted
sum of the current input sample, x[n] and seven previous input samples. The C code for
calculating the FIR response is given in Figure 3. The unoptimized LLVM IR corresponding
to this C code is given in Figure 4. We highlight a few key elements of the IR here. The LLVM
IR is in single static assignment (SSA) form, which prohibits variable re-use, guaranteeing
a 1-to-1 correspondence between an instruction and its destination register. Register names
in the IR are prefixed by %. Types are explicit in the IR. For example, i32 specifies a 32-bit
integer type and i32* specifies a pointer to a 32-bit integer.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:8 A. Canis, J. Choi, M. Aldham et al.

y[n] = 0;
for(i = 0; i < 8; i++) {
y[n] += coeff[i] * x[n - i];
}

Fig. 3. C code for FIR filter.

In the example IR for the FIR filter in Figure 4, line 1 marks the beginning of a basic
block called entry. A basic block is a contiguous set of instructions with a single entry (at
its beginning) and exit point (at its end). Lines 2 and 3 initialize y[n] to 0. Line 4 is an
unconditional branch to a basic block called bb1 that begins on line 5. phi instructions
are needed to handle control flow-dependent variables in SSA form. For example, the phi
instruction on line 6 assigns loop index register %i to 0 if the previous basic block was
entry; otherwise, %i is assigned to register %i.new, which contains the incremented %i from
the previous loop iteration. Line 7 initializes a pointer to the coefficient array. Lines 8 and
9 initialize a pointer to the sample array x. Lines 10-12 load the sum y[n], sample and
coefficient into registers. Lines 13 and 14 perform the multiply-accumulate. The result is
stored in line 15. Line 16 increments the loop index %i. Lines 17 and 18 compare %i with
loop limit (8) and branch accordingly.
Observe that LLVM instructions are simple enough to directly correspond to hardware
operations (e.g., a load from memory, or an arithmetic computation). Our HLS tool operates
directly with the LLVM IR, scheduling the instructions into specific clock cycles (described
below).
Scheduling operations in hardware requires knowing data dependencies between opera-
tions. Fortunately, the SSA form of the LLVM IR makes this easy. For example, the multiply
instruction (mul) on line 13 of Figure 4 depends on the results of two load instructions on
lines 11 and 12. Memory data dependencies are more problematic to discern; however, LLVM
includes alias analysis – a compiler technique for determining which memory locations a
pointer can reference. In Figure 4, the store on line 15 has a write-after-read dependency
with the load on line 10, but has no memory dependencies with the loads on lines 12 and
13. Alias analysis can determine that these instructions are independent and can therefore
be performed in parallel.
Transformations and optimizations in the LLVM framework are structured as a series
of compiler passes. Passes include optimizations such as dead code elimination, analysis
passes such as alias analysis, and back-end passes that produce assembly for a particular
target machine (e.g. MIPS or ARM). The infrastructure is flexible, allowing passes to be
reordered, substituted with alternatives, and disabled. LegUp HLS algorithms have been
implemented as LLVM passes that fit into the existing framework. Implementing the HLS
steps as distinct passes also allows easy experimentation with different HLS algorithms. For
example, one could modify LegUp to “plug in” a new scheduling algorithm and study its
impact on quality of results.
4.1.2. Device Characterization. For a given FPGA family, LegUp includes scripts to pre-
characterize the hardware operation corresponding to each LLVM instruction for all sup-
ported bitwidths (typically, 8, 16, 32, 64). The scripts synthesize each operation in isolation
for the target FPGA family to determine the propagation delay, required number of logic
elements, registers, multiplier blocks, and power consumption. This characterization data
allows LegUp to make early predictions of circuit speed and area for the hardware acceler-
ators and also to aid scheduling and binding.
4.1.3. Allocation. The purpose of allocation is to determine the amount of hardware that
may be used to implement the circuit. LegUp reads allocation information from a configura-
tion Tcl file, which specifies the target FPGA device and the resource limits for the device,

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:9

1: entry:
2: %y.addr = getelementptr i32* %y, i32 %n
3: store i32 0, i32* %y.addr
4: br label %bb1
5: bb1:
6: %i = phi i32 [ 0, %entry ], [ %i.new, %bb1 ]
7: %coeff.addr = getelementptr [8 x i32]* %coeff,
i32 0, i32 %i
8: %x.ind = sub i32 %n, %i
9: %x.addr = getelementptr i32* %x, i32 %x.ind
10: %0 = load i32* %y.addr
11: %1 = load i32* %coeff.addr
12: %2 = load i32* %x.addr
13: %3 = mul i32 %1, %2
14: %4 = add i32 %0, %3
15: store i32 %4, i32* %y.addr
16: %i.new = add i32 %i, 1
17: %exitcond = icmp eq i32 %i.new, 8
18: br i1 %exitcond, label %return, label %bb1
19:return:

Fig. 4. LLVM IR for FIR filter.

e.g. the number of available multiplier blocks. In general, LegUp HLS operates as though
an unlimited amount of resources are available in the target FPGA. The reason for this is
that resource sharing (i.e. using a single hardware unit to implement multiple operations
within the program being synthesized) requires adding multiplexers to the input ports of a
shared hardware unit, and multiplexers are costly to implement in FPGAs. For example, a
32-bit adder can be implemented using 32 4-input LUTs (and associated carry logic), and
32 2-to-1 multiplexers also require 32 4-input LUTs – the same number of LUTs as the
adder itself! Thus, for the allocation step, LegUp does the following:
— Multiply: Hard multiplier blocks in the FPGA fabric are used. Sharing multipliers is only
done when the benchmark being synthesized requires more multipliers than that available
in the FPGA.
— Divide/Modulus: These operations are implemented with LUTs, and consume significant
area. Therefore, we set the number of divide/remainder units to be the maximum number
used in any cycle of the schedule. Multiplexers are added to the input ports of the unit(s)
to facilitate the resource sharing (described below in the binding section).
4.1.4. Scheduling. Scheduling is the task of assigning operations to clock cycles and building
a finite state machine (FSM). A control flow graph (CFG) of a program is a directed graph
where basic blocks are represented by vertices and branches are represented by edges. For
example, given two basic blocks, b1 and b2 , b1 has an edge to b2 in the CFG if b1 can
branch to b2 . We can think of a CFG as a coarse representation of the FSM needed to
control the hardware being synthesized – the nodes and edges are analogous to those of a
state diagram. What is not represented in this coarse FSM are data dependencies between
operations within a basic block and the latencies of operations (e.g., a memory access may
take more than a single cycle).
Having constructed the coarse FSM from the CFG, LegUp then schedules each basic block
individually, which amounts to splitting each node in the CFG into multiple nodes, each
corresponding to one FSM state (clock cycle). The initial release of LegUp uses as-soon-as-
possible (ASAP) scheduling [Gajski and et. al. Editors 1992], which assigns an instruction to
the first state after all of its dependencies have been computed. Traversing basic blocks, and
visiting the instructions within each basic block in order, the operands for each instruction

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:10 A. Canis, J. Choi, M. Aldham et al.

0 1 2 3 4 5 6 7 8

%y.addr = getelementptr i32* %y, i32 %n

store i32 0, i32* %y.addr

br label %bb1

%i = phi i32 [ 0, %entry ], [ %i.new, %bb1 ]

%coeff.addr = geteptr i32* %coeff, i32 0, i32 %i

%x.ind = sub i32 %n, %i

%x.addr = getelementptr i32* %x, i32 %x.ind

%0 = load i32* %y.addr

%1 = load i32* %coeff.addr

%2 = load i32* %x.addr

%3 = mul i32 %1, %2

%4 = add i32 %0, %3

store i32 %4, i32* %y.addr

%i.new = add i32 %i, 1

%exitcond = icmp eq i32 %i.new, 8

br i1 %exitcond, label %return, label %bb1

Fig. 5. Scheduled FIR filter IR with data dependencies.

are either: 1) from this basic block and therefore guaranteed to have already been assigned
a state, or 2) from outside this basic block, in which case we can safely assume they will be
available before control reaches this basic block. Note that our scheduler properly handles
instructions with multi-cycle latencies, such as pipelined divides or memory accesses.
In some cases, we can schedule an instruction into the same state as one of its operands.
This is called operation chaining. We perform chaining in cases where the estimated delay of
the chained operations (from allocation) does not exceed the estimated clock period for the
design. Chaining can reduce hardware latency (# of cycles for execution) and save registers
without impacting the final clock period.
Fig. 5 is a Gantt chart showing the ASAP schedule of the FIR filter instructions shown
in Fig. 4. The chart shows the same LLVM instructions, now organized into nine states.
Data dependencies between operations are shown; in this case we do not allow operation
chaining (for clarity). Load instructions have a two cycle latency, allowing us to pipeline
our memory controller for higher speed performance. Once a load has been issued, a new
load can be issued on the next cycle. Because our memory controller is single ported, only
one load can be performed every cycle.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:11

4.1.5. Binding. Binding comprises two tasks: assigning operators from the program being
synthesized to specific hardware units (operation assignment), and assigning program vari-
ables to registers (register allocation). When multiple operators are assigned to the same
hardware unit, or when multiple variables are bound to the same register, multiplexers are
required to facilitate the sharing. We make two FPGA-specific observations in our approach
to binding. First, multiplexers are relatively expensive to implement in FPGAs using LUTs.
Consequently, there is little advantage to sharing all but the largest functional units, namely,
multipliers and dividers. Likewise, the FPGA fabric is register rich – each logic element in
the fabric has a LUT and a register. Therefore, sharing registers is rarely justified.
We have three goals when binding operations to shared functional units. First, we would
like to balance the sizes of the multiplexers across functional units to keep circuit perfor-
mance high. Multiplexers with more inputs have higher delay, so it is desirable to avoid
having a functional unit with a disproportionately large multiplexer on its input. Second,
we want to recognize cases where we have shared inputs between operations, letting us
save a multiplexer if the operations are assigned to the same functional unit. Lastly, during
binding if we can assign two operations that have non-overlapping livetime intervals to the
same functional unit, we can use a single output register for both operations. In this case
we save a register, without needing a multiplexer. We use the LLVM live variable analysis
pass to check for the livetime intervals.
To account for these goals we use the following cost function to measure the benefit of
assigning operation op to function unit fu:
Cost(op, f u) = φ · existingM uxInputs(f u) +
β · newM uxInputs(op, f u) −
θ · outputRegisterSharable(op, f u) (1)
where φ = 0.1, β = 1, and θ = 0.5 to give priority to saving new multiplexer inputs, then
output registers, and finally balancing the multiplexers. Notice that sharing the output
register reduces the cost, while the other factors increase it.
The initial release of LegUp uses a weighted bipartite matching heuristic to solve the
binding problem [Huang et al. ]. The binding problem is represented using a bipartite graph
with two vertex sets. The first vertex set corresponds to the operations being bound (i.e.
LLVM instructions). The second vertex set corresponds to the available functional units.
A weighted edge is introduced from a vertex in the first set to a vertex in the second set
if the corresponding operation is a candidate to be bound to the corresponding functional
unit. We set the cost (edge weight) of assigning an operation to a functional unit using (1).
Weighted bipartite matching can be solved optimally in polynomial time using the well-
known Hungarian method [Kuhn 2010]. We formulate and solve the matching problem one
clock cycle at a time until the operations in all clock cycles (states) have been bound.

4.2. Local Accelerator Memories

The system architecture shown in Figure 2 includes a shared memory between the processor
and accelerators, comprising on-FPGA cache and off-FPGA SDRAM. Accesses to the off-
chip SDRAM are detrimental to performance, as each access takes multiple clock cycles to
complete, and contention may arise in the case of concurrent accesses. To help mitigate this,
constants and local variables used within hardware accelerators (which are not shared with
the processor) are stored in block RAMs in the accelerators themselves. We create local
memories for each variable/constant array used by an accelerator. An advantage of using
multiple memories instead of a single large memory is enhanced parallelism.
Each local memory is assigned a 9-bit tag using the top 9 bits of the 32-bit address
space. The tag is used to steer a memory access to the correct local accelerator memory,
or alternately, to the shared memory, that is, to the memory controller shown in Figure 2.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:12 A. Canis, J. Choi, M. Aldham et al.

Fig. 6. Profiler hardware architecture.

LegUp automatically generates the multiplexing logic to interpret the tags and steer memory
requests. Tag 000000000 is reserved for the NULL pointer, and tag 000000001 indicates that
the memory access should be steered to the shared memory. The remaining 510 different
tags can be used to differentiate between up to 510 local accelerator memories. Using 9 bits
for the tag implies that 23 bits are available for encoding the address. The decision to use
9-bit tags in the initial release of LegUp was taken because the Altera DE2 board contains
an 8 MB SDRAM which is fully addressable using 23 bits. It is straightforward to change
LegUp to use a different tag width if desired.

4.3. Hardware Profiling

As shown in Figure 1, a hardware profiler is used to decide which functions should be
implemented as hardware accelerators. The profiler utilized in LegUp is a non-intrusive,
real-time profiler that performs its analyses at the function level. As a program executes on
the MIPS processor, the profiler monitors the program counter and instruction op-codes to
track the number of cycles spent in each function and its descendants.
At a high-level, our profiler works by associating both an index and a counter with
each function in a program. The index for a function is computed using a hash of the
memory address of the function’s first instruction. The hash can be calculated in hardware
using simple logical and arithmetic operations. The counter tracks the total number of
execution cycles spent in the function and optionally, execution cycles spent in the function’s
descendants. The number of functions being tracked by the profiler is configurable, as are
the widths of the cycle counters. Most importantly, the profiler allows different programs
to be profiled without requiring any re-synthesis.
Fig. 6 provides an overview of the profiler architecture. An operation decoder module
(labeled Op Decoder ) monitors instructions issued by the processor, looking for function
calls and returns. When a call is detected, the destination address is passed to the Address
Hash module which computes a unique index for the called function. The function index
is pushed onto a stack in the Call Stack module (the stack is implemented in FPGA block
RAM). A Data Counter module accumulates profiling data for the current function being
executed, while the Counter Storage contains profiling data for all functions in the program.
Pushing and popping function indices onto/from the stack on function calls and returns
allows profiling data to be accrued to the appropriate functions. The profiler represents a

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:13

int add (int * a, int * b, int N)

{
int sum=0;
for (int i=0; i<N; i++)
{
sum += a[i]+b[i];
}
return sum;
}

Fig. 7. C function targeted for hardware.

6.7% overhead on the MIPS processor area when configured to track up to 32 functions
using 32-bit counters. Complete details on the profiler, including how it can be extended
to profile energy consumption, are omitted for lack of space, but can be found in [Aldham
et al. 2011].

4.4. Processor/Accelerator Communication

Recall the target architecture shown in Figure 2 comprising a MIPS processor that com-
municates with hardware accelerators. When a function is selected to be implemented in
hardware, its C implementation is automatically replaced with a wrapper function by the
LegUp framework. The wrapper function passes the function arguments to the correspond-
ing hardware accelerator, asserts a start signal to the accelerator, waits until the accelerator
has completed execution, and then receives the returned data over the Avalon interconnec-
tion fabric.
The MIPS processor can do one of two things while waiting for the accelerator to complete
its work: 1) it can continue to perform computations and periodically poll a memory-mapped
register whose value is set to 1 when the accelerator is done, or, 2) it can stall until a
done signal is asserted by the accelerator. The advantage of polling is that the processor
can execute other computations concurrent with the accelerator doing its work, akin to a
threaded computing environment. The advantage of stalling is energy consumption – the
processor is in a low-power state while the accelerator operates. In our initial LegUp release,
both modes are functional; however, we use only mode #2 (stalling) for the results in this
paper.
To illustrate the wrapper concept, consider the C function shown in Figure 7. The func-
tion accepts two N -element vectors as input and computes the sum of the vectors’ pairwise
elements. If function is to be implemented in hardware, it would be replaced with the wrap-
per function shown in Figure 8. The defined memory addresses correspond to the assigned
memory space of the hardware accelerator. Each accelerator contains logic to communi-
cate with the processor according to the signals and addresses asserted through the Avalon
interconnect. Writes to the specified memory addresses translate into data communicated
across the Avalon interface to the accelerator. The write to the STATUS address starts the
accelerator. At this point, the accelerator asserts an input signal to the processor causing
it to stall; the accelerator de-asserts this signal when its work is complete. A read from the
DATA address retrieves the vector addition result from the accelerator.

4.5. Language Support and Benchmarks

LegUp supports a large subset of ANSI C for synthesis to hardware including: function calls,
assignments, loops, integer logical, bitwise and arithmetic operations. Program segments
that use unsupported language features are required to remain in software and execute on
the MIPS processor. Table II lists C language constructs that are frequently problematic for
hardware synthesis, and specifies which constructs are supported/unsupported by LegUp.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:14 A. Canis, J. Choi, M. Aldham et al.

#define STATUS (volatile int *)0xf00000000

#define DATA (volatile int *)0xf00000004
#define ARG1 (volatile int *)0xf00000008
#define ARG2 (volatile int *)0xf0000000C
#define ARG3 (volatile int *)0xf00000010

int add (int * a, int * b, int N)

{
// pass arguments to accelerator
*ARG1 = a;
*ARG2 = b;
*ARG3 = N;
// give start signal
*STATUS = 1;
// wake up and get return data
return *DATA;
}

Fig. 8. Wrapper for function in Figure 7.

Table II. Language support.

Supported Unsupported
Functions Dynamic Memory
Arrays, Structs Floating Point
Global Variables Recursion
Pointer Arithmetic

Table III. Benchmark programs included with LegUp.

Category Benchmarks Lines of C

Arithmetic 64-bit dbl precision 376–755
add, mult, div, sin
Encryption AES, Blowfish, SHA 716–1406
Processor MIPS processor 232
Media JPEG decoder, Motion, 393–1692
GSM, ADPCM
General Dhrystone 491

Unlike many HLS tools, synthesis of fixed-size multi-dimensional arrays, structs, global vari-
ables, and pointer arithmetic are supported by LegUp. Regarding structs, LegUp supports
structs with arrays, arrays of structs, and structs containing pointers. LegUp stores structs
in memory using the ANSI C alignment standards. Functions that return a struct, dynamic
memory allocation, recursion and floating point arithmetic are unsupported in the initial
release of the tool.
With the LegUp distribution, we include 13 benchmark C programs, summarized in
Table III. Included are all 12 programs in the CHStone high-level synthesis benchmark
suite [Hara et al. 2009], as well as Dhrystone – a standard integer benchmark. The pro-
grams represent a diverse set of computations falling into several categories: arithmetic,
encryption, media, processing and general. They range in size from 232-1692 lines of C
code. The arithmetic benchmarks implement 64-bit double-precision floating-point opera-
tions in software using integer types. Notice that the CHStone suite contains a benchmark
which is a software model of a MIPS processor (which we can then run on a MIPS processor).

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:15

A key characteristic of the benchmarks is that inputs and expected outputs are included
in the programs themselves. The presence of the inputs and golden outputs for each pro-
gram gives us assurance regarding the correctness of our synthesis results. Each benchmark
program performs computations whose results are then checked against golden values. This
is analogous to built-in self test in design-for-test methodology. No inputs (e.g. from the
keyboard or a file) are required to run the programs.

4.6. Debugging
The initial release of LegUp includes a basic debugging capability which consists of auto-
matically adding print statements into the LLVM IR to dump variable values at the end of
each basic block’s execution. When the IR is synthesized to hardware, the Verilog can be
simulated using ModelSim producing a log of variable value changes that can be directly
compared with an analogous log from a strictly software execution of a benchmark. We
found even this limited capability to be quite useful, as it allows one to pinpoint the first
LLVM instruction where computed values differ in hardware vs. software, aiding problem
diagnosis and debugging.

5. EXPERIMENTS
The goals of our experimental study are three-fold: 1) to demonstrate that the quality of
results (speed, area, power) produced by LegUp HLS is comparable to that produced by
a commercial HLS tool, eXCite [Y Explorations (XYI) 2010], 2) to demonstrate LegUp’s
ability to effectively explore the hardware/software co-design space, and 3) to compare the
quality of hardware vs. software implementations of the benchmark programs. We chose
eXCite because it was the only commercial tool we had access to that could compile the
benchmark programs. With the above goals in mind, we map each benchmark program
using 5 different flows, representing implementations with successively increasing amounts
of computation happening in hardware vs. software. The flows are as follows (labels appear
in parentheses):
(1) A software-only implementation running on the MIPS soft processor (MIPS-SW ).
(2) A hybrid software/hardware implementation where the second most 1 compute-intensive
function (and its descendants) in the benchmark is implemented as a hardware accel-
erator, with the balance of the benchmark running in software on the MIPS processor
(LegUp-Hybrid2 ).
(3) A hybrid software/hardware implementation where the most compute-intensive func-
tion (and its descendants) is implemented as a hardware accelerator, with the balance
in software (LegUp-Hybrid1 ).
(4) A pure hardware implementation produced by LegUp (LegUp-HW ).
(5) A pure hardware implementation produced by eXCite (eXCite-HW )2 .
The two hybrid flows correspond to a system that includes the MIPS processor and a
single accelerator, where the accelerator implements a C function and all of its descendant
functions.
For the back-end of the flow, we use Quartus II ver. 9.1 SP2 to target the Cyclone II
FPGA. Quartus II was executed in timing-driven mode with all physical synthesis opti-
mizations turned on3 . The correctness of the LegUp implementations was verified using
post-routed ModelSim simulations and also in hardware using the Altera DE2 board.

1 Not considering the main() function.

2 The eXCite implementations were produced by running the tool with the default options.
3 The eXCite implementation for the jpeg benchmark was run without physical synthesis optimizations
turned on in Quartus II, as with such optimizations, the benchmark could not fit into the largest Cyclone
II device.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:16 A. Canis, J. Choi, M. Aldham et al.

Table IV. Speed performance results (frequencies in MHz, times in µS)

MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW eXCite-HW

Benchmark Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time
adpcm 193607 74.26 2607 159883 61.61 2595 96948 57.19 1695 36795 45.79 804 21992 28.88 761
aes 73777 74.26 993 55014 54.97 1001 26878 49.52 543 14022 60.72 231 55679 50.96 1093
blowfish 954563 74.26 12854 680343 63.21 10763 319931 63.7 5022 209866 65.41 3208 209614 35.86 5845
dfadd 16496 74.26 222 14672 75.01 196 5649 77.41 73 2330 124.05 19 370 24.54 15
dfdiv 71507 74.26 963 15973 77.92 205 4538 65.92 69 2144 74.72 29 2029 43.95 46
dfmul 6796 74.26 92 10784 75.58 143 2471 79.14 31 347 85.62 4 223 49.17 5
dfsin 2993369 74.26 40309 293031 65.66 4463 80678 68.23 1182 67466 62.64 1077 49709 40.06 1241
gsm 39108 74.26 527 29500 61.46 480 18505 61.14 303 6656 58.93 113 5739 41.82 137
jpeg 29802639 74.26 401328 16072954 51.2 313925 15978127 46.65 342511 5861516 47.09 124475 3248488 22.66 143358
mips 43384 74.26 584 6463 75.51 86 6463 75.51 86 6443 90.09 72 4344 76.25 57
motion 36753 74.26 495 34859 73.34 475 17017 79.67 214 8578 91.79 93 2268 42.87 53
sha 1209523 74.26 16288 358405 77.40 4631 265221 75.76 3508 247738 86.93 2850 238009 62.48 3809
dhrystone 28855 74.26 389 25599 77.64 330 25509 76.99 331 10202 85.38 119 - - -
Geomean: 173332.0 74.26 2334.1 86258.3 67.10 1285.9 42700.5 65.65 650.3 20853.8 71.56 291.7 14594.4 40.87 357.1
Ratio: 1 1 1 0.50 0.90 0.55 0.25 0.88 0.28 0.12 0.96 0.12 0.08 0.55 0.15

Three metrics are employed to gauge quality of result: 1) circuit speed, 2) area, and
3) energy consumption. For circuit speed, we consider the cycle latency, clock frequency
and total execution time. Cycle latency refers to the number of clock cycles required for a
complete execution of a benchmark. Clock frequency refers to the reciprocal of the post-
routed critical path delay reported by Altera timing analysis. Total execution time is simply
the cycle latency multiplied by the clock period. For area, we consider the number of used
Cyclone II logic elements (LEs), memory bits, and 9x9 multipliers.
Energy is a key cost metric, as it directly impacts electricity costs, as well as influences
battery life in mobile settings. To measure energy, we use Altera’s PowerPlay power analyzer
tool, applied to the routed design. We gather switching activity data for each benchmark
through a post-route full delay simulation with Mentor Graphics’ ModelSim. ModelSim
produces a VCD (value change dump) file containing activity data for each design signal.
PowerPlay reads the VCD to produce a power estimate for each design. To compute the
total energy consumed by a benchmark for its computational work, we multiply the average
core dynamic power reported by PowerPlay with the benchmark’s total execution time.
5.1. Results
Table IV presents speed performance results for all circuits and flows. Three data columns
are given for each flow: Cycles contains the latency in number of clock cycles; Freq presents
the post-routed critical path delay in MHz; Time gives the total executation time in µS
(Cycles/F req). The flows are presented in the order specified above, from pure software on
the left, to pure hardware on the right. The second last row of the table contains geometric
mean results for each column. The dhrystone benchmark was excluded from the geomean
calculations, as eXCite was not able to compile this benchmark. The last row of the table
presents the ratio of the geomean relative to the software flow (MIPS-SW ).
Beginning with the MIPS-SW flow, the data in Table IV indicates that the processor runs
at 74 MHz on the Cyclone II and the benchmarks take between 6.7K-29M cycles to complete
their execution. In terms of program execution time, this corresponds to a range of 92-401K
µS4 . In the LegUp-Hybrid2 flow, where the second most compute-intensive function (and
its descendants) is implemented as a hardware accelerator, the number of cycles needed for
execution is reduced by 50% compared with software, on average. The Hybrid2 circuits run
at 10% lower frequency than the processor, on average. Overall, LegUp-Hybrid2 provides
a 45% (1.8×) speed-up in program execution time vs. software (MIPS-SW ). Moving onto
the LegUp-Hybrid1 flow, which represents additional computations in hardware, Table IV

4 Asa comparison, we also ran the benchmarks on the Altera NIOS II/f (fast) soft processor and found the
NIOS II performance to be about twice as fast as Tiger MIPS. Note, however, that NIOS II is not open
source, has a 6-stage pipeline and is specially tuned for Altera devices, whereas, Tiger MIPS has a 5-stage
pipeline and is not optimized for any particular FPGA device architecture.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:17

shows that cycle latency is 75% lower than software alone. However, clock speed is 12%
worse for this flow, which when combined with latency, results in a 72% reduction in program
execution time vs. software (a 3.6× speed-up over software). Looking broadly at the data for
MIPS-SW, LegUp-Hybrid1 and LegUp-Hybrid2, we observe a trend: execution time decreases
substantially as more computations are mapped to hardware. Note that the MIPS processor
would certainly run at a higher clock speed on a 40/45 nm FPGA, e.g. Stratix IV, however
the accelerators would also speed-up commensurately.
The two right-most flows in Table IV correspond to pure hardware implementations. Ob-
serve that benchmark programs mapped using the LegUp-HW flow require just 12% of the
clock cycles of the software implementations, on average, yet they run at about the same
speed in MHz. When benchmarks are mapped using eXCite-HW, even fewer clock cycles are
required to complete their execution – just 8% of that required for software implementations.
However, implementations produced by eXCite run at 45% lower clock frequency than the
MIPS processor, on average. LegUp produces heavily pipelined hardware implementations,
whereas, we believe eXCite does more operation chaining, resulting in few computation cy-
cles yet longer critical path delays. Considering total execution time of a benchmark, LegUp
and eXCite offer similar results. LegUp-HW provides an 88% execution time improvement
vs. software (8× speed-up); eXCite-HW provides an 85% improvement (6.7× speed-up).
Both of the pure hardware implementations are a significant win over software. The most
favorable LegUp results were for the dfdiv and dfsin benchmarks, for which the speed-up
over pure software was over 30×. The benchmark execution times of LegUp implementa-
tions relative to eXCite are comparable, which bodes well for our framework and gives us
assurance that it produces implementations of reasonable quality.
Observe that neither of the hybrid scenarios provide a performance win over pure hard-
ware for these particular benchmark circuits. Moreover, none of the benchmarks use C
language constructs that are unsupported by LegUp. Nevertheless, the hybrid scenarios
do serve to demonstrate LegUp’s ability to synthesize working systems that contain both
hardware and software aspects.
It is worth highlighting a few anomalous results in Table IV. Comparing LegUp-HW
with eXCite-HW for the benchmark aes, LegUp’s implementation provides a nearly 5×
improvement over eXCite in terms of execution time. Conversely, for the motion benchmark,
LegUp’s implementation requires nearly 4× more cycles than eXCite’s implementation. We
believe such differences lie in the extent of pipelining used by LegUp vs. eXCite, especially
for arithmetic operations such as division. In LegUp, we pipeline arithmetic units to the
maximum extent possible, leading to higher cycle latencies, and improved clock periods.
Area results are provided for each circuit in Table V. For each flow, three columns provide
the number of Cyclone II logic elements (LEs), the number of memory bits used (# bits),
as well as the number of 9x9 multipliers (Mults). As in the performance data above, the
geometric mean and ratios relative to MIPS software alone are given in the last two rows
of Table V. Observe that some columns contain a 0 for one or more circuits, invalidating
the geomean calculation. To calculate the geomean for such columns, the 0’s were taken to
be 1’s5 .
Beginning with the area of the MIPS processor, the data in Table V shows it requires
12.2K LEs, 226K memory bits, and 16 multipliers. The hybrid flows include both the MIPS
processor, as well as custom hardware, and consequently, they consume considerably more
area. When the LegUp-Hybrid2 flow is used, the number of LEs, memory bits, and multi-
pliers increase by 2.23×, 1.14×, and 2.68×, respectively, in Hybrid2 vs. the MIPS processor
alone, on average. The LegUp-Hybrid1 flow requires even more area: 2.75× LEs, 1.16×
memory bits, and 3.18× multipliers vs. MIPS. Note that link time optimization in LLVM
was disabled for the hybrid flows, as was necessary to preserve the integrity of the function
5 This convention is used in life sciences studies.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:18 A. Canis, J. Choi, M. Aldham et al.

Table V. Area results.

MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW eXCite-HW

Benchmark LEs # bits Mults LEs # bits Mults LEs # bits Mults LEs # bits Mults LEs # bits Mults
adpcm 12243 226009 16 25628 242944 152 46301 242944 300 22605 29120 300 16654 6572 28
aes 12243 226009 16 56042 244800 32 68031 245824 40 28490 38336 0 46562 18688 0
blowfish 12243 226009 16 25030 341888 16 31020 342752 16 15064 150816 0 31045 33944 0
dfadd 12243 226009 16 22544 233664 16 26148 233472 16 8881 17120 0 9416 0 0
dfdiv 12243 226009 16 28583 226009 46 36946 233472 78 20159 12416 62 9482 0 32
dfmul 12243 226009 16 16149 226009 48 20284 233472 48 4861 12032 32 4536 0 26
dfsin 12243 226009 16 34695 233472 78 54450 233632 116 38933 12864 100 22274 0 38
gsm 12243 226009 16 25148 232576 114 30808 233296 142 19131 11168 70 6114 3280 2
jpeg 12243 226009 16 46432 338096 252 64441 354544 254 46224 253936 172 30420 105278 20
mips 12243 226009 16 18857 230304 24 18857 230304 24 4479 4480 8 2260 3072 8
motion 12243 226009 16 28761 243104 16 18013 242880 16 13238 34752 0 20476 16384 0
sha 12243 226009 16 20382 359136 16 29754 359136 16 12483 134368 0 13684 3072 0
dhrystone 12243 226009 16 15220 226009 16 16310 226009 16 4985 82008 0 - - -
Geomean: 12243 226009 16 27248 258526 43 33629 261260 51 15646 28822 12 13101 496 5
Ratio: 1 1 1 2.23 1.14 2.68 2.75 1.16 3.18 1.28 0.13 0.72 1.07 0.00 0.32

2500 40000
Execution time (geometric mean)

# of LEs 35000
2000

# of LEs (geometric mean)

Exec. time 30000

1500 25000

20000
1000 15000

10000
500
5000

0 0

Fig. 9. Performance and area results (performance in µS).

boundaries6 . However, link time optimization was enabled for the MIPS-SW and LegUp-
HW flows, permitting greater compiler optimization for such flows, possibly improving area
and speed.
Turning to the pure hardware flows in Table V, the LegUp-HW flow implementations
require 28% more LEs than the MIPS processor on average; the eXCite-HW implementa-
tions require 7% more LEs than the processor. In other words, on the key area metric of the
number of LEs, LegUp implementations require 19% more LEs than eXCite, on average.
We consider the results to be quite encouraging, given that this is the initial release of an
open source academic HLS tool. In terms of memory bits, both the LegUp-HW flow and
the eXCite-HW flow require much fewer memory bits than the MIPS processor alone. For
the benchmarks that require embedded multipliers, the LegUp-HW implementations use
more multipliers than the eXCite-HW implementations, which we believe is due to more
extensive multiplier sharing in the binding phase of eXCite.
Figure 9 summarizes the speed and area results. The left vertical axis represents geometric
mean execution time; the right axis represents area (number of LEs). Observe that execution

6 Link time optimization permits code optimization across compilation modules.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:19

600,000

500,000

Energy (nJ) (geometric mean)

400,000

300,000

200,000

100,000

Fig. 10. Energy results.

time drops as more computations are implemented in hardware. While the data shows
that pure hardware implementations offer superior speed performance to pure software or
hybrid implementations, the plot demonstrates LegUp’s usefulness as a tool for exploring
the hardware/software co-design space. One can multiply the delay and area values to
produce an area-delay product. On such a metric, LegUp-HW and eXCite-HW are nearly
identical (∼4.6M µS-LEs vs. ∼4.7M µS-LEs) – LegUp-HW requires more LEs vs. eXCite-
HW, however, it offers better speed, producing a roughly equivalent area-delay product.
The area-delay product parity with eXCite gives us further confidence that the HLS results
produced by LegUp are competitive with commercial tools.
Figure 10 presents the geometric mean energy results for each flow. The energy results
bear similarity to the trends observed for execution time, though the trends here are even
more pronouced. Energy is reduced drastically as computations are increasingly imple-
mented in hardware vs. software. The LegUp-Hybrid2 and LegUp-Hybrid1 flows use 47%
and 76% less energy than the MIPS-SW flow, respectively, representing 1.9× and 4.2× en-
ergy reductions. The pure hardware flows are even more promising from the energy stand-
point. With LegUp-HW, the benchmarks use 94% less energy than if they are implemented
with the MIPS-SW flow (an 18× reduction). The eXCite results are similar. Pure hardware
benchmark implementations produced by eXCite use over 95% less energy than software im-
plementations (a 22× reduction). The energy results are promising, especially since energy
was not a specific focus of our initial release.

6. EXTENSIBILITY OF LEGUP: CASE STUDIES

We now move onto illustrating the extensibility of the LegUp framework in several ways: 1)
we describe our recent work on supporting an additional FPGA architecture, 2) we describe
our experiences with incorporating a state-of-the-art scheduling algorithm, and 3) we show
a simple example that utilizes multiple parallel accelerators. Our intent is to demonstrate
LegUp’s utility as an open source tool that can be tailored by the researcher to meet their
particular needs.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:20 A. Canis, J. Choi, M. Aldham et al.

D Q
D Q 6-LUT
4-LUT
D Q

a) Cyclone II Logic Element (LE) b) Stratix IV Adaptive Logic Module (ALM)

Fig. 11. Cyclone II and Stratix IV logic element architectures.

6.1. Targeting a Different FPGA Device

As described above, the LegUp targets the Altera DE2 board with the Cyclone II FPGA.
We expect, however, that some users will wish to target other platforms and we therefore
investigated the ease with which the tool can be modified to target a different FPGA. We
chose Altera’s 40nm Stratix IV family as the alternative platform [Altera, Corp. 2011].
Stratix IV has a considerably different logic element architecture than Cyclone II. Cyclone
II uses logic elements (LEs) with 4-input LUTs to implement combinational logic functions,
whereas Stratix IV uses an adaptive logic module (ALM). An ALM is a dual-output 6-LUT
that receives eight inputs from the interconnection fabric. The ALM can implement any
6-input function, any two 4-input functions, a 3- and 5-input function, and several other
combinations. Each of the two LUT outputs has a flip-flop that can be optionally bypassed.
The Cyclone II and Stratix IV logic elements are illustrated in Fig. 11.
LegUp HLS depends on the target FPGA architecture in two ways: 1) the area and delay
models for each of the LLVM operations (including the latency of each operator, i.e. the
number of clock cycles each operator takes to complete), 2) architecture-specific aspects of
the Verilog generation. For #2, LegUp’s Verilog is vendor agnostic except in two respects:
block RAM and divider instantiation. Block RAMs are instances of Altera’s ALTSYNCRAM
megafunction and division is done with the LPM DIVIDE megafunction. Thus, very little has
to be changed to target another vendor’s FPGAs or to use the tool as a front-end to the
open source VTR (Verilog-to-routing) FPGA CAD flow being developed at the University
of Toronto [VTR 2011]. VTR accepts Verilog RTL as input.
As mentioned in Section 4.1.2, the speed and area of each LLVM operator is stored in a Tcl
file that is read by LegUp. The area/speed data must be updated if LegUp is to produce good
results for a particular target device, as the data is used within the scheduling and binding
algorithms. To ease migration to alternative FPGA devices, the LegUp installation includes
a set of Verilog modules that implement each LLVM operator in isolation. One can compile
each of the modules to the particular FPGA device being targeted to populate the Tcl file.
For Altera FPGAs specifically, a set of PERL scripts is provided that automatically compile
the single-operator modules using Altera commercial tools and then parse the results to
retrieve the area and speed data. We executed the scripts to gather Stratix IV area and
speed data for each LLVM operator and then manually modified the Tcl script. Gathering
the speed/area data and modifying the Tcl script took a day.
Table VI gives speed and area results for Stratix IV; the results for Cyclone II are repeated
in the table for convenience. Speed and area results are shown on the left and right sides
of the table, respectively. Looking first at the speed data, we see that as expected, all
circuits run considerably faster in Stratix IV vs. Cyclone II – the speed-up ratio is 2.4×,
which we expect is partly due to the more advanced technology node, and partly due to
the Stratix IV architecture. Area values are given in LEs and ALMs for Cyclone II and
Stratix IV, respectively. On average, about 60% fewer ALMs are needed to implement
circuits in Stratix IV vs. LEs in Cyclone II, owing to Stratix IV’s larger dual-output LUTs.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:21

Table VI. Stratix IV Speed and Area Results.

Performance (MHz) Area (LEs, ALMs)
Benchmark Cyclone II Stratix IV Cyclone II Stratix IV
adpcm 45.79 115.51 22605 9993
aes 60.72 135.81 28490 9088
blowfish 65.41 198.77 15064 7560
dfadd 124.05 247.28 8881 4196
dfdiv 74.72 159.85 20159 7540
dfmul 85.62 248.32 4861 1965
dfsin 62.64 143.97 38933 15044
gsm 58.93 143.43 19131 6662
jpeg 47.09 82.69 46224 24883
mips 90.09 223.21 4479 2541
motion 91.79 248.82 13238 2546
sha 86.93 212.9 12483 4889
dhrystone 85.38 182.32 4985 2735
Geomean: 72.54 171.69 14328.0 5840.8

Porting LegUp to an alternative FPGA device for pure hardware HLS is straightforward,
however, supporting the hybrid processor/accelerator scenario on a non-Altera device is
more involved. In particular, the Tiger MIPS processor makes use of Altera megafunctions
for memory, division and multiplication. The megafunctions would need to be changed to ref-
erence the corresponding modules for the alternate FPGA vendor. Moreover, as described in
Section 3.2, the LegUp hybrid platform uses the Altera Avalon interface for processor/ac-
celerator communication. If a Xilinx FPGA were targeted, processor/accelerator system
generation and communication would need to be modified to use the Xilinx EDK tool and
PLB bus [PLB 2011]. The PLB and Avalon interfaces are quite similar however, as both are
memory-mapped master/slave bus interfaces. We therefore see no significant barriers that
would prevent LegUp from targeting a Xilinx device.

6.2. Implementing a New Scheduling Algorithm

We implemented a new scheduling algorithm in LegUp based on the SDC (system of dif-
ference constraints) formulation, as described in [Cong and Zhang 2006], and used in xPi-
lot [Cong et al. 2006]. The idea is to formulate the scheduling problem as a linear program
(LP) that can be solved with a standard solver (we used lpsolve [LPS 2011]). In SDC
scheduling, each operation is assigned a variable that, after solving, will hold the clock
cycle in which the operation in scheduled. Consider two operations, op1 and op2, and let
the variable cop1 represent the cycle in which op1 is to be scheduled, and cop2 the cycle in
which op2 is to be scheduled. Assume further that op2 depends on op1, then the following
constraint is added to the LP formulation: cop1 ≥ cop2 or equivalently: cop1 − cop2 ≥ 0: a
difference constraint.
Clock period constraints can also be incorporated into SDC scheduling. Let P be the
target clock period and let C represent a chain of any N dependant combinational operations
in the dataflow graph: C = op1 → op2 → ... → opN . Assume that T represents the total
estimated combinational delay of the chain of N operations – computed by summing the
delays of each operator (with the operators characterized as described in Section 4.1.3). We
can add the following timing constraint to the LP: copN − cop1 ≥ ⌈T /P ⌉ − 1. This difference
constraint requires that the cycle assignment for opN be at least ⌈T /P ⌉− 1 cycles later than
the cycle in which op1 is scheduled. Such constraints control the extent to which operations
can be chained together in a clock cycle. Chaining is permitted such that the target clock
period P is met. This provides the basic picture of SDC scheduling and the reader is referred
to [Cong and Zhang 2006] for complete details of the formulation and how other types of
constraints can be included.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:22 A. Canis, J. Choi, M. Aldham et al.

Fig. 12. SDC scheduling results for Cyclone II with various clock period constraints (bars represent per-
formance in MHz; the line represents latency in clock cycles.

The LegUp implementation has a scheduler DAG object that, essentially, annotates each
LLVM instruction with data relevant to its scheduling: its combinational delay, as charac-
terized in the target FPGA, and the instructions on which it depends. The scheduler DAG
object can be viewed as an overlay on the dataflow graph with scheduling-specific infor-
mation. The object contains all of the information needed for us to generate the SDC LP
formulation. After solving the LP, we deposit the cycle assignment for each instruction into
another LegUp data structure called the scheduler mapping. For each LLVM instruction, the
mapping holds the scheduled cycle number. Following scheduling, FSM generation accesses
the mapping object to construct the FSM.
Fig. 12 shows SDC scheduling results for Cyclone II, demonstrating the impact of running
SDC with different clock period constraints. The left axis (bar) gives the geometric mean
post-routed clock frequency across the 12 CHStone circuits and dhrystone; the right axis
(line) gives the geometric mean latency (# of clock cycles to execute). The four datapoints
show SDC scheduling results for clock period constraints of 20, 15, 10, and 7.5 ns, respec-
tively. Observe that circuit clock frequency increases as P is decreased, which demonstrates
the effectiveness of SDC, as well as provides confidence in our operator speed characteriza-
tion. Note that P is a minimum clock period constraint – no effort is made to actually slow
circuits down. Hence, for the P = 20 ns datapoint, the circuits run considerably faster than
50 MHz. As P is decreased, the circuits are more heavily pipelined and take larger numbers
of cycles to execute.
SDC scheduling will be made LegUp’s default scheduling algorithm in a subsequent re-
lease.
6.3. Parallel Accelerators
As a last case study, we demonstrate the capability of LegUp to synthesize multi-accelerator
systems. As a proof-of-concept application, we use array addition for four 1000-element
arrays. Three parallelization scenarios were evaluated: 1) pure software with the MIPS
processor performing all of the work, 2) a single accelerator, called by the processor, per-
forming each of the four array additions sequentially, and, 3) four accelerators, operating
in parallel, with each accelerator performing the addition for one of the four arrays. In the
multi-accelerator case, the processor signals each accelerator to start its work and polls until
all four have completed. We found that a single acclerator doing all of the work sequentially

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:23

provides a 5.2× speedup over the pure software case. Using four parallel accelerators yields
a 3.7× speedup vs. using a single accelerator. While this is a simple application, with no po-
tential cache coherency issues, it serves to illustrate that concurrently running accelerators
are feasible with LegUp – a topic we plan to explore further in future work.

7. CONCLUSIONS AND FUTURE WORK

In this paper, we introduced LegUp – a new high-level synthesis tool that compiles a stan-
dard C program to a hybrid processor/accelerator architecture comprising a MIPS processor
and custom accelerators communicating through a standard on-chip interface. Using LegUp,
one can explore the hardware/software design space, where some portions of a program run
on a processor, and others are implemented as custom hardware circuits. As compared with
software running on a MIPS soft processor, pure hardware implementations produced by
LegUp HLS execute 8× faster and use 18× less energy on a Altera Cyclone II FPGA.
LegUp’s hardware implementations are competitive with those produced by a commercial
HLS tool, both in benchmark execution time and in area-delay product. LegUp, along with
its suite of benchmark C programs, is a powerful open source platform for HLS research
that we expect will enable a variety of research advances in hardware synthesis, as well as in
hardware/software co-design. LegUp is available for download at: http://www.legup.org.
We are currently using the LegUp framework to explore several new directions towards
improving computational throughput. First, we are investigating the benefits of using multi-
ple clock domains, where each processor and accelerator can operate at its maximum speed
and communication between modules occurs across clock domains (the Altera Avalon inter-
face can support this). Second, we are implementing loop pipelining within our scheduler,
wherein a loop iteration can commence execution prior to the completion of the previous
iteration (a loop iteration’s instructions can execute as long as their operands have been
computed). Lastly, although we are already seeing significant energy benefits of computing
in hardware vs. software, we believe that much more can be done on this front through the
incorporation of energy-driven scheduling and FSM generation. Lastly, we are exploring the
performance benefits of concurrent processor/accelerator execution for both data parallel
and task parallel applications, as well as for multi-program workloads.
Our long-term vision is to fully automate the flow in Figure 1, thereby creating a self-
accelerating adaptive processor in which profiling, hardware synthesis and acceleration hap-
pen transparently without user awareness.

Acknowledgements
The authors thank Dr. Tedd Hadley from Y Explorations for providing the eXCite tool
used in the experimental study. The authors gratefully acknowledge the comments of the
anonymous reviewers that have significantly improved the manuscript.

REFERENCES
2011. CoreConnect, Xilinx, Inc. http://www.xilinx.com/support/documentation/ipembedprocess coreconnect.htm.
2011. lp solve linear programming solver. http://lpsolve.sourceforge.net/5.5/ .
2011. VTR – the Verilog-to-routing project for FPGAs. http://www.eecg.toronto.edu/vtr/ .
Aldham, M., Anderson, J., Brown, S., and Canis, A. 2011. Low-cost harware profiling of run-time
and energy in FPGA embedded processors. In IEEE Int’l Conference on Application-specific Systems,
Architecture and Processors (ASAP). 61–68.
Altera, Corp. 2009. Nios II C2H Compiler User Guide. Altera, Corp., San Jose, CA.
Altera, Corp. 2010. Avalon Interface Specification. Altera, Corp., San Jose, CA.
Altera, Corp. 2011. Stratix IV FPGA Family Data Sheet. Altera, Corp., San Jose, CA.
AutoESL. AutoESL Design Technologies, Inc. (http://www.autoesl.com). AutoESL.
Betz, V. and Rose, J. 1997. VPR: A new packing, placement and routing tool for FPGA research. In Int’l
Workshop on Field Programmable Logic and Applications. 213–222.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
1:24 A. Canis, J. Choi, M. Aldham et al.

Cadence 2010. Cadence C-to-Silicon Compiler (http://www.cadence.com/products/sd/silicon compiler).

Cadence.
Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J., Brown, S., and Czajkowski,
T. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In ACM Int’l
Symposium on Field Programmable Gate Arrays. 33–36.
CebaTech 2010. CebaTech The software to silicon company (http://www.cebatech.com). CebaTech.
Chen, D. and Cong, J. 2004. Register binding and port assignment for multiplexer optimization. In
IEEE/ACM Asia and South Pacific Design Automation Conference. 68–73.
Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang, Z. 2006. Platform-based behavior-level and system-
level synthesis. In IEEE Int’l System-on-Chip Conference. 199–202.
Cong, J. and Zhang, Z. 2006. An efficient and versatile scheduling algorithm based on SDC formulation.
In IEEE/ACM Design Automation Conference. 433–438.
Cong, J. and Zou, Y. 2009. FPGA-based hardware acceleration of lithographic aerial image simulation.
ACM Transactions on Reconfigurable Technology and Systems 2, 3, 1–29.
Coussy, P., Gajski, D., Meredith, M., and Takach, A. 2009. An introduction to high-level synthesis.
IEEE Design Test of Computers 26, 4, 8 – 17.
Coussy, P., Lhairech-Lebreton, G., Heller, D., and Martin, E. 2010. GAUT – a free and open source
high-level synthesis tool. In IEEE Design Automation and Test in Europe – University Booth.
DE2 2010. DE2 Development and Education Board. DE2, Altera Corp, San Jose, CA.
Forte 2010. Forte Design Systems The high level design company
(http://www.forteds.com/products/cynthesizer.asp). Forte.
Gajski, D. and et. al. Editors. 1992. High-Level Synthesis - Introduction to Chip and System Design.
Kulwer Academic Publishers.
Hara, Y., Tomiyama, H., Honda, S., and Takada, H. 2009. Proposal and quantitative analysis of the
CHStone benchmark program suite for practical C-based high-level synthesis. Journal of Information
Processing 17, 242–254.
Henkel, J. 2003. Closing the SoC design gap. IEEE Computer 36, 119–121.
Huang, C., Che, Y., Lin, Y., and Hsu, Y.
Huang, S., Hormati, A., Bacon, D., and Rabbah, R. 2008. Liquid Metal: Object-oriented programming
across the hardware/software boundary. In ACM European conference on Object-Oriented Program-
ming. 76–103.
Impulse 2010. Impulse CoDeveloper – Impulse accelerated technologies
(http://www.impulseaccelerated.com). Impulse.
Kuhn, H. 2010. The Hungarian method for the assignment problem. In 50 Years of Integer Programming
1958-2008. Springer, 29–47.
LLVM 2010. The LLVM Compiler Infrastructure Project (http://www.llvm.org). LLVM.
Luu, J., Redmond, K., Lo, W., Chow, P., Lilge, L., and Rose, J. 2009. FPGA-based monte carlo
computation of light absorption for photodynamic cancer therapy. In IEEE Int’l Symposium on Field-
Programmable Custom Computing Machines. 157–164.
Mentor Graphics 2010. Mentor Graphics (http://www.mentor.com/products/esl/high level synthesis). Men-
tor Graphics.
Mishchenko, A., Chatterjee, S., and Brayton, R. 2006. DAG-aware AIG rewriting: A fresh look at
combinational logic synthesis. In IEEE/ACM Design Automation Conference. 532–536.
Pothineni, N., Brisk, P., Ienne, P., Kumar, A., and Paul, K. 2010. A high-level synthesis flow for
custom instruction set extensions for application-specific processors. In ACM/IEEE Asia and South
Pacific Design Automation Conference. 707–712.
Pozzi, L., Atasu, K., and Ienne, P. 2006. Exact and approximate algorithms for the extension of embedded
processor instruction sets. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems 25, 7, 1209–1229.
Putnam, A., Bennett, D., Dellinger, E., Mason, J., Sundararajan, P., and Eggers, S. 2008. CHiMPS:
A C-level compilation flow for hybrid CPU-FPGA architectures. In IEEE Int’l Conference on Field
Programmable Logic and Applications. 173–178.
Stitt, G. and Vahid, F. 2007. Binary synthesis. ACM Transactions on Design Automation of Electronic
Systems 12, 3.
Sun, F., Raghunathan, A., Ravi, S., and Jha, N. 2004. Custom-instruction synthesis for extensible-
processor platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems 23, 7, 216–228.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: An Open Source High-Level Synthesis Tool 1:25

Tripp, J., Gokhale, M., and Peterson, K. 2007. Trident: From high-level language to hardware circuitry.
IEEE Computer 40, 3, 28–37.
United States Bureau of Labor Statistics 2010. Occupational Outlook Handbook 2010-2011 Edition. United
States Bureau of Labor Statistics.
University of Cambridge 2010. The Tiger MIPS processor (http://www.cl.cam.ac.uk/teaching/
0910/ECAD+Arch/mips.html). University of Cambridge.
Vahid, F., Stitt, G., and R., L. 2008. Warp processing: Dynamic translation of binaries to FPGA circuits.
IEEE Computer 41, 7, 40–46.
Villarreal, J., Park, A., Najjar, W., and Halstead, R. 2010. Designing modular hardware accelerators
in C with ROCCC 2.0. In IEEE Int’l Symposium on Field-Programmable Custom Computing Machines.
127–134.
Wayne Marx, V. A. 2008. FPGAs Are Everywhere - In Design, Test & Control. RTC Magazine.
Y Explorations (XYI) 2010. eXCite C to RTL Behavioral Synthesis 4.1(a). Y Explorations (XYI), San
Jose, CA.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1, Publication date: July 2012.
LegUp: High-Level Synthesis for FPGA-Based
Processor/Accelerator Systems
Andrew Canis1 , Jongsok Choi1 , Mark Aldham1 , Victor Zhang1 , Ahmed Kammoona1 ,
Jason Anderson1 , Stephen Brown1 , and Tomasz Czajkowski‡
1
ECE Department, University of Toronto, Toronto, ON, Canada
‡
Altera Toronto Technology Centre, Toronto, ON, Canada
legup@eecg.toronto.edu

ABSTRACT However, according to labor statistics, software engineers

outnumber hardware engineers by more than 10X in the
In this paper, we introduce a new open source high-level
U.S. [10]. An overarching goal of LegUp is to broaden the
synthesis tool called LegUp that allows software techniques
FPGA user base to include software engineers, thereby ex-
to be used for hardware design. LegUp accepts a standard C panding the scope of FPGA applications and growing the
program as input and automatically compiles the program to size of the programmable hardware market.
a hybrid architecture containing an FPGA-based MIPS soft LegUp includes a soft processor because not all program
processor and custom hardware accelerators that communi- segments are appropriate for hardware implementation. In-
cate through a standard bus interface. Results show that herently sequential computations are well-suited for software
the tool produces hardware solutions of comparable quality (e.g. traversing a linked list); whereas, other computations
to a commercial high-level synthesis tool. are ideally suited for hardware (e.g. addition of integer ar-
rays). Incorporating a processor also offers the advantage of
increased high-level language coverage. Program segments
Categories and Subject Descriptors that use restricted C language constructs can execute on the
B.7 [Integrated Circuits]: Design Aids processor (e.g. recursion).
LegUp is written in modular C++ to permit easy exper-
imentation with new HLS algorithms. The LegUp distribu-
General Terms tion includes a set of benchmark C programs [6] that can
Design, Algorithms be compiled to pure software, pure hardware, or a hybrid
system. In this paper, we present an experimental study
1. INTRODUCTION demonstrating that LegUp produces hardware implementa-
Two approaches are possible for implementing computations of comparable quality to a commercial tool [13], and
tions: software (running on a standard processor) or hard- we give results demonstrating the tool’s capabilities for hard-
ware (custom circuits). A hardware implementation can ware/software co-design.
improve speed and energy-efficiency versus a software im-
plementation (e.g. [3]). However, hardware design requires 2. RELATED WORK
writing complex RTL code, which is error prone and diffi- Among prior academic work, the Warp Processor pro-
cult to debug. Software design, on the other hand, is more posed by Vahid, Stitt and Lysecky bears similarity to our
straightforward, and mature debugging and analysis tools framework [12]. The Warp Processor profiles software run-
are freely accessible. Despite the potential energy and per- ning on a processor. The profiling results guide the selec-
formance benefits, hardware design is too difficult and costly tion of program segments to be synthesized to hardware.
for most applications, and a software approach is preferred. Such segments are disassembled from the software binary to
In this paper, we propose LegUp – an open source high- a higher-level representation, which is then synthesized to
level synthesis (HLS) framework we have developed that hardware [9]. We take a somewhat similar approach, with
provides the performance and energy benefits of hardware, key differences being that we compile hardware from the
while retaining software ease-of-use. LegUp automatically high-level language source code and our tool is open source.
compiles a C program to target a hybrid FPGA-based soft- On the commercial front is Altera’s C2H tool [1]. C2H al-
ware/hardware system, where some program segments ex- lows a user to partition a C program’s functions into a hard-
ecute on an FPGA-based 32-bit MIPS soft processor and ware set and a software set. The software-designated func-
other program segments are automatically synthesized into tions execute on a Nios II soft processor, and the hardware-
FPGA circuits – hardware accelerators – that communicate designated functions are synthesized into custom hardware
and work in tandem with the soft processor. Since the first accelerators. The C2H system architecture closely resembles
FPGAs appeared in the mid-1980s, access to the technol- that targeted by LegUp.
ogy has been restricted to those with hardware design skills.
3. LEGUP FLOW AND ARCHITECTURE
Permission to make digital or hard copies of all or part of this work for The LegUp design flow comprises first compiling and run-
personal or classroom use is granted without fee provided that copies are ning a program on a standard processor, profiling its execu-
not made or distributed for profit or commercial advantage and that copies tion, selecting program segments to target to hardware, and
bear this notice and the full citation on the first page. To copy otherwise, to
then re-compiling the program to a hybrid hardware/software
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. system. Fig. 1 illustrates the detailed flow. Referring to
FPGA’11, February 27–March 1, 2011, Monterey, California, USA. the labels in the figure, at step ➀, a C compiler compiles a
Copyright 2011 ACM 978-1-4503-0554-9/11/02 ...$10.00. program to a binary executable [7]. At ➁, the executable
.... FPGA
y[n] = 0;
for (i = 0; i < 8; i++) { Self-Profiling MIPS Processor
Hardware Hardware
y[n] += coeff[i] * x[n-i]; 1 Processor
MIPS Processor Accelerator Accelerator
} C Compiler
(MIPS)
....
2
AVALON INTERCONNECT
Program code
LegUp
5 Memory Controller
On-Chip
Cache
Altered SW binary (calls HW accelerators) Profiling Data:

3 Execution Cycles Off-Chip Memory

4 High-level Power
synthesis Suggested Cache Misses
μP Hardened
program
program
segments to
Figure 2: Target system architecture.
segments
6 FPGA fabric target to
HW processor/accelerator communication across the Avalon in-
terface or through memory.
Figure 1: Design flow with LegUp. The architecture depicted in Fig. 2 represents the target
system most natural for an initial release of the tool. The ar-
chitecture of processor/accelerator systems is an important
runs on an FPGA-based MIPS processor. We evaluated
direction for future research.
several publicly-available MIPS processor implementations
and selected the Tiger MIPS processor from the University 4. DESIGN AND IMPLEMENTATION
of Cambridge [11], based on its full support of the MIPS
instruction set, established tool flow, and well-documented 4.1 High-Level Hardware Synthesis
modular Verilog. High-level synthesis has traditionally been divided into
The MIPS processor has been augmented with extra cir- three steps [4]: allocation, scheduling and binding. Alloca-
cuitry to profile its own execution. Using its profiling abil- tion determines the amount of hardware resources available
ity, the processor is able to identify sections of program code for use, and manages other hardware constraints (e.g., speed,
that would benefit from hardware implementation. Specif- area, and power). Scheduling assigns each operation in the
ically, the profiling results drive the selection of program program being synthesized to a particular clock cycle (state)
code segments to be re-targeted to custom hardware from and generates a finite state machine. Binding saves area
the C source. Profiling a program’s execution in the proces- by sharing functional units between operations, and sharing
sor itself provides the highest possible accuracy. Presently, registers/memories between variables.
we profile program run-time at the function level. LegUp leverages the low-level virtual machine (LLVM)
Having chosen program segments to target to custom hard- compiler framework. At the core of LLVM is an inter-
ware, at step ➂ LegUp is invoked to compile these segments mediate representation (IR), which is essentially machine-
to synthesizeable Verilog RTL. LegUp’s hardware synthe- independent assembly language. C code is translated into
sis and software compilation are part of the same compiler LLVM’s IR then analyzed and modified by a series of com-
framework. Presently, LegUp HLS operates at the function piler optimization passes. LLVM IR instructions are sim-
level: entire functions are synthesized to hardware from the ple enough to directly correspond to hardware operations
C source. The RTL produced by LegUp is synthesized to (e.g., an arithmetic computation). Our HLS tool operates
an FPGA implementation using standard commercial tools directly with the LLVM IR, scheduling the instructions into
at step ➃. In step ➄, the C source is modified such that specific clock cycles. LegUp HLS algorithms have been im-
the functions implemented as hardware accelerators are re- plemented as LLVM passes that fit neatly into the existing
placed by wrapper functions that call the accelerators (in- framework. Implementing the HLS steps as distinct passes
stead of doing computations in software). This new modified also allows easy experimentation with different HLS algo-
source is compiled to a MIPS binary executable. Finally, in rithms; for example, one could modify LegUp to “plug in” a
step ➅ the hybrid processor/accelerator system executes on new scheduling algorithm.
the FPGA. The initial release of LegUp uses as-soon-as-possible (ASAP)
Our long-term vision is to fully automate the flow in Fig. 1, scheduling [5], which assigns an operation to the first state
thereby creating a self-accelerating adaptive processor in which after its dependencies are available. In some cases, we can
profiling, hardware synthesis and acceleration happen trans- schedule an instruction into the same state as one of its de-
parently without user awareness. In the first release of our pendencies. This is called operation chaining. Chaining can
tool, however, the user must manually examine the profiling reduce hardware latency (# of cycles for execution) without
results and place the names of the functions to be acceler- impacting the clock period.
ated in a file that is read by LegUp. Binding consists of two tasks: assigning operators from
Fig. 2 elaborates on the target system architecture. The the program being synthesized to specific hardware units,
processor connects to one or more custom hardware accel- and assigning program variables to registers (register allo-
erators through a standard on-chip interface. As our initial cation). When multiple operators are assigned to the same
hardware platform is the Altera DE2 Development and Edu- hardware unit, or when multiple variables are bound to
cation board (containing a 90 nm Cyclone II FPGA), we use the same register, multiplexers are required to facilitate the
the Altera Avalon interface for processor/accelerator com- sharing. We make two FPGA-specific observations in our
munication [2]. A shared memory architecture is used, with approach to binding. First, multiplexers are relatively ex-
the processor and accelerators sharing an on-FPGA data pensive to implement in FPGAs using LUTs. A 32-bit mul-
cache and off-chip main memory. The on-chip cache memory tiplexer implemented in 4-LUTs is the same size as a 32-bit
is implemented using block RAMs within the FPGA fabric adder. Consequently, there is little advantage to sharing all
(M4K blocks on Cyclone II). Access to memory is handled but the largest functional units, namely, multipliers and di-
by a memory controller. The architecture in Fig. 2 allows viders. Likewise, the FPGA fabric is register rich and shar-
ing registers is rarely justified. The initial relase of LegUp Three metrics are employed to gauge quality of result:
uses a weighted bipartite matching heuristic to solve the 1) circuit speed, 2) area, and 3) energy consumption. For
binding problem [8]. We minimize the number of multi- circuit speed, we consider the cycle latency, clock frequency
plexer inputs required, thereby minimizing area. and total execution time. Cycle latency refers to the number
of clock cycles required for a complete execution of a bench-
4.2 Processor/Accelerator Communication mark. Clock frequency refers to the reciprocal of the post-
Functions selected for hardware implementation are au- routed critical path delay reported by Altera timing analysis.
tomatically replaced with a wrapper by the LegUp frame- Total execution time is simply the cycle latency multiplied
work. The wrapper function passes the function arguments by the clock period. To measure energy, we use Altera’s
to the corresponding hardware accelerator, and receives the PowerPlay power analyzer tool, applied to the routed de-
returned data over the Avalon interconnect. While waiting sign. We use switching activity data gathered from a full
for the accelerator to complete its work, the MIPS processor delay simulation with Mentor Graphics’ ModelSim.
can do one of two things: 1) continue to perform computa- Table 1 presents speed performance results for all circuits
tions and periodically poll a memory-mapped register whose and flows. Three data columns are given for each flow: Cy-
value is set when the accelerator is done, or, 2) stall until a cles, Freq in MHz, and Time in µS (Cycles/F req). The sec-
done signal is asserted by the accelerator. The advantage of ond last row of the table contains geometric mean results for
polling is that the processor can execute other computations each column. The dhrystone benchmark was excluded from
while the accelerator performs its work. The advantage of the geomean calculations, as eXCite was not able to compile
stalling is reduced energy consumption – the processor is in this benchmark. The last row of the table presents the ratio
a low-power state while the accelerator operates. In our ini- of the geomean relative to the software flow (MIPS-SW ).
tial LegUp release, both modes are functional; however, we For the MIPS-SW flow, Table 1 indicates that the proces-
use stalling for the results in this paper. sor runs at 74 MHz on the Cyclone II and the benchmarks
4.3 Language Support and Benchmarks take between 6.7K-29M cycles to complete their execution.
LegUp supports a large subset of ANSI C for synthesis In the LegUp-Hybrid2 flow, the number of cycles needed for
to hardware including: control flow statements, all integer execution is reduced by 50% compared with software, on av-
arithmetic and bitwise operations, and integer types. Pro- erage. The Hybrid2 circuits run at 6% lower frequency than
gram segments that use unsupported language features are the processor, on average. Overall, LegUp-Hybrid2 provides
required to remain in software and execute on the MIPS a 47% (1.9×) speed-up in program execution time vs. soft-
processor. LegUp also supports functions, arrays, structs, ware (MIPS-SW ). In the LegUp-Hybrid1 flow, cycle latency
global variables and pointer arithmetic. Dynamic memory, is 75% lower than software alone. However, clock speed is
floating point and recursion are unsupported in the initial 9% worse for this flow, which results in a 73% reduction
release. in program execution time vs. software (a 3.7× speed-up
With the LegUp distribution, we include 13 benchmark over software). Looking broadly at the data for MIPS-SW,
C programs. Included are all 12 programs in the CHStone LegUp-Hybrid1 and LegUp-Hybrid2, we observe a trend: ex-
high-level synthesis benchmark suite [6], and Dhrystone – a ecution time decreases substantially as more computations
standard integer benchmark. A key characteristic of the are mapped to hardware.
benchmarks is that inputs and expected outputs are in- Benchmark programs mapped using the LegUp-HW flow
cluded in the programs themselves. The presence of golden require 12% of the clock cycles of the software implementa-
outputs for each program gives us assurance regarding the tions, on average, yet they run at about the same speed in
correctness of our synthesis results. MHz. Benchmarks mapped using eXCite-HW require even
fewer clock cycles – just 8% of that required for software im-
5. EXPERIMENTS plementations. However, implementations produced by eX-
Cite run at 45% lower clock frequency than the MIPS proces-
The goals of our experimental study are three-fold: 1) to
sor, on average. LegUp produces heavily pipelined hardware
demonstrate that the quality of result (speed, area, power)
implementations, whereas, we believe eXCite does more op-
produced by LegUp HLS is comparable to that produced
eration chaining, resulting in fewer cycles yet longer critical
by a commercial HLS tool (eXCite [13]), 2) to demonstrate
path delays. Considering total execution time of a bench-
LegUp’s ability to effectively explore the hardware/software
mark, LegUp and eXCite offer similar results. LegUp-HW
co-design space, and 3) to compare the quality of hardware
provides an 88% execution time improvement vs. software
vs. software implementations of the benchmark programs.
(8× speed-up); eXCite-HW provides an 85% improvement
We map each benchmark program using 5 different flows,
(6.7× speed-up).
representing implementations with increasing amounts of
It is worth highlighting a few results in Table 1. Com-
computation happening in hardware vs. software: 1) A soft-
paring LegUp-HW with eXCite-HW for the benchmark aes,
ware only implementation running on the MIPS soft proces-
LegUp’s implementation provides a nearly 5× improvement
sor (MIPS-SW ); 2) A hybrid software/hardware implemen-
over eXCite in terms of execution time. Conversely, for the
tation where the second most compute-intensive function
motion benchmark, LegUp’s implementation requires nearly
(and its descendants) in the benchmark is implemented as
4× more cycles than eXCite’s implementation. We believe
a hardware accelerator (LegUp-Hybrid2 ); 3) A hybrid soft-
such differences lie in the extent of pipelining used by LegUp
ware/hardware implementation where the most compute-
vs. eXCite.
intensive function (and its descendants) is implemented as a
Average area results are provided in Table 2. For each
hardware accelerator (LegUp-Hybrid1 ); 4) A pure hardware
flow, three columns provide the number of Cyclone II logic
implementation produced by LegUp (LegUp-HW ); 5) A pure
elements (LEs), the number of memory bits used (# bits),
hardware implementation produced by eXCite (eXCite-HW );
as well as the number of 9x9 multipliers (Mults). The num-
The two hybrid flows correspond to a system that includes
bers in parentheses represent ratios relative to the MIPS-SW
the MIPS processor and a single accelerator, where the ac-
flow. The hybrid flows include both the MIPS processor, as
celerator implements a C function and all of its descendant
well as custom hardware, and consequently, they consume
functions. For the back-end of all flows, we use Quartus II
considerably more area. The LegUp-HW flow implementa-
ver. 9.1 SP2 to target the Cyclone II FPGA.
Table 1: Speed performance results.

MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW eXCite-HW

Benchmark Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time
adpcm 193607 74.26 2607 159883 61.61 2595 96948 57.19 1695 36795 45.79 804 21992 28.88 761
aes 73777 74.26 993 55014 54.97 1001 26878 49.52 543 14022 60.72 231 55679 50.96 1093
blowfish 954563 74.26 12854 680343 63.21 10763 319931 63.7 5022 209866 65.41 3208 209614 35.86 5845
dfadd 16496 74.26 222 14672 83.14 176 5649 83.65 68 2330 124.05 19 370 24.54 15
dfdiv 71507 74.26 963 15973 83.78 191 4538 65.92 69 2144 74.72 29 2029 43.95 46
dfmul 6796 74.26 92 10784 85.46 126 2471 83.53 30 347 85.62 4 223 49.17 5
dfsin 2993369 74.26 40309 293031 65.66 4463 80678 68.23 1182 67466 62.64 1077 49709 40.06 1241
gsm 39108 74.26 527 29500 61.46 480 18505 61.14 303 6656 58.93 113 5739 41.82 137
jpeg 29802639 74.26 401328 16072954 51.2 313925 15978127 46.65 342511 5861516 47.09 124475 3248488 22.66 143358
mips 43384 74.26 584 6463 84.5 76 6463 84.5 76 6443 90.09 72 4344 76.25 57
motion 36753 74.26 495 34859 73.34 475 17017 83.98 203 8578 91.79 93 2268 42.87 53
sha 1209523 74.26 16288 358405 84.52 4240 265221 81.89 3239 247738 86.93 2850 238009 62.48 3809
dhrystone 28855 74.26 389 25599 82.26 311 25509 83.58 305 10202 85.38 119 - - -
Geomean: 173332.0 74.26 2334.1 86258.3 69.98 1232.6 42700.5 67.78 630.0 20853.8 71.56 291.7 14594.4 40.87 357.1
Ratio: 1 1 1 0.50 0.94 0.53 0.25 0.91 0.27 0.12 0.96 0.12 0.08 0.55 0.15

tions are implemented in hardware vs. software. The LegUp-

Table 2: Area results (geometric mean). Hybrid2 and LegUp-Hybrid1 flows use 47% and 76% less
energy than the MIPS-SW flow, respectively. With LegUp-
Flow #LEs # bits Mults
HW, the benchmarks use 94% less energy than if they are
MIPS-SW 12243 (1) 226009 (1) 16 (1) implemented with the MIPS-SW flow (an 18× reduction).
LegUp-Hybrid2 27248 (2.23) 258526 (1.14) 43 (2.68)
LegUp-Hybrid1 33629 (2.75) 261260 (1.16) 51 (3.18)
The eXCite energy results are similar to LegUp.
LegUp-HW 15646 (1.28) 28822 (0.13) 12 (0.72)
eXCite-HW 13101 (1.07) 496 (0.00) 5 (0.32) 6. CONCLUSIONS
In this paper, we introduced LegUp – a new high-level
2500 40000 600,000 synthesis tool that compiles a standard C program to a hy-
brid processor/accelerator architecture. Using LegUp, one
Execution time (geometric mean)

35000
Energy (nJ) (geometric mean)

# of LEs 500,000
2000
# of LEs (geometric mean)

Exec. time 30000 can explore the hardware/software design space, where some
400,000
1500 25000 portions of a program run on a processor, and others as
20000 300,000 custom hardware circuits. LegUp, along with its suite of
1000 15000
200,000 benchmark C programs, is a powerful open source platform
500
10000 for HLS research that we expect will enable a variety of
100,000
5000 research advances in hardware synthesis, as well as in hard-
0 0 - ware/software co-design. LegUp is available for download
at: http://www.legup.org.

a) Performance and area b) Energy 7. REFERENCES

[1] Altera, Corp. Nios II C2H Compiler User Guide, 2009.
Figure 3: Performance, area and energy results. [2] Altera, Corp. Avalon Interface Specification, 2010.
[3] J. Cong and Y. Zou. FPGA-based hardware acceleration of
tions require 28% more LEs than the MIPS processor on lithographic aerial image simulation. ACM Trans.
average; the eXCite-HW implementations require 7% more Reconfigurable Technol. Syst., 2(3):1–29, 2009.
LEs than the processor. In terms of memory bits, both [4] P. Coussy, D. Gajski, M. Meredith, and A. Takach. An
the LegUp-HW flow and the eXCite-HW flow require much introduction to high-level synthesis. IEEE Design Test of
Computers, 26(4):8 – 17, jul. 2009.
fewer memory bits than the MIPS processor alone. For the [5] D. Gajski and et. al. Editors. High-Level Synthesis -
benchmarks that require embedded multipliers, the LegUp- Introduction to Chip and System Design. Kulwer Academic
HW implementations use more multipliers than the eXCite- Publishers, 1992.
HW implementations, which we believe is due to more ex- [6] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal
tensive multiplier sharing in the binding phase of eXCite. and quantitative analysis of the CHStone benchmark program
suite for practical C-based high-level synthesis. Journal of
Fig. 3(a) summarizes the speed and area results. The Information Processing, 17:242–254, 2009.
left vertical axis represents geometric mean execution time; [7] http://www.llvm.org. The LLVM Compiler Infrastructure
the right axis represents area (number of LEs). Observe Project, 2010.
that execution time drops as more computations are imple- [8] C.Y. Huang, Y.S. Che, Y.L. Lin, and Y.C. Hsu. Data path
mented in hardware. While the data shows that pure hard- allocation based on bipartite weighted matching. In Design
Automation Conference, volume 27, pages 499–504, 1990.
ware implementations offer superior speed performance to
[9] G. Stitt and F. Vahid. Binary synthesis. ACM Transactions
pure software or hybrid implementations, the plot demon- on Design Automation of Electronic Systems, 12(3), 2007.
strates LegUp’s usefulness as a tool for exploring the hard- [10] United States Bureau of Labor Statistics. Occupational
ware/software co-design space. One can multiply the delay Outlook Handbook 2010-2011 Edition, 2010.
and area values to produce an area-delay product. On such [11] Univ. of Cambridge, http://www.cl.cam.ac.uk/teaching/
a metric, LegUp-HW and eXCite-HW are nearly identical 0910/ECAD+Arch/mips.html. The Tiger ”MIPS” processor.,
2010.
(∼4.6M µS-LEs vs. ∼4.7M µS-LEs) – LegUp-HW requires
[12] F. Vahid, G. Stitt, and Lysecky R. Warp processing: Dynamic
more LEs vs. eXCite-HW, however, it offers better speed, translation of binaries to FPGA circuits. IEEE Computer,
producing a roughly equivalent area-delay product. 41(7):40–46, 2008.
Fig. 3(b) presents the geometric mean energy results for [13] Y Explorations (XYI), San Jose, CA. eXCite C to RTL
each flow. Energy is reduced drastically as more computa- Behavioral Synthesis 4.1(a), 2010.
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for

Software Programmable FPGAs
Yuan Zhou1 *, Udit Gupta2⋆ , Steve Dai1 , Ritchie Zhao1 , Nitish Srivastava1 , Hanchen Jin1 , Joseph Featherston1 ,
Yi-Hsiang Lai1 , Gai Liu1 , Gustavo Angarita Velasquez3⋆ , Wenping Wang4⋆ , Zhiru Zhang1 *
1
School of Electrical and Computer Engineering, Cornell University, USA
2 Computer Science, Harvard University, USA
3 Systems Engineering and Computer Science, National University of Colombia, Colombia
4 Electronic and Information Engineering, Zhejiang University, China

*{yz882,zhiruz}@cornell.edu

ABSTRACT 1 INTRODUCTION
Modern high-level synthesis (HLS) tools greatly reduce the turn- Field-programmable gate arrays (FPGAs) have become an attractive
around time of designing and implementing complex FPGA-based option for realizing specialized accelerators thanks to their reconfig-
accelerators. They also expose various optimization opportunities, urability, massive fine-grained parallelism, and performance per watt
which cannot be easily explored at the register-transfer level. With advantage. With the extreme-scale integration of modern system-
the increasing adoption of the HLS design methodology and con- on-chip (SoC) and escalating design complexity of emerging applica-
tinued advances of synthesis optimization, there is a growing need tions, designing at a higher level of abstraction has become crucial
for realistic benchmarks to (1) facilitate comparisons between tools, to achieving high productivity. To address this challenge, high-level
(2) evaluate and stress-test new synthesis techniques, and (3) estab- synthesis (HLS) tools have emerged to allow application developers
lish meaningful performance baselines to track progress of the HLS to describe the hardware accelerator using common software pro-
technology. While several HLS benchmark suites already exist, they gramming languages like C/C++ by automatically generating RTL
are primarily comprised of small textbook-style function kernels, from behavioral descriptions [7, 14]. With the recent advances on
instead of complete and complex applications. To address this limita- HLS techniques and algorithms, modern HLS tools enable design-
tion, we introduce Rosetta, a realistic benchmark suite for software ers to explore optimization opportunities that are infeasible at the
programmable FPGAs. Designs in Rosetta are fully-developed appli- register-transfer level.
cations. They are associated with realistic performance constraints, Programming FPGAs with HLS tools is drastically different from
and optimized with advanced features of modern HLS tools. We be- writing traditional software code. HLS users typically need to apply
lieve that Rosetta is not only useful for the HLS research community, many optimization pragmas/directives to meet design constraints.
but can also serve as a set of design tutorials for non-expert HLS The success of such manual optimization often requires nontrivial
users. In this paper we describe the characteristics of our bench- hardware design knowledge. For example, in image/video processing,
marks and the optimization techniques applied to them. We further the right combination of SRAM-based line buffers and shift regis-
report experimental results on an embedded FPGA device as well as ters is needed to achieve the ideal throughput and resource usage
a cloud FPGA platform. for pipelining the stencil code in hardware. With a more complex
dataflow structure, the user needs to further calculate and specify the
ACM Reference Format: right FIFO depth to obtain the best pipeline rate without causing too
Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen much area overhead. However, these advanced HLS optimizations
Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, are rarely used or even required in the existing HLS benchmark
Wenping Wang, Zhiru Zhang. 2018. Rosetta: A Realistic High-Level Synthesis suites (e.g., [11], [23]), which primarily include relatively small ker-
Benchmark Suite for Software Programmable FPGAs. In FPGA ’18: 2018 nels that are designed to test some of the basic capabilities of an
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
HLS tool such as the synthesis support of high-level language con-
February 25–27, 2018, Monterey, CA, USA. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/3174243.3174255
structs. In addition, for HLS tool developers and the HLS research
community at large, there is also a growing demand for a common
set of realistic and complex designs to evaluate the efficacy of new
⋆ Udit, Gustavo, and Wenping conducted this research when they were affiliated with
or visiting Cornell. synthesis techniques.
To this end, we introduce Rosetta1 — a suite of realistic HLS bench-
marks for software programmable FPGAs. Rosetta includes popular
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed machine learning workloads such as logistic regression and neural
for profit or commercial advantage and that copies bear this notice and the full citation network inference, as well as real-time video processing applications
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
including image rendering and face detection. Unlike previous ef-
to post on servers or to redistribute to lists, requires prior specific permission and/or a forts, Rosetta presents fully developed applications instead of small
fee. Request permissions from permissions@acm.org. kernel programs, and specifies realistic design constraints for each
FPGA ’18, February 25–27, 2018, Monterey, CA, USA
© 2018 Association for Computing Machinery. 1 Rosetta gets the name following the convention of a plethora of “stone” benchmark
ACM ISBN 978-1-4503-5614-5/18/02. . . $15.00 suites. It also symbolizes that our benchmarks are specified in multiple languages (i.e.,
https://doi.org/10.1145/3174243.3174255 C++, OpenCL) and useful for evaluating HLS across different tools and platforms.

269
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

application. These design constraints are satisfied by applying ad- In particular, state-of-the-art HLS tools provide many advanced fea-
vanced optimizations of state-of-the-art HLS tools, which are not tures for achieving high design quality. Examples include arbitrary-
exercised by existing benchmark suites. With these features, Rosetta precision datatypes, parameterized hardware data structures (e.g.,
is not only a set of practical benchmarks for the HLS community, line buffers), and hierarchical dataflow pipelining. These features are
but also a design tutorial on how to build specialized FPGA accelera- often used in combination with other common HLS optimizations
tors with advanced HLS optimizations. More concretely, our main such as unrolling, loop pipelining [9, 15, 37], and array partition-
contributions are threefold: ing [30, 41]. Moreover, they are typically applied across multiple
• We design and present Rosetta, which couples a range of realistic kernels exhibiting different characteristics to meet the stringent
applications with real-world design constraints under different applicant-level design constraints.
programming models. Current Rosetta designs are written in C++ We believe that a new set of full-application benchmarks is desir-
and OpenCL. The synthesized hardware accelerators are tested able to enable more realistic performance reporting of HLS tools and
on both embedded and cloud FPGA platforms. FPGA-based acceleration. Along this line, Liu et al. [16] conducted a
comprehensive case study on an H.264 decoder, and they have open
• Rosetta demonstrates how to effectively apply advanced optimiza-
tions provided by modern HLS tools to meet the design constraints sourced their HLS implementation. Rosetta goes one step further
by providing a suite of application benchmarks that can be used
and achieve high quality of results. Examples of these optimiza-
tions include fixed-point optimization, dataflow pipelining, and to (1) facilitate comparisons between HLS tools, (2) evaluate new
synthesis techniques, and (3) establish meaningful baselines to track
data reuse through customized memory.
progress of the HLS and FPGA technologies. Each application in
• The proposed benchmark suite is freely available in open-source Rosetta includes a set of enforceable application-level design con-
format2 . We plan to continuously improve Rosetta by strengthen- straints based on real-world specifications. These constraints model
ing current cases and adding new applications from other domains. the realistic use cases for FPGA-based hardware accelerators, which
The rest of this paper is organized as follows: in Section 2, we helps standardize the evaluation of future advancements in HLS
introduce related work on HLS benchmarking and optimizations; tools. Furthermore, the applications in Rosetta leverage advanced
Section 3 outlines the Rosetta applications and key HLS optimiza- features of HLS tools to achieve high quality of results (QoRs) across
tion techniques leveraged by them; details of each benchmark are a distinct set of hardware designs. Hence these benchmarks can
described in Section 4; we show our experimental results in Section also serve as useful design tutorials for FPGA programmers to build
5, and conclude this work in Section 6. high-performance hardware accelerators using HLS.

2 RELATED WORK 3 ROSETTA OVERVIEW

FPGA programming currently differs significantly from the com- Rosetta currently contains six realistic benchmarks selected from
mon practice of software programming, even with the use of HLS machine learning and video processing fields, where FPGAs are com-
tools. Instead of simply focusing on functional correctness and exe- petitive on energy efficiency compared to CPUs and GPUs.3 For each
cution time, FPGA programmers often have to explore various com- Rosetta design, we provide the unoptimized software version, and the
plex design trade-offs involving performance, power, area, and cost. optimized HLS implementations written in either C++ or OpenCL.
Therefore, traditional software benchmark suites cannot directly be Table 1 lists the current Rosetta collection. Two of these benchmarks,
applied to HLS evaluation. In response, a number of HLS-specific binarized neural network and face detection, are adopted from our
benchmark suites have been developed by the research community previously published work [25, 39], while the rest are new designs.
for evaluating various aspects of hardware synthesis techniques and Rosetta contains both compute-bound and memory-bound appli-
tool flows. CHStone [11] is a widely used C-based HLS benchmark cations comprised of a rich set of kernels. These applications and
suite, which contains function kernels selected from application kernels expose diverse sources of parallelism. Our current HLS im-
domains such as arithmetic, signal processing, and security. Mach- plementations typically exploit instruction-level parallelism (ILP)
Suite [23] is another popular HLS benchmark suite, which includes a through fine-grained pipelining, and in some cases also expose task-
more diverse set of kernels and provides different algorithms for the level parallelism (TLP) by overlapping the execution of different
same kernel to facilitate comparisons at the algorithmic level. A more kernels. Additionally, each benchmark is associated with realistic
recent effort, Spector [10], offers OpenCL benchmarks that are ready design objectives — the machine learning applications require ei-
to be executed on Intel (formerly Altera) FPGA platforms. Kernels ther low latency or high throughput depending on the use-case
in Spector are designed to have large design spaces, which is useful scenario, while video processing applications must meet a real-time
for experimentation of automatic design space exploration (DSE) throughput target of at least 30 frames per second. In order to achieve
techniques. Additionally, HLS researchers have also adopted bench- these application-level constraints, Rosetta designs are customized
marks from other communities. For example, Rodinia [5], originally using a variety of HLS optimization techniques, which are concisely
designed for GPU benchmarking, has been used to test OpenCL- summarized as follows:
based HLS flows [29, 31]. Polybench [21] from the software compiler • Datatype customization – Customized data types such as fixed-
community has been adopted for assessing HLS-targeted polyhedral point types allow an FPGA accelerator to compute at the desired
transformations [22, 34, 42] and DSE techniques [24, 29, 38, 40]. numerical accuracy, and often lead to significant performance and
While the popular kernel benchmarks are simple to run and ana- area improvements over the design using full-precision floating-
lyze, they are insufficient for evaluating the increased capabilities of point types.
HLS optimizations and new technology advances in FPGA devices.
3 For the time being, we are not targeting traditional benchmarks from cryptography
(e.g., AES) and digital signal processing (e.g., DCT, FFT), since they are already included
2 Released on Github at https://github.com/cornell-zhang/rosetta in several other benchmark suites [10, 23].

270
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Table 1: The current set of the Rosetta applications — Rosetta contains both compute-bound and memory-bound applications with
different workloads. Kernels in each application expose different sources of parallelism: SLP = subword-level parallelism; DLP = data-level
parallelism; ILP = instruction-level parallelism. Different types of parallelism available in each compute kernel are listed in parentheses.
Application Categorization Major Compute Kernels Major HLS Optimizations
Video processing
Dataflow pipelining
3D Rendering Compute bound Integer arithmetics (ILP)
Communication customization
Integer operation intensive
Machine learning
Hamming distance (SLP, DLP, ILP) Loop unrolling
Digit Recognition Compute bound
KNN voting (ILP) Loop pipelining
Bitwise operation intensive
Dot product (DLP, ILP)
Machine learning Dataflow pipelining
Scalar multiplication (DLP, ILP)
Spam Filtering Memory bound Datatype customization
Vector addition (DLP, ILP)
Fixed-point arithmetic intensive Communication customization
Sigmoid function (ILP)
Video processing Dataflow pipelining
1D convolution (DLP, ILP)
Optical Flow Memory bound Memory customization
Outer product (DLP, ILP)
Floating-point arithmetic intensive Communication customization
Machine learning Memory customization
Binarized Neural Binarized 2D convolution (SLP, DLP, ILP)
Compute bound Datatype customization
Network (BNN) [39] Binarized dot product (SLP, DLP, ILP)
Bitwise operation intensive Communication customization
Video processing
Image scaling (DLP, ILP) Memory customization
Face Detection [25] Compute bound
Cascaded classifiers (DLP, ILP) Datatype customization
Integer arithmetic intensive

• Compute customization – Compute customization improves 1 TRIANGLES: for (int i = 0; i < NUM_3D_TRI; i++) {
the latency and/or throughput of the design through paralleliza- 2 #pragma HLS dataflow
tion and pipelining. Loop unrolling, loop pipelining, and dataflow 3 // five stages for processing each 3D triangle
pipelining fall into this category. 4 projection(triangle_3ds, &triangle_2ds, angle);
5 flag = rasterization1(triangle_2ds, max_min,
• Memory customization – FPGA accelerators typically demand 6 &triangle_2ds_same, max_index);
very high on-chip memory bandwidth to enable highly distributed 7 size = rasterization2(flag, max_min, max_index,
control and computation. Therefore, it is critical to set up cus- 8 triangle_2ds_same, fragment);
tomized memory hierarchy to provide the required bandwidth 9 size_pixels = zculling(i, fragment, size, pixels);
through data reuse and memory banking. 10 coloringFB(i, size_pixels, pixels, frame_buffer);
11 }
• Communication customization – The limited data bandwidth
between off-chip memories and the FPGA accelerators often be-
comes the performance bottleneck for memory-bound applica- Figure 1: Main loop for 3D Rendering. One triangle is pro-
tions. Hence it is crucial to customize the communication channel cessed by five image processing stages in each iteration.
and protocol used by the hardware accelerator to fully utilize off-
chip memory bandwidth through proper data packing and careful Time

design of the data layout.

Triangle 0 projection rast1 rast2 zculling coloringFB

Triangle 1 projection rast1 rast2 zculling coloringFB

4 BENCHMARK DESCRIPTION
This section discusses Rosetta applications in detail. For each bench- Triangle 2 projection rast1 rast2 zculling coloringFB
mark, we first briefly introduce its functionality and design con-
straints; we then describe its major compute kernels, explain the
rationale behind our categorizations in Table 1, and discuss the key Figure 2: Dataflow optimization overlaps different pipeline
HLS optimizations applied to this design. stages in 3D rendering.

4.1 3D Rendering coordinates of 3192 triangles. Target throughput is 30 frames per

The 3D rendering benchmark renders 2D images from 3D triangle second.
mesh models [20]. Taking in 3D coordinates of triangle vertices, the The HLS design contains a typical image processing pipeline as
application projects the triangles onto a 2D image, and colors the shown in Figure 1. The coordinates of each triangle go through
image pixels according to the "altitude" of the projected triangle. four kernel functions before updating the output frame buffer in
Our implementation works on 256x256 images where pixels are coloringFB. Integer operations form the primary workload inside
represented with 8-bit integers. The provided dataset contains the the kernels: projection and rasterization2 are rich in integer

271
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

1 __local WholeDigitType training_set[NUM_TRAINING] These two kernels have very different characteristics: while we can
2 __attribute__((xcl_array_partition(block,PAR_FACTOR,1))); easily exploit the bit-level and data-level parallelism in the Hamming
3 distance kernel, the KNN voting kernel is harder to parallelize.
4 __attribute__((xcl_pipeline_loop))
Digit recognition has a high compute to communication ratio.
5 TRAINING_LOOP:
For each test instance, Hamming distance calculation requires 100s-
6 for (int i = 0; i < NUM_TRAINING / PAR_FACTOR; i ++) {
7 __attribute__((opencl_unroll_hint)) 1000s of cycles depending on the parallelization factor, and KNN
8 LANES: voting requires 10s-100s of cycles depending on K and the paralleliza-
9 for (int j = 0; j < PAR_FACTOR; j ++) { tion factor. The training samples and their labels are stored on-chip
10 // Read a new instance from the training set and reused for all test instances. As a result, digit recognition is a
11 int train_id = j * NUM_TRAINING / PAR_FACTOR + i; compute-bound application.
12 WholeDigitType training_instance; Figure 3 shows the main compute loop nest for KNN calcula-
13 training_instance = training_set[train_id]; tion, alongside key HLS optimizations. TRAINING_LOOP iterates over
14 // Update the KNN set training samples, while the inner loop, LANES, instantiates different
15 update_knn(test_instance, training_instance,
Hamming distance units. In addition to compute optimizations in
16 &knn_set[j*K_CONST]);
the form of loop pipelining and unrolling (lines 4 and 7 of Figure 3),
17 }
18 } memory optimization is needed since the default implementation of
on-chip array training_set only has two memory ports, it cannot
supply PAR_FACTOR training instances per cycle. The training_set
Figure 3: Main compute loop nest for KNN calculation in
array is partitioned in line 2. With these optimizations, we can exploit
OpenCL.
the data-level parallelism between training instances.
arithmetic, while rasterization1 and zculling are heavy on in- Design parameters. The user can tune the following knobs:
teger comparisons. Each triangle requires a large amount of com-
putation relative to its memory size. Therefore, the application is • K: number of nearest neighbors.
categorized as compute-bound. • PAR_FACTOR: number of parallel Hamming distance units.
3D rendering is a prime example of dataflow optimization, which These two parameters present an interesting trade-off between
is applied in the HLS code on line 2 of Figure 1. Dataflow optimiza- classification accuracy, latency, and resource utilization. Increasing
tion exploits task-level parallelism by overlapping different stages PAR_FACTOR reduces the latency of the Hamming distance kernel,
of the image processing pipeline, as shown in Figure 2. Although but complicates the KNN voting kernel. Parallelization also causes
the latency of processing each triangle is not reduced, dataflow opti- frequency to drop. Furthermore, the complexity of both kernels in-
mization improves throughput and ensures no hardware module in creases with K. Additional results and analysis on the design space
the pipeline is idle in the steady state. are presented in Section 5.
Design parameters. We provide a switch in the source code to
enable/disable dataflow optimization. 4.3 Spam Filtering
The spam filtering application uses stochastic gradient descent (SGD)
4.2 Digit Recognition to train a logistic regression (LR) model for spam email classifica-
Digit recognition classifies hand-written digits using the K-nearest- tion [19]. The input is a dataset containing 5000 emails, 4500 for
neighbor (KNN) algorithm. The application works on a downsampled training and 500 for testing [26]. Each email is represented as a 1024-
subset of the MNIST database [13], with 18000 training samples and dimensional vector whose elements are relative word frequencies
2000 test samples evenly split amongst the ten digit classes. Each stored as 16-bit fixed-point numbers. The SGD training process pro-
MNIST image is downsampled to 14x14 and each pixel is represented duces a vector of 32-bit fixed-point parameters for the LR model.
as a single bit; thus, each image can be stored as a 196-bit unsigned We use five training epochs and a minibatch size of one; each epoch
integer. The KNN algorithm computes the Hamming distance be- processes every training sample once and updates the parameters
tween a test input and each training sample, stores the labels of the after each sample.
training samples with the K shortest distances, and votes among the The performance target of spam filtering is to minimize training
K labels to decide the label of the test sample. The design objective latency. Critical resource constraints are the number of hardened
for digit recognition is to minimize the total latency of classifying DSP blocks and the size of on-chip storage, which limits the level of
the 2000 test samples. compute parallelization and the amount of data stored on the FPGA.
Digit recognition includes two major compute kernels: Hamming The SGD algorithm contains kernels commonly found in machine
distance calculation and KNN voting. The Hamming distance kernel learning applications, including dot product, vector addition, and
computes the Manhattan distance between two samples; as each sigmoid.
sample is comprised of 1-bit pixels, this is done via bitwise XOR on Our spam filtering design exploits datatype customization and
the inputs, followed by computing a population count of the result. approximation of complex arithmetic operations on the FPGA. Fig-
The kernel is therefore rich in bitwise logic. The Hamming distance ure 4 shows the optimized sigmoid function. Lines 1-3 show the
must be calculated between a test input and every training sample. customized datatypes used to avoid expensive floating-point arith-
As a result, Hamming distance calculation is the dominant workload metic. We also eliminate most of the compute by taking advantage
of digit recognition. The KNN voting kernel examines the list of of the properties of the sigmoid function. Sigmoid asymptotically
Hamming distances to find the K nearest training samples, and out- approaches one when the input is large and zero when the input is
puts the classification result as the most frequent label amongst them. small (i.e. large negative). Sigmoid values when the input is between
The main workload in this kernel is integer comparison and sorting. minus four and four are hardcoded in a look-up table.

272
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

1 typedef fixed<F_TWIDTH,F_IWIDTH> FeatureType;

2 typedef uint<LUT_TWIDTH> IdxFixed; Gradient
3 typedef fixed<LUT_TWIDTH, LUT_IWIDTH> LutInFixed; xy
4 // values of sigmoid function stored in a look-up table Unpack Weight_y
5 FeatureType useLUT(LutInFixed in) { Gradient
z
6 IdxFixed index;
7 if (in < 0) { Outer
8 in = -in; Tensor_y Weight_x
product
9 index = LUT_SIZE - (in << (LUT_TWIDTH - LUT_IWIDTH));
10 } Compute
Tensor_x
11 else flow
12 index = (in << (LUT_TWIDTH - LUT_IWIDTH));
13 return lut[index]; Optical Flow Accelerator
14 }
Output Flow
15 // sigmoid function Image Frames
Vectors
16 FeatureType Sigmoid(FeatureType exponent) {
17 if (exponent > 4)
18 return 1.0; Off-chip Memory
19 else if (exponent < -4)
20 return 0.0;
21 else { Figure 6: Hardware diagram for optical flow — The kernels are
22 LutInFixed inLut = (LutInFixed)exponent; connected by FIFOs for streaming dataflow pipelining.
23 return useLUT(inLut);
24 } integer type. Inside the FPGA, the data is unpacked into 16-bit train-
25 } ing vector elements, resulting in a communication throughput of
multiple elements per cycle. Despite this optimization, the through-
Figure 4: Datatype and compute optimization to the Sigmoid put of the dataflow pipeline is still determined by the communica-
function — Specialized datatypes are used throughout the whole tion latency because of the relatively simple and highly parallelized
hardware function to avoid expensive floating-point arithmetic. We compute units for LR. Therefore, spam filtering is classified as a
use a look-up table to store the values of the sigmoid function so that memory-bound application.
the complex arithmetic operations can be reduced. In our implemen-
Design parameters. The design space of spam filtering consists of
tation F_TWIDTH = 32, F_IWIDTH = 13, LUT_TWIDTH = 12, LUT_IWIDTH
the following parameters:
= 4.
• PAR_FACTOR: the parallelization factor of the vector compute
1 typedef uint<VDWIDTH> VectorDataType; kernels.
2 typedef fixed<D_TWIDTH, D_IWIDTH> DataType; • VDWIDTH: the width of the packed vector data type, which con-
3 void read_data(VectorDataType* data, trols the upper-bound of the off-chip communication band-
4 DataType* training, width of the hardware function.
5 int tid)
6 { Our results and analysis on the design space are shown in Section 5.
7 for (int i = 0; i < N_FEATURES/(VDWIDTH/D_TWIDTH); i++) {
8 #pragma HLS pipeline 4.4 Optical Flow
9 // read in the data
Optical flow captures the motion pattern of objects between consec-
10 int idx = tid * N_FEATURES / (VDWIDTH/D_TWIDTH) + i;
11 VectorDataType tmp = data[idx];
utive image frames. It is an important step for object detection and
12 // distribute into local buffer is integrated into several image/video processing toolsets such as
13 for (int j = 0; j < (VDWIDTH/D_TWIDTH); j++) { OpenCV and the Computer Vision toolbox of MATLAB. Our imple-
14 int loc_idx = i * (VDWIDTH/D_TWIDTH) + j; mentation is based on the Lucas-Kanade method which is friendly
15 training[loc_idx] = tmp((j+1)*D_TWIDTH-1, j*D_TWIDTH); for FPGAs [32]. The output is a 2D vector field of the same size,
16 }} where each vector shows the movement of the pixel in the input
17 }
image frames. Currently, pixels of input images are represented with
8-bit integers, while the output and all intermediate results are rep-
Figure 5: Communication optimization for spam filtering — resented with 32-bit floating-point numbers. We use the MPI Sintel
In our implementation D_TWIDTH = 16, D_IWIDTH = 4, N_FEATURES = dataset [4] for testing this benchmark. The resolution of the image
1024. Users can tune the VDWIDTH parameter to control the off-chip frames in this dataset is 436x1024.
communication bandwidth. Optical flow must satisfy a real-time throughput constraint of
30 frames per second. In addition, the limited amount of on-chip
Our target FPGA devices do not have sufficient on-chip memory storage prevents us from buffering the image frame on chip. Fig-
to store the complete training set, necessitating the streaming of ure 6 shows the image processing pipeline with eight stages. The
training instances from off-chip memory. Dataflow optimization main compute kernel for stages Gradient, Weight, and Tensor is
(introduced in Section 4.1) is applied to overlap communication and 1D convolution; the Outer product stage performs outer product of
compute. To fully utilize off-chip memory bandwidth, we apply ele- three-dimensional vectors. Output is generated in the Compute flow
ment packing as shown in Figure 5. Data is transferred from off-chip stage. Currently, we are using floating-point arithmetic in these ker-
storage as VectorFeatureType, which is a wide, custom-bitwidth nels. Data packing optimization introduced in Section 4.3 is applied

273
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

1 void gradient_xy(pixel_t frame[MAX_HEIGHT][MAX_WIDTH],

Image Buffer Line Buffer 2 pixel_t gradient_x[MAX_HEIGHT][MAX_WIDTH],
a b c 0 1 2 3 4 c 0 1 2 3 4 3 pixel_t gradient_y[MAX_HEIGHT][MAX_WIDTH])
4 {
5 6 7 8 9 A B C 5 6 7 8 9 A B C
5 // specialized line buffer and window buffer
D E F x D E F x 6 hls::LineBuffer<5,MAX_WIDTH,pixel_t> buf;
7 hls::Window<5,5,pixel_t> window;
8 GRAD_XY_OUTER: for (int r = 0; r < MAX_HEIGHT + 2; r ++) {
Previous Window Window Buffer 9 GRAD_XY_INNER: for (int c = 0; c < MAX_WIDTH + 2; c ++) {
a b c 0 1 2 3 4 10 #pragma HLS pipeline II=1
a b c 0 11 // fill the line buffer
5 6 7 8 9 A B C
5 6 7 8 12 if (r < MAX_HEIGHT && c < MAX_WIDTH) {
D E F x 13 // shift up pixels in column c
D E F x
14 buf.shift_pixels_up(c);
15 // insert new pixel into column c of the last row
Current Window 16 buf.insert_bottom_row(frame[r][c], c);
17 } else if (c < MAX_WIDTH) {
18 buf.shift_pixels_up(c);
Figure 7: Example of a 2-row line buffer and a 3x3 window
19 // zero padding
buffer — Pixels a, b, c and 0-F are already visited, while x is a new 20 buf.insert_bottom_row(0,c);
pixel. The line buffer stores pixels in the two most recently visited 21 }
rows, and reads in one pixel from the image buffer every cycle. The 22 // fill the window buffer
3x3 window buffer stores recently visited pixels in the 3x3 sliding 23 if (r < MAX_HEIGHT && c < MAX_WIDTH) {
window. When the sliding window shifts to the right, the left-most 24 // shift pixels to the left
25 window.shift_pixels_left();
pixels in the window buffer are shifted out, while two pixels stored
26 for (int i = 0; i < 4; i ++)
in the line buffer (0 and 8) and the new pixel x are shifted in. The 27 // read from the line buffer
new pixel x is also stored into the line buffer and pixel c is removed 28 // and insert to the right-most column
from the line buffer. 29 window.insert_pixel(buf.getval(i, c), i, 4);
30 } else {
31 window.shift_pixels_left();
32 for (int i = 0; i < 4; i ++)
to avoid contention on the off-chip memory. Each packet contains 33 // zero padding
one pixel from each image frame, and the Unpack stage distributes 34 window.insert_pixel(0, i, 4);
the pixels to on-chip FIFOs. Similar to 3D rendering, we use dataflow 35 }
36
optimization to construct channels between stages of the image pro- // compute
37 // ......
cessing pipeline. The major difference between the two benchmarks 38 }}}
is that all pipeline stages in optical flow produce and consume pix-
els in a strict sequential order. In addition, the pipeline stages have
Figure 8: Gradient kernel optimized with line buffer and win-
perfectly balanced rates. Therefore, the channels between pipeline
dow buffer — hls::LineBuffer and hls::Window classes provide
stages can be implemented as fixed-depth FIFOs, as shown in Fig-
parameterized implementations of line buffers and window buffers.
ure 6. The whole accelerator is a very deep, fine-grained pipeline
with different stages perfectly overlapped. Convolvers fout output streams
Memory customization is also necessary for optical flow to achieve Input
high throughput. Here we introduce the common specialized mem- words
BitSel ∗
ory structures for image processing applications: line buffer and fin
Pooling, Output
+ + Bnorm,
…

window buffer. Figure 7 gives a pictorial illustration of a 2-row line Input Binarize words
buffer and a 3x3 window buffer. The line buffer reads in one pixel per words
BitSel ∗ Integer
buffer
cycle and stores pixels in recently visited rows. The window buffer Variable-width fout Conv
is completely partitioned into registers for parallel data access, and Line Buffer Weights

it consistently reads from the line buffer. These specialized memory

structures exploit the data reuse in stencil applications with sliding Figure 9: Hardware structure of the BNN accelerator (figure
processing windows, and minimize memory accesses to the next- adapted from [39]).
level memory hierarchy. The convolution kernels in optical flow are
good candidates for this memory customization. Figure 8 shows how 4.5 Binarized Neural Network
we construct and maintain a line buffer and a window buffer in the Accelerating convolutional neural networks (CNNs) has become
gradient_xy kernel. Proper conditions need to be applied to avoid an important research topic for the FPGA community. Academic
out-of-bound array accesses. and industry researchers have implemented different CNN models
With the optimizations described above, we classify optical flow on a variety of FPGA platforms [3, 18, 35, 36]. Recently, binarized
as a memory-bound application because the off-chip memory band- neural networks (BNNs) were shown to be a natural fit for FPGA
width directly determines the throughput of the streaming dataflow hardware [6, 27, 33, 39]. BNNs constrain weights and intermediate
pipeline. However, this is because our current implementation does activations to +1 or -1; this converts most of its multiplies to binary
not exploit data reuse between input frames. We plan to further XORs and takes full advantage of the FPGA logic fabric. We adopt
optimize this design to achieve a higher throughput. an open-source implementation of BNN by Zhao et al. [39] as a
representative neural network application in Rosetta.

274
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Input Word Line Buffer Image Line Buffer

Image Buffer
Input width 32 32 c 0 1 2 3 4
line 1 a b c 0 1 2 3 4
line 3 line 2 5 6 7 8 9 A B C
5 6 7 8 9 A B C
line 3 D E F x
D E F x
Input width 8 8
Image Window Buffer
line 5 line 6 line 7 line 8
z line 3
line 4
line 4
line 5
line 5
line 6
line 6
line 7 Previous Window 0
line 5 line 6 line 7 line 8 a b c 0 1 2 3 4 8

5 6 7 8 9 A B C x
Figure 10: Example usage of variable-width line buffer for 8- D E F x
wide and 32-wide feature maps (figure adapted from [39]). Integral Image
Current Window
Window Buffer
Zhao et al. implement the BNN model described in [8], which
operates on the CIFAR-10 dataset [12]. It contains six convolutional
layers, three pooling layers, and three fully-connected layers. Fig- Figure 11: Specialized line buffer and window buffer for face
ure 9 shows the hardware diagram of the BNN accelerator, which detection [25] — Here we show a 3x3 example, but the actual imple-
uses a configurable number of convolvers to exploit data-level paral- mentation uses 25x25 windows. Solid arrows refer to normal register
lelism in a scalable manner. The authors target a small FPGA device shifting, while dashed arrows refer to addition. The image window
with limited on-chip storage. As a result, the BNN weights cannot buffer accumulates the incoming pixels and construct the integral
fit on-chip and the accelerator must be invoked multiple times to image on the fly. The integral image window buffer accesses the
classify an image; each time new weights are loaded from off-chip image window buffer for new data.
memory.
There are two major kernels in BNN: binarized convolution and
binarized dot product. Both kernels are intensive of bitwise logic Table 2: Device capacity of the two FPGA platforms and the
operations. Binarized convolution comprise the majority of opera- resource utilization of the platform logic (shell) on AWS F1 —
tions in classifying an image, and is heavily parallelized as a result. The last row reports the average resource utilization of the shell,
In contrast, the binarized fully-connected layers, which use the dot with the standard deviation in parentheses.
product kernel, are limited by off-chip memory-bandwidth. We cate-
# LUTs # FFs # BRAMs # DSPs
gorize BNN as compute-bound since latency improvement mostly
comes from accelerating compute in the convolutional layers. AWS F1 Total 1181768 2363536 2160 6840
Since 2D convolutional layers have a sliding window access pat- ZC706 Total 218600 437200 545 900
tern, line buffers are used to exploit data locality. In particular, a AWS F1 Shell 293209 (±3693) 381853 (±5138) 545 (±0) 12 (±0)
variable-width line buffer (VWLB) is designed to keep the hardware
convolvers fully utilized despite the varying sizes of the feature maps. kernel in feature extraction applications such as SIFT [17], as well
Figure 10 shows how the VWLB works for different input widths. For as the pooling layers of CNNs. The cascaded classifiers are the domi-
input feature map with a width of 32, the VWLB operates identically nant workload for the face detection application. The authors of [25]
to a conventional line buffer. For a smaller feature map with a width parallelize the first three classifier stages and pipeline the rest of
of 8, each row in the VWLB stores multiple rows of the input. The the stages to exploit data-level parallelism. This kernel also exposes
rows are carefully arranged in the VWLB so that the convolutional an irregular memory access pattern — each classifier accesses ei-
filter can slide through and produce correct results. ther eight or twelve pixels, and the classifiers have different access
Design parameters. The BNN benchmark allows users to tune the patterns. This feature itself makes the kernel interesting for HLS
number of convolvers in the accelerator. Other parameters such as memory optimization techniques. Customized memory partitioning
the size of buffers are automatically scaled. is applied to improve kernel frequency and reduce routing effort [41].
The cascaded classifiers operate on a sliding window of the inte-
4.6 Face Detection gral image. As a result, face detection can also benefit from the line
buffer and window buffer optimization introduced in Section 4.4.
The face detection application is adopted from [25]. It uses the Viola- However, constructing the whole integral image before applying
Jones algorithm [28] to detect human faces in a given image. More the classifiers would require a significant amount of on-chip storage
specifically, the accelerator takes an 320x240 greyscale image as and incur performance loss. Therefore, the authors of [25] modified
input, which is scaled to construct an image pyramid; afterwards, an the window buffer to construct the integral image efficiently. The
integral image is constructed from each image in the image pyramid, operation of this buffer is depicted in Figure 11, where the modified
and a set of cascaded classifiers are applied to a fixed-size window image window buffer accumulates pixels on the diagonal to compute
which scans through the integral image; eventually, the positions the pixel values in the integral image.
and sizes of the human faces are returned.
As mentioned in [25], the throughput target for face detection
is 30 frames per second. In addition, the application is subject to 5 EXPERIMENTAL RESULTS
hardware constraints including limited on-chip storage and rout- We have synthesized the Rosetta benchmarks targeting an embedded
ing resources. The two major compute kernels in face detection are FPGA as well as a cloud FPGA instance. We use Xilinx ZC706 for
image scaling and cascaded classifiers. Image scaling is a common the embedded platform, which contains a Kintex-7 FPGA with a

275
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Table 3: Rosetta results on Xilinx ZC706 Platform — The Runtime column shows overall
execution time. Resource numbers show the total resource usage of the designs, including
both kernel function and shell logic. Bitstreams are generated by Xilinx SDSoC 2017.1.
Benchmark # LUTs # FFs # BRAMs # DSPs Runtime (ms) Throughput
3D Rendering 8893 12471 48 11 4.7 213 frames/s
Digit Recognition1 41238 26468 338 1 10.6 189k digits/s
Spam Filtering2 12678 22134 49 160 78.9 285k samples/s
Optical Flow 42878 61078 54 454 24.3 41.2 frames/s
Binarized Neural Network3 46899 46760 102 4 4995.2 200 images/s
Face Detection 62688 83804 121 79 33.0 30.3 frames/s
1. K = 3, PAR_FACTOR = 40. 2. Five epochs, PAR_FACTOR = 32, VDWIDTH = 512.
3. Eight convolvers, 1000 test images.

Table 4: Rosetta results on AWS F1 Platform — Kernel: execution time on the FPGA; Comm.: time of data transfer between
host and global memory; Runtime: overall execution time. Performance-Cost Ratio is calculated based on the hourly rate (in
US Dollar/$) of the AWS f1.2xlarge instance [1]. Resource numbers are for kernel functions only. Bitstreams are generated by
Xilinx SDAccel 2017.1.
Performance-Cost
Benchmark # LUTs # FFs # BRAMs # DSPs Kernel (ms) Comm. (ms) Runtime (ms) Throughput
Ratio
3D Rendering 6763 7916 36 11 3.6 0.19 4.4 227 frames/s 496k frames/$
Digit Recognition1 39971 33853 207 0 9.9 0.55 11.1 180k digits/s 393M digits/$
Spam Filtering2 7207 17434 90 224 25.1 4.8 30.9 728k samples/s 1.6G samples/$
Optical Flow 38094 63438 55 484 2.6 4.8 8.4 119 frames/s 260k frames/$
Face Detection 48217 54206 92 72 20.2 0.47 21.5 46.5 frames/s 101k frames/$
1. K = 3, PAR_FACTOR = 40. 2. Five epochs, PAR_FACTOR = 32, VDWIDTH = 512.

target clock frequency of 140MHz. For the cloud FPGA platform, we

Table 5: 3D rendering without dataflow on AWS F1.
choose the AWS f1.2xlarge instance (F1), which is equipped with
a Xilinx VU9P FPGA. The target clock frequency for our experi- # LUTs # FFs # BRAMs #DSPs Kernel (ms)
ments on F1 is 250MHz. These two platforms have different memory
6323 7737 36 11 5.3
systems — on ZC706, the FPGA shares the same DRAM with the
embedded CPU, while on F1 the FPGA has its own on-board DRAM
and communicates with the CPU through PCIe. In the rest of this Tables 3 and 4 show our experimental results on the two plat-
section, we use the term global memory to refer to the DRAM on forms. All resource usage numbers are extracted from Vivado reports
the FPGA side, and use host memory for the DRAM on the CPU side. after place and route. Resource numbers in Table 3 show the total
The BNN benchmark is originally designed for embedded FPGA resource utilization of the designs on ZC706, while Table 4 reports
platforms and requires nontrivial effort to be retargeted to AWS F1. resource usage on F1 without the shell logic. The total runtime of
We leave this for future work, and will only present BNN results on the applications, including hardware kernel time, communication
ZC706 in this paper. For other benchmarks, the HLS code for the time, and the overhead of necessary software function calls, are
two platforms share the same optimization techniques, with some measured on both platforms. On AWS F1, we further break down the
platform-dependent variances such as datatype and interface. Xilinx kernel and communication time with the help of the SDAccel profiler.
SDSoC 2017.1 is used to generate bitstream for ZC706, and SDAccel Rosetta benchmarks generally have better performance on AWS F1
2017.1 is used for F1. because of its higher frequency and off-chip memory bandwidth,
We run the F1 applications remotely through the FPGA developer except for digit recognition. For some applications, however, this
AMI flow provided by AWS, whereas the experiments on ZC706 are performance gap is narrow due to the communication latency and
performed locally. Table 2 shows the available resource counts of the additional overhead incurred by OpenCL runtime.
two platforms. On the F1 platform, the AWS platform logic (or shell) Since cost efficiency is an important aspect of platform selection
consumes a considerable amount of resources to provide peripheral and accelerator design, we further provide the performance-cost
connections for PCIe data transfer, DRAM access, and interrupts [2]. ratio as a metric for F1 applications based on the hourly rate of the
In the third row of Table 2, we report the statistics of the resource f1.2xlarge instance (currently at $1.65 per hour).
usage by this shell across different applications. For ZC706, Xilinx In the remainder of this section we summarize the results for the
SDSoC also automatically generates shell logic for communications four new benchmarks. As for BNN and face detection, interested
among accelerators, processors, and DRAM. However, the size of readers can refer to [39] and [25], respectively, for more results and
these shells greatly vary across designs, and are typically small detailed performance analysis.
compared to that of the core logic. Hence we choose to simply report
the total resource utilization for ZC706 results. 3D Rendering. For our test dataset, the total execution time of 3D
rendering is 4.7 ms and 4.4 ms on the two platforms, respectively.
Converting to throughput, our design achieves 213 frames per second

276
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

Table 6: Digit recognition accuracy vs. K value.

K 2 3 4 5
Accuracy (%) 92.9 93.9 94.3 94.3

(a) (b)
Figure 13: Spam filtering design space, results are for AWS F1
Figure 12: Digit recognition design space, results are for AWS platform — Off-chip memory bandwidth is controlled by VDWIDTH.
F1 platform — (a) Kernel time vs. K value. Difference in kernel time This parameter strictly limits the performance of the hardware ker-
is caused by variance in latency and kernel frequency. (b) LUT usage nel, showing that spam filtering is a memory-bound application.
vs. K value.
already large. When the Hamming distance kernel is highly paral-
on ZC706 and 227 frames per second on F1. While the throughput lelized, the KNN voting kernel, which is highly sequential, becomes
calculated with our test input is much higher than the target, both the performance bottleneck. The performance can be further im-
kernel time and communication time increase with more triangles proved by optimizing the KNN voting kernel, and finding an optimal
in the input. Communication latency is not significant on F1, but combination of the K value and PAR_FACTOR.
the software API calls in OpenCL runtime incur a 0.6 ms overhead,
which is not negligible for this specific application. These API calls Spam Filtering. The performance of spam filtering significantly
initiate data transfer, enqueue the kernel function, and set proper differs on two platforms. The kernel time on F1 is 3.1x shorter than
kernel arguments. ZC706, and the total execution time on F1 is 2.6x shorter, despite the
Table 5 shows the resource utilization and kernel time of a baseline additional 4.8 ms latency for host-global memory communication.
design where dataflow optimization is not applied. Comparing with In addition to the frequency improvement, this performance gap
the first row of Table 4, enabling dataflow optimization improves the is mainly caused by the difference in off-chip memory bandwidth.
kernel time by around 30% without significant resource overhead. Since we apply dataflow optimization to overlap communication and
This result demonstrates the efficacy of dataflow optimization in compute, the overall latency of the design is determined by the max-
image processing pipelines. imum of compute and communication latency. Because the compute
kernels are highly parallel, the low communication bandwidth on
ZC706 results in a much longer latency of the dataflow pipeline.
Digit Recognition. In contrast to other benchmarks, the perfor-
Figure 13 shows the kernel time on AWS F1 with different com-
mance of digit recognition is currently slightly worse on F1 than
binations of PAR_FACTOR and VDWIDTH. Here PAR_FACTOR specifies
ZC706. The overall throughput is 189k digits per second on ZC706
the degree of parallelism in vector kernels, and VDWIDTH controls the
and 180k digits per second on F1. Although F1 has a shorter kernel
off-chip communication bandwidth. With the same off-chip band-
time of 9.9ms, the latency of communication and other overhead in
width, increasing PAR_FACTOR beyond 64 does not result in much
OpenCL runtime seem to have offset this advantage. According to
performance gain, since the communication latency already dom-
our analysis, this is likely due to a missing feature in the specific
inates the compute latency. When off-chip bandwidth is reduced,
version of the tool we are using, where async_group_copy is not
communication latency further increases, and kernel time degrades
pipelined to the full extent. Hence we expect to achieve a higher
for all PAR_FACTOR values we tested. The best-achievable perfor-
performance on F1 in the near future once this issue is resolved.
mance improves with a higher off-chip memory bandwidth. These
As mentioned in Section 4.2, digit recognition has a complex de-
results confirm that spam filtering is a memory-bound application.
sign space. Table 6 shows the classification accuracy of different
K values. Figure 12 shows kernel time and resource utilization of Optical Flow. The total execution time of optical flow is 8.4 ms on
different design points. We only show kernel time in Figure 12a F1 and 24.3 ms on ZC706. Both implementations satisfy the through-
because host-global memory communication time is not affected by put constraint. On the AWS F1 platform, host-global memory com-
kernel implementation. In Figure 12b, only the most critical resource munication time takes up approximately 60% of the total execution
LUT is shown. As we can see from Table 6 and Figure 12, the two time due to the large input/output data size. If we only consider ker-
design parameters expose interesting design trade-offs. Increasing nel time, it is 9.3x shorter on F1 than on ZC706. Similar with spam
the K value improves classification accuracy at a cost of significant filtering, this behavior is also caused by the difference in off-chip
increase in kernel time, which is caused by the frequency drop and memory bandwidth. The optical flow accelerator is reading from and
the worsened latency of the KNN voting kernel. Additionally, the writing to the off-chip memory at the same time due to the stream-
benefit of increasing PAR_FACTOR diminishes when PAR_FACTOR is ing dataflow optimization. The F1 platform has multiple off-chip

277
Session 8: Applications 2 FPGA’18, February 25–27, Monterey, CA, USA

DDR banks to handle concurrent read and write requests. On ZC706, [13] Y. LeCun. The MNIST Database of Handwritten Digits. http://yann. lecun. com/exd-
however, these concurrent requests would cause contention on the b/mnist/, Dec 2017.
[14] Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen. High-Level Synthesis:
off-chip memory, and the accelerator is often stalled due to the lack Productivity, Performance, and Software Constraints. Journal of Electrical and
of input data. Computer Engineering, 2012:1:1–1:1, Jan 2012.
[15] G. Liu, M. Tan, S. Dai, R. Zhao, and Z. Zhang. Architecture and Synthesis for
Area-Efficient Pipelining of Irregular Loop Nests. IEEE Trans. on Computer-Aided
6 CONCLUSIONS AND FUTURE WORK Design of Integrated Circuits and Systems (TCAD), 2017.
[16] X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen. High Level
We have presented Rosetta, an open-source, realistic benchmark Synthesis of Complex Applications: An H. 264 Video Decoder. Int’l Symp. on
suite for high-level synthesis targeting modern FPGA platforms. Field-Programmable Gate Arrays (FPGA), Feb 2016.
Rosetta is designed to be a collection of real applications which are [17] D. G. Lowe. Object Recognition from Local Scale-Invariant Features. Int’l Conf. on
Computer Vision (ICCV), Oct 1999.
optimized for performance and resource constraints. All Rosetta [18] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo. Optimizing Loop Operation and Dataflow
applications are ready to be executed on the supported embedded in FPGA Acceleration of Deep Convolutional Neural Networks. Int’l Symp. on
and cloud platforms. We believe that Rosetta can serve as a useful Field-Programmable Gate Arrays (FPGA), Feb 2017.
[19] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
benchmark suite for HLS algorithms and tools, as well as a set of [20] J. Pineda. A Parallel Algorithm for Polygon Rasterization. ACM SIGGRAPH
design tutorials for application developers interested in FPGA-based Computer Graphics, 22(4):17–20, 1988.
[21] L.-N. Pouchet. Polybench: The Polyhedral Benchmark Suite. http://www. cs. ucla.
accelerated computing. edu/pouchet/software/polybench, Dec 2017.
Rosetta will be continuously improved in the future. We will [22] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-Based Data Reuse
extend Rosetta to include more realistic applications from emerging Optimization for Configurable Computing. Int’l Symp. on Field-Programmable Gate
Arrays (FPGA), Feb 2013.
domains. For the existing benchmarks, we plan to provide both [23] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks
C++ and OpenCL implementations for every benchmark to embrace for Accelerator Design and Customized Architectures. Int’l Symp. on Workload
different programming models commonly supported by HLS tools. Characterization (IISWC), Oct 2014.
[24] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A Pre-RTL, Power-
The benchmarks will also be further optimized for achieving higher Performance Accelerator Simulator Enabling Large Design Space Exploration of
performance and resource efficiency. Customized Architectures. Int’l Symp. on Computer Architecture (ISCA), Jun 2014.
[25] N. K. Srivastava, S. Dai, R. Manohar, and Z. Zhang. Accelerating Face Detection on
Programmable SoC Using C-Based Synthesis. Int’l Symp. on Field-Programmable
ACKNOWLEDGEMENTS Gate Arrays (FPGA), Feb 2017.
[26] The Apache Software Foundation. Public Corpus. http://spamassassin. apache.
This research was supported in part by a DARPA Young Faculty org/old/publiccorpus/, Apr 2017.
Award, NSF Awards #1337240, #1453378, and a research gift from [27] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vis-
Xilinx, Inc. We thank Dr. Sumit Roy from Xilinx for providing helpful sers. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference.
Int’l Symp. on Field-Programmable Gate Arrays (FPGA), Feb 2017.
feedback on the Rosetta designs. We also thank Ackerley Tng, Edgar [28] P. Viola, M. J. Jones, and D. Snow. Detecting Pedestrians using Patterns of Motion
Munoz, Wendian Jiang, Lin Wang, Yun Qing, Nithya Subramanian, and Appearance. International Journal of Computer Vision, 63(2):153–161, Jul 2005.
[29] S. Wang, Y. Liang, and W. Zhang. FlexCL: An Analytical Performance Model for
Nikita Patil, Surabhi Singh, Judy Stephen, and Ian Thompson for OpenCL Workloads on Flexible FPGAs. Design Automation Conf. (DAC), Jun 2017.
their contributions to the baseline designs of digit recognition, 3D [30] Y. Wang, P. Li, and J. Cong. Theory and Algorithm for Generalized Memory
rendering, spam filtering, and optical flow. Partitioning in High-Level Synthesis. Int’l Symp. on Field-Programmable Gate
Arrays (FPGA), Feb 2014.
[31] Z. Wang, B. He, W. Zhang, and S. Jiang. A Performance Analysis Framework for
REFERENCES Optimizing OpenCL Applications on FPGAs. Int’l Symp. on High Performance
[1] Amazon Web Services. AWS FPGA Developer AMI. https://aws. amazon. com/- Computer Architecture (HPCA), Mar 2016.
marketplace/pp/B06VVYBLZZ, Dec 2017. [32] Z. Wei, L. Dah-Jye, and B. E. Nelson. FPGA-Based Real-Time Optical Flow Algo-
[2] Amazon Web Services. AWS Shell Interface Specification. https://github. rithm Design and Implementation. Journal of Multimedia, 2:38–45, Sep 2007.
com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md, [33] H. Yonekawa and H. Nakahara. On-Chip Memory Based Binarized Convolutional
Dec 2017. Deep Neural Network Applying Batch Normalization Free Technique on an FPGA.
[3] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu. An OpenCL Int’l Parallel and Distributed Processing Symp. Workshops (IPDPSW), May 2017.
Deep Learning Accelerator on Arria 10. Int’l Symp. on Field-Programmable Gate [34] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-Based
Arrays (FPGA), Feb 2017. Accelerator Design for Deep Convolutional Neural Networks. Int’l Symp. on
[4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A Naturalistic Open Source Field-Programmable Gate Arrays (FPGA), Feb 2015.
Movie for Optical Flow Evaluation. European Conference on Computer Vision [35] C. Zhang and V. K. Prasanna. Frequency Domain Acceleration of Convolutional
(ECCV), Oct 2012. Neural Networks on CPU-FPGA Shared Memory System. Int’l Symp. on Field-
[5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Ro- Programmable Gate Arrays (FPGA), Feb 2017.
dinia: A Benchmark Suite for Heterogeneous Computing. Int’l Symp. on Workload [36] J. Zhang and J. Li. Improving the Performance of OpenCL-Based FPGA Accelerator
Characterization (IISWC), Oct 2009. for Convolutional Neural Network. Int’l Symp. on Field-Programmable Gate Arrays
[6] P. Colangelo, R. Huang, E. Luebbers, M. Margala, and K. Nealis. Fine-Grained Ac- (FPGA), Feb 2017.
celeration of Binary Neural Networks Using Intel Xeon Processor with Integrated [37] Z. Zhang and B. Liu. SDC-Based Modulo Scheduling for Pipeline Synthesis. Int’l
FPGA. Int’l Symp. on Field-Programmable Custom Computing Machines (FCCM), Conf. on Computer-Aided Design (ICCAD), Nov 2013.
Apr/May 2017. [38] J. Zhao, L. Feng, S. Sharad, W. Zhang, Y. Liang, and B. He. COMBA: A Com-
[7] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level prehensive Model-Based Analysis Framework for High Level Synthesis of Real
Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on Applications. Int’l Conf. on Computer-Aided Design (ICCAD), Nov 2017.
Computer-Aided Design of Integrated Circuits and Systems, 30(4):473–491, 2011. [39] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. B. Srivastava, R. Gupta, and
[8] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Z. Zhang. Accelerating Binarized Convolutional Neural Networks with Software-
Neural Networks: Training Deep Neural Networks with Weights and Activations Programmable FPGAs. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), Feb
Constrained to + 1 or -1. arXiv preprint arXiv:1602.02830, Mar 2016. 2017.
[9] S. Dai, R. Zhao, G. Liu, S. Srinath, U. Gupta, C. Batten, and Z. Zhang. Dynamic [40] G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar. Lin-Analyzer: A High-Level
Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis. Int’l Performance Analysis Tool for FPGA-Based Accelerators. Design Automation Conf.
Symp. on Field-Programmable Gate Arrays (FPGA), Feb 2017. (DAC), Jun 2016.
[10] Q. Gautier, A. Althoff, P. Meng, and R. Kastner. Spector: An OpenCL FPGA [41] Y. Zhou, K. M. Al-Hawaj, and Z. Zhang. A New Approach to Automatic Memory
Benchmark Suite. Int’l Conf. on Field Programmable Technology (FPT), Dec 2016. Banking using Trace-Based Address Mining. Int’l Symp. on Field-Programmable
[11] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and Quantitative Gate Arrays (FPGA), Feb 2017.
Analysis of the CHStone Benchmark Program Suite for Practical C-Based High- [42] W. Zuo, P. Li, D. Chen, L.-N. Pouchet, S. Zhong, and J. Cong. Improving Poly-
Level Synthesis. Journal of Information Processing, Vol. 17, pages 242–254, Oct hedral Code Generation for High-Level Synthesis. Proc. of the 8th Int. Conf. on
2008. Hardware/Software Codesign and System Synthesis (CODES+ISSS), Sep/Oct 2013.
[12] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny
Images. Technical report, University of Toronto, Apr 2009.

278
IEEE EMBEDDED SYSTEMS LETTERS, VOL. XX, NO. X, DECEMBER 2016 1

Securing Hardware Accelerators: a New Challenge

for High-Level Synthesis
(Perspective Paper)
Christian Pilato, Member, IEEE, Siddharth Garg, Kaijie Wu,
Ramesh Karri, Senior Member, IEEE and Francesco Regazzoni, Member, IEEE

Abstract—High-level synthesis (HLS) tools have made sig- 100 150

nificant progress in the past few years, improving the design
productivity for hardware accelerators and becoming mainstream 80

percent of design
in industry to create specialized System-on-Chip (SoC) architec-

billions, USD
tures. Increasing the level of security of these heterogeneous ar- 100
60
chitectures is becoming critical. However, state-of-the-art security
countermeasures are still applied only to the code executing on 40
the processor cores or manually implemented into the generated 50
components, leading to suboptimal and sometimes even insecure
20
designs. This paper discusses extensions to HLS tools for creating
secure heterogeneous architectures.
0 0

2,009
2,010
2,011
2,012
2,013
2,014
2,015
2,016
2,017
2,018
2,019
2,020
2,021
2,022
2,023
2,024
Index Terms—High-Level Synthesis, Hardware Security.

Cybersecurity spending in US (projected)

I. I NTRODUCTION
% of designs with pre-existing components

W E are entering the era of the Internet of Things (IoT),

where about 200 billion “things” will be connected
by 2020 [1]. Each of these systems is becoming increasingly
Fig. 1. There is an increase in the percentage of reused components in SoCs
(source: ITRS [2]). There is also an increase in spending in the U.S. on
cybersecurity (Source: TIA [5]).
heterogeneous, combining processor cores and specialized
accelerators into the same chip to implement System-on-Chip information or tampering by injecting malicious data [6].
(SoC) architectures. Since the SoC complexity is growing, Furthermore, the components should be protected from side-
these systems will be designed by re-using components (more channel attacks [7]. Several countermeasures have been pro-
than 90% by 2020 [2], see Fig. 1). Designers will increasingly posed to thwart such attacks, but they have not been considered
use high-level synthesis (HLS) to raise the abstraction level so during the generation of SoC components in a systematic
as to improve the design productivity [3]. and comprehensive way. When combined and applied after
These heterogeneous systems are also used in critical sys- the design process to avoid leaving the SoC unprotected, it
tems such as aircraft, automobiles, banks, and medical devices, potentially leads to suboptimal solutions.
where security is a major concern [4]. These systems should This paper highlights emerging challenges that HLS tools
not leak secrets to unclassified outputs and untrusted execution will have to tackle to ensure security of the generated hard-
should never access or affect critical information. These and ware accelerators. Hardware security must be considered as
related security concerns are doubling spending on cyber- a primary objective side-by-side performance and power in
security in the last five years (see Fig. 1), and we expect this the optimization process during HLS. In our vision, this
trend to continue [5]. Securing heterogeneous SoCs requires line of research will be important to enable the design of
careful analysis and a complete understanding of hardware secure hardware accelerators, a fundamental block towards
vulnerabilities. For example, the designer must understand secure-by-construction systems. We first describe hardware
how the information is elaborated to avoid leaking sensitive accelerators and identify security vulnerabilities (Section II).
Manuscript received May 24, 2017; revised October 10, 2017; accepted Then, we present state-of-the-art solutions for such challenges
November 3, 2017. This work has received funding from the EU Commis- (Section III) and discuss how these security solutions can be
sion’s H2020 Programme under grant agreement N. 732105, the CERBERO integrated into future HLS tools (Section IV).
project. This work was also supported by the National Science Foundation
(A#: 1526405). This manuscript was recommended for publication by C.
Gebotys. (Corresponding author: Christian Pilato.) II. V ULNERABILITIES IN H ARDWARE ACCELERATORS
C. Pilato and F. Regazzoni are with the Advanced Learning and Research
Institute (ALaRI), Faculty of Informatics, Università della Svizzera italiana A hardware accelerator is a component tailored to execute
(USI), Lugano, Switzerland (e-mail: christian.pilato@usi.ch). a specific functionality. These specialized components are
S. Garg, K. Wu, and R. Karri with the NYU center for cybersecurity (cyber. able to improve performance (10-100×) and lower energy
nyu.edu), New York University (NYU), New York, NY, USA. Karri is also
supported in part by CCS-AD, NYU-AD. Karri and Garg are supported in consumption (100-1,000×) when compared to implementing
part by Boeing. them in software running on processors [8]. Fig. 2(a) shows the
Controller Datapath
Authorized SoC
CPU
reg_0 reg_1 reg_2 mem
ctrl encrypt(…);
Unauthorized
CPU (T0) CPU
SoC

Internal mem bus

Trusted Untrusted
Data Data
+ mux CryptoAccelerator(AES)
System Interconnect

Accelerator (T1) Accelerator

DPA to extract key
5e-06
Correct key 23 HDL
- 4e-06

3e-06
Optimized

Current Absorption [A]

local CPU (T2) 2e-06 Accelerator
Malicious
1e-06
SPM
Untrusted 0

-1e-06 designer
reg_3 Data -2e-06

-3e-06
0 5 10 15 20 25 30 35 40 45 50
Time [10 ps]

(a) (b) (c) (d)

Fig. 2. (a) accelerator microarchitecture, along with examples of vulnerabilities: (b) data produced by the accelerator at time T1 cannot be considered secure
by the following CPU execution at time T2, (c) side-channel attacks for key extraction on an AES implementation, and (d) reverse engineering to get an
unauthorized copy of the functionality.

microarchitecture of an accelerator, which is composed of the sensitive data through the output port or via the memory space
controller and the datapath. The execution of the function is shared with the attacker (CWE-485: Insufficient Encapsulation
controlled by a finite-state machine (controller) that, based on and CWE-922: Insecure Storage of Sensitive Information).
a set of conditions, determines which operations are executed Hence, the execution and the outcome of an accelerator are not
by the arithmetic resources (datapath) in any given clock cycle. secure if not adequately verified and protected (see Fig. 2(b)).
These resources elaborate input data, provided through param- Even when the specification of the accelerator is secure,
eters or stored in memories – either in local directly-accessible its implementation can be compromised through physical
scratchpads or external memory accessed through memory attacks, where the adversary exploits the weaknesses of the
controllers – with the possibility of computing on memory implementation (CWE-693: Protection Mechanism Failure).
addresses (e.g., pointer arithmetic) [9]. This is achieved by Side-channel attacks can be used to extract secret data from
daisy chaining all memory components (i.e., local scratchpads embedded devices and high-end cloud servers. A paradigmatic
and the controller for the external memory). In this way, example is the Advanced Encryption Standard (AES). The
accelerators can automatically identify the memory location algorithm is mathematically secure, but its physical imple-
accessed by a memory operation based on the dynamic value mentations have been attacked using power and timing attacks
of the address, broadening the range of applications that can (CWE-326: Inadequate Encryption Strength). Accelerators can
leverage such heterogeneous building blocks. help mitigate side-channel attacks, for instance ensuring con-
Since heterogeneous architectures are leveraging hard- stant execution time and thus making timing attacks infeasi-
ware accelerators to provide energy-efficient high-performance ble. For example, Intel recently added the AES-NI instruc-
computation, such components are an attractive target for tions [11]. Accelerators must be protected from a variety of
attacks. Current protection mechanisms target software execu- other attacks, including fault-based and side-channels [16]. If
tion on processors [10], [11], are manually implemented [12], not adequately protected, a circuit separated from the rest of
and introduce significant overheads [13]. The approach is not the processor can be localized easily, becoming the target of
efficient and scalable when applied to accelerators, requiring precise power side-channel, ultimately leading to easy key
revisiting the design process. recovery (see Fig. 2(c)).
We discuss hardware vulnerabilities listed in the CWE list1 , Besides these hardware vulnerabilities for the end user,
focusing on how to exploit design errors and alter the accel- secure accelerators should be protected from reverse engi-
erator behavior. First, vulnerabilities in hardware accelerators neering, insertion of hardware Trojans (CWE-912: Hidden
can be exploited to launch software-based attacks. Even if Functionality) and unauthorized copy. Otherwise, the tech-
it is not possible to implement a different functionality as nological advantage of the IP provider can be undermined,
is done by exploiting buffer overflow (CWE-121) and code creating billions of dollars of economical damage [17]. The
injection (CWE-94), one can manipulate input values (either hardware description of the accelerator depends not only on
configuration parameters or memory values) to exploit design the initial high-level specification but also on the optimizations
errors. For example, attackers may exploit vulnerabilities in selected by the designer and performed by the design tools.
the accelerator controller to launch a wide range of attacks Reverse engineering would make all these assets available to
(CWE-691: Insufficient Control Flow Management) [14]. At- unauthorized parties (see Fig. 2(d)).
tackers can also exploit vulnerabilities in the SoC architecture. In a nutshell, since designers are integrating hardware
For example, the attacker may tamper with the system bus to accelerators into their designs, we expect securing these com-
insert malicious operations to trigger unauthorized execution ponents to become relevant in the coming years.
of the accelerators (CWE-284: Improper Access Control) [15].
If the system is not adequately protected, the resulting III. S ECURING H ARDWARE ACCELERATORS
execution may be compromised. One can access internal and
The proliferation of third-party applications for embedded
1 The Common Weakness Enumeration List (CWE), http://cwe.mitre.org systems (e.g., in Apple App Store or Google Play) is becoming
a serious threat to the user’s privacy since such systems C-based Tech Security
Constraints
can leak personal information without authorization [10]. Spec Library Spec

Moreover, the complexity of such applications is increasing

exponentially. So, part of these applications are accelerated in HLS Identification of security primitives
hardware. This section describes the existing software-based Compiler optimizations
(e.g. sensible data, security operations etc)
Code transformations (e.g. random
solutions to prevent or detect malicious attacks, highlighting precharging, Boolean masking etc)
how they could be effectively and efficiently integrated into
Scheduling Synthesis of protection mechanisms
the design process of hardware accelerators to achieve secure (e.g. side-channel countermeasures etc)
heterogenous computation. Resource Binding
Extra Logic for security monitoring
(e.g. DIFT, Logic Obfuscation etc)
Controller Synthesis
Dynamic Information Flow Tracking. Systems can be Security extensions
compromised by injecting malicious data to execute unau-
thorized code (e.g., buffer overflow). Dynamic Information Accelerator
(Verilog)
Flow Tracking (DIFT) prevents such attacks by marking
insecure data and tracking their use during the execution [6].
Many DIFT solutions have been proposed for software. Fig. 3. HLS-based design flow for hardware accelerators with security-aware
The software-based solutions leverage instrumentations of the extensions.
source code [6], modifications to the virtualization environ-
ment [10], or hardware extensions, like external co-processors to include an automatic identification of the security primitives
that process the execution trace in parallel [18]. However, sim- to be protected, followed by an automated implementation of
ilar solutions do not exist for hardware accelerators, making side-channel countermeasures that are tailored for the target
it impossible to accurately and efficiently track information platform.
flows in heterogeneous SoCs. The dedicated microarchitecture Protection Againt Reverse Engineering. Several ap-
of a hardware accelerator cannot execute additional functions. proaches can be used to protect the generated code against
Although the attacker cannot inject code limiting the type of reverse engineering, which is a first step for the insertion
accelerator’s misbehaviors, the designer cannot integrate DIFT of hardware Trojans or for counterfeiting. The designer can
without changing the microarchitecture. Given the increasing obfuscate the functionality of the accelerator by adding extra
complexity of these components, manual modification of the logic and connections during datapath and controller synthe-
hardware is becoming impracticable, while modifications to sis [19], [20]. This complicates the reverse-engineering of the
the underlying Boolean gate library is expensive [13]. Con- design, dramatically increasing the costs for an adversary and
versely, hardware logic for DIFT can be generated during the reducing the possibilities of inserting the triggers of hardware
automatic synthesis of the accelerator and it can be optimized Trojans. The designer can embed a secret key that the users
by exploiting high-level information (e.g., constant values, must know to unlock the functionality of the circuit [21].
control flows, etc.) Additional approaches watermark and fingerprint accelerators
Side-Channel Countermeasures. Malicious attackers un- to identify unauthorized copies [22]. All these solutions require
dermine mathematically secure algorithms using side-channel modifications to the design during the creation of the accel-
attacks on their physical implementations [12]. Security prim- erator microarchitecture. While simple changes to the gate-
itives must be protected from side-channel attacks using level netlists provide low protection, applying these techniques
countermeasures such as enforcing constant time computation during HLS offer many opportunities to generate more secure
to defeat timing attacks or adding randomness to prevent accelerators [23].
the attacker to extract secret keys using power analysis [7].
Timing attacks can be prevented by ensuring a computation IV. S ECURITY E XTENSIONS TO H IGH -L EVEL S YNTHESIS
time which is independent of the input data. Similarly, power HLS automatically translates a high-level specification of a
attacks can be prevented by reducing the amount of compu- functionality (e.g., in C/C++/SystemC) into the corresponding
tation that is data-dependent or by masking the sensitive data RTL implementation [3], [24], as shown in Fig. 3. It receives
with random values, reducing thus the correlation between as input the description of the function, along with synthesis
computed data and actual data of the algorithm. All these constraints and the target technology library. The HLS flow
techniques require modifications of the hardware in specific starts with compiler optimizations (e.g., constant propagation,
parts of the design, i.e., the ones elaborating the information loop transformations, and bitwidth analysis) and then performs
to be protected. Security primitives are manually identified scheduling (to determine which operations are executed in
by the designer, who is then in charge of implementing the each clock cycle), resource and interconnection binding (to
protections. This is performed as an additional step once all determine which hardware resources are used and how they are
other design goals have been met. However, this approach interconnected), and controller synthesis (to generate the con-
limits the optimizations that can be performed for performance trolling logic). Finally, it produces the HDL description ready
and cost. Sometimes, these solutions are not even implemented for logic synthesis. Several algorithms have been proposed to
in a proper way, yielding insecure design. In addition, such improve the performance (either latency or throughput) and
attacks depend on the target technology and there is no one- the cost (either resources or power consumption) [3]. HLS
size-fits-all solution. So, a security-aware design flow requires tools have made significant progress in the last years and have
been successfully used to design accelerators for a variety of [4] D. Selwood, “Heterogeneous Processing, SoCs and FPGAs,” Electronic
applications [25], [26]. However, they do not address security Engineering Journal, Aug. 2015. [Online]. Available: http://eejournal.
com/archives/articles/20151029-altera/
issues. To the best of our knowledge, no commercial HLS tools [5] TIA, “ICT Market Review and Forecast,” 2016.
implement security countermeasures as part of their design [6] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas, “Secure Program
flow. This is a limitation because guaranteeing the security Execution via Dynamic Information Flow Tracking,” in Proceedings of
ASPLOS, Oct. 2004, pp. 85–96.
of pervasively and massively connected devices is extremely [7] P. C. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in
complex. However, it is also an opportunity because HLS Proceedings of CRYPTO, 1999, pp. 388–397.
tools are a natural place to embed the security mechanisms [8] M. Horowitz, “Computing’s energy problem (and what we can do about
it),” in ISSCC Digest of Technical Papers, Feb. 2014, pp. 10–14.
described above. [9] C. Pilato, F. Ferrandi, and D. Sciuto, “A design methodology to
We envision extending the traditional HLS flow to include implement memory accesses in High-Level Synthesis,” in Proceedings
security as shown in the right-hand side of Fig. 3. Security- of CODES+ISSS, Oct. 2011, pp. 49–58.
[10] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel,
aware HLS supports the analysis of the input descriptions and A. N. Sheth, “TaintDroid: An information-flow tracking system for
at the compiler level to automatically identify and optimize realtime privacy monitoring on smartphones,” in Proceedings of OSDI,
security primitives and the sensible data to protect. Code Oct. 2010, pp. 393–407.
[11] G. Hofemeier and R. Chesebrough, “Introduction to In-
transformations can be applied to increase the level of security tel AES-NI and Intel Secure Key Instructions,” Jul.
of the input specification through, for example, Boolean mask- 2012. [Online]. Available: https://software.intel.com/en-us/articles/
ing. Software-based techniques can be complemented with introduction-to-intel-aes-ni-and-intel-secure-key-instructions
[12] A. Gornik, A. Moradi, J. Oehm, and C. Paar, “A hardware-based coun-
hardware protections that can be synthesized automatically termeasure to reduce side-channel leakage: Design, implementation, and
and transparently. The subsequent synthesis process starts evaluation,” IEEE Transactions on Computer-Aided Design of Integrated
from this optimized specification and automatically generates Circuits and Systems, vol. 34, no. 8, pp. 1308–1319, Aug. 2015.
[13] W. Hu, J. Oberg, A. Irturk, M. Tiwari, T. Sherwood, D. Mu, and
protection mechanisms inside the datapath or the controller. R. Kastner, “Theoretical fundamentals of gate level information flow
For example, efficient DIFT logic can be inserted during tracking,” IEEE Transactions on Computer-Aided Design of Integrated
resource and interconnection binding to monitor the data Circuits and Systems, vol. 30, no. 8, pp. 1128–1140, Aug. 2011.
[14] A. Nahiyan, K. Xiao, K. Yang, Y. Jin, D. Forte, and M. Tehranipoor,
exchanges between trusted and untrusted regions mindful of “AVFSM: A framework for identifying and mitigating vulnerabilities in
the performance and resource overheads. Information leakage FSMs,” in Proceedings of DAC, Jun. 2016, pp. 1–6.
via side-channels can be mitigated by automatically applying [15] M. Wolf, A. Weimerskirch, and C. Paar, “Security in automotive bus
systems,” in Proceedings of ESCAR, 2004.
countermeasures such as masking or constant-time routines. [16] F. Regazzoni, T. Eisenbarth, J. Grossschadl, and L. Breveglieri, “Power
Finally, obfuscation can be automatically applied when the Attacks Resistance of Cryptographic S-Boxes with Added Error Detec-
modules are generated or during the third-party intellectual tion Circuits,” in Proceedings of DFT, 2007, pp. 508–516.
[17] U. Guin, K. Huang, D. DiMase, J. M. Carulli, M. Tehranipoor, and
property integration. These techniques can also limit the Y. Makris, “Counterfeit integrated circuits: A rising threat in the global
insertion of hardware Trojans [27]–[29]. semiconductor supply chain,” Proceedings of the IEEE, vol. 102, no. 8,
In conclusion, our ultimate goal is the integration of efficient pp. 1207–1228, Aug. 2014.
[18] H. Kannan, M. Dalton, and C. Kozyrakis, “Decoupling dynamic infor-
security protections in accelerators. This can be only achieved mation flow tracking with a dedicated coprocessor,” in Proceedings of
by embracing HLS and extending it to comprehensively sup- DSN, Jun. 2009, pp. 105–114.
port hardware security. [19] N. Veeranna and B. C. Schafer, “Efficient behavioral intellectual prop-
erties source code obfuscation for high-level synthesis,” in Proceedings
V. C ONCLUSION AND F UTURE D IRECTIONS OF W ORK of LATS, Mar. 2017, pp. 1–6.
[20] J. Rajendran, A. Ali, O. Sinanoglu, and R. Karri, “Belling the CAD:
This paper discussed the security challenges when designing Toward security-centric electronic system design,” IEEE Transactions
hardware accelerators in heterogeneous SoC architectures. We on Computer-Aided Design of Integrated Circuits and Systems, vol. 34,
no. 11, pp. 1756–1769, 2015.
analyzed hardware vulnerabilities for which state-of-the-art [21] J. A. Roy, F. Koushanfar, and I. L. Markov, “EPIC: Ending Piracy of
countermeasures are not fully integrated and automated into Integrated Circuits,” in Proceedings of DATE, Mar. 2008, pp. 1069–1074.
current design flows for hardware accelerators. [22] C.-H. Chang, M. Potkonjak, and L. Zhang, Hardware IP Watermarking
and Fingerprinting. Springer, 2016, pp. 329–368.
We envision high-level synthesis to be naturally extended [23] J. Rajendran, H. Zhang, O. Sinanoglu, and R. Karri, “High-level
for the automatic synthesis of a variety of protection mecha- synthesis for security and trust,” in Proceedings of IOLTS, Jul. 2013,
nisms. The proposed HLS-based, security-aware methodology pp. 232–233.
[24] P. Coussy, D. D. Gajski, M. Meredith, and A. Takach, “An Introduction
will be able to create heterogeneous architectures that are to High-Level Synthesis,” IEEE Design Test of Computers, vol. 26, no. 4,
efficient (in terms of performance, energy, and resources) and pp. 8–17, Jul. 2009.
secure at the same time. [25] C. Pilato, Q. Xu, P. Mantovani, G. D. Guglielmo, and L. P. Carloni,
“On the design of scalable and reusable accelerators for big data
R EFERENCES applications,” in Proceedings of CF, May 2016, pp. 406–411.
[26] F. D. Robinson, L. H. Crockett, W. H. Nailon, and R. W. Stewart, “High-
[1] H. Bauer, M. Patel, and J. Veira, “The Internet of Things: Sizing up level synthesis for medical image processing on Systems on Chip: A case
the opportunity,” McKinsey&Company High Tech, Dec. 2014. [Online]. study,” in Proceedings of FPL, Aug. 2016, pp. 1–2.
Available: http://www.mckinsey.com/industries/high-tech/our-insights/ [27] N. Veeranna and B. Schafer, “Hardware Trojan detection in Behavioral
the-internet-of-things-sizing-up-the-opportunity Intellectual Properties (IPs) using Property Checking Techniques,” IEEE
[2] ITRS, “Roadmap (system drivers),” 2009. [Online]. Available: http: Transactions on Emerging Topics in Computing, Jun. 2016.
//www.itrs.org/ [28] N. Veeranna and B. C. Schafer, “Trust filter: Runtime hardware Trojan
[3] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. detection in behavioral MPSoCs,” Journal of Hardware and Systems
Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, Security, pp. 1–12, 2017.
“A Survey and Evaluation of FPGA High-Level Synthesis Tools,” IEEE [29] J. Rajendran, O. Sinanoglu, and R. Karri, “Building trustworthy systems
Transactions on Computer-Aided Design of Integrated Circuits and using untrusted components: A high-level synthesis approach,” IEEE
Systems, vol. 35, no. 10, pp. 1591–1604, Oct. 2016. Trans. on VLSI Systems, vol. 24, no. 9, pp. 2946–2959, Sep. 2016.
Circuits and Systems, 2012, 3, 1-9
http://dx.doi.org/10.4236/cs.2012.31001 Published Online January 2012 (http://www.SciRP.org/journal/cs)

System-on-Chip Design Using High-Level Synthesis Tools

Erdal Oruklu1*, Richard Hanley1, Semih Aslan2, Christophe Desmouliers1,
Fernando M. Vallina3, Jafar Saniie1
1
Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, USA
2
Ingram School of Engineering, Texas State University, San Marcos, USA
3
Xilinx Inc., San Jose, USA
Email: *erdal@ece.iit.edu

Received August 13, 2011; revised November 20, 2011; accepted November 30, 2011

ABSTRACT
This paper addresses the challenges of System-on-Chip designs using High-Level Synthesis (HLS). HLS tools convert
algorithms designed in C into hardware modules. This approach is a practical choice for developing complex applica-
tions. Nevertheless, certain hardware considerations are required when writing C applications for HLS tools. Hence, in
order to demonstrate the fundamental hardware design concepts, a case study is presented. Fast Fourier Transform (FFT)
implementation in ANSI C is examined in order to explore the important design issues such as concurrency, data recur-
rences and memory accesses that need to be resolved before generating the hardware using HLS tools. There are addi-
tional language constraints that need to be addressed including use of pointers, recursion and floating point types.

Keywords: System Level Design; High Level Synthesis; Field Programmable Gate Arrays; Fourier Transform

1. Introduction through-put and area, different modifications of the code

required by HLS tools are presented step by step.
In the past decade, there has been a substantial increase
Section 2 of this paper provides a brief background of
in the level of hardware abstraction that High-Level
HLS tools, and the current pedagogical techniques. Sec-
Synthesis (HLS) [1-5] tools offer, which has made de-
tion 3 presents an introduction of the FFT algorithm
signing a complete System-on-Chip (SoC) much more
along with a software implementation. This software
practical. By designing at the system level, it has become based FFT is then deconstructed in Section 4, where a
possible for hardware engineers to avoid gate-level se- fully synthesizable product is created. Section 5 analyzes
mantics. HLS tools work by taking applications written the different results that can be produced depending on
in a subset of ANSI C, and translating it into a Register the constraints selected by the user such as speed, area,
Transfer Level (RTL) module for Application-Specific throughput, and targeted system.
Integrated Circuit (ASIC) or Field Programmable Gate For this paper, all designs are targeted for the Xilinx
Arrays (FPGAs) chip design. The design workflow re- Virtex-5 FPGA platform [9] using the HLS tool called
quires knowledge of both software to write C applica- PICO, provided by Synfora Inc [10] (currently known as
tions and hardware to parallelize tasks, resolve timing Synphony C Compiler by Synopsys [11]). However, the
and memory management issues. There has been signifi- different code modifications presented in this paper are
cant previous work that discusses how to teach RTL con- applicable to other HLS tools such as AutoESL [12]
cepts to students and design simple applications for SoCs which targets primarily Xilinx FPGAs with architecture
[6,7]. Nevertheless, the learning curve for software engi- aware synthesis and Catapult C [13] which provides full-
neers is relatively high since they need to use Hardware chip high-level synthesis for both ASIC and FPGA de-
Descriptive Languages (HDL) such as Verilog and VHDL. vices and automatic RTL verification.
By using HLS tools, software engineers can use their pro-
gramming skills along with hardware knowledge to create
2. High Level Synthesis Tools
complex embedded hardware/software co-design systems.
To demonstrate the critical hardware and software de- This section outlines the important concepts that software
sign issues, a Fast Fourier Transform (FFT) [8] case developers need to know before entering the field of HLS.
study is used as a guideline. In order to generate hard- There is a special emphasis on how these concepts differ
ware modules satisfying predefined constraints such as from the contemporary software environment familiar to
*
Corresponding author. the software engineers. Design of SoCs has historically

Copyright © 2012 SciRes. CS

2 E. ORUKLU ET AL.

been accomplished using Hardware Descriptive Lan- Focusing further on HLS, the design flow is shown in
guages such as VHDL or Verilog. Each expression in Figure 2. Each module of a system is implemented using
HDL represents a group of gates that operate in parallel, high level languages such as C, C++, Java, or Matlab
as opposed to machine instructions executed sequentially. [2,18], which can then be tested automatically with test-
This concept of instruction level parallelism is one of the benches provided by the user. After verification of the
first major hurdles when introducing hardware concepts. complete system, the user can specify in the HLS tool
Once an RTL module is designed, it can be compiled which modules will be converted into hardware accel-
and simulated. The simulation is done by creating a se- erators in order to speed up the application. This is one of
ries of pre-defined inputs, known as a testbench, and re- the core elements of hardware/software co-design that
cording the outputs. If a module passes the simulation software developers need to understand. There are in-
then a low level implementation can be created. This low herent restrictions in the HDLs that are mirrored in the
level implementation then enters the verification process HLS tool. Therefore, the emphasis for teaching HDL to
to ensure that all timing dependencies are met. In prac- software developers is on its constraints and how it af-
tice, simulating and verifying an implementation can take fects the HLS tools.
50% - 60% of the development time, increasing the time- After generation of the hardware modules along with
to-market (TTM) [14]. By automating the simulation and testbenches, the system is verified and can be imple-
verification process, it is possible to greatly reduce the mented using synthesis tools.
development time. This paper, as mentioned earlier, focuses on designing
Integration of HLS tools into the FPGA or ASIC de- a Fast Fourier Transform. The concept of HLS is pre-
sign flow, as shown in Figure 1, allows software design- sented by using PICO (Program-In Chip-Out) Extreme
ers to build hardware modules and speed up the TTM from Synfora [10,19,20] to generate the RTL code of an
significantly. During the generation process of an RTL FFT. To be specific, PICO takes a C-based description of
module from a software implementation, simulation and an algorithm and generates: performance-driven device-
verification are done automatically by using a formal dependent synthesizable RTL code, testbench files, ap-
proof provided during the initial steps. Subsequently, by plication drivers, simulation scripts as well as SystemC
using synthesis tools, the RTL module is implemented based Transaction Level Models (TLM) [3,17,18,21].
and timing verification is done. An independent evalua- PICO design flow is shown in Figure 3. With integration
tion of HLS tools for Xilinx FPGAs has been done by
Berkeley Design Technology [15]. It shows that using
HLS tools with FPGAs can improve the performances of
an application by an order of magnitude compared to
DSPs. Moreover, this study shows that for a given appli-
cation, HSL tools will achieve similar results compared to
hand-written HDL code with a shorter development time.
HLS software based approach for simulation and veri-
fication is made possible by using SystemC, a language
developed by Synopsys, University of California Irvine,
Frontier Design and IMEC. SystemC is an extension of
C++ that provides additional libraries to design an em-
bedded system. The first version was released in 1999
and in 2005 it became IEEE standardized SystemC [16,
17] as the IEEE-1666-2005. These additional libraries
make it possible to specify the hardware and software
components in an embedded system using one unified
paradigm and to generate testbenches.

Figure 1. FPGA high level synthesis block diagram. Figure 2. High level synthesis (HLS) design flow.

Copyright © 2012 SciRes. CS

E. ORUKLU ET AL. 3

Figure 3. HLS (PICO) based design flow for hardware implementation.

of the PICO design tools to their FPGA flow, designers sult of these constraints, the reference code included in
can create complex hardware [20] sub-systems from se- this section does not use divisions, is completely iterative,
quential untimed C algorithms. It allows designers to and has not pointer variables. However, before going into
explore programmability, performance, power, area and the details of the implementation, the mathematical back-
clock frequency. This is achieved by providing a com- ground of the FFT is presented.
prehensive and robust verification and validation envi-
ronment. PICO is designed to explore different type of 3.1. FFT Algorithm
parallelism and will choose the optimal one transparently.
The Fourier transform takes a signal x in time t and trans-
Results in terms of throughput and area are given along
forms it into a function X in frequency ω:
with detailed reports that will help the user for code op-

timization. When the synthesized performances are sat-
 x(t ) * e
2 jπ t
isfactory, RTL code is generated and can be implemented X ( )  dt (1)

in the targeted platform. Because the testing is done in C,
the verification time of the RTL module can be signifi- The transform can be computed using a Discrete Fou-
cantly reduced [20]. rier Transform (DFT).
N 1 n
2 jπk
3. Fast Fourier Transform X k   xn e N
(2)
n 0
In most cases, the first step when using an HLS tool is to
where k  0, , N  1.
create a reference implementation, which is used to ver-
The direct realization of DFT algorithm requires O(N2)
ify the synthesized product. The reference code itself can
computational time. To make this computation faster, an
be compiled using any C compiler, and is purely soft-
entire class of Fast Fourier Transforms (FFT) were de-
ware based. This means that no new concepts have to be
veloped [8]. However, in this paper a radix-2 FFT deci-
taught, making the reference implementation a logical
mated in time is implemented. This algorithm divides the
starting point when using HLS.
original DFT into two DFTs with half the length (i.e.
When creating the reference code for FFT, there are
decimation). The first step in decimation is shown below:
few issues that need to be addressed when using HLS
tools. The first issue is that arithmetic operations such as N
1 2πj
N
1 2πj
2   2 m k 2   2 m 1 k
division can significantly decrease the performance of
the design, and therefore should be avoided whenever
Xk   x2me N
  x2 m1e N
(3)
m 0 m 0
possible. Nevertheless, division by a power of two is
Then the algorithm is recursively applied to each term
considered as a bit shift operation and hence can be used
until each DFT’s length is 1. This recursive deconstruc-
at no cost. The second issue, and more fundamental issue,
tion of the DFT makes the computational time of
is that pointers and recursion are not supported by the
O(Nlog(N)) [8].
current HLS tools due to the fact that those concepts are
purely software and can’t be applied to hardware designs.
3.2. Software Implementation of the FFT
Finally, HLS tools may not have the capability to synthe-
size software functions such as cosine and sine. As a re- In Figure 4, a 16-point radix-2 FFT is shown. A signal is

Copyright © 2012 SciRes. CS

4 E. ORUKLU ET AL.

[0] [0]
0 # include " fft.h "
2
[8] [1] # include  math.h 
0
[1]
4
[2] # define pi 2 (double)6.28318530717958647692528676655901
1
0
2 4 extern s _ complex fix _ float[ N / 2];
[9] [3]
0 void table _ setup (void )
8
[2] [4]
0
{
1
2 8
[10] [5] double a  0.0;
0 2
[3]
4 8
[6]
double e   pi 2 / N ;
0
2 4
1 3
8
float cos _ val , sin _ val ;
[11] [7]
0
int i;
16
[4] [8] for (i  0; i  N / 2; i  ){
0 1
[12]
2 16
[9] cos _ val  cos(a );
4
0 2
16 sin _ val  sin(a );
[5] [10]
0
2
1 3 fix _ float[i ].x  cos _ val ;
4 16
[13] [11]
4
fix _ float[i ]. y  sin _ val ;
0
8 16
[6] [12] a  a  e;
0 1 5
[14]
2 8 16
[13]
}
4
0
8
2 6
16 }
[7] [14]
0 1 3 7
2 4 8 16
[15] [15]
The particular implementation chosen for this refer-
Figure 4. 16-point radix-2 FFT. ence FFT was provided by [22]. The exact code used is
shown in Figure 5. N represents the length of the FFT
inputted into the FFT in a bit reversed order and then and must be a power of 2. Before using the function
goes through log2(N) passes, where each pass has N/2 fft_ref, the function table_setup must be executed in or-
“butterfly” operations. These butterfly operations are der to compute the twiddle factors and store them in the
defined as: array fix_float. The FFT of an input z can then be exe-
cuted. The first phase is the bit-reverse operation where
f O the input data are rearranged as show in Figure 4. Then,
WNk for each passes, the butterfly operations are performed
g G until the FFT is completed. In the next section this code
will be made fully synthesizable by applying four modi-
WNk  e 2πjk N
(called the Twiddle factor) fications to it.
F  f  WNk  g
(4) 4. Code Modification for HLS
G  f  WNk  g
The objective of this section is to generate the hardware
The butterfly operation requires complex number of a FFT block based on the reference C code using HLS
arithmetic additions and multiplications. Because of the tools. Multiple modifications are needed in order to ge-
programming constraints placed on the reference code, nerate an optimal hardware in term of resource usage and
most complex number libraries are not useable. Hence, throughput. As an example, we generate an 8-bit 1024-
this reference code uses its own complex number repre- point radix-2 FFT. The output is on 18 bits and will
sentation shown below: beavailable in natural order. The size of the data width
inside the FFT has been chosen so that the HLS FFT
typedef struct{ gives the same results as the Xilinx FFT core [23].
float x ;
4.1. Floating Point to Fixed Point Implementation
float y ;
Since the reference C code is using floating point num-
}s _ complex;
bers, a fixed-point library is needed. For example, PICO,
the HLS used in this demonstration, provides such library.
Moreover, in order to perform the butterfly operation, The PICO fixed-point arithmetic library derives its se-
the WNk terms need to be calculated. Since we assume mantics from the SystemC fixed-point library and it sup-
that HLS library does not support cosine and sine func- ports signed and unsigned arithmetic operations. Hence,
tions, the twiddle factors are pre-computed and stored in the previous floating point complex structure must be
a table using the code below: modified as followed:

Copyright © 2012 SciRes. CS

E. ORUKLU ET AL. 5

# < ℎ. ℎ > used. PICO supports two types of streams: external and
# " . ℎ"
internal. External streams are used to stream data from/to
_ _ [ /2]; global memory and/or other blocks in the system. Inter-
_ ( , , _ ∗ ) nal streams are used to stream data between loops within
{
, , , 1, 2; a multi-loop accelerator designed by PICO. In PICO,
, , 1, 2;
= 0; streams are specified using explicit procedure calls that
2 = / 2;
( = 1; < − 1; + +) {
transmit a scalar value to an output stream or receive a
1 = 2; scalar value from an input stream. These procedures are
ℎ ( >= 1) {
= − 1; converted into special opcodes that receive (transmit)
1 = 1 / 2; Bit-reverse
} data from (to) actual streams. For the FFT application,
operation
= + 1;
( < ){
four streams are needed: input/output streams for real
_ = [ ]; and imaginary parts:
[ ] = [ ];
[] = ;
}
}
char pico _ stream _ input _ xin();
1 = 0; char pico _ stream _ input _ yin();
2 = 1;
( = 0; < ; + +){ void pico _ stream _ output _ xout (int);
1 = 2;
2 = 2 + 2; void pico _ stream _ output _ yout (int);
= 0;
= /(1 << ( + 1));
( = 0; < 1; + +) { Obtain cosine and
=
=
_
_
[
[
]. ;
]. ;
sine values for the PICO synthesizes a FIFO (within the RTL) for each
= + ; butterfly operation internal and external stream in the code. Different para-
( == /2 ) { = 0; }
, 1; meters such as the length of the FIFO can be configured
( = ; < ; = + 2) {
1 = [ + 1]. ∗ ); using pragmas. The first step of the FFT will be the
= [ + 1]. ∗ );
1 −= ; loading phase where input data are stored into a RAM
2 = [ + 1]. ∗ ;
= [ + 1]. ∗ ;
called z as shown below:
2 += ; Butterflycalculation
[ + 1]. = [ ]. − 1; for (h  0; h  N ; h  ){
[ + 1]. = [ ]. − 2;
[ ]. = [ ]. + 1; z[h].x  ( floatP) pico _ stream _ input _ xin();
[ ]. = [ ]. + 2; z  h . y   floatP  pico _ stream _ input _ yin();
}
} }
}
0;
}
Finally after the FFT is computed, the unloading phase
is performed:
Figure 5. FFT reference C code.
for ( p  0; p  N ; p   ){
pico _ stream _ output _ xout ( z[ p ]. x );
typedef pico :: s _ fixed  22,18, pico :: S _ RND, pico :: S _ SAT , 0  floatP;
pico _ stream _ output _ yout ( z[ p ]. y );
typedef struct{
}
floatP x;
floatP y;
}s _ complexP;
4.3. Bit-Reverse Operation
If we look at the reference C code, the next step would be
FFT is computed using 22-bit data width with 18 bits the bit-reverse stage; this operation takes 1024 cycles.
for the integer part and 4 bits for the fractional part. However, it can be integrated in the radix-2 FFT block,
Rounding and saturation configuration is used. The effect hence reducing the total number of cycles required to
of the number of bits allocated to the fractional part on perform the calculations. This can be done using the
the precision and resource usage of the FFT HLS is pre- bit_swap function:
sented in Section V. The twiddle factors are pre-calcu-
lated with a precision of 16 bits and stored in an array unsigned short bit _ swap (unsigned short in , unsigned short bits ){
unsigned short out  0;
eliminating the need of trigonometric functions. unsigned short k ;
# pragma unroll
4.2. Input Array to Stream of Input Data for ( k  0; k  bits ; k   ){
out  ( out  1) | (in & 0 x1);
In the reference C code, the input data are passed to the in  in  1;
}
function as an array. This will be translated into memory return out ;
accesses by the HLS tool which is not optimal for hard- }
ware implementation. Hence, a stream of input data is

Copyright © 2012 SciRes. CS

6 E. ORUKLU ET AL.

In this function, we use the pragma unroll to specify to .

.
the HLS tool to unroll the loop and hence parallelize the .
operations to speed-up the process. This function is used .
( = ; < ; = + 2) {
to calculate the new address when performing the butter- _ _ 1, ;
_ _ 1, ;
fly calculation on z as shown below: ( %2) {
= 1[ ];
_ 1 = 1[ + 1];
. } { Multi-buffering
. = [ ];
. _ 1 = [ + 1];
. }
1 = ( _ 1. ∗ − _ 1. ∗ ) ;
( = ; < ; =+ 2) { 2 = ( _ 1. ∗ + _ 1. ∗ );
ℎ _ = _ ( , ); Bit-reverse _ 1. = ( . − 1); Butterfly
ℎ _ _ 1 = _ ( + 1, ); operation
1 = [ _ _ 1]. ∗ );
_ 1. = ( . − 2); calculation
. = ( . + 1);
= [ _ _ 1]. ∗ );
. = ( . + 2);
1 −= ;
( %2) {
2 = [ _ _ 1]. ∗ ;
[ ] = ;
= [ _ _ 1]. ∗ ;
2 += ;
Butterfly [ + 1] = _ 1;
[ _ _ 1]. = [ _ ]. − 1; calculation } { Multi-buffering
[ _ _ 1]. = [ _ ]. − 2; 1[ ] = ;
[ _ ]. = [ _ ]. + 1; 1[ + 1] = _ 1;
[ _ ]. = [ _ ]. + 2; }
} }
} }
} }
0; 0;
} }

By integrating the modifications presented in this sec-

4.4. Memory Access Reduction
tion to the reference C code given in Figure 5, the HLS
Each array of data in the reference C code will be im- implementation of the FFT can be obtained as shown in
plemented as a RAM by the HLS tool. We can see that Figure 6.
multiple accesses of z are done which is not suitable for
hardware implementation since only single or dual port 5. Hardware Synthesis Results
RAMs/ROMs are available. In order to resolve this The HLS tool offers different configurations that will
problem and obtain better performances, the first step is have an impact on the hardware generated. For example,
to use temporary variables. This step is shown below: the user can specify the desired frequency that may or
may not be achieved by the tool depending on the system
.
. targeted and the complexity of the C code. As seen in
.
.
( = ; < ; = + 2) { .
_ _ 1, ; .
.
_ _ 1, ;
.
= [ ]; ( = ; < ; = + 2) {
_ 1 = [ + 1]; _ _ 1, ;
1 = ( _ 1. ∗ − _ 1. ∗ ) ; _ _ 1, ;
2 = ( _ 1. ∗ + _ 1. ∗ ); Butterfly ( %2) {
_ 1. = ( . − 1); calculation = 1[ ];
_ 1. = ( . − 2); _ 1 = 1[ + 1];
} { Multi-buffering
. = ( . + 1);
= [ ];
. = ( . + 2); _ 1 = [ + 1];
[ ]= ; }
[ + 1] = _ 1; 1 = ( _ 1. ∗ − _ 1. ∗ ) ;
} 2 = ( _ 1. ∗ + _ 1. ∗ );
} _ 1. = ( . − 1); Butterfly
} _ 1. = ( . − 2); calculation
. = ( . + 1);
0; . = ( . + 2);
} ( %2) {
[ ] = ;
[ + 1] = _ 1;
} { Multi-buffering
1[ ] = ;
Through this arrangement, the memory accesses are 1[ + 1] = _ 1;
reduced to 2 read and 2 write operations. They can be }
}

reduced further using multi-buffering or ping-pong me- }

}
mories. Therefore, we use two RAMs z and z1 and we 0;
}
alternate read and write operations. For example, a read
operation will be done on z (or z1) while the write opera-
tion will be done on z1 (or z): Figure 6. HLS implementation of FFT.

E. ORUKLU ET AL. 7

this section, increasing the frequency will increase the Area reduction in terms of slices and DSP48E blocks
resources of the hardware generated by the HLS tool. can be achieved by increasing the number of clock cycles
The throughput (number of FFTs that can be done in one required to perform the FFT. Hence, for equivalent
second) can also be specified. In order to achieve a high throughput, it is better to choose a higher operational
throughput, the HLS tool will parallelize tasks; hence frequency and a higher number of clock cycles required
increasing the hardware resources. Finally, the user can to perform the FFT. Table 2 shows the hardware usage
specify to implement arrays using block RAMs or look- of the FFT for a targeted frequency of 150MHz with dif-
up tables (LUTs). Hardware implementation results are ferent throughputs. For example, from Table 1, for a
obtained using Xilinx ISE 12.1 software with either frequency of 75 MHz, the throughput is 10,463. Never-
speed or area optimization for Virtex-5 FPGA. The twid- theless, with a frequency of 150 MHz, a better through-
dle factors have been implemented using LUTs but can put can be obtained using fewer DSP48E blocks (see
be also implemented using RAMs. By doing this, it will Table 2, second row).
reduce the total number of slices LUTs but increase the Figure 7 shows the error variation with respect to the
number of blocks RAM/FIFO. Table 1 shows the hard- width of the fractional part compared to the reference
ware usage of the HLS implementation of FFT with code shown in Figure 5. The relative error for the FFT is
22-bit data width for different targeted frequencies. given using the formula below:
One can see a significant increase in terms of logic
slices for 150 MHz operational frequency. This is due to 1 1 99 1023  X ref [n][k ]  X HLS [n][k ]
the fact that we have selected optimization for speed in
error   
100 1024 n0 k 0  X ref [n][k ]
ISE in order to achieve the desired operational frequency (6)
after place and route. For frequencies lower than 150 Yref [n][k ]  YHLS [n][k ] 
 
MHz, optimization in terms of area has been selected. Yref [n][k ] 
For frequencies from 50 MHz to 150 MHz, the total
number of clock cycles achieved by PICO to perform the where X and Y are real and imaginary parts respectively.
1024-point FFT is 7168 but for 175 MHz it is increased The relative error is calculated for 100 random input
to 12288 clock cycles. 7168 clock cycles is the minimum signals of 1024 samples each Figure 7 shows that the
latency that can be obtained and is calculated as follow: relative error decreases linearly as the number of bit for
latency  loading  FFT  unloading the fractional part increases. For the implementation of
the FFT, –40 dB is achieved giving the same results as
N
latency  N  *log 2  N   N (5) the Xilinx FFT core. Nevertheless, the user can increase
2 the precision at the expense of hardware usage. For 13
lantecy  1024  512 *10  1024  7168clock cycles bits, the relative error achieved is –73 dB compared to
For frequencies higher than 150 MHz, PICO reduces the reference C code based on double precision floating
the tasks’ parallelism of the FFT in order to achieve the point operations.
desired frequency. This results in an increase of the la- Table 3 shows the hardware usage with respect to the
tency and a reduction of the hardware resources. The width of the fractional part for a desired operational fre-
maximum frequency that can be obtained by PICO is quency of 100 MHz. As expected, the resource usage
around 270 MHz with a total of 17,408 clock cycles increases with the number of bit for the fractional part.
( 1024  3  512  10  1024 ) to compute the FFT. Never- Nevertheless, the number of blocks RAM/FIFO used is
theless, after place and route, the maximum frequency the same. This is due to the architecture of the Virtex-5
obtained is 180 MHz due to the FPGA targeted. FPGA selected.

Table 1. FFT hardware usage for different frequencies.

Resource usage
Targeted frequency Achieved frequency
Slices Registers Slices LUTs Block RAM/FIFO DSP48E
50 MHz 749 1700 2 4 50 MHz
75 MHz 765 1769 2 4 75 MHz
100 MHz 926 1967 2 4 100 MHz
125 MHz 1042 1714 2 4 125 MHz
150 MHz 1546 2004 2 4 150 MHz
175 MHz 1380 1849 2 2 165 MHz
270 MHz 1457 1989 2 2 180 MHz

8 E. ORUKLU ET AL.

Table 2. FFT hardware usage for different throughputs. code. Results of the generated FFT for a Virtex-5 FPGA
have been presented. FFT has a broad range of appli-
Targeted
Resource usage cations in digital signal processing, and multimedia. It is
throughput Slices Slices Blocks a key component that determines most of the design met-
DSP48Es rics in many signal processing communication applica-
Registers LUTs RAM/FIFO
20926 1546 2004 2 4
tions. HLS tools facilitate complex algorithms to be real-
ized at a higher level. They can reduce the design cycle
12207 1351 1693 2 2 significantly while successfully generating results very
8616 1186 1418 2 2 close to handmade HDL design.
6658 1161 1404 2 1
7. Acknowledgements
Relative error for different bit sizes for the fractional part The authors would like to thank the Xilinx, Inc.
–10 (www.xilinx.com) and Synopsys (www.synopsys.com)
for their valuable support.
–20

–30 REFERENCES
Relative error in Db

–40 [1] S. Dongwan, A. Gerstlauer, R. Domer and D. D. Gajski,

“An Interactive Design Environment for C-Based High-
Level Synthesis of RTL Processors,” IEEE Transactions
–50
on Very Large Scale Integration (VLSI) Systems, Vol. 16,
No. 4, 2008, pp. 446-475.
–60
[2] S. Ramachandran, “Digital VLSI System Design,” Chap-
ter 11, Springer, New York, 2007.
–70
[3] M. Glasser, “Open Verification Methodology Cook-
–80 book,” Chapters 1-3, Springer, New York, 2009.
0 2 4 6 8 10 12 14 doi:10.1007/978-1-4419-0968-8
Number of bits for the fractional part
[4] E. Casseau and B. Le Gal, “High-Level Synthesis for the
Figure 7. Relative error for different bit size for the frac- Design of FPGA-Based Signal Processing Systems,” In-
tional part. ternational Symposium on Systems, Architectures, Mod-
eling, and Simulation, SAMOS’09, 20-23 July 2009, pp.
Table 3. FFT hardware usage for different fractional sizes. 25-32.
[5] B. Bailey, G. Martin and A. Piziali, “ESL Design and
Resource usage
Verification,” Morgan Kaufmann, San Francisco, Chap-
Fractional part
number of bits Slices Slices Block ters 1-6, 2007.
DSP48Es
Registers LUTs RAM/FIFO [6] V. Sklyarov and I. Skliarova, “Teaching Reconfigurable
Systems: Methods, Tools, Tutorials, and Projects,” IEEE
0 767 1305 2 4 Transactions on Education, Vol. 48, No. 2, 2005, pp.
2 809 1474 2 4 290-300. doi:10.1109/TE.2004.842909
[7] L. E. M. Brackenbury, L. A. Plana and J. Pepper, “System-
4 926 1967 2 4 on-Chip Design and Implementation,” IEEE Transactions
6 968 2184 2 4 on Education, Vol. 53, No. 2, 2010, pp. 272-281.
doi:10.1109/TE.2009.2014858
8 1188 2309 2 4
[8] J. G. Proakis and D. G. Manolakis, “Digital Processing
10 1279 2405 2 8 4th Edition,” 4th Edition, Prentice Hall, New Jersey,
2006.
13 1400 2607 2 8
[9] Xilinx Inc., “XUPV5 Development Board,”
http://www.xilinx.com
6. Conclusion [10] Synfora Inc, Website. http://www.synfora.com
In this paper, we have presented hardware considerations [11] Synopsys, “Symphony C Compiler,”
http://www.synopsys.com/Tools/SLD/HLS/Pages/Synpho
that software engineers need to apply when designing
nyC-Compiler.aspx
hardware modules using HLS tools. As a demonstration,
[12] AutoESL, Website. http://www.autoesl.com
the implementation of a radix-2 FFT unit has been pre-
sented. We have shown the different steps to achieve an [13] Mentor, Website. http://www.mentor.com
optimized C code for HLS tools based on an ANSI C [14] P. Avss, S. Prasant and R. Jain, “Virtual Prototyping In-

E. ORUKLU ET AL. 9

creases Productivity—A Case Study,” IEEE Interna- [19] S. Van Haastregt and B. Kienhuis, “Automated Synthesis
tional Symposium on VLSI Design, Automation and Test, of Streaming C Applications to Process Networks in
Hsinchu, 28-30 April 2009, pp. 96-101. Hardware,” Proceedings of the Conference on Design
doi:10.1109/VDAT.2009.5158104 Automation & Test in Europe, April 2009, pp. 890-893.
[15] Berkeley Design Technology, “An independent Evalua- [20] P. Coussy and A. Morawiec, “High-Level Synthesis:
tion of High-Level Synthesis Tools for Xilinx FPGAs,” From Algorithm to Digital Circuits,” Springer Science +
http://www.bdti.com Business Media, Chapters 1, 4, Berlin, 2008.
[16] K. L. Man, “An overview of SystemCFL,” Research in [21] N. Hatami, A. Ghofrani, P. Prinetto and Z. Navabi, “TLM
Microelectronics and Electronics, 2005 PhD, Vol. 1, 2.0 Simple Sockets Synthesis to RTL,” International
2005, pp. 145-148. Conference on Design & Technology of Integrated Sys-
tems in Nanoscale Era, Vol. 1, 2000, pp. 232-235.
[17] P. Schumacher, M. Mattavelli, A. Chirila-Rus and R.
Turney, “A Software/Hardware Platform for Rapid Pro- [22] D. L. Jones, “FFT Reference C Code,” University of Illi-
totyping of Video and Multimedia Designs,” Proceedings nois at Urbana-Champaign, 1992.
of Fifth International Workshop on System-on-Chip for [23] Xilinx Inc., “CoreGen,” http://www.xilinx.com
Real-Time Applications, 20-24 July 2005, pp. 30-33.
doi:10.1109/IWSOC.2005.27
[18] W. Chen (Ed.), “The VLSI Handbook,” 2nd Edition,
Chapter 86, CRC Press LCC, Boca Raton, 2007.

Transformations of High-Level Synthesis Codes

for High-Performance Computing
Johannes de Fine Licht Maciej Besta Simon Meierhans Torsten Hoefler
definelicht@inf.ethz.ch maciej.besta@inf.ethz.ch mesimon@ethz.ch htor@inf.ethz.ch
Department of Computer Science, ETH Zurich
Abstract—Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store
devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++
and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider
audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer
sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for
arXiv:1805.08288v6 [cs.DC] 23 Nov 2020

deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing
transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We
systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS
code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation
can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining,
on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various
transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim
to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential
offered by spatial computing architectures using HLS.

1 I NTRODUCTION For many applications, computational performance is a

Since the end of Dennard scaling, when the power con- primary goal, which is achieved through careful tuning by
sumption of digital circuits stopped scaling with their specialized performance engineers using well-understood
size, compute devices have become increasingly limited by optimizing transformations when targeting CPU [25] and
their power consumption [1]. In fact, shrinking the feature GPU [26] architectures. For HLS, a comparable collection
size even increases the loss in the metal layers of modern of guidelines and principles for code optimization is yet
microchips. Today’s load/store architectures suffer mostly to be established. Optimizing codes for hardware is dras-
from the cost of data movement and addressing general tically different from optimizing codes for software. In fact,
purpose registers and cache [2]. Other approaches, such the optimization space is larger, as it contains most known
as dataflow architectures, have not been widely successful, software optimizations, in addition to HLS-specific trans-
due to the varying granularity of applications [3]. However, formations that let programmers manipulate the underlying
application-specific dataflow can be used to lay out as regis- hardware architecture. To make matters worse, the low clock
ters and on-chip memory to fit the specific structure of the frequency, lack of cache, and fine-grained configurability,
computation, and thereby minimize data movement. means that naive HLS codes typically perform poorly com-
Reconfigurable architectures, such as FPGAs, can be pared to naive software codes, and must be transformed
used to implement application-specific dataflow [4], [5], considerably before the advantages of specialized hardware
[6], but are hard to program [7], as traditional hardware can be exploited. Thus, the established set of traditional
design languages, such as VHDL or Verilog, do not benefit transformations is insufficient, as it does not consider
from the rich set of software engineering techniques that aspects of optimized hardware design, such as pipelining
improve programmer productivity and code reliability. For and decentralized fast memory.
these reasons, both hardware and software communities are In this work, we survey and define a set of key trans-
embracing high-level synthesis (HLS) tools [8], [9], enabling formations that optimizing compilers or performance en-
hardware development using procedural languages. gineers can apply to improve the performance of hard-
HLS bridges the gap between hardware and software ware layouts generated from HLS codes. This set unions
development, and enables basic performance portability transformations extracted from previous work, where they
implemented in the compilation system. For example, HLS were applied either explicitly or implicitly, with additional
programmers do not have to worry about how exactly a techniques that fill in gaps to maximize completeness. We
floating point operation, a bus protocol, or a DRAM con- characterize and categorize transformations, allowing per-
troller is implemented on the target hardware. Numerous formance engineers to easily look up those relevant to
HLS systems [10], [11] synthesize hardware designs from improving their HLS code, based on the problems and bot-
C/C++ [12], [13], [14], [15], [16], [17], OpenCL [18], [19] and tlenecks currently present. The transformations have been
other high-level languages [20], [21], [22], [23], [24], provid- verified to apply to both the Intel OpenCL and Xilinx Vivado
ing a viable path for software and hardware communities to HLS toolflows, but are expected to translate to any pragma-
meet and address each other’s concerns. based imperative HLS tool.
2

Transformations Characteristics Objectives

PL RE PA ME RS RT SC CC LD RE CU BW PL RT RS
while computations and fine-grained control flow are or-
ganized in (predicated) pipelines. · Hardware synthesis
Accumulation interleaving §2.1 - – – ∼ – ∼ – – – – – –
Delay buffering §2.2 - - (-) - () – – – – – maps the register-level circuit description to components
Pipelining

Random access buffering §2.3 - - (-) - – – – –

Pipelined loop fusion §2.4 - (-) – ∼ ∼ () – – – – – – –
and wires present on the specific target architecture. At
Pipelined loop switching §2.5 - (-) – ∼ ∼ () – ∼ – – – – – this stage and onwards, the procedure is both vendor and
Pipelined loop flattening §2.6 - – – - ∼ ∼ – – – – – – –
Inlining §2.7 - – (-) – () – – - – – – – – – architecture specific. ¸ Place and route maps the logical
Horizontal unrolling §3.1 – (-) - - () – – – – – – circuit description to concrete locations on the target device,
Scaling

Vertical unrolling §3.2 – -! -! – ! ! – – – – – by performing a lengthy heuristic-based optimization that

Dataflow §3.3 – – (-) – () -! – - – – – –
Tiling §3.4 – - – ∼ ∼ – – – attempts to minimize the length of the longest wire and the
Mem. access extraction §4.1 (-) – – - - – – – – – – total wire length. The longest propagation time between two
Memory

Mem. buffering §4.2 – – – - – – – – – – – – registers including the logic between them (i.e., the critical
Mem. striping §4.3 – – – - – – – – – – –
Type demotion §4.4 – – – - - - – - – – – – – path of the circuit), will determine the maximum obtainable
TABLE 1: Overview of transformations, the characteristics of their effect on the frequency. ¹ Bitstream generation translates the final circuit
HLS code and the resulting hardware, and the objectives that they can target. The description to a binary format used to configure the device.
center group of column marks the following transformation characteristics: (PL)
enables pipelining; (RE) increases data reuse, i.e., increases the arithmetic intensity of Most effort invested by an HLS programmer lies in guiding
the code; (PA) increases or exposes more parallelism; (ME) optimizes memory accesses; the scheduling process in ¶ to implement deep, efficient
(RS) does not significantly increase resource consumption; (RT) does not significantly
impair routing, i.e., does not potentially reduce maximum frequency or prevent pipelines, but · is considered when choosing data types and
the design from being routed altogether; (SC) does not change the schedule of loop buffer sizes, and ¸ can ultimately bottleneck applications
nests, e.g., by introducing more loops; and (CC) does not significantly increase
code complexity. The symbols have the following meaning: “–”: no effect, “-”: once the desired parallelism has been achieved, requiring
positive effect, “-!”: very positive effect, “(-)”: small or situational positive the developer to adapt their code to aid this process.
effect, “”: negative effect, “!”: very negative effect, “()”: small or situational
negative effect, “∼”: positive or negative effect can occur, depending on the
context. The right group of columns marks the following objectives that 1.2 Key Transformations for High-Level Synthesis
can be targeted by transformations: (LD) resolve loop-carried dependencies, due This work identifies a set of optimizing transformations that
to inter-iteration dependencies or resource contention; (RE) increase data reuse;
(CU) increase parallelism; (BW) increase memory bandwidth utilization; (PL) reduce are essential to designing scalable and efficient hardware
pipelining overhead; (RT) improve routing results; (RS) reduce resource utilization. kernels in HLS. An overview given in Tab. 1. We divide
the transformations into three major classes: pipelining
In addition to identifying previous work that apply one or transformations, that enable or improve the potential for
more of the transformations defined here, we describe and pipelining computations; scaling transformations that in-
publish a set of end-to-end “hands-on” examples, optimized crease or expose additional parallelism; and memory en-
from naive HLS codes into high performance implementa- hancing transformations, which increase memory utilization
tions. This includes a stencil code, matrix multiplication, and efficiency. Each transformation is further classified ac-
and the N-body problem, all available on github. The cording to a number of characteristic effects on the HLS
optimized codes exhibit dramatic cumulative speedups of source code, and on the resulting hardware architecture
up to 29,950× relative to their respective naive starting (central columns). To serve as a cheat sheet, the table further-
points, showing the crucial necessity of hardware-aware more lists common objectives targeted by HLS programmers,
transformations, which are not performed automatically by and maps them to relevant HLS transformations (rightmost
today’s HLS compilers. As FPGAs are currently the only columns). Characteristics and objectives are discussed in
platforms commonly targeted by HLS tools in the HPC detail in relevant transformation sections.
domain, transformations are discussed and evaluated in this Throughout this work, we will show how each transfor-
context. Evaluating FPGA performance in comparison to mation is applied manually by a performance engineer by
other platforms is out of scope of this work. Our work pro- directly modifying the source code, giving examples before
vides a set of guidelines and a cheat sheet for optimizing and after it is applied. However, many transformations are
high-performance codes for reconfigurable architectures, also amenable to automation in an optimizing compiler.
guiding both performance engineers and compiler devel-
1.3 The Importance of Pipelining
opers to efficiently exploit these devices.
Pipelining is essential to efficient hardware architectures,
as expensive instruction decoding and data movement be-
1.1 From Imperative Code to Hardware tween memory, caches and registers can be avoided, by
Before diving into transformations, it is useful to form an sending data directly from one computational unit to the
intuition of the major stages of the source-to-hardware stack, next. We attribute two primary characteristics to pipelines:
to understand how they are influenced by the HLS code: • Latency (L): the number of cycles it takes for an input to
¶ High-level synthesis converts a pragma-assisted proce- propagate through the pipeline and arrive at the exit, i.e.,
dural description (C++, OpenCL) to a functionally equiva- the number of pipeline stages.
lent behavioral description (Verilog, VHDL). This requires • Initiation interval or gap (I ): the number of cycles that
mapping variables and operations to corresponding con- must pass before a new input can be accepted to the
structs, then scheduling operations according to their inter- pipeline. A perfect pipeline has I=1 cycle, as this is re-
dependencies. The dependency analysis is concerned with quired to keep all pipeline stages busy. Consequently,
creating a hardware mapping such that the throughput the initiation interval can often be considered the inverse
requirements are satisfied, which for pipelined sections re- throughput of the pipeline; e.g., I=2 cycles implies that the
quire the circuit to accept a new input every cycle. Coarse- pipeline stalls every second cycle, reducing the through-
grained control flow is implemented with state machines, put of all pipelines stages by a factor of 12 .
3

To quantify the importance of pipelining in HLS, we con- previous iteration, which takes multiple cycles to com-
sider the number of cycles C it takes to execute a pipeline plete (i.e., has multiple internal pipeline stages). If the
with latency L (both in [cycles]), taking N inputs, with an latency of the operations producing this result is L, the
initiation interval of I [cycles]. Assuming a reliable producer minimum initiation interval of the pipeline will be L.
and consumer at either end, we have: This is a common scenario when accumulating into a sin-
gle register (see Fig. 2), in cases where the accumulation
C = L + I · (N − 1) [cycles]. (1)
operation takes Lacc >1 cycles.
This is shown in Fig. 1. The time to execute all N iterations 2) Interface contention (intra-iteration): a hardware re-
with clock rate f [cycles/s] of this pipeline is thus C/f . source with limited ports is accessed multiple times in
the same iteration of the loop. This could be a FIFO
queue or RAM that only allows a single read and write
I
per cycle, or an interface to external memory, which only
N

L supports sending/serving one request per cycle.

C = L + I (N – 1) For each of the following transformations, we will give
Fig. 1: Pipeline characteristics. examples of programs exhibiting properties that prevent
them from being pipelined, and how the transformation
For two pipelines in sequence that both consume and pro- can resolve this. All examples use C++ syntax, which al-
duce N elements, the latency is additive, while the initiation lows classes (e.g., “FIFO” buffer objects) and templating. We
interval is decided by the “slowest” actor: perform pipelining and unrolling using pragma directives,
where loop-oriented pragmas always refer to the following
C0 + C1 = (L0 + L1 ) + max(I0 , I1 ) · (N − 1) loop/scope, which is the convention used by Intel/Altera HLS
When I0 =I1 this corresponds to a single, deeper pipeline. tools (as opposed to applying to current scope, which is the
For large N , the latencies are negligible, so this deeper convention for Xilinx HLS tools).
pipeline increases pipeline parallelism by adding more Lacc Loop carried
computations without increasing the runtime; and without dependency
is resolved
introducing additional off-chip memory traffic. We are thus
+ +
Loop (update every
carried M cycles)
interested in building deep, perfect pipelines to maximize depen-
performance and minimize off-chip data movement. Lacc
dency
M ≥ Lacc
1.4 Optimization Goals
We organize the remainder of this work according to three Fig. 2: Loop-carried dependency. Fig. 3: Buffered accumulation.
overarching optimization goals, corresponding to the three
categories marked in Tab. 1: 2.1 Accumulation Interleaving
• Enable pipelining (Sec. 2): For compute bound codes, For multi-dimensional iteration spaces, loop-carried depen-
achieve I=1 cycle for all essential compute components, dencies can often be resolved by reordering and/or inter-
to ensure that all pipelines run at maximum throughput. leaving nested loops, keeping state for multiple concurrent
For memory bound codes, guarantee that memory is accumulations. We distinguish between four approaches to
always consumed at line rate. interleaving accumulation, covered below.
• Scaling/folding (Sec. 3): Fold the total number of itera-
2.1.1 Full Transposition
tions N by scaling up the parallelism of the design to
consume more elements per cycle, thus cutting the total When a loop-carried dependency is encountered in a loop
number of cycles required to execute the program. nest, it can be beneficial to reorder the loops, thereby fully
• Memory efficiency (Sec. 4): Saturate pipelines with data
transposing the iteration space. This typically also has a
from memory to avoid stalls in compute logic. For mem- significant impact on the program’s memory access pattern,
ory bound codes, maximize bandwidth utilization. which can benefit/impair the program beyond resolving a
loop-carried dependency.
Sec. 5 covers the relationship between well-known software
Consider the matrix multiplication code in Lst. 1a, com-
optimizations and HLS, and accounts for which of these
puting C = A · B + C , with matrix dimensions N , K , and
apply directly to HLS code. Sec. 6 shows the effect of
M . The inner loop k ∈ K accumulates into a temporary reg-
transformations on a selection of kernels, Sec. 7 presents
ister, which is written back to C at the end of each iteration
related work, and we conclude in Sec. 9.
m ∈ M . The multiplication of elements of A and B can
2 P IPELINE -E NABLING T RANSFORMATIONS be pipelined, but the addition on line 6 requires the result
As a crucial first step for any HLS code, we cover detecting of the addition in the previous iteration of the loop. This
and resolving issues that prevent pipelining of computa- is a loop-carried dependency, and results in an initiation
tions. When analyzing a basic block of a program, the HLS interval of L+ , where L+ is the latency of a 64 bit floating
tool determines the dependencies between computations, point addition (for integers L+,int =1 cycle, and the loop can
and pipelines operations accordingly to achieve the target be pipelined without further modifications). To avoid this,
initiation interval. There are two classes of problems that we can transpose the iteration space, swapping the K -loop
hinder pipelining of a given loop: with the M -loop, with the following consequences:
1) Loop-carried dependencies (inter-iteration): an iteration • Rather than a single register, we now implement an accu-
of a pipelined loop depends on a result produced by a mulation buffer of depth M and width 1 (line 2).
4
acc 1 double Acc(double arr[], int N) { Phase 0 Phase 1

p
oo
1 for (int n = 0; n < N; ++n) 2 double t[16];

rl
ne
2 for (int m = 0; m < M; ++m) { K 3 #pragma PIPELINE

in
3 double acc = C[n][m]; 4 for (int i = 0; i < N; ++i) { // P0

+ +
4 #pragma PIPELINE 5 auto prev = (i < 16) ? 0 : t[i%16];
C [N×M]
5 for (int k = 0; k < K; ++k) 6 t[i%16] = prev + arr[i]; }

]
[K ×K
6 acc += A[n][k] * B[k][m];

]
double res = 0;

M
[N
7
N

×
A
7 C[n][m] = acc; } M 8 for (int i = 0; i < 16; ++i) // P1

B
(a) Naive implementation of general matrix multiplication C=AB+C . 9 res += t[i]; // Not pipelined
10 return res; }
1 for (int n = 0; n < N; ++n) {
2 double acc[M]; // Uninitialized Listing 2: Two stages required for single loop accumulation.
inner acc[ 0] , acc[ 1] , . . . , acc[ M-1]
3 for (int k = 0; k < K; ++k) loop
4 double a = A[n][k]; // Only read once K
5 #pragma PIPELINE
6 for (int m = 0; m < M; ++m) {
result buffers, the second phase collapses the partial results
N
7 double prev = (k == 0) ? C[n][m] into the final output. This is shown in Lst. 2 for K=16.

× ]
[K ×K
8 : acc[m]; Optionally, the two stages can be implemented to run in

]
M
[N
acc[m] = prev + a * B[k][m]; }

A
9
M a coarse-grained pipelined fashion, such that the first stage
B
10 for (int m = 0; m < M; ++m) // Write
11 C[n][m] = acc[m]; } // out begins computing new partial results while the second stage
(b) Transposed iteration space, same location written every M cycles. is collapsing the previous results (by exploiting dataflow
1 for (int n = 0; n < N; ++n)
between modules, see Sec. 3.3).
2 for (int m = 0; m < M/T; ++m) {
3 double acc[T]; // Tiles of size T inner
loop
acc[ 0] , . . ., acc[ B-1] 2.1.4 Batched Accumulation Interleaving
4 for (int k = 0; k < K; ++k) K For algorithms with loop-carried dependencies that cannot
5 double a = A[n][k]; // M/T reads
6 #pragma PIPELINE N be solved by either method above (e.g., due to a non-
7 for (int t = 0; t < T; ++t) { C [N×M] commutative accumulation operator), we can still pipeline
double prev = (k == 0) ?
]

8
K

the design by processing batches of inputs, introducing an

B [N×

]
M

9 C[n][m*T+t] : acc[t];
×
A
[K

10 acc[t] = prev + a * B[k][m*T+t]; } M additional loop nested in the accumulation loop. This pro-
11 for (int t = 0; t < T; ++t) // Write cedure is similar to Sec. 2.1.2, but only applies to programs
12 C[n][m*T+t] = acc[t]; } // out
where it is relevant to compute the accumulation for multi-
(c) Tiled iteration space, same location written every T cycles.
ple data streams, and requires altering the interface and data
Listing 1: Interleave accumulations to remove loop-carried dependency.
movement of the program to interleave inputs in batches.
The code in Lst. 3a shows an iterative solver code with
an inherent loop-carried dependency on state, with a min-
• The loop-carried dependency is resolved: each location is
imum initiation interval corresponding to the latency LStep
only updated every M cycles (with M ≥Lacc in Fig. 3).
of the (inlined) function Step. There are no loops to inter-
• A, B , and C are all read in a contiguous fashion, achiev-
change, and we cannot change the order of loop iterations.
ing perfect spatial locality (we assume row-major memory
While there is no way to improve the latency of producing
layout. For column-major we would interchange the K -
a single result, we can improve the overall throughput by a
loop and N -loop).
factor of LStep by pipelining across N ≥LStep different inputs
• Each element of A is read exactly once.
(e.g., overlap solving for different starting conditions). We
The modified code is shown in Lst. 1b. We leave the accumu- effectively inject another loop over inputs, then perform
lation buffer defined on line 2 uninitialized, and implicitly transposition or tiled accumulation interleaving with this
reset it on line 8, avoiding M extra cycles to reset (this is a loop. The result of this transformation is shown in Lst. 3b,
form of pipelined loop fusion, covered in Sec. 2.4). for a variable number of interleaved inputs N.
2.1.2 Tiled Accumulation Interleaving 1 Vec<double> IterSolver(Vec<double> state, int T) {
For accumulations done in a nested loop, it can be sufficient 2 #pragma PIPELINE // Will fail to pipeline with I=1
3 for (int t = 0; t < T; ++t)
to interleave across a tile of an outer loop to resolve a 4 state = Step(state);
loop-carried dependency, using a limited size buffer to store 5 return state; }
intermediate results. This tile only needs to be of size ≥Lacc , (a) Solver executed for T steps with a loop-carried dependency on state.
where Lacc is the latency of the accumulation operation. 1 template <int N> inner loop T
2 void MultiSolver(Vec<double> *in,
This is shown in Lst. 1c, for the transposed matrix mul-
3 Vec<double> *out, int T) { b[0]
tiplication example from Lst. 1b, where the accumulation 4 Vec<double> b[N]; // Partial results
array has been reduced to tiles of size T (which should be 5 for (int t = 0; t < T; ++t) b[1]
6 #pragma PIPELINE
≥Lacc , see Fig. 3), by adding an additional inner loop over 7 for (int i = 0; i < N; ++i) {
the tile, and cutting the outer loop by a factor of B . 8 auto read = (t == 0) ? in[i] : b[i]; ...
9 auto next = Step(read);
10 if (t < T-1) b[i] = next;
2.1.3 Single-Loop Accumulation Interleaving b[N-1]
11 else out[i] = next; }} // Write out
If no outer loop is present, we have to perform the ac- (b) Pipeline across N ≥Lstep inputs to achieve I=1 cycle.
cumulation in two separate stages, at the cost of extra Listing 3: Pipeline across multiple inputs to avoid loop-carried dependency.
resources. For the first stage, we perform a transformation
similar to the nested accumulation interleaving, but strip- 2.2 Delay Buffering
mine the inner (and only) loop into blocks of size K ≥ Lacc , When iterating over regular domains in a pipelined fashion,
accumulating partial results into a buffer of size K . Once it is often sufficient to express buffering using delay buffers,
all incoming values have been accumulated into the partial expressed either with cyclically indexed arrays, or with
5

constant offset delay buffers, also known from the Intel Lst. 4b demonstrates the shift register pattern used to
ecosystem as shift registers. These buffers are only accessed in express the stencil buffering scheme, which is supported
a FIFO manner, with the additional constraint that elements by the Intel OpenCL toolflow. Rather than creating each
are only be popped once they have fully traversed the depth individual delay buffer required to propagate values, a
of the buffer (or when they pass compile-time fixed access single array is used, which is “shifted” every cycle using
points, called “taps”, in Intel OpenCL). Despite the “shift unrolling (lines 6-7). The computation accesses elements of
register” name, these buffers do not need to be implemented this array using constant indices only (line 10), relying on the
in registers, and are frequently implemented in on-chip tool to infer the partitioning into individual buffers (akin
RAM when large capacity is needed, where values are not to loop idiom recognition [25]) that we did explicitly in
physically shifted. Lst. 4a. The implicit nature of this pattern requires the tool
A common set of applications that adhere to the delay to specifically support it. For more detail on buffering stencil
buffer pattern are stencil applications such as partial dif- codes we refer to other works on the subject [44], [39].
ferential equation solvers [27], [28], [29], image processing Opportunities for delay buffering often arise naturally in
pipelines [30], [31], and convolutions in deep neural net- pipelined programs. If we consider the transposed matrix
works [32], [33], [34], [35], [36], all of which are typically multiplication code in Lst. 1b, we notice that the read from
traversed using a sliding window buffer, implemented in acc on line 8 and the write on line 9 are both sequential, and
terms of multiple delay buffers (or, in Intel terminology, a cyclical with a period of M cycles. We could therefore also
shift register with multiple taps). These applications have use the shift register abstraction for this array. The same is
been shown to be a good fit to spatial computing architec- true for the accumulation code in Lst. 3b.
tures [37], [38], [39], [40], [41], [42], [43], as delay buffering
is cheap to implement in hardware, either as shift registers
Seq.
+ + + ×

DRAM
south
in general purpose logic, or in RAM blocks. east west north

Lst. 4 shows two ways of applying delay buffering to a

stencil code, namely a 4-point stencil in 2D, which updates Fig. 4: A delay buffer for a 4-point stencil with three taps.
each point on a 2D grid to the average of its north, west,
east, and south neighbors. To achieve perfect data reuse, we 2.3 Random Access Buffering
buffer every element read in sequential order from memory When a program unavoidably needs to perform random ac-
until it has been used for the last time – after two rows, cesses, we can buffer data in on-chip memory and perform
when the same value has been used as all four neighbors. random access to this fast memory instead of to slow off-
In Lst. 4a we use cyclically indexed line buffers to imple- chip memory. A random access buffer implemented with a
ment the delay buffering pattern, instantiated as arrays on general purpose replacement strategy will emulate a CPU-
lines 1-2. We only read the south element from memory each style cache; but to benefit from targeting a spatial system, it
iteration (line 7), which we store in the center line buffer is usually more desirable to specialize the buffering strategy
(line 13). This element is then reused after M cycles (i.e., to the target application [45], [46]. This can enable off-chip
“delayed” for M cycles), when it is used as the east value memory accesses to be made contiguous by loading and
(line 9), propagated to the north buffer (line 12), shifted in storing data in stages (i.e., tiles), then exclusively perform-
registers for two cycles until it is used as the west value ing random accesses to fast on-chip memory.
(line 14), and reused for the last time after M cycles on line 8. Lst. 6 outlines a histogram implementation that uses an
The resulting circuit is illustrated in Fig. 4. on-chip buffer (line 1) to perform fast random accesses reads
and writes (line 5) to the bins computed from incoming data,
1 float north_buffer[M]; // Line illustrated in Fig. 6. Note that the random access results
2 float center_buffer[M]; // buffers
3 float west, center; // Registers
in a loop-carried dependency on histogram, as there is a
4 for (int i = 0; i < N; ++i) { potential for subsequent iterations to read and write the
5 #pragma PIPELINE
6 for (int j = 0; j < M; ++j) {
same bin. This can be solved with one of the interleaving
7 auto south = memory[i][j]; // Single memory read techniques described in Sec. 2.1, by maintaining multiple
8 auto north = north_buffer[j]; // Read line buffers partial result buffers.
9 auto east = center_buffer[(j + 1)%M]; // (with wrap around)
10 if (i > 1 && j > 0 && j < M - 1) // Assume padding of 1
11
12
result[i - 1][j] = 0.25*(north + west + south + east);
north_buffer[j] = center; // Update both
1 unsigned
2 #pragma
hist[256] = {0}; // Array of bins
PIPELINE // Will have II=2 + Seq.
DRAM

center_buffer[j] = south; // line buffers 3 for (int i = 0; i < N; ++i) {

RAM

13
14 west = center; center = east; } } // Propagate registers 4 int bin = CalculateBin(memory[i]);
5 hist[bin] += 1; // Single cycle access Seq.
(a) Delay buffering using cyclically indexed line buffers.
6 } // ...write result out to memory...

1 float sr[2*M + 1]; // Shift register buffer

Listing 6: Random access to on-chip histogram buffer.
2 for (int i = 0; i < N; ++i) {
3 #pragma PIPELINE
4 for (int j = 0; j < M; ++j) { 2.4 Pipelined Loop Fusion
5 #pragma UNROLL
6 for (int k = 0; k < 2*M; ++k) When two pipelined loops appear sequentially, we can fuse
7 sr[k] = sr[k + 1]; // Shift the array left them into a single pipeline, while using loop guards to en-
8 sr[2*M] = memory[i][j]; // Append to the front
9 if (i > 1 && j > 0 && j < M - 1) // Initialize/drain force any dependencies that might exist between them. This
10 result[i-1][j] = 0.25*(sr[0] + sr[M-1] + sr[M+1] + sr[2*M]); } } can result in a significant reduction in runtime, at little to no
(b) Delay buffering using an Intel-style shift register. resource overhead. This transformation is closely related to
Listing 4: Two ways of implementing delay buffering on an N ×M grid. loop fusion [47] from traditional software optimization.
6

1 // Pipelined loops executed sequentially 1 for (int i = 0; i < N0+N1; ++i) { 1 for (int i = 0; i < max(N0, N1); ++i) {
2 for (int i = 0; i < N0; ++i) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); // Omit ifs
3 for (int i = 0; i < N1; ++i) Bar(i, /*...*/); 3 else Bar(i - N0, /*...*/); } 3 if (i < N1) Bar(i, /*...*/); } // for N0==N1

(a) (L0 + I0 (N0 −1)) + (L1 + I1 (N1 −1)) cycles. (b) L2 + I(N0 + N1 −1) cycles. (c) L3 + I · (max(N0 , N1 )−1) cycles.
Listing 5: Two subsequent pipelined loops fused sequentially (Lst. 5b) or concurrently (Lst. 5c). Assume that all loops are pipelined (pragmas omitted for brevity).

For two consecutive loops with latencies/bounds/initi- Lst. 7b). There can be a (tool-dependent) benefit from saving
ation intervals {L0 , N0 , I0 } and {L1 , N1 , I1 } (Lst. 5a), re- overhead logic by only implementing the orchestration and
spectively, the total runtime according to Eq. 1 is (L0 + interfaces of a single pipeline, at the (typically minor) cost
I0 (N0 −1)) + (L1 + I1 (N1 −1)). Depending on which con- of the corresponding predication logic. More importantly,
dition(s) are met, we can distinguish between three levels of eliminating the coarse-grained control can enable other
pipelined loop fusion, with increasing performance benefits: transformations that significantly benefit performance, such
1) I=I0 =I1 (true in most cases): Loops can be fused by as fusion [§2.4] with adjacent pipelined loops, flattening
summing the loop bounds, using loop guards to sequen- nested loops [§2.6], and on-chip dataflow [§3.3].
tialize them within the same pipeline (Lst. 5b).
2.6 Pipelined Loop Flattening/Coalescing
2) Condition 1 is met, and only fine-grained or no dependencies
exist between the two loops: Loops can be fused by To minimize the number of cycles spent in filling/draining
iterating to the maximum loop bound, and loop guards pipelines (where the circuit is not streaming at full through-
are placed as necessary to predicate each section (Lst. 5c). put), we can flatten nested loops to move the fill/drain
3) Conditions 1 and 2 are met, and N =N0 =N1 (same loop phases to the outermost loop, fusing/absorbing code that
bounds): Loops bodies can be trivially fused (Lst. 5c, but is not in the innermost loop if necessary.
with no loop guards necessary). Lst. 8a shows a code with two nested loops, and gives the
An alternative way of performing pipeline fusion is to total number of cycles required to execute the program. The
instantiate each stage as a separate processing element, and latency of the drain phase of the inner loop and the latency
stream fine-grained dependencies between them (Sec. 3.3). of Bar outside the inner loop must be paid at every iteration
of the outer loop. If N0 L0 , the cycle count becomes just
2.5 Pipelined Loop Switching L1 + N0 N1 , but for applications where N0 is comparable to
L0 , draining the inner pipeline can significantly impact the
The benefits of pipelined loop fusion can be extended to
runtime (even if N1 is large). By transforming the code such
coarse-grained control flow by using loop switching (as op-
that all loops are perfectly nested (see Lst. 8b), the HLS tool
posed to loop unswitching, which is a common transforma-
can effectively coalesce the loops into a single pipeline, where
tion [25] on load/store architectures). Whereas instruction-
next iteration of the outer loop can be executed immediately
based architectures attempt to only execute one branch of
after the previous finishes.
a conditional jump (via branch prediction on out-of-order
processors), a conditional in a pipelined scenario will result 1 for (int i = 0; i < N1; ++i) { 1 for (int i = 0; i < N1; ++i) {
in both branches being instantiated in hardware, regardless 2 #pragma PIPELINE 2 #pragma PIPELINE
3 for (int j = 0; j < N0; ++i) 3 for (int j = 0; j < N0; ++i)
of whether/how often it is executed. The transformation of 4 Foo(i, j); 4 Foo(i, j);
coarse-grained control flow into fine-grained control flow is 5 Bar(i); } 5 if (j == N0 - 1) Bar(i); }
implemented by the HLS tool by introducing predication to (a) L1 + N1 · (L0 + N0 −1) cycles. (b) L2 + N0 N1 −1 cycles.
the pipeline, at no significant runtime penalty.
Lst. 7 shows a simple example of how the transformation
Inner state 0 Inner state 1
fuses two pipelined loops in different branches into a single Outer state Single state
loop switching pipeline. The transformation applies to any Listing 8: Before and after coalescing loop nest to avoid inner pipeline drains.
pipelined code in either branch, following the principles
described for pipelined loop fusion (§2.4 and Lst. 5). To perform the transformation in Lst. 8, we had to absorb
The implications of pipelined loop switching are more Bar into the inner loop, adding a loop guard (line 5 in
subtle than the pure fusion examples in Lst. 5, as the total Lst. 8b), analogous to pipelined loop fusion (§2.4), where
number of loop iterations is not affected (assuming the fused the second pipelined “loop” consists of a single iteration.
loop bound is set according to the condition, see line 1 in This contrasts the loop peeling transformation, which is
used by CPU compilers to regularize loops to avoid branch
mispredictions and increasing amenability to vectorization.
1 if (condition) 1 auto N = condition ? N0 : N1;
2 #pragma HLS PIPELINE 2 #pragma HLS PIPELINE While loop peeling can also be beneficial in hardware, e.g.,
3 for (int i = 0; i < N0; ++i) 3 for (int i = 0; i < N; ++i) { to avoid deep conditional logic in a pipeline, small inner
4 y[i] = Foo(x[i]); 4 if (condition)
5 else 5 y[i] = Foo(x[i]); loops can see a significant performance improvement by
6 #pragma HLS PIPELINE 6 else eliminating the draining phase.
7 for (int i = 0; i < N1; ++i) 7 y[i] = Bar(x[i]);
8 y[i] = Bar(x[i]); 8} 2.7 Inlining
(a) Coarse-grained control flow. (b) Control flow absorbed into pipeline. In order to successfully pipeline a scope, all function calls
Listing 7: Pipelined loop switching absorbs coarse-grained control flow. within the code section must be pipelineable. This typically
7

1 for (int i = 0; i < N / W; ++i) 1 // Unroll outer loop by W

requires “inlining” functions into each call site, creating 2 #pragma UNROLL // Fully unroll inner 2 #pragma UNROLL W
dedicated hardware for each invocation, resulting in ad- 3 for (int w = 0; w < W; ++w) // loop 3 for (int i = 0; i < N; ++i)

ditional resources consumed for every additional callsite 4 C[i*W + w] = A[i*W + w]*B[i*W + w]; 4 C[i] = A[i] * B[i];

after the first. This replication is done automatically by HLS (a) Using strip-mining. (b) Using partial unrolling.
compilers on demand, but an additional inline pragma Listing 9: Two variants of vectorization by factor W using loop unrolling.
can be specified to directly “paste” the function body into
the callsite during preprocessing, removing the function
boundary during optimization and scheduling. compute jlogic k (e.g., from off-chip memory), according to
Wmax = fBS , where f [cycle/s] is the clock frequency of
3 S CALABILITY T RANSFORMATIONS
the unrolled logic, and S [Byte/operand] is the operand
Parallelism in HLS revolves around the folding of loops, size in bytes. Horizontal unrolling is usually not sufficient to
achieved through unrolling. In Sec. 2.1 we used strip- achieve high logic utilization on large chips, where the avail-
mining and reordering to avoid loop-carried dependencies able memory bandwidth is low compared to the available
by changing the schedule of computations in the pipelined amount of compute logic. Furthermore, because the energy
loop nest. In this section, we similarly strip-mine and re- cost of I/O is orders of magnitude higher than moving data
order loops, but with additional unrolling of the strip-mined on the chip, it is desirable to exploit on-chip memory and
chunks. Pipelined loops constitute the iteration space; the pipeline parallelism instead (this follows in Sec. 3.2 and 3.3).
size of which determines the number of cycles it takes
to execute the program. Unrolled loops, in a pipelined 3.2 Vertical Unrolling
program, correspond to the degree of parallelism in the We can achieve scalable parallelism in HLS without relying
architecture, as every expression in an unrolled statement on external memory bandwidth by exploiting data reuse,
is required to exist as hardware. Parallelizing a code thus distributing input elements to multiple computational units
means turning sequential/pipelined loops fully or partially replicated “vertically” through unrolling [49], [38], [50]. This
into parallel/unrolled loops. This corresponds to folding the is the most potent source of parallelism on hardware architectures,
sequential iteration space, as the number of cycles taken to as it can conceptually scale indefinitely with available silicon
execute the program are effectively reduced by the inverse when enough reuse is possible. Viewed from the paradigm
of the unrolling factor. of cached architectures, the opportunity for this transforma-
tion arises from temporal locality in loops. Vertical unrolling
a b a0 b0 a1 b1 a2 b2 a3 b3 b a1 b a2 b a3 b a0 a1 a2 a3
a0
CU CU
draws on bandwidth from on-chip fast memory by storing
CU CU CU CU CU CU CU CU CU CU CU
b b b b
more elements temporally, combining them with new data
(a) Before. (b) Horizontal unroll. (c) Vertical unroll. (d) Dataflow. streamed in from external memory to increase parallelism,
allowing more computational units to run in parallel at the
Fig. 5: Horizontal unrolling, vertical unrolling, and dataflow, as means to increase
parallelism. Rectangles represent buffer space, such as registers or on-chip RAM. expense of buffer space. In comparison, horizontal unrolling
Horizontal: four independent inputs processed in parallel. Vertical: one input is requires us to widen the data path that passes through the
combined with multiple buffered values. Dataflow: similar to vertical, but input
or partial results are streamed through a pipeline rather than broadcast.
processing elements (compare Fig. 5b and 5c).
When attempting to parallelize a new algorithm, iden-
3.1 Horizontal Unrolling (Vectorization) tifying a source of temporal parallelism to feed vertical
We implement vectorization-style parallelism with HLS by unrolling is essential to whether the design will scale. Pro-
“horizontally” unrolling loops in pipelined sections, or by grammers should consider this carefully before designing
introducing vector types, folding the sequential iteration the hardware architecture. From a reference software code,
space accordingly. This is the most straightforward way of the programmer can identify scenarios where reuse occurs,
adding parallelism, as it can often be applied directly to an then extract and explicitly express the temporal access pattern
inner loop without further reordering or drastic changes to in hardware, using a delay buffering [§2.2] or random-access
the nested loop structure. Vectorization is more powerful [§2.3] buffering scheme. Then, if additional reuse is possible,
in HLS than SIMD operations on load/store architectures, vertically unroll the circuit to scale up performance.
as the unrolled compute units are not required to be ho- As an example, we return to the matrix multiplication
mogeneous, and the number of units are not constrained code from Lst. 1c. In Sec. 2.1.2, we saw that strip-mining
to fixed sizes. Horizontal unrolling increases bandwidth
utilization by explicitly exploiting spatial locality, allowing
more efficient accesses to off-chip memory such as DRAM. 1 for (int n = 0; n < N / P; ++n) { // Folded by unrolling factor P
2 for (int m = 0; m < M / T; ++m) { // Tiling
Lst. 9 shows two functionally equivalent ways of vec- 3 double acc[T][P]; // Is now 2D
torizing a loop over N elements by a horizontal unrolling 4 // ...initialize acc from C...
5 for (int k = 0; k < K; ++k) {
factor of W . Lst. 9a strip-mines a loop into chunks of W 6 double a_buffer[P]; // Buffer multiple elements to combine with
and unrolls the inner loop fully, while Lst. 9b uses partial 7 #pragma PIPELINE // incoming values of B in parallel
8 for (int p = 0; p < P; ++p)
unrolling by specifying an unroll factor in the pragma. As 9 a_buffer[p] = A[n*P + p][k];
a third option, explicit vector types can be used, such as 10 #pragma PIPELINE
11 for (int t = 0; t < T; ++t) // Stream tile of B
those built into OpenCL (e.g., float4 or int16), or custom 12 #pragma UNROLL
vector classes [48]. These provide less flexibility, but are 13 for (int p = 0; p < P; ++p) // P-fold vertical unrolling
more concise and are sufficient for most applications. 14 acc[t][p] += a_buffer[p] * B[k][m*T+t];
15 } /* ...write back 2D tile of C... */ } }
In practice, the unrolling factor W [operand/cycle] is con-
strained by the bandwidth B [Byte/s] available to the Listing 10: P -fold vertical unrolling of matrix multiplication.
8

and reordering loops allowed us to move reads from matrix To see how streaming can be an important tool to express
A out of the inner loop, re-using the loaded value across scalable hardware, we apply it in conjunction with vertical
T different entries of matrix B streamed in while keeping unrolling (Sec. 3.2) to implement an iterative version of the
the element of A in a register. Since every loaded value stencil example from Lst. 4. Unlike the matrix multiplication
of B eventually needs to be combined with all N rows of code, the stencil code has no scalable source of parallelism
A, we realize that we can perform more computations in in the spatial dimension. Instead, we can achieve reuse by
parallel by keeping multiple values of A in local registers. folding the outer time-loop to treat P consecutive timesteps
The result of this transformation is shown in Lst. 10. By in a pipeline parallel fashion, each computed by a distinct
buffering P elements (where P was 1 in Lst. 1c) of A prior PE, connected in a chain via channels [37], [51], [38]. We
to streaming in the tile of B -matrix (lines 8-9), we can fold replace the memory interfaces to the PE with channels, such
the outer loop over rows by a factor of P , using unrolling that the memory read and write become Pop and Push oper-
to multiply parallelism (as well as buffer space required for ations, respectively. The resulting code is shown in Lst. 11a.
the partial sums) by a factor of P (lines 12-14). We then vertically unroll to generate P instances of the PE
(shown in Lst. 11b), effectively increasing the throughput
3.3 Dataflow
of the kernel by a factor of P , and consequently reducing
For complex codes it is common to partition functionality the runtime by folding the outermost loop by a factor of P
into multiple modules, or processing elements (PEs), stream- (line 3 in Lst. 11a). Such architectures are sometimes referred
ing data between them through explicit interfaces. In con- to as systolic arrays [52], [53].
trast to conventional pipelining, PEs arranged in a dataflow For architectures/HLS tools where large fan-out is an is-
architecture are scheduled separately when synthesized by sue for compilation or routing, an already replicated design
the HLS tool. There are multiple benefits to this: can be transformed to a dataflow architecture. For example,
• Different functionality runs at different schedules. For exam- in the matrix multiplication example in Lst. 10, we can move
ple, issuing memory requests, servicing memory requests, the P -fold unroll out of the inner loop, and replicate the
and receiving requested memory can all require different entire PE instead, replacing reads and writes with channel
pipelines, state machines, and even clock rates. accesses [50]. B is then streamed into the first PE, and
• Smaller components are more modular, making them eas- passed downstream every cycle. A and C should no longer
ier to reuse, debug and verify. be accessed by every PE, but rather be handed downstream
• The effort required by the HLS tool to schedule code similar to B , requiring a careful implementation of the start
sections increases dramatically with the number of opera- and drain phases, where the behavior of each PE will vary
tions that need to be considered for the dependency and slightly according to its depth in the sequence.
pipelining analysis. Scheduling logic in smaller chunks is
thus beneficial for compilation time. 3.4 Tiling
• Large fan-out/fan-in is challenging to route on real hard- Loop tiling in HLS is commonly used to fold large problem
ware, (i.e., 1-to-N or N -to-1 connections for large N ). This sizes into manageable chunks that fit into fast on-chip
is mitigated by partitioning components into smaller parts memory, in an already pipelined program [38]. Rather than
and adding more pipeline stages. making the program faster, this lets the already fast archi-
• The fan-in and fan-out of control signals (i.e., stall, reset) tecture support arbitrarily large problem sizes. This is in
within each module is reduced, reducing the risk of these contrast to loop tiling on CPU and GPU, where tiling is used
signals constraining the maximum achievable frequency. to increase performance. Common to both paradigms is that
they fundamentally aim to meet fast memory constraints. As
To move data between PEs, communication channels with
with horizontal and vertical unrolling, tiling relies on strip-
a handshake mechanism are used. These channels double
mining loops to alter the iteration space.
as synchronization points, as they imply a consensus on
Tiling was already shown in Sec. 2.1.2, when the accu-
the program state. In practice, channels are always FIFO
mulation buffer in Lst. 1b was reduced to a tile buffer in
interfaces, and support standard queue operations Push,
Pop, and sometimes Empty, Full, and Size operations. They
1 void PE(FIFO<float> &in, FIFO<float> &out, int T) {
occupy the same register or block memory resources as 2 // ..initialization...
other buffers (Sec. 2.2/Sec. 2.3). 3 for (int t = 0; t < T / P; ++t) // Fold timesteps T by factor P
4 #pragma PIPELINE
The mapping from source code to PEs differs between 5 for (/* loops over spatial dimensions */) {
HLS tools, but is manifested when functions are connected 6 auto south = in.Pop(); // Value for t-1 from previous PE
using channels. In the following example, we will use the 7 // ...load values from delay buffers...
8 auto next = 0.25*(north + west + east + south);
syntax from Xilinx Vivado HLS to instantiate PEs, where 9 out.Push(next); }} // Value for t sent to PE computing t+1
each non-inlined function correspond to a PE, and these (a) Processing element for a single timestep. Will be replicated P times.
are connected by channels that are passed as arguments
1 #pragma DATAFLOW // Schedule nested functions as parallel modules
to the functions from a top-level entry function. Note that 2 void SystolicStencil(const float in[], float out[], int T) {
this functionally diverges from C++ semantics without 3 FIFO<float> pipes[P + 1]; // Assume P is given at compile time
4 ReadMemory(in, pipes[0]); // Head
additional abstraction [48], as each function in the dataflow 5 #pragma UNROLL // Replicate PEs
scope is executed in parallel in hardware, rather than in the 6 for (int p = 0; p < P; ++p)
sequence specified in the imperative code. In Intel OpenCL, 7 PE(pipe[p], pipe[p + 1], T); // Forms a chain
8 WriteMemory(pipes[P], out); } // Tail
dataflow semantics are instead expressed with multiple
(b) Instantiate and connect P consecutive and parallel PEs.
kernel functions each defining a PE, which are connected by
global channel objects prefixed with the channel keyword. Listing 11: Dataflow between replicated PEs to compute P timesteps in parallel.
9

Lst. 1c, such that the required buffer space used for partial one cache line. If we instead read the two sections of A
results became a constant, rather than being dependent on sequentially (or in larger chunks), the HLS tool can infer
the input size. This transformation is also relevant to the two bursts accesses to A of length N/2, shown in Lst. 12c.
stencil codes in Lst. 4, where it can be used similarly to Since the schedules of memory and computational modules
restrict the size of the line buffers or shift register, so they a are independent, ReadA can run ahead of PE, ensuring that
no longer proportional to the problem size. memory is always read at the maximum bandwidth of the
interface (Sec. 4.2 and Sec. 4.3 will cover how to increase this
4 M EMORY ACCESS T RANSFORMATIONS bandwidth). From the point of view of the computational
When an HLS design has been pipelined, scheduled, and PE, both A0 and A1 are read in parallel, as shown on
unrolled as desired, the memory access pattern has been line 5 in Lst. 12b, hiding initialization time and inconsistent
established. In the following, we describe transformations memory producers in the synchronization implied by the
that optimize the efficiency of off-chip memory accesses in data streams.
the HLS code. For memory bound codes in particular, this is An important use case of memory extraction appears in
critical for performance after the design has been pipelined. the stencil code in Lst. 11, where it is necessary to separate
the memory accesses such that the PEs are agnostic of
4.1 Memory Access Extraction whether data is produced/consumed by a neighboring PE
By extracting accesses to external memory from the compu- or by a memory module. Memory access extraction is also
tational logic, we enable compute and memory accesses to useful for performing data layout transformations in fast
be pipelined and optimized separately. Accessing the same on-chip memory. For example, we can change the schedule
interface multiple times within the same pipelined section of reads from A in Lst. 10 to a more efficient scheme by
is a common cause for poor memory bandwidth utilization buffering values in on-chip memory, while streaming them
and increased initiation interval due to interface contention, to the kernel according to the original schedule.
since the interface can only service a single request per
cycle. In the Intel OpenCL flow, memory extraction is done 4.2 Memory Buffering
automatically by the tool, but since this process must be When dealing with memory interfaces with an inconsistent
conservative due to limited information, it is often still data rate, such as DRAM, it can be beneficial to request
beneficial to do the extraction explicitly in the code [54]. In and buffer accesses earlier and/or at a more aggressive pace
many cases, such as for independent reads, this is not an in- than what is consumed or produced by the computational
herent memory bandwidth or latency constraint, but arises elements. For memory reads, this can be done by reading
from the tool scheduling iterations according to program ahead of the kernel into a deep buffer instantiated between
order. This can be relaxed when allowed by inter-iteration memory and computations, by either 1) accessing wider vec-
dependencies (which can in many cases be determined tors from memory than required by the kernel, narrowing or
automatically, e.g., using polyhedral analysis [55]). widening data paths (aka. “gearboxing”) when piping to or
In Lst. 12a, the same memory (i.e., hardware memory from computational elements, respectively, or 2) increasing
interface) is accessed twice in the inner loop. In the worst the clock rate of modules accessing memory with respect to
case, the program will issue two 4 Byte memory requests the computational elements.
every iteration, resulting in poor memory performance, and The memory access function Lst. 12c allows long bursts
preventing pipelining of the loop. In software, this problem to the interface of A, but receives the data on a narrow bus
is typically mitigated by caches, always fetching at least at W · Sint = (1 · 4) Byte/cycle. In general, this limits the
bandwidth consumption to f ·W Sint at frequency f , which is
likely to be less than what the external memory can provide.
1 void PE(const int A[N], int B[N/2]) { 1 elem./burst To better exploit available bandwidth, we can either read
#pragma PIPELINE // Achieves I=2
DRAM

2 N/2 bursts A[i]

3 for (int i = 0; i < N/2; ++i) N/2 state
wider vectors (increase W ) or clock the circuit at a higher
PE transitions
4 // Issues N/2 memory requests of size 1
1 elem./burst
rate (increase f ). The former consumes more resources, as
5 B[i] = A[i] + A[N/2 + i];
6} N/2 bursts A[N/2+i] additional logic is required to widen and narrow the data
path, but the latter is more likely to be constrained by timing
(a) Multiple accesses to A cause inefficient memory accesses.
PE constraints on the device.
1 void PE(FIFO<int> &A0, FIFO<int> &A1, N/2 elem./burst
int B[N/2]) {
DRAM

2 1 burst A[i]
Compute

4.3 Memory Striping

Pipeline

3 #pragma PIPELINE // Achieves I=1 1 state

ReadA transition
4 for (int i = 0; i < N/2; ++i)
5 B[i] = A0.Pop() + A1.Pop()); N/2 elem./burst When multiple memory banks with dedicated channels
6} 1 burst A[N/2+i] (e.g., multiple DRAM modules or HBM lanes) are available,
(b) Move memory accesses out of computational code. the bandwidth at which a single array is accessed can be
1 void ReadA(const int A[N], FIFO<int> &A0, FIFO<int> &A1) { increased by a factor corresponding the the number of
2 int buffer[N/2]; available interfaces by striping it across memory banks. This
3 #pragma PIPELINE
4 for (int i = 0; i < N/2; ++i) optimization is employed by most CPUs transparently by
5 buffer[i] = A[i]; // Issues 1 memory request of size N/2 striping across multi-channel memory, and is commonly
6 #pragma PIPELINE
7 for (int i = 0; i < N/2; ++i) { known from RAID 0 configuration of disks.
8 A0.Push(buffer[i]); // Sends to PE We can perform striping explicitly in HLS by inserting
9 A1.Push(A[N/2 + i]); }} // Issues 1 memory request of size N/2
modules that join or split data streams from two or more
(c) Read A in long bursts and stream them to the PE. memory interfaces. Reading can be implemented with two
Listing 12: Separate memory accesses from computational logic. or more memory modules requesting memory from their
10

respective interfaces, pushing to FIFO buffers that are read CPU-Oriented Transformations and how they apply to HLS codes.

in parallel and combined by another module (for writing: in Loop interchange [57], [47] is used to resolve loop-carried dependencies [§2].
Strip-mining [58], loop tiling [59], [47], and cycle shrinking [60] are central compo-
reverse), exposing a single data stream to the computational nents of many HLS transformations [§2.1, §3.1, §3.2, §2.1.2].
kernel. This is illustrated in Fig. 6, where the unlabeled Loop distribution and loop fission [61], [47] are used to separate differently scheduled

Directly related to HLS transformations

computations to allow pipelining [§3.3].
dark boxes in Fig. 6b represent PEs reading and combining Loop fusion [62], [47], [63] is used for merging pipelines [§2.4].
data from the four DRAM modules. The Intel OpenCL Loop unrolling [64] is used to generate parallel hardware [§3.1, §3.2].
Software pipelining [65] is used by HLS tools to schedule code sections according to
compiler [19] applies this transformation by default. operation interdependencies to form hardware pipelines.
Loop coalescing/flattening/collapsing [66] saves pipeline drains in nested
loops [§2.6].
DDR3

DDR1

DDR2
DDR1

DDR2

DDR0

DDR3
DDR0

Reduction recognition prevents loop-carried dependencies when accumulating [§2.1].

Loop idiom recognition is relevant for HLS backends, for example to recognize shift
registers [§2.2] in the Intel OpenCL compiler [19].
Procedure inlining is used to remove function call boundaries [§2.7].
FPGA fabric Compute kernel FPGA fabric Compute kernel
Procedure cloning is frequently used by HLS tools when inlining [§2.7] to specialize
each function “call” with values that are known at compile-time.
(a) Memory stored in a single bank. (b) Memory striped across four banks. Loop unswitching [67] is rarely advantageous; its opposite is beneficial [§2.6, §2.4].
Loop peeling is rarely advantageous; its opposite is beneficial to allow coalesc-
Fig. 6: Striping memory across memory banks increases available bandwidth.
ing [§2.6].
SIMD transformations is done in HLS via horizontal unrolling [§3.1].
Short-circuiting: while the logic for both boolean operands must always be instanti-
4.4 Type Demotion ated in hardware, dynamically scheduling branches [68] can effectively “short-circuit”
otherwise deep, static pipelines.
We can reduce resource and energy consumption, band- Loop-based strength reduction [69], [70], [71], Induction variable elimination [72],
width requirements, and operation latency by demoting Unreachable code elimination [72], Useless-code elimination [72], Dead-variable
elimination [72], Common-subexpression elimination [72], Constant propagation
data types to less expensive alternatives that still meet [72], Constant folding [72], Copy propagation [72], Forwarding substitution [72],
Reassociation, Algebraic simplification, Strength reduction, Bounds reduction,
precision requirements. This can lead to significant im- Redundant guard elimination are all transformations that eliminate code, which is a
provements on architectures that are specialized for certain useful step for HLS codes to avoid generating unnecessary hardware.

Same or similar in HLS

Loop-invariant code motion (hoisting) [72] does not save hardware in itself, but can
types, and perform poorly on others. On traditional FPGAs save memory operations.
there is limited native support for floating point units. Loop normalization can be useful as an intermediate transformation.
Loop reversal [72], array padding and contraction, scalar expansion, and scalar
Since integer/fixed point and floating point computations replacement yield the same benefits as in software.
Loop skewing [72] can be used in multi-dimensional wavefront codes.
on these architectures compete for the same reconfigurable Function memoization can be applied to HLS, using explicit fast memory.
logic, using a data type with lower resource requirements Tail recursion elimination may be useful if eliminating dynamic recursion can enable
a code to be implemented in hardware.
increases the total number of arithmetic operations that can Regular array decomposition applies to partitioning of both on-chip/off-chip memory.
potentially be instantiated on the device. The largest benefits We do not consider transformations that apply only in a distributed setting (message
vectorization, message coalescing, message aggregation, collective communica-
of type demotion are seen in the following scenarios: tion, message pipelining, guard introduction, redundant communication), but they
should be implemented in dedicated message passing hardware when relevant [73].
• Compute bound architectures where the data type can be
No use case found for loop spreading and parameter promotion.
changed to a type that occupies less of the same resources Array statement scalarization: No built-in vector notation in C/C++/OpenCL.
(e.g., from 64 bit integers to 48 bit integers). Code colocation, displacement minimization, leaf procedure optimization, and
cross-call register allocation, are not relevant for HLS, as there are no runtime
• Compute bound architectures where the data type can be
Do not apply to HLS

function calls.
moved to a type that is natively supported by the target I/O format compilation: No I/O supported directly in HLS.
Supercompiling: is infeasible for HLS due to long synthesis times.
architecture, such as single precision floating point on Loop pushing/embedding: Inlining completely is favored to allow pipelining.
Intel’s Arria 10 and Stratix 10 devices [56]. Automatic decomposition and alignment, scalar privatization, array privatization,
cache alignment, and false sharing are not relevant for HLS, as there is no (implicit)
• Bandwidth bound architectures, where performance can cache coherency protocol in hardware.
be improved by up to the same factor that the size of the Procedure call parallelization and split do not apply, as there are no forks in hardware.
Graph partitioning only applies to explicit dataflow languages.
data type can be reduced by. There are no instruction sets in hardware, so VLIW transformations do not apply.
• Latency bound architectures where the data type can be TABLE 2: The relation of traditional CPU-oriented transformations to HLS codes.
reduced to a lower latency operation, e.g., from floating
point to integer. It is interesting to note that the majority of well-known
In the most extreme case, it has been shown that collapsing transformations from software apply to HLS. This implies
the data type of weights and activations in deep neural that we can leverage much of decades of research into high-
networks to binary [34] can provide sufficient speedup for performance computing transformations to also optimize
inference that the increased number of weights makes up hardware programs, including many that can be applied
for the loss of precision per weight. directly (i.e., without further adaptation to HLS) to the im-
perative source code or intermediate representation before
5 S OFTWARE T RANSFORMATIONS IN HLS synthesizing for hardware. We stress the importance of sup-
In addition to the transformations described in the sections port for these pre-hardware generation transformations in
above, we include an overview of how well-known CPU- HLS compilers, as they lay the foundation for the hardware-
oriented transformations apply to HLS, based on the com- specific transformations proposed here.
piler transformations compiled by Bacon et al. [25]. These
transformations are included in Tab. 2, and are partitioned 6 E ND - TO -E ND E XAMPLES
into three categories: To showcase the transformations presented in this work and
provide a “hands-on” opportunity for seeing HLS optimiza-
• Transformations directly relevant to the HLS transforma-
tions applied in practice, we will describe the optimization
tions already presented here.
process on a sample set of classical HPC kernels, available
• Transformations that are the same or similar to their
as open source repositories on github1 . These kernels are
software counterparts.
• Transformations with little or no relevance to HLS. 1. https://github.com/spcl?q=hls
11

written in C++ for Xilinx Vivado HLS [12] with hlslib [48] [GOp/s] Stencil Matrix Multiplication N-Body

extensions, and are built and run using the Xilinx Vitis envi- 103 (25×/18270×)409.3 (52×/29950×)497.0 (42×/167×)270.7

ronment. For each example, we will describe the sequence of (14×/720×) 16.1 (16×/578×) 9.6 (4×) 6.4
101 1.6
transformations applied, and give the resulting performance (53×) 1.2 (36×) 0.6
at each major stage. 10−1 <0.1 <0.1
The included benchmarks were run on an Alveo 10 −3
Nai Pip Vec S Nai Pip Vec S I P S
U250 board, which houses a Xilinx UltraScale+ XCU250- ve elin tor ystoli ve elin tor ystoli nitial ipelin ystoli
ed ized c ed ized c ed c
FIGD2104-2L-E FPGA and four 2400 MT/s DDR4 banks (we
Fig. 7: Performance progression of kernels when applying transformations. Paren-
utilize 1-2 banks for the examples here). The chip consists theses show speedup over previous version, and cumulative speedup.
of four almost identical chiplets with limited interconnect [Utilization] LUTs DSPs BRAM
between them, where each chiplet is connected to one of 100%
Stencil Matrix Multiplication N-Body
the DDR4 pinouts. This multi-chiplet design allows more 10%
resources (1728K LUTs and 12,288 DSPs), but poses chal- 1%
lenges for the routing process, which impedes the achiev-
0.1%
able clock rate and resource utilization for a monolithic ker-
nel attempting to span the full chip. Kernels were compiled 0.01%
Nai Pip Vec S Nai Pip Vec S Init Pip S
for the xilinx u250 xdma 201830 2 shell with Vitis 2019.2 ve elin tor ystoli ve elin tor ystoli ial elin ystoli
ed ized c ed ized c ed c
and executed with version 2.3.1301 of the Xilinx Runtime Fig. 8: Resource usage of kernels from Fig. 7 as fractions of available resources.
(XRT). All benchmarks are included in Fig. 7, and the The maxima are taken as 1728K LUTs, 12,288 DSPs, and 2688 BRAM.

resource utilization of each kernel is shown in Fig. 8.

1) We transpose the iteration space [§2.1.1], removing the
6.1 Stencil Code loop-carried dependency on the accumulation register,
Stencil codes are a popular target for FPGA acceleration in and extract the memory accesses [§4.1], vastly improving
HPC, due to their regular access pattern, intuitive buffering spatial locality. The buffering, streaming and writing
scheme, and potential for creating large systolic array de- phases are fused [§2.4], allowing us to coalesce the three
signs [38]. We show the optimization of a 4-point 2D stencil nested loops [§2.6].
based on Lst. 4. Benchmarks are shown in Fig. 7, and use 2) In order to increase spatial parallelism, we vectorize
single precision floating point, iterating over a 8192×8192 accesses to B and C [§3.1].
domain. We first measure a naive implementation, where 3) To scale up the design, we vertically unroll by buffering
all neighboring cells are accessed directly from the input multiple values of A, applying them to streamed in
array, which results in no data reuse and heavy interface values of B in parallel [§3.2]. To avoid high fan-out,
contention on the input array. We then apply the following we partition buffered elements of A into processing
optimization steps: elements [§3.3] arranged in a systolic array architecture.
1) Delay buffers [§2.2] are added to store two rows of the Finally, the horizontal domain is tiled to accommodate
domain (see Lst. 4a), removing interface contention on arbitrarily large matrices with finite buffer space.
the memory bus and achieving perfect spatial data reuse. Allowing pipelining and regularizing the memory access
2) Spatial locality is exploited by introducing vectoriza- pattern results in a throughput of ∼1 cell per cycle. Vec-
tion [§3.1]. To efficiently use memory bandwidth, we torization multiplies the performance by W , set to 16 in
use memory extraction [§4.1], buffering [§4.2], and strip- the benchmarked kernel. The performance of the vertically
ing [§4.3] from two DDR banks. unrolled dataflow kernel is only limited by placement and
3) To exploit temporal locality, we replicate the vectorized routing due to high resource usage on the chip. The final
PE by vertical unrolling [§3.2] and stream [§3.3] between implementation achieves state-of-the-art performance on
them (Lst. 11). The domain is tiled [§3.4] to limit fast the target architecture [50], and is available on github3 .
memory usage.
Enabling pipelining with delay buffers allows the kernel to 6.3 N-Body Code
throughput ∼1 cell per cycle. Improving the memory perfor- Finally, we show an N-body code in 3 dimensions, using
mance to add vectorization (using W = 16 operands/cycle single precision floating point types and iterating over
for the kernel) exploits spatial locality through additional 16,128 bodies. Since Vivado HLS does not allow memory
bandwidth usage. The vertical unrolling and dataflow step accesses of a width that is not a power of two, memory ex-
scales the design to exploit available hardware resources on traction [§4.1] and buffering [§4.2] was included in the first
the chip, until limited by placement and routing. The final stage, to support 3-vectors of velocity. We then performed
implementation is available on github2 . the following transformations:
1) The loop-carried dependency on the acceleration accu-
6.2 Matrix Multiplication Code mulation is resolved by applying tiled accumulation
We consider the optimization process of a matrix multipli- interleaving [§2.1.2], pipelining across T ≥L+ different
cation kernel using transformations presented here. Bench- resident particles applied to particles streamed in.
mark for 8192×8192 matrices across stages of optimization 2) To scale up the performance, we further multiply the
are shown in Fig. 7. Starting from a naive implementation number of resident particles, this time replicating com-
(Lst. 1a), the following optimization stages were applied: pute through vertical unrolling [§3.2] of the outer loop

2. https://github.com/spcl/stencil hls/ 3. https://github.com/spcl/gemm hls

into P parallel processing element arranged in a systolic Lee et al. [90] present an OpenACC to OpenCL com-
array. Each element holds T resident particles, and parti- piler, using Intel OpenCL as a backend. The authors im-
cles are streamed [§3.3] through the PEs. plement horizontal and vertical unrolling, pipelining and
The second stage gains a factor of 4× corresponding to the dataflow by introducing new OpenACC clauses. Papakon-
latency of the interleaved accumulation, followed by a factor stantinou et al. [91] generate HLS code for FPGA from
of 42× from unrolling units across the chip. T ≥L+ can be directive-annotated CUDA code.
used to regulate the arithmetic intensity of the kernel. The Optimizing HLS compilers. Mainstream HLS compil-
bandwidth requirements can be reduced further by storing ers automatically apply many of the well-known software
more resident particles on the chip, scaling up to the full transformations in Tab. 2 [22], [92], [93], but can also employ
fast memory usage of the FPGA. The tiled accumulation in- more advanced FPGA transformations. Intel OpenCL [19]
terleaving transformation thus enables not just pipelining of performs memory access extraction into “load store units”
the compute, but also minimization of I/O. The optimized (LSUs), does memory striping between DRAM banks, and
implementation is available on github4 . detects and auto-resolves some buffering and accumulation
These examples demonstrate the impact of different patterns. The proprietary Merlin Compiler [94] uses high-
transformations on a reconfigurable hardware platform. In level acceleration directives to automatically perform some
particular, enabling pipelining, regularizing memory ac- of the transformations described here, as source-to-source
cesses, and vertical unrolling are shown to be central com- transformations to underlying HLS code. Polyhedral compi-
ponents of scalable hardware architectures. The dramatic lation is a popular framework for optimizing CPU and GPU
speedups over naive codes also emphasize that HLS tools do loop nests [55], and has also been applied to HLS for FPGA
not yield competitive performance out of the box, making it for optimizing data reuse [95]. Such techniques may prove
critical to perform further transformations. For additional valuable in automating, e.g., memory extraction and tiling
examples of optimizing HLS codes, we refer to the numer- transformations. While most HLS compilers rely strictly
ous works applying HLS optimizations listed below. on static scheduling, Dynamatic [68] considers dynamically
scheduling state machines and pipelines to allow reducing
7 R ELATED W ORK the number of stages executed at runtime.
Optimized applications. Much work has been done in
Domain-specific frameworks. Implementing programs
optimizing C/C++/OpenCL HLS codes for FPGA, such as
in domain specific languages (DSLs) can make it easier
stencils [38], [39], [40], [74], [75], deep neural networks [76],
to detect and exploit opportunities for advanced trans-
[77], [35], [36], [34], matrix multiplication [78], [75], [50], [79],
formations. Darkroom [30] generates optimized HDL for
graph processing [80], [81], networking [82], light propaga-
image processing codes, and the popular image process-
tion for cancer treatment [46], and protein sequencing [49],
ing framework Halide [31] has been extended to support
[83]. These works optimize the respective applications using
FPGAs [96], [97]. Luzhou et al. [53] and StencilFlow [44]
transformations described here, such as delay buffering,
propose frameworks for generating stencil codes for FP-
random access buffering, vectorization, vertical unrolling,
GAs. These frameworks rely on optimizations such as delay
tiling for on-chip memory, and dataflow.
buffering, dataflow, and vertical unrolling, which we cover
Transformations. Zohouri et al. [84] use the Rodinia
here. Using DSLs to compile to structured HLS code can
benchmark to evaluate the performance of OpenCL codes
be a viable approach to automating a wide range of trans-
targeting FPGAs, employing optimizations such as SIMD
formations, as proposed by Koeplinger et al. [98], and the
vectorization, sliding-window buffering, accumulation in-
FROST [99] DSL framework.
terleaving, and compute unit replication across multiple
Other approaches. There are other approaches than
kernels. We present a generalized description of a superset
C/C++/OpenCL-based HLS languages to addressing the
of these transformations, along with concrete code examples
productivity issues of hardware design: Chisel/FIR-
that show how they are applied in practice. The DaCe frame-
RTL [100], [101] maintains the paradigm of behavioral pro-
work [85] exploits information on explicit dataflow and
gramming known from RTL, but provides modern language
control flow to perform a wide range of transformations,
and compiler features. This caters to developers who are
and code generates efficient HLS code using vendor-specific
already familiar with hardware design, but wish to use a
pragmas and primitives. Kastner et al. [86] go through the
more expressive language. In the Maxeler ecosystem [102],
implementation of many HLS codes in Vivado HLS, focus-
kernels are described using a Java-based language, but
ing on algorithmic optimizations. da Silva et al. [87] explore
rather than transforming imperative code into a behavioral
using modern C++ features to capture HLS concepts in a
equivalent, the language provides a DSL of hardware con-
high-level fashion. Lloyd et al. [88] describe optimizations
cepts that are instantiated using object-oriented interfaces.
specific to Intel OpenCL, and include a variant of memory
By constraining the input, this encourages developers to
access extraction, as well as the single-loop accumulation
write code that maps well to hardware, but requires learning
variant of accumulation interleaving.
a new language exclusive to the Maxeler ecosystem.
Directive-based frameworks. High-level, directive-based
frameworks such as OpenMP and OpenACC have been 8 TOOLFLOW OF X ILINX VS . I NTEL
proposed as alternative abstractions for generating FPGA
When choosing a toolflow to start designing hardware with
kernels. Leow et al. [89] implement an FPGA code gen-
HLS, it is useful to understand two distinct approaches
erator from OpenMP pragmas, primarily focusing on cor-
by the two major vendors: Intel OpenCL wishes to en-
rectness in implementing a range of OpenMP pragmas.
able writing accelerators using software, making an effort to
4. https://github.com/spcl/nbody hls abstract away low-level details about the hardware, and
13

present a high-level view to the programmer; whereas Xil- [6] D. B. Thomas et al., “A comparison of CPUs, GPUs, FPGAs,
inx’ Vivado HLS provides a more productive way of writing and massively parallel processor arrays for random number
generation,” in FPGA, 2009.
hardware, by means of a familiar software language. Xilinx [7] D. Bacon et al., “FPGA programming for the masses,” CACM,
uses OpenCL as a vehicle to interface between FPGA and 2013.
host, but implements the OpenCL compiler itself as a thin [8] G. Martin and G. Smith, “High-level synthesis: Past, present, and
wrapper around the C++ compiler, whereas Intel embraces future,” D&T, 2009.
[9] J. Cong et al., “High-level synthesis for FPGAs: From prototyping
the OpenCL paradigm as their frontend (although they to deployment,” TCAD, 2011.
encourage writing single work item kernels [103], effectively [10] R. Nane et al., “A survey and evaluation of FPGA high-level
preventing reuse of OpenCL kernels written for GPU). synthesis tools,” TCAD, 2016.
Vivado HLS has a stronger coupling between the HLS [11] W. Meeus et al., “An overview of today’s high-level synthesis
tools,” DAEM, 2012.
source code and the generated hardware. This requires [12] Z. Zhang et al., “AutoPilot: A platform-based ESL synthesis
the programmer to write more annotations and boilerplate system,” in High-Level Synthesis, 2008.
code, but can also give them stronger feeling of control. [13] Intel High-Level Synthesis (HLS) Compiler. https://www.
intel.com/content/www/us/en/software/programmable/
Conversely, the Intel OpenCL compiler presents convenient quartus-prime/hls-compiler.html. Accessed May 15, 2020.
abstracted views, saves boilerplate code (e.g., by automat- [14] A. Canis et al., “LegUp: High-level synthesis for FPGA-based
ically pipelining sections), and implements efficient substi- processor/accelerator systems,” in FPGA, 2011.
tutions by detecting common patterns in the source code [15] Mentor Graphics. Catapult high-level synthesis. https:
//www.mentor.com/hls-lp/catapult-high-level-synthesis/
(e.g., to automatically perform memory extraction [§4.1]). c-systemc-hls. Accessed May 15, 2020.
The downside is that developers end up struggling to write [16] C. Pilato et al., “Bambu: A modular framework for the high level
or generate code in a way that is recognized by the tool’s synthesis of memory-intensive applications,” in FPL, 2013.
“black magic”, in order to achieve the desired result. Finally, [17] R. Nane et al., “DWARV 2.0: A CoSy-based C-to-VHDL hardware
compiler,” in FPL, 2012.
Xilinx’ choice to allow C++ gives Vivado HLS an edge in [18] M. Owaida et al., “Synthesis of platform architectures from
expressibility, as (non-virtual) objects and templating turns OpenCL programs,” in FCCM, 2011.
out to be a useful tool for abstracting and extending the [19] T. Czajkowski et al., “From OpenCL to high-performance hard-
language [48]. Intel offers a C++-based HLS compiler, but ware on FPGAs,” in FPL, 2012.
[20] R. Nikhil, “Bluespec system Verilog: efficient, correct RTL from
does not (as of writing) support direct interoperability with high level specifications,” in MEMOCODE, 2004.
the OpenCL-driven accelerator flow. [21] J. Auerbach et al., “Lime: A Java-compatible and synthesizable
language for heterogeneous architectures,” in OOPSLA, 2010.
9 C ONCLUSION [22] ——, “A compiler and runtime for heterogeneous computing,”
in DAC, 2012.
The transformations known from software are insufficient to [23] J. Hammarberg and S. Nadjm-Tehrani, “Development of safety-
optimize HPC kernels targeting spatial computing systems. critical reconfigurable hardware with Esterel,” FMICS, 2003.
We have proposed a new set of optimizing transforma- [24] M. B. Gokhale et al., “Stream-oriented FPGA computing in the
Streams-C high level language,” in FCCM, 2000.
tions that enable efficient and scalable hardware architec- [25] D. F. Bacon et al., “Compiler transformations for high-
tures, and can be applied directly to the source code by a performance computing,” CSUR, 1994.
performance engineer, or automatically by an optimizing [26] S. Ryoo et al., “Optimization principles and application perfor-
compiler. Performance and compiler engineers can benefit mance evaluation of a multithreaded GPU using CUDA,” in
PPoPP, 2008.
from these guidelines, transformations, and the presented [27] G. D. Smith, Numerical solution of partial differential equations: finite
cheat sheet as a common toolbox for developing high per- difference methods, 1985.
formance hardware using HLS. [28] A. Taflove and S. C. Hagness, “Computational electrodynamics:
The finite-difference time-domain method,” 1995.
[29] C. A. Fletcher, Computational Techniques for Fluid Dynamics 2, 1988.
ACKNOWLEDGEMENTS [30] J. Hegarty et al., “Darkroom: compiling high-level image process-
This work was supported by the European Research Coun- ing code into hardware pipelines.” TOG, 2014.
cil under the European Union’s Horizon 2020 programme [31] J. Ragan-Kelley et al., “Halide: A language and compiler for
optimizing parallelism, locality, and recomputation in image
(grant agreement DAPP, No. 678880). The authors wish processing pipelines,” in PLDI, 2013.
to thank Xilinx and Intel for helpful discussions; Xilinx [32] T. Ben-Nun and T. Hoefler, “Demystifying parallel and dis-
for generous donations of software, hardware, and access tributed deep learning: An in-depth concurrency analysis,”
to the Xilinx Adaptive Compute Cluster (XACC) at ETH CSUR, 2019.
[33] G. Lacey et al., “Deep learning on FPGAs: Past, present, and
Zurich; the Swiss National Supercomputing Center (CSCS) future,” arXiv:1602.04283, 2016.
for providing computing infrastructure; and Tal Ben-Nun [34] M. Courbariaux et al., “Binarized neural networks: Training deep
for valuable feedback on iterations of this manuscript. neural networks with weights and activations constrained to +1
or -1,” arXiv:1602.02830, 2016.
[35] Y. Umuroglu et al., “FINN: A framework for fast, scalable bina-
R EFERENCES rized neural network inference,” in FPGA, 2017.
[36] M. Blott et al., “FINN-R: An end-to-end deep-learning framework
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implica- for fast exploration of quantized neural networks,” TRETS, 2018.
tions of the obvious,” SIGARCH, 1995. [37] H. Fu and R. G. Clapp, “Eliminating the memory bottleneck: An
[2] M. Horowitz, “Computing’s energy problem (and what we can FPGA-based solution for 3d reverse time migration,” in FPGA,
do about it),” in ISSCC, 2014. 2011.
[3] D. D. Gajski et al., “A second opinion on data flow machines and [38] H. R. Zohouri et al., “Combined spatial and temporal block-
languages,” Computer, 1982. ing for high-performance stencil computation on FPGAs using
[4] S. Sirowy and A. Forin, “Where’s the beef? why FPGAs are so OpenCL,” in FPGA, 2018.
fast,” MS Research, 2008. [39] H. M. Waidyasooriya et al., “OpenCL-based FPGA-platform for
[5] A. R. Brodtkorb et al., “State-of-the-art in heterogeneous comput- stencil computation and its optimization methodology,” TPDS,
ing,” Scientific Programming, 2010. May 2017.
14

[40] Q. Jia and H. Zhou, “Tuning stencil codes in OpenCL for FPGAs,” [76] N. Suda et al., “Throughput-optimized OpenCL-based FPGA
in ICCD, 2016. accelerator for large-scale convolutional neural networks,” in
[41] X. Niu et al., “Exploiting run-time reconfiguration in stencil FPGA, 2016.
computation,” in FPL, 2012. [77] J. Zhang and J. Li, “Improving the performance of OpenCL-based
[42] ——, “Dynamic stencil: Effective exploitation of run-time re- FPGA accelerator for convolutional neural network,” in FPGA,
sources in reconfigurable clusters,” in FPT, 2013. 2017.
[43] J. Fowers et al., “A performance and energy comparison of [78] E. H. D’Hollander, “High-level synthesis optimization for
FPGAs, GPUs, and multicores for sliding-window applications,” blocked floating-point matrix multiplication,” SIGARCH, 2017.
in FPGA, 2012. [79] P. Gorlani et al., “OpenCL implementation of Cannon’s matrix
[44] J. de Fine Licht et al., “StencilFlow: Mapping large stencil pro- multiplication algorithm on Intel Stratix 10 FPGAs,” in ICFPT,
grams to distributed spatial computing systems,” in CGO, 2021. 2019.
[45] X. Chen et al., “On-the-fly parallel data shuffling for graph [80] M. Besta et al., “Graph processing on FPGAs: Taxonomy, survey,
processing on OpenCL-based FPGAs,” in FPL, 2019. challenges,” arXiv:1903.06697, 2019.
[46] T. Young-Schultz et al., “Using OpenCL to enable software-like [81] ——, “Substream-centric maximum matchings on FPGA,” in
development of an FPGA-accelerated biophotonic cancer treat- FPGA, 2019.
ment simulator,” in FPGA, 2020. [82] H. Eran et al., “Design patterns for code reuse in HLS packet
[47] D. J. Kuck et al., “Dependence graphs and compiler optimiza- processing pipelines,” in FCCM, 2019.
tions,” in POPL, 1981. [83] E. Rucci et al., “Smith-Waterman protein search with OpenCL on
[48] J. de Fine Licht and T. Hoefler, “hlslib: Software engineering for an FPGA,” in Trustcom/BigDataSE/ISPA, 2015.
hardware design,” arXiv:1910.04436, 2019. [84] H. R. Zohouri et al., “Evaluating and optimizing OpenCL kernels
[49] S. O. Settle, “High-performance dynamic programming on FP- for high performance computing with FPGAs,” in SC, 2016.
GAs with OpenCL,” in HPEC, 2013. [85] T. Ben-Nun et al., “Stateful dataflow multigraphs: A data-centric
[50] J. de Fine Licht et al., “Flexible communication avoiding matrix model for performance portability on heterogeneous architec-
multiplication on FPGA with high-level synthesis,” in FPGA, tures,” in SC, 2019.
2020. [86] R. Kastner et al., “Parallel programming for FPGAs,”
[51] K. Sano et al., “Multi-FPGA accelerator for scalable stencil com- arXiv:1805.03648, 2018.
putation with constant memory bandwidth,” TPDS, 2014. [87] J. S. da Silva et al., “Module-per-Object: a human-driven method-
[52] H. Kung and C. E. Leiserson, “Systolic arrays (for VLSI),” in ology for C++-based high-level synthesis design,” in FCCM, 2019.
Sparse Matrix Proceedings, 1978. [88] T. Lloyd et al., “A case for better integration of host and target
[53] W. Luzhou et al., “Domain-specific language and compiler compilation when using OpenCL for FPGAs,” in FSP, 2017.
for stencil computation on fpga-based systolic computational- [89] Y. Y. Leow et al., “Generating hardware from OpenMP pro-
memory array,” in ARC, 2012. grams,” in FPT, 2006.
[54] T. Kenter et al., “OpenCL-based FPGA design to accelerate the [90] S. Lee et al., “OpenACC to FPGA: A framework for directive-
nodal discontinuous Galerkin method for unstructured meshes,” based high-performance reconfigurable computing,” in IPDPS,
in FCCM, 2018. 2016.
[55] T. Grosser et al., “Polly – performing polyhedral optimizations on [91] A. Papakonstantinou et al., “FCUDA: Enabling efficient compila-
a low-level intermediate representation,” PPL, 2012. tion of CUDA kernels onto FPGAs,” in SASP, 2009.
[56] U. Sinha, “Enabling impactful DSP designs on FPGAs with hard- [92] S. Gupta et al., “SPARK: a high-level synthesis framework for ap-
ened floating-point implementation,” Altera White Paper, 2014. plying parallelizing compiler transformations,” in VLSID, 2003.
[93] ——, “Coordinated parallelizing compiler optimizations and
[57] J. R. Allen and K. Kennedy, “Automatic loop interchange,” in
high-level synthesis,” TODAES, 2004.
SIGPLAN, 1984.
[94] J. Cong et al., “Source-to-source optimization for HLS,” in FPGAs
[58] M. Weiss, “Strip mining on SIMD architectures,” in ICS, 1991.
for Software Programmers, 2016.
[59] M. D. Lam et al., “The cache performance and optimizations of
[95] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization
blocked algorithms,” 1991.
for configurable computing,” in FPGA, 2013.
[60] C. D. Polychronopoulos, “Advanced loop optimizations for par-
[96] J. Pu et al., “Programming heterogeneous systems from an image
allel computers,” in ICS, 1988.
processing DSL,” TACO, 2017.
[61] D. J. Kuck, “A survey of parallel machine organization and [97] J. Li et al., “HeteroHalide: From image processing DSL to efficient
programming,” CSUR, Mar. 1977. FPGA acceleration,” in FPGA, 2020.
[62] A. P. Yershov, “ALPHA – an automatic programming system of [98] D. Koeplinger et al., “Automatic generation of efficient accelera-
high efficiency,” J. ACM, 1966. tors for reconfigurable hardware,” in ISCA, 2016.
[63] M. J. Wolfe, “Optimizing supercompilers for supercomputers,” [99] E. D. Sozzo et al., “A common backend for hardware acceleration
Ph.D. dissertation, 1982. on FPGA,” in ICCD, 2017.
[64] J. J. Dongarra and A. R. Hinds, “Unrolling loops in Fortran,” [100] J. Bachrach et al., “Chisel: constructing hardware in a scala
Software: Practice and Experience, 1979. embedded language,” in DAC, 2012.
[65] M. Lam, “Software pipelining: An effective scheduling technique [101] A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware
for VLIW machines,” in PLDI, 1988. construction languages, compiler frameworks, and transforma-
[66] C. D. Polychronopoulos, “Loop coalescing: A compiler transfor- tions,” in ICCAD, 2017.
mation for parallel machines,” Tech. Rep., 1987. [102] Maxeler Technologies, “Programming MPC systems (white pa-
[67] F. E. Allen and J. Cocke, A catalogue of optimizing transformations, per),” 2013.
1971. [103] Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide,
[68] L. Josipović et al., “Dynamically scheduled high-level synthesis,” UG-OCL003, revision 2020.04.1. Accessed May 15, 2020.
in FPGA, 2018.
[69] J. Cocke and K. Kennedy, “An algorithm for reduction of operator Johannes de Fine Licht is a PhD student at ETH Zurich. His research topics
strength,” CACM, 1977. revolve around spatial computing systems in HPC, and include programming
[70] R. Bernstein, “Multiplication by integer constants,” Softw. Pract. models, applications, libraries, and enhancing programmer productivity.
Exper., 1986.
[71] G. L. Steele, “Arithmetic shifting considered harmful,” ACM Maciej Besta is a PhD student at ETH Zurich. His research focuses on under-
SIGPLAN Notices, 1977. standing and accelerating large-scale irregular graph processing in any type of
[72] A. V. Aho et al., “Compilers, principles, techniques,” Addison setting and workload.
Wesley, 1986.
Simon Meierhans is studying for his MSc degree at ETH Zurich. His interests
[73] T. De Matteis et al., “Streaming message interface: High- include randomized and deterministic algorithm and data structure design.
performance distributed memory programming on reconfig-
urable hardware,” in SC, 2019. Torsten Hoefler is a professor at ETH Zurich, where he leads the Scalable
[74] D. Weller et al., “Energy efficient scientific computing on FPGAs Parallel Computing Lab. His research aims at understanding performance of
using OpenCL,” in FPGA, 2017. parallel computing systems ranging from parallel computer architecture through
[75] A. Verma et al., “Accelerating workloads on FPGAs via OpenCL: parallel programming to parallel algorithms.
A case study with opendwarfs,” Tech. Rep., 2016.
Understanding the Potential of FPGA-Based Spatial
Acceleration for Large Language Model Inference
HONGZHENG CHEN, Cornell University, USA
JIAHAO ZHANG∗ , Tsinghua University, China
arXiv:2312.15159v2 [cs.LG] 7 Apr 2024

YIXIAO DU, SHAOJIE XIANG, and ZICHAO YUE, Cornell University, USA
NIANSONG ZHANG, YAOHUI CAI, and ZHIRU ZHANG, Cornell University, USA
Recent advancements in large language models (LLMs) boasting billions of parameters have generated
a significant demand for efficient deployment in inference workloads. While hardware accelerators for
Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal
architectures that reuse hardware units for different network layers and operators. However, these methods
often encounter challenges in achieving low latency due to considerable memory access overhead.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference
on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers,
facilitating direct communication between them through a dataflow architecture while minimizing off-chip
memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial
LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA.
This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can
identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine
the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.
To enable more productive implementations of an LLM model on FPGAs, we further provide a library of
high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as
open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented
BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach
can achieve up to 13.4× speedup when compared to previous FPGA-based accelerators for the BERT model.
For GPT generative inference, we attain a 2.2× speedup compared to DFX, an FPGA overlay, in the prefill
stage, while achieving a 1.9× speedup and a 5.7× improvement in energy efficiency compared to the NVIDIA
A100 GPU in the decode stage.
CCS Concepts: • Hardware → Hardware-software codesign; • Computing methodologies → Neural
networks.
Additional Key Words and Phrases: FPGA, high-level synthesis, large language models, hardware acceleration

1 Introduction
The rapid advancement of Transformer-based large language models (LLMs) [5, 74] has sparked a
revolution across a wide range of natural language processing tasks, such as conversational AI [13,
54, 104] and code generation [10, 42, 52]. Recent research has brought to light the phenomenon of
“emergence” in LLMs, where advanced capabilities become evident as the models scale up to billions
of parameters [77, 78]. However, supporting this unprecedented scale poses significant challenges,
particularly in terms of computational and memory resources. At the same time, the increasing use
of LLMs in interactive applications like voice assistants and autonomous systems requires hardware
accelerators capable of providing both low latency and high energy efficiency [17, 54, 62].
Recent efforts have primarily focused on improving the performance of LLM inference on
GPUs [2, 53], although GPUs are known for their high power consumption and are less suitable
for latency-sensitive workloads [32, 62]. There is also an active body of research dedicated to
developing specialized hardware accelerators tailored for Transformer models, with several of these
efforts using FPGAs as the target platforms [23, 26, 39, 46, 59, 63].
∗ Work was done when interning at Cornell.

1
Chen et al.

space
PE1
f1 f2 f3 f4 f1 f2 f3 f4 time

(a) Temporal architecture (i.e., overlay).

space
PE1 PE2 f2 f4 f2 f4
f1 f3 f1 f3 time

(b) Partially unfolded spatial architecture with two PEs.

space
f4 f4
PE1 PE2 f3 f3
PE4 PE3 f2 f2
f1 f1 time

(c) Fully unfolded spatial architecture with four PEs.

Fig. 1. Temporal and spatial architectures — PE stands for processing engine; 𝑓1 -𝑓4 represent different operators
in the model.

FPGA-based LLM accelerators can be broadly categorized into two architectural paradigms:
temporal architecture and spatial architecture. In a temporal architecture, a processing engine
(PE) capable of performing various tasks is constructed and reused across different layers and
models, as shown in Figure 1(a). For flexibility, these accelerators typically employ an overlay
approach [23, 28, 39], where a virtual hardware architecture that executes instructions is “laid” on
top of the physical FPGA fabric. Overlays provide a more restricted configuration space, allowing
for quicker compilation with bitstream reuse across multiple models. However, the use of such
temporal architecture requires more frequent off-chip memory access, as intermediate results must
be written back to memory. This incurs a cost in terms of both latency and energy consumption
that is significantly higher than direct on-chip memory access. Additionally, one could argue that
an FPGA overlay will inherently be less efficient than its hardened ASIC counterpart.
In contrast, an FPGA-based spatial architecture typically involves the specialization of distinct
PEs for specific operators or layers, facilitating direct communication between them using streaming
buffers (e.g., FIFOs or multi-buffers) [60, 72, 75, 80], as depicted in Figure 1(b-c). This dataflow-style
execution substantially reduces off-chip memory accesses and enables the concurrent processing of
multiple PEs in a pipelined manner. Moreover, the fine-grained programmability of FPGAs allows
efficient support of model-specific spatial architectures, which can further leverage efficiency
optimizations such as low-bitwidth quantization, custom numerical types, and sparsity [58, 69, 93,
102]. These capabilities can potentially enable highly efficient LLM inference implementations that
surpass GPUs, especially in small-batch low-latency scenarios.
However, implementing a spatial architecture for LLM inference presents significant challenges.
Challenge 1: Navigating diverse parallelism in LLMs. The generative inference process of
LLMs typically consists of two distinct stages: (1) simultaneously processing user prompts and
(2) sequentially generating new tokens in an autoregressive manner. These two stages exhibit
significantly different computational and memory characteristics (detailed in §3), making it nec-
essary to tailor hardware accelerators for their specific needs. This challenge cannot be directly
addressed by leveraging techniques from the traditional convolutional neural network (CNN)
designs [32, 97]. The large number of parameters and intermediate tensors further complicates the
choice between on-chip and off-chip storage. Additionally, harnessing multiple accelerators for
distributed LLM inference adds complexity, particularly when dealing with intricate parallelization
schemes [23, 49, 68].

2
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Challenge 2: Lack of standard LLM building blocks in hardware accelerators. The rapid
evolution of LLM architectures [5, 54, 70] contrasts with the comparatively slow pace of hardware
development. While a plethora of building blocks for Transformers have been proposed in the
software domain [14, 19, 38], the absence of reusable blocks for hardware accelerator design
hampers development progress. Many frameworks have been designed to automatically map deep
learning models to FPGAs [3, 20, 72, 98, 99], but they are constrained to small CNN designs and lack
support for complicated Transformer models. It is also hard to scale their designs to accommodate
large models and multi-die FPGAs.
To tackle these challenges, this paper is to provide a comprehensive set of hardware design
considerations for LLMs and try to answer the following question: What role can FPGA-based
spatial accelerators play in enabling efficient LLM inference? We start by conducting an in-depth
analysis of the computational and memory requirements associated with each operator within
Transformer models across two distinct stages of LLM generative inference – prefill and decode.
Subsequently, we extend our analysis to reveal the potential benefits of distributed inference using
multiple FPGAs. We believe that providing such an analysis, rather than presenting only
positive results in selectively chosen settings for an FPGA LLM accelerator, offers more
valuable insights to the community. To validate the feasibility of our analytical framework,
we implement a specific design point and demonstrate its viability. Leveraging this analytical
framework, we employ specific optimizations in HLS to craft each kernel and compose them into a
hardware accelerator that achieves the expected performance. While our primary focus is not to
propose a new LLM accelerator architecture, we demonstrate that by using the analytical model,
we can create a high-performance design that surpasses previous efforts. Our major contributions
are as follows:
• We introduce an analytical framework that presents the first in-depth analysis of both the
advantages and limitations of FPGA-based LLM spatial acceleration. This framework not
only allows us to estimate the performance of a specific accelerator configuration on a given
FPGA device but also provides guidance for designing accelerators for LLM inference.
• We create a suite of modular and reusable HLS kernels designed for building FPGA-based
spatial accelerators for different Transformer models. We plan to open-source this kernel
library1 and expect it to serve as a valuable resource for benchmarking HLS and FPGA
acceleration more broadly.
• Leveraging our kernel library, we design and implement a range of high-performance
FPGA-based LLM accelerators that achieve speedups comparable to previous GPU and
FPGA-based accelerators. Specifically, for the BERT model, we achieve a 13.4× speedup
over prior FPGA-based accelerators. For GPT generative inference, we achieve speedups
of 2.2× and 1.1× in prefill and decode stages respectively, when compared to DFX, an
FPGA-based overlay architecture. Additionally, our accelerator is 1.9× faster and 5.7× more
energy-efficient than the A100 GPU in the decode stage.

2 Background
This section provides backgrounds on Transformer models and introduces parallelization schemes
for LLM inference.

2.1 Transformer Models

The Transformer model consists of both encoder and decoder blocks [74]. Recent employment
on LLMs mostly uses decoder-only models, which leverage an auto-regressive approach for text
1 https://github.com/cornell-zhang/allo/tree/main/examples

3
Chen et al.

Propmts New token

Hidden states Hidden states
(seq_len > 1) (seq_len=1)
MHA MHA
Input Input
Embedding Embedding
Positional
Encoding
SDP
Attn Layer 1 Layer 1
Mask (opt.) SDP
Attn
Mask (opt.)
Softmax KV
Layer ... Layer ...
Cache Softmax

Layer N Layer N

Linear Linear
LayerNorm & Add
LayerNorm & Add
FFN Softmax Softmax
FFN
GELU
1st output 2nd output GELU
token token

LayerNorm & Add Prefill Stage Decode Stage

(a) (b)
LayerNorm & Add

Fig. 2. Transformer model. Red blocks represent linear operators, and blue blocks signify non-linear operators.

generation [54, 65, 70]. We will mainly discuss decoder-only models in this paper, but since encoders
and decoders share the core building blocks with subtle architectural variances, our approach can
also be extended for encoder-only models [16, 36, 45].
As illustrated in Figure 2, generative inference of LLMs has two stages: prefill stage and decode
stage [62]. In the prefill stage, the model takes in user prompts, normally a long sequence with 𝑙 input
tokens, goes through the whole Transformer model, and generates the first token. In the decode
stage, the model takes in the previously generated token and generates 𝑙 gen new tokens one at a
time in an auto-regressive way. Since each token depends on the previously generated tokens, the
decode stage is purely sequential.
We then go through the detailed model architecture. The input tokens are first passed into an
embedding layer that maps the discrete tokens into high-dimensional continuous representations
while incorporating positional encoding for each token. Subsequently, it generates a tensor (i.e.,
hidden states) of shape (𝑙, 𝑑), where 𝑙 represents sequence length, and 𝑑 is the size of hidden
dimensions. We omit the batch dimension to simplify the analysis, focusing solely on single-batch
inference in this paper, but our approach can be easily extended to different batch sizes for LLM
serving by adding an additional batch dimension [34, 43].
The hidden states then pass through a series of 𝑁 Transformer blocks. Each Transformer block
consists of two sublayers: a multi-head attention (MHA) module and a feed-forward network (FFN).
Residual connections and layer normalization (LayerNorm) functions are applied between these
sublayers, although the specific order and application may vary across different models [91]. The
MHA module plays a crucial role in capturing token relationships within the input sequence. The
input is initially partitioned into ℎ segments, where ℎ corresponds to the number of attention heads.
To compute the attention scores for each head, the input sequence of length 𝑙 undergoes three

4
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

TP rank#1
GL
SM

all_reduce
all_reduce

LN & add

LN & add
SM
GL
TP rank#2

Fig. 3. An example of tensor parallelism of a Transformer layer with two devices. TP rank is the unique
identifier given to a device within a TP group. SM is the softmax function, LN is LayerNorm, and GL is the
GeLU function.

linear projections: query, key, and value. These projections, which are trainable, yield matrices
𝑄, 𝐾, and 𝑉 respectively. Attention scores are then computed using a scaled dot-product (SDP)
operator between 𝑄, 𝐾, and 𝑉 , as specified by the formula:
. √︁
Attention(𝑄, 𝐾, 𝑉 ) = softmax 𝑄𝐾 𝑇 𝑑𝑘 𝑉 , (1)
where 𝑑𝑘 is the size of the hidden dimension. The output from this operator involves ℎ outputs,
which are subsequently concatenated and processed through an additional linear projection. In the
prefill stage, the generated 𝐾 and 𝑉 tensors will be stored as KV cache and later be concatenated
before SDP during the decode stage [62].
The FFN module comprises a linear layer followed by a non-linear activation function and
another linear layer. This module transforms the outputs of MHA into embedding matrices, which
are then further processed by subsequent Transformer layers.
Finally, the output tensor will go through a softmax function to obtain a distribution. The model
will sample a token from this distribution and feed it into the decode stage. For encoder-only
models, there is only a prefill stage involved, and the distribution will be directly used for different
downstream tasks like text classification [16, 36, 45].
In this paper, we only focus on analyzing the core Transformer blocks and accelerating them on
FPGAs. Embedding layers and output sampling [8, 25] require extensive random memory accesses,
which may not be suitable for FPGA acceleration. Also, they only take a small fraction of overall
compute that does not affect the overall latency [32], so we leave them to execute on CPUs or GPUs
as usual.

2.2 Parallelization Schemes

As model sizes continue to expand, it becomes increasingly common for a single device to be insuffi-
cient for accommodating the entire model. Consequently, the exploration of diverse parallelization
schemes within a device and across devices becomes necessary. Parallelism techniques in deep
learning can be roughly classified into data, tensor, and pipeline parallelism, together known as
3D parallelism [9, 33, 51, 68]. During the era of CNNs, data parallelism was the norm, involving
the partitioning of input tensors along the batch dimension and their distribution across multiple
devices [1, 41]. DeepSpeed ZeRO [66] and FSDP [47] extended data parallelism by proposing a
three-stage parallelism strategy that partitions optimizer states, gradients, and parameters to mini-
mize memory usage. However, this approach may incur high communication overheads during
inference.
A more recent parallelism scheme for LLM inference is tensor parallelism (TP) [2, 43, 51, 62,
68], which distributes model parameters across multiple devices and conducts explicit collective

5
Chen et al.

Table 1. MACs of the prefill and decode stages of the linear layers in the Transformer model in Figure 2
— 𝑙 denotes input sequence length, 𝑑 denotes input feature dimension size, and 𝑑 FFN denotes FFN hidden
dimension size.

Linear Layer Abbreviations Input Matrices Prefill Decode

Q/K/V linear 𝑞, 𝑘, 𝑣 𝑋𝑊𝑄 , 𝑋𝑊𝐾 , 𝑋𝑊𝑉 3𝑙𝑑 2 3𝑑 2
Matmul1 𝑎1 𝑄𝐾 T 𝑙 2𝑑 (𝑙 + 1)𝑑
Matmul2 𝑎2 𝑋 sm𝑉 𝑙 2𝑑 (𝑙 + 1)𝑑
Projection 𝑝 𝑋 sdp𝑊Proj 𝑙𝑑 2 𝑑2
FFN1 𝑓1 𝑋 mha𝑊FFN1 𝑙𝑑𝑑 FFN 𝑑𝑑 FFN
FFN2 𝑓2 𝑋 act𝑊FFN2 𝑙𝑑𝑑 FFN 𝑑𝑑 FFN

operations to ensure model correctness. Megatron-LM [68] is the first to explore tensor parallelism
for Transformer-based models, proving to be efficient in both training and inference due to relatively
low communication costs. As shown in Figure 3, tensor parallelism requires two all_reduce
operations inside a Transformer layer to ensure the results are correct. Our accelerator design also
explores tensor parallelism, as detailed in §3.4.2.
Lastly, pipeline parallelism [49, 50, 92] divides the model across network layers. Multiple layers
are grouped into a pipeline stage, and different stages are assigned to different devices. Pipeline
parallelism is typically employed across multiple nodes. Since both tensor parallelism and pipeline
parallelism handle only portions of the network model, they are collectively referred to as model
parallelism. We revisit these parallelization schemes in §3.4.2.

3 Analytical Modeling Framework

In this section, we propose a comprehensive analytical modeling framework aimed at understanding
the computational requirements of a Transformer layer. Our investigation begins by analyzing the
compute demands and resource constraints on a single device. We base our estimations on these
constraints. Finally, we extend the framework to the analysis of multiple devices.

3.1 Computational Demands

Our first imperative is to calculate the computational demands of the model. Given that the
predominant computation in the Transformer model is general matrix-matrix multiplication (GEMM
or Matmul) or general matrix-vector multiplication (GEMV) [32, 62], we employ the number of
multiply–accumulates (MACs) as the proxy metric for quantifying compute requirements of the
linear layers, as depicted in Table 1. For non-linear layers such as softmax and GeLU functions, they
are elementwise operators that can be easily fused with the GEMM kernels in a pipeline design
without affecting the final performance. More experimental results are provided in §6.3.
We denote 𝑋 (·) as the output tensors from preceding layers, and 𝑊 (·) represents the weights of
corresponding linear layers. For example, 𝑋 sm is the output of the softmax operator. Our analysis
here is restricted to a single batch; hence the tensors only have two dimensions. We can observe
that the computational demand during the prefill stage far surpasses that of the decode stage. In the
prefill stage, the required MACs of the two Matmuls within the SDP are quadratic to the sequence
length (i.e., 𝑙 2𝑑). Consequently, when input sequences exhibit substantial length, attention layers
may extend computation time significantly. On the contrary, in the decode stage, each operator
processes a single token at a time, making the MACs independent of the sequence length except
for SDP.

6
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

3.2 Resource Constraints

We then model the compute and memory resource constraints on an FPGA. In this section, we
assume that one FPGA device can effectively compute at least a single Transformer layer, but our
framework can be easily extended to more resource-constrained cases using a similar analysis
proposed in §3.4.

3.2.1 Compute Resource Constraints. The core computational element for linear operators is the
MAC unit. Let 𝑀𝑖 denote the compute power, in terms of the number of MACs per cycle allocated
to each matrix multiplication kernel, where 𝑖 ranges over 𝑞, 𝑘, 𝑣, 𝑎 1 , 𝑎 2 , 𝑝, 𝑓1 , and 𝑓2 , based on
the notation in Table 1. We quantize the matrix multiplication to integer inputs for maximum
efficiency, which has been proven to be effective by many recent studies [15, 30, 67, 81]. Quantization
enables single-cycle accumulation. As a result, one multiply-accumulator (MAC) unit can provide
a 1 MAC/cycle throughput with a properly pipelined multiplier. Therefore, the latency for the 𝑄
projection can be calculated as 𝑙𝑑 2 /𝑀𝑞 cycles, considering that the total number of MACs computed
in this operator is 𝑙𝑑 2 .
Suppose we want to deploy 𝐶 Transformer model layers on an FPGA. The total MAC units
must not exceed the capacity of the device. Since we employ a dataflow design that unfolds all the
layers on-board, the required MAC units are simply the sum of the MAC units for each layer. This
requirement can be expressed as:
∑︁
𝑀𝑖 𝐶 < 𝑀tot, 𝑖 ∈ {𝑞, 𝑘, 𝑣, 𝑎 1, 𝑎 2, 𝑝, 𝑓1, 𝑓2 } , (2)

where 𝑀tot represents the total available compute power of an FPGA in terms of MACs per cycle,
which can be obtained from the official data sheets. For FPGAs with specialized compute blocks
(e.g., AI Engine [86] and AI Tensor Blocks [37]), we can convert their compute power to match the
frequency of the programming logic, thus obtaining an overall value 𝑀tot for the entire FPGA. For
example, the VCK5000 FPGA [86] has 400 AI Engines, each of which can compute 128 MACs/cycle
at 1GHz. Therefore, the equivalent compute power at 250MHz is 128×400×1GHz/250MHz, which
is 204800 MACs/cycle.

3.2.2 Memory Capacity Constraints. The demand for memory capacity stems from a variety of
on-chip buffers, including weight buffers for parameters, buffers for 𝐾 and 𝑉 matrices, and FIFOs
interconnecting different stages.
Parameter buffers. To optimize an FPGA-based dataflow design, we assume that all the quan-
tized parameters can be accommodated in on-chip or off-chip memory. Suppose all the linear
weights are quantized to 𝑏𝑊 bits, and the size of the linear operator 𝑖 is 𝑠𝑖 . The total size of the
buffers is 𝑆 param = 𝑖 ∈ {𝑞,𝑘,𝑣,𝑝,𝑓1,𝑓2 } 𝑠𝑖 𝑏𝑊 = (4𝑑 2 + 2𝑑𝑑 FFN )𝑏𝑊 if storing on-chip. If the parameters are
Í
too large to fit in on-chip memory, we can store the parametersÍ in DRAM and tile the parameters
with size 𝑀𝑖 on-chip, then the total tiled buffer size is 𝑆 tile = 𝑖 ∈ {𝑞,𝑘,𝑣,𝑝,𝑓1,𝑓2 } 𝑀𝑖 𝑏𝑊 . To hide the
memory access latency, we need to double buffer those parameters, so the final buffer size of the
𝑖-th linear operator is 2𝑆 tile .
KV Cache. When conducting matrix multiplication, at least one of the matrices’ elements must
be accessed repeatedly so that a buffer is required. Given that parameters are already buffered, only
the SDP requires buffering for at least one of the input matrices. In our case, we choose to buffer 𝐾
and 𝑉 , which will be later passed to the decode stage as the KV cache. We also double buffer 𝐾 and
𝑉 matrices to improve throughput. The final buffer size is 𝑆 KV = 4𝑙 max𝑑𝑏𝐴 , where 𝑏𝐴 is the bitwidth
of the activation and 𝑙 max is the maximum sequence length supported by the model. Notice KV
cache can also be tiled on-chip, which can leverage a similar analysis above.

7
Chen et al.

FIFOs. The intermediate results between linear operators flow in FIFOs since the linear operators
sequentially access them. For the initial residual connection, we assume that the input tensors are
fetched from off-chip memory to obviate the need for additional buffering. However, for the second
residual connection related to the FFN, it is necessary to use an intermediate buffer to store the
projection’s activation 𝑋 act before the FFN. This buffer simultaneously serves as a bypass path. To
avoid deadlock, the buffer must possess sufficient capacity to store 𝑋 act . We simply create a FIFO
of size 𝑙𝑑𝑏𝐴 to store it. For other FIFO connections, we assume a FIFO depth of 𝑠 and one FIFO
connecting each layer in Figure 2, so the total FIFO size is equal to 𝑆 FIFO = 16𝑠𝑏𝐴 + 𝑙𝑑𝑏𝐴 .
In summary, the memory capacity constraint is expressed as:
𝑆 param𝐶 < 𝐷𝑅𝐴𝑀tot ,
∑︁ (3)
𝑆𝑖 𝐶 < 𝑆𝑅𝐴𝑀tot , 𝑖 ∈ {tile, KV, FIFO} ,

if the parameters are stored off-chip. 𝐷𝑅𝐴𝑀tot and 𝑆𝑅𝐴𝑀tot are the total available off-chip and
on-chip memory.

3.2.3 Memory Port Constraints. Besides memory capacity, we also need to consider constraints
on memory ports in a highly paralleled design. For matrix multiplication, if different MAC units
work in parallel, they will visit the weight/result buffers simultaneously, hence contending for
memory ports. This issue can be addressed by either partitioning the buffer, effectively offering
more memory ports; or packing data to create wider elements, subsequently reducing the number
of memory ports required.
SRAM resources. The on-chip SRAM resources of FPGAs are typically organized as blocks.
Each block has a fixed capacity and may support configurable bitwidth. For example, on AMD
UltraScale+ FPGAs, there are two types of SRAM resources: Block RAM (BRAM) and Ultra RAM
(URAM). BRAM blocks can be configured to 1×36 Kb block or 2×18 Kb blocks, with two read and
write ports each. URAM blocks are 288 Kb with one read and one write port. The port width of the
BRAM block is flexible; it can be configured to 1, 2, 4, 9, 18, 36, or 72 (in 36 Kb mode) bits, while the
port width of the URAM block is fixed at 72 bits. Similar to BRAM and URAM, Intel FPGAs have
M20K and eSRAM with different configurable port widths.
Memory blocks needed without data packing. To begin with, we analyze the port constraints
without data packing. In this case, to eliminate the port contention, different MAC units may
need different memory ports. Consider the linear operator 𝑖 with the size of 𝑠𝑖 with 𝑀𝑖 MAC units
working in parallel, each loaded weight may feed multiple MAC units due to intrinsic data reuse
in GEMM. We use 𝑟𝑖 to represent the data reuse factor (number of MAC units sharing the loaded
weight). Therefore, the weight buffer needs to be partitioned into 𝑀𝑖 /𝑟𝑖 parts. If we store all the
weight buffers on-chip, then the number of 𝑏𝑊 -bit elements in each partition is 𝑠𝑖 /(𝑀𝑖 /𝑟𝑖 ). However,
𝑏𝑊 may not fully occupy one memory word as the memory bitwidth can only take limited options.
We introduce the effective bit width, 𝑏 𝐵𝑅𝐴𝑀 , to be the smallest memory bitwidth larger than 𝑏𝑊 .
Let 𝑆 𝐵𝑅𝐴𝑀 be the total capacity (in bits) of one memory block, we can deduce the total number of
memory blocks for one linear operator:

𝑠𝑖 𝑏 𝐵𝑅𝐴𝑀
𝑅𝑖 = × 𝑀𝑖 /𝑟𝑖 . (4)
𝑀𝑖 /𝑟𝑖 × 𝑆 𝐵𝑅𝐴𝑀
If the parameters are loaded from off-chip memory and we only store a tile of the weight on-chip,
then 𝑠𝑖 is simply 𝑀𝑖 , and 𝑅𝑖 also becomes 𝑀𝑖 as 𝑏 𝐵𝑅𝐴𝑀 ≪ 𝑆 𝐵𝑅𝐴𝑀 . Since we need to double buffer
those parameters, the final buffer size of the 𝑖-th linear operator is 2𝑀𝑖 . Notice the 𝑘 and 𝑣 layers
need to be double-buffered, so the required BRAM also doubles in these two layers. We can obtain

8
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

the total required BRAM as below:

∑︁
𝐶𝑅𝑖 + 2𝐶 (𝑅𝑎1 + 𝑅𝑎2 ) < 𝑀𝑒𝑚 tot . (5)
𝑖 ∈ {𝑞,𝑘,𝑣,𝑝,𝑓1 ,𝑓2 }

Memory blocks needed with data packing. Data packing can alleviate the strain on memory
port contention by consolidating multiple narrow data into a single, wider data element. This
process allows multiple MAC units to access data from the same memory port. We consider packing
data into 𝑏 𝑝𝑎𝑐𝑘 bits for the linear weights, and we have 𝑏 𝑝𝑎𝑐𝑘 = 𝑘𝑏𝑊 . Again, we denote 𝑏 𝐵𝑅𝐴𝑀 as
the smallest memory bitwidth larger than 𝑏 𝑝𝑎𝑐𝑘 . We need to partition 𝑀𝑖 /𝑟𝑖 MAC units to 𝑀𝑖 /𝑟𝑖 /𝑘
parts, and each partition has ⌈𝑠𝑖 /𝑘 ×𝑏 𝐵𝑅𝐴𝑀 /(𝑀𝑖 /𝑟𝑖 /𝑘)⌉ bits. Therefore, the total number of memory
blocks needed is:
𝑠𝑖 𝑏 𝐵𝑅𝐴𝑀 𝑀𝑖 /𝑟𝑖
𝑅𝑖 = × . (6)
𝑀𝑖 /𝑟𝑖 × 𝑆 𝐵𝑅𝐴𝑀 𝑘
3.2.4 Memory Bandwidth Constraints. If the parameters are stored off-chip, we need to consider
the impact of off-chip memory bandwidth. Similar to §4.2.2, we use 𝑟𝑖 to denote the data reuse
factor of a linear operator with 𝑀𝑖 MAC units. Effectively, 𝑀𝑖 /𝑟𝑖 weights must be loaded from
off-chip memory per cycle to feed the MAC units, requiring a bandwidth of:
𝐵𝑖 = 𝑏𝑊 × 𝑀𝑖 /𝑟𝑖 × 𝑓 𝑟𝑒𝑞 , (7)
Í
where 𝑓 𝑟𝑒𝑞 is the achieved frequency of FPGA. If the total required bandwidth, 𝑖 𝐶𝐵𝑖 (𝑖 ∈
{𝑞, 𝑘, 𝑣, 𝑝, 𝑓1, 𝑓2 }), exceeds the maximum device bandwidth, the inference becomes bandwidth bound.
Notice this bandwidth requirement needs to be analyzed for each operator individually if the data
loading requires accessing multiple DDR or HBM channels.

3.3 Performance Estimation

In this section, we estimate the overall latency based on the constraints and conduct work balancing
for the dataflow.

Time

Layer 1

Layer 2
... ... ...
PE

Fig. 4. Pipeline diagram. Different colors stand for different input samples. Different blocks stand for different
linear operators which also constitute the pipeline stages. ℎ is the number of attention heads.

3.3.1 Latency Estimation. We construct the pipeline diagram as shown in Figure 4. As mentioned
in §3.2.2, since we need to store the 𝐾 and 𝑉 values after the linear operators, there is an implicit
synchronization point between the 𝑞/𝑘/𝑣 operator and the latter SDP and FFN parts. The com-
putation of them cannot be overlapped. Notice the 𝑞/𝑘/𝑣 operator can be performed in parallel
since they do not have any dependencies. After 𝑘 and 𝑣 have been fully calculated, the subsequent
computations of SDP and FFN can be greatly overlapped. This is because these operations do not

9
Chen et al.

need to wait for all the results to perform the next operation. The results of the previous operation
can be directly streamed into the next operation as input. Moreover, since different Transformer
layers share the same architecture, their computation can also be overlapped without waiting for
the result of the previous layer.
Suppose the Transformer model has 𝑁 layers in total. Since we have 𝐶 layers on one FPGA, it
needs to iterate 𝑁 /𝐶 times to process the whole model. We can calculate the latency of different
stages, and the overall latency is the maximum latency of these stages (which defines the initiation
interval of the pipeline) times the number of iterations, i.e.,
1 𝑁 𝑙𝑑 2
2 2
𝑙𝑑 𝑙 𝑑 𝑙𝑑𝑑 FFN
𝑇prefill = + 𝐶 max , , ,𝑇mem , (8)
𝑓 𝑟𝑒𝑞 𝐶 𝑀𝑘 𝑀𝑘 𝑀𝑎1 𝑀 𝑓1
1 𝑁 𝑑2
2
𝑑 (𝑙 max + 1)𝑑 𝑑𝑑 FFN
𝑇decode = + 𝐶 max , , ,𝑇mem , (9)
𝑓 𝑟𝑒𝑞 𝐶 𝑀𝑘 𝑀𝑘 𝑀𝑎1 𝑀 𝑓1
where the first term inside the parentheses is the latency of the 𝑞/𝑘/𝑣 linear operator (i.e., 𝑡 in
Figure 4). 𝑇mem is the off-chip memory access latency, which can be calculated based on Equation (7).
3.3.2 Work Balancing. As the overall latency is determined by the slowest stage in the dataflow,
we can balance the execution time of each stage; hence we have
𝑙𝑑 2 𝑙 2𝑑/ℎ 𝑙𝑑𝑑 FFN
= ℎ= (10)
𝑀𝑞,𝑘,𝑣,𝑝 𝑀𝑎1,𝑎2 𝑀 𝑓1,𝑓2
=⇒ 𝑀 = 𝑀𝑞,𝑘,𝑣,𝑝 = 𝑑/𝑙𝑀𝑎1,𝑎2 = 𝑑/𝑑 FFN 𝑀 𝑓1,𝑓2 , (11)
where 𝑀 is defined as the global compute power in MACs/cycle. Finally, Equation (8) can be
simplified to
1 𝑙𝑑 2

1
𝑇prefill = 𝑁 1+ , (12)
𝑓 𝑟𝑒𝑞 𝐶 𝑀
which shows the overall latency with work balancing. We can obtain the latency for the decode
stage using a similar analysis.
To derive the optimal 𝑀 for a given model, we devise a linear search algorithm to identify the
maximum available 𝑀 based on the constraints in Equations (2), (3), and (6). Notice the optimal 𝑀
represents an upper bound of the compute power. In practice, we also need to consider the routing
issue to adjust the actual achievable 𝑀 as discussed in §5.2.

3.4 Distributed Inference

As a single FPGA may not be sufficient to process some extremely large models, we next extend our
modeling to multiple FPGAs. We first characterize the communication cost between two FPGAs
and discuss the impact of different parallelism schemes.
3.4.1 Communication. Various methods exist for facilitating inter-FPGA communication, including
communication through the host, PCI-E Peer-to-Peer (P2P), and on-device Ethernet. We mainly
consider the third approach since it does not necessitate orchestration from the host and provides
higher bandwidth compared to other alternatives. For example, the AMD Alveo U280 FPGA provides
two QSFP ports [85], each capable of carrying 100 Gb/s Ethernet data over optical fibers, which
ensures robust and high-speed inter-FPGA communication. Most of the time, we cannot fully utilize
the network bandwidth and need to pay for the package header overheads. Suppose the theoretical
network bandwidth between two FPGA devices is 𝐵 bits per second (bps), and the efficiency of
the network is 𝛼, so we can have the effective bandwidth as 𝛼𝐵, where 𝛼 can be obtained through
network benchmarking.

10
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

3.4.2 Parallelization Schemes. As mentioned in §2.2, we have various parallelization schemes when
considering multiple devices. We first analyze tensor parallelism (TP). As shown in Figure 3, the
parameters of the linear operations are partitioned across different devices. For example, suppose
the weight parameters of the two FFN layers 𝑓1 and 𝑓2 are 𝐴 and 𝐵, then we can partition 𝐴 along
its column and partition 𝐵 along its row, and obtain

𝐵1
𝜎 (𝑍𝐴)𝐵 = 𝜎 𝑍 𝐴1 𝐴2 = 𝜎 (𝑍𝐴1 )𝐵 1 + 𝜎 (𝑍𝐴2 )𝐵 2 ,
𝐵2
where 𝜎 is the GeLU function. Therefore, apart from partitioning 𝐴 and 𝐵, we need to insert an
all-reduce operation to aggregate the partial results on each device to ensure correctness. The
partitioned parameters will be stored on different devices. For example, 𝐴1 will be on the first FPGA,
and 𝐴2 will be on the second FPGA. A similar partition scheme can be applied for MHA, and we
refer the readers to [68] for more details.
Based on this partition scheme, TP requires two all-reduce operations within one Transformer
layer. However, these communicative operations are implemented in a blocking way. Figure 5(a)
shows the subsequent FFN module needs to wait for the completion of the all-reduce process before
it can conduct computation [76]. Notice that the all-reduce operation only involves fetching results
from other devices and adding the result to its local tensor. Given that the output of MHA is a
sequential stream, we can perform elementwise addition in a non-blocking manner. As soon as the
kernel receives enough data, it can initiate data transfer to other devices without waiting for the
remaining data to be computed. This leads to substantial synchronization time savings as shown in
Figure 5(b).

Time
MHA all_reduce FFN all_reduce (a) Blocking all-
reduce in TP

MHA
(b) Proposed
all_reduce
non-blocking
FFN Time saved
all-reduce
all_reduce

Fig. 5. Blocking and non-blocking all-reduce in TP. The latency of different stages is not drawn to scale.

Since the size of the output tensor of MHA and FFN are both 𝑙𝑑, the communication time for one
all-reduce is
𝑇comm = 𝑙𝑑𝑏𝐴 /(𝛼𝐵) . (13)
As we have already implemented dataflow inside a device, pipeline parallelism (PP) essentially
extends the dataflow to 𝑝 2 devices with a tensor of size 𝑙𝑑 communicated in between. Here, we
only split the pipeline between two Transformer layers so the results of the previous device can be
directly streamed to the next device in the same PP group. Notice TP and PP can be combined to
conduct model inference [51], and the latency of Equation (8) becomes
!!
1 𝑁 𝑙𝑑 2 𝑙𝑑 2 𝑙 2𝑑 𝑙𝑑𝑑 FFN
𝑇prefill = + 𝑝 2𝐶 max , , ,𝑇mem,𝑇comm , (14)
𝑓 𝑟𝑒𝑞 𝑝 2𝐶 𝑝 1 𝑀𝑘 𝑝 1 𝑀𝑘 𝑝 1 𝑀𝑎1 𝑝 1 𝑀 𝑓1

where 𝑝 1 and 𝑝 2 are the size of a TP group and a PP group [68]. Additionally, the memory re-
quirements of Equations (3) and (5) need to be divided by 𝑝 1 to satisfy the constraints of multiple
devices.

11
Chen et al.

Table 2. Models used in §4 and §6.

# Layers # Heads Hidden FFN size

Model Type # of params
𝑁 ℎ Size 𝑑 𝑑𝐹 𝐹 𝑁
BERT [16] Encoder 110M 12 12 768 3072
GPT2 [65] Decoder 355M 24 16 1024 4096
LLaMA2 [71] Decoder 7B 32 32 4096 11008
Vicuna [11] Decoder 13B 40 40 5120 13824

Table 3. Summary of FPGA and GPU devices.

AMD Xilinx FPGA Intel FPGA Nvidia GPU

Alveo Versal Versal Stratix 10 Agilex 7 GeForce
Tesla A100
U280 [83] VCK5000 [86] VHK158 [90] NX 2100 [37] AGM039 [27] RTX 2080 Ti
Process Node TSMC 16nm TSMC 7nm TSMC 7nm Intel 14nm Intel 7nm TSMC 12nm TSMC 7nm
Release Date 2018 2022 2023 2020 2022 2018 2021
Thermal Design Power 225W 225W 180W 225W 225W 250W 300W
Peak Throughput 24.5 INT8 TOPS 145 INT8 TOPS 56 INT8 TOPS 143 INT8 TOPS 88.6 INT8 TOPS 14.2 TFLOPS 312 TFLOPS
400× 3960× 544× 432×
Specialized Blocks - - -
AI Engine AI Tensor Block Tensor Cores Tensor Cores
DSP/CUDA Cores 9024 1968 7392 - 12300 4352 6912
BRAM18K/M20K 4032 967 5063 6847 18960 - -
URAM/eSRAM 960 463 1301 2 - - -
On-chip Memory
41MB 24MB 63.62MB 30MB 46.25MB 5.5MB 40MB
Capacity
Off-chip Memory 8GB HBM2 & 32GB HBM2e &
16GB DDR 16GB HBM2 32GB HBM2e 11GB DDR 80GB HBM2e
Capacity 32GB DDR 32GB DDR
On-chip Memory 460GB/s & 819.2GB/s &
102.4GB/s 512GB/s 820GB/s 616GB/s 1935GB/s
Bandwidth 38GB/s 102.4GB/s

Notice we only discuss two basic parallelism schemes for Transformer models. Some recent
works may partition the sequence dimension and leverage reduce-scatter and all-gather to reduce
the overheads of all-reduce [33, 51]. The communication time can be similarly analyzed, and we
will not discuss them here. The optimal parallelism scheme on multiple devices [48, 62, 73, 82, 105]
is out of the scope of this paper, and we will leave it as future works.

4 Case Study
In this section, we leverage actual hardware configurations to estimate the model performance
using our analytical framework and provide insights for LLM accelerator design.

4.1 Overview of Workloads and Hardware

Table 2 lists several widely used models that we choose for performance estimation. Such models
include BERT-base [16], a representative encoder-only model, GPT2 [65], the only open-sourced
model in the GPT family, LLaMA-7B [70, 71], an open-sourced model trained by Meta, and Vicuna-
13B [11], the best non-commercial model on Chatbot Arena [104].
To see the performance differences between FPGAs and GPUs, we pick and list several represen-
tative devices in Table 3. For FPGAs, Alveo U280 and Agilex 7 are two FPGAs that are widely used
in cloud servers but not specially optimized for AI workloads; Versal VCK5000 and Stratix 10 NX
FPGAs are designed for accelerating AI applications with specialized hardware units such as AI
Engine (AIE) [84] or Tensor Blocks [37]. Versal VHK158 is the latest FPGA with HBM2e released
by AMD in 2023. For GPUs, RTX 2080Ti is a high-end GPU designed for personal usage with a
similar process node and release date to U280; A100 is the most deployed GPU in data centers to
conduct LLM training and inference [54].

12
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

(a) BERT (b) GPT2 Prefill Stage (c) GPT2 Decode Stage
103 103 RTX 2080Ti
Tesla A100
Latency (ms)

Latency (ms)

Latency (ms)
103
2
102 Alveo U280 (est.)
10
Versal VCK5000 (est.)
102 101
Versal VHK158 (est.)
101 Stratix 10-NX2100 (est.)
100
101 Agilex 7-AGM039 (est.)

25 26 27 28 29 210 211 212 25 26 27 28 29 210 211 212 20 21 22 23 24 25 26 27

Sequence length (tokens) Input sequence length (tokens) Output sequence length (tokens)

Fig. 6. Latency estimation of BERT and GPT2 on different FPGAs. GPU results are obtained from actual
profiling.

4.2 Single-Device Performance Estimation

We first conduct experiments on a single device with pre-trained BERT-base and GPT2 models
from HuggingFace Hub [79]. For GPU baselines, we run the models with PyTorch 2.0 [57] and
CUDA 11.7, and measure the performance on both RTX 2080Ti and A100 GPUs. The host machine
runs an Intel Xeon Silver 4114 CPU at 2.20GHz with 40 cores. We follow the common practice to
measure the out-of-the-box FP16 performance [2, 53, 65].
In this section, we only estimate the performance of FPGAs using the proposed framework in
§3, and in §6 we will evaluate the actual performance on FPGAs. The quantization scheme is 4-bit
weight and 8-bit activation (W4A8) configurations for FPGA estimations, unless otherwise noted.
4.2.1 Latency of Different Stages. Based on the constraints of Equations (2), (3), and (6), we can
calculate the latency one device can achieve at the maximum compute power 𝑀 (defined in
Equation (11)).
From Figure 6(a) and (b), we observe that the 2080Ti and A100 GPUs maintain an almost constant
curve when the sequence length is less than 1024. This behavior is primarily attributed to the
high kernel launch overheads for hundreds of CUDA kernels in PyTorch, which overshadow the
computation time. However, when the sequence length exceeds 1024, GPUs become computation-
bound, resulting in higher latency. In contrast, the latency of FPGAs increases linearly with the
sequence length, as described in Equation (8), and demonstrates significantly longer latency when
the sequence length is large (e.g., 512). It is worth noting that even when making use of state-
of-the-art AI-optimized FPGAs like the Stratix 10 NX and Versal VCK5000, the situation does
not improve significantly. This is because these modern FPGAs are equipped with DDR or older
versions of HBM, which makes parameter loading from off-chip memory the bottleneck. Even
though these FPGAs have highly efficient computation blocks, the compute units have to wait for
the memory to fetch data, resulting in suboptimal performance. Moreover, many FPGAs struggle
to achieve both high memory bandwidth and compute power simultaneously. Consequently, for
VHK158, performance deteriorates when the sequence length reaches 4096, as it shifts from being
memory-bandwidth bound to compute-bound. Further scaling the sequence length larger than
4096 may lead to out-of-memory for both FPGAs and GPUs, and only A100 can handle such large
sequence lengths, so the latency is not plotted on the figure.
The decode stage is on the contrary, as shown in Figure 6(c). When the sequence length is small,
GPUs suffer from underutilizing the compute resources, and FPGAs can achieve a significantly
lower latency compared to GPUs. The latency of RTX2080 and A100 GPUs increases since the
memory access takes control compared to computation in the decode stage. Although FPGAs are
also bounded by off-chip memory bandwidth, they can always perform better than GPUs since
the computation is small when the sequence length is equal to one. Therefore, FPGAs can easily
achieve GPU-level performance even with a small 𝑀.

13
Chen et al.

(a) LLaMA2 Prefill Stage (b) LLaMA2 Decode Stage

U280 VHK158 A100 U280 VHK158
102
104 VCK5000 Ideal VCK5000 A100

Latency (ms)

Latency (ms)
103
101
102

101
0 10000 20000 30000 0 500 1000 1500 2000 2500 3000
# of MACs/cycle M # of MACs/cycle M

Fig. 7. Latency estimation of LLaMA2 model. The sequence length is set as 128, and the W4A8 quantization
scheme is used in this experiment. GPU results are obtained from actual profiling.

Insight I: Existing FPGAs are inferior in the compute-intensive prefill stage but can
outperform GPUs in the memory-intensive decode stage.
To further investigate what constrains the performance of FPGAs, we conduct an analysis on the
LLaMA2 model by varying different 𝑀 and observing the changes in latency. As shown in Figure 7,
the VCK5000 FPGA exhibits the smallest off-chip memory bandwidth, which leads it to reach a
latency plateau rather quickly. Conversely, the VHK158 FPGA has the largest off-chip memory
bandwidth, so it can achieve the lowest latency in both prefill and decode stages. Moreover, we
include the curve of ideal FPGA performance in Figure 7 to assess the compute power required to
attain A100-level performance. Based on this estimation, we need around 30,000 MACs/cycle in
order to achieve the A100-level performance in the prefill stage, assuming no memory bandwidth
constraints. This is achievable by those AI-optimized FPGAs, which can conduct a large number
of MACs efficiently. On the contrary, for the decode stage, once an FPGA has enough memory
bandwidth, such as U280, it can reach the A100-level performance easily.
Insight II: The prefill stage requires large compute power 𝑀 to achieve the GPU-level
performance, while the decode stage only requires a small 𝑀.
4.2.2 Quantization Schemes. We then investigate the impact of different quantization schemes and
memory packing. We consider quantizing the weight parameters to 𝑥 bits and the activation to 𝑦
bits (abbreviated as W{𝑥 }A{𝑦}). As shown in Figure 8(a), the red dashed line depicts the maximum
available MACs/cycle on-board, which is calculated based on Equation (2). Different quantization
schemes may have different requirements on BRAM usage constrained by Equation (3). W4A8 is
the scheme that can almost fully utilize the compute resources. W8A8 and W16A16 require more
memory resources, resulting in lower performance since the computation is bound by the limited
BRAM resources on-board. Also, we can see quantizing the weights gives the most benefits, but
quantizing activation only gives little benefit (𝑀 does not change a lot under the same weight
bitwidth), which is due to the fact that we employ a dataflow architecture and do not require large
buffers to store the intermediate tensors on-board.
Insight III: Weight quantization is necessary for reducing memory usage, while activa-
tion quantization only has limited benefit.
4.2.3 Memory Packing. Next, we further consider the impact of memory packing under the W4A8
setting. As shown in Figure 8(b), if we do not conduct memory packing, it even cannot satisfy
the memory port constraint (Equation (5)) when 𝑀 is small (blue curve). This is because a large
number of partitioned arrays require more BRAMs, and many BRAMs are not fully utilized causing
a large waste of resources. The orange curve shows packing two int4 elements to int8, and we can
achieve a small 𝑀 under the resource constraint since the number of partitioned arrays is reduced.
The green curve packs 9×int4 elements to int36, and it can achieve more than four times of 𝑀

14
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

(a) Quantization Schemes (b) Memory Packing (c) Memory Bandwidth

Max available M
W4A8 125 bW=2 bW=8
W4A16 bW=4 bW=16

BRAM 36K Usage

100 104

Latency (ms)

Latency (ms)
W8A8
W8A16 160 720 1440
W16A8 75 102
W16A16 103
W4A8 w/o packing 50
W4A8 w/ 8-bit packing
W4A8 w/ 32-bit packing 25
W4A8 w/ 72-bit packing
Max available BRAM
500 1000 1500 500 1000 1500 0 1000 2000 3000
# of MACs/cycle M # of MACs/cycle M # of MACs/cycle M

Fig. 8. (a) Impact of different quantization techniques on GPT2 prefilling stage on U280. The sequence length
is set as 128. The cyan line shows the theoretical latency under different 𝑀 without memory bandwidth
constraints. Thin dashed lines depict the maximum 𝑀 constrained by available BRAM resources. (b) Impact
of memory packing in the W4A8 setting. (c) Impact of different weight quantization schemes on memory
bandwidth and overall latency.

compared to the int8 packing. The purple curve packs 18×int4 elements to int72, and the curve
can almost intersect with the red line before intersecting with the blue line, which means it reaches
the maximum DSP constraint on-board (Equation (2)). This study shows that it is important to pack
the parameters to reduce on-chip memory usage.
Insight IV: Memory packing can efficiently reduce the required BRAMs to store the
tensors.
4.2.4 Memory Bandwidth. Lastly, we investigate how quantization impacts the required memory
bandwidth. As shown in Figure 8(c), the low-bit weight quantization can significantly alleviate the
demands of off-chip memory access. By reducing the volume of data needed in each cycle, it can
achieve a larger compute power 𝑀, thus leading to a better performance. In particular, quantizing
the model to a 2-bit representation yields a performance boost exceeding an order of magnitude
when compared to a 16-bit weight quantization scheme. Recent research [7, 103] has demonstrated
that 4-bit or even 2-bit quantization can be implemented without compromising model accuracy,
which makes efficient LLM deployment on FPGAs possible.
Insight V: Low-bit weight quantization can further help alleviate the demands of off-
chip memory access.

4.3 Multi-Device Performance Estimation

(a) Vicuna-13B Prefill (b) Vicuna-13B Decode

FPGA×2
Latency (ms)

Latency (ms)

FPGA×4
101
FPGA×8 103
Max M on U280
Max M on VHK158
Max M on VCK5000 100
102
Max M on Stratix 10
500 1000 1500 2000 500 1000 1500 2000
# of MACs per cycle per FPGA (M) # of MACs per cycle per FPGA (M)

Fig. 9. Latency estimations of Vicuna-13B model on multiple FPGAs.

For multiple devices, we use the Vicuna-13B model to estimate the performance of 2, 4, and
8 FPGAs based on our analytical model. As shown in Figure 9, the latency can scale well when
the number of devices increases. Since we employ a non-blocking communication scheme in our
dataflow design as discussed in § 3.4, communication will not be the bottleneck of the design.

15
Chen et al.

Multiple FPGAs can reduce the number of required MACs on each device, but cannot increase
the number of available MACs on an FPGA, so the performance is still limited by the maximum
available resources on-board and the off-chip memory bandwidth. For the decode stage, leveraging
two FPGAs can already reduce the inference latency of the Vicuna-13B model to less than 10ms
based on the estimation.
Insight VI: Multiple FPGAs help reduce overall latency under the same 𝑀 on each
device.

5 Implementations
In this section, we describe the kernel implementation and accelerator design to show how to
efficiently achieve the design points in the analytical framework.

5.1 HLS Kernels

This section is not intended to propose new HLS kernels. Instead, we explore the efficient ways to
reach the maximum available 𝑀 on FPGAs and implement the standard kernels as a library with
parameterized quantization support, which is reusable across different models (e.g., BERT and GPT
models implemented in §6).
Linear Operators. Linear operators are the key operators in Transformer models because they
are ubiquitous and compute-intensive. There are two types of linear operators in the Transformer
layers, activation-weight matrix multiply (A-W GEMM) with bias, and activation-activation (A-A
GEMM) matrix multiply. A-W GEMM includes the projection layers and the linear layers in the
FFN, with weights buffered on-chip and activations streamed from an input FIFO; A-A GEMM
includes the two GEMM operators in the SDP attention, with one of the input activations stored in
a double buffer and the other one streamed.
We adopt an output-stationary systolic array architecture to implement the A-W and A-A GEMM
process engines. As shown in Figure 10(a), the systolic array is a 2-D array of MAC units of shape
𝑚 1 × 𝑚 2 with FIFOs connecting different MAC units. The number of MAC units of linear operator 𝑖
is actually 𝑀𝑖 defined in §3.1. We mainly discuss the A-W GEMM in the following, and the idea
can also be applied to the A-A GEMM. The input activation from the previous layer will be first
buffered in an activation buffer. After 𝑚 1 elements are buffered, they will be streamed into the
systolic array together with the 𝑚 2 weights. There is also a fully partitioned output buffer on-chip
that allows the outputs to directly write back.

Weight / RHS Buffer

in[7:0]
Activation / LHS Buffer

2 1 F E D C B A 0 9 8 7 6 5 4 3 2 1 18-bit

w[16:13] w[3:0]
A 0 9 8 7 6 5 4 3 2 1 F E D C B A 0 9 8 7 6 5 4 3 2 1 27-bit

out[24:13] out[11:0]
. . A 0 9 8 7 6 5 4 3 2 1 F E D C B A 0 9 8 7 6 5 4 3 2 1 45-bit

(b) DSP packing for matrix

(a) Output-stationary systolic array multiplication of int4 x int8

Fig. 10. Systolic array and DSP packing. The yellow blocks in the systolic array represent output buffers.

Each MAC unit can be implemented with a single DSP block and can provide one-MAC-per-cycle
throughput. Based on the discussion in §4.2.2, we adopt the W4A8 quantization scheme for our

16
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Region 0 Region 1 Region 2

(SLR0) (SLR1) (SLR2)
GL

Mask SM LN
input
LN
SDP output

Fig. 11. Overall dataflow architecture of a single Transformer layer that uses post-LayerNorm scheme [65].
SDP denotes scaled dot-product. Orange nodes denote the GEMM kernels. Yellow nodes are the non-linear
kernels, including softmax (SM), LayerNorm (LN), and GELU (GL). Green rectangles represent the FIFOs
between kernels, and purple rectangles are the data loaders.

accelerator design, which maximizes the utilization of available resources. As a result, the matrix
multiplications involve either int4 by int8 or int8 by int8 operations, which are too large for
LUTs, thus relying primarily on DSPs or specialized compute blocks (e.g., AIE [84]). In AMD
FPGAs, the DSP48E2 hard blocks can support 18-bit by 27-bit multiplication and accumulation [83],
enabling the packing of two multiplications into one slice for a W4A8 quantized model to save
DSP resources and achieve a larger 𝑀. Figure 10(b) shows the bit packing method for 4-bit by
8-bit integer multiplications. One activation is filled into the lower 8 bits of the 18-bit DSP input,
and two weights are filled into 0-to-3 and 13-to-16 bit positions of the 27-bit DSP input to avoid
overlapping results. Finally, the two multiplication results are extracted by bit-slicing the 45-bit
DSP result. Notice that since the DSP output is wide enough, we can also pack two 8-bit by 8-bit
integer multiplications into one DSP slice by further offsetting the second weight and output. With
DSP packing, we can easily double 𝑀 to achieve higher performance but with much fewer DSPs.
Non-Linear Operators. Since quantizing non-linear operators can lead to a significant degradation
in model accuracy [67, 81], and these non-linear operators are not the bottleneck of the design, we
directly implement the floating-point version of the operators in HLS. Specifically, we buffer a row
of elements for softmax and LayerNorm functions, which requires conducting reduction along the
last dimension. Consequently, this approach eliminates the need to wait for the complete results
for the computation of these non-linear operators and effectively prevents dataflow stalling.

5.2 Accelerator Design

We integrate the proposed kernels in §5.1 to create a high-performance hardware accelerator.
The overall architecture of our proposed accelerator is depicted in Figure 11. Different operators
including linear and non-linear operators are connected with FIFOs, except that the KV cache
discussed in §3.2.2 leverages double buffers. This design reads input tensors from off-chip memory
and stores the results back to memory after each layer. Intermediate activation values are directly
streamed to the next operator. Initially, all parameters are stored in DRAM, and data loaders 𝐿 (·) are
responsible for loading data from DRAM or streaming buffers. Storing parameters off-chip reduces
the number of FPGA devices needed and avoids the need to build different bitstreams for different
network layers. After completing a layer, the accelerator fetches new parameters from the host
to the device for the subsequent layer. Since data loading is hidden from the computation, overall
latency remains unaffected. Given the contemporary trend of multi-die FPGA designs [18, 22],
explicit dataflow partitioning becomes necessary to meet timing requirements. Our target device,
Alveo U280 FPGA [83], has three chip dies called super logic regions (SLRs). Consequently, we
partition the dataflow into three regions to confine each region fully inside SLR. According to our

17
Chen et al.

Fig. 12. Physical layout of the implemented spatial accelerator.

placement constraints, AMD Vitis toolchain will automatically insert AXI Register Slice IPs to
pipeline SLR crossing.
We leverage the proposed analytical framework to guide our accelerator design. Since typical
Transformer models have 𝑑 FFN = 4𝑑 [16, 65] and 𝑙 < 𝑑, according to work balancing of Equation (11),
we have 𝑀𝑞,𝑘,𝑣,𝑝 = 𝑀, 𝑀𝑎1,𝑎2 < 𝑀, and 𝑀 𝑓 1,𝑓 2 = 4𝑀. A straight-forward division is to put the PEs for
𝑞, 𝑘, 𝑣, SDP, and 𝑝 on SLR0, 𝑓 1 on SLR1, and 𝑓 2 on SLR2 so that each SLR roughly contains 4𝑀 MAC
units. However, we observe that scaling up the linear operators in FFN poses significant challenges
to timing closure. Among various configurations of systolic arrays we tested, the maximum capacity
of one SLR at 250 MHz is three of 8 × 16 systolic arrays; a single 16 × 16 one fails timing. Therefore,
we only leverage 8 × 8 and 8 × 16 systolic arrays for simplicity. We also explore using LUT-based
multipliers as they provide greater flexibility for placement compared to DSPs. However, the
presence of additional inter-LUT wires results in a much lower frequency (191 MHz) compared to
the DSP-based multipliers. To minimize the number of SLR crossings, we put 𝑞, 𝑘, and 𝑣 on SLR0
and use 8 × 16 systolic arrays, which also ensures a relatively low latency for the first stage based
on Equation (8). MHA and the 𝑝 projection are on SLR1, with 𝑎 1 and 𝑎 2 using 8 × 8, and 𝑝 using
8 × 16 systolic arrays. 𝑓1 and 𝑓2 operators on SLR2 using 8 × 16 systolic arrays. Therefore, it can
still form a relatively balanced 3:2:2 resource utilization ratio for linear operators.

6 Evaluation on FPGAs
In this section, we implement two design points studied in §4 to validate the feasibility of our
framework. We first describe our experimental setup and perform evaluation on a single FPGA.

6.1 Experiment Setup

We test the publicly available BERT and GPT2 models listed in Table 2. We conduct post-training
quantization and use the W4A8 quantization scheme for BERT and the W8A8 scheme for GPT,
which are prevalent settings in nowadays LLM inference [81, 103].
We implement and run the actual design on an Alveo U280 FPGA [83] with 𝑀 = 256 for the
kernels. This FPGA has 4032 BRAM18K blocks, 9024 DSP slices, 2.6M flip-flops, 1.3M LUTs, and
960 URAM blocks. It has three SLRs with an almost equal amount of resources. All the kernels are
implemented in C++ with Vitis HLS v2022.1 [89], and synthesized with Vivado backend toolchains.

18
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Table 4. Experimental results compared with other FPGA-based accelerators. Sequence lengths are set as 512.

Freq Latency (ms) Throughput

Name Device Quantization Speedup BRAM DSP FF LUT URAM
(MHz) [Est. (§3)] (samples/sec)
Ours U280 245 W4A8 26.01 [24.07] 38.45 - 389 (19%) 1780 (20%) 653K (25%) 569K (44%) 111 (12%)
- SLR0 - - - 4.86 - - 130 (19%) 482 (17%) 200K (23%) 167K (38%) 3 (1%)
- SLR1 - - - 14.63 - - 136 (20%) 590 (19%) 240K (28%) 212K (49%) 50 (16%)
- SLR2 - - - 19.81 - - 123 (18%) 708 (23%) 213K (25%) 191K (44%) 58 (18%)
FQ-BERT [46] ZCU111 214 W4A8 95.16 10.51 3.66× 679 (31%) 3287 (77%) 201K (24%) 190K (45%) N/A
TRAC [61] ZCU106 200 Fixed8 347.48 2.88 13.36× 181 (29%) 1379 (80%) 128K (28%) 126K (55%) N/A

Est. Ours DFX Tesla A100 RTX 2080Ti

GPT2 Prefill GPT2 Prefill GPT2 Decode GPT2 Decode

Energy Efficiency (tokens/J)

101
103 103
102
Latency (ms)

Latency (ms)
102
102

101 100
101
101
32 64 128 256 512 32 64 128 256 512 1 4 16 64 128 1 4 16 64 128
Input sequence length (tokens) Input sequence length (tokens) Output sequence length (tokens) Output sequence length (tokens)

Fig. 13. Latency and energy efficiency of GPT2 model on different devices. The GPU results are obtained
following the same setting in §4.

Figure 12 shows the final device layout of the implemented accelerator. We use OpenCL with
Xilinx RunTime (XRT) for hardware execution and Xilinx Board Utility (xbutil) for computing
power measurements. The environment for GPU experiments is listed in §4.1, and NVIDIA system
management interface (nvidia-smi) is used for measuring GPU power. Notice the quantized models
on GPUs are slower than the FP16 models, as the quantization methods normally leverage fake
quantization and lack high-performance GPU kernels to support efficient inference. Therefore, we
directly compare our accelerator with the best FP16 GPU results. The FPGA on-board results match
the outputs from the quantized model in PyTorch and are able to achieve the same accuracy. The
latency results are the average across fifty runs.

6.2 On-Board Evaluation

We first compare our BERT accelerator with FQ-BERT [46] and TRAC [61], two FPGA-based accel-
erators for the BERT model. To make a fair comparison, we employ the same W4A8 quantization
precision with FQ-BERT and assess the model accuracy using the CoLA dataset [16], a widely-used
language understanding task. The fp16 model achieves an accuracy of 56.84%, while our W4A8
quantized model attains 56.76% accuracy. We only compare the performance of the core encoder
layers, and report the best on-board latency results of the baselines obtained from the original
papers. Both baselines use a sequence length of 128, so we scale their latency results to match
our sequence length of 512. FTrans [39] also targets BERT-variant models, but it does not provide
frequency and sequence length in the paper, so we cannot make a direct comparison. From Table 4,
we can see our spatial architecture delivers a much lower latency and a higher throughput with a
much lower DSP usage. Even though our evaluation device is not exactly the same as the baselines,
our throughput improvement still surpasses the resource increment of FF and LUT. Specifically,
our accelerator is 3.66× faster than FQ-BERT [46] and 13.36× faster than TRAC [61]. Compared
to temporal architectures, our spatial architecture can efficiently overlap the execution of the
operators in the model and eliminate most of the data movement overhead. In our design, one layer
can start operation once the previous layer finishes computing one tile of the feature map, which
typically takes only tens of cycles. In contrast, temporal architectures need to wait for the entire

19
Chen et al.

tensor to be produced, which may take hundreds of cycles. Therefore, even if the per-layer latency
of spatial architectures is longer, the end-to-end latency can still be significantly lower than the
temporal architectures employed by FQ-BERT and TRAC. Furthermore, our analytical framework
precisely predicts the performance of the accelerator with less than 2ms of differences, showing
the practicality of our approach.
We next design an accelerator for the GPT2 model. We support importing quantized models
from different quantization frameworks [7, 81, 103]. Specifically, we export the W8A8 model from
SmoothQuant [81] and achieve 62.2% on the LAMBADA dataset [56], whereas the FP16 model
demonstrates an accuracy of 65.0%. We compare our GPT accelerator with the state-of-the-art GPT
accelerator, DFX [23], which employs a temporal architecture with an instruction set and uses the
same U280 FPGA device for on-board evaluation. On average, we are 2.16× and 1.10× faster than
DFX in the prefill and decode stage respectively. This is because our spatial architecture overlaps
the computation and greatly eliminates off-chip memory access. We can also see our estimations in
§3 align closely with the actual performance, achieving a 92% accuracy for the prefill stage. For the
decode stage, the estimated latencies are lower than the actual results, which is mainly because the
initial interval between two operators is not significantly smaller than the execution time of one
stage, contributing to a notable increase in latency.
We also include the GPU results in §4 for a more comprehensive evaluation. As shown in
Figure 13, neither DFX nor our design performs well during the prefill stage compared to GPUs
that have more compute resources to exploit the abundant parallelism. Notably, the latency of
FPGAs in the prefill stage increases linearly, while the GPU ones almost remain constant as the
model does not fully utilize GPUs. For the decode stage, the situation is reversed. FPGA-based
accelerators are more efficient than GPUs, and our accelerator can achieve a 1.85× speedup and is
5.69× more energy efficient compared to the A100 GPU. This is because the generation of each
token is fully sequential, and GPUs cannot leverage their abundant parallelism, and suffer from
extensive memory access overheads. On the contrary, our dataflow accelerator eliminates most of
the off-chip memory accesses and overlaps the compute as much as possible. Thus, we can achieve
a better performance compared to GPUs, aligning with our estimation results in §4. Notice the
U280 FPGA only uses a 16nm process while the A100 GPU has a more advanced 7nm process node
based on the data in Table 3, but we can still achieve higher speedup, demonstrating the efficiency
of our spatial accelerators. It also indicates the potential of further optimizing our HLS design and
scaling it up to achieve even higher performance.

6.3 Ablation Study

We begin by examining the latency of different SLRs. As shown in rows 3-5 in Table 4, the overall
latency of the BERT accelerator is nearly the sum of the latency of SLR0 and SLR2, aligning with
the pipeline diagram in Figure 4. Moreover, the computation in SLR1 largely overlaps with that
of SLR2 due to the fully pipelined design. The resource usage across different SLRs is also similar,
resulting in a balanced design.
We further investigate the efficiency of our kernel functions. We conduct experiments on our tem-
plate systolic array function with the AutoSA-generated systolic array, which is a highly optimized
state-of-the-art systolic array implementation [75]. From Table 5, we can see our implementation
achieves the same level of performance compared to AutoSA. While maintaining the same DSP
usage, the resource usage of our kernel function is much smaller than AutoSA. Since the prediction
of a single GEMM kernel is accurate and can achieve the theoretical performance, our analytical
model can precisely predict the performance of spatial accelerators when combining multiple
linear operators. The presence of latency-predictable kernels as foundational components plays

20
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

a pivotal role in this predictability. Additionally, our function offers enhanced customizability,
accommodating varying sizes and the choice of different quantization schemes.
Moreover, employing DSP packing further reduces the DSP usage, allowing one DSP to handle
two MAC operations within a single cycle, a feature not supported in AutoSA. This experiment
shows the efficiency of our kernels, facilitating the development of high-performance Transformer
accelerators.

Table 5. Latency and resource usage of our systolic array library function. Results are directly derived from
the HLS report in 300MHz. The GEMM kernel is extracted from the first FFN layer in the BERT-base model
with size (512, 768) × (768, 3072). We use a 16 × 16 systolic array to calculate the int8 GEMM. The theoretical
peak performance without DSP packing is (512 × 768 × 3072)/(16 × 16) cycles ×3.33 ns/cycle = 15.71 ms.

Latency (ms) Effective 𝑀 BRAM DSP FF LUT

Ours (w/o DSP packing) 15.73 256 0 (0%) 256 (2%) 88284 (3%) 168190 (12%)
Ours (w/ DSP packing) 15.73 128 0 (0%) 128 (1%) 79969 (3%) 244439 (18%)
AutoSA [75] 15.71 N/A 514 (12%) 256 (2%) 100138 (3%) 244032 (18%)

Lastly, we analyze the performance of the non-linear operators. As shown in Table 6, we observe
that the softmax operator in the MHA module incurs the highest latency, primarily due to the need
to compute the exponential function. Since these operators are elementwise and only require a row
of data to start the computation, they can be easily fused with the preceding linear operators in
the pipeline, thereby not significantly impacting the overall latency. For instance, the combined
latency of a GEMM kernel (10.77ms) and the softmax operator (6.67ms) greatly exceeds the latency
of SLR1 in Table 4 (14.63ms/300MHz×245MHz), indicating substantial overlap between the softmax
operator and other operators. Again, these ablation studies show that considering only the linear
operators in the analytical framework is sufficient to achieve an accurate latency estimation.

Table 6. Performance and resource usage of non-linear operators in our kernel library. Kernel sizes are set to
match those of the BERT model in Table 4. Results are directly derived from the HLS report in 300MHz.

Operator Latency (ms) BRAM DSP FF LUT

Softmax 6.67 8 38 4835 7447
LayerNorm 0.85 20 80 18751 12746
GeLU 0.67 0 256 26193 16472

7 Discussion
In the previous sections, we provide details of the analytical framework and prove that it can
achieve high accuracy compared to the latency of actual implementation. However, our framework
may have limitations when analyzing the overlay designs or compressed models with sparsity,
which requires changes in the resource and latency estimation. In this section, we will delve into
several unanswered questions and open challenges.
AI-Optimized FPGAs. In §4, we demonstrate the potential of leveraging FPGAs with specialized
compute engines to accelerate LLMs. Although AIEs and tensor blocks provide massive compute
power [37, 84], the memory layout and bandwidth requirements remain undiscovered. Future FPGAs
for AI workloads should provide enough memory bandwidth and efficient on-chip interconnect to
facilitate the local data movements in a spatial architecture. Moreover, these specialized hardware
blocks usually adopt a unique programming model with custom compilation flows. It is still an

21
Chen et al.

open question whether existing practices for programming those hardware blocks enable efficient
execution of Transformer models.
Timing Closure on Multi-Die FPGAs. We encounter timing problems in partitioning and
scaling our design in §5.2. In general, it is hard to adequately explore the design space of multi-
die partitioning and scaling. There are automated frameworks [22, 29] to generate floorplanning
constraints, but they are currently not expressive enough to capture the various data movement
schemes (e.g., residual connection, multi-head splitting) within Transformer models. We hope
similar tools for Transformers could be derived from our analytical framework to speed up the
design closure.
Heterogeneous Deployment. Nowadays, data centers are increasingly heterogeneous, with
CPUs, GPUs, and FPGAs available at scale [6, 12, 51]. Therefore, it is possible to leverage the
advantages of different hardware to accelerate Transformer models. For example, GPUs are good
for the GPT prefill stage due to their high compute power; FPGAs can achieve low-latency decode
stage with customized spatial architecture. The key challenge is to build a distributed system
that efficiently manages hundreds of heterogeneous devices. We hope our analysis on resource
constraints, latency, and scaling could assist future deployment and evaluation of LLMs in a
heterogeneous and distributed environment.

8 Related Work
FPGA-Based Transformer Accelerators. Most of the prior works on hardware accelerators
leverage temporal or overlay architecture with one FPGA [26, 28, 39, 40, 46, 59, 63, 64]. Their
performance usually suffers from frequent data movements of intermediate results. DFX [23]
explores using multiple FPGAs to accelerate GPT2 inference, but it is still an overlay design. Some
research has delved into software-hardware co-design to optimize the attention kernel [100]. These
endeavors often lack in-depth analysis on resource utilization and cannot be easily generalized to
other kernels.
Quantization on LLMs. Initial investigations [15, 81, 94] demonstrate lossless 8-bit quantization
for LLMs. Subsequent studies [21, 31, 44, 94, 96, 103] keep lowering the bit width; the latest
advancements reveal that 2-bit [7] and even 1-bit (binary) quantization [101] are adequate for
an accurate LLM. While these approaches offer valuable insights, our focus remains orthogonal
to quantization, as we illustrate optimization techniques and provide high-performance building
blocks for deploying quantized LLMs on FPGAs.
HLS Kernel Libraries. Despite the existence of kernel libraries for accelerating Transformer
models on GPUs [14, 38, 79], the hardware domain has seen only a handful of initiatives in this
regard. AMD provides Vitis HLS library [87, 88] that only has basic kernel-level examples without
comprehensive designs tailored for Transformer models. TRAC [61] attempts to provide an HLS-
based Transformer library, but its kernel performance is unpredictable, and it exclusively focuses on
the BERT model using a temporal architecture. Some frameworks map deep learning frameworks
to FPGAs [4, 20, 72, 98, 99], but can only handle small CNN designs and do not cater to LLMs.
More recent tools allow hardware design using Python [24, 35, 55, 80, 95], but are still general-
purpose and require hardware engineers to construct and optimize kernels from scratch. Our work
provides a Transformer kernel library designed for dataflow implementations and demonstrates
their composability in constructing high-performance hardware accelerators.

9 Conclusion
In this paper, we propose an analytical framework for large language models and point out the bot-
tlenecks and potential optimizations across the prefill and decode stages in the generative inference.
To verify the feasibility of our framework, we provide a reusable HLS kernel library to quickly

22
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

compose Transformer kernels into different LLMs that can achieve the expected performance.
Based on these proposed kernels, we design FPGA-based spatial accelerators for both BERT and
GPT models and achieve high performance and energy efficiency on par with high-end GPUs. By
offering insights into performance bottlenecks, a suite of reusable kernels, and a high-performance
accelerator, we propel the deployment of LLMs for real-world applications while pushing the
boundaries of hardware innovation.

Acknowledgments
This work was supported in part by ACE, one of the seven centers in JUMP 2.0, a Semiconductor
Research Corporation (SRC) program sponsored by DARPA and NSF Awards #2007832, #2019306,
and #2118709. We would like to thank anonymous reviewers, Keisuke Kamahori, and Zihao Ye for
providing insightful feedback. We also thank Jiajie Li, Jie Liu, and Zhanqiu Hu for their contributions
to the initial LLM modeling and benchmarking.

References
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay
Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G.
Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating
Systems Design and Implementation, 2016.
[2] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji
Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed-inference: Enabling efficient inference
of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis, 2022.
[3] Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, and Jason Cong. Flexcnn: An end-to-end framework
for composing cnn accelerators on fpga. ACM Trans. Reconfigurable Technol. Syst., 16(2), mar 2023.
[4] Michaela Blott, Thomas B Preußer, Nicholas J Fraser, Giulio Gambardella, Kenneth O’brien, Yaman Umuroglu, Miriam
Leeser, and Kees Vissers. Finn-r: An end-to-end deep-learning framework for fast exploration of quantized neural
networks. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(3):1–23, 2018.
[5] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein,
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258, 2021.
[6] Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil,
Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael,
Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In 2016 49th
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13, 2016.
[7] Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language
models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
[8] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating
large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
[9] Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, and Yida Wang. Slapo: A schedule language
for progressive optimization of large deep learning model training. In Proceedings of the 29th ACM International
Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS’24), 2024.
[10] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser,
Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak,
Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan
Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati,
Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[11] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang,
Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4

23
Chen et al.

with 90%* chatgpt quality, March 2023.

[12] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu,
Daniel Lo, Shlomi Alkalay, Michael Haselman, Maleen Abeydeera, Logan Adams, Hari Angepat, Christian Boehn,
Derek Chiou, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Ahmad
El Husseini, Tamas Juhasz, Kara Kagi, Ratna K. Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak
Patel, Brandon Perez, Amanda Rapsang, Steven Reinhardt, Bita Rouhani, Adam Sapek, Raja Seera, Sangeetha Shekar,
Balaji Sridharan, Gabriel Weisz, Lisa Woods, Phillip Yi Xiao, Dan Zhang, Ritchie Zhao, and Doug Burger. Serving
dnns in real time at datacenter scale with project brainwave. IEEE Micro, 38(2):8–20, 2018.
[13] Aaron Daniel Cohen, Adam Roberts, Alejandra Molina, Alena Butryna, Alicia Jin, Apoorv Kulshreshtha, Ben Hutchin-
son, Ben Zevenbergen, Blaise Hilary Aguera-Arcas, Chung ching Chang, Claire Cui, Cosmo Du, Daniel De Freitas
Adiwardana, Dehao Chen, Dmitry (Dima) Lepikhin, Ed H. Chi, Erin Hoffman-John, Heng-Tze Cheng, Hongrae Lee,
Igor Krivokon, James Qin, Jamie Hall, Joe Fenton, Johnny Soraker, Kathy Meier-Hellstern, Kristen Olson, Lora Mois
Aroyo, Maarten Paul Bosma, Marc Joseph Pickett, Marcelo Amorim Menegali, Marian Croak, Mark Díaz, Matthew
Lamm, Maxim Krikun, Meredith Ringel Morris, Noam Shazeer, Quoc V. Le, Rachel Bernstein, Ravi Rajakumar, Ray
Kurzweil, Romal Thoppilan, Steven Zheng, Taylor Bos, Toju Duke, Tulsee Doshi, Vincent Y. Zhao, Vinodkumar
Prabhakaran, Will Rusch, YaGuang Li, Yanping Huang, Yanqi Zhou, Yuanzhong Xu, and Zhifeng Chen. Lamda:
Language models for dialog applications. In arXiv preprint arXiv:2201.08239, 2022.
[14] Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient
exact attention with io-awareness. arXiv preprint arXiv:2205.14135, 2022.
[15] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for
transformers at scale. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in
Neural Information Processing Systems, 2022.
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[17] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid,
Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth,
Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and
Pete Florence. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
[18] Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. High-Performance Sparse Linear Algebra on HBM-
Equipped FPGAs Using HLS: A Case Study on SpMV. Int’l Symp. on Field-Programmable Gate Arrays (FPGA),
2022.
[19] ELS-RD. kernl.ai. https://github.com/ELS-RD/kernl, 2022.
[20] Farah Fahim, Benjamin Hawks, Christian Herwig, James Hirschauer, Sergo Jindariani, Nhan Tran, Luca P. Carloni,
Giuseppe Di Guglielmo, Philip Harris, Jeffrey Krupa, Dylan Rankin, Manuel Blanco Valentin, Josiah Hester, Yingyi Luo,
John Mamish, Seda Orgrenci-Memik, Thea Aarrestad, Hamza Javed, Vladimir Loncar, Maurizio Pierini, Adrian Alan
Pol, Sioni Summers, Javier Duarte, Scott Hauck, Shih-Chieh Hsu, Jennifer Ngadiuba, Mia Liu, Duc Hoang, Edward
Kreinar, and Zhenbin Wu. hls4ml: An open-source codesign workflow to empower scientific low-power machine
learning devices, 2021.
[21] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training compression for
generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
[22] Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. Autobridge:
Coupling coarse-grained floorplanning and pipelining for high-frequency hls design on multi-die fpgas. In The 2021
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, page 81–92, New York, NY, USA,
2021. Association for Computing Machinery.
[23] Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, and Joo-Young Kim. Dfx:
A low-latency multi-fpga appliance for accelerating transformer-based text generation. In 2022 55th IEEE/ACM
International Symposium on Microarchitecture (MICRO), pages 616–630, 2022.
[24] Sitao Huang, Kun Wu, Hyunmin Jeong, Chengyue Wang, Deming Chen, and Wen-Mei Hwu. Pylog: An algorithm-
centric python-based fpga programming and synthesis flow. IEEE Transactions on Computers, 70(12):2015–2028,
2021.
[25] HuggingFace. Text generation strategies. https://huggingface.co/docs/transformers/generation_strategies, 2023.
[26] Suyeon Hur, Seongmin Na, Dongup Kwon, Joonsung Kim, Andrew Boutros, Eriko Nurvitadhi, and Jangwoo Kim. A
fast and flexible fpga-based accelerator for natural language processing neural networks. ACM Trans. Archit. Code
Optim., 20(1), feb 2023.
[27] Intel. Intel agilex 7 fpga and soc fpga. https://www.intel.com/content/www/us/en/products/details/fpga/agilex/7.html,
2022.

24
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

[28] Hamza Khan, Asma Khan, Zainab Khan, Lun Bin Huang, Kun Wang, and Lei He. Npe: An fpga-based overlay
processor for natural language processing. In The 2021 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, FPGA ’21, page 227, New York, NY, USA, 2021. Association for Computing Machinery.
[29] Moazin Khatti, Xingyu Tian, Yuze Chi, Licheng Guo, Jason Cong, and Zhenman Fang. Pasta: Programming and
automation support for scalable task-parallel hls programs on modern multi-die fpgas. In 2023 IEEE 31st Annual
International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 12–22, 2023.
[30] Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-bert: Integer-only bert quantization.
In Proceedings of the International Conference on Machine Learning (ICML), 2021.
[31] Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael Mahoney, and Kurt Keutzer.
Squeezellm: Dense-and-sparse quantization. arXiv, 2023.
[32] Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing
Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, and Amir Gholami. Full stack optimization of
transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
[33] Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and
Bryan Catanzaro. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198,
2022.
[34] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao
Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[35] Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. Heterocl: A
multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the
2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019.
[36] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
bert for self-supervised learning of language representations. In International Conference on Learning Representations,
2020.
[37] Martin Langhammer, Eriko Nurvitadhi, Bogdan Pasca, and Sergey Gribok. Stratix 10 nx architecture and applications.
In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, page 57–67, New
York, NY, USA, 2021. Association for Computing Machinery.
[38] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru
Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer
modelling library. https://github.com/facebookresearch/xformers, 2022.
[39] Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and
Caiwen Ding. Ftrans: Energy-efficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE
International Symposium on Low Power Electronics and Design, ISLPED ’20, page 175–180, New York, NY, USA, 2020.
Association for Computing Machinery.
[40] Qin Li, Xiaofan Zhang, Jinjun Xiong, Wen-Mei Hwu, and Deming Chen. Efficient methods for mapping neural
machine translator on fpgas. IEEE Transactions on Parallel and Distributed Systems (TPDS), 32(7):1866–1877, 2021.
[41] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian
Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel
training. Proc. VLDB Endow., 2020.
[42] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin,
Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz,
Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level
code generation with alphacode. Science, 378(6624):1092–1097, 2022.
[43] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao
Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical multiplexing with model parallelism for deep
learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages
663–679, Boston, MA, July 2023. USENIX Association.
[44] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight
quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
[45] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[46] Zejian Liu, Gang Li, and Jian Cheng. Hardware acceleration of fully quantized bert for efficient natural language
processing. 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 513–516, 2021.
[47] Meta. Fully sharded data parallel: faster ai training with fewer gpus. https://engineering.fb.com/2021/07/15/open-
source/fsdp/, 2021.

25
Chen et al.

[48] Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient
transformer training over multiple gpus using automatic parallelism. Proc. VLDB Endow., 16(3):470–479, nov 2022.
[49] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger,
Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings
of the 27th ACM Symposium on Operating Systems Principles, 2019.
[50] Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel
dnn training. In Proceedings of the 38th International Conference on Machine Learning, 2021.
[51] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri
Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training
on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, 2021.
[52] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International
Conference on Learning Representations, 2023.
[53] Nvidia. Fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2022.
[54] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[55] Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini,
Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, and Zhiru Zhang. Accelerator design with decoupled hardware
customizations: benefits and challenges: invited. In Proceedings of the 59th ACM/IEEE Design Automation Conference,
DAC ’22, page 1351–1354, New York, NY, USA, 2022. Association for Computing Machinery.
[56] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle,
Marco Baroni, Gemma Boleda, and R. Fernández. The lambada dataset: Word prediction requiring a broad discourse
context. arXiv preprint arXiv:1606.06031, 2016.
[57] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An
imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on
Neural Information Processing Systems, 2019.
[58] Hongwu Peng, Shaoyi Huang, Shiyang Chen, Bingbing Li, Tong Geng, Ang Li, Weiwen Jiang, Wujie Wen, Jinbo Bi,
Hang Liu, and Caiwen Ding. A length adaptive algorithm-hardware co-design of transformer on fpga through sparse
attention and dynamic pipelining. In Proceedings of the 59th ACM/IEEE Design Automation Conference, DAC ’22, page
1135–1140, New York, NY, USA, 2022. Association for Computing Machinery.
[59] Hongwu Peng, Shaoyi Huang, Tong Geng, Ang Li, Weiwen Jiang, Hang Liu, Shusen Wang, and Caiwen Ding.
Accelerating transformer-based deep learning models on fpgas using column balanced block pruning. In 2021 22nd
International Symposium on Quality Electronic Design (ISQED), pages 142–148, 2021.
[60] Lucian Petrica, Tobías Alonso, Mairin Kroes, Nicholas J. Fraser, Sorin Dan Cotofana, and Michaela Blott. Memory-
efficient dataflow inference for deep cnns on fpga. 2020 International Conference on Field-Programmable Technology
(ICFPT), pages 48–55, 2020.
[61] Patrick Plagwitz, Frank Hannig, and Jürgen Teich. Trac: Compilation-based design of transformer accelerators for
fpgas. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), 2022.
[62] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan
Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine
Learning and Systems, 2023.
[63] Panjie Qi, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Hongwu Peng, Shaoyi Huang, Zhenglun Kong, Yuhong Song,
and Bingbing Li. Accelerating framework of transformer by hardware design and model compression co-optimization.
In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), page 1–9. IEEE Press, 2021.
[64] Panjie Qi, Yuhong Song, Hongwu Peng, Shaoyi Huang, Qingfeng Zhuge, and Edwin Hsing-Mean Sha. Accommodating
transformer onto fpga: Coupling the balanced model compression and fpga-implementation optimization. In
Proceedings of the 2021 on Great Lakes Symposium on VLSI, GLSVLSI ’21, page 163–168, New York, NY, USA, 2021.
Association for Computing Machinery.
[65] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[66] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training
trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage
and Analysis, 2020.
[67] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer.
Q-BERT: hessian based ultra low precision quantization of BERT. In The Thirty-Fourth AAAI Conference on Artificial

26
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Intelligence, pages 8815–8821. AAAI Press, 2020.

[68] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm:
Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
[69] Mengshu Sun, Zhengang Li, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser,
Zhangyang Wang, Xue Lin, and Zhenman Fang. Fpga-aware automatic acceleration framework for vision transformer
with mixed-scheme quantization: Late breaking results. In Proceedings of the 59th ACM/IEEE Design Automation
Conference, DAC ’22, page 1394–1395, New York, NY, USA, 2022. Association for Computing Machinery.
[70] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[71] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen,
Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian
Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana
Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith,
Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert
Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288, 2023.
[72] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees
Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, FPGA ’17, pages 65–74. ACM, 2017.
[73] Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah,
Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyan-
skiy, and Alex Aiken. Unity: Accelerating DNN training through joint optimization of algebraic transformations
and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages
267–284, Carlsbad, CA, July 2022. USENIX Association.
[74] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
[75] Jie Wang, Licheng Guo, and Jason Cong. Autosa: A polyhedral compiler for high-performance systolic arrays on
fpga. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, page 93–104,
New York, NY, USA, 2021. Association for Computing Machinery.
[76] Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa
Murthy, Marcello Maggioni, Qiao Zhang, Sameer Kumar, Tongfei Guo, Yuanzhong Xu, and Zongwei Zhou. Overlap
communication with dependent computation via decomposition in large deep learning models. In Proceedings of
the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,
Volume 1, ASPLOS 2023, page 93–106, New York, NY, USA, 2023. Association for Computing Machinery.
[77] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William
Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
[78] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny
Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal,
D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages
24824–24837. Curran Associates, Inc., 2022.
[79] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771, 2019.
[80] Shaojie Xiang, Yi-Hsiang Lai, Yuan Zhou, Hongzheng Chen, Niansong Zhang, Debjit Pal, and Zhiru Zhang. Heteroflow:
An accelerator programming model with decoupled data placement for software-defined fpgas. In Proceedings of the
2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022.
[81] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and
efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on
Machine Learning, 2023.
[82] Ningning Xie, Tamara Norman, Dominik Grewe, and Dimitrios Vytiniotis. Synthesizing optimal parallelism placement
and reduction strategies on hierarchical systems for deep learning. In D. Marculescu, Y. Chi, and C. Wu, editors,

27
Chen et al.

Proceedings of Machine Learning and Systems, volume 4, pages 548–566, 2022.

[83] AMD Xilinx. Alveo u280 data center accelerator card. https://www.xilinx.com/products/boards-and-kits/alveo/u280.
html#specifications, 2021.
[84] AMD Xilinx. AI Engines and Their Applications. White paper, AMD Xilinx, Dec 2022.
[85] AMD Xilinx. QSFP Module Connector, 2022.
[86] AMD Xilinx. Vck5000 versal development card. https://www.xilinx.com/products/boards-and-kits/vck5000.html#
specs, 2022.
[87] AMD Xilinx. Vitis accelerated libraries. https://github.com/Xilinx/Vitis_Libraries, 2022.
[88] AMD Xilinx. Vitis ai: Adaptable & real-time ai inference acceleration. https://github.com/Xilinx/Vitis-AI, 2022.
[89] AMD Xilinx. Vitis hls v2022.1. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html, 2022.
[90] AMD Xilinx. Versal vhk158. https://www.xilinx.com/products/boards-and-kits/vhk158.html, 2023.
[91] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei
Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. In Proceedings of the 37th International
Conference on Machine Learning, ICML’20. JMLR.org, 2020.
[92] Bowen Yang, Jian Zhang, Jonathan Li, Christopher Re, Christopher Aberger, and Christopher De Sa. Pipemare:
Asynchronous pipeline parallel dnn training. In Proceedings of Machine Learning and Systems, 2021.
[93] Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K. Jones, and Peipei Zhou. Aim: Accelerating arbitrary-
precision integer multiplication on heterogeneous reconfigurable computing platform versal acap. In 2023 IEEE/ACM
International Conference On Computer Aided Design (ICCAD), 2023.
[94] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient
and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing
Systems, 35:27168–27183, 2022.
[95] Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen.
Scalehls: A new scalable high-level synthesis framework on multi-level intermediate representation. In 2022 IEEE
International Symposium on High-Performance Computer Architecture (HPCA), 2022.
[96] Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang
Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint
arXiv:2304.01089, 2023.
[97] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator
design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, FPGA ’15, page 161–170, New York, NY, USA, 2015. Association for Computing
Machinery.
[98] Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. Dnnbuilder:
an automated tool for building high-performance dnn hardware accelerators for fpgas. In 2018 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pages 1–8, 2018.
[99] Xiaofan Zhang, Hanchen Ye, Junsong Wang, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. Dnnex-
plorer: A framework for modeling and exploring a novel paradigm of fpga-based dnn accelerator. In Proceedings of
the 39th International Conference on Computer-Aided Design, ICCAD ’20, New York, NY, USA, 2020. Association for
Computing Machinery.
[100] Xinyi Zhang, Yawen Wu, Peipei Zhou, Xulong Tang, and Jingtong Hu. Algorithm-hardware co-design of attention
mechanism on fpga devices. ACM Trans. Embed. Comput. Syst., 20(5s), sep 2021.
[101] Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, and Orhan Firat. Binarized
neural machine translation. arXiv preprint arXiv:2302.04907, 2023.
[102] Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, and Zhiru Zhang. FracBNN: Accurate and
FPGA-Efficient Binary Neural Networks with Fractional Activations. The 2021 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, 2021.
[103] Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi
Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint
arXiv:2310.19102, 2023.
[104] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li,
Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and
chatbot arena, 2023.
[105] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong
Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. Alpa: Automating inter- and Intra-Operator
parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation
(OSDI 22), pages 559–578, Carlsbad, CA, July 2022. USENIX Association.

School Based Assessment 2023-24 Second Term Computer Education Grade 8
100% (1)
School Based Assessment 2023-24 Second Term Computer Education Grade 8
1 page
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Cafpga: An Automatic Generation Model For CNN Accelerator
No ratings yet
Cafpga: An Automatic Generation Model For CNN Accelerator
30 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
10 1109vdat50263 2020 9190274
No ratings yet
10 1109vdat50263 2020 9190274
6 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
Implementation of FPGA-based Accelerator For CNN
No ratings yet
Implementation of FPGA-based Accelerator For CNN
7 pages
7-Research On FPGA High-Performance Implementation Method of CNN
No ratings yet
7-Research On FPGA High-Performance Implementation Method of CNN
5 pages
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
10 pages
Irmak2021energy Efficient
No ratings yet
Irmak2021energy Efficient
4 pages
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
No ratings yet
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
7 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
No ratings yet
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
102 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
No ratings yet
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
9 pages
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
PM Chi Zhang
No ratings yet
PM Chi Zhang
1 page
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
8 pages
Rongshi 2019
No ratings yet
Rongshi 2019
4 pages
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
No ratings yet
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
5 pages
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
No ratings yet
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
7 pages
10 3390@electronics8030295
No ratings yet
10 3390@electronics8030295
15 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and
20 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
Pynq Classification
No ratings yet
Pynq Classification
65 pages
FPGA Convolution Network Acceleration
No ratings yet
FPGA Convolution Network Acceleration
9 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
Mhamdan Publication
No ratings yet
Mhamdan Publication
7 pages
Performance Modeling For CNN Inference Accelerators On FPGA
No ratings yet
Performance Modeling For CNN Inference Accelerators On FPGA
14 pages
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
No ratings yet
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
32 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
A Scalable FPGA Based Accelerator For Tiny-YOLO-V2
No ratings yet
A Scalable FPGA Based Accelerator For Tiny-YOLO-V2
9 pages
Research On FPGA Based Convolutional Neural Network Acceleration Method
No ratings yet
Research On FPGA Based Convolutional Neural Network Acceleration Method
4 pages
Systematic Analysis of FPGA-based Hardware Acceler
No ratings yet
Systematic Analysis of FPGA-based Hardware Acceler
9 pages
A Reconfigurable CNN-based Accelerator Design For
No ratings yet
A Reconfigurable CNN-based Accelerator Design For
9 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
No ratings yet
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
7 pages
Electronics 13 01564 v2
No ratings yet
Electronics 13 01564 v2
18 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
10.1109 fpl53798.2021.00061
No ratings yet
10.1109 fpl53798.2021.00061
6 pages
Tesi
No ratings yet
Tesi
73 pages
Kanoria Shubham Anil 2023HT01569
No ratings yet
Kanoria Shubham Anil 2023HT01569
9 pages
Fully Convolutional
No ratings yet
Fully Convolutional
4 pages
Tcas-I Haco Final
No ratings yet
Tcas-I Haco Final
14 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
FFSN Inplementation4
No ratings yet
FFSN Inplementation4
18 pages
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
No ratings yet
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
4 pages
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
No ratings yet
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
14 pages
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
31 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
NullHop A Flexible Convolutional Neural Network Accelerator Based On Sparse Representations of Feature Maps
No ratings yet
NullHop A Flexible Convolutional Neural Network Accelerator Based On Sparse Representations of Feature Maps
13 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
One Score Questions and Answers - CS - Plus2 - New
No ratings yet
One Score Questions and Answers - CS - Plus2 - New
19 pages
Presentation Syllabus
No ratings yet
Presentation Syllabus
7 pages
Java
No ratings yet
Java
27 pages
Database Engineering Summary of Coursework-1
No ratings yet
Database Engineering Summary of Coursework-1
4 pages
Coresight Mtb-M0+: Technical Reference Manual
No ratings yet
Coresight Mtb-M0+: Technical Reference Manual
55 pages
User Interface Design-MCA
No ratings yet
User Interface Design-MCA
3 pages
Mastering Azure Cloud A Comprehensive Guide To Building Scalable Cloud Solutions
No ratings yet
Mastering Azure Cloud A Comprehensive Guide To Building Scalable Cloud Solutions
4 pages
A Project Report of ISM On Role of Java in Market
No ratings yet
A Project Report of ISM On Role of Java in Market
17 pages
Open VPN
No ratings yet
Open VPN
11 pages
3.addon Spro
No ratings yet
3.addon Spro
114 pages
Problem - B - Codeforces
No ratings yet
Problem - B - Codeforces
2 pages
Fr-Covadis Manuel-Conception VRD - CV01 PDF
No ratings yet
Fr-Covadis Manuel-Conception VRD - CV01 PDF
623 pages
Python Programming Lab 2021-24
No ratings yet
Python Programming Lab 2021-24
29 pages
B Migrating On Premises Calling To Cisco Webex Lab
No ratings yet
B Migrating On Premises Calling To Cisco Webex Lab
222 pages
Typography & Logo Design in Adobe Illustrator 1
No ratings yet
Typography & Logo Design in Adobe Illustrator 1
4 pages
Updated AshishVerma Resume
No ratings yet
Updated AshishVerma Resume
1 page
Multimedia Lab Only Procedure
No ratings yet
Multimedia Lab Only Procedure
20 pages
CV Tran Phuong Nam Embedded
No ratings yet
CV Tran Phuong Nam Embedded
3 pages
Learn CSS - The Box Model Cheatsheet - Codecademy
No ratings yet
Learn CSS - The Box Model Cheatsheet - Codecademy
2 pages
Sample Thesis Computerized Billing System
100% (3)
Sample Thesis Computerized Billing System
4 pages
CSEC Information Technology June 2009 P01
No ratings yet
CSEC Information Technology June 2009 P01
10 pages
Lex l11 Brochure English
No ratings yet
Lex l11 Brochure English
12 pages
Agile Training
No ratings yet
Agile Training
66 pages
Mis 02
No ratings yet
Mis 02
6 pages
Datasheet c78 741523
No ratings yet
Datasheet c78 741523
12 pages
WSC2022SE 54 Cyber Security Marking Schemembmbmb
No ratings yet
WSC2022SE 54 Cyber Security Marking Schemembmbmb
14 pages
Studio One 6 - Release Notes
No ratings yet
Studio One 6 - Release Notes
10 pages
UNIT I Complete
No ratings yet
UNIT I Complete
12 pages
Bull VPN Disconnect
No ratings yet
Bull VPN Disconnect
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

RM Merged Files

Uploaded by

RM Merged Files

Uploaded by

Journal of Signal Processing Systems (2021) 93:513–529

A Parametrizable High-Level Synthesis Library for Accelerating

Keywords High-level synthesis · Neural networks · FPGA · Hardware acceleration · Library

1 Introduction in object detection [25] and image classification [21] due to

Structure of Buﬀers Figure 1 shows the structure needed

is still possible to choose between the different sizes of 3.2.4 Activation

  3.2.5 Batch Normalization

Resolution Parallelization Estimated Latency (clock cycles)

1 224x224 112x112 3 16 – 3 8 2 101250 100368 101250

Table 8 Comparison with a state-of-the-art implementation of

Layer 1 Layer 7 Layer 27 Layer 29

CPU [19] 2000.00 9000.00 5500.00 11000.00

AnyHLS: High-Level Synthesis with Partial

line buffers sliding window

efficient hardware implementations. There is an ongoing discussion whether C-based languages

aforementioned DSL compilers, AnyHLS allows programmers }

translates to VHLS-code.cpp VHLS-code.cpp AOCL-code.cl VHLS-code.cpp AOCL-code.cl

Vectorization factor (v)

AnyHLS exploits stream processing and performs implicit

COSMOS: Coordination of High-Level Synthesis and

LUCA PICCOLBONI, Columbia University

1.1 System-Level Design

1.2 Intellectual Property Reuse

Fig. 2. Architecture of a loosely-coupled accelerator.

2.1 Hardware Accelerators

Fig. 3. Execution of a loosely-coupled accelerator.

Execution. Figure 3 reports an example of execution of an accelerator made of multiple com-

2.2 Computational Model

3.2 HLS Unpredictability

4 THE COSMOS METHODOLOGY

• formulates a LP problem instance to identify the latency requirements of each compo-

Fig. 6. Example of application of the λ-constraint.

5.1 Memory Generation

6.1 Synthesis Planning

of the components), and A is the m × n incidence matrix defined as:

6.2 Synthesis Mapping

Fig. 7. Example of application of the mapping function ϕ.

3 GLPK (GNU Linear Programming Kit): https://www.gnu.org/software/glpk/.

Fig. 8. TMG modeling the WAMI application.

7.1 Computational Model

7.2 Component Characterization

Table 1. Characterization of the Components for WAMI. The

Fig. 9. Characterization of four representative components of the WAMI accelerator.

Fig. 10. Results of the compositional DSE for WAMI.

7.3 Design-Space Exploration

8.1 Component DSE

8.2 System DSE

Received April 2017; revised May 2017; accepted June 2017

years, thanks to its competitive quality of results (QoR) and

15 node_pkt_valid = node_in.read_nb(node_pkt); buffers.

10 hr Table I on Page 2 shows a brief summary of the related HLS

Abstract nificant traction in industry as evidenced by many new commercial

Figure 1. Design flow with LegUp.

ators using LegUp’s high-level synthesis engine. Next, in step ➄,

# of clock cycles (geomean ratio)

Figure 6. Screenshots of the LegUp visualization tool.

[25] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. FPGA-

Conference Paper · September 2015

Nehir Sönmez Arda Yurdakul

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Gorker Alp Malazgirt Nehir Sonmez Arda Yurdakul

ABSTRACT capable memory technologies, main memory databases are

2.2 Aggregation Operations

in the software implementation, we set the data width to

3.4 Experimental Results

our Virtex-7 FPGA which is also shown in Table 1. Vivado

View publication stats

High-Level Synthesis Hardware Design for

Corresponding author: Romina Soledad Molina (rominasoledad.molina@phd.units.it)

I. INTRODUCTION Several high-level synthesis (HLS) tools [5] have been

90430 VOLUME 10, 2022

FIGURE 1. PRAM model. Different processors execute read and write

years (2016–2022) and have been selected based on the topics

VOLUME 10, 2022 90431

does not include a model for application/computation. LogP

3.2.5 Batch Normalization