Systolic Array
Systolic Array
Abstract—Artificial Intelligence (AI) has become the fastest generally available still lack the performance required for this
growing application area for FPGAs. Two types of numerics are important application area.
needed. Training typically uses floating point arithmetic (which is
now widely available as embedded functions in current FPGAs). We also considered different types of numerical representa-
Inference is typically calculated with lower precision integer tion for this work: signed magnitude is potentially more effi-
numbers, which can be implemented with embedded functions, cient, and aligns with the precision component of the IEEE754
soft logic, or a combination of the two. INT8 performance is floating point numerics. In [5], custom floating point numbers
therefore used as a typical benchmarking metric for current
FPGAs. Recent publications based on Xilinx devices show the with a variable SM mantissa were used, so we have precedence
extraction of two INT8 multipliers from a 24x18 multiplier. A and results on FPGA for this approach. A recent work [6]
paper from Intel describes how to obtain two INT8 multipliers using a 9 bit custom floating point format on a Xilinx VUP
from a 18x18 multiplier, with the help of a small amount of device also uses a signed magnitude mantissa representation,
soft logic. In this paper we introduce a number of new INT8 with two mantissa multipliers mapped to a single DSP48.
multiplier techniques, starting with the Intel 18x18 multiplier
approach. Using both memory and logic resources - for a more Although not using signed magnitude, [7] points to floating
balanced use of the FPGA features - we improve the INT8 density, point processing, particularly with shared exponents, as an
and also show a signed-magnitude (SM) 1.7 construct that is even increasing use model for FPGA. Our goal in this work was
smaller. To demonstrate the usability of these new multipliers, we not to benchmark another deep learning application, but to
develop a scalable systolic array, that contains up to 32,768 SM1.7 improve the performance density of the mainstream FPGA
multipliers, or 28,800 INT8 multipliers, fit in an Intel Stratix 10
2800 device. Finally, we implement a system architecture that considerably, in order to maintain competitiveness with GPUs
includes input and output flow buffering and control, which can and the emerging ASSPs. We took some inspiration from the
be instantiated directly into a larger AI design, or can enable Microsoft Brainwave work [8][9], where a Stratix 10 2800
the FPGA to be used as a standalone accelerator. This system device was packed to 92% logic density. According to [10],
exceeds 400 MHz for the largest array on a mid-speed device (26 80% of the device logic was used by the datapaths, packed to
TOPS INT8), and can operate up to 600 MHz for smaller array
sizes. 97%, and the remaining 20% of the logic was control and data
distribution, packed to 80% efficiency. Other important points
I. I NTRODUCTION from [9] was that around 2/3rds of the memory blocks were
used - presumably largely for data distribution, i.e. feeding the
In deep learning (DL), multiplier density and performance - dot products. In this work, we will aim for a 80% device logic
whether TOPs or TFLOPs - sets the performance expectation resource utilization for our datapaths, while at the same time
of the implementation. Dot product or matrix-vector arrays incorporating some of the free memory into the arithmetic
are the most common structure for these [1]. One advantage calculations. We will still use all of the DSP blocks that we
of the FPGA is flexibility; dataflow can be configured for each are able to access. Our contributions are as follows:
application, and non-linear activation functions like tanh and
sigmoid [2] can be inserted anywhere. Support of simpler 1) We develop an improvement to published INT8 results
functions such as RELU are trivial. Additional optimizations to support SM1.7, using a combination of DSP blocks,
like precision or numerical representation scaling can be varied embedded memory blocks, and soft logic.
from design to design. But the FPGA has less raw performance 2) We describe dot structures for SM1.7, that support both
potential compared to ASIC at similar process nodes. Newer high density and high packing efficiency.
FPGA devices are addressing this gap with more embedded 3) The SM1.7 logic and memory implementation methods
features, especially for lower precision floating point and are mapped back to the INT8 datapaths, with similar
integer numbers. Xilinx has introduced the ACAP concept improvements in multiplier density
[3], with embedded processors supporting IEEE754 FP32, 4) We demonstrate a scalable 2D systolic array that fits to
INT16, and INT8 representations. The Intel Agilex FPGA deterministic area and geometry automatically - without
[4] now has IEEE754 FP16, BFLOAT16, as well as INT9 floorplanning - from a small grid to a full chip design.
support in the DSP Blocks. These two device families have 5) We provide a lightweight, low cost, scalable interface
just been introduced, which means that the mainstream devices which can source and sink any size of the 2D systolic
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
arrays to the pins, or a user design, while maintaining use of the embedded resources, and uses 100% URAM, 95%
the performance of the systolic array DSP, and 40% BRAM. The INT8 extraction method [11] is
again used with the DSP Blocks. Several DL applications are
II. P REVIOUS W ORK benchmarked; although significantly more DSP Blocks are
The production Brainwave deployment [8] used 91% of utilized, throughput is reduced by 30% over [12], however
the logic, 69% of the M20Ks, and 91% of the DSP on the latency is reduced by 7X. Assuming that this design scales
target Stratix 10 2800 device. Much of the M20K resource to the somewhat smaller VU9P, and applying the utilization
was used to store activations and weights locally. We will ratios and clock frequency we calculate a scaled performance
attempt to employ some of the 30% of the unused M20K of 16.9 TOPs.
to perform arithmetic operations in the dot product structure. In [15] a Stratix 10 2800 device is filled with dot products,
The clock rate of the Brainwave design was limited to 250 arranged as tree structures. A new type of INT8 extraction to
MHz [8], although they earlier reported 300 MHz for their obtain two INT8 multipliers per 18x18 DSP Block multiplier
prototypes [9]. Brainwave uses 96K multipliers with a block (with the use of some soft logic) is described, with an
floating point {1.5.2} format per device. We are instead using a average cost of about 8 ALMs per INT8 multiplier. The INT8
SM1.7 format for the mantissa portion. A SM1.7 bit multiplier multipliers are grouped in 32 element dot products, and a full
is much larger arithmetically than the SM1.2 multiplier in chip dot product array is implemented. Fractal Synthesis [10]
Brainwave - by over 10x. The number of SM1.7 multipliers is used to pack the logic more efficiently, leaving the unused
that can be implemented in the FPGA will therefore be logic and routing available in large contiguous blocks, where
significantly less, but we will show that are new techniques it can presumably easily be accessed for any application level
will be able to dramatically close this gap in operations. functionality. The dot products are individually instantiated, so
The Xilinx whitepaper WP486 [11] shows how to extract that all inputs and outputs must be individually sourced and
two INT8 multipliers from a DSP48 if one of the inputs is terminated. This is done by using virtual pins. The presented
shared. A reference point in this paper will be the Xilinx VU9P design contains 22,400 INT8 multipliers (700 dot products);
with 6840 DSP48 blocks, giving a maximum possible of 13680 with shared inputs this amounts to 283,500 virtual pins - this
INT8 multipliers using their method. Xilinx does not show any alone requires 141,500 ALMs of the placed and routed design.
system level results in their whitepaper, although a number of The logic for these virtual pins would also be freed up if the
other works based on this have been published recently. connections were from a system level design - as these virtual
In [12], the Xilinx Supertile is used at double the clock pins represent 15% of the available logic on the device, this
frequency of the surrounding logic (720MHz/360Mhz), but is almost enough to implement our entire Brainwave inspired
only 55% of the DSP Blocks are used. Although not explic- design, which requires about 20% of the 2800 logic. (The
itly stated, this is because the application implementation is connections to the dot products would largely come from the
memory bandwidth limited, as all of the large memory blocks M20K memories, and not through logic for this design). The
(URAMs) are used (only 40% of the smaller 36kb memory Stratix 10 2800 (933K ALMs, 5760 DSP Blocks) device is
are used, however). The earlier Xilinx Supertile paper [13] slightly smaller than the VU9P device (1182 6LUTs, 6840
describes the implementation of a large 96x16 (1536 DSP48 DSP48s).
blocks) processing array, albeit on the smaller VU3P device. Comparing the two Virtex Ultrascale (VUP) designs, we
This design takes up 67% of the available DSP48 blocks, can see that the dataflow architecture has a significant impact
and runs at the 775MHz-891MHz (speed grade dependent) on metrics such as throughput and latency, and that the
datasheet speed. In [12] speed degradation to 92% of this performance level of the final design cannot be independent
rate is observed in the larger VU9P (which is essentially determined simply by looking at memory size, bandwidth, and
three VU3P die stitched together). According to [13], 100% number of TOPs. The performance, however, cannot exceed
DSP use for the Supertile uses only 25% of the device logic, the maximum number of TOPs. The Xilinx targeted designs
although this 100% is only an extrapolated number based also rely on the hard cascade feature of the DSP48 blocks,
on the realized smaller design. The Supertile paper also uses which will introduce a linear latency to any vector operation.
INT16, while [12] implements INT8 via the technique from In contrast, Brainwave uses a dot product implementation, with
[11]. Extrapolating to a maximum systolic INT8 array use on a combination of DSP Blocks and soft logic. We note that
a VU9P (100% DSP and 25% logic utilization) gives 19.7 Microsoft Brainwave is the highest profile production design
TOPs. The performance based on the demonstrated 96x16 currently in use, with tens of thousands of nodes deployed
array replicated thrice in the VU9P is 13.3 TOPs. This array [8] on a relatively large device. A recent paper from another
architecture is based on groups of small cascaded DSP Blocks. commercial organization [16] used smaller Xilinx devices,
In [14], an alternate assembly of the DSP48 Blocks is utilizing only 2070 DSP48 blocks.
presented. As opposed to the Supertile, the memory and DSP A brief survey of other recent works using our targeted
Blocks operate at the same clock rate. Logic use is reduced Stratix 10 2800 device gives some additional datapoints. In
- presumably as the data rate matching is not required - but [17], up to 100% DSP with 76% of the ALMs were used,
the clock rate is slightly reduced to 650MHz, mapped to a showing that large designs are possible on this device with a
large VU37P device. This design incorporates a more balanced variety of approaches, although this example only achieved
85
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
{a,0,0,0,0,b} a a[2:0] c[2:0]
LSB mult
DSP: 18x7 multiplier 3x3
z[2:0]
o[24:11]
subtractor
24 23 22 21 20 19 18
17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
c6 c5 c4 c3 c2 c1 c0 0 0 0 0 b6 b5 b4 b3 b2 b1 b0
0 0 0 0 0 0 0 0 0 0 0 a6 a5 a4 a3 a2 a1 a0
0 0 0 0 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0
z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
o24 o23 o22 o21 o20 o19 o18 o17 o16 o15 o14 o13 o12 o11 o10 o9 o8 o7 o6 o5 o4 o3 o2 o1 o0
(b) Bit Alignments
Fig. 1. DSP Block based unsigned 7x7 multiplier
240 MHz. Another design [18] had a larger ratio of DSP [12][14] is that the dot product latency varies log rather than
Blocks (71%) to ALMs (15%), but despite the relative sparse- linearly of the dot size, reducing system latency, and giving
ness of the utilization, was slower at 200 MHz. more flexibility to the matrix and vector decompositions.
Power was not reported for the VUP designs. Microsoft
reports 125W for a full chip Stratix 10 2800, which uses III. M ETHOD
the majority of resources at a lower frequency. The VUP We started by first creating some efficient SM1.7 operators,
designs use a significant amount of memory and are DSP rich, based on both DSP Blocks and M20K memory blocks. In both
and have a relatively low logic utilization, but run the DSP cases, a small amount of soft logic is required. Many of our
blocks (and in [14], the memory as well) at very high speed dot products will also require pure soft logic multipliers, and
to achieve throughput. While we cannot compare any Xilinx we will use the Quartus based IP for these.
power numbers, we can see that the power envelope for a full
device Stratix 10 2800 @400MHz should still be coolable in A. DSP Block Based Multipliers
a production environment. We therefore set ourselves a target In [15] several new DSP block based methods for INT8
of a design that uses around 80% of the device resources (but, multipliers were introduced. We will modify these to imple-
if possible, a lower percentage of memory blocks), capable of ment unsigned INT7 (UINT7) multipliers instead.
running in the range of 400 MHz on a previous generation Both [15] and [11] use a signed or twos complement format
FPGA. In [6], a relatively small VUP based CNN accelerator, to represent input values. There are some FPGA implemen-
using only 1,106 of the 6,840 DSP48 block on the device tation efficiencies if we use SM format instead. The dynamic
(although 79% of the logic) consumed 75W at a 200 MHz range is almost the identical, with the only difference that
clock. This was apparently measured at the board level, so may the number -128 cannot be represented in SM1.7. The sign
not accurately reflect the FPGA power consumption. Power is is simply calculated as (signa )XOR(signb ). Although the first
reported for [17], but these are simulated numbers from the level of the adder tree is arithmetically more complex as the
EPE tool, and not actually measured. signs can result in subtractions as well as addition, this can be
We will build number of dot product structures, balancing supported in the logic associated with the FPGA carry chain,
DSP based implementations with multipliers constructed from as is therefore free at the implementation level.
different ratios of M20K blocks and soft logic, in order to The magnitude bits require a UINT7 multiplier. We use a
provide a wider choice in resource balancing. We will also modified technique of [15], with each INT18 multiplier in the
arrange the dot products in a 2D systolic architecture so that DSP block acting as an unsigned 18x7 multiplier. The two
memory bandwidth, or at least the I/O bandwidth is reduced. 7 bit multiplicands are input with zero nibble between them.
This may increase the stress on inter PE routing, but will This output of the multiplier contains the sum of two partially
also give a worst-case routing congestion datapoint. If a 2D overlapping 14 bit results. A correction factor of 3 bits can
systolic structure is not suitable, the dot products could still be calculated by a subset multiplier, which only outputs the 3
be individually instantiated. A key point of our designs over LSBs of a 3x3 bit multiplication. The description in [15] does
86
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
Port A: ROM inputs Port B: ROM inputs
M20K
87
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Int8 M20K Mapping
ALM
88
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
reasonable target for this device, and is likely scalable within
... this device family.
The INT8 versions gave us both slightly lower performance
...
and lower density. We initially tried a near maximum DSP
usage for a 30x60 grid of PEs (28,800 INT multipliers), but
...
the speed dropped below 400 MHz. We attempted to improve
... 64 the speed by pipelining the critical path, which was from
the output of the DSP Blocks to the immediately following
adders, but this only increased area, with a minimal impact
...
...
...
...
...
on performance. Reducing the grid size was more successful
... - although the 25,088 multipliers in this grid is only a 12%
density improvement on the previously reported Intel results,
... a large number of DSP Blocks have been freed up for other
tasks, such as activation function implementation. About half
32 the M20Ks remain for the system design. In all of our
SM1.7 and INT8 designs, the dot product composition is
Fig. 7. 2D PE Array homogeneous throughout the grid. By mixing and matching,
e.g. having some PEs implemented only with DSP Blocks and
TABLE II
some with mixture of DSP Blocks and memory, and possibly
R ESOURCE R EQUIREMENTS FOR AN E XAMPLE INT8 2 X 2 G RID logic, different amounts and distributions of resources can
remain free.
M20Ks ALM ALM ALM FFs M20K DSPs
INT8 (logic+FF) (logic) (FF only) Our example systolic arrays used virtual pins, as in [15].
4 1606 54 730 4124 16 12 The number of virtual pins in our design was considerably
less, because the 2D systolic structure includes PE to PE
communication paths. There are a total of 51,200 virtual pins,
have seen in the earlier [8][12][14] cases that there is a large consisting of 8K left side input bits, 4K bits in from the
variation in the memory depth and bandwidth to multiplier top, and 38,912 output bits. These virtual pins occupy 25600
relationship. Further implementation flexibility is possible by ALMs.
mixing grids with different multiplier composition ratios.
Because of the more restrictive memory based approach A floorplan and routing heatmap are shown in Figure 8.
in the INT8 case, we only composed one version of the 16 This particular example was for SM1.7, with 5 M20K based
element dot product, split into 4 and 12 elements of the M20K multipliers per dot product. The heatmap for all of the SM1.7
and DSP Block based versions, respectively. Table 2 reports and INT8 grids is almost identical, with the highest stress
the resources in the same manner as table 1. (colored in pink) concentrated around the discontinuities on
For all of our designs, we targeted a 1SG280LN2F43E2VG, the die. The include the embedded processor subsystem on
using Quartus 19.1. It is important to note that this is a mid- the lower right edge, and the IO regions running vertically
speed grade device. through the die. This suggests that a system using a collection
of smaller dot product arrays, or cases where only a smaller
IV. R ESULTS - S YSTOLIC A RRAYS portion of the FPGA is used for the deep learning application,
may be able to increase the clock frequency of the dot product
We created a number of chip filling designs (to approxi- array significantly (we found that small grids often exceeded
mately 80% logic usage, as discussed earlier in our goals for 600MHz).
this work), with different types of PE. In these reported cases,
all PE compositions were uniform across the design, but this Figure 9 shows an intermediate sized array (20x40 dot
does not have to be the case. We first built these full chip products), which is about 40% of the maximum size limit,
designs with three variants of the SM1.7 dot products: 5, 3, based on both our resource goal, and the number of DSP
and 2 M20K multipliers per PE. The results are listed in table Blocks in the device. The fractal synthesis clustering and
3. In every case, the number of PEs was 2048, arranged as a packing algorithms directs Quartus to create an almost per-
32x64 grid of PEs (16x32 grid of our 2x2 grids). In all cases, fectly rectangular layout, leaving the majority of the device
these designs were placed and routed using a push button flow. completely untouched.
No floorplanning was used, even for the very large arrays. The Brainwave design contained 96K multipliers with a 1.2
We can achieve a successful fit of 32K multipliers, with magnitude format. Our work shows that 32K multipliers for a
consistent results over different multiplier component mixes, much larger 1.7 magnitude format is readily achievable - this
while leaving enough logic left over to support a system design translates to about a 3x ISO density improvement, based on
in the Brainwave style. Our performance consistently exceeds arithmetic complexity (which varies as the square of relative
400MHz, so a performance level of 25TOPs to 30TOPs is a precision).
89
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
TABLE III
L ARGE 2D S YSTOLIC A RRAY (SM1.7) R ESULTS
TABLE IV
L ARGE 2D S YSTOLIC A RRAY (INT8) R ESULTS
(a) (b)
Fig. 8. Floorplan and Heatmap of a 32K multiplier 2D Systolic Array
V. R ESULTS - S YSTEM D ESIGNS is still some logic margin available, and for all array sizes,
the memory availability is greater than the logic availability.
After validating our approach with the independent 2D All of this bodes well for incorporating these structures into a
systolic arrays, we implemented a more representative system more complex deep learning system. Alternately, the system
design. In this case, all I/Os are mapped to pins. First, each design can be used as an accelerator as is.
PE in the systolic array is terminated in an 32b integer We noticed an initial speed degradation of about 10% com-
accumulator. The PE routing of figure 6 is modified to support pared to the array core, and decided to introduce some minimal
systolic propagation of the accumulator results to the core floorplanning. Pinning the corners of the arrays achieved an
periphery. Finally, small (512 word) input and output RAMs immediate result, with the system designs running about 10%
are attached to the periphery for flow control. Muxing is used faster than the push button arrays.
to select the current output stream. Figure 10 shows a block diagram of the system array. The
For any given array size, the number of DSP Blocks remains flow control memories can be seen in the periphery. The
the same, while there are small (typically 5%) increases in soft routing heatmap of Fig. 11 following our minimal floorplan
logic and memory sizes. There is virtually no impact on speed. intervention shows less stress than the systolic array core
We can see that for even very large (30x60) PE arrays, there alone.
90
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
TABLE V
S YSTOLIC A RRAY S YSTEM (INT8) R ESULTS
DATA B ADDR B
VI. C ONCLUSIONS
address decoder
Our goal was to develop a method to create large, scal-
able, and high performance systolic arrays for DL at an 8
RAM RAM RAM RAM bit precision. We showed new methods for INT8 multiplier
based dot products, and introduced SM1.7 as an even more
RAM
DATA A
RAM
efficient format for implementing these arithmetic constructs
RAM
with this precision. The resultant designs achieve their goals
RAM
with consistent and deterministic speed, as well as low latency.
ADDR A RAM
Our throughput is 10% to 50% higher than other designs im-
RAM plemented on the same or slightly larger devices, by utilizing a
RAM balanced (the ratio of multipliers and soft logic to DSP blocks)
RAM implementation mix. We were able to realize this results even
RAM using a single monolithic chip scale 2D systolic array.
ADDR OUT
RAM We can scale these arrays from small to large, while en-
RAM suring an efficiently packed, placed and routed result, without
RAM
floorplanning, simplifying the design process greatly. We then
DATA OUT RAM extended the arrays to include flow control so that they can
RAM
be used as an accelerator, either as a stand alone device, or as
output
muxes
RAM
part of a more complex deep learning design on the FPGA.
RAM
Using a small amount of floorplanning, performance increased,
Fig. 10. Systolic Array Block Diagram delivering over 400MHz on a mid-speed grade device, for even
a very full design.
91
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES International Symposium on Computer Architecture (ISCA), June 2018,
pp. 1–14.
[1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, [9] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Ac-
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. celerating persistent neural networks at datacenter scale,” in Hot Chips,
Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, vol. 27, 2017.
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, [10] M. Langhammer, G. Baeckler, and S. Gribok, “Fractal Synthesis:
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, Invited tutorial,” in Proceedings of the 2019 ACM/SIGDA International
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, Symposium on Field-Programmable Gate Arrays, ser. FPGA ’19.
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, New York, NY, USA: ACM, 2019, pp. 202–211. [Online]. Available:
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, http://doi.acm.org/10.1145/3289602.3293927
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, [11] Deep Learning with INT8 Optimization on Xilinx Devices,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, 2017, https://www.xilinx.com/support/documentation/white papers/
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, wp486-deep-learning-int8.pdf.
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter [12] E. Wu, X. Zhang, D. Berman, I. Cho, and J. Thendean, “Compute-
performance analysis of a tensor processing unit,” in Proceedings of efficient neural-network acceleration,” in Proceedings of the 2019
the 44th Annual International Symposium on Computer Architecture, ACM/SIGDA International Symposium on Field-Programmable Gate
ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 1–12. [Online]. Arrays, ser. FPGA ’19. New York, NY, USA: ACM, 2019, pp. 191–200.
Available: http://doi.acm.org/10.1145/3079856.3080246 [Online]. Available: http://doi.acm.org/10.1145/3289602.3293925
[2] B. Pasca and M. Langhammer, “Activation function architectures for [13] E. Wu, X. Zhang, D. Berman, and I. Cho, “A high-throughput reconfig-
fpgas,” in 2018 28th International Conference on Field Programmable urable processing array for neural networks,” in 2017 27th International
Logic and Applications (FPL), Aug 2018, pp. 43–437. Conference on Field Programmable Logic and Applications (FPL), Sep.
[3] I. Swarbrick, D. Gaitonde, S. Ahmad, B. Gaide, and Y. Arbel, “Network- 2017, pp. 1–4.
on-chip programmable platform in VersalTM ACAP architecture,” in Machine Learning problems,” in 2019 29th International Conference
Proceedings of the 2019 ACM/SIGDA International Symposium on on Field Programmable Logic and Applications (FPL), Sep 2019.
Field-Programmable Gate Arrays, ser. FPGA ’19. New York,
[15] M. Langhammer, B. Pasca, G. Baeckler, and S. Gribok, “Extracting int8
NY, USA: ACM, 2019, pp. 212–221. [Online]. Available: http:
multipliers from int18 multipliers,” in International Conference on Field
//doi.acm.org/10.1145/3289602.3293908
Programmable Logic and Applications. Barcelona, Spain: IEEE, sep
[4] Intel Agilex Variable Precision DSP Blocks User Guide, 2019,
2019.
https://www.intel.com/content/dam/altera-www/global/en US/pdfs/
[16] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan, “A
literature/hb/agilex/ug-ag-dsp.pdf.
high-performance CNN processor based on FPGA for MobileNets,” in
[5] G. R. Chiu, A. C. Ling, D. Capalija, A. Bitar, and M. S.
2019 29th International Conference on Field Programmable Logic and
Abdelfattah, “Flexibility: Fpgas and cad in deep learning acceleration,”
Applications (FPL), Sep 2019.
in Proceedings of the 2018 International Symposium on Physical
[17] S. K. Venkataramanaiah, Y. Ma, S. Yin, E. Nurvithadhi, A. Dasu, Y. Cao,
Design, ser. ISPD ’18. New York, NY, USA: ACM, 2018, pp. 34–41.
and J.-S. Seo, “Automatic compiler based FPGA accelerator for CNN
[Online]. Available: http://doi.acm.org/10.1145/3177540.3177561
training,” in 2019 29th International Conference on Field Programmable
[6] H. Nakahara, Y. Sada, M. Shimoda, K. Sayama, A. Jinguji, and S. Sato,
Logic and Applications (FPL), Sep 2019.
“FPGA-based training accelerator utilizing sparseness of convolutional
neural network,” in 2019 29th International Conference on Field Pro- [18] R. Rajat, H. Zeng, and V. Prasanna, “A flexible design automation tool
grammable Logic and Applications (FPL), Sep 2019. for accelerating quantized spectral CNNs,” in 2019 29th International
[7] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, “High-performance Conference on Field Programmable Logic and Applications (FPL), Sep
fpga-based cnn accelerator with block-floating-point arithmetic,” IEEE 2019.
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, [19] M. Langhammer and G. Baeckler, “High density and performance
no. 8, pp. 1874–1885, Aug 2019. multiplication for FPGA,” in 25th IEEE Symposium on Computer
[8] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, Arithmetic, ARITH 2018, Amherst, MA, USA, June 25-27, 2018,
S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, 2018, pp. 5–12. [Online]. Available: https://doi.org/10.1109/ARITH.
A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. 2018.8464695
Caulfield, E. S. Chung, and D. Burger, “A configurable cloud-scale [20] X. Yu, Y. Wang, J. Miao, E. Wu, H. Zhang, Y. Meng, B. Zhang, B. Min,
DNN processor for real-time AI,” in 2018 ACM/IEEE 45th Annual D. Chen, and J. Gao, “A data-center FPGA acceleration platform for
[14] A. Samajdar, T. Garg, T. Krishna, and N. Kapre, “Scaling the Cascades: convolutional neural networks,” in 2019 29th International Conference
Interconnect-aware Mapping Strategies for FPGA implementation of on Field Programmable Logic and Applications (FPL), Sep 2019.
92
Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.