0% found this document useful (0 votes)
2 views9 pages

Systolic Array

This paper presents new techniques for high-density 8-bit multiplier systolic arrays designed for FPGAs, particularly focusing on improving INT8 density and performance for AI applications. The authors introduce a scalable systolic array architecture that can accommodate a large number of signed-magnitude multipliers, achieving significant logic density and operational efficiency. The work aims to enhance FPGA competitiveness against GPUs by optimizing resource utilization and implementing advanced arithmetic structures.

Uploaded by

dasujayanth2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views9 pages

Systolic Array

This paper presents new techniques for high-density 8-bit multiplier systolic arrays designed for FPGAs, particularly focusing on improving INT8 density and performance for AI applications. The authors introduce a scalable systolic array architecture that can accommodate a large number of signed-magnitude multipliers, achieving significant logic density and operational efficiency. The work aims to enhance FPGA competitiveness against GPUs by optimizing resource utilization and implementing advanced arithmetic structures.

Uploaded by

dasujayanth2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

High Density 8-bit Multiplier Systolic Arrays for


FPGA
Martin Langhammer Sergey Gribok Gregg Baeckler
Intel Corporation Intel Corporation Intel Corporation
Martin.Langhammer@intel.com Sergey.Gribok@intel.com Gregg.Baeckler@intel.com

Abstract—Artificial Intelligence (AI) has become the fastest generally available still lack the performance required for this
growing application area for FPGAs. Two types of numerics are important application area.
needed. Training typically uses floating point arithmetic (which is
now widely available as embedded functions in current FPGAs). We also considered different types of numerical representa-
Inference is typically calculated with lower precision integer tion for this work: signed magnitude is potentially more effi-
numbers, which can be implemented with embedded functions, cient, and aligns with the precision component of the IEEE754
soft logic, or a combination of the two. INT8 performance is floating point numerics. In [5], custom floating point numbers
therefore used as a typical benchmarking metric for current
FPGAs. Recent publications based on Xilinx devices show the with a variable SM mantissa were used, so we have precedence
extraction of two INT8 multipliers from a 24x18 multiplier. A and results on FPGA for this approach. A recent work [6]
paper from Intel describes how to obtain two INT8 multipliers using a 9 bit custom floating point format on a Xilinx VUP
from a 18x18 multiplier, with the help of a small amount of device also uses a signed magnitude mantissa representation,
soft logic. In this paper we introduce a number of new INT8 with two mantissa multipliers mapped to a single DSP48.
multiplier techniques, starting with the Intel 18x18 multiplier
approach. Using both memory and logic resources - for a more Although not using signed magnitude, [7] points to floating
balanced use of the FPGA features - we improve the INT8 density, point processing, particularly with shared exponents, as an
and also show a signed-magnitude (SM) 1.7 construct that is even increasing use model for FPGA. Our goal in this work was
smaller. To demonstrate the usability of these new multipliers, we not to benchmark another deep learning application, but to
develop a scalable systolic array, that contains up to 32,768 SM1.7 improve the performance density of the mainstream FPGA
multipliers, or 28,800 INT8 multipliers, fit in an Intel Stratix 10
2800 device. Finally, we implement a system architecture that considerably, in order to maintain competitiveness with GPUs
includes input and output flow buffering and control, which can and the emerging ASSPs. We took some inspiration from the
be instantiated directly into a larger AI design, or can enable Microsoft Brainwave work [8][9], where a Stratix 10 2800
the FPGA to be used as a standalone accelerator. This system device was packed to 92% logic density. According to [10],
exceeds 400 MHz for the largest array on a mid-speed device (26 80% of the device logic was used by the datapaths, packed to
TOPS INT8), and can operate up to 600 MHz for smaller array
sizes. 97%, and the remaining 20% of the logic was control and data
distribution, packed to 80% efficiency. Other important points
I. I NTRODUCTION from [9] was that around 2/3rds of the memory blocks were
used - presumably largely for data distribution, i.e. feeding the
In deep learning (DL), multiplier density and performance - dot products. In this work, we will aim for a 80% device logic
whether TOPs or TFLOPs - sets the performance expectation resource utilization for our datapaths, while at the same time
of the implementation. Dot product or matrix-vector arrays incorporating some of the free memory into the arithmetic
are the most common structure for these [1]. One advantage calculations. We will still use all of the DSP blocks that we
of the FPGA is flexibility; dataflow can be configured for each are able to access. Our contributions are as follows:
application, and non-linear activation functions like tanh and
sigmoid [2] can be inserted anywhere. Support of simpler 1) We develop an improvement to published INT8 results
functions such as RELU are trivial. Additional optimizations to support SM1.7, using a combination of DSP blocks,
like precision or numerical representation scaling can be varied embedded memory blocks, and soft logic.
from design to design. But the FPGA has less raw performance 2) We describe dot structures for SM1.7, that support both
potential compared to ASIC at similar process nodes. Newer high density and high packing efficiency.
FPGA devices are addressing this gap with more embedded 3) The SM1.7 logic and memory implementation methods
features, especially for lower precision floating point and are mapped back to the INT8 datapaths, with similar
integer numbers. Xilinx has introduced the ACAP concept improvements in multiplier density
[3], with embedded processors supporting IEEE754 FP32, 4) We demonstrate a scalable 2D systolic array that fits to
INT16, and INT8 representations. The Intel Agilex FPGA deterministic area and geometry automatically - without
[4] now has IEEE754 FP16, BFLOAT16, as well as INT9 floorplanning - from a small grid to a full chip design.
support in the DSP Blocks. These two device families have 5) We provide a lightweight, low cost, scalable interface
just been introduced, which means that the mainstream devices which can source and sink any size of the 2D systolic

2576-2621/20/$31.00 ©2020 IEEE 84


DOI 10.1109/FCCM48280.2020.00021

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
arrays to the pins, or a user design, while maintaining use of the embedded resources, and uses 100% URAM, 95%
the performance of the systolic array DSP, and 40% BRAM. The INT8 extraction method [11] is
again used with the DSP Blocks. Several DL applications are
II. P REVIOUS W ORK benchmarked; although significantly more DSP Blocks are
The production Brainwave deployment [8] used 91% of utilized, throughput is reduced by 30% over [12], however
the logic, 69% of the M20Ks, and 91% of the DSP on the latency is reduced by 7X. Assuming that this design scales
target Stratix 10 2800 device. Much of the M20K resource to the somewhat smaller VU9P, and applying the utilization
was used to store activations and weights locally. We will ratios and clock frequency we calculate a scaled performance
attempt to employ some of the 30% of the unused M20K of 16.9 TOPs.
to perform arithmetic operations in the dot product structure. In [15] a Stratix 10 2800 device is filled with dot products,
The clock rate of the Brainwave design was limited to 250 arranged as tree structures. A new type of INT8 extraction to
MHz [8], although they earlier reported 300 MHz for their obtain two INT8 multipliers per 18x18 DSP Block multiplier
prototypes [9]. Brainwave uses 96K multipliers with a block (with the use of some soft logic) is described, with an
floating point {1.5.2} format per device. We are instead using a average cost of about 8 ALMs per INT8 multiplier. The INT8
SM1.7 format for the mantissa portion. A SM1.7 bit multiplier multipliers are grouped in 32 element dot products, and a full
is much larger arithmetically than the SM1.2 multiplier in chip dot product array is implemented. Fractal Synthesis [10]
Brainwave - by over 10x. The number of SM1.7 multipliers is used to pack the logic more efficiently, leaving the unused
that can be implemented in the FPGA will therefore be logic and routing available in large contiguous blocks, where
significantly less, but we will show that are new techniques it can presumably easily be accessed for any application level
will be able to dramatically close this gap in operations. functionality. The dot products are individually instantiated, so
The Xilinx whitepaper WP486 [11] shows how to extract that all inputs and outputs must be individually sourced and
two INT8 multipliers from a DSP48 if one of the inputs is terminated. This is done by using virtual pins. The presented
shared. A reference point in this paper will be the Xilinx VU9P design contains 22,400 INT8 multipliers (700 dot products);
with 6840 DSP48 blocks, giving a maximum possible of 13680 with shared inputs this amounts to 283,500 virtual pins - this
INT8 multipliers using their method. Xilinx does not show any alone requires 141,500 ALMs of the placed and routed design.
system level results in their whitepaper, although a number of The logic for these virtual pins would also be freed up if the
other works based on this have been published recently. connections were from a system level design - as these virtual
In [12], the Xilinx Supertile is used at double the clock pins represent 15% of the available logic on the device, this
frequency of the surrounding logic (720MHz/360Mhz), but is almost enough to implement our entire Brainwave inspired
only 55% of the DSP Blocks are used. Although not explic- design, which requires about 20% of the 2800 logic. (The
itly stated, this is because the application implementation is connections to the dot products would largely come from the
memory bandwidth limited, as all of the large memory blocks M20K memories, and not through logic for this design). The
(URAMs) are used (only 40% of the smaller 36kb memory Stratix 10 2800 (933K ALMs, 5760 DSP Blocks) device is
are used, however). The earlier Xilinx Supertile paper [13] slightly smaller than the VU9P device (1182 6LUTs, 6840
describes the implementation of a large 96x16 (1536 DSP48 DSP48s).
blocks) processing array, albeit on the smaller VU3P device. Comparing the two Virtex Ultrascale (VUP) designs, we
This design takes up 67% of the available DSP48 blocks, can see that the dataflow architecture has a significant impact
and runs at the 775MHz-891MHz (speed grade dependent) on metrics such as throughput and latency, and that the
datasheet speed. In [12] speed degradation to 92% of this performance level of the final design cannot be independent
rate is observed in the larger VU9P (which is essentially determined simply by looking at memory size, bandwidth, and
three VU3P die stitched together). According to [13], 100% number of TOPs. The performance, however, cannot exceed
DSP use for the Supertile uses only 25% of the device logic, the maximum number of TOPs. The Xilinx targeted designs
although this 100% is only an extrapolated number based also rely on the hard cascade feature of the DSP48 blocks,
on the realized smaller design. The Supertile paper also uses which will introduce a linear latency to any vector operation.
INT16, while [12] implements INT8 via the technique from In contrast, Brainwave uses a dot product implementation, with
[11]. Extrapolating to a maximum systolic INT8 array use on a combination of DSP Blocks and soft logic. We note that
a VU9P (100% DSP and 25% logic utilization) gives 19.7 Microsoft Brainwave is the highest profile production design
TOPs. The performance based on the demonstrated 96x16 currently in use, with tens of thousands of nodes deployed
array replicated thrice in the VU9P is 13.3 TOPs. This array [8] on a relatively large device. A recent paper from another
architecture is based on groups of small cascaded DSP Blocks. commercial organization [16] used smaller Xilinx devices,
In [14], an alternate assembly of the DSP48 Blocks is utilizing only 2070 DSP48 blocks.
presented. As opposed to the Supertile, the memory and DSP A brief survey of other recent works using our targeted
Blocks operate at the same clock rate. Logic use is reduced Stratix 10 2800 device gives some additional datapoints. In
- presumably as the data rate matching is not required - but [17], up to 100% DSP with 76% of the ALMs were used,
the clock rate is slightly reduced to 650MHz, mapped to a showing that large designs are possible on this device with a
large VU37P device. This design incorporates a more balanced variety of approaches, although this example only achieved

85

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
{a,0,0,0,0,b} a a[2:0] c[2:0]

LSB mult
DSP: 18x7 multiplier 3x3

z[2:0]
o[24:11]

subtractor

y[10:0] {z[13:3],y[13:11]} z[2:0]


(a) Architecture

24 23 22 21 20 19 18
17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
c6 c5 c4 c3 c2 c1 c0 0 0 0 0 b6 b5 b4 b3 b2 b1 b0
0 0 0 0 0 0 0 0 0 0 0 a6 a5 a4 a3 a2 a1 a0
0 0 0 0 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0
z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
o24 o23 o22 o21 o20 o19 o18 o17 o16 o15 o14 o13 o12 o11 o10 o9 o8 o7 o6 o5 o4 o3 o2 o1 o0
(b) Bit Alignments
Fig. 1. DSP Block based unsigned 7x7 multiplier

240 MHz. Another design [18] had a larger ratio of DSP [12][14] is that the dot product latency varies log rather than
Blocks (71%) to ALMs (15%), but despite the relative sparse- linearly of the dot size, reducing system latency, and giving
ness of the utilization, was slower at 200 MHz. more flexibility to the matrix and vector decompositions.
Power was not reported for the VUP designs. Microsoft
reports 125W for a full chip Stratix 10 2800, which uses III. M ETHOD
the majority of resources at a lower frequency. The VUP We started by first creating some efficient SM1.7 operators,
designs use a significant amount of memory and are DSP rich, based on both DSP Blocks and M20K memory blocks. In both
and have a relatively low logic utilization, but run the DSP cases, a small amount of soft logic is required. Many of our
blocks (and in [14], the memory as well) at very high speed dot products will also require pure soft logic multipliers, and
to achieve throughput. While we cannot compare any Xilinx we will use the Quartus based IP for these.
power numbers, we can see that the power envelope for a full
device Stratix 10 2800 @400MHz should still be coolable in A. DSP Block Based Multipliers
a production environment. We therefore set ourselves a target In [15] several new DSP block based methods for INT8
of a design that uses around 80% of the device resources (but, multipliers were introduced. We will modify these to imple-
if possible, a lower percentage of memory blocks), capable of ment unsigned INT7 (UINT7) multipliers instead.
running in the range of 400 MHz on a previous generation Both [15] and [11] use a signed or twos complement format
FPGA. In [6], a relatively small VUP based CNN accelerator, to represent input values. There are some FPGA implemen-
using only 1,106 of the 6,840 DSP48 block on the device tation efficiencies if we use SM format instead. The dynamic
(although 79% of the logic) consumed 75W at a 200 MHz range is almost the identical, with the only difference that
clock. This was apparently measured at the board level, so may the number -128 cannot be represented in SM1.7. The sign
not accurately reflect the FPGA power consumption. Power is is simply calculated as (signa )XOR(signb ). Although the first
reported for [17], but these are simulated numbers from the level of the adder tree is arithmetically more complex as the
EPE tool, and not actually measured. signs can result in subtractions as well as addition, this can be
We will build number of dot product structures, balancing supported in the logic associated with the FPGA carry chain,
DSP based implementations with multipliers constructed from as is therefore free at the implementation level.
different ratios of M20K blocks and soft logic, in order to The magnitude bits require a UINT7 multiplier. We use a
provide a wider choice in resource balancing. We will also modified technique of [15], with each INT18 multiplier in the
arrange the dot products in a 2D systolic architecture so that DSP block acting as an unsigned 18x7 multiplier. The two
memory bandwidth, or at least the I/O bandwidth is reduced. 7 bit multiplicands are input with zero nibble between them.
This may increase the stress on inter PE routing, but will This output of the multiplier contains the sum of two partially
also give a worst-case routing congestion datapoint. If a 2D overlapping 14 bit results. A correction factor of 3 bits can
systolic structure is not suitable, the dot products could still be calculated by a subset multiplier, which only outputs the 3
be individually instantiated. A key point of our designs over LSBs of a 3x3 bit multiplication. The description in [15] does

86

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
Port A: ROM inputs Port B: ROM inputs

addr A[10:0] addr B[10:0]

M20K

dout A[9:0] dout B[9:0]

Port A: ROM outputs Port B: ROM outputs

Fig. 2. M20K Dual Port ROM - Pair of 7x4 Multiplications

11 bit output available, we do not support the LSB, which can


be inexpensively calculated in a 2 bit logic function. This is
not an issue for the 7x3 case, as we treat the 3 bit input as
a 4 bit value with a LSB, and right shift the output by 1 bit.
From the dot diagram in Figure 2, the least significant partial
product disappears, and we end up with 7x4 and 7x3 products,
respectively. We then sum the appropriate segments to get our
result, and append the separately calculated LSB. The total
cost is 6.5 ALMs: 5 ALMs for the adder, 1/2 an ALM for
the LSB, and 1.5 ALMs to latency balance the LSBs of the
7x4 multiplier. The latter can again be amortized across a dot
product for additional savings.
We then extended the memory based multiplier method to
Fig. 3. Arithmetic Equivalence of M20K Mapping INT8. Using the M20K here is more difficult. First of all,
both multiplier halves in the UINT7 case were identical - the
multiplicand was the same, and both multiplier subsets were
not describe the construction of this multiplier, but the small unsigned, even though the upper one only used 3 of the 4 bit
size means that it can be implemented relatively efficiently by dataset. For INT8, the upper multiplier half is signed, while
table look up, especially if 6LUTs are natively supported. We the lower half is unsigned (and identical to the UINT7 case).
used the techniques in [19] to map this multiplier to the ALM This means that the two multiplier nibbles need separate tables.
structure in the Stratix 10 device. To support this efficiently, a pair of INT8 multipliers need to
We simply subtract the value z[2:0] from o[24:11] to obtain be constructed together, with both upper and lower multiplier
the upper 11 bits of a*c, or z[13:3], and the 3 upper bits of nibbles addressing their own dual port ROMs. The granularity
a*b, or y[13:11]. The lower 3 bits of a*c are output from of the INT8 M20K based method is therefore two.
the LSB-only multiplier, and the lower 11 bits of a*b are
Figure 4 shows a DOT representation of this. The blue and
output directly from the DSP Block. Both the LSB multiplier
orange input boxes (and the same colored dots) represent the
and the subtractor can be implemented in 7.5 ALMs, or just
unsigned 7 bit multiplicand and signed 4 bit multipliers, and
under 4 ALMs per UINT7 multiplier on average. Depending
the red and green input boxes the unsigned 7 bit multiplicand
on the performance required, additional logic may be needed
and the unsigned 4 bit multiplier. The yellow dots on the
to introduce pipelining, although this may be amortized over
LSBs of the partial products are calculated in soft logic, just
larger structures like dot products.
like the LSB for the first partial product in the UINT7 case.
We re-used the techniques of [15] for the INT8 dot products.
The missing information is the product of the MSB of the
B. M20K Based Multipliers multiplicand and the multiplier, also shown as yellow dots.
The features and organization of the M20K memory block Like the LSBs, these are calculated in soft logic. The output
lends itself well to 7 bit multipliers. This 20Kb block can be stage now has three rows, but we do not need to add them all
configured as a dual port ROM, with an 11 bit address space, here. The four MSBs of the third row can be moved to the
and therefore a 10 bit output. This supports two independent empty positions of the first row; the existing partial product
7x4 multiplications, as shown in Figure 2. adder remains the same size. The remaining 4 bits in the
One set of input multiplies A[6:0]* B[3:0], and the other third row can be added with the same 4 bits from any other
A[6:0]*B[6:4] - in other words, we only use the 7x3 capability multiplier in the DOT, and only folded into the final result at
of the second input. As we only have 10 bits of the required the output.
The INT8 version requires an additional 4.5 ALMs over the

87

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Int8 M20K Mapping

M20K + ALM DSP + ALM

ALM

Fig. 5. Mixed Multiplier Construct 16 Element dot product

UINT7 counterpart. The total logic cost - around 10 ALMs -


is considerably less than the 30 ALMs required for the pure
soft logic IP provided by Quartus. The die cost of the M20K
is likely considerably larger than the difference of 20 ALMs,
but it does allow us additional choices in balancing resource Fig. 6. PE Dot Product and Ingress/Egress Ports
use on the FPGA. As we are now attempting to fit multiple
tens of thousands of multipliers on an FPGA, a small decrease
in soft logic per multiplier can have a marked impact. TABLE I
R ESOURCE R EQUIREMENTS FOR D IFFERENT SM1.7 PE C OMPONENT
We can now assemble dot products. We use 16 element dot M IXES
products to more closely correlate inter-PE routing resources
with the density of long vertical wires in the Stratix 10 devices. M20Ks ALM ALM ALM FFs M20K DSPs
SM1.7 (logic+FF) (logic) (FF only)
(We originally tried 32 element dot products like [15], but 0 1573 53 482 3694 0 11
found that the 2D systolic array architecture suffered from 1 1493 57 471 3512 4 11
excessive routing congestion, affecting Fmax.) We will first 2 1411 54 448 3330 8 11
3 1328 58 455 3148 12 11
implement SM1.7 based cores. Our 16 element dot products 4 1245 51 409 2966 16 11
contain 11 DSP based SM1.7 multipliers, with a mix of M20K 5 1164 50 366 2784 20 11
and logic based multipliers. Figure 5 shows a block diagram
of a dot product with 5 M20K based multipliers, annotated
with implementation resource per function (there are no purely array of PEs.
logic based multipliers in this design). Table 1 shows the size of our base 2x2 PE grid, with
We can then start creating systolic arrays of these dot varying numbers of M20K based SM1.7 multipliers per PE.
products. Each PE consists of a 16 element dot product, and In each case, 11 of the 16 multipliers in each PE are DSP
two directions of activation and weight routing, vertical and based multipliers, with the remaining 5 being a combination
horizontal. Each PE also has a separate output. Our building of multipliers constructed from memory or soft logic. There
block is a 2x2 PE grid, which we then replicate to make is a significant cost to the logic based multipliers, but these
a larger systolic array. Figure 6 shows the connections per may be required in cases where the system design requires a
individual PE, and Figure 7 shows the logical structure of an large number of memory ports to feed the dot products. We

88

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
reasonable target for this device, and is likely scalable within
... this device family.
The INT8 versions gave us both slightly lower performance
...
and lower density. We initially tried a near maximum DSP
usage for a 30x60 grid of PEs (28,800 INT multipliers), but
...
the speed dropped below 400 MHz. We attempted to improve
... 64 the speed by pipelining the critical path, which was from
the output of the DSP Blocks to the immediately following
adders, but this only increased area, with a minimal impact
...

...

...

...

...
on performance. Reducing the grid size was more successful
... - although the 25,088 multipliers in this grid is only a 12%
density improvement on the previously reported Intel results,
... a large number of DSP Blocks have been freed up for other
tasks, such as activation function implementation. About half
32 the M20Ks remain for the system design. In all of our
SM1.7 and INT8 designs, the dot product composition is
Fig. 7. 2D PE Array homogeneous throughout the grid. By mixing and matching,
e.g. having some PEs implemented only with DSP Blocks and
TABLE II
some with mixture of DSP Blocks and memory, and possibly
R ESOURCE R EQUIREMENTS FOR AN E XAMPLE INT8 2 X 2 G RID logic, different amounts and distributions of resources can
remain free.
M20Ks ALM ALM ALM FFs M20K DSPs
INT8 (logic+FF) (logic) (FF only) Our example systolic arrays used virtual pins, as in [15].
4 1606 54 730 4124 16 12 The number of virtual pins in our design was considerably
less, because the 2D systolic structure includes PE to PE
communication paths. There are a total of 51,200 virtual pins,
have seen in the earlier [8][12][14] cases that there is a large consisting of 8K left side input bits, 4K bits in from the
variation in the memory depth and bandwidth to multiplier top, and 38,912 output bits. These virtual pins occupy 25600
relationship. Further implementation flexibility is possible by ALMs.
mixing grids with different multiplier composition ratios.
Because of the more restrictive memory based approach A floorplan and routing heatmap are shown in Figure 8.
in the INT8 case, we only composed one version of the 16 This particular example was for SM1.7, with 5 M20K based
element dot product, split into 4 and 12 elements of the M20K multipliers per dot product. The heatmap for all of the SM1.7
and DSP Block based versions, respectively. Table 2 reports and INT8 grids is almost identical, with the highest stress
the resources in the same manner as table 1. (colored in pink) concentrated around the discontinuities on
For all of our designs, we targeted a 1SG280LN2F43E2VG, the die. The include the embedded processor subsystem on
using Quartus 19.1. It is important to note that this is a mid- the lower right edge, and the IO regions running vertically
speed grade device. through the die. This suggests that a system using a collection
of smaller dot product arrays, or cases where only a smaller
IV. R ESULTS - S YSTOLIC A RRAYS portion of the FPGA is used for the deep learning application,
may be able to increase the clock frequency of the dot product
We created a number of chip filling designs (to approxi- array significantly (we found that small grids often exceeded
mately 80% logic usage, as discussed earlier in our goals for 600MHz).
this work), with different types of PE. In these reported cases,
all PE compositions were uniform across the design, but this Figure 9 shows an intermediate sized array (20x40 dot
does not have to be the case. We first built these full chip products), which is about 40% of the maximum size limit,
designs with three variants of the SM1.7 dot products: 5, 3, based on both our resource goal, and the number of DSP
and 2 M20K multipliers per PE. The results are listed in table Blocks in the device. The fractal synthesis clustering and
3. In every case, the number of PEs was 2048, arranged as a packing algorithms directs Quartus to create an almost per-
32x64 grid of PEs (16x32 grid of our 2x2 grids). In all cases, fectly rectangular layout, leaving the majority of the device
these designs were placed and routed using a push button flow. completely untouched.
No floorplanning was used, even for the very large arrays. The Brainwave design contained 96K multipliers with a 1.2
We can achieve a successful fit of 32K multipliers, with magnitude format. Our work shows that 32K multipliers for a
consistent results over different multiplier component mixes, much larger 1.7 magnitude format is readily achievable - this
while leaving enough logic left over to support a system design translates to about a 3x ISO density improvement, based on
in the Brainwave style. Our performance consistently exceeds arithmetic complexity (which varies as the square of relative
400MHz, so a performance level of 25TOPs to 30TOPs is a precision).

89

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
TABLE III
L ARGE 2D S YSTOLIC A RRAY (SM1.7) R ESULTS

M20Ks ALM ALM ALM ALM FFs M20K DSPs FMax


SM1.7 (total) (logic+FF) (logic) (FF only)
5 654K (70%) 429K 24K 278K 1676K 10240 (87%) 5632 (98%) 432
3 735K (79%) 499K 23K 268K 1852K 6144 (52%) 5632 (98%) 445
2 774K (83%) 534K 22K 251K 1940K 4096 (35%) 5632 (98%) 413

TABLE IV
L ARGE 2D S YSTOLIC A RRAY (INT8) R ESULTS

Array ALM ALM ALM ALM FFs M20K DSPs FMax


Size (total) (logic+FF) (logic) (FF only)
30x60 727K (78%) 468K 20K 298K 1888K 7200 (61%) 5400 (94%) 378
28x56 693K (74%) 407K 18K 345K 1888K 6272 (54%) 4704 (82%) 404

(a) (b)
Fig. 8. Floorplan and Heatmap of a 32K multiplier 2D Systolic Array

V. R ESULTS - S YSTEM D ESIGNS is still some logic margin available, and for all array sizes,
the memory availability is greater than the logic availability.
After validating our approach with the independent 2D All of this bodes well for incorporating these structures into a
systolic arrays, we implemented a more representative system more complex deep learning system. Alternately, the system
design. In this case, all I/Os are mapped to pins. First, each design can be used as an accelerator as is.
PE in the systolic array is terminated in an 32b integer We noticed an initial speed degradation of about 10% com-
accumulator. The PE routing of figure 6 is modified to support pared to the array core, and decided to introduce some minimal
systolic propagation of the accumulator results to the core floorplanning. Pinning the corners of the arrays achieved an
periphery. Finally, small (512 word) input and output RAMs immediate result, with the system designs running about 10%
are attached to the periphery for flow control. Muxing is used faster than the push button arrays.
to select the current output stream. Figure 10 shows a block diagram of the system array. The
For any given array size, the number of DSP Blocks remains flow control memories can be seen in the periphery. The
the same, while there are small (typically 5%) increases in soft routing heatmap of Fig. 11 following our minimal floorplan
logic and memory sizes. There is virtually no impact on speed. intervention shows less stress than the systolic array core
We can see that for even very large (30x60) PE arrays, there alone.

90

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
TABLE V
S YSTOLIC A RRAY S YSTEM (INT8) R ESULTS

Array Size ALM Pins FFs M20K DSPs FMax (MHz)


24x48 503K (54%) 357 (39%) 1300K 4944 (42%) 3456 (60%) 464
26x52 588K (63%) 357 (39%) 1538K 5772 (49%) 4056 (70%) 458
28x56 681K (73%) 357 (39%) 1775K 6664 (57%) 4704 (82%) 439
30x60 778K (83%) 357 (39%) 2027K 7620 (65%) 5400 (94%) 402

Fig. 11. Routing Heatmap


Fig. 9. 20x40 PE 2D Systolic Array

DATA B ADDR B
VI. C ONCLUSIONS

address decoder
Our goal was to develop a method to create large, scal-
able, and high performance systolic arrays for DL at an 8
RAM RAM RAM RAM bit precision. We showed new methods for INT8 multiplier
based dot products, and introduced SM1.7 as an even more
RAM
DATA A
RAM
efficient format for implementing these arithmetic constructs
RAM
with this precision. The resultant designs achieve their goals
RAM
with consistent and deterministic speed, as well as low latency.
ADDR A RAM
Our throughput is 10% to 50% higher than other designs im-
RAM plemented on the same or slightly larger devices, by utilizing a
RAM balanced (the ratio of multipliers and soft logic to DSP blocks)
RAM implementation mix. We were able to realize this results even
RAM using a single monolithic chip scale 2D systolic array.
ADDR OUT
RAM We can scale these arrays from small to large, while en-
RAM suring an efficiently packed, placed and routed result, without
RAM
floorplanning, simplifying the design process greatly. We then
DATA OUT RAM extended the arrays to include flow control so that they can
RAM
be used as an accelerator, either as a stand alone device, or as
output
muxes
RAM
part of a more complex deep learning design on the FPGA.
RAM
Using a small amount of floorplanning, performance increased,
Fig. 10. Systolic Array Block Diagram delivering over 400MHz on a mid-speed grade device, for even
a very full design.

91

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES International Symposium on Computer Architecture (ISCA), June 2018,
pp. 1–14.
[1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, [9] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Ac-
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. celerating persistent neural networks at datacenter scale,” in Hot Chips,
Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, vol. 27, 2017.
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, [10] M. Langhammer, G. Baeckler, and S. Gribok, “Fractal Synthesis:
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, Invited tutorial,” in Proceedings of the 2019 ACM/SIGDA International
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, Symposium on Field-Programmable Gate Arrays, ser. FPGA ’19.
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, New York, NY, USA: ACM, 2019, pp. 202–211. [Online]. Available:
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, http://doi.acm.org/10.1145/3289602.3293927
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, [11] Deep Learning with INT8 Optimization on Xilinx Devices,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, 2017, https://www.xilinx.com/support/documentation/white papers/
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, wp486-deep-learning-int8.pdf.
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter [12] E. Wu, X. Zhang, D. Berman, I. Cho, and J. Thendean, “Compute-
performance analysis of a tensor processing unit,” in Proceedings of efficient neural-network acceleration,” in Proceedings of the 2019
the 44th Annual International Symposium on Computer Architecture, ACM/SIGDA International Symposium on Field-Programmable Gate
ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 1–12. [Online]. Arrays, ser. FPGA ’19. New York, NY, USA: ACM, 2019, pp. 191–200.
Available: http://doi.acm.org/10.1145/3079856.3080246 [Online]. Available: http://doi.acm.org/10.1145/3289602.3293925
[2] B. Pasca and M. Langhammer, “Activation function architectures for [13] E. Wu, X. Zhang, D. Berman, and I. Cho, “A high-throughput reconfig-
fpgas,” in 2018 28th International Conference on Field Programmable urable processing array for neural networks,” in 2017 27th International
Logic and Applications (FPL), Aug 2018, pp. 43–437. Conference on Field Programmable Logic and Applications (FPL), Sep.
[3] I. Swarbrick, D. Gaitonde, S. Ahmad, B. Gaide, and Y. Arbel, “Network- 2017, pp. 1–4.
on-chip programmable platform in VersalTM ACAP architecture,” in Machine Learning problems,” in 2019 29th International Conference
Proceedings of the 2019 ACM/SIGDA International Symposium on on Field Programmable Logic and Applications (FPL), Sep 2019.
Field-Programmable Gate Arrays, ser. FPGA ’19. New York,
[15] M. Langhammer, B. Pasca, G. Baeckler, and S. Gribok, “Extracting int8
NY, USA: ACM, 2019, pp. 212–221. [Online]. Available: http:
multipliers from int18 multipliers,” in International Conference on Field
//doi.acm.org/10.1145/3289602.3293908
Programmable Logic and Applications. Barcelona, Spain: IEEE, sep
[4] Intel Agilex Variable Precision DSP Blocks User Guide, 2019,
2019.
https://www.intel.com/content/dam/altera-www/global/en US/pdfs/
[16] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan, “A
literature/hb/agilex/ug-ag-dsp.pdf.
high-performance CNN processor based on FPGA for MobileNets,” in
[5] G. R. Chiu, A. C. Ling, D. Capalija, A. Bitar, and M. S.
2019 29th International Conference on Field Programmable Logic and
Abdelfattah, “Flexibility: Fpgas and cad in deep learning acceleration,”
Applications (FPL), Sep 2019.
in Proceedings of the 2018 International Symposium on Physical
[17] S. K. Venkataramanaiah, Y. Ma, S. Yin, E. Nurvithadhi, A. Dasu, Y. Cao,
Design, ser. ISPD ’18. New York, NY, USA: ACM, 2018, pp. 34–41.
and J.-S. Seo, “Automatic compiler based FPGA accelerator for CNN
[Online]. Available: http://doi.acm.org/10.1145/3177540.3177561
training,” in 2019 29th International Conference on Field Programmable
[6] H. Nakahara, Y. Sada, M. Shimoda, K. Sayama, A. Jinguji, and S. Sato,
Logic and Applications (FPL), Sep 2019.
“FPGA-based training accelerator utilizing sparseness of convolutional
neural network,” in 2019 29th International Conference on Field Pro- [18] R. Rajat, H. Zeng, and V. Prasanna, “A flexible design automation tool
grammable Logic and Applications (FPL), Sep 2019. for accelerating quantized spectral CNNs,” in 2019 29th International
[7] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, “High-performance Conference on Field Programmable Logic and Applications (FPL), Sep
fpga-based cnn accelerator with block-floating-point arithmetic,” IEEE 2019.
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, [19] M. Langhammer and G. Baeckler, “High density and performance
no. 8, pp. 1874–1885, Aug 2019. multiplication for FPGA,” in 25th IEEE Symposium on Computer
[8] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, Arithmetic, ARITH 2018, Amherst, MA, USA, June 25-27, 2018,
S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, 2018, pp. 5–12. [Online]. Available: https://doi.org/10.1109/ARITH.
A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. 2018.8464695
Caulfield, E. S. Chung, and D. Burger, “A configurable cloud-scale [20] X. Yu, Y. Wang, J. Miao, E. Wu, H. Zhang, Y. Meng, B. Zhang, B. Min,
DNN processor for real-time AI,” in 2018 ACM/IEEE 45th Annual D. Chen, and J. Gao, “A data-center FPGA acceleration platform for
[14] A. Samajdar, T. Garg, T. Krishna, and N. Kapre, “Scaling the Cascades: convolutional neural networks,” in 2019 29th International Conference
Interconnect-aware Mapping Strategies for FPGA implementation of on Field Programmable Logic and Applications (FPL), Sep 2019.

92

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy