0% found this document useful (0 votes)

2 views9 pages

Systolic Array

This paper presents new techniques for high-density 8-bit multiplier systolic arrays designed for FPGAs, particularly focusing on improving INT8 density and performance for AI applications. The authors introduce a scalable systolic array architecture that can accommodate a large number of signed-magnitude multipliers, achieving significant logic density and operational efficiency. The work aims to enhance FPGA competitiveness against GPUs by optimizing resource utilization and implementing advanced arithmetic structures.

Uploaded by

dasujayanth2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views9 pages

Systolic Array

Uploaded by

dasujayanth2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

High Density 8-bit Multiplier Systolic Arrays for

FPGA
Martin Langhammer Sergey Gribok Gregg Baeckler
Intel Corporation Intel Corporation Intel Corporation
Martin.Langhammer@intel.com Sergey.Gribok@intel.com Gregg.Baeckler@intel.com

Abstract—Artificial Intelligence (AI) has become the fastest generally available still lack the performance required for this
growing application area for FPGAs. Two types of numerics are important application area.
needed. Training typically uses floating point arithmetic (which is
now widely available as embedded functions in current FPGAs). We also considered different types of numerical representa-
Inference is typically calculated with lower precision integer tion for this work: signed magnitude is potentially more effi-
numbers, which can be implemented with embedded functions, cient, and aligns with the precision component of the IEEE754
soft logic, or a combination of the two. INT8 performance is floating point numerics. In [5], custom floating point numbers
therefore used as a typical benchmarking metric for current
FPGAs. Recent publications based on Xilinx devices show the with a variable SM mantissa were used, so we have precedence
extraction of two INT8 multipliers from a 24x18 multiplier. A and results on FPGA for this approach. A recent work [6]
paper from Intel describes how to obtain two INT8 multipliers using a 9 bit custom floating point format on a Xilinx VUP
from a 18x18 multiplier, with the help of a small amount of device also uses a signed magnitude mantissa representation,
soft logic. In this paper we introduce a number of new INT8 with two mantissa multipliers mapped to a single DSP48.
multiplier techniques, starting with the Intel 18x18 multiplier
approach. Using both memory and logic resources - for a more Although not using signed magnitude, [7] points to floating
balanced use of the FPGA features - we improve the INT8 density, point processing, particularly with shared exponents, as an
and also show a signed-magnitude (SM) 1.7 construct that is even increasing use model for FPGA. Our goal in this work was
smaller. To demonstrate the usability of these new multipliers, we not to benchmark another deep learning application, but to
develop a scalable systolic array, that contains up to 32,768 SM1.7 improve the performance density of the mainstream FPGA
multipliers, or 28,800 INT8 multipliers, fit in an Intel Stratix 10
2800 device. Finally, we implement a system architecture that considerably, in order to maintain competitiveness with GPUs
includes input and output flow buffering and control, which can and the emerging ASSPs. We took some inspiration from the
be instantiated directly into a larger AI design, or can enable Microsoft Brainwave work [8][9], where a Stratix 10 2800
the FPGA to be used as a standalone accelerator. This system device was packed to 92% logic density. According to [10],
exceeds 400 MHz for the largest array on a mid-speed device (26 80% of the device logic was used by the datapaths, packed to
TOPS INT8), and can operate up to 600 MHz for smaller array
sizes. 97%, and the remaining 20% of the logic was control and data
distribution, packed to 80% efficiency. Other important points
I. I NTRODUCTION from [9] was that around 2/3rds of the memory blocks were
used - presumably largely for data distribution, i.e. feeding the
In deep learning (DL), multiplier density and performance - dot products. In this work, we will aim for a 80% device logic
whether TOPs or TFLOPs - sets the performance expectation resource utilization for our datapaths, while at the same time
of the implementation. Dot product or matrix-vector arrays incorporating some of the free memory into the arithmetic
are the most common structure for these [1]. One advantage calculations. We will still use all of the DSP blocks that we
of the FPGA is flexibility; dataflow can be configured for each are able to access. Our contributions are as follows:
application, and non-linear activation functions like tanh and
sigmoid [2] can be inserted anywhere. Support of simpler 1) We develop an improvement to published INT8 results
functions such as RELU are trivial. Additional optimizations to support SM1.7, using a combination of DSP blocks,
like precision or numerical representation scaling can be varied embedded memory blocks, and soft logic.
from design to design. But the FPGA has less raw performance 2) We describe dot structures for SM1.7, that support both
potential compared to ASIC at similar process nodes. Newer high density and high packing efficiency.
FPGA devices are addressing this gap with more embedded 3) The SM1.7 logic and memory implementation methods
features, especially for lower precision floating point and are mapped back to the INT8 datapaths, with similar
integer numbers. Xilinx has introduced the ACAP concept improvements in multiplier density
[3], with embedded processors supporting IEEE754 FP32, 4) We demonstrate a scalable 2D systolic array that fits to
INT16, and INT8 representations. The Intel Agilex FPGA deterministic area and geometry automatically - without
[4] now has IEEE754 FP16, BFLOAT16, as well as INT9 floorplanning - from a small grid to a full chip design.
support in the DSP Blocks. These two device families have 5) We provide a lightweight, low cost, scalable interface
just been introduced, which means that the mainstream devices which can source and sink any size of the 2D systolic

2576-2621/20/$31.00 ©2020 IEEE 84

DOI 10.1109/FCCM48280.2020.00021

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
arrays to the pins, or a user design, while maintaining use of the embedded resources, and uses 100% URAM, 95%
the performance of the systolic array DSP, and 40% BRAM. The INT8 extraction method [11] is
again used with the DSP Blocks. Several DL applications are
II. P REVIOUS W ORK benchmarked; although significantly more DSP Blocks are
The production Brainwave deployment [8] used 91% of utilized, throughput is reduced by 30% over [12], however
the logic, 69% of the M20Ks, and 91% of the DSP on the latency is reduced by 7X. Assuming that this design scales
target Stratix 10 2800 device. Much of the M20K resource to the somewhat smaller VU9P, and applying the utilization
was used to store activations and weights locally. We will ratios and clock frequency we calculate a scaled performance
attempt to employ some of the 30% of the unused M20K of 16.9 TOPs.
to perform arithmetic operations in the dot product structure. In [15] a Stratix 10 2800 device is filled with dot products,
The clock rate of the Brainwave design was limited to 250 arranged as tree structures. A new type of INT8 extraction to
MHz [8], although they earlier reported 300 MHz for their obtain two INT8 multipliers per 18x18 DSP Block multiplier
prototypes [9]. Brainwave uses 96K multipliers with a block (with the use of some soft logic) is described, with an
floating point {1.5.2} format per device. We are instead using a average cost of about 8 ALMs per INT8 multiplier. The INT8
SM1.7 format for the mantissa portion. A SM1.7 bit multiplier multipliers are grouped in 32 element dot products, and a full
is much larger arithmetically than the SM1.2 multiplier in chip dot product array is implemented. Fractal Synthesis [10]
Brainwave - by over 10x. The number of SM1.7 multipliers is used to pack the logic more efficiently, leaving the unused
that can be implemented in the FPGA will therefore be logic and routing available in large contiguous blocks, where
significantly less, but we will show that are new techniques it can presumably easily be accessed for any application level
will be able to dramatically close this gap in operations. functionality. The dot products are individually instantiated, so
The Xilinx whitepaper WP486 [11] shows how to extract that all inputs and outputs must be individually sourced and
two INT8 multipliers from a DSP48 if one of the inputs is terminated. This is done by using virtual pins. The presented
shared. A reference point in this paper will be the Xilinx VU9P design contains 22,400 INT8 multipliers (700 dot products);
with 6840 DSP48 blocks, giving a maximum possible of 13680 with shared inputs this amounts to 283,500 virtual pins - this
INT8 multipliers using their method. Xilinx does not show any alone requires 141,500 ALMs of the placed and routed design.
system level results in their whitepaper, although a number of The logic for these virtual pins would also be freed up if the
other works based on this have been published recently. connections were from a system level design - as these virtual
In [12], the Xilinx Supertile is used at double the clock pins represent 15% of the available logic on the device, this
frequency of the surrounding logic (720MHz/360Mhz), but is almost enough to implement our entire Brainwave inspired
only 55% of the DSP Blocks are used. Although not explic- design, which requires about 20% of the 2800 logic. (The
itly stated, this is because the application implementation is connections to the dot products would largely come from the
memory bandwidth limited, as all of the large memory blocks M20K memories, and not through logic for this design). The
(URAMs) are used (only 40% of the smaller 36kb memory Stratix 10 2800 (933K ALMs, 5760 DSP Blocks) device is
are used, however). The earlier Xilinx Supertile paper [13] slightly smaller than the VU9P device (1182 6LUTs, 6840
describes the implementation of a large 96x16 (1536 DSP48 DSP48s).
blocks) processing array, albeit on the smaller VU3P device. Comparing the two Virtex Ultrascale (VUP) designs, we
This design takes up 67% of the available DSP48 blocks, can see that the dataflow architecture has a significant impact
and runs at the 775MHz-891MHz (speed grade dependent) on metrics such as throughput and latency, and that the
datasheet speed. In [12] speed degradation to 92% of this performance level of the final design cannot be independent
rate is observed in the larger VU9P (which is essentially determined simply by looking at memory size, bandwidth, and
three VU3P die stitched together). According to [13], 100% number of TOPs. The performance, however, cannot exceed
DSP use for the Supertile uses only 25% of the device logic, the maximum number of TOPs. The Xilinx targeted designs
although this 100% is only an extrapolated number based also rely on the hard cascade feature of the DSP48 blocks,
on the realized smaller design. The Supertile paper also uses which will introduce a linear latency to any vector operation.
INT16, while [12] implements INT8 via the technique from In contrast, Brainwave uses a dot product implementation, with
[11]. Extrapolating to a maximum systolic INT8 array use on a combination of DSP Blocks and soft logic. We note that
a VU9P (100% DSP and 25% logic utilization) gives 19.7 Microsoft Brainwave is the highest profile production design
TOPs. The performance based on the demonstrated 96x16 currently in use, with tens of thousands of nodes deployed
array replicated thrice in the VU9P is 13.3 TOPs. This array [8] on a relatively large device. A recent paper from another
architecture is based on groups of small cascaded DSP Blocks. commercial organization [16] used smaller Xilinx devices,
In [14], an alternate assembly of the DSP48 Blocks is utilizing only 2070 DSP48 blocks.
presented. As opposed to the Supertile, the memory and DSP A brief survey of other recent works using our targeted
Blocks operate at the same clock rate. Logic use is reduced Stratix 10 2800 device gives some additional datapoints. In
- presumably as the data rate matching is not required - but [17], up to 100% DSP with 76% of the ALMs were used,
the clock rate is slightly reduced to 650MHz, mapped to a showing that large designs are possible on this device with a
large VU37P device. This design incorporates a more balanced variety of approaches, although this example only achieved

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
{a,0,0,0,0,b} a a[2:0] c[2:0]

LSB mult
DSP: 18x7 multiplier 3x3

z[2:0]
o[24:11]

subtractor

y[10:0] {z[13:3],y[13:11]} z[2:0]

(a) Architecture

24 23 22 21 20 19 18
17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
c6 c5 c4 c3 c2 c1 c0 0 0 0 0 b6 b5 b4 b3 b2 b1 b0
0 0 0 0 0 0 0 0 0 0 0 a6 a5 a4 a3 a2 a1 a0
0 0 0 0 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0
z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
o24 o23 o22 o21 o20 o19 o18 o17 o16 o15 o14 o13 o12 o11 o10 o9 o8 o7 o6 o5 o4 o3 o2 o1 o0
(b) Bit Alignments
Fig. 1. DSP Block based unsigned 7x7 multiplier

240 MHz. Another design [18] had a larger ratio of DSP [12][14] is that the dot product latency varies log rather than
Blocks (71%) to ALMs (15%), but despite the relative sparse- linearly of the dot size, reducing system latency, and giving
ness of the utilization, was slower at 200 MHz. more flexibility to the matrix and vector decompositions.
Power was not reported for the VUP designs. Microsoft
reports 125W for a full chip Stratix 10 2800, which uses III. M ETHOD
the majority of resources at a lower frequency. The VUP We started by first creating some efficient SM1.7 operators,
designs use a significant amount of memory and are DSP rich, based on both DSP Blocks and M20K memory blocks. In both
and have a relatively low logic utilization, but run the DSP cases, a small amount of soft logic is required. Many of our
blocks (and in [14], the memory as well) at very high speed dot products will also require pure soft logic multipliers, and
to achieve throughput. While we cannot compare any Xilinx we will use the Quartus based IP for these.
power numbers, we can see that the power envelope for a full
device Stratix 10 2800 @400MHz should still be coolable in A. DSP Block Based Multipliers
a production environment. We therefore set ourselves a target In [15] several new DSP block based methods for INT8
of a design that uses around 80% of the device resources (but, multipliers were introduced. We will modify these to imple-
if possible, a lower percentage of memory blocks), capable of ment unsigned INT7 (UINT7) multipliers instead.
running in the range of 400 MHz on a previous generation Both [15] and [11] use a signed or twos complement format
FPGA. In [6], a relatively small VUP based CNN accelerator, to represent input values. There are some FPGA implemen-
using only 1,106 of the 6,840 DSP48 block on the device tation efficiencies if we use SM format instead. The dynamic
(although 79% of the logic) consumed 75W at a 200 MHz range is almost the identical, with the only difference that
clock. This was apparently measured at the board level, so may the number -128 cannot be represented in SM1.7. The sign
not accurately reflect the FPGA power consumption. Power is is simply calculated as (signa )XOR(signb ). Although the first
reported for [17], but these are simulated numbers from the level of the adder tree is arithmetically more complex as the
EPE tool, and not actually measured. signs can result in subtractions as well as addition, this can be
We will build number of dot product structures, balancing supported in the logic associated with the FPGA carry chain,
DSP based implementations with multipliers constructed from as is therefore free at the implementation level.
different ratios of M20K blocks and soft logic, in order to The magnitude bits require a UINT7 multiplier. We use a
provide a wider choice in resource balancing. We will also modified technique of [15], with each INT18 multiplier in the
arrange the dot products in a 2D systolic architecture so that DSP block acting as an unsigned 18x7 multiplier. The two
memory bandwidth, or at least the I/O bandwidth is reduced. 7 bit multiplicands are input with zero nibble between them.
This may increase the stress on inter PE routing, but will This output of the multiplier contains the sum of two partially
also give a worst-case routing congestion datapoint. If a 2D overlapping 14 bit results. A correction factor of 3 bits can
systolic structure is not suitable, the dot products could still be calculated by a subset multiplier, which only outputs the 3
be individually instantiated. A key point of our designs over LSBs of a 3x3 bit multiplication. The description in [15] does

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
Port A: ROM inputs Port B: ROM inputs

addr A[10:0] addr B[10:0]

M20K

dout A[9:0] dout B[9:0]

Port A: ROM outputs Port B: ROM outputs

Fig. 2. M20K Dual Port ROM - Pair of 7x4 Multiplications

11 bit output available, we do not support the LSB, which can

be inexpensively calculated in a 2 bit logic function. This is
not an issue for the 7x3 case, as we treat the 3 bit input as
a 4 bit value with a LSB, and right shift the output by 1 bit.
From the dot diagram in Figure 2, the least significant partial
product disappears, and we end up with 7x4 and 7x3 products,
respectively. We then sum the appropriate segments to get our
result, and append the separately calculated LSB. The total
cost is 6.5 ALMs: 5 ALMs for the adder, 1/2 an ALM for
the LSB, and 1.5 ALMs to latency balance the LSBs of the
7x4 multiplier. The latter can again be amortized across a dot
product for additional savings.
We then extended the memory based multiplier method to
Fig. 3. Arithmetic Equivalence of M20K Mapping INT8. Using the M20K here is more difficult. First of all,
both multiplier halves in the UINT7 case were identical - the
multiplicand was the same, and both multiplier subsets were
not describe the construction of this multiplier, but the small unsigned, even though the upper one only used 3 of the 4 bit
size means that it can be implemented relatively efficiently by dataset. For INT8, the upper multiplier half is signed, while
table look up, especially if 6LUTs are natively supported. We the lower half is unsigned (and identical to the UINT7 case).
used the techniques in [19] to map this multiplier to the ALM This means that the two multiplier nibbles need separate tables.
structure in the Stratix 10 device. To support this efficiently, a pair of INT8 multipliers need to
We simply subtract the value z[2:0] from o[24:11] to obtain be constructed together, with both upper and lower multiplier
the upper 11 bits of a*c, or z[13:3], and the 3 upper bits of nibbles addressing their own dual port ROMs. The granularity
a*b, or y[13:11]. The lower 3 bits of a*c are output from of the INT8 M20K based method is therefore two.
the LSB-only multiplier, and the lower 11 bits of a*b are
Figure 4 shows a DOT representation of this. The blue and
output directly from the DSP Block. Both the LSB multiplier
orange input boxes (and the same colored dots) represent the
and the subtractor can be implemented in 7.5 ALMs, or just
unsigned 7 bit multiplicand and signed 4 bit multipliers, and
under 4 ALMs per UINT7 multiplier on average. Depending
the red and green input boxes the unsigned 7 bit multiplicand
on the performance required, additional logic may be needed
and the unsigned 4 bit multiplier. The yellow dots on the
to introduce pipelining, although this may be amortized over
LSBs of the partial products are calculated in soft logic, just
larger structures like dot products.
like the LSB for the first partial product in the UINT7 case.
We re-used the techniques of [15] for the INT8 dot products.
The missing information is the product of the MSB of the
B. M20K Based Multipliers multiplicand and the multiplier, also shown as yellow dots.
The features and organization of the M20K memory block Like the LSBs, these are calculated in soft logic. The output
lends itself well to 7 bit multipliers. This 20Kb block can be stage now has three rows, but we do not need to add them all
configured as a dual port ROM, with an 11 bit address space, here. The four MSBs of the third row can be moved to the
and therefore a 10 bit output. This supports two independent empty positions of the first row; the existing partial product
7x4 multiplications, as shown in Figure 2. adder remains the same size. The remaining 4 bits in the
One set of input multiplies A[6:0]* B[3:0], and the other third row can be added with the same 4 bits from any other
A[6:0]*B[6:4] - in other words, we only use the 7x3 capability multiplier in the DOT, and only folded into the final result at
of the second input. As we only have 10 bits of the required the output.
The INT8 version requires an additional 4.5 ALMs over the

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Int8 M20K Mapping

M20K + ALM DSP + ALM

ALM

Fig. 5. Mixed Multiplier Construct 16 Element dot product

UINT7 counterpart. The total logic cost - around 10 ALMs -

is considerably less than the 30 ALMs required for the pure
soft logic IP provided by Quartus. The die cost of the M20K
is likely considerably larger than the difference of 20 ALMs,
but it does allow us additional choices in balancing resource Fig. 6. PE Dot Product and Ingress/Egress Ports
use on the FPGA. As we are now attempting to fit multiple
tens of thousands of multipliers on an FPGA, a small decrease
in soft logic per multiplier can have a marked impact. TABLE I
R ESOURCE R EQUIREMENTS FOR D IFFERENT SM1.7 PE C OMPONENT
We can now assemble dot products. We use 16 element dot M IXES
products to more closely correlate inter-PE routing resources
with the density of long vertical wires in the Stratix 10 devices. M20Ks ALM ALM ALM FFs M20K DSPs
SM1.7 (logic+FF) (logic) (FF only)
(We originally tried 32 element dot products like [15], but 0 1573 53 482 3694 0 11
found that the 2D systolic array architecture suffered from 1 1493 57 471 3512 4 11
excessive routing congestion, affecting Fmax.) We will first 2 1411 54 448 3330 8 11
3 1328 58 455 3148 12 11
implement SM1.7 based cores. Our 16 element dot products 4 1245 51 409 2966 16 11
contain 11 DSP based SM1.7 multipliers, with a mix of M20K 5 1164 50 366 2784 20 11
and logic based multipliers. Figure 5 shows a block diagram
of a dot product with 5 M20K based multipliers, annotated
with implementation resource per function (there are no purely array of PEs.
logic based multipliers in this design). Table 1 shows the size of our base 2x2 PE grid, with
We can then start creating systolic arrays of these dot varying numbers of M20K based SM1.7 multipliers per PE.
products. Each PE consists of a 16 element dot product, and In each case, 11 of the 16 multipliers in each PE are DSP
two directions of activation and weight routing, vertical and based multipliers, with the remaining 5 being a combination
horizontal. Each PE also has a separate output. Our building of multipliers constructed from memory or soft logic. There
block is a 2x2 PE grid, which we then replicate to make is a significant cost to the logic based multipliers, but these
a larger systolic array. Figure 6 shows the connections per may be required in cases where the system design requires a
individual PE, and Figure 7 shows the logical structure of an large number of memory ports to feed the dot products. We

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
reasonable target for this device, and is likely scalable within
... this device family.
The INT8 versions gave us both slightly lower performance
...
and lower density. We initially tried a near maximum DSP
usage for a 30x60 grid of PEs (28,800 INT multipliers), but
...
the speed dropped below 400 MHz. We attempted to improve
... 64 the speed by pipelining the critical path, which was from
the output of the DSP Blocks to the immediately following
adders, but this only increased area, with a minimal impact
...

...

...
on performance. Reducing the grid size was more successful
... - although the 25,088 multipliers in this grid is only a 12%
density improvement on the previously reported Intel results,
... a large number of DSP Blocks have been freed up for other
tasks, such as activation function implementation. About half
32 the M20Ks remain for the system design. In all of our
SM1.7 and INT8 designs, the dot product composition is
Fig. 7. 2D PE Array homogeneous throughout the grid. By mixing and matching,
e.g. having some PEs implemented only with DSP Blocks and
TABLE II
some with mixture of DSP Blocks and memory, and possibly
R ESOURCE R EQUIREMENTS FOR AN E XAMPLE INT8 2 X 2 G RID logic, different amounts and distributions of resources can
remain free.
M20Ks ALM ALM ALM FFs M20K DSPs
INT8 (logic+FF) (logic) (FF only) Our example systolic arrays used virtual pins, as in [15].
4 1606 54 730 4124 16 12 The number of virtual pins in our design was considerably
less, because the 2D systolic structure includes PE to PE
communication paths. There are a total of 51,200 virtual pins,
have seen in the earlier [8][12][14] cases that there is a large consisting of 8K left side input bits, 4K bits in from the
variation in the memory depth and bandwidth to multiplier top, and 38,912 output bits. These virtual pins occupy 25600
relationship. Further implementation flexibility is possible by ALMs.
mixing grids with different multiplier composition ratios.
Because of the more restrictive memory based approach A floorplan and routing heatmap are shown in Figure 8.
in the INT8 case, we only composed one version of the 16 This particular example was for SM1.7, with 5 M20K based
element dot product, split into 4 and 12 elements of the M20K multipliers per dot product. The heatmap for all of the SM1.7
and DSP Block based versions, respectively. Table 2 reports and INT8 grids is almost identical, with the highest stress
the resources in the same manner as table 1. (colored in pink) concentrated around the discontinuities on
For all of our designs, we targeted a 1SG280LN2F43E2VG, the die. The include the embedded processor subsystem on
using Quartus 19.1. It is important to note that this is a mid- the lower right edge, and the IO regions running vertically
speed grade device. through the die. This suggests that a system using a collection
of smaller dot product arrays, or cases where only a smaller
IV. R ESULTS - S YSTOLIC A RRAYS portion of the FPGA is used for the deep learning application,
may be able to increase the clock frequency of the dot product
We created a number of chip filling designs (to approxi- array significantly (we found that small grids often exceeded
mately 80% logic usage, as discussed earlier in our goals for 600MHz).
this work), with different types of PE. In these reported cases,
all PE compositions were uniform across the design, but this Figure 9 shows an intermediate sized array (20x40 dot
does not have to be the case. We first built these full chip products), which is about 40% of the maximum size limit,
designs with three variants of the SM1.7 dot products: 5, 3, based on both our resource goal, and the number of DSP
and 2 M20K multipliers per PE. The results are listed in table Blocks in the device. The fractal synthesis clustering and
3. In every case, the number of PEs was 2048, arranged as a packing algorithms directs Quartus to create an almost per-
32x64 grid of PEs (16x32 grid of our 2x2 grids). In all cases, fectly rectangular layout, leaving the majority of the device
these designs were placed and routed using a push button flow. completely untouched.
No floorplanning was used, even for the very large arrays. The Brainwave design contained 96K multipliers with a 1.2
We can achieve a successful fit of 32K multipliers, with magnitude format. Our work shows that 32K multipliers for a
consistent results over different multiplier component mixes, much larger 1.7 magnitude format is readily achievable - this
while leaving enough logic left over to support a system design translates to about a 3x ISO density improvement, based on
in the Brainwave style. Our performance consistently exceeds arithmetic complexity (which varies as the square of relative
400MHz, so a performance level of 25TOPs to 30TOPs is a precision).

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
TABLE III
L ARGE 2D S YSTOLIC A RRAY (SM1.7) R ESULTS

M20Ks ALM ALM ALM ALM FFs M20K DSPs FMax

SM1.7 (total) (logic+FF) (logic) (FF only)
5 654K (70%) 429K 24K 278K 1676K 10240 (87%) 5632 (98%) 432
3 735K (79%) 499K 23K 268K 1852K 6144 (52%) 5632 (98%) 445
2 774K (83%) 534K 22K 251K 1940K 4096 (35%) 5632 (98%) 413

TABLE IV
L ARGE 2D S YSTOLIC A RRAY (INT8) R ESULTS

Array ALM ALM ALM ALM FFs M20K DSPs FMax

Size (total) (logic+FF) (logic) (FF only)
30x60 727K (78%) 468K 20K 298K 1888K 7200 (61%) 5400 (94%) 378
28x56 693K (74%) 407K 18K 345K 1888K 6272 (54%) 4704 (82%) 404

(a) (b)
Fig. 8. Floorplan and Heatmap of a 32K multiplier 2D Systolic Array

V. R ESULTS - S YSTEM D ESIGNS is still some logic margin available, and for all array sizes,
the memory availability is greater than the logic availability.
After validating our approach with the independent 2D All of this bodes well for incorporating these structures into a
systolic arrays, we implemented a more representative system more complex deep learning system. Alternately, the system
design. In this case, all I/Os are mapped to pins. First, each design can be used as an accelerator as is.
PE in the systolic array is terminated in an 32b integer We noticed an initial speed degradation of about 10% com-
accumulator. The PE routing of figure 6 is modified to support pared to the array core, and decided to introduce some minimal
systolic propagation of the accumulator results to the core floorplanning. Pinning the corners of the arrays achieved an
periphery. Finally, small (512 word) input and output RAMs immediate result, with the system designs running about 10%
are attached to the periphery for flow control. Muxing is used faster than the push button arrays.
to select the current output stream. Figure 10 shows a block diagram of the system array. The
For any given array size, the number of DSP Blocks remains flow control memories can be seen in the periphery. The
the same, while there are small (typically 5%) increases in soft routing heatmap of Fig. 11 following our minimal floorplan
logic and memory sizes. There is virtually no impact on speed. intervention shows less stress than the systolic array core
We can see that for even very large (30x60) PE arrays, there alone.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
TABLE V
S YSTOLIC A RRAY S YSTEM (INT8) R ESULTS

Array Size ALM Pins FFs M20K DSPs FMax (MHz)

24x48 503K (54%) 357 (39%) 1300K 4944 (42%) 3456 (60%) 464
26x52 588K (63%) 357 (39%) 1538K 5772 (49%) 4056 (70%) 458
28x56 681K (73%) 357 (39%) 1775K 6664 (57%) 4704 (82%) 439
30x60 778K (83%) 357 (39%) 2027K 7620 (65%) 5400 (94%) 402

Fig. 11. Routing Heatmap

Fig. 9. 20x40 PE 2D Systolic Array

DATA B ADDR B
VI. C ONCLUSIONS

address decoder
Our goal was to develop a method to create large, scal-
able, and high performance systolic arrays for DL at an 8
RAM RAM RAM RAM bit precision. We showed new methods for INT8 multiplier
based dot products, and introduced SM1.7 as an even more
RAM
DATA A
RAM
efficient format for implementing these arithmetic constructs
RAM
with this precision. The resultant designs achieve their goals
RAM
with consistent and deterministic speed, as well as low latency.
ADDR A RAM
Our throughput is 10% to 50% higher than other designs im-
RAM plemented on the same or slightly larger devices, by utilizing a
RAM balanced (the ratio of multipliers and soft logic to DSP blocks)
RAM implementation mix. We were able to realize this results even
RAM using a single monolithic chip scale 2D systolic array.
ADDR OUT
RAM We can scale these arrays from small to large, while en-
RAM suring an efficiently packed, placed and routed result, without
RAM
floorplanning, simplifying the design process greatly. We then
DATA OUT RAM extended the arrays to include flow control so that they can
RAM
be used as an accelerator, either as a stand alone device, or as
output
muxes
RAM
part of a more complex deep learning design on the FPGA.
RAM
Using a small amount of floorplanning, performance increased,
Fig. 10. Systolic Array Block Diagram delivering over 400MHz on a mid-speed grade device, for even
a very full design.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES International Symposium on Computer Architecture (ISCA), June 2018,
pp. 1–14.
[1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, [9] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Ac-
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. celerating persistent neural networks at datacenter scale,” in Hot Chips,
Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, vol. 27, 2017.
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, [10] M. Langhammer, G. Baeckler, and S. Gribok, “Fractal Synthesis:
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, Invited tutorial,” in Proceedings of the 2019 ACM/SIGDA International
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, Symposium on Field-Programmable Gate Arrays, ser. FPGA ’19.
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, New York, NY, USA: ACM, 2019, pp. 202–211. [Online]. Available:
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, http://doi.acm.org/10.1145/3289602.3293927
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, [11] Deep Learning with INT8 Optimization on Xilinx Devices,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, 2017, https://www.xilinx.com/support/documentation/white papers/
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, wp486-deep-learning-int8.pdf.
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter [12] E. Wu, X. Zhang, D. Berman, I. Cho, and J. Thendean, “Compute-
performance analysis of a tensor processing unit,” in Proceedings of efficient neural-network acceleration,” in Proceedings of the 2019
the 44th Annual International Symposium on Computer Architecture, ACM/SIGDA International Symposium on Field-Programmable Gate
ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 1–12. [Online]. Arrays, ser. FPGA ’19. New York, NY, USA: ACM, 2019, pp. 191–200.
Available: http://doi.acm.org/10.1145/3079856.3080246 [Online]. Available: http://doi.acm.org/10.1145/3289602.3293925
[2] B. Pasca and M. Langhammer, “Activation function architectures for [13] E. Wu, X. Zhang, D. Berman, and I. Cho, “A high-throughput reconfig-
fpgas,” in 2018 28th International Conference on Field Programmable urable processing array for neural networks,” in 2017 27th International
Logic and Applications (FPL), Aug 2018, pp. 43–437. Conference on Field Programmable Logic and Applications (FPL), Sep.
[3] I. Swarbrick, D. Gaitonde, S. Ahmad, B. Gaide, and Y. Arbel, “Network- 2017, pp. 1–4.
on-chip programmable platform in VersalTM ACAP architecture,” in Machine Learning problems,” in 2019 29th International Conference
Proceedings of the 2019 ACM/SIGDA International Symposium on on Field Programmable Logic and Applications (FPL), Sep 2019.
Field-Programmable Gate Arrays, ser. FPGA ’19. New York,
[15] M. Langhammer, B. Pasca, G. Baeckler, and S. Gribok, “Extracting int8
NY, USA: ACM, 2019, pp. 212–221. [Online]. Available: http:
multipliers from int18 multipliers,” in International Conference on Field
//doi.acm.org/10.1145/3289602.3293908
Programmable Logic and Applications. Barcelona, Spain: IEEE, sep
[4] Intel Agilex Variable Precision DSP Blocks User Guide, 2019,
2019.
https://www.intel.com/content/dam/altera-www/global/en US/pdfs/
[16] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan, “A
literature/hb/agilex/ug-ag-dsp.pdf.
high-performance CNN processor based on FPGA for MobileNets,” in
[5] G. R. Chiu, A. C. Ling, D. Capalija, A. Bitar, and M. S.
2019 29th International Conference on Field Programmable Logic and
Abdelfattah, “Flexibility: Fpgas and cad in deep learning acceleration,”
Applications (FPL), Sep 2019.
in Proceedings of the 2018 International Symposium on Physical
[17] S. K. Venkataramanaiah, Y. Ma, S. Yin, E. Nurvithadhi, A. Dasu, Y. Cao,
Design, ser. ISPD ’18. New York, NY, USA: ACM, 2018, pp. 34–41.
and J.-S. Seo, “Automatic compiler based FPGA accelerator for CNN
[Online]. Available: http://doi.acm.org/10.1145/3177540.3177561
training,” in 2019 29th International Conference on Field Programmable
[6] H. Nakahara, Y. Sada, M. Shimoda, K. Sayama, A. Jinguji, and S. Sato,
Logic and Applications (FPL), Sep 2019.
“FPGA-based training accelerator utilizing sparseness of convolutional
neural network,” in 2019 29th International Conference on Field Pro- [18] R. Rajat, H. Zeng, and V. Prasanna, “A flexible design automation tool
grammable Logic and Applications (FPL), Sep 2019. for accelerating quantized spectral CNNs,” in 2019 29th International
[7] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, “High-performance Conference on Field Programmable Logic and Applications (FPL), Sep
fpga-based cnn accelerator with block-floating-point arithmetic,” IEEE 2019.
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, [19] M. Langhammer and G. Baeckler, “High density and performance
no. 8, pp. 1874–1885, Aug 2019. multiplication for FPGA,” in 25th IEEE Symposium on Computer
[8] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, Arithmetic, ARITH 2018, Amherst, MA, USA, June 25-27, 2018,
S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, 2018, pp. 5–12. [Online]. Available: https://doi.org/10.1109/ARITH.
A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. 2018.8464695
Caulfield, E. S. Chung, and D. Burger, “A configurable cloud-scale [20] X. Yu, Y. Wang, J. Miao, E. Wu, H. Zhang, Y. Meng, B. Zhang, B. Min,
DNN processor for real-time AI,” in 2018 ACM/IEEE 45th Annual D. Chen, and J. Gao, “A data-center FPGA acceleration platform for
[14] A. Samajdar, T. Garg, T. Krishna, and N. Kapre, “Scaling the Cascades: convolutional neural networks,” in 2019 29th International Conference
Interconnect-aware Mapping Strategies for FPGA implementation of on Field Programmable Logic and Applications (FPL), Sep 2019.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 15,2020 at 01:05:17 UTC from IEEE Xplore. Restrictions apply.

Guide To FPGA Implementation of Arithmetic Functions
No ratings yet
Guide To FPGA Implementation of Arithmetic Functions
472 pages
Wireless and Mobile Network Architecture
No ratings yet
Wireless and Mobile Network Architecture
592 pages
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
No ratings yet
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
39 pages
Mini (1) Sidhu
No ratings yet
Mini (1) Sidhu
47 pages
Implementing AI Models On FPGAs - A Comprehensive T
No ratings yet
Implementing AI Models On FPGAs - A Comprehensive T
43 pages
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
No ratings yet
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
14 pages
Lecture04 - High-Level Digital Design Automation
No ratings yet
Lecture04 - High-Level Digital Design Automation
30 pages
Floating-Point Hardware Design A Test Perspective
No ratings yet
Floating-Point Hardware Design A Test Perspective
5 pages
A Variable-Size FFT Hardware Accelerator Based On Matrix Transposition
No ratings yet
A Variable-Size FFT Hardware Accelerator Based On Matrix Transposition
4 pages
Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas
No ratings yet
Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas
12 pages
Cousins of Compiler
0% (1)
Cousins of Compiler
6 pages
Electronics 12 04236
No ratings yet
Electronics 12 04236
12 pages
Ullah 2021
No ratings yet
Ullah 2021
14 pages
L2 Introduction To FPGA Technology
No ratings yet
L2 Introduction To FPGA Technology
15 pages
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
No ratings yet
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
13 pages
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
No ratings yet
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
6 pages
Design and Implementation of A 32 Bit RISC Processor On Xilinx FPGA
No ratings yet
Design and Implementation of A 32 Bit RISC Processor On Xilinx FPGA
6 pages
Fpga 1722521703
No ratings yet
Fpga 1722521703
73 pages
Esda 3rd
No ratings yet
Esda 3rd
4 pages
Distributed Arithmetic For The Design of High Speed Fir Filter Using Fpgas
No ratings yet
Distributed Arithmetic For The Design of High Speed Fir Filter Using Fpgas
9 pages
Dapnia 05 105
No ratings yet
Dapnia 05 105
5 pages
Number System PDF Class 11TH
No ratings yet
Number System PDF Class 11TH
22 pages
31 Design JJ New
No ratings yet
31 Design JJ New
8 pages
Hamamu ASAP 2020 Jun9
No ratings yet
Hamamu ASAP 2020 Jun9
8 pages
Anjali Kumari Report
No ratings yet
Anjali Kumari Report
8 pages
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
No ratings yet
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
7 pages
Fpga Programin
No ratings yet
Fpga Programin
24 pages
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
No ratings yet
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
4 pages
The Rise of SoC FPAA Devices
No ratings yet
The Rise of SoC FPAA Devices
8 pages
Floating Point Elsevier
No ratings yet
Floating Point Elsevier
12 pages
The 5G Network Is Facing Technical Challenges
No ratings yet
The 5G Network Is Facing Technical Challenges
6 pages
FPDSP Latest
No ratings yet
FPDSP Latest
14 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
IJME Vol 7 Iss 4 Paper 9 1260 1264
No ratings yet
IJME Vol 7 Iss 4 Paper 9 1260 1264
5 pages
Applications Enabled by FPGA-Based Technology
No ratings yet
Applications Enabled by FPGA-Based Technology
4 pages
Guide To FPGA
No ratings yet
Guide To FPGA
472 pages
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
No ratings yet
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
13 pages
Why FPGAs Are So Fast?
No ratings yet
Why FPGAs Are So Fast?
13 pages
Department of Electronics and Communication Engineering Saintgits College of Engineering
No ratings yet
Department of Electronics and Communication Engineering Saintgits College of Engineering
41 pages
Lec5 FPGA
No ratings yet
Lec5 FPGA
46 pages
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
No ratings yet
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
8 pages
FPGA-based SOC Verification
No ratings yet
FPGA-based SOC Verification
3 pages
A RISC-V Matrix Multiplier Using Systolic Arrays
No ratings yet
A RISC-V Matrix Multiplier Using Systolic Arrays
41 pages
04 Abstract
No ratings yet
04 Abstract
40 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Deep Learning Xilinx
No ratings yet
Deep Learning Xilinx
11 pages
Dynamic Power Consumption in Virtex™-II FPGA Family: Li Shang Alireza S Kaviani Kusuma Bathala
No ratings yet
Dynamic Power Consumption in Virtex™-II FPGA Family: Li Shang Alireza S Kaviani Kusuma Bathala
8 pages
T.E. - 2019 Pattern - Endsem Exam Timetable For Nov-Dec. 2024
No ratings yet
T.E. - 2019 Pattern - Endsem Exam Timetable For Nov-Dec. 2024
25 pages
100 Power Tips For FPGA Designers Stavinov Evgeni
No ratings yet
100 Power Tips For FPGA Designers Stavinov Evgeni
213 pages
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
No ratings yet
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
5 pages
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
No ratings yet
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
6 pages
Tam Metin
No ratings yet
Tam Metin
4 pages
Domain-Driven Design With Laravel Sample Chapter
No ratings yet
Domain-Driven Design With Laravel Sample Chapter
49 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
No ratings yet
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
10 pages
Bergmann
No ratings yet
Bergmann
35 pages
FPGA Paper PDF
No ratings yet
FPGA Paper PDF
18 pages
Image Hardware PDF
No ratings yet
Image Hardware PDF
19 pages
S3 - Cse231 - Summer 2024 - Muu
No ratings yet
S3 - Cse231 - Summer 2024 - Muu
104 pages
Etherlink Plus Adapter Technical Reference Manual: A Member of The Etherlink Product Family
No ratings yet
Etherlink Plus Adapter Technical Reference Manual: A Member of The Etherlink Product Family
84 pages
Multimedia Hardware and Software
No ratings yet
Multimedia Hardware and Software
11 pages
RXV 557
No ratings yet
RXV 557
131 pages
Existing Methodology: I I I-1 I I-1 I I
No ratings yet
Existing Methodology: I I I-1 I I-1 I I
9 pages
Sid-8bt High Speed Transfer Operation Manual
No ratings yet
Sid-8bt High Speed Transfer Operation Manual
29 pages
GSM R 5 0 GTSOFTX3000 Configuration Manual
No ratings yet
GSM R 5 0 GTSOFTX3000 Configuration Manual
52 pages
Designing of 4-Bit Array Multiplayer
No ratings yet
Designing of 4-Bit Array Multiplayer
6 pages
Double Door
No ratings yet
Double Door
2 pages
Luggage Security System: S.Vandana 180040468 SONALI KUMARI 180040632 CH - KALYANI 180040688
No ratings yet
Luggage Security System: S.Vandana 180040468 SONALI KUMARI 180040632 CH - KALYANI 180040688
12 pages
Milestone XProtect Comparison Chart 2024 R1
No ratings yet
Milestone XProtect Comparison Chart 2024 R1
22 pages
Reconfigurable Computing Using FPGA: State of The Art and Potential For Systolic Array Applications
No ratings yet
Reconfigurable Computing Using FPGA: State of The Art and Potential For Systolic Array Applications
2 pages
Catalog 180-70R-TS
No ratings yet
Catalog 180-70R-TS
38 pages
Explain About Web Service?: SOA Interview Questions
No ratings yet
Explain About Web Service?: SOA Interview Questions
80 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Opsis LD500
No ratings yet
Opsis LD500
16 pages
Introduction To Thevenins Theorem
No ratings yet
Introduction To Thevenins Theorem
8 pages
Jain College of Engineering, Belagavi: Introduction To Ns-2
No ratings yet
Jain College of Engineering, Belagavi: Introduction To Ns-2
19 pages
Proposals
No ratings yet
Proposals
32 pages
Data Sheet: TDA1516BQ
No ratings yet
Data Sheet: TDA1516BQ
12 pages
Seis Wide How To
100% (2)
Seis Wide How To
10 pages
Continue
No ratings yet
Continue
3 pages
Qualified Vendors List (QVL) For: GA-F2A68HM-H
No ratings yet
Qualified Vendors List (QVL) For: GA-F2A68HM-H
3 pages
CmpE Degree Requirements
No ratings yet
CmpE Degree Requirements
3 pages
Ieee Chilecon2021 V03
No ratings yet
Ieee Chilecon2021 V03
1 page
SiRRAN LTEnet-EPC ProductBrochure01
No ratings yet
SiRRAN LTEnet-EPC ProductBrochure01
5 pages
Vehicle Telematics - How Does A Vehicle Telematics Solution Work
No ratings yet
Vehicle Telematics - How Does A Vehicle Telematics Solution Work
7 pages
VLT® HVAC Drive FC 102: Available Protection Ratings
No ratings yet
VLT® HVAC Drive FC 102: Available Protection Ratings
2 pages
Lazy Dog Node B
100% (1)
Lazy Dog Node B
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Systolic Array

Uploaded by

Systolic Array

Uploaded by

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

High Density 8-bit Multiplier Systolic Arrays for

2576-2621/20/$31.00 ©2020 IEEE 84

y[10:0] {z[13:3],y[13:11]} z[2:0]

addr A[10:0] addr B[10:0]

dout A[9:0] dout B[9:0]

Port A: ROM outputs Port B: ROM outputs

Fig. 2. M20K Dual Port ROM - Pair of 7x4 Multiplications

11 bit output available, we do not support the LSB, which can

M20K + ALM DSP + ALM

Fig. 5. Mixed Multiplier Construct 16 Element dot product

UINT7 counterpart. The total logic cost - around 10 ALMs -

M20Ks ALM ALM ALM ALM FFs M20K DSPs FMax

Array ALM ALM ALM ALM FFs M20K DSPs FMax

Array Size ALM Pins FFs M20K DSPs FMax (MHz)

Fig. 11. Routing Heatmap

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.