0% found this document useful (0 votes)
109 views9 pages

SoftMax ASAP2020 June14

This document discusses design space exploration for softmax implementations in neural networks. It notes that existing softmax hardware designs have limitations in supporting different parallelism values, precisions, and trade-offs between area and accuracy. The authors propose a tunable generator called SoftGen that can generate multiple softmax designs with different architectures by varying parameters like parallelism, accuracy, storage, and precision. They perform design space exploration using SoftGen and observe that a parallelism of 16 provides the best area-delay product and that approximate LOG and EXP units yield similar accuracy with better efficiency than full implementations.

Uploaded by

aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views9 pages

SoftMax ASAP2020 June14

This document discusses design space exploration for softmax implementations in neural networks. It notes that existing softmax hardware designs have limitations in supporting different parallelism values, precisions, and trade-offs between area and accuracy. The authors propose a tunable generator called SoftGen that can generate multiple softmax designs with different architectures by varying parameters like parallelism, accuracy, storage, and precision. They perform design space exploration using SoftGen and observe that a parallelism of 16 provides the best area-delay product and that approximate LOG and EXP units yield similar accuracy with better efficiency than full implementations.

Uploaded by

aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Design Space Exploration for Softmax

Implementations
Zhigang Wei, Aman Arora, Pragenesh Patel, Lizy John
The Laboratory for Computer Architecture, Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, United States
zw5259@utexas.edu, aman.kbm@utexas.edu, f20160183@goa.bits-pilani.ac.in, ljohn@ece.utexas.edu

Abstract—Deep Neural Networks (DNN) are crucial compo- naive implementation is not very hardware friendly because
nents of machine learning in the big data era. Significant effort it easily causes overflow, requires large amounts of storage,
has been put into the hardware acceleration of convolution and and includes divider and exponential units which are gen-
fully-connected layers of neural networks, while not too much
attention has been put on the Softmax layer. Softmax is used in erally costly. Some researchers have proposed architectures
terminal classification layers in networks like ResNet, and is also for softmax [6] [4] [8] [5] [12]. However, most of these
used in intermediate layers in networks like the Transformer. As designs can only support a fixed number of inputs and the
the speed for other DNN layers keeps improving, efficient and hardware required increases proportional to the number of
flexible designs for Softmax are required. With the existence of inputs, and they generally support only one precision. Hence,
several ways to implement Softmax in hardware, we evaluate
various softmax hardware designs and the trade-offs between these designs are not flexible. The focus of many prior designs
them. In order to make the design space exploration more is on providing efficient implementations of the exponent unit,
efficient, we also develop a parameterized generator which can e.g. LUT based [6] or FSM based [5]. Geng et al. [4] uses
produce softmax designs by varying multiple aspects of a base bit-shifts for division. The design in [12] is not pipelined. Li et
architecture. The aspects or knobs are parallelism, accuracy, al. [8] uses FIFOs to store all input values increasing the area
storage and precision. The goal of the generator is to enable
evaluation of tradeoffs between area, delay, power and accuracy significantly. Not all designs support fixed point and floating
in the architecture of a softmax unit. We simulate and synthesize point data types, limiting their application to either training or
the generated designs and present results comparing them with inference.
the existing state-of-the-art. Our exploration reveals that the Although existing designs may perform well with one
design with parallelism of 16 can provide the best area-delay
product among designs with parallelism ranging from 1 to 32.
particular accuracy or parallelism in one scenario, the per-
It is also observed that look-up table based approximate LOG formance may not remain when architects want to tune the
and EXP units can be used to yield almost the same accuracy as design. Additionally, tuning the existing hardware design may
the full LOG and EXP units, while providing area and energy be time-consuming and requires lots of extra work. There are
benefits. Additionally, providing local registers for intermediate several limitations in the existing softmax hardware designs:
values is seen to provide energy savings.
Index Terms—Softmax, DNN, Machine LearningDesign Space • The support for different parallelism values is poor, which
Exploration makes the performance of their designs not scale well
with increasing input data sizes.
I. I NTRODUCTION • They do not support various precisions which limits their
Deep Neural Networks (DNN) have become one of the most design to machine learning training or inference.
important technologies for machine learning. There has been • Designs may consume large area while the trade-offs
a rapid development of hardware for accelerating inference or between area and accuracy is not clear.
training process of DNNs. While most architectures focus on There exist signification trade-offs in the aspects mentioned
speeding up the convolution and fully-connected layers, there above. Different DNNs have different number of inputs for the
are only a few researchers who have proposed optimizations in softmax layer. Different accelerators have different budgets for
hardware for softmax layer, which serves as a key component area and delay of softmax layer. Different applications have
in DNNs. Therefore, more research is required to explore different tolerance for classification accuracy. A one-size-fits-
efficient architectures for softmax. all softmax architecture can not satisfy all the requirements in
Softmax is usually used for multi-category classification as a space with such diversity. Adhoc methods of exploration
the last layer in neural networks like ResNet or MobileNet. can leave out efficient architectures leading to inefficient
It is also used as an activation layer in intermediate layers accelerators. So, we believe a tunable generator that can
in some networks, for example in Transformer and Capsule generate multiple designs with different architectures can be
network. very valuable to perform design space exploration. To the best
The major challenge of the softmax hardware is to im- of our knowledge, no such tool exists in the open source
plement efficient exponential units and division units. The community. Our contributions in this paper are summarized
Fig. 1: (a) Naive softmax architecture (as shown in [4] (on the left) (b) softmax architecture proposed in [13] with the required
EXP units (on the right)

as follows: implement it is shown in Figure 1(a). However, several prob-


• We propose a base architecture that is amenable to ad- lems exist in such a design: first of all, this kind of design
justment based on various parameters such as parallelism, requires significant amount of storage to store the exponential
accuracy and precision. This architecture can support any results because the number of classification outputs can be
number of input values and can work for the forward pass in thousands or millions. Secondly, such a design includes
of softmax for both inference and training. expensive division units. Normal division unit can consume
• We develope a generator called SoftGen that generates large area and requires significant amounts of time to execute
softmax designs. This generator is controlled by various which increases the power consumption and cause difficulties
knobs - parallelism, accuracy, storage, precision - that can to the further pipeline the design. Thirdly, it cannot leverage
take multiple values. Based on the values of the knob, the the parallelism existing in softmax calculation, therefore, it
generator dumps a design and a testbench. performs poorly when the number of inputs increases.
• We perform design space exploration using our developed Researchers have tried several approaches to tackle the
generator and proposed base architecture. We evaluate the above problems, Kouretas et al. [6] uses LUT to approximate
various generated softmax designs and discuss the trade- the exponential calculation. Hu et al. [5] leverage the stochastic
offs between area, delay, power and accuracy. computing to conduct the softmax execution.
Our design space exploration reveals observations from Yuan [13] introduced an efficient hardware architecture for
three aspects. For the parallelism ranging from 1 to 32, softmax layer as shown in Figure 1(b). Equation 1 is adapted
the energy-delay product of the design decreases with the into a more hardware friendly form:
increasing parallelism until the parallelism reaches 16. It is N (XL −Xmax ) )
also observed that the approximate LUT-based LOG and EXP P(Mth category) = e(XM −Xmax )−ln(∑L=1 e (2)
units yield almost the same accuracy but are more energy and
area efficient compared to the full accurate implementations. This avoids large silicon area consumption and accuracy
Additionally, local registers used to store the intermediate loss caused by division units, and down-scaling technique is
results can reduce the number of memory accesses, therefore, applied to exponential units to overcome the potential overflow
provide energy savings, however, extra area overhead is re- problem. However, several problems remain. For DNNs with
quired. large number categories, this design will require large number
The rest of paper is organized as follows: Section II gives of exponential units, which is not realistic. Additionally, the
the backgound information of the softmax hardware design. architecture modifies the meaning of outputs; it generates the
In Section III, we describe the details of the base architecture magnitude of each classification rather than the probability
used by our generator. Section IV introduces how we automate (the last exponential stage is missing). Lastly, no quantitative
the designs generation and tools used for evaluation.V presents evaluation of the design is provided in the paper.
the various exploration experiments we conducted along with Du et al. [3] optimized Yuan’s architecture [13] to process
the results observed from these experiments. Section VI con- the data serially so that it can perform classification of infinite
cludes this paper and points out the future work. categories. But the total cycles to accomplish the softmax
II. BACKGROUND operation increases exponentially with the number of input
values. FIFOs are utilized to store intermediate data. The
The formula to calculate the M-th neuron in a softmax layer depth of these FIFOs increases proportionally to the input size
is described as below: indicating a large area requirement. The authors take advantage
of the distribution of inputs in softmax layers to avoid some
eXM
P(Mth category) = (1) calculations for input values that are out of range, but that
∑NL=1 eXL does not work in all cases, e.g. training.
where XL is the output of the L-th neuron and N is We create our base architecture based on Yuan’s design due
the number of categories. The most straightforward way to to its scalability and pipelining features.
SoftGen(PA,PR,AC,ST)

max_tree_gen(PA,PR) sub_gen(PA,PR) exp_gen(PA,PR,AC) adder_tree_gen(PA,PR) ln_gen(AC,PR) sub_gen(PA, PR) exp_gen(PA,PR,AC)

Stage 1 Stage 2 Stage 3 TOP


LOG
block 5
> - EXP +

> >
- EXP
+ +
- - EXP
- EXP -
presub
-
logsub
EXP
>
block 1 - EXP + block 4
- - EXP
Control Logic
block 2 block 3 - - EXP
block 6
Buffer block 7

top_gen(PA,ST)

Fig. 2: Baseline Architecture used for design exploration. Softmax generator allows knobs: PA: Parallelism, PR: Precision,
AC: Accuracy, ST: Storage. The diagram shows the hardware with PA=4. The value of ST controls the presence of buffer and
dotted path.

III. A RCHITECTURE Stage 1 Stage 2 log Stage 3


Our baseline architecture allows the generator Softgen to MAX SUB presub
create various designs based on the knobs. This architecture EXP logsub
is shown in Fig. 2. The architecture is logically divided into ADD EXP
0 16 19 32 37 39 53 56
3 stages and physically divided into 7 blocks:
• Stage 1: This stage includes block 1 (max). It finds the Fig. 3: Timeline for the architecture (Parallelism=4, Number
largest value from all the input values. Xmax of input values=64, Storage=mem)
• Stage 2: This stage includes blocks 2, 3, 4.
– Block 2 (subtraction) finds the difference between each
4 exponential units in block 3, etc.), but still process a tensor
input value and the max value. XL − Xmax
with, say, 512 input values. In these cases, the control unit
– Block 3 (exponent) generates the exponential of the
orchestrates the data movement such that all 512 values are
results from Block 2. e(XL −Xmax )
processed in groups, with 4 values entering the design at a
– Block 4 (adder tree) adds all the exponential values up.
time.
∑NL=1 e(XL −Xmax )
The following sections provide details of how the designs
• Stage 3: This stage includes blocks 5, 6, 7.
of various blocks in the architecture are modified to enable
– Block 5 (log) calculates the natural logarithm of the the knobs of the generator:
result from Block 4. ln(∑NL=1 e(XL −Xmax ) ). Let’s call this
XLOG. A. Max block (block 1) and Adder tree (block 4)
– Block 6 is composed of two sets of subtractors (called The Parallelism and Precision knobs affect the architecture
presub and logsub) to calculate XM − Xmax − XLOG of the Max block. Based on the precision, floating point or
– Block 7 calculates the final result. eXM −Xmax −XLOG fixed point comparators are instantiated in this block. In a
Stage 2 can only be triggered once the max value is found fully serial implementation, the Max block only requires 1
by Stage 1. Stage 3 can be triggered only when Stage 2 is comparator. For Parallelism >= 2, the generated Max block
finished (i.e. the Adder tree has finished adding all values). is composed of a comparator tree. The number of levels of
The timeline for a design generated by SoftGen can be seen comparators in the Max block = log2 (N) + 1, where N is the
in Fig. 3. Within each stage, operations are pipelined to reduce value of the Parallelism knob. The +1 is required to handle
the latency significantly. In other words, blocks within a stage the cases where the number of inputs values is larger than
start before the previous block is finished (e.g. ADD in stage the parallelism of the design. For the Parallelism=4 and input
2 starts before SUB in stage 2 is finished). There are latches values=512 case mentioned above, the max value from 4 input
after each block in the design. Some blocks like the adder tree, values is stored in a buffer and is compared with the max
max block, exponential unit are pipelined internally as well. value from the next 4 values by this additional comparator.
Note that the number of inputs for softmax operation can The comparator tree is pipelined. Based on the delays of the
be different than the amount of parallelism in the design. For various blocks in the library we used [10] (more details can
example, a design could have a parallelism of 4 (4 values read be found in section IV) we add pipeline registers after every 3
and processed together in the design, 4 subtractors in block 2, comparator levels. The Adder tree is similar to the comparator
tree, except that we add pipeline registers in this tree after For the 16-bit floating point exponential unit, the LUT-PLF
every adder. is built to store the 16-bit floating point value of the slope an
and the pre-computed (ynm − an × x1n ) for 64 equally divided
B. Subtractors (block 2 and block 6)
intervals in the data range of [-8, 0]. An input x is converted
The value of the Parallelism knob governs the number of from floating point to fixed point format that is used to select
subtractors needed in these blocks. The type of the subtractors the PLF-LUT entry closest to x. That is further followed by
(floating point or fixed point) depends on the value of the a multiplication (an × x) and an addition to compute f n (x) as
Precision knob. Block 6 is divided into two parts: logsub and per equation 3.
presub. For each input, the value XL − Xmax is calculated by
block 2 and is required again by block 6 logsub (to calculate D. Natural logarithm unit (block 5)
XM − Xmax − XLOG). In [3], the authors save the temporary
values XL − Xmax in FIFOs in the design. In our architecture, There are multiple ways for designing hardware to compute
we add additional subtractors (block 6 presub) to calculate the natural log [12][3][11]. For the natural logarithm (LOG) unit,
difference again. This saves significant area (for FIFOs), but we provide the Accuracy knob to choose between two imple-
adds 1 cycle of latency and requires additional subtractor(s). mentations from our generator. The first one is the LOG unit
provided by the DesignWare library ([10]). We also provide
C. Exponential units (block 3 and block 7) a second reduced-area, low-accuracy option that follows the
There are multiple ways for designing hardware to compute ICISLog algorithm mentioned in [11]. For the 16-bit floating
exponent [14][4][12][5][8][3]. In our generator, we provide point LOG unit, we use a modified LUT-based architecture
the Accuracy knob to choose between two implementations of the ICISLog algorithm implementation in [1]. A user may
of the exponential unit. The first one is the exponential unit also choose a fixed-point data format using the Precision knob.
provided by the DesignWare library ([10]). We provide a For that, fixed-point LOG units have been implemented as well
second reduced-area, low-accuracy option that uses LUT-based that follow a similar architecture with LUTs storing fixed point
Piecewise Linear Function (PLF) approach from [4]. The values and includes a floating to fixed point converter. Since
architecture utilizing PLF technique for 16-bit floating point real valued logarithm is only defined for positive numbers, a
EXP unit is shown in Fig. 4. A user may also choose a fixed- positive floating point number can be represented as:
point data format using the Precision knob. For that, LUT-
val = 2exp × (1.mantissa)
based fixed-point EXP units have been implemented as well
that follow a similar architecture with LUTs storing fixed point Using the multiplicative property of the logarithm function,
values and excludes the fixed to floating point converter. we get:

ym − a  xm ln(val) = ln(2) × exp + ln(1.mantissa) (4)


Floating point 16 16
Input x LUT-PLF with
to Fixed Point
16 6 depth = 64 a a x
Converter 16 16
LUT-MANT exp× ln(2)
16 6 with depth = 64 16

Fig. 4: Architecture of the float16 EXP unit used by the Sign Exponent Mantissa 16
LUT-EXP with
generator 5 depth =32 16
ln(1.mantissa)

PLF is generally used to approximate non-linear functions Fig. 5: Architecture of the float16 LOG unit used by the
with a small number of linear pieces [2]. PLF technique ap- generator
proximates the computation of ex by using the linear equation
in N continuous intervals uniformly defined over a finite range In Fig. 5 we provide the block diagram of the LOG unit
1 , xN ], with each interval having a slope an .
of x ∈ [xm p based on Eq. 4. All the exponent bits and the first 6 mantissa
bits are used to select the ln(2)×exp and ln(1.mantissa) 16-bit
floating point values from look up tables LUT-EXP and LUT-
f n (x) = an × (x − xm
n
) + ynm = an × x + (ynm − an × xm
n
) (3)
MANT respectively. The outputs from the look up tables are
n , xn ], n
where x ∈ [xm p ynm = exm , n ∈ [1, N] added to obtain the final output ln(x).
Following the implementation in [3], input data in the range
E. Comparison with existing architectures
of [-8, 0] is considered as valid and data less than -8 is mapped
to the last entry in the Look up Table (LUT-PLF). There are Table I compares various attributes of our architecture with
two reasons for this. 1) Values input to EXP unit will always existing designs. Our architecture overcomes many limitations
be either zero or negative because the maximum input value that are present in other architectures and through the gener-
is subtracted from each input in block 2. 2) Since e0 /e−8 ≈ ator, we provide exploration of various attributes to allow a
2980.958 and e−8 = 0.000335, it is deemed safe to ignore this DNN hardware architect to make informed decisions within
small value. the constraints of an application.
Feature Ours [3] [13] [6] [4] [12] [5] [8]

Support for any number of input values Y Y N N Y N N Y


Hardware increases proportional to input size G N Y Y N Y Y N
Needs costly/accurate division unit N N N N N N N Y
Uses LOG based modified softmax formula Y Y Y N N Y Y N
Uses LUT based EXP or LOG units G Y Y Y Y N Y Y
Uses internal storage to store input values for reuse G Y N N Y N N Y
Supports fixed and floating point values G N N N N Y N N
Is completely serial and hence has high latency G Y N N Y N N Y
Is completely parallel and hence has high area G N Y Y N N Y N
Redoes subtraction instead of storing temp results Y N N N N N N N
Down scaling for EXP (”max - val”) Y Y Y Y N Y N N
Adder tree used for additions Y N Y N N Y N N
Uses stochastic computing methods N N N N N N Y N
Applicable to both training and inference Y N Y N N Y N N

TABLE I: Comparing the features of various softmax architectures (Y=Yes, N=No, G=Provided through generator for trade-off
analysis)

IV. E XPERIMENTAL METHODOLOGY simulates the code, and generates a CSV file that lists the
observed output values from the softmax Verilog design, the
In this section, we discuss the tools we used to conduct
expected output values from a Python based CPU model and
the experiment. The flow to conduct the experiment can be
the difference between the two.
summarized in the following steps:
• prepare the blocks of basic arithmetic
• synthesize, simulate and verify the blocks
• use the generator SoftGen to generate softmax designs
• synthesize, simulate and verify the softmax models
The circuit designs of the first step are already mentioned
in III. We used Synopsys tools for synthesis. All synthesis is
performed under 45nm technology with FreePDK45 academic
library [9]. The area values in our results are post-synthesis
and pre-placement/pre-routing areas. We used CACTI [7] to
analyze the energy consumption of memory accesses. A single
port on-chip memory is assumed to contain the input values
required by softmax. Each memory location is wide enough to
store the input values required in one memory read, based on
the parallelism knob. We also assume that read/write latency
for read/write from the on-chip memory is 1 clock. Fig. 6: Flow and architecture of the softmax generator
A. SoftGen The generation is composed of two components: Verilog
Figure 6 provides an overview of the flow and architecture templates and python scripts. The Verilog templates contains
of the generator. The inputs to the generator are values of the skeleton design and testbench corresponding to our archi-
various knobs that control different aspects of the softmax tecture with various tags present in it at various locations to
architecture described in Section III. The outputs of the customize the design. The Python scripts process the template,
generator are a set of Verilog design files including module replace the tags with Verilog code based on the knobs specified
definitions of each block and the top-level module. The top- when running the generator and dump Verilog files during the
level module puts all the blocks together, along with the con- process.
trol logic. The generator also produces a simple testbench that The Python scripts are organized hierarchically to make the
can be used to verify the sanity of the design. The generator generator modular and easily changeable. There are separate
code is available on GitHub at %link hidden for anonymity generator scripts for the adder tree and the max block. The
for double-blind review%. The Makefile available with the utility scripts generate inputs for the simulation, expected
generator dumps the design and the testbench, compiles and outputs and the CSV containing the difference.
B. Design spaces of softmax our knowledge. However, we can not directly compare our
With the knob support of the SoftGen, we explore the designs with the results their papers because they use a
hardware implementation trade-offs of softmax in 4 different different technology node (65 nm) and a different design
aspects: library. We instead use their architecture and our design blocks
and library to estimate various metrics for their design. An
1) Parallelism: This knob controls the amount of paral- approximation of the baseline design can be generated by
lelism in the generated design. Currently, this knob can our generator with the settings: Parallelism=1, Accuracy=LUT,
take a value of any power of 2 (including 20 =1). A value Precision=fixed32, Storage=REG, except for one main differ-
of 1 implies a fully serial design. Such a design has ence. The authors of [3] take advantage of the distribution of
one compute unit in each block, and consumes the least inputs in softmax layers to avoid some calculations for input
amount of area. But it takes the most amount of clock values that are out of range, but that limits their circuit’s use
cycles. As the value of this knob increases, the design’s for training. Instead we support the full range of input values.
parallelism increases. That means more compute blocks
are added, increasing the area and power consumption. V. R ESULTS AND D ISCUSSION
As an example, a value of 4 will generate a design which In this section, we discuss the observations from the ex-
has more area but smaller latency. This knob is useful to periments when we sweep the configurations of parallelism,
study the trade-off between area, power and delay. accuracy and storage in the first three subsections. We also
2) Accuracy: This knob controls which EXP and LOG compared our generated designs with the state-of-the-art ar-
implementations are used in the softmax design. All the chitecture and the discussion is in the last subsection.
blocks in the design, except the EXP and LOG blocks,
have full accuracy. For EXP and LOG blocks, we support A. Exploration with the Parallelism knob
choosing between a highly accurate implementation from For this experiment, we varied the values of the Paral-
the Synopsys DesignWare [10] library, or a less accurate lelism knob across 1,2,4,8,16,32. The other knobs were kept
implementation using LUTs (as described in the sections fixed (Accuracy=LUT, Storage=mem, Precision=float16). The
III-C and III-D). The LUT based implementations are number of input values used in this experiment was fixed
more area efficient. This knob can be used to study the at 1024. The normalized post-synthesis area and the number
trade-off between accuracy of results and the area of the of cycles consumed by each generated design are plotted in
design. Figure 7. Also plotted is the area-delay product. As expected,
3) Precision: This knob controls the precision (data type) with increasing parallelism, the number of cycles reduces,
for all the compute units used by the design. We currently but the area increases. We see the area delay product value
support 4 data types: int8, int32, float16, float32. This reduces and then starts to increase, implying the design with
knob is driven by system requirements. For example, it Parallelism=16 is the best. However, designs with Parallelism
has been shown that for inference, int8 is sufficient, but values of 8, 16 have very similar values of area-delay product
float16 is more optimal for training. This knob mainly and hence are good choices. For larger values of Parallelism,
changes the compute blocks in the design. The control the power consumption of the design can also be expected to
logic remains the same. So, the area of the design and increase, because more compute units are working in parallel.
the clock frequency is affected by this knob, but not the
latency in clock cycles. 1.2 0.045

4) Storage: As can be seen from the architecture described 0.04


Normalized area and delay

1
0.035
in Section III, input values stored in the on-chip memory

Area-Delay Product
0.8 0.03
are required 3 times during the softmax operation - for
0.025
calculating the max value, for calculating difference of 0.6
0.02
inputs from the max value and for finally calculating the
0.4 0.015
probabilities. These values can either be read from the 0.01
on-chip memory whenever required (consuming SRAM 0.2
0.005
access delay and energy every time), or they could be read 0 0
once from the on-chip memory and stored in registers 1 2 4 8 16 32

internal to the softmax unit (consuming area and static Area Delay Area-Delay Product
power) and used directly. This knob is used to select
Fig. 7: Trade-off between area and number of cycles with
between these two choices (NOREG or REG), to study
varying values of the Parallelism knob (1024 input values,
the trade-off between delay, area, energy and power.
Accuracy=LUT, Storage=mem, Precision=float16))
Internal storage is used by the design in [3].

C. Implementation of the baseline design B. Exploration with the Accuracy knob


We chose the design from Du et al. [3] as our baseline To see the effect of the Accuracy knob, we generated two
since their design gives the best implementation of [13] to designs - one with LUT based implementations of EXP and
LOG blocks, and another with DesignWare [10] implemen- re-reads inputs from on-chip memory whenever required and
tations of these blocks. Other knobs were kept fixed (Par- another that stores the input values in registers after reading
allelism=4, Storage=NOREG, Precision=float16). We chose them once. The resulting chart is shown in Figure 8. Registers
various ranges of inputs and generated random values in those cause the area of the design to increase significantly with
ranges and fed them to the two designs. We then compared increasing number of input values. But for the design with on-
the results against the results obtained from a simple Python chip memory re-reads, the area does not change as we increase
based CPU model. Table II shows the comparison. LUT number of input values. We calculated energy consumed by
based implementations are less accurate, but this is generally the additional registers in the design with Storage=REG and
acceptable for DNNs. the energy consumed by additional on-chip memory re-reads
in the design with Storage=NOREG. The energy consumed
Range Max error Max error Avg error Avg error for re-reading inputs from on-chip memory is higher than the
with DW with LUT with DW with LUT
energy consumed for reading/writing to/from internal storage
-0.1 to 0.1 8.80E-06 5.04E-05 7.21E-06 3.55E-05 registers. Since the on-chip memory re-read latency can be
-1 to 1 2.40E-06 2.90E-04 5.31E-07 8.38E-05
hidden behind other operations, the delay for the design with
-10 to 5 5.70E-06 4.31E-03 3.11E-07 1.69E-03
on-chip memory re-reads is not different from the delay for the
5 to 10 1.22E-03 1.23E-03 2.45E-04 4.93E-04
design with registers. For the design with registers, the number
-8 to -4 5.70E-06 7.60E-04 6.69E-07 2.29E-04
of registers required to store inputs is equal to the number of
-8 to 8 3.77E-03 4.65E-03 2.45E-04 2.05E-03
input values, and so the design becomes less flexible. So, there
TABLE II: Accuracy evaluation for DesignWare and LUT-based im- is a tradeoff between area, energy and flexibility.
plementations (512 input values, Parallelism = 8, Storage=NOREG,
Precision=float16) LUT based implementations are less accurate but
generally still acceptable for DNNs

Table III shows the variation of area and delay of the whole
softmax design with these two Accuracy options. We can
see from the first two rows of the table that the LUT based
design has a smaller area, but delay is higher with the design
using DesignWare blocks because the DesignWare blocks are
not pipelined (our LUT based EXP unit has a pipeline stage
in it) and so the design could only run at a reduced clock
frequency. Since they are available as IP blocks, we could
not modify them. We also synthesized the design using LUTs
at the max frequency at which the design using DesignWare
could be synthesized. The area reduced significantly with this
optimization and the power reduced as well.
Fig. 8: Area and energy evaluation with different values of the
Design Cycles Delay Power Energy Area Storage knob with various number of input values (Parallelism
(us) (mW ) (nJ) (um2 ) = 4, Accuracy=LUT, Precision=float16)
Design with 201 0.67 10.19 6.82 279711
LUT, max freq
(294MHz) D. Comparison with the state-of-the-art
Design with 199 0.79 8.38 6.67 283300
DW, max freq Table IV compares various metrics of the design from [3]
(250MHz) with some variations of the designs generated by our generator.
Design with 201 0.80 6.87 5.52 220178
LUT, iso freq
The ”Add. energy” column refers to the additional energy con-
(250MHz) sumed because of internal storage registers in the designs with
Storage=REG, and the additional energy consumed because of
TABLE III: Trade-off between various metrics with different values
of the Accuracy knob (512 input values, Parallelism = 8, Stor- memory re-reads in the designs with Storage=NOREG. We can
age=NOREG, Precision=float16). Power/Energy numbers are from see that a design with Parallelism=1, Storage=NOREG (second
Synopsys Design Vision. row in the table) is much more area efficient, but consumes
more energy. Changing Parallelism=2 and Storage=NOREG
(fourth row) results in a faster design, but with more area
C. Exploration with the Storage knob consumption.
There are two values of the Storage knob - NOREG and One of the important issues mentioned in [3] is that in
REG - as described in Section IV. For this experiment, we their design, as the number of input values increases, the
fix the Parallelism knob to 4, Accuracy knob to LUT and total computing time increases exponentially, and the time
Precision knob to float16. We vary the number of inputs from taken by the Max block dominates the total computing time.
32 to 1024, and generate two designs for each case - one that Figure 9 shows the results from a similar study we conducted
Parallelism = 1 Parallelism = 4 Parallelism = 8 Parallelism = 16
Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3
100% 100% 100% 100%
90% 90% 90% 90%
80% 80% 80% 80%
70% 70% 70% 70%
60% 60% 60% 60%
50% 50% 50% 50%
40% 40% 40% 40%
30% 30% 30% 30%
20% 20% 20% 20%
10% 10% 10% 10%
0% 0% 0% 0%
16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096

Fig. 9: Cycle consumption in each stage of the generated design with various values of the Parallelism knob (x-axis: number
of input values, y-axis: percentage of cycles consumed) Stages defined in Section III-E.

Design Area Cycles Add. en- make the design more energy and area efficient with almost
(mm2 ) ergy (pJ)
the same accuracy. Additionally, providing local registers to
Design in [3] 0.807 1542 830.24 store the intermediate results are seen to yield energy savings.
Design with PA=1, ST=NOREG 0.059 1542 4351.48 This work can be extended in many ways. Currently, we
Design with PA=2, ST=REG 0.828 775 830.24 only support input sizes that are a power-of-2 (including
Design with PA=2, ST=NOREG 0.085 775 4351.48
20 = 1). We plan to add support for other knobs and other
values of the existing knobs. While variations of LOG and
Design with PA=4, ST=REG 0.835 392 830.24
EXP units, and bfloat16 or other precision settings can be
Design with PA=4, ST=NOREG 0.138 392 4351.48 added to the framework, this paper presents several important
insights on softmax designs and demonstrates a methodology
TABLE IV: Comparing various metrics for some designs generated
by the generator with the design in [3]. PA=Parallelism, ST=Storage, for parameterizable design generation and design space explo-
PR=Precision, AC=Accuracy. All designs were synthesized for a ration of softmax.
clock frequency of 250 MHz, processed 512 input values, have the
same precision (fixed32) and have the same accuracy (LUT). VII. ACKNOWLEDGEMENT
We thank all the anonymous reviewers for the detailed
comments on the paper. This work was supported in part by
using various designs generated by our generator. In this the National Science Foundation grant 1763848. Any opinions,
case, the other knobs were Storage=NOREG, Accuracy=LUT, findings, conclusions, or recommendations are those of the
Precision=float32. We can see that these designs are easily authors and do not necessarily reflect the views of these
pipelineable to handle multiple data sets during Training since funding agencies.
we can keep each stage busy at the same time. For larger
input sizes, the designs are very balanced. We spend almost R EFERENCES
equal time in each stage. For smaller input sizes, stage 2 does [1] N. Alachiotis and A. Stamatakis, “Efficient floating-point logarithm unit
for fpgas,” 05 2010, pp. 1 – 8.
consume relatively more time especially with high values of [2] H. Amin, K. M. Curtis, and B. R. Hayes-Gill, “Piecewise linear
Parallelism, but these scenarios are not very common. approximation applied to nonlinear function of a neural network,” IEE
Proceedings - Circuits, Devices and Systems, vol. 144, no. 6, pp. 313–
VI. C ONCLUSION 317, Dec 1997.
[3] G. Du, C. Tian, Z. Li, D. Zhang, Y. Yin, and Y. Ouyang,
There are many tradeoffs in the design of softmax, the “Efficient softmax hardware architecture for deep neural networks,”
multi-category classification layer in neural networks. In this in Proceedings of the 2019 on Great Lakes Symposium on VLSI,
paper, we perform design tradeoff evaluation of softmax using ser. GLSVLSI ’19. New York, NY, USA: ACM, 2019, pp. 75–80.
[Online]. Available: http://doi.acm.org/10.1145/3299874.3317988
SoftGen, an open-source tool1 that we created that generates [4] X. Geng, J. Lin, B. Zhao, A. Kong, M. M. S. Aly, and V. Chandrasekhar,
softmax designs by controlling the values of parallelism, “Hardware-aware softmax approximation for deep neural networks,”
accuracy, precision and storage. The architecture used by our in Computer Vision – ACCV 2018, C. Jawahar, H. Li, G. Mori, and
K. Schindler, Eds. Cham: Springer International Publishing, 2019, pp.
generator eliminates the shortcomings in existing designs such 107–122.
as limited parallelism, limited precision options, etc. We show [5] R. Hu, B. Tian, S. Yin, and S. Wei, “Efficient hardware architecture of
the results of trade-off analysis using these knobs in the paper. softmax layer in deep neural network,” 11 2018, pp. 1–5.
[6] I. Kouretas and V. Paliouras, “Simplified hardware implementation of
In terms of parallelism, it is found that the architecture with the softmax activation function,” 05 2019, pp. 1–4.
parallelism of 16 can provide the best area-delay product [7] H. Labs. (2008) Cacti - an integrated cache and memory access
among all the parallelism ranging from 1 to 32. It is also time, cycle time, area, leakage, and dynamic power model. [Online].
Available: https://www.hpl.hp.com/research/cacti/
observed that LUT-based EXP and LOG units can help to [8] Z. Li, H. Li, X. Jiang, B. Chen, Y. Zhang, and G. Du, “Efficient fpga
implementation of softmax function for dnn applications,” 11 2018, pp.
1 The tool is available at https://github.com/georgewzg95/softmax 212–216.
[9] NCSU. (2018) Freepdk45. [Online]. Available: https://www.eda.ncsu.
edu/wiki/FreePDK45:Contents
[10] Synopsys. (2018) Designware library - datapath and building block ip.
[Online]. Available: https://www.synopsys.com/dw/buildingblock.php
[11] O. Vinyals and G. Friedland, “A hardware-independent fast logarithm
approximation with adjustable accuracy,” in 2008 Tenth IEEE Interna-
tional Symposium on Multimedia, Dec 2008, pp. 61–65.
[12] M. Wang, S. Lu, D. Zhu, J. Lin, and Z. Wang, “A high-speed and low-
complexity architecture for softmax function in deep learning,” 10 2018,
pp. 223–226.
[13] B. Yuan, “Efficient hardware architecture of softmax layer in deep neural
network,” in 2016 29th IEEE International System-on-Chip Conference
(SOCC), Sep. 2016, pp. 323–326.
[14] W. Yuan and Z. Xu, “Fpga based implementation of low-latency floating-
point exponential function,” vol. 2013, 01 2013, pp. 237–240.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy