SoftMax ASAP2020 June14
SoftMax ASAP2020 June14
Implementations
Zhigang Wei, Aman Arora, Pragenesh Patel, Lizy John
The Laboratory for Computer Architecture, Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, United States
zw5259@utexas.edu, aman.kbm@utexas.edu, f20160183@goa.bits-pilani.ac.in, ljohn@ece.utexas.edu
Abstract—Deep Neural Networks (DNN) are crucial compo- naive implementation is not very hardware friendly because
nents of machine learning in the big data era. Significant effort it easily causes overflow, requires large amounts of storage,
has been put into the hardware acceleration of convolution and and includes divider and exponential units which are gen-
fully-connected layers of neural networks, while not too much
attention has been put on the Softmax layer. Softmax is used in erally costly. Some researchers have proposed architectures
terminal classification layers in networks like ResNet, and is also for softmax [6] [4] [8] [5] [12]. However, most of these
used in intermediate layers in networks like the Transformer. As designs can only support a fixed number of inputs and the
the speed for other DNN layers keeps improving, efficient and hardware required increases proportional to the number of
flexible designs for Softmax are required. With the existence of inputs, and they generally support only one precision. Hence,
several ways to implement Softmax in hardware, we evaluate
various softmax hardware designs and the trade-offs between these designs are not flexible. The focus of many prior designs
them. In order to make the design space exploration more is on providing efficient implementations of the exponent unit,
efficient, we also develop a parameterized generator which can e.g. LUT based [6] or FSM based [5]. Geng et al. [4] uses
produce softmax designs by varying multiple aspects of a base bit-shifts for division. The design in [12] is not pipelined. Li et
architecture. The aspects or knobs are parallelism, accuracy, al. [8] uses FIFOs to store all input values increasing the area
storage and precision. The goal of the generator is to enable
evaluation of tradeoffs between area, delay, power and accuracy significantly. Not all designs support fixed point and floating
in the architecture of a softmax unit. We simulate and synthesize point data types, limiting their application to either training or
the generated designs and present results comparing them with inference.
the existing state-of-the-art. Our exploration reveals that the Although existing designs may perform well with one
design with parallelism of 16 can provide the best area-delay
product among designs with parallelism ranging from 1 to 32.
particular accuracy or parallelism in one scenario, the per-
It is also observed that look-up table based approximate LOG formance may not remain when architects want to tune the
and EXP units can be used to yield almost the same accuracy as design. Additionally, tuning the existing hardware design may
the full LOG and EXP units, while providing area and energy be time-consuming and requires lots of extra work. There are
benefits. Additionally, providing local registers for intermediate several limitations in the existing softmax hardware designs:
values is seen to provide energy savings.
Index Terms—Softmax, DNN, Machine LearningDesign Space • The support for different parallelism values is poor, which
Exploration makes the performance of their designs not scale well
with increasing input data sizes.
I. I NTRODUCTION • They do not support various precisions which limits their
Deep Neural Networks (DNN) have become one of the most design to machine learning training or inference.
important technologies for machine learning. There has been • Designs may consume large area while the trade-offs
a rapid development of hardware for accelerating inference or between area and accuracy is not clear.
training process of DNNs. While most architectures focus on There exist signification trade-offs in the aspects mentioned
speeding up the convolution and fully-connected layers, there above. Different DNNs have different number of inputs for the
are only a few researchers who have proposed optimizations in softmax layer. Different accelerators have different budgets for
hardware for softmax layer, which serves as a key component area and delay of softmax layer. Different applications have
in DNNs. Therefore, more research is required to explore different tolerance for classification accuracy. A one-size-fits-
efficient architectures for softmax. all softmax architecture can not satisfy all the requirements in
Softmax is usually used for multi-category classification as a space with such diversity. Adhoc methods of exploration
the last layer in neural networks like ResNet or MobileNet. can leave out efficient architectures leading to inefficient
It is also used as an activation layer in intermediate layers accelerators. So, we believe a tunable generator that can
in some networks, for example in Transformer and Capsule generate multiple designs with different architectures can be
network. very valuable to perform design space exploration. To the best
The major challenge of the softmax hardware is to im- of our knowledge, no such tool exists in the open source
plement efficient exponential units and division units. The community. Our contributions in this paper are summarized
Fig. 1: (a) Naive softmax architecture (as shown in [4] (on the left) (b) softmax architecture proposed in [13] with the required
EXP units (on the right)
> >
- EXP
+ +
- - EXP
- EXP -
presub
-
logsub
EXP
>
block 1 - EXP + block 4
- - EXP
Control Logic
block 2 block 3 - - EXP
block 6
Buffer block 7
top_gen(PA,ST)
Fig. 2: Baseline Architecture used for design exploration. Softmax generator allows knobs: PA: Parallelism, PR: Precision,
AC: Accuracy, ST: Storage. The diagram shows the hardware with PA=4. The value of ST controls the presence of buffer and
dotted path.
Fig. 4: Architecture of the float16 EXP unit used by the Sign Exponent Mantissa 16
LUT-EXP with
generator 5 depth =32 16
ln(1.mantissa)
PLF is generally used to approximate non-linear functions Fig. 5: Architecture of the float16 LOG unit used by the
with a small number of linear pieces [2]. PLF technique ap- generator
proximates the computation of ex by using the linear equation
in N continuous intervals uniformly defined over a finite range In Fig. 5 we provide the block diagram of the LOG unit
1 , xN ], with each interval having a slope an .
of x ∈ [xm p based on Eq. 4. All the exponent bits and the first 6 mantissa
bits are used to select the ln(2)×exp and ln(1.mantissa) 16-bit
floating point values from look up tables LUT-EXP and LUT-
f n (x) = an × (x − xm
n
) + ynm = an × x + (ynm − an × xm
n
) (3)
MANT respectively. The outputs from the look up tables are
n , xn ], n
where x ∈ [xm p ynm = exm , n ∈ [1, N] added to obtain the final output ln(x).
Following the implementation in [3], input data in the range
E. Comparison with existing architectures
of [-8, 0] is considered as valid and data less than -8 is mapped
to the last entry in the Look up Table (LUT-PLF). There are Table I compares various attributes of our architecture with
two reasons for this. 1) Values input to EXP unit will always existing designs. Our architecture overcomes many limitations
be either zero or negative because the maximum input value that are present in other architectures and through the gener-
is subtracted from each input in block 2. 2) Since e0 /e−8 ≈ ator, we provide exploration of various attributes to allow a
2980.958 and e−8 = 0.000335, it is deemed safe to ignore this DNN hardware architect to make informed decisions within
small value. the constraints of an application.
Feature Ours [3] [13] [6] [4] [12] [5] [8]
TABLE I: Comparing the features of various softmax architectures (Y=Yes, N=No, G=Provided through generator for trade-off
analysis)
IV. E XPERIMENTAL METHODOLOGY simulates the code, and generates a CSV file that lists the
observed output values from the softmax Verilog design, the
In this section, we discuss the tools we used to conduct
expected output values from a Python based CPU model and
the experiment. The flow to conduct the experiment can be
the difference between the two.
summarized in the following steps:
• prepare the blocks of basic arithmetic
• synthesize, simulate and verify the blocks
• use the generator SoftGen to generate softmax designs
• synthesize, simulate and verify the softmax models
The circuit designs of the first step are already mentioned
in III. We used Synopsys tools for synthesis. All synthesis is
performed under 45nm technology with FreePDK45 academic
library [9]. The area values in our results are post-synthesis
and pre-placement/pre-routing areas. We used CACTI [7] to
analyze the energy consumption of memory accesses. A single
port on-chip memory is assumed to contain the input values
required by softmax. Each memory location is wide enough to
store the input values required in one memory read, based on
the parallelism knob. We also assume that read/write latency
for read/write from the on-chip memory is 1 clock. Fig. 6: Flow and architecture of the softmax generator
A. SoftGen The generation is composed of two components: Verilog
Figure 6 provides an overview of the flow and architecture templates and python scripts. The Verilog templates contains
of the generator. The inputs to the generator are values of the skeleton design and testbench corresponding to our archi-
various knobs that control different aspects of the softmax tecture with various tags present in it at various locations to
architecture described in Section III. The outputs of the customize the design. The Python scripts process the template,
generator are a set of Verilog design files including module replace the tags with Verilog code based on the knobs specified
definitions of each block and the top-level module. The top- when running the generator and dump Verilog files during the
level module puts all the blocks together, along with the con- process.
trol logic. The generator also produces a simple testbench that The Python scripts are organized hierarchically to make the
can be used to verify the sanity of the design. The generator generator modular and easily changeable. There are separate
code is available on GitHub at %link hidden for anonymity generator scripts for the adder tree and the max block. The
for double-blind review%. The Makefile available with the utility scripts generate inputs for the simulation, expected
generator dumps the design and the testbench, compiles and outputs and the CSV containing the difference.
B. Design spaces of softmax our knowledge. However, we can not directly compare our
With the knob support of the SoftGen, we explore the designs with the results their papers because they use a
hardware implementation trade-offs of softmax in 4 different different technology node (65 nm) and a different design
aspects: library. We instead use their architecture and our design blocks
and library to estimate various metrics for their design. An
1) Parallelism: This knob controls the amount of paral- approximation of the baseline design can be generated by
lelism in the generated design. Currently, this knob can our generator with the settings: Parallelism=1, Accuracy=LUT,
take a value of any power of 2 (including 20 =1). A value Precision=fixed32, Storage=REG, except for one main differ-
of 1 implies a fully serial design. Such a design has ence. The authors of [3] take advantage of the distribution of
one compute unit in each block, and consumes the least inputs in softmax layers to avoid some calculations for input
amount of area. But it takes the most amount of clock values that are out of range, but that limits their circuit’s use
cycles. As the value of this knob increases, the design’s for training. Instead we support the full range of input values.
parallelism increases. That means more compute blocks
are added, increasing the area and power consumption. V. R ESULTS AND D ISCUSSION
As an example, a value of 4 will generate a design which In this section, we discuss the observations from the ex-
has more area but smaller latency. This knob is useful to periments when we sweep the configurations of parallelism,
study the trade-off between area, power and delay. accuracy and storage in the first three subsections. We also
2) Accuracy: This knob controls which EXP and LOG compared our generated designs with the state-of-the-art ar-
implementations are used in the softmax design. All the chitecture and the discussion is in the last subsection.
blocks in the design, except the EXP and LOG blocks,
have full accuracy. For EXP and LOG blocks, we support A. Exploration with the Parallelism knob
choosing between a highly accurate implementation from For this experiment, we varied the values of the Paral-
the Synopsys DesignWare [10] library, or a less accurate lelism knob across 1,2,4,8,16,32. The other knobs were kept
implementation using LUTs (as described in the sections fixed (Accuracy=LUT, Storage=mem, Precision=float16). The
III-C and III-D). The LUT based implementations are number of input values used in this experiment was fixed
more area efficient. This knob can be used to study the at 1024. The normalized post-synthesis area and the number
trade-off between accuracy of results and the area of the of cycles consumed by each generated design are plotted in
design. Figure 7. Also plotted is the area-delay product. As expected,
3) Precision: This knob controls the precision (data type) with increasing parallelism, the number of cycles reduces,
for all the compute units used by the design. We currently but the area increases. We see the area delay product value
support 4 data types: int8, int32, float16, float32. This reduces and then starts to increase, implying the design with
knob is driven by system requirements. For example, it Parallelism=16 is the best. However, designs with Parallelism
has been shown that for inference, int8 is sufficient, but values of 8, 16 have very similar values of area-delay product
float16 is more optimal for training. This knob mainly and hence are good choices. For larger values of Parallelism,
changes the compute blocks in the design. The control the power consumption of the design can also be expected to
logic remains the same. So, the area of the design and increase, because more compute units are working in parallel.
the clock frequency is affected by this knob, but not the
latency in clock cycles. 1.2 0.045
1
0.035
in Section III, input values stored in the on-chip memory
Area-Delay Product
0.8 0.03
are required 3 times during the softmax operation - for
0.025
calculating the max value, for calculating difference of 0.6
0.02
inputs from the max value and for finally calculating the
0.4 0.015
probabilities. These values can either be read from the 0.01
on-chip memory whenever required (consuming SRAM 0.2
0.005
access delay and energy every time), or they could be read 0 0
once from the on-chip memory and stored in registers 1 2 4 8 16 32
internal to the softmax unit (consuming area and static Area Delay Area-Delay Product
power) and used directly. This knob is used to select
Fig. 7: Trade-off between area and number of cycles with
between these two choices (NOREG or REG), to study
varying values of the Parallelism knob (1024 input values,
the trade-off between delay, area, energy and power.
Accuracy=LUT, Storage=mem, Precision=float16))
Internal storage is used by the design in [3].
Table III shows the variation of area and delay of the whole
softmax design with these two Accuracy options. We can
see from the first two rows of the table that the LUT based
design has a smaller area, but delay is higher with the design
using DesignWare blocks because the DesignWare blocks are
not pipelined (our LUT based EXP unit has a pipeline stage
in it) and so the design could only run at a reduced clock
frequency. Since they are available as IP blocks, we could
not modify them. We also synthesized the design using LUTs
at the max frequency at which the design using DesignWare
could be synthesized. The area reduced significantly with this
optimization and the power reduced as well.
Fig. 8: Area and energy evaluation with different values of the
Design Cycles Delay Power Energy Area Storage knob with various number of input values (Parallelism
(us) (mW ) (nJ) (um2 ) = 4, Accuracy=LUT, Precision=float16)
Design with 201 0.67 10.19 6.82 279711
LUT, max freq
(294MHz) D. Comparison with the state-of-the-art
Design with 199 0.79 8.38 6.67 283300
DW, max freq Table IV compares various metrics of the design from [3]
(250MHz) with some variations of the designs generated by our generator.
Design with 201 0.80 6.87 5.52 220178
LUT, iso freq
The ”Add. energy” column refers to the additional energy con-
(250MHz) sumed because of internal storage registers in the designs with
Storage=REG, and the additional energy consumed because of
TABLE III: Trade-off between various metrics with different values
of the Accuracy knob (512 input values, Parallelism = 8, Stor- memory re-reads in the designs with Storage=NOREG. We can
age=NOREG, Precision=float16). Power/Energy numbers are from see that a design with Parallelism=1, Storage=NOREG (second
Synopsys Design Vision. row in the table) is much more area efficient, but consumes
more energy. Changing Parallelism=2 and Storage=NOREG
(fourth row) results in a faster design, but with more area
C. Exploration with the Storage knob consumption.
There are two values of the Storage knob - NOREG and One of the important issues mentioned in [3] is that in
REG - as described in Section IV. For this experiment, we their design, as the number of input values increases, the
fix the Parallelism knob to 4, Accuracy knob to LUT and total computing time increases exponentially, and the time
Precision knob to float16. We vary the number of inputs from taken by the Max block dominates the total computing time.
32 to 1024, and generate two designs for each case - one that Figure 9 shows the results from a similar study we conducted
Parallelism = 1 Parallelism = 4 Parallelism = 8 Parallelism = 16
Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3
100% 100% 100% 100%
90% 90% 90% 90%
80% 80% 80% 80%
70% 70% 70% 70%
60% 60% 60% 60%
50% 50% 50% 50%
40% 40% 40% 40%
30% 30% 30% 30%
20% 20% 20% 20%
10% 10% 10% 10%
0% 0% 0% 0%
16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096 16 32 64 128 256 512 1024 2048 4096
Fig. 9: Cycle consumption in each stage of the generated design with various values of the Parallelism knob (x-axis: number
of input values, y-axis: percentage of cycles consumed) Stages defined in Section III-E.
Design Area Cycles Add. en- make the design more energy and area efficient with almost
(mm2 ) ergy (pJ)
the same accuracy. Additionally, providing local registers to
Design in [3] 0.807 1542 830.24 store the intermediate results are seen to yield energy savings.
Design with PA=1, ST=NOREG 0.059 1542 4351.48 This work can be extended in many ways. Currently, we
Design with PA=2, ST=REG 0.828 775 830.24 only support input sizes that are a power-of-2 (including
Design with PA=2, ST=NOREG 0.085 775 4351.48
20 = 1). We plan to add support for other knobs and other
values of the existing knobs. While variations of LOG and
Design with PA=4, ST=REG 0.835 392 830.24
EXP units, and bfloat16 or other precision settings can be
Design with PA=4, ST=NOREG 0.138 392 4351.48 added to the framework, this paper presents several important
insights on softmax designs and demonstrates a methodology
TABLE IV: Comparing various metrics for some designs generated
by the generator with the design in [3]. PA=Parallelism, ST=Storage, for parameterizable design generation and design space explo-
PR=Precision, AC=Accuracy. All designs were synthesized for a ration of softmax.
clock frequency of 250 MHz, processed 512 input values, have the
same precision (fixed32) and have the same accuracy (LUT). VII. ACKNOWLEDGEMENT
We thank all the anonymous reviewers for the detailed
comments on the paper. This work was supported in part by
using various designs generated by our generator. In this the National Science Foundation grant 1763848. Any opinions,
case, the other knobs were Storage=NOREG, Accuracy=LUT, findings, conclusions, or recommendations are those of the
Precision=float32. We can see that these designs are easily authors and do not necessarily reflect the views of these
pipelineable to handle multiple data sets during Training since funding agencies.
we can keep each stage busy at the same time. For larger
input sizes, the designs are very balanced. We spend almost R EFERENCES
equal time in each stage. For smaller input sizes, stage 2 does [1] N. Alachiotis and A. Stamatakis, “Efficient floating-point logarithm unit
for fpgas,” 05 2010, pp. 1 – 8.
consume relatively more time especially with high values of [2] H. Amin, K. M. Curtis, and B. R. Hayes-Gill, “Piecewise linear
Parallelism, but these scenarios are not very common. approximation applied to nonlinear function of a neural network,” IEE
Proceedings - Circuits, Devices and Systems, vol. 144, no. 6, pp. 313–
VI. C ONCLUSION 317, Dec 1997.
[3] G. Du, C. Tian, Z. Li, D. Zhang, Y. Yin, and Y. Ouyang,
There are many tradeoffs in the design of softmax, the “Efficient softmax hardware architecture for deep neural networks,”
multi-category classification layer in neural networks. In this in Proceedings of the 2019 on Great Lakes Symposium on VLSI,
paper, we perform design tradeoff evaluation of softmax using ser. GLSVLSI ’19. New York, NY, USA: ACM, 2019, pp. 75–80.
[Online]. Available: http://doi.acm.org/10.1145/3299874.3317988
SoftGen, an open-source tool1 that we created that generates [4] X. Geng, J. Lin, B. Zhao, A. Kong, M. M. S. Aly, and V. Chandrasekhar,
softmax designs by controlling the values of parallelism, “Hardware-aware softmax approximation for deep neural networks,”
accuracy, precision and storage. The architecture used by our in Computer Vision – ACCV 2018, C. Jawahar, H. Li, G. Mori, and
K. Schindler, Eds. Cham: Springer International Publishing, 2019, pp.
generator eliminates the shortcomings in existing designs such 107–122.
as limited parallelism, limited precision options, etc. We show [5] R. Hu, B. Tian, S. Yin, and S. Wei, “Efficient hardware architecture of
the results of trade-off analysis using these knobs in the paper. softmax layer in deep neural network,” 11 2018, pp. 1–5.
[6] I. Kouretas and V. Paliouras, “Simplified hardware implementation of
In terms of parallelism, it is found that the architecture with the softmax activation function,” 05 2019, pp. 1–4.
parallelism of 16 can provide the best area-delay product [7] H. Labs. (2008) Cacti - an integrated cache and memory access
among all the parallelism ranging from 1 to 32. It is also time, cycle time, area, leakage, and dynamic power model. [Online].
Available: https://www.hpl.hp.com/research/cacti/
observed that LUT-based EXP and LOG units can help to [8] Z. Li, H. Li, X. Jiang, B. Chen, Y. Zhang, and G. Du, “Efficient fpga
implementation of softmax function for dnn applications,” 11 2018, pp.
1 The tool is available at https://github.com/georgewzg95/softmax 212–216.
[9] NCSU. (2018) Freepdk45. [Online]. Available: https://www.eda.ncsu.
edu/wiki/FreePDK45:Contents
[10] Synopsys. (2018) Designware library - datapath and building block ip.
[Online]. Available: https://www.synopsys.com/dw/buildingblock.php
[11] O. Vinyals and G. Friedland, “A hardware-independent fast logarithm
approximation with adjustable accuracy,” in 2008 Tenth IEEE Interna-
tional Symposium on Multimedia, Dec 2008, pp. 61–65.
[12] M. Wang, S. Lu, D. Zhu, J. Lin, and Z. Wang, “A high-speed and low-
complexity architecture for softmax function in deep learning,” 10 2018,
pp. 223–226.
[13] B. Yuan, “Efficient hardware architecture of softmax layer in deep neural
network,” in 2016 29th IEEE International System-on-Chip Conference
(SOCC), Sep. 2016, pp. 323–326.
[14] W. Yuan and Z. Xu, “Fpga based implementation of low-latency floating-
point exponential function,” vol. 2013, 01 2013, pp. 237–240.