Convolution Optimization For DNN
Convolution Optimization For DNN
Abstract— As convolution contributes most operations in con- when compared to software implementations on multicore
volutional neural network (CNN), the convolution acceleration processors with GPUs [10], [12], [13], [17]. This is due
scheme significantly affects the efficiency and performance of a to the fact that modern FPGAs allow customization of the
hardware CNN accelerator. Convolution involves multiply and
accumulate operations with four levels of loops, which results architecture and can exploit the availability of hundreds
in a large design space. Prior works either employ limited to thousands of on-chip DSP blocks. However, significant
loop optimization techniques, e.g., loop unrolling, tiling, and challenges remain in mapping CNNs onto FPGAs. The state-
interchange, or only tune some of the design variables after of-the-art CNNs require a large number (>1 billion) of compu-
the accelerator architecture and dataflow are already fixed. tationally intensive task (e.g., matrix multiplications on large
Without fully studying the convolution loop optimization before
the hardware design phase, the resulting accelerator can hardly numbers), involving a very large number of weights (>50 mil-
exploit the data reuse and manage data movement efficiently. This lion) [4], [5]. Deep CNN algorithms have tens to hundreds of
paper overcomes these barriers by quantitatively analyzing and layers, with significant differences between layers in terms of
optimizing the design objectives (e.g., memory access) of the CNN sizes and configurations. The limited computational resources
accelerator based on multiple design variables. Then, we propose and storage capacity on FPGA make the task of optimal
a specific dataflow of hardware CNN acceleration to minimize the
data communication while maximizing the resource utilization mapping of CNNs (e.g., minimizing latency subject to energy
to achieve high performance. The proposed CNN acceleration constraints or vice versa) a complex and multidimensional
scheme and architecture are demonstrated by implementing end- optimization problem. The high cost of off-chip commu-
to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet- nication is another major impediment to achieving higher
152 for inference. For VGG-16 CNN, the overall throughputs performance and lower energy. In fact, the energy cost
achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria
10 FPGAs, respectively. associated with the large amount of data movements and
memory accesses often exceeds the energy consumption of
Index Terms— Accelerator architectures, convolutional neural the computations [8], [20]. For these reasons, energy-efficient
networks (CNNs), field-programmable gate array (FPGA), neural
network hardware. hardware acceleration of CNNs on a FPGA requires simulta-
neous maximization of resource utilization and data reuse, and
minimization of data communication.
I. I NTRODUCTION More than 90% of the operations in a CNN involve con-
Fig. 1. Four levels of convolution loops, where L denotes the index of convolution layer and S denotes the sliding stride [15].
TABLE I
C ONVOLUTION L OOP D IMENSIONS AND H ARDWARE D ESIGN VARIABLES
A. Computing Latency
The number of multiplication operations per layer (Nm) is
Nm = Nif × Nkx × Nky × Nof × Nox × Noy. (5)
Ideally, the number of computing cycles per layer should be
Nm/Pm, where Pm is the number of multipliers. However, for
different loop unrolling and tiling sizes, the multipliers cannot
necessarily be fully utilized for every convolution dimension.
The number of actual computing cycles per layer is
#_cycles = #i nter tile_cycles × #i ntr atile_cycles (6)
Fig. 8. Loop tiling determines the size of data stored in on-chip buffers.
where
Piy parallel multiplication contributes to independent Pix × #i nter tile_cycles = Nif /Tif Nkx/TkxNky/Tky
Piy output pixels, Pix × Piy accumulators are used to serially ×Nof /Tof Nox/ToxNoy/Toy (7)
accumulate the multiplier outputs and no adder tree is needed.
#i ntr atile_cycles = Tif /Pif Tkx/PkxTky/Pky
d) Loop-4 unrolling (Fig. 7): In every cycle, one pixel is
multiplied by Pof weights at the same (x, y) location but from ×Tof /Pof Tox/PoxToy/Poy. (8)
Pof different kernel maps, and this pixel is reused Pof times. Here, we assume that the multipliers receive input data
The computing structure is identical to unrolling Loop-3 using continuously without idle cycles. If the ratio of N∗ to T∗ or T∗
Pof multipliers and accumulators without an adder tree. to P∗ is not an integer, the multipliers or the external memory
The unrolling variable values of the four convolution loops transactions are not fully utilized. In addition to considering
collectively determine the total number of parallel MAC computing latency, memory transfer delay must also be con-
operations as well as the number of required multipliers (Pm) sidered for the overall system latency.
Pm = Pkx × Pky × Pif × Pix × Piy × Pof . (4)
B. Partial Sum Storage
2) Loop Tiling: On-chip memory of FPGAs is not always A partial sum (psum) is the intermediate result of the
large enough to store the entire data of deep CNN algorithms. inner-product operation that needs to be accumulated over
Therefore, it is reasonable to use denser external DRAMs several cycles to obtain one final output data. Therefore, partial
to store the weights and the intermediate pixel results of all sums need to be stored in memory for the next few cycles
layers. and sometimes have to be moved between PEs. An efficient
Loop tiling is used to divide the entire data into multiple acceleration strategy has to minimize the number of partial
blocks, which can be accommodated in the on-chip buffers, sums and process them locally as soon as possible to reduce
as illustrated in Fig. 8. With proper assignments of the loop data movements.
tiling size, the locality of data can be increased to reduce The flowchart to calculate the number of partial sums stored
the number of DRAM accesses, which incurs long latency in memory (#psum) is shown in Fig. 9. To obtain one final
and high-power consumption. The loop tiling sets the lower output pixel, we need to finish Loop-1 and Loop-2. Therefore,
bound on the required on-chip buffer size. The required size if both Loop-1 and Loop-2 are fully unrolled, the final output
of input pixel buffer is Tix × Tiy × Tif × (pixel_datawidth). pixel can be obtained right after the inner-product operations
The size of weight buffer is Tkx × Tky × Tif × Tof × with minimal #psum. If the loop tile size can cover all pixels
(weight_datawidth). The size of output pixel buffer is Tox × and weights in Loop-1 (Tkx = Nkx and Tky = Nky) and
Toy × Tof × (pixel_datawidth). Loop-2 (Tif = Nif ), then the partial sums can be consumed
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
level of parallelism can be achieved even for largest FPGA D. Minimizing Access of External Memory
available with ∼3600 DSP slices. By this means, a uniform As we first compute Loop-1 and Loop-2 to reduce partial
configuration and structure of PEs can be applied for all the sums, we cannot achieve the minimum number of DRAM
convolution layers. access described in (10.1) and (10.3) inside Fig. 10, where
Loop tiling has been used in prior hardware CNN accel- neither the pixels nor the weights are fully buffered for one
erators to fit the large-scale CNN models into limited on- convolution layer. Therefore, we can only attain the minimum
chip buffers. However, only a few prior works [13], [18] have DRAM access by assigning sufficient buffer size for either all
shown their tiling configurations that determine the on-chip pixels or all weights of each layer as in (10.8) inside Fig. 10.
buffer size, but the tradeoff between the loop tiling size and Then, the optimization of minimizing the on-chip buffer size
the number of external memory accesses is not explored. while having minimum DRAM access is formulated as
The impact of loop interchange has not been rigorously
studied in prior works, but it can greatly impact the number min bi ts_BU F_ px_wt
of partial sums as well as the resulting data movements and s.t. #T ile_px L = 1or #T ile_wt L = 1
memory access. with ∀L ∈ [1, #C O N V s] (20)
where #Tile_px L and #Tile_wt L denote the number of tiling
V. P ROPOSED ACCELERATION S CHEME blocks for input pixels and weights of layer L, respec-
The optimization process of our proposed acceleration tively, and #CONVs is the number of convolution layers.
scheme is presented in this section, which includes appropriate bits_BUF_px_wt is the sum of pixel buffer size (bits_BUF_px)
selection of the convolution loop design variables. and weight buffer size (bits_BUF_wt), which are given by
bi ts_BU F_px_wt = bi ts_BU F_px + bi ts_BU F_wt. (21)
A. Minimizing Computing Latency Both pixel and weight buffers need to be large enough to
We set variables P∗ to be the common factors of T∗ for all cover the data in one tiling block for all the convolution layers.
the convolution layers to fully utilize PEs, and T∗ to be the This is expressed as
common factors of N∗ to make full use of external memory bi ts_BU F_px
transactions. For CNN models with only small common fac-
= M AX (wor ds_px L )
tors, it is recommended to set N∗/T∗ − N∗/T∗ and T∗/P∗
− T∗/P∗ as small as possible to minimize the inefficiency × pi xel_datawidth with L ∈ [1, #C O N V s] (22)
caused by the difference in sizes of CNN models. bi ts_BU F_wt
= M AX (wor ds_wt L )
B. Minimizing Partial Sum Storage ×weight_datawidth with L ∈ [1, #C O N V s] (23)
To reduce the number and movements of partial sums, both where words_px L and words_wt L denote the number of
Loop-1 and Loop-2 should be computed as early as possi- pixels and weights of one tiling block in layer L, respectively.
ble or unrolled as much as possible. To avoid the drawback of These are expressed in terms of loop tiling variables as
unrolling Loop-1 as discussed in Section IV and maximize the follows:
data reuse as discussed in Section III-C, we decide to unroll
wor ds_px L = Tix L ×Tiy L ×Tif L +Tox L ×Toy L ×T o f L (24)
Loop-3 (Pox > 1 or Poy > 1) and Loop-4 (Pof >1). By this
means, we cannot attain the minimum partial sum storage, wor ds_wt L = Tof L ×Tif L ×Tkx L ×Tky L (25)
as (9.1) inside Fig. 9. where words_px L is comprised of both input and output pixels.
Constrained by 1 ≤ P∗ ≤ T∗ ≤ N∗, the second least number The number of tiles in (20) is also determined by T∗ variables
of partial sum storage is achieved by (9.2) among (9.2)–(9.9)
inside Fig. 9. To satisfy the condition for (9.2), we serially #T ile_px L = Nif L /Tif L × Nox L /Tox L × Noy L /Toy L
compute Loop-1 and Loop-2 first and ensure the required data (26)
of Loop-1 and Loop-2 are buffered, i.e., Tkx = Nkx, Tky = #T ile_wt L = Nkx L /Tkx L × Nky L /Tky L × Nif L /Tif L
Nky and Tif = Nif . Therefore, we only need to store Pof ×
Pox × Poy number of partial sums, which can be retained in ×Nof L /Tof L . (27)
local registers with minimum data movements. By solving (20), we can find an optimal configuration of
T∗ variables that result in minimum DRAM access and on-
chip buffer size. However, since we have already set Tkx =
C. Minimizing Access of On-Chip Buffer
Nkx, Tky = Nky, Tif = Nif as in Section V-B, we can only
The number of on-chip buffer accesses is minimized by achieve a suboptimal solution by tuning Tox, Toy and Tof,
unrolling Loop-3 to reuse weights as shown in (12) and resulting in larger buffer size requirement. If the available
unrolling Loop-4 to reuse pixels as shown in (14). As our on-chip memory is sufficient, we set Tox = Nox so that an
partial sums are kept on local registers, they do not add entire row can be buffered to benefit the direct memory access
overhead to the buffer access and storage. (DMA) transactions with continuous data.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 12. Optimized loop unrolling and tiling strategy. The parallelism is
within one feature map (Pox × Poy) and across multiple kernels (Pof). The
tiling variables Tiy, Toy, and Tof can be tuned that decide the buffer sizes.
Fig. 11. To guarantee minimum DRAM accesses, either all pixels (blue E. Optimized Loop Design Variables
bars) are covered by pixel buffers (blue dashed lines) or all weights are
covered by weight buffers in one layer. Then, we try to lower the total buffer According to the aforementioned optimization process,
sizes/lines. (a) Pixels and weights distribution of convolution layers in VGG- we propose a convolution acceleration scheme for a high-
16. (b) Pixels and weights distribution of convolution layers in ResNet-50.
performance and low-communication CNN accelerator, which
is visualized in Fig. 12.
1) Loop Unrolling: For all the convolution layers, Loop-
1 and Loop-2 are not unrolled, which means Pkx = 1, Pky =
Finally, we have to solve (20) by searching Toy and Tof,
1 and Pif = 1. According to (7) and (8), Pox, Poy and Pof are
because it has a nonlinear objective function and constraints
set to be the common factors of the feature maps (Nox, Noy)
with integer variables. Since Toy and Tof in VGG-16 con-
and output channels (Nof), respectively, to fully utilize the
sist of 2 × #CONVs = 26 variables and each variable
multipliers. The configurations of Pox, Poy, and Pof of different
can have about four candidate values constrained by T∗/P∗
CNNs on different FPGAs are listed in Table II, which are
= integer and N∗ /T∗ = integer, the total number of Toy
largely constrained by the available computing resources.
and Tof configurations is about 426 = 4.5 × 1015 , which
By setting P∗ to be constant across all the convolution layers,
becomes an enormous solution space. In ResNet-50/ResNet-
a uniform structure and mapping of PEs can be realized to
152, the #CONVs are increased to be 53 and 155, respectively,
reduce the architecture complexity.
which makes the solution space even larger to be about
2) Loop Tiling: For loop tiling, we set Tkx = Nkx, Tky
4106 = 6.6 × 1063 and 4310 = 4.4 × 10186, respectively.
= Nky, Tif = Nif as described in Section V-B and shown
Therefore, it is impossible to enumerate all the candidate
in Fig. 12 so that data used in Loop-1 and Loop-2 are all
solutions.
buffered and Tox = Nox to benefit DMA transfer. Details of
In this paper, we propose to empirically find a satisfactory
Toy and Tof are described in Section V-D.
solution for a given on-chip memory capacity that takes
3) Loop Interchange: For loop interchange, we first serially
advantage of the property of CNNs. CNNs normally have large
compute Loop-1 and then Loop-2 as described in Section V-B.
pixel data volume and small weight sizes in the beginning
Finally, we compute Loop-3 and Loop-4, where the exact com-
few layers. As we proceed into deeper layers, the pixel sizes
putation order of these two loops does not have a pronounced
become smaller with extracted features, and the weight sizes
impact on the cost, based on our P∗ and T∗ choices.
become larger with more channels. This trend is illustrated
in Fig. 11, where the bars denote data sizes in each convolution
layer. To benefit from the data distribution property in different VI. P ROPOSED CNN ACCELERATOR
layers, we only need to make pixel buffers fully cover the To implement the optimized convolution acceleration
last few layers and weight buffers fully cover the beginning scheme in Section V-E, a data router is proposed with high
few layers. Then, the middle layers with both relatively large flexibility for different convolution sliding settings, e.g., strides
pixel and weight sizes become the constraints of the buffer and zero paddings, using variant data buses. A corresponding
sizes, and we only need to take care of these bounding layers, hardware PE architecture is also designed that minimizes
which significantly shrinks the solution space. The dashed on/off-chip memory accesses and data movements.
lines in Fig. 11 are the minimal buffer sizes we found while
guaranteeing minimum DRAM accesses, and the bounding
layers are pointed out by arrows. If this buffer size still cannot A. Data Bus From Buffer to PE (BUF2PE)
be fit into the FPGA on-chip memory, then we need to either In [15] and [16], a register array architecture is designed
change the tiling strategy or decrease the buffer sizes at the to rearrange and direct the pixel stream from buffers into
cost of more DRAM accesses as discussed in [15]. PEs. This method takes advantage of convolution stride being
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
O UR I MPLEMENTATION OF D IFFERENT CNN S ON D IFFERENT FPGA S
Fig. 15. Convolution acceleration architecture with Pox × Poy × Pof MAC
units.
TABLE III
P REVIOUS CNN FPGA I MPLEMENTATIONS
us to unroll Loop-3 and Loop-4, which can also achieve high [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
DSP utilization. In [10] and [17], the layer-by-layer computa- with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst. (NIPS), 2012, pp. 1097–1105.
tion is pipelined using different part of one or multiple FPGAs [3] M. Lin, Q. Chen, and S. Yan. (Mar. 2014). “Network in net-
resources to improve hardware utilization and thus throughput. work.” [Online]. Available: https://arxiv.org/abs/1312.4400
However, with the highly increasing number of convolution [4] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” in Proc. Int. Conf. Learn.
layers [5], it becomes very difficult to map different layers Represent. (ICLR), 2015.
onto different resources and balance the computation among [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
all the pipeline stages. In addition, pipelining can increase the Recognit. (CVPR), Jun. 2016, pp. 770–778.
throughput but not necessarily the latency. Batch computing [6] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations
with multiple input images is applied in [8], [10], [12], [17], for high-performance computing,” ACM Comput. Surv., vol. 26, no. 4,
pp. 345–420, Dec. 1994.
and [23]. The biggest advantage of this technique is to [7] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
share the weights transferred from off-chip DRAM among efficient reconfigurable accelerator for deep convolutional neural net-
multiple images and thus increase the throughput at the cost works,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 127–138,
Jan. 2017.
of increased latency per image and external memory storage [8] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
of multiple images. Benefit from batch computing and using energy-efficient dataflow for convolutional neural networks,” in Proc.
2144 DSP slices, which enables high parallelism degree, ACM/IEEE Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 367–379.
[9] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
Li et al. [17] also achieve high throughput of 565.94 GOPS FPGA-based accelerator design for deep convolutional neural networks,”
for AlexNet. In [12], an OpenCL-based CNN accelerator is in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA),
implemented on Arria10 FPGA, where the Intel FPGA SDK Feb. 2015, pp. 161–170.
[10] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient
for OpenCL provides a pregenerated platform that ensures CNN implementation on a deeply pipelined FPGA cluster,” in Proc.
timing closure at higher frequency than our RTL design. ACM Int. Symp. Low Power Electron. Design (ISLPED), Aug. 2016,
The Winograd transform is applied for convolution layers pp. 326–331.
[11] N. Suda et al., “Throughput-optimized OpenCL-based FPGA accelerator
that reduces multiplication operations by 2× or improves the for large-scale convolutional neural networks,” in Proc. ACM/SIGDA Int.
throughput by 2× using the same number of DSPs. The 16-b Symp. Field-Program. Gate Arrays (FPGA), Feb. 2016, pp. 16–25.
[12] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R.
floating-point data format is used with shared exponent, which Chiu, “An OpenCL deep learning accelerator on Arria 10,” in Proc.
allows directly using fixed-point 18-bit × 18-bit multipliers for ACM/SIGDA Symp. Field-Program. Gate Arrays (FPGA), Feb. 2017,
floating-point operations. Wei et al. [24] proposed an OpenCL- pp. 55–64.
[13] K. Guo et al., “Angel-Eye: A complete design flow for mapping CNN
based systolic array architecture to implement convolution on onto embedded FPGA,” IEEE Trans. Comput.-Aided Des. Integr. Circuits
Arria 10, which reduces the global PE interconnect fan-out to Syst., vol. 37, no. 1, pp. 35–47, Jan. 2018.
achieve high frequency and resource utilization. The VGG-16 [14] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable and
modularized RTL compilation of convolutional neural networks onto
throughput of [24] is higher than ours mainly due to: 1) higher FPGA,” in Proc. IEEE Int. Conf. Field-Program. Logic Appl. (FPL),
frequency; 2) lower precision of weights; and 3) dual buffer Aug./Sep. 2016, pp. 1–8.
scheme to hide DRAM latency. Guan et al. [23] proposed [15] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop opera-
tion and dataflow in FPGA acceleration of deep convolutional neural
an RTL–HLS hybrid framework to automatically generate networks,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
FPGA hardware and implements convolution and FC as Arrays (FPGA), Feb. 2017, pp. 45–54.
matrix multiplication. Although the Stratix-V GSMD5 (with [16] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “An automatic RTL compiler
for high-throughput FPGA implementation of diverse deep convolutional
1590 DSP blocks) used in [23] has 6.2× more DSP blocks neural networks,” in Proc. IEEE Int. Conf. Field-Program. Logic Appl.
than our Stratix-V GXA7, our accelerator on Stratix V can (FPL), Sep. 2017, pp. 1–8.
realize 1.2× higher throughput for ResNet-152 by higher [17] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-
formance FPGA-based accelerator for large-scale convolutional neural
hardware (DSP and logic) utilization through the proposed networks,” in Proc. IEEE Int. Conf. Field-Program. Logic Appl. (FPL),
loop optimization technique and exploiting logic elements to Aug. 2016, pp. 1–9.
[18] A. Rahman, J. Lee, and K. Choi, “Efficient FPGA acceleration of
implement multipliers as well as DSPs. convolutional neural networks using logical-3D compute array,” in Proc.
IEEE Design, Auto. Test Eur. Conf. (DATE), Mar. 2016, pp. 1393–1398.
VIII. C ONCLUSION [19] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space explo-
ration of FPGA-based deep convolutional neural networks,” in Proc.
In this paper, we present an in-depth analysis of convolu- IEEE Asia South Pacific Design Auto. Conf. (ASP-DAC), Jan. 2016,
tion loop acceleration strategy by numerically characterizing pp. 575–580.
the loop optimization techniques. The relationship between [20] S. Han et al., “EIE: Efficient inference engine on compressed deep
neural network,” in Proc. ACM/IEEE Int. Symp. Comput. Archit. (ISCA),
accelerator objectives and design variables are quantitatively Jun. 2016, pp. 243–254.
investigated. A corresponding new dataflow and architecture [21] L. Du et al., “A reconfigurable streaming deep convolutional neural
network accelerator for Internet of Things,” IEEE Trans. Circuits Syst.
is proposed to minimize data communication and enhance I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018.
throughput. Our CNN accelerator implements end-to-end NiN, [22] B. Bosi, G. Bois, and Y. Savaria, “Reconfigurable pipelined 2-D con-
VGG-16, and ResNet-50/ResNet-152 CNN models on Stratix volvers for fast digital signal processing,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 7, no. 3, pp. 299–308, Sep. 1999.
V and Arria 10 FPGA, achieving the overall throughput [23] Y. Guan et al., “FP-DNN: an automated framework for mapping deep
of 348 GOPS and 715 GOPS, respectively. neural networks onto FPGAs with RTL-HLS hybrid templates,” in
Proc. IEEE Int. Symp. Field-Program. Custom Comput. Mach. (FCCM),
Apr./May 2017, pp. 152–159.
R EFERENCES [24] X. Wei et al., “Automated systolic array architecture synthesis for high
[1] O. Russakovsky et al., “ImageNet large scale visual recognition chal- throughput CNN inference on FPGAs,” in Proc. ACM the 54th Annu.
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. Design Autom. Conf. (DAC), Jun. 2017, pp. 1–6.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Yufei Ma (S’16) received the B.S. degree in infor- Sarma Vrudhula (M’85–SM’02–F’16) received the
mation engineering from the Nanjing University B.Math. degree from the University of Waterloo,
of Aeronautics and Astronautics, Nanjing, China, Waterloo, ON, Canada, and the M.S.E.E. and Ph.D.
in 2011 and the M.S.E. degree in electrical engineer- degrees in electrical and computer engineering from
ing from the University of Pennsylvania, Philadel- the University of Southern California, Los Angeles,
phia, PA, USA, in 2013. He is currently working CA, USA.
toward the Ph.D. degree at Arizona State University, He was a Professor at the ECE Department, Uni-
Tempe, AZ, USA. versity of Arizona, Tucson AZ, USA, and was on
His current research interests include the high- the faculty of the EE-Systems Department at the
performance hardware acceleration of deep learning University of Southern California. He was also the
algorithms on digital application-specified integrated Founding Director of the NSF Center for Low Power
circuit and field-programmable gate arrays. Electronics at the University of Arizona. He is currently a Professor of
Computer Science and Engineering with Arizona State University, Tempe,
AZ, USA, and the Director of the NSF I/UCRC Center for Embedded Sys-
tems. His current research interests include design automation and computer
aided design for digital integrated circuit and systems; low-power circuit
design; energy management of circuits and systems; energy optimization of
battery powered computing systems, including smartphones, wireless sensor
networks, and Internet of Things systems that relies energy harvesting; system
Yu Cao (S’99–M’02–SM’09–F’17) received the level dynamic power and thermal management of multicore processors and
B.S. degree in physics from Peking University, system-on-chip; statistical methods for the analysis of process variations;
Beijing, China, in 1996 and the M.A. degree in statistical optimization of performance, power, and leakage; a new circuit
biophysics and the Ph.D. degree in electrical engi- architectures of threshold logic circuits for the design of application-specific
neering from the University of California, Berkeley, integrated circuits and field-programmable gate arrays; nonconventional meth-
CA, USA, in 1999 and 2002, respectively. ods for implementing logic, including technology mapping with threshold
He was a Summer Intern at Hewlett-Packard Labs, logic circuits; the implementation of threshold logic using resistive memory
Palo Alto, CA, USA, in 2000, and at the IBM devices; and the design and optimization of nonvolatile logic.
Microelectronics Division, East Fishkill, NY, USA,
in 2001. He was a Postdoctoral Researcher at the
Berkeley Wireless Research Center, University of
California. He is currently a Professor of Electrical Engineering at Arizona
State University, Tempe, AZ, USA. He has authored or coauthored numerous Jae-sun Seo (S’04–M’10–SM’17) received the B.S.
articles and two books on Nano-CMOS Modeling and Physical Design. His degree in electrical engineering from Seoul National
current research interests include physical modeling of nanoscale technologies, University, Seoul, South Korea, in 2001 and the M.S.
design solutions for variability and reliability, reliable integration of postsili- and Ph.D. degrees in electrical engineering from the
con technologies, and hardware designs for on-chip learning. University of Michigan, Ann Arbor, MI, USA, in
Dr. Cao was a recipient of the 2012 Best Paper Award at the IEEE 2006 and 2010, respectively.
Computer Society Annual Symposium on VLSI, the 2010, 2012, 2013, From 2010 to 2013, he was with the IBM
2015, and 2016 Top 5% Teaching Award, Schools of Engineering, Arizona T. J. Watson Research Center, Yorktown Heights,
State University, the 2009 ACM SIGDA Outstanding New Faculty Award, NY, USA, focused on the cognitive computing chips
the 2009 Promotion and Tenure Faculty Exemplar, Arizona State Univer- under the DARPA SyNAPSE Project and energy-
sity, the 2009 Distinguished Lecturer of the IEEE Circuits and Systems efficient integrated circuits for high-performance
Society, the 2008 Chunhui Award for outstanding oversea Chinese scholars, processors. In 2014, he joined the School of Electrical, Computer and Energy
the 2007 Best Paper Award at International Symposium on Low Power Engineering, Arizona State University, Tempe, AZ, USA, as an Assistant
Electronics and Design, the 2006 NSF CAREER Award, the 2006 and Professor. In 2015, he was with the Intel Circuits Research Labortory as
2007 IBM Faculty Award, the 2004 Best Paper Award at International a Visiting Faculty. His current research interests include efficient hardware
Symposium on Quality Electronic Design, and the 2000 Beatrice Winner design of machine learning and neuromorphic algorithms and integrated power
Award at International Solid-State Circuits Conference. He was an Associate management.
Editor of the IEEE Transactions on Computer-Aided Design of Integrated Dr. Seo was a recipient of the Samsung Scholarship from 2004 to 2009,
Circuits and Systems. He served on the technical program committee of many the IBM Outstanding Technical Achievement Award in 2012, and the NSF
conferences. CAREER Award in 2017.