Hardware For Deep Learning Acceleration
Hardware For Deep Learning Acceleration
www.advintellsyst.com
Adv. Intell. Syst. 2024, 2300762 2300762 (1 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
MAC OPs (based on a single instruction) applied to massive data, The rest of the article is organized as follows. Section 2 over-
graphics processing units (GPUs) for single-instruction multiple- views key computations (major workload) in DNNs classified as
data and single-instruction multiple-threads (SIMD/SIMT) compute- and memory-bound models, software frameworks for
with high-bandwidth memory are the very suitable type of autodifferentiation, and the suitable topology of DNNs to use
general-purpose hardware for DL acceleration.[31–41] SIMD/SIMT such autodifferentiation frameworks. Additionally, this section
GPUs can significantly accelerate DNN operations (i.e., MAC overviews SNNs and their major workload in comparison with
OPs) irrespective of DNN topology, and thus serving as the DNNs. Section 3 addresses various processors for DL accelera-
mainstream hardware for DL acceleration. However, their high tion, including GPUs, NPUs, and CIM units, and overviews
power consumption, insomuch as they consume a few hundreds several recent designs for each type of accelerator. Section 4 is
watts,[42] is a challenge to their applications to AI at the edge dedicated to neuromorphic processors that accelerate computa-
(also known as edge AI and on-device AI). tions in SNNs. In this section, we overview various neuromor-
As alternatives, various types of application-specific integrated phic processor designs and explain distinguishable concerns
circuit (ASIC)-based accelerators have been introduced, includ- in neuromorphic processor design from that in DNN-based
ing neural processing units (NPUs),[43–46] computing-in-memory DL accelerator design. Section 5 concludes this review with con-
(CIM) units,[47,48] and neuromorphic processors.[49–52] Note that cluding remarks and outlook.
neuromorphic processors (also known as event processors)
accelerate SNN while the others DNN. These alternatives to
GPUs aim to boost the operational efficiency at the cost of loss 2. Computation in NNs
in versatility and flexibility to some degree. That is, a given alter-
native leverages its operational efficiency for a particular class of We address generic features of NNs with regard to computation
DNN. Generally, DNNs are classified as compute- and memory- in Section 2.1, which hold for DNNs and SNNs. Computational
bound models with regard to the overall operation-dictating features of DNNs and SNNs are addressed in Section 2.2 and 2.3,
factor: 1) arithmetic operation rate and 2) memory bandwidth, respectively.
respectively. This classification will be elaborated in Section 3.
Note that the aforementioned accelerators for DNN-based DL 2.1. Generic Properties of Operations in NNs
are responsible for MAC OPs only, so that they need host
CPUs to complete the whole computations in DNNs. 2.1.1. Elementary Layers and Their Computation
Neuromorphic processors (event processors) are standalone
devices without host CPU and main memory unlike the afore- Although NNs differ in topology for different tasks, they are com-
mentioned accelerators. In this review, we refer to neuromorphic monly of layer-wise topology, and neighboring layers are joined
processors as event-based inference accelerators without learn- by unidirectional connections as illustrated in Figure 1a. Despite
ing engine unless otherwise stated. They can be standalone given the diversity in NN topology, the major types of elementary layers
several crucial features of SNNs such that 1) a feature map (event include a dense layer and convolutional layer (conv layer for
map) is remarkably sparse, i.e., the feature map at a given time- short)—illustrated in Figure 1b,c, respectively. Note that we
step includes only few ’10 s and mostly ’00 s and 2) the data assume a mini-batch size of one and FP32 data format, i.e.,
required for inference are local to each spiking neuron. These real-value data, hereafter unless otherwise stated. For a dense
allow the spiking neurons to be distributed over multiple cores layer of N neurons with a fan-in layer of M neurons, its weight
in a neuromorphic processor and to send the events from them matrix w is of N M in dimension. The total number of weights
to their fan-out neurons in an ad hoc manner, significantly boost-
ing power efficiency.
Despite proposals of various DL acceleration platforms to date,
there exist no perfect acceleration platforms with high perfor-
mance in all key performance metrics, e.g., operational through-
put, power efficiency, versatility, flexibility, and so forth. Each of
these platforms has relative pros and cons in terms of these
performance metrics, so that appropriate platforms need to be
chosen to accelerate key workloads for a given model. To begin,
one should understand such key workloads for various models.
In this review, we aims to provide comprehensive reviews on
workloads for DNNs and SNNs of different topologies and vari-
ous hardware platforms to accelerate their major operations. The
primary contributions of this review are as follows: 1) We review
generic properties of DNNs and SNNs and classify them with
regard to the bottleneck in their computations. 2) We overview
various acceleration platforms for DNNs, including, CPUs,
GPUs, NPUs, and CIM units, and compare them with regard
to the key performance metrics. 3) We overview neuromorphic Figure 1. Schematics of a a) feedforward neural network, b) dense layer,
processors for SNNs and comprehensively analyze their distin- and c) conv layer. The dense and conv layers calculate y ¼ wx and
guishable features from DNN accelerators. y ¼ k x, respectively.
Adv. Intell. Syst. 2024, 2300762 2300762 (2 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Algorithm 1. Naive Matrix–Vector Multiplication. Multiplication of an Algorithm 2. Convolution of input tensor x ∈ ℝcin hw using kernel
0 0
input vector x ∈ ℝM and weight matrix w ∈ ℝNM , yielding an output k ∈ ℝcout cin kh kw , yielding output tensor y ∈ ℝcout h w .
vector y ∈ ℝN .
Initialize y;
Initialize y for i ¼ 1 to cout do
for i ¼ 1 to N do for j ¼ 1 to h0 do
s ← 0; for k ¼ 1 to w0 do
for j ¼ 1 to M do s ← 0;
/* Single MAC OP */ for l ¼ 1 to cin do
s ← s þ w½i, j x½j; for m ¼ 1 to kh do
end for o ¼ 1 to kw do
y½i ← s; /* Single MAC OP */
end s ← s þ k½i, l, m, o
x½l, j þ m 1, k þ o 1;
end
and memory dedicated to them are NM and 32NM bits, respec-
end
tively. The feature map z for this dense layer is calculated by a
linear equation y ¼ wx (x denotes the feature map of the fan-in end
layer) and subsequent nonlinear equation z ¼ f ðyÞ, where f y½i, j, k ← s;
denotes a nonlinear activation function. The linear equation end
(matrix–vector multiplication) involves two nested for-loops, end
and thus NM MAC floating-point operations (FLOPs) in total,
end
as shown in Algorithm 1. Therefore, the time complexity for the
linear equation is Oðn2 Þ. The nonlinear equation z ¼ f ðyÞ involves
N FLOPs, i.e., OðnÞ. Therefore, the major workload arises from
the MAC FLOPs of Oðn2 Þ in complexity. Given that the major layers with FWR = 1. In this case, the arithmetic operational
workload involves NM weights and NM FLOPs, the ratio of throughput (rather than memory bandwidth) likely dictates the
the number of FLOPs to the number of weights (FWR) is one. overall operational throughput, so that an appropriate strategy
to accelerate the conv layer computation is to increase the arith-
FWR ¼ 1 for dense layers (1) metic operational throughput by employing multiple ALUs that
work in parallel.
That is, one weight loaded is used for one FLOP, and thus, for
each FLOP, one weight value needs to be loaded. In this case,
memory-access latency and memory bandwidth (rather than 2.1.2. Data Formats
the throughput of the arithmetic operations) likely dictate the
overall operational throughput. Consequently, employing a A commonly used data format is FP32 (single-precision floating-
high-bandwidth memory is an appropriate strategy to accelerate point): 1b sign, 8b exponent, and 23b mantissa (Figure 2a).
the dense layer computation.
A feature map for a conv layer is of c out h0 w 0 , where c out ,
h0 , and w0 denote the channel, height, and width of a rank-3 ten-
sor (see Figure 1c). The feature map is calculated by convolution
(linear operation) of a fan-in feature map x of cin h w using a
rank-4 kernel k of c out cin kh kw , i.e., y ¼ k ∗ x and subse-
quent nonlinear equation z ¼ f ðyÞ. The convolution (y ¼ k ∗ x)
involves six nested for-loops for a unit MAC OP as shown in
Algorithm 2, so that its time complexity is Oðn6 Þ. The number
of FLOPs involved in the convolution is therefore c out h0 w0 cin kh kw .
The nonlinear equation z ¼ f ðyÞ involves c out h0 w0 FLOPs, i.e., the
time complexity is Oðn3 Þ, and thus the major workload for con-
volution layers arises from the convolution operation of Oðn6 Þ in
complexity. FWR for this conv layer is therefore given by
c out h0 w0 cin kh kw
FWR ¼ ¼ h0 w 0 for conv layers: (2)
cout c in kh kw
That is, one weight loaded is h0 w0 times reused for FLOPs, Figure 2. a) Schematics of FP32, BFLOAT16, and FP16 formats.
which is a distinguishable feature of conv layers from dense b) Architecture of a FP multiplier.
Adv. Intell. Syst. 2024, 2300762 2300762 (3 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Numbers in FP32 are considered as real values. Given the afore- by using GPUs with autodifferentiation frameworks like
mentioned trend in NN evolution, the space complexity (memory PyTorch[54] and TensorFlow.[55] The backpropagation of error
usage) prohibitively increases, leading to the desire for the use of algorithm (backprop for short) uses the backward pass of the
lower precision FP formats such as FP16 (sign/exponent/ error (evaluated at the output nodes), which is based on the chain
mantissa: 1b/5b/10b) and BFLOAT16 (sign/exponent/mantissa: rule on the gradient tensor computed for each node.
1b/8b/7b). These formats are illustrated in Figure 2a. These Autodifferentiation frameworks compute the gradient tensors
low-precision FP formats allow the reduction in memory usage in a user friendly manner. The chain rule should apply to
by 50% and memory-loading latency for a given memory DAGs only because it yields wrong results for graphs with
bandwidth. Particularly, BFLOAT16 allows a similar number self-association like feedback connections.[56] Feedforward NNs
range to FP32 because of eight exponent bits but largely reduces can directly be mapped onto DAGs given no inherent self-
power and area overheads for FP multipliers. The representative association included. Recurrent NNs (RNNs) include in-layer
architecture of a FP multiplier is shown in Figure 2b. The mul- connections, which inevitably include closed data paths within
tiplication of two numbers in FP involves the addition of two a layer. To use autodifferentiation frameworks for such RNNs,
integer exponents and multiplication of two integer mantissa they are unrolled (duplicated) over time on DAGs in that the
parts; integer adder and multiplier are included in Figure 2b. same node at different timesteps is considered as different
Given that the integer multiplier mainly dictates the power nodes, which removes self-association. There exist NNs with
and area overheads for the FP multiplier, BFLOAT16 with ten feedback paths at the same timestep, e.g., SNN with spiking
mantissa bits can significantly reduce the power and area neurons whose state variables (membrane potential) are reset
overheads compared with FP32 and even with FP16. Mixed pre- upon their spiking. In this case, the reset signals are delayed
cision is often used as for Tensor Processing Units (TPUs) for one timestep to avoid self-association within the same time-
(BFLOAT16 for multiplication and FP32 for accumulation).[44] step. Otherwise, particular mathematical techniques based on
In this case, FLOAT16 numbers are easily converted to FP32 the implicit function theorem are required as for EXODUS.[56]
by simply attaching 16 null bits to the right-hand side of the
LSB in FLOAT16.
2.2. Computation in DNNs
There exist efforts to use low-precision integer formats (e.g.,
INT8 and INT4) in place of floating-point formats to further
CNNs are broadly used for computer vision, which include
reduce the memory usage, memory-loading latency, and power
AlexNet,[28] VGG,[29] ResNet,[57] DenseNet,[58] GoogLeNet,[30]
and area overheads for ALUs. To this end, additional algorithms
and so on. Most layers in CNNs are conv layers for feature extrac-
for weight and activation quantization should apply to NNs,
tion but with a few dense layer for classification. Therefore, the
which cause inevitable loss in NN performance.[53] As such,
major operation type is convolution elaborated in the pseudocode
an integer multiplier is much lighter than a FP multiplier as
in Algorithm 2. A pooling layer generally follows a given conv
it is a component of a FP multiplier as shown in Figure 2b, which
layer, which reduces the dimension of the conv layer. There exist
reduces the overheads for the multiplier logic. The extreme cases
various types of available pooling layers such as max pooling,
include binary weight and activation and power-of-two weight,
average pooling, and adaptive max pooling. Max pooling for
which replace multipliers by simple XNOR logics and shift regis-
2D kernels (kh kw in size) involves a simple max function that
ters, respectively.
outputs the largest feature among the features in the 2D kernel.
GPUs have excellent flexibility as they support various data
Average pooling outputs the feature value averaged over the fea-
formats, e.g., FP64/32/16, BFLOAT16, INT8/4, and even binary.
ture map within the kernel, requiring addition and division oper-
ASIC-based accelerators are frequently designed for particular
ations. Batch normalization (BN)[59] is commonly deployed after
data formats to boost their performance at the cost of loss in flex-
pooling. BN L2 normalizes the feature map after pooling using
ibility. An extreme case is the CIM units based on analog MAC
the mean and standard deviation of the features for a given chan-
OPs, which limits the data format to integer only. Generally, the
nel over the samples in a mini-batch. The normalized feature
performance is significantly dictated by the data format used
map undergoes scale and shift using two trainable parameters,
such that the lower the data precision, the higher the perfor-
which require multiply and addition operations such that the
mance insomuch as the performance for INT4 is ≈64 that
normalized feature map is multiplied by the scale parameter,
for FP32 for NVIDIA A100.
and the shift parameter is subsequently added. The nonlinear
Note that the term operation (OP) indicates format-unspecific
function finally applies to the result to compute the activation.
operation, so that it is a rather general term that includes FLOP.
Although various activation functions are available, ReLU and
Hereafter, we use the general terms OP and OPs to refer to oper-
its variants are often chosen as activation functions for feedfor-
ations in various data formats.
ward CNNs.
There are several lightweight DNNs that use depth-wise
2.1.3. Directed Acyclic Graphs separable convolution in place of the aforementioned normal
convolution to reduce their time and space complexities, e.g.,
Directed acyclic graphs (DAGs) define the sequence of successive MobileNets,[60,61] ShuffleNets,[62] and EfficientNets.[63] For a
computations and the consequent data flow. They are acyclic cout h0 w0 feature map with a cin h w fan-in feature
such that the data processed by a given node (function) are dis- map, depth-wise separable convolution requires the total number
allowed to be directed to the node through any paths, i.e., no self- of OPs (time complexity) and parameters (space complexity) as
association is allowed. NNs are mapped onto DAGs to train them follows
Adv. Intell. Syst. 2024, 2300762 2300762 (4 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
(
# OPs ¼ c in h0 w0 kh kw þ c out cin h0 w0 , et=τm HðtÞ for LIF
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl} ε¼ (6)
Depthwis econv Separable conv
(3) HðtÞ for IF
# parameters ¼ cin kh kw þ cout c in :
|fflfflffl{zfflfflffl} |fflffl{zfflffl}
Depthwise conv Separable conv where H denotes the Heaviside step function. Note that the tem-
poral kernel for the IF model is considered as a special case of the
These highlight significant reductions in time and space com- LIF model with τm ! ∞.
plexities compared with the normal convolution whose time and Equation (5) is expressed as a form of convolution integral as
space complexities are cout h0 w 0 c in kh kw and cout cin kh kw , respec- follows
tively. However, FWR for a depth-wise separable conv layer is X Z t
the same as the normal convolution. ui ðtÞ ¼ wij εðτÞsj ðt τÞdτ (7)
j 0
c h0 w 0 kh kw þ c out cin h0 w0
FWR ¼ in ¼ h0 w 0 : (4)
c in kh kw þ c out cin With the temporal kernel in Equation (6), Equation (7) is
equivalent to the following differential equation
Natural language processing depends on dense layer-based
DNNs such as 1) RNNs, e.g., bidirectional RNN,[64] long dui u X
¼ i þ wij sj (8)
short-term memory,[65] and gated recurrent units,[66] and 2) trans- dt τm j
former[67] and its variants, e.g., bidirectional encoder representa-
tions from transformers[68] and generative pretrained This equation can be expressed as a discrete form (with
transformer.[26,69] These models use either dense layers and/or Δt ¼ 1) using the Euler method (explicit method), and we have
dense layer-like operations, i.e., matrix–vector or matrix–matrix the following recursive form
multiplications, so that FWR for the major operation is unity. X
Additionally, these models include exponential function-based ui ½t ¼ e1=τm ui ½t 1 þ wij sj ½t (9)
nonlinear activation functions, e.g., sigmoid, hyperbolic tangent, j
Adv. Intell. Syst. 2024, 2300762 2300762 (5 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Adv. Intell. Syst. 2024, 2300762 2300762 (6 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Adv. Intell. Syst. 2024, 2300762 2300762 (7 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Adv. Intell. Syst. 2024, 2300762 2300762 (8 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
PEs merged into the memory domain. High-bandwidth DRAM- and carryless multiplication, are supported. Compute cache
based CIM units have been prototyped, for instance, a HBM- enhances operational performance by 1.9 and reduces energy
based digital CIM unit (FIMDRAM)[47,79] and GDDR6-based consumption by 2.4 compared to an Intel’s eight-core Sandy
CIM units (AiM).[80] These examples use high-bandwidth Bridge processor with three-level cache hierarchy.
DRAMs as embedded memories in conjunction with PEs in Neural cache[102] is the extension of compute cache, which
the vicinity of the memories. There exist diverse SRAM-based supports in-cache integer arithmetic operations in addition to
digital CIM units that fully utilize the advantages of SRAM such the logical and supplementary operations supported in compute
as fast operation, high bandwidth, and perfect compatibility with cache. For the in-cache arithmetic operations, data are trans-
complementary metal–oxide–semiconductor (CMOS) logic posed and mapped onto the SRAM array, i.e., each n-bit element
circuits.[81–86] is stored over n word lines. Two operands sharing a single bitline
Analog CIM is based on PEs merged into the memory are iteratively computed for each bit by simultaneously activating
domain, where full or partial MAC OPs are performed in an ana- the two word lines. Neural cache realizes integer addition, mul-
log manner. Consequently, memory access and full or partial tiplication, and division for n þ 1, n2 þ 5n 2, 1.5n2 þ 5.5n
MAC OP are performed at a single cycle, so that the latency cycles, respectively. Compared to baseline CPU (Xeon E5) and
for load and store in digital CIM can be avoided. A front-runner GPU (Titan Xp) for Inception-v3,[103] neural cache reduces infer-
is SRAM-based CIM in which bitwise multiplication is simply ence latency by 18.3 and 7.7 compared with a Xeon E5 CPU
implemented by using an AND gate (digital) and accumulation and Titan Xp GPU, respectively. Further, its energy efficiency
by using a bitline capacitor (analog). Resistance-based nonvolatile increases by 37.1 and 16.6 compared to the CPU and
memory (e.g., RRAM and MRAM)-based analog CIM has been GPU, respectively.
successfully demonstrated.[88–96] Particularly, RRAM-based ana- Duality cache is an in-cache computation architecture to sup-
log CIM is based on Kirschhoff ’s current law in a nonvolatile port various in-cache arithmetic operations in integer, fixed-
resistor array, inherently realizing bit-wise multiplication and point, and floating-point formats.[104] For addition in FP (which
accumulation of currents through RRAM bitcells that share is of large complexity as explained in Section 2.1.2), duality cache
the same bitline. Both digital and analog CIM units will be is equipped with a new FP addition algorithm (referred to as bit-
addressed in detail in Section 3.5. serial) of low time complexity, which applies to multiple data in
parallel. For INT multiplication and division, zero skipping algo-
rithms are adopted to skip redundant arithmetic operations. The
3.3. CPU-Based Accelerators
CORDIC algorithm[105] is used for transcendental functions in
FP. Duality cache implemented in a two-socket Xeon server by
CPU-based accelerators proposed to date mostly aim to perform
using the entire cache can support 150 the number of threads
operations in caches. As shown in Figure 6, cache hierarchy is
supported by a NVIDIA Titan Xp GPU, achieving an average
used to efficiently retrieve data in a CPU for various
speedup of ≈3.6 and ≈4 compared with the GPU for
applications.[97–99] CPU-based DL accelerators aim to perform
Rodinia[106] and OpenACC benchmarks, respectively, at the cost
operations in the cache to minimize data movement, and thus
of an increase in area overhead by merely 3.5% and TDP by 3 W.
latency and power consumption. However, the limited number
Intel has recently announced DL acceleration using scalable
of cores (and thus cache memories) in a CPU limits the opera-
Xeon CPUs with a built-in AI accelerator referred to as Intel
tional parallelism, so that its performance is hardly comparable
advanced matrix extensions (Intel AMX).[107] The instruction
to the other accelerators when using deep and large DNNs.
set for Intel AMX is an extended version of the x86 ISA to opti-
Nevertheless, lightweight DNNs with high operational sparsity
mize DL training and inference tasks. To minimize data move-
can solely be handled by CPU-based accelerators without the cost
ment, maximize parallelism, and reduce needs for additional
of data movement between CPU and other accelerators.
hardware design, the Intel AMX architecture is based on 1) eight
Compute near last level cache (CNC) realizes cache-level DL
1KB 2D registers (tiles) for data storage and 2) a tile matrix mul-
acceleration by integrating an auxiliary MAC unit in the last level
tiplication unit attached to the tiles. Intel AMX supports
cache (LLC) of 512 KB in an eight-core 64b RISC-V CPU.[100]
BFLOAT16 and INT8 formats, offering a respective speed boost
Instead of loading data into the core for MAC OPs, performing
of 16 and 8 compared to the previous generation Xeon CPUs
the operations within the LLC can boost the performance and
without Intel AMX.
energy efficiency. The CNC MAC unit multiplies an 8 8
INT8 matrix by an 8-long INT8 vector and accumulates the prod-
uct vector in INT32. Custom instructions are added to the 3.4. NPU
RV64GC instruction set architecture (ISA) for the MAC OP in
the LLC. The processor (fabricated using the Intel 4 CMOS pro- NPU was coined by Esmaeilzadeh et al. to refer to an ASIC unit
cess) consumes 510 mW (at 0.85 V and 1.15 GHz) and 73 mW (at that supports parallel arithmetic operations for NNs.[43] The NPU
0.55 V and 350 MHz). The CNC achieved 46 and 27 perfor- proposed by Esmaeilzadeh et al. is equipped with eight PEs to
mance of the scalar ISA for dense and conv layers, respectively. accelerate neural programs and interface with a CPU. Each
Compute cache allows logical operators, e.g., AND, NOR, and PE comprises a multiply-add unit, accumulator registers, and sig-
XOR, in the cache without a large area overhead by designing an moid unit to compute multilayer perceptron (MLP). The NPU
additional decoder for activating multiple word lines in parallel includes three first-in first-out (FIFO) modules (configuration
and a single-ended sense amplifier for each bitline.[101] FIFO and input and output FIFOs) to interface with the CPU
Additionally, compound operations, e.g., compare, search, copy, in use. The configuration FIFO is used to send and retrieve
Adv. Intell. Syst. 2024, 2300762 2300762 (9 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
the MLP parameters, and the input and output FIFOs to send Eyeriss is an NPU based on systolic arrays for row-stationary
and retrieve the input and output data in the execute phase of dataflow, which was fabricated using a 65 nm CMOS process in a
CPU cores, respectively. The programmer chooses the program 16 mm2 die.[110,111] 2D kernel and feature map data in 16b fixed-
segments subject to acceleration in the NPU with regard to the point are distributed over the 12 14 PEs, which are stationary.
following conditions: execution frequency, approximability, and Each PE computes dot-product of a given row of the kernel and a
well-defined input and output. Subsequently, the program seg- feature map row of the same size and buffers the result in the
ments are converted into NPU instructions through a compiler scratchpad. Each kernel row and activation row is horizontally
and processed in the NPU. The performance of the CPU–NPU and diagonally distributed, respectively; partial sums are verti-
was identified using the MARSSx86 cycle-accurate x86-64 simu- cally accumulated (Figure 11a). Eyeriss attains 33.6 GOPS at
lator.[108] With the Intel’s Penryn microarchitecture, the NPU 200 MHz core clock and 1 V supply voltage.
achieves up to 2.3 speedup and 3.0 energy saving. Origami is an NPU based on weight-stationary systolic arrays
Frequently, NPUs are designed to minimize data movement (Figure 9b), designed using 65 nm CMOS process in a 3.09 mm2
from their off-chip memory (and thus power consumption and die.[109] This NPU comprises four processing channels, each of
latency) by employing systolic arrays of PEs,[78] which realize a which is given kh kw multipliers for parallel multiplications of 12b
sequence of MAC OPs based on the partial sums buffered in data and adder-tree for accumulation. The results in the process-
the registers, each of which is embedded in a PE (see ing channel are truncated to 12-b data. In each processing chan-
Figure 9). Systolic array-based NPUs are classified as weight- nel, cin kh kw kernel data are registered for weight reuse.
stationary (storing a weight in each PE), e.g., Origami[109] and Origami attains a maximum performance and power efficiency
TPU[44] and row-stationary (storing more data of an input activa- of 274 GOPS and 369 GOPS/W, respectively.
tion and weight in each PE), e.g., Eyeriss.[110] The latter is to Another NPU based on weight-stationary systolic arrays is
improve data reuse and to reduce energy consumption by mini- TPU.[44,45] TPUv1 consists of a matrix multiply unit, unified
mizing data movement from the off-chip memory but at the cost buffer for local activation, accumulators, control unit, and inter-
of a larger local memory in each PE. A schematic of a systolic face unit. The matrix multiply unit contains 256 256 MAC PEs
array for row-stationary and weight-stationary is illustrated in for INT8 data. 64KB weights are loaded in the matrix multiply
unit from a 8 GB off-chip DRAM. The 16b multiplication results
Figure 11a,b, respectively.
are accumulated in a 4 MB accumulator for 32b data. The large
on-chip memory of 28 MB in total supports the memory-hungry
weight-stationary scheme, yielding a peak performance of
92 TOPS at 700 MHz. TPUv1 is 15 30 faster than a K80
GPU and Haswell E5-2699 v3 CPU, and its power efficiency
is 30 80 higher. TPUv4 supports BFLOAT16 for multiplica-
tion and FP32 for accumulation.[45] Its peak performance attains
275 TOPS at a TDP of 170 W at 1 GHz.
The aforementioned stationary-based NPUs focus on
minimizing data movement, but require additional registers
(scratchpad memory for Eyeriss) for inter-PE data movement.
This is because the full operation is split into a number of unit
operations (MACs) and the results of unit operations need to be
buffered in each PE. An alternative strategy is to perform the full
operation using a multiply and adder-tree structure through
which the data flow without data buffer. An example is
DianNao developed to accelerate the computation of large
scale NNs with a tiling method to alleviate memory bandwidth
requirements.[112] The architecture comprises three stages of
neural functional units (NFUs), three split buffers, and a control
processor. In the NFU, input activations and weights (all in 16b
fixed-point) are multiplied in the first stage, and the products are
added through a adder-tree in the second stage. The final stage
computes the activation function for the results from the adder-
tree. The three buffers store input activations, weights, and out-
put activations. DianNao was designed using a 65 nm CMOS
process and simulated to identify its performance, yielding
452 GOPS at 0.98 GHz. Subsequently, Luo et al. introduced an
extended version of DianNao, referred to as DaDianNao.[113]
DaDianNao (designed using a 22 nm CMOS process) consists
of 16 computing tiles in total and can perform 16 operations
per tile (i.e., total 256 operations). Moreover, a four-bank embed-
Figure 11. Schematic of a systolic array for a) row stationary and b) weight ded DRAM (eDRAM) in each tile accommodates a number of
stationary. weights on chip.
Adv. Intell. Syst. 2024, 2300762 2300762 (10 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
3.5.1. Digital CIM Analog CIM demonstrates its high operational efficiency by
reducing the area and power overheads for multipliers and add-
Function-In-Memory (FIM) DRAM is a digital CIM unit with ers, and eliminating on-chip data loading. However, they support
programmable computing units (PCUs) supporting FP16 data integer MAC OPs only, and operational reliability is unlikely
and HBM2 as an embedded memory.[47] FIMDRAM includes comparable to digital CIM. There exist a number of analog
a 16-wide SIMD engine in the memory banks to achieve CIM designs using various memories such as mainstream mem-
bank-level parallelism. The physical dimension of HBM2 was ories, e.g., SRAM[115–119] and DRAM,[120,121] and emerging non-
maintained by replacing half of the memory array by PCUs. volatile memories, e.g., resistive RAM (RRAM)[87–95] and
Multibank operations are supported using the FIM mode while magnetic RAM (MRAM).[96] We introduce a few examples in
the general memory operations using the normal mode. The each class of analog CIM as follows.
PCU comprises a register group, execution unit, and interface Dong et al. introduced an 8 T SRAM-based analog CIM macro
unit, and is controlled using the conventional memory com- fabricated using a 7 nm CMOS process.[118] A sufficient noise
mands (CMDs) from the host without modifying the conven- margin allows the 8 T SRAM array of 6464 in size to remain
tional memory controller. FIMDRAM was fabricated using a stable even when multiple words are activated for parallel MAC
20 nm DRAM process and achieves 1.2 TFLOPS in FP16 at
OPs. The macro computes multiply-average (MAV) operations
300 MHz.
for 4b activations and also 4b weights. Multibit inputs are real-
Another example of eDRAM-based digital CIM units is AiM
ized in a bit-serial manner by a 4b digital counter. Multibit
which is based on 4 Gb GDDR6.[80] In AiM, one processing unit
weights are realized by using multiple capacitors of power-of-
(PU) is dedicated to each of 16 banks and placed in the vicinity of
two relative capacitance (1:2:4:8) at the end of each bitline.
the bank. Notably, a set of new CMDs is introduced for bank acti-
MAV operations for 64 4b inputs and 16 4b weights are com-
vation, compute, and data movement unlike FIMDRAM. This
puted using a flash analog-to-digital converter (ADC). The macro
new CMD set allows swift switches between the memory mode
attains 455.1 GOPS and 321 TOPS/W for INT4 MAV operations
and DL operation mode without commands from the host. Each
at 1 V.
PU is equipped with 16 multipliers and a four-stage adder-tree
Fully analog CIM is prone to operational errors due to the lim-
for BFLOAT16 data. For parallel MAC OPs, the PU receives
ited signal-to-noise ratio (SNR) of ADCs. Mixed-signal CIM may
1) 256b weights from its bank and also 256b activations from
be a noise-robust alternative. In this regard, Su et al. introduced
the global buffer or 2) 256b weights from its bank and activations
of the same size from the paired bank. Additionally, AiM sup- 384 Kb 6 T SRAM-based CIM.[115] The proposed macro consists
ports various activation functions, e.g., sigmoid, hyperbolic tan- of SRAM subarrays, ADCs, and digital shifter and adder (DSaA).
gent, GELU, ReLU, and leaky ReLU, based on lookup tables In each subarray, 32 6 T-SRAMs are connected to a local bitline
(LUTs), each of which stores uniform inputs and their function pair (LBL/LBLB), and 16 sub-arrays are connected to a global bit-
values for each activation function. AiM attains 1 TFLOPS for line pair (GBL/GBLB). In a subarray, voltage-scaled 2b activa-
BFLOAT16 at 1 GHz. tions and 1b weights (stored in SRAM) are multiplied using
SRAM is frequently used as an embedded memory in digital LBL/LBLB, and the products are averaged using GBL/GBLB.
CIM units.[81–86] Mostly, weights are stored in the SRAM other The ADC converts the averaged product into a 5b digital value.
than a few cases in which activations are stored in the SRAM, These values from multiple ADCs are combined in DSaA to
e.g., Z-PIM.[81] Most of designs separate a processing domain obtain a 20b output. The macro achieves a peak performance
from a memory domain[82–85,114] other than a few cases[81,86] of 22.75 TOPS/W (peak) for INT8 data at 0.85 V supply voltage.
as follows. Chih et al. proposed a 6 T-SRAM-based CIM macro DynaPlasia is a system-level CIM unit with reconfigurable
design with integrated bit-wise multipliers (4 T NOR gate) in the 3T2C eDRAM of 9.6 Mb.[121] Each bitcell is reconfigurable such
SRAM domain.[86] For massive computing parallelism, the archi- that one of memory, in-memory computing, and ADC modes
tecture employs bit-serial multipliers (in the memory domain) can be chosen. Particularly, a bitcell capacitor serves as a unit
and parallel adder-trees (in the processing domain). This design capacitor of a successive-approximation ADC in the ADC mode,
supports the programmable precision of input activations (binary significantly reducing the area overhead for ADCs. DynaPlasia
to INT8) and weights (INT4/8/12/16). The memory domain was fabricated using a 28 nm CMOS technology in a
includes a 256 4 4b SRAM array in which a 4 T NOR gate 20.25 mm2 die. It attains 56 TOPS/W peak performance for
for bit-wise multiplication is dedicated to each bit cell, supporting INT4 activations, INT5 weights at 250 MHz, and 1 V supply
256 bit-wise multiplications of 256 activations and 256 weights in voltage.
parallel. The 256 products are summed through the add-tree. RRAM is resistance-based nonvolatile memory[122] which was
A single macro was fabricated using a 22 nm CMOS process considered as storage-class memory. Following Kirschhoff ’s cur-
in a 0.202 mm2 die, yielding a performance of 3.3 TOPS (for rent law, the 1T1R bitcells sharing the same bitline inherently
4b activation and 4b weight) at 0.72 V supply voltage. realize bit-wise multiplications and accumulation of the currents
Z-PIM also uses bit-wise multipliers integrated in the SRAM through the bitcells, equivalent to parallel MAC OPs. ISAAC is
domain.[81] Notably, the SRAM array stores input activations an analog dot-product machine that leverages this inherent prop-
instead of weights unlike the above explained CIM designs. erty of RRAM crossbar arrays.[92] ISAAC comprises many tiles,
The key feature of this CIM macro is a zero-skipping scheme each of which includes eDRAM to store input activations, output
that avoids MAC OPs for zero weights. registers, in situ multiply-accumulate (IMA) units, shift-and-add,
Adv. Intell. Syst. 2024, 2300762 2300762 (11 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
sigmoid, and max pooling units. Each IMA includes RRAM OPs. In this regard, GPUs are frequently used to accelerate
crossbar arrays, digital-to-analog converters (DACs) for input SNN computation, which can significantly accelerate SynOPs
activations, and ADCs for outputs. The authors proposed three given the high memory bandwidth and peak performance.
performance metrics: computational efficiency (CE in However, given their prohibitive power consumption, GPUs
GOPS mm2), power efficiency (PE in GOPS/W), and storage hardly utilize the inherent operational efficiency of SNNs. To
efficiency (SE in MB/mm2). These metrics were maximized by leverage the inherent operational efficiency due to operational
searching for the optimal size of each RRAM crossbar array, the sparsity and binary activation, event processors (based on ad hoc
numbers of crossbar arrays and ADCs in each IMA, and the event routing) need to be used to implement SNNs, which are
number of IMAs in each tile. The system-level simulation results referred to as neuromorphic processors. In this review, we
highlight improvements of throughput, power efficiency, and mainly address event-based neuromorphic processors, which
computational density by 14.8, 5.5, and 7.5, respectively, are the mainstream neuromorphic hardware. Note that hereafter
compared to DaDianNao. neuromorphic processors indicate SNN inference accelerators
TIMELY is an analog dot product machine based on RRAM without on-chip learning engine unless otherwise specified.
crossbar arrays with three key innovations: analog local buffers Additionally, we refer to event-based neuromorphic processors
(ALBs) to enhance data locality, time-domain interfaces (TDIs) to as neuromorphic processors unless otherwise specified.
reduce the number of data conversions, and only-once input read Section 4.1 introduces the generic architecture of neuromorphic
(O2 IR) to reuse input activations.[94] When transferring data from processors and several key working principles. Section 4.2 is ded-
an analog to a digital domain, the register in the digital domain icated to various event-routing architectures that are the key to
consumes considerable energy and time. In this regard, ALBs event processors. Section 4.3 overviews various neuromorphic
eliminate the necessity for the data conversion. Additionally, processors introduced to date, which are classified as 1) mix-
energy cost is reduced by replacing ADCs and DACs (used in signal and 2) digital neuromorphic processors. Section 4.4
the conventional crossbar array) by time-to-digital converters addresses nonevent-based neuromorphic processors and
(TDCs) and digital-to-time converters. O2 IR also reduces the compares them with event-based neuromorphic processors.
energy cost by reducing memory access frequency. TIMELY
was designed using a 65 nm CMOS process, working at 4.1. Generic Architecture of Neuromorphic Processors
40 MHz and 1.2 V supply voltage. TIMELY improves energy effi-
ciency by 18.2 and computational density by 20 compared to Unlike DNN accelerators based on the von Neumann architec-
ISAAC. ture, neuromorphic processors are standalone hardware in need
MRAM is also resistance-based nonvolatile memory based on of neither host CPU nor main memory. Generally, a neuromor-
current-controlled magnetic tunnel junctions (MTJs) that exhibit phic processor consists of multiple neuromorphic cores and
nonvolatile high- and low-resistance states depending on their network-on-chips (NoCs) that are responsible for communica-
spin configuration. Jung et al. introduced a spin-transfer-torque tion between cores as illustrated in Figure 12a. Each core consists
MRAM (STT-MRAM)-based CIM unit for binary NNs (binary of a neuron block, weight memory, event router, and event
activation and binary weight).[96] CIM in this unit is realized queue. The neuron block calculates the membrane potential
by using strings of MTJs instead of MTJ crossbar arrays to reduce values. The weight memory stores synaptic weights used for
the power consumption in the memory domain. Given the con- SynOPs. The event router sends input spikes (events) to the post-
siderable current in even high-resistance state MTJs, MTJ cross- synaptic (destination) neurons. The event queue buffers input
bar arrays consume high power during the current summation
spikes temporarily before the event router sends them to the des-
for parallel MAC OPs. To cope with this power issue, the pro-
tination neurons. Note that some designs merge the weight
posed design uses strings of MTJs like NAND flash, where
memory with the event router as for crossbar-based event-
the total resistance of each string is determined by the number
routing architectures.[49,123,124] Spiking neurons are distributed
of high-resistance state MTJs. Each bitcell is of 2T2M, i.e., two
over multiple cores that compute the membrane potentials (state
MTJs of complementary resistance states and two transistors
variables) of their spiking neurons in parallel. The events from a
with complementary inputs, to realize the XNOR logic for binary
given neuron are routed to its postsynaptic (fan-out) neurons
NNs. Thus, the resistance sum of a given string is equivalent to
through the NoCs. Despite the limited bandwidth of the
the dot product of binary weight and activation vectors. The
NoCs, this event routing through NoCs barely causes heavy
string resistance is subsequently converted to a digital value
traffic because of 1) the binary activation (’00 and ’10 , nonspike
using a TDC.
and spike, respectively) instead of real-value activation as for
DNNs and 2) the high sparsity of the feature maps.
The forward pass of an SNN for inference depends on local
4. Accelerators for SNN-Based DL
data only. That is, the membrane potential update in
SNNs are time-dependent models for DL with unique features Equation (9) (based on SynOPs in Equation (12)) uses several
distinct from DNNs as addressed in Section 2.3. Particularly, data like the potential time-constant τm , spiking threshold θ,
their binary activation and highly sparse feature maps allow and fan-in synaptic weights w ij , which are all local. Therefore,
accelerator architectures of extreme power efficiency, which when the fan-in synaptic weights wij are placed in the same core
are essentially distinguishable from DNN accelerators. as the postsynaptic neurons, each core depends on its own local
Nevertheless, SNNs for DL are of the same topology as memory without access to a main memory. This allows the mul-
DNNs, and SynOPs (major operations) are equivalent to MAC tiple cores to operate independently in parallel. Thus, event data
Adv. Intell. Syst. 2024, 2300762 2300762 (12 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
SynOPsð0Þ
f sp ¼ , (16)
ðSynOPsð0Þ þ SynOPsð1ÞÞ
Adv. Intell. Syst. 2024, 2300762 2300762 (13 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
because of its simplicity. However, the quadratic increase of Table 1. Comparison of event-routing memory usage.
memory usage (≈N 2 ) with the number of neurons N hinders
it from applying to large-scale neuromorphic processors. Network Memory usage
Hierarchical AER is a hierarchical tree-based event-routing – Crossbar [49]
K-tag[50,51] LaCERA[52]
architecture, which supports exponential expandability for a cNet a)
42.2 Mib 4.58 Mib 15.31 Kib
given number of event hops over the hierarchy.[126] The tree hier-
LeNetb) 22.0 Mib 3.77 Mib 16.54 Kib
archy is configured such that the leaves are dedicated to the neu-
rons that send and receive events while the nodes on each
hierarchical level relay the events from the neurons throughout
a)
ð3 32 32Þ 16C4@2 32C3 64C3@2 10C4 10.
the hierarchy. Given the exponential expandability of the hierar-
b)
ð3 32 32Þ 6C5 AP2 16C5 AP2 120C5 84 10.
chical tree, this event-routing architecture supports the minimal
number of event hops (which corresponds to the minimal 4.3. Overview on Neuromorphic Processors
latency) for event routing in large-scale SNNs.
The K-tag routing scheme[50] is a two-stage event-routing Mixed-signal neuromorphic processors are implemented using
method in that the first stage routes an event to the clusters both analog and digital circuits; frequently, neurons and synap-
including the destination neurons and the second stage broad- ses are realized using analog circuits, while network configura-
casts them to their destination neurons. The key advantage of tion and event routing are achieved using digital circuits. The
K-tag routing lies in its optimal use of event-routing memory advantages of analog building blocks (neurons and synapses)
by optimizing the number of fan-out connections for the second include their ability to mimic biological dynamics at low power
stage. The K-tag scheme is employed in DYNAPs[50] and Loihi.[51] consumption. However, given that analog circuits are sensitive to
The pointer-based event-routing scheme proposed by noise, mismatch, and power, voltage, and temperature variations,
Kornijcuk et al.[127] is an LUT-based event-routing method that the scalability of analog neurons and synapses is somewhat lim-
aims to reduce latency for event routing and inverse lookup ited. As such, digital neuromorphic processors are fully imple-
for spike timing-dependent plasticity -based on-chip learning. mented using digital circuits. Neurons and synapses are
This method uses three LUTs (PTR_LUT, FOUT_LUT, and implemented using digital logic cells. The advantages of digital
FIN_LUT). FOUT_LUT is sorted according to source neuron circuits include excellent scalability, reliability, reconfigurability,
address such that the destination neuron addresses for a given and operation speed, allowing digital neuromorphic processors
source neuron are adjacently allocated in FOUT_LUT. FIN_LUT to be a solution to large-scale SNN-based DL accelerators.
is sorted according to destination neuron address such that the Section 4.3.1 and 4.3.2 introduce several neuromorphic proces-
source neuron addresses for a given destination neuron are sor designed using mixed-signal and digital circuits, respectively.
grouped in FIN_LUT. PTR_LUT stores the ranges of the
addresses in FOUT_LUT and FIN_LUT for a given neuron,
4.3.1. Mixed-Signal Neuromorphic Processors
so that, upon an event from a neuron, the addresses of its desti-
nation neurons in FOUT_LUT and its source neurons in
NeuroGrid is a multichip system that works with million neu-
FIN_LUT can be found at one cycle without sequential search
rons and billions of synapses in real time.[128] This system con-
of FOUT_LUT and in FIN_LUT.
sists of 16 mixed-signal neuromorphic chips, each of which is
with a single neuromorphic core. Each chip (core) was fabricated
4.2.2. Layer-Wise Event Routing using a 180 nm CMOS process, within a 168 mm2 die. Each core
is of the shared synapse and dendrite architecture for its area
The neuron-wise event routing supports high flexibility in topol- efficiency and local connectivity. The 16 cores are connected
ogy configuration. Yet, SNNs for DL frequently consist of basic using the tree-based event-routing architecture for its higher
elementary layers (see Section 2.1) instead of arbitrary connec- throughput in multicasting packets than the mesh architecture.
tions, so that they barely need such high flexibility. When a layer Each core consists of a 256 256 neuron array, transmitter,
type (conv or dense) and hyperparameters are fixed for a given receiver, router, and two RAMs. The analog neuron array
layer, all neuron-to-neuron connections are determined, allowing realizes a soma, dendrite, synapse-population, and ion-channel-
us to avoid using neuron-wise configured LUTs that cause sig- population circuits. The analog neurons in the core operate in a
nificant memory usage. That is, layer-wise rather than neuron- fully parallel manner. The transmitter, receiver, and router are
wise event routing can remarkably boost the efficiency in on-chip used to route AER packets. The transmitter encodes the coordi-
memory usage. The layer-centric event-routing architecture nates of spiking neurons and sends them to the output port. The
(LaCERA)[52] realizes this layer-wise event routing by using a tiny receiver decodes the coordinates of the AER packets and delivers
LUT that defines the type of each layer in a given SNN them to the target neurons. The router multicast AER packets
(Figure 13b). Consequently, LaCERA reduces the memory usage using the information retrieved from the memory in the core.
for event routing by more than two orders of magnitude com- The RAMs store the target synapse locations and configuration
pared with the K-tag scheme as compared in Table 1. Further, parameters that are shared among all neurons in the core.
LaCERA supports ideal weight-reuse rate for conv layers, which In NeuroGrid, a million neurons and eight billion synapses
is barely realized in neuron-wise event-routing architectures, so real-time operate at 2.7 W power consumption.
that the efficiency in on-chip memory usage can be further ROLLS[123] is a single core neuromorphic processor fabricated
enhanced. using a 180 nm CMOS process within a 51.4 mm2 die.
Adv. Intell. Syst. 2024, 2300762 2300762 (14 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
It embodies an event-based online learning engine that allows and axon indices. The AER packets are routed to the destination
ad hoc update of synaptic weights, which is based on analog cir- cores through the NoCs. Subsequently, the synaptic fan-in states,
cuit design. The learning algorithm implemented is spike-driven e.g., fan-in weight and delay, are retrieved from the memory in
synaptic plasticity (SDSP).[129] The processor consists of an the destination core, and event routing is completed. A barrier
analog neuron circuit realizing 256 adaptive exponential IF synchronization mechanism is employed to let the whole cores
(AdExp-IF) neurons,[130] two 256 256 arrays of trainable synap- operate timestep-wise without a global clock. The authors dem-
ses, and synapse demultiplexer. The 256 AdExp-IF neurons work onstrate the performance of Loihi on various tasks.
in parallel. Each of the two synapse arrays realizes short- and Frenkel et al. introduced ODIN which is a digital single-core
long-term plasticity, respectively. In the synapse arrays, weights neuromorphic processor with an on-chip learning engine sup-
are update using an analog circuit, while the digital circuit con- porting the SDSP algorithm.[124] ODIN was fabricated using
trols the update protocol and manages handshaking signals with 28 nm CMOS process in a 0.086 mm2 die. This processor sup-
AER packets. The synapse demultiplexer is to allocate one of the ports 256 neurons conforming to the LIF model or custom
256 rows in the 256 256 array to each neuron. ROLLS success- phenomenological model to realize biologically plausible neural
fully emulate attractor networks of cortical neurons for image dynamics. The neuronal state variables are stored in a 4 KB
classification at a power consumption of 4 mW. SRAM memory. Similar to TrueNorth, an SRAM crossbar of
Moradi et al. introduced mixed-signal neuromorphic process- 256 256 in size was adopted for event routing and weight stor-
ors (DYNAPs), which support excellent reconfigurability of SNN age, but it supports weights of 4-b precision unlike TrueNorth.
topology.[50] DYNAP is a quad-core neuromorphic processor fab- The time-multiplexed learning engine for SDSP modifies the
ricated using a 180 nm CMOS process within a 43.79 mm2 die. synaptic weights subject to update. ODIN successfully repro-
Each core consists of 256 AdExp-IF neurons and 16 k synapses duced the dynamics of Izhikevich model[72] and demonstrated
that operate in parallel. The mixed-signal computing node in a its on-chip learning capability on MNIST.
core comprises an analog neuron and four synaptic dynamic cir- There exist several digital neuromorphic processors proto-
cuits. The excellent reconfigurability arises from the novel K-tag typed on field-programmable gate array (FPGA) boards. We
event-routing architecture (see Section 4.2) that first routes an introduce some of them as follows. Ye et al. prototyped a 32-core
event to the clusters including the destination neurons and neuromorphic processor in a Virtex-7 FPGA, which features its
subsequently broadcasts them to the destination neurons with novel layer-wise event-routing architecture (LaCERA introduced
optimal memory usage. The K-tag scheme is realized using in Section 4.2) for convolutional SNNs (conv-SNNs). This archi-
three-level routers. tecture supports significant reductions in 1) event-routing mem-
ory usage compared with conventional neuron-wise routing
methods, e.g., K-tag and crossbar architecture, and 2) weight
4.3.2. Digital Implementation
memory usage given the ideal weight reuse supported by this
architecture for conv-SNNs. Additionally, the hyperparameters
TrueNorth is a digital multicore neuromorphic processor fabri-
for each layer is fully reconfigurable in this architecture. This pro-
cated using a 28 nm CMOS process.[49] Globally asynchronous
cessor supports the LIF and IF models, and SRM. Each core
locally synchronous design was adopted to minimize the power
includes up to 2 k neurons whose state variables (membrane
consumption. Each of 4096 cores in total realizes maximal 256
potential for LIF and membrane potential and synaptic current
LIF neurons and 64 k programmable synapses with several digital
for SRM) are approximated using the template-scaling exponen-
modules, including neuron, memory, scheduler, router, and con-
tial function approximation[131,132] which allows high-precision
troller modules. The router, scheduler, and controller modules
approximations of exponential functions with minimal use of
operate asynchronously in a handshaking manner based on
LUT memory. The 32 cores are distributed conforming to 2D
AER packets. The neuron module is implemented in a synchro-
mesh architecture with eight NoCs.
nous design that receives the generated clock from the asynchro-
Yang et al. proposed a multicore neuromorphic processor of
nous controller module. The 256 410-b memory stores the
6 6 6 3D mesh architecture, which was prototyped in an
synaptic connections, neurons’ state variables, and parameters.
Altera Stratix III FPGA.[133] The keys to the 3D mesh architecture
The scheduler and controller manage the sequence of the core
are a novel 3D NoCs and the router in each core, which multi-
operation. Event routing in TrueNorth is configured such that
casts and receives AER packets through six ports (up, down,
1) intracore event routing uses the crossbar architecture explained
north, east, west, and south). This processor is to real-time realize
in Section 4.2, and 2) mesh architecture-based event routing for
large-scale SNNs of conductance-based neuron models with high
intercore event routing. TrueNorth consumes 63 mW power.
fidelity to their biological counterparts. They authors successfully
Loihi is another digital multicore (128-core) neuromorphic
reproduced the behavior of cortico-basal ganglia-thalamocortical
processor designed using a 14 nm CMOS process in a
networks real-time.
60 mm2 die.[51] Notably, Loihi is equipped with an on-chip learn-
ing engine that supports several event-based learning algorithms.
Each core realizes 1 k time-multiplexed LIF neurons and synap- 4.4. Nonevent-Based Neuromorphic Processors
ses whose number ranges from 114 k to 1M, depending on the
user-programmable weight precision. The core implements syn- We addressed event-based neuromorphic processors that allow
apse, dendrite, axon, and a learning engine modules using 2Mb ad hoc event routing when events are generated. This event-
distributed SRAMs in total. Upon event generation, the axon routing method fully leverages the generic property of SNNs,
module generates AER packets containing the destination core i.e., high sparsity of feature (event) maps. However, the lower
Adv. Intell. Syst. 2024, 2300762 2300762 (15 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
the sparsity, the larger the latency for event routing because of low-power accelerators are endowed with strictly limited capacity
the limited parallelism in event routing through NoCs of limited of DNNs under computation. In this regard, SNN-based DL
bandwidth. There exist several neuromorphic processors that using neuromorphic processors at extremely low power may
build full event maps which are used for subsequent SynOPs offer the largest model capacity at a given power constraint
unlike event-based neuromorphic processors, which we term among these on-device AI accelerators.
nonevent-based neuromorphic processors. These processors Yet, the neuromorphic processor technology is hardly as
support parallel SynOPs with multiple PEs operating in parallel matured as DNN-based DL accelerators, and its ecosystem has
as for DNN accelerators. Thus, these nonevent-based processors barely been solidified although there emerge several SNN
are advantageous over event-based processors for SNNs of low libraries, e.g., snnTorch,[135] SpikingJelly,[136] and Spyx.[137]
sparsity. Note that, in this case, SynOPs are identical to MAC Additionally, to bring the SNN-based DL to the edge, there exist
OPs, so that the same PEs can be used for both operations. several challenges to be overcome, mainly with regard to the
Tianjic[134] is an example which realizes a method to leverage training efficiency and scalability of SNNs. As such, SNNs are
highly sparse feature maps by skipping SynOPs for zero events time-dependent models like RNNs. When training, the model
(i.e., no-spike cases). In the processor, the multiple cores com- is commonly unrolled over time, and its parameters are opti-
municate using AER packets through the NoCs. However, the mized using backpropagation through time (BPTT).[138] Thus,
AER packets are buffered into a memory in the core to construct the space complexity scales with the number of timesteps in
the event vectors (corresponding to binary feature maps) rather use, so that the length of each training sample is strictly limited
than ad hoc routing of the AER packets. An input event vector to a by the memory capacity of a given training platform. To cope
given core undergoes parallel SynOPs using multiple PEs with with this issue, the online training through time (OTTT) algo-
the weights in the same core. SynOPs for zero activations are rithm has recently been proposed, which learns weights using
skipped using a zero-filtering mask before the PEs. the data of temporal locality unlike BPTT.[139] Yet, the scalability
of OTTT to deeper and larger SNNs remains to be identified.
Another challenge is the scalability of SNNs to a similar degree
5. Concluding Remark and Outlook to DNNs. The difficulty lies in a number of hyperparameters
included in SNNs, e.g., various time-constants and spiking
Given that the progress in DL is regarded to continue onward, thresholds, which should be optimized. Nonetheless, these chal-
the computational complexity keeps growing. For the moment, lenges unnecessarily assure the inherent disadvantages of SNNs
GPUs support this DL progress as mainstream DL accelerators. compared with DNNs given that they may be overcome in the
This is not only because of excellent performance and memory near future. Certainly, the fascinating advantage (ultralow power
bandwidth, flexibility, and versatility of GPUs but also because of consumption) of neuromorphic processors for on-device AI
the solidly built ecosystem of DL research based on CUDA-based likely fuels the activities to overcome the challenges.
DL libraries, e.g., PyTorch and TensorFlow. To bring the alter-
native accelerators (NPUs and CIM units) into play in the market,
they should outperform GPUs on the hardware level insomuch Acknowledgements
as the users desire the change of the current solid ecosystem.
C.S. and C.Y. contributed equally to this work. This research was supported
However, for the moment, NPUs and CIM units are endowed
by the National R&D Program through the National Research Foundation
with obvious disadvantages with regard to their versatility such of Korea (NRF) funded by the Ministry of Science and ICT (grant no. NRF-
that NPUs and CIM units are advantageous only for compute- 2021M3F3A2A01037632). This work was supported by the Institute of
and memory-bound models, respectively. Notably, the computa- Information & communications Technology Planning & Evaluation
tional bottleneck mostly arises from the limited memory (IITP) grant funded by the Korea government (MSIT) (grant no.
bandwidth, which is unavoidable for the conventional computer RS-2023-00229689). This work was partly supported by the IITP under
architecture. In this regard, further developments of high- the artificial intelligence semiconductor support program to nurture the
best talents (IITP-(2004)-RS-2023-00253914) grant funded by the Korea
bandwidth memory are the key premise of boosting the government (MSIT).
performance of DL accelerators.
Given that the DL accelerators addressed have clear pros and
cons, it may be feasible to build a system equipped with various Conflict of Interest
accelerators to harness their pros only for various DL tasks. An
example is the heterogeneous computing using a system com- The authors declare no conflict of interest.
bining CPUs, GPUs, field-programmable gate arrays, and so
forth, proposed by Intel recently. To support this computing
environment, there remain several challenges including 1) mini- Keywords
mization of data movements among different accelerators and
2) optimal task assignment to different accelerators of different compute-in-memory, deep learning, deep learning accelerators, graphics
run-times to maximize device utilization rate. processing units, neural processing units, neuromorphic processors
DL acceleration at the edge is strictly constrained by power Received: November 13, 2023
consumption, ruling out GPUs as on-device AI accelerators. Revised: February 6, 2024
For the moment, on-device AI is an emerging market that is pre- Published online:
dicted to grow. Inference-only NPUs and CIM units at low power
may be brought into play in the on-device AI market, but these
Adv. Intell. Syst. 2024, 2300762 2300762 (16 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
[1] A. L. Samuel, IBM J. Res. Dev. 1959, 3, 210. [28] A. Krizhevsky, I. Sutskever, G. E. Hinton, Adv. Neural Inf. Process. Syst.
[2] M.-T. Luong, H. Pham, C. D. Manning (Preprint), arXiv:1508.04025, 2012, 25, 1.
v1, August Submitted: 2015. [29] K. Simonyan, A. Zisserman (Preprint), arXiv:1409.1556, v1,
[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Adv. Neural Submitted: September 2014.
Inf. Process. Syst. 2013, 26, 1. [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[4] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, S. Khudanpur, in V. Vanhoucke, A. Rabinovich, in Proc. of the IEEE Conf. on Computer
Interspeech, Vol. 2, International Speech Communication Vision and Pattern Recognition (CVPR), IEEE, Piscataway, NJ 2015.
Association (ISCA), Makuhari 2010 pp. 1045–1048. [31] T. P. Morgan, Nvidia Rounds Out “Ampere” Lineup with Two New
[5] J. Hirschberg, C. D. Manning, Science 2015, 349, 261. Accelerators 2021, https://www.nextplatform.com/2021/04/15/
[6] D. Yu, L. Deng, Automatic Speech Recognition, Vol. 1, Springer, New nvidia-rounds-out-ampere-lineup-with-two-new-accelerators/ (accessd:
York 2016. April, 2021).
[7] Y. Zhang, W. Chan, N. Jaitly, in 2017 IEEE Int. Conf. on Acoustics, [32] R. Krashinsky, O. Giroux, S. Jones, N. Stam, S. Ramaswamy, Nvidia
Speech and Signal Processing (ICASSP). IEEE, Piscataway, NJ 2017, Ampere Architecture In-Depth 2020, https://developer.nvidia.com/
pp. 4845–4849. blog/nvidia-ampere-architecture-in-depth/ (accessed: May 2020).
[8] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, [33] P. Alcorn, Nvidia Infuses dgx-1 with Volta, Eight v100s in a Single
C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., in Int. Chassis 2017, https://www.tomshardware.com/news/nvidia-volta-
Conf. on Machine Learning, JMLR, New York, NY 2016, pp. 173–182. v100-dgx-1-hgx-1,34380.html (accessed: May 2017).
[9] A. Vedaldi, B. Fulkerson, in Proc. of the 18th ACM Int. Conf. on [34] I. Cutress, Nvidia’s DGX-2: Sixteen Tesla v100s, 30tb of NVME,
Multimedia, ACM, New York, NY 2010, pp. 1469–1472. Only $400k, 2018, https://www.anandtech.com/show/12587/
[10] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, N. Prabhu, nvidias-dgx2-sixteen-v100-gpus-30-tb-of-nvme-only-400k (accessed:
S. S. Kruthiventi, R. V. Babu, Front. Robot. AI 2016, 2, 36. March 2018).
[11] A. Krizhevsky, I. Sutskever, G. E. Hinton, Commun. ACM 2017, 60, 84. [35] C. Campa, C. Kawalek, H. Vo, J. Bessoudo, Defining AI Innovation
[12] A. Kendall, Y. Gal, Adv. Neural Inf. Process. Syst. 2017, 30, 1. with Nvidia DGX A100, 2020, https://developer.nvidia.com/blog/
[13] D. Bahdanau, K. Cho, Y. Bengio (Preprint), arXiv:1409.0473, v1, defining-ai-innovation-with-dgx-a100 (accessed: May 2020).
Submitted: September 2014. [36] R. Smith, Nvidia Hopper GPU Architecture and H100 Accelerator
[14] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T.-Y. Liu, W.-Y. Ma, Adv. Neural Announced: Working Smarter and Harder, 2022, https://www.
Inf. Process. Syst. 2016, 29, 1. anandtech.com/show/17327/nvidia-hopper-gpu-architecture-and-
[15] Y. Liu, C. Niu, Z. Wang, Y. Gan, Y. Zhu, S. Sun, T. Shen, J. Mater. Sci. h100-accelerator-announced (accessed: March 2022).
Technol. 2020, 57 113. [37] R. Smith, Nvidia Gives Jetson AGX Xaview A Trim, Announces Nano-
[16] Q. Yang, S. Fu, H. Wang, H. Fang, IEEE Network 2021, 35, 96. Sized Jetson Xavier Nx 2019, https://www.anandtech.com/show/
[17] M. F. Dixon, I. Halperin, P. Bilokon, Machine Learning in Finance, Vol. 15070/nvidia-gives-jetson-xavier-a-trim-announces-nanosized-jetson-
1170, Springer, New York 2020. xavier-nx (accessed: November 2019).
[18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den [38] B. Funk, Nvidia Jetson AGX Orin: The Next-Gen Platform that will Power
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, Our AI Robot Overloads Unveiled, 2022, https://hothardware.com/
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, news/nvidia-jetson-agx-orin (accessed: March 2022).
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, [39] D. Franklin, Nvidia Jetson TX2 Delivers Twice the Intelligence to the
D. Hassabis, Nature 2016, 529, 484. Edge, 2017, https://developer.nvidia.com/blog/jetson-tx2-delivers-
[19] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, twice-intelligence-edge/ (accessed: March 2017).
J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, [40] B. Hill, Nvidia Unveils Ampere-Infused Drive AGX for Autonomous
D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, Cars, Isaac Robotics Platform with BMW Partnership, 2022, https://
J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, hothardware.com/news/nvidia-drive-agx-pegasus-orin-ampere-next-
T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, gen-autonomous-cars (accessed: May 2020).
C. Gulcehre, Z. Wang, T. Pfaff, et al., Nature 2019, 575, 350. [41] R. Smith, 16gb Nvidia Tesla v100 Gets Reprieve; Remains in
[20] W. S. McCulloch, W. Pitts, Bull. Math. Biophys. 1943, 5 115. Production 2018, https://www.anandtech.com/show/12809/16gb-
[21] V. Nair, G. E. Hinton, in Proc. of the 27th Int. Conf. on Int. Conf. on nvidia-tesla-v100-gets-reprieve-remains-in-production (accessed:
Machine Learning, ICML’10, Omnipress, Madison, WI 2010, May 2018).
pp. 807–814. [42] N. C. Thompson, K. Greenewald, K. Lee, G. F. Manso (Preprint),
[22] A. L. Maas, A. Y. Hannun, A. Y. Ng, in ICML Workshop on Deep arXiv:2007.05558, v1, Submitted: July 2020.
Learning for Audio, Speech and Language Processing, Atlanta, [43] H. Esmaeilzadeh, A. Sampson, L. Ceze, D. Burger, in 2012 45th
June 2013. Annual IEEE/ACM Int. Symp. on Microarchitecture, IEEE,
[23] D. Hendrycks, K. Gimpel, Gaussian Error Linear Units (Gelus), Piscataway, NJ 2012, pp. 449–460.
Arxiv 2016. [44] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
[24] P. Ramachandran, B. Zoph, Q. V. Le, Searching for Activation S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,
Functions, Arxiv 2017. C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
[25] D. S. Jeong, J. Appl. Phys. 2018, 124, 152002. T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann,
[26] OpenAI (Preprint), arXiv:2303.08774, v1, Submitted: March 2023. C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,
[27] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, et al., in Proc. of the 44th Annual Int. Symp. on Computer
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, Architecture, 2017, pp. 1–12.
L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, [45] N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin,
J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil,
N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, S. Prasad, C. Young, Z. Zhou, D. Patterson, in 2021 ACM/IEEE
V. Kerkez, M. Khabsa, et al., Llama 2: Open Foundation and Fine- 48th Annual Int. Symp. on Computer Architecture (ISCA), IEEE,
Tuned Chat Models, Arxiv 2023. Piscataway, NJ 2021, pp. 1–14.
Adv. Intell. Syst. 2024, 2300762 2300762 (17 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
[46] K. J. Lee, in Hardware Accelerator Systems for Artificial Intelligence and [68] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova (Preprint),
Machine Learning (Eds:S. Kim, G. C. Deka)), Vol. 122, Advances in arXiv:1810.04805, v1, Submitted: October 2018.
Computers, Elsevier, Amsterdam 2021, pp. 217–245. [69] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
[47] Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, et al., in 2021 IEEE Int. Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu,
Solid-State Circuits Conf. (ISSCC), Vol. 64, IEEE, Piscataway, NJ C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess,
2021 pp. 350–352. J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, et al.,
[48] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, Adv. Neural Inf. Process. Syst. 2020, 33 1877.
K. Kang, J. Kim, et al., in 2022 IEEE Int. Solid-State Circuits Conf. [70] P. Dayan, L. F. Abbott, Theoretical Neuroscience, MIT Press, London
(ISSCC), Vol. 65, IEEE, Piscataway, NJ 2022, pp. 1–3. 2001.
[49] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, [71] W. Gerstner, W. M. Kistler, Spiking Neuron Models: Single Neurons,
F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, Populations, Plasticity, Cambridge University Press, Cambridge,
I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, England 2002.
W. P. Risk, R. Manohar, D. S. Modha, Science 2014, 345, 668. [72] E. M. Izhikevich, IEEE Trans. Neural Netw. 2003, 14, 1569.
[50] S. Moradi, N. Qiao, F. Stefanini, G. Indiveri, IEEE Trans. Biomed. [73] S. Williams, A. Waterman, D. Patterson, Commun. ACM 2009, 52, 65.
Circuits Syst. 2017, 12, 106. [74] P. Dhilleswararao, S. Boppu, M. S. Manikandan, L. R. Cenkeramaddi,
[51] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, IEEE Access 2022.
G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C.-K. Lin, A. Lines, [75] JEDEC Standards, https://www.jedec.org/standards-documents
R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, (accessed: August 2022).
Y.-H. Weng, A. Wild, Y. Yang, H. Wang, IEEE Micro 2018, 38, 82. [76] Nvidia a100 Tensor Core GPU, 2022, https://resources.nvidia.com/
[52] C. Ye, V. Kornijcuk, D. Yoo, J. Kim, D. S. Jeong, Neurocomputing 2023, en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
520, 46. [77] Nvidia h100 Tensor Core GPU, 2023, https://resources.nvidia.com/
[53] G. Kim, D. S. Jeong, Adv. Neural Inf. Process. Syst. 2021, 34, 28274. en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, [78] H. T. Kung, Computer 1982, 15, 37.
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, [79] S. Lee, S.-H. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee,
A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, K. Lim, H. Shin, et al., in 2021 ACM/IEEE 48th Annual Int. Symp.
B. Steiner, L. Fang, J. Bai, S. Chintala, Adv. Neural Inf. Process. on Computer Architecture (ISCA), IEEE, Piscataway, NJ 2021,
Syst. 2019, 32, 1. pp. 43–56.
[55] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, [80] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park,
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, K. Kang, J. Kim, et al., in 2022 IEEE Int. Solid-State Circuits Conf.
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, (ISSCC), Vol. 65, IEEE, Piscataway ,NJ 2022, pp. 1–3.
P. Warden, M. Wicke, Y. Yu, X. Zheng, Tensorflow: A System for [81] J.-H. Kim, J. Lee, J. Lee, J. Heo, J.-Y. Kim, IEEE J. Solid-State Circuits
Large-Scale Machine Learning, Arxiv 2016. 2021, 56, 1093.
[56] F. C. Bauer, G. Lenz, S. Haghighatshoar, S. Sheik, Front. Neurosci. [82] H. Fujiwara, H. Mori, W.-C. Zhao, M.-C. Chuang, R. Naous,
2023, 17, 1. C.-K. Chuang, T. Hashizume, D. Sun, C.-F. Lee, K. Akarvardar,
[57] K. He, X. Zhang, S. Ren, J. Sun, in The IEEE Conf. on Computer Vision S. Adham, T.-L. Chou, M. E. Sinangil, Y. Wang, Y.-D. Chih,
and Pattern Recognition, IEEE, Piscataway, NJ 2016. Y.-H. Chen, H.-J. Liao, T.-Y. J. Chang, in 2022 IEEE Int. Solid-State
[58] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely Circuits Conf. (ISSCC), Vol. 65, IEEE, Piscataway, NJ 2022, pp. 1–3.
Connected Convolutional Networks, Arxiv 2018. [83] C.-F. Lee, C.-H. Lu, C.-E. Lee, H. Mori, H. Fujiwara, Y.-C. Shih,
[59] S. Ioffe, C. Szegedy, in Int. Conf. on Machine Learning, PMLR, T.-L. Chou, Y.-D. Chih, T.-Y. J. Chang, in 2022 IEEE Symp. on VLSI
New York, NY 2015, pp. 448–456. Technology and Circuits (VLSI Technology and Circuits), IEEE,
[60] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, Piscataway, NJ 2022, pp. 24–25.
T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient Convolutional [84] F. Tu, Y. Wang, Z. Wu, L. Liang, Y. Ding, B. Kim, L. Liu, S. Wei, Y. Xie,
Neural Networks for Mobile Vision Applications, Arxiv 2017. S. Yin, in 2022 IEEE Int. Solid-State Circuits Conf. (ISSCC), Vol. 65,
[61] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, in Proc. of IEEE, Piscataway, NJ 2022, pp. 1–3.
the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), [85] S. Liu, P. Li, J. Zhang, Y. Wang, H. Zhu, W. Jiang, S. Tang, C. Chen,
IEEE, Piscataway, NJ 2018. Q. Liu, M. Liu, in 2023 IEEE Int. Solid-State Circuits Conf. (ISSCC),
[62] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An Extremely Efficient IEEE, Piscataway, NJ 2023 pp. 250–252.
Convolutional Neural Network for Mobile Devices, IEEE (CVPR), [86] Y.-D. Chih, P.-H. Lee, H. Fujiwara, Y.-C. Shih, C.-F. Lee, R. Naous,
New York 2017. Y.-L. Chen, C.-P. Lo, C.-H. Lu, H. Mori, et al., in 2021 IEEE Int.
[63] M. Tan, Q. V. Le, Efficientnet: Rethinking Model Scaling for Solid-State Circuits Conf. (ISSCC), Vol. 64, IEEE, Piscataway, NJ
Convolutional Neural Networks, JMLR, New York, NY 2020. 2021, pp. 252–254.
[64] M. Schuster, K. Paliwal, IEEE Trans. Signal Process. 1997, 45, [87] Z. Li, Z. Wang, L. Xu, Q. Dong, B. Liu, C.-I. Su, W.-T. Chu, G. Tsou,
2673. Y.-D. Chih, T.-Y. J. Chang, et al., IEEE J. Solid-State Circuits 2020, 56, 1105.
[65] S. Hochreiter, J. Schmidhuber, Neural Comput. 1997, 9, 1735. [88] A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee,
[66] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, J. P. Strachan, N. Muralimanohar, IEEE Micro 2018, 38, 41.
H. Schwenk, Y. Bengio, Learning Phrase Representations Using RNN [89] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang,
Encoder-Decoder for Statistical Machine Translation, Arxiv 2014. H. Qian, Nature 2020, 577, 641.
[67] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, [90] S. Yin, X. Sun, S. Yu, J. S. Seo, IEEE Trans. Electron Devices 2020, 67,
A. N. Gomez, L. U. Kaiser, I. Polosukhin, in Advances in Neural 4185.
Information Processing Systems (Eds: I. Guyon, U. V. Luxburg, [91] J.-H. Yoon, M. Chang, W.-S. Khwa, Y.-D. Chih, M.-F. Chang,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett) A. Raychowdhury, in 2021 IEEE Int. Solid- State Circuits Conf.
Vol. 30. Curran Associates, Inc., Red Hook, NY 2017. (ISSCC), Vol. 64, IEEE, Piscataway, NJ 2021 pp. 404–406.
Adv. Intell. Syst. 2024, 2300762 2300762 (18 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
[92] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, Circuits Conf. (ISSCC), Vol. 64, IEEE, Piscataway, NJ 2021,
J. P. Strachan, M. Hu, R. S. Williams, V. Srikumar, ACM pp. 250–252.
SIGARCH Comput. Archit. News 2016, 44, 14. [116] K. Ueyoshi, I. A. Papistas, P. Houshmand, G. M. Sarda, V. Jain,
[93] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, ACM M. Shi, Q. Zheng, S. Giraldo, P. Vrancx, J. Doevenspeck, et al., in
SIGARCH Comput. Archit. News 2016, 44, 27. 2022 IEEE Int. Solid-State Circuits Conf. (ISSCC), Vol. 65, IEEE,
[94] W. Li, P. Xu, Y. Zhao, H. Li, Y. Xie, Y. Lin, in 2020 ACM/IEEE 47th Piscataway, NJ 2022, pp. 1–3.
Annual Int. Symp. on Computer Architecture (ISCA), IEEE, Piscataway, [117] I. A. Papistas, S. Cosemans, B. Rooseleer, J. Doevenspeck,
NJ 2020, pp. 832–845. M.-H. Na, A. Mallik, P. Debacker, D. Verkest, in 2021 IEEE
[95] C. Song, J. Kim, D. S. Jeong, Adv. Intell. Syst. 2023, 5, 2200289. Custom Integrated Circuits Conf. (CICC), IEEE, Piscataway, NJ
[96] S. Jung, H. Lee, S. Myung, H. Kim, S. K. Yoon, S.-W. Kwon, Y. Ju, 2021, pp. 1–2.
M. Kim, W. Yi, S. Han, et al., Nature 2022, 601, 211. [118] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W.-S. Khwa, H.-J. Liao,
[97] J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, in Proc. of the 42nd Y. Wang, J. Chang, in 2020 IEEE Int. Solid-State Circuits Conf.-
Annual Int. Symp. on Computer Architecture, ACM, New York, NY (ISSCC), IEEE, Piscataway, NJ 2020, pp. 242–244.
2015, pp. 105–117. [119] B. Wang, C. Xue, Z. Feng, Z. Zhang, H. Liu, L. Ren, X. Li, A. Yin,
[98] F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, F. Franchetti, in T. Xiong, Y. Xue, et al., in 2023 IEEE Int. Solid-State Circuits Conf.
Proc. of the 52nd Annual IEEE/ACM Int. Symp. on Microarchitecture, (ISSCC), IEEE, Piscataway, NJ 2023, pp. 134–136.
IEEE, Piscataway, NJ 2019, pp. 347–358. [120] S. Xie, C. Ni, A. Sayal, P. Jain, F. Hamzaoglu, J. P. Kulkarni,
[99] M. Zhu, T. Zhang, Z. Gu, Y. Xie, in Proc. of the 52nd Annual IEEE/ in 2021 IEEE Int. Solid- State Circuits Conf. (ISSCC), Vol. 64,
ACM Int. Symp. on Microarchitecture, IEEE, Piscataway, NJ 2019, 2021, pp. 248–250.
pp. 359–371. [121] S. Kim, Z. Li, S. Um, W. Jo, S. Ha, J. Lee, S. Kim, D. Han, H.-J. Yoo, in
[100] G. K. Chen, P. C. Knag, C. Tokunaga, R. K. Krishnamurthy, IEEE J. 2023 IEEE Int. Solid-State Circuits Conf. (ISSCC), IEEE, Piscataway,
Solid-State Circuits 2022, 58, 1117. NJ 2023, pp. 256–258.
[101] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, [122] D. S. Jeong, R. Thomas, R. S. Katiyar, J. F. Scott, H. Kohlstedt,
R. Das, in 2017 IEEE International Symposium on High A. Petraru, C. S. Hwang, Rep. Prog. Phys. 2012, 75, 076502.
Performance Computer Architecture (HPCA), IEEE, Piscataway, NJ [123] N. Qiao, H. Mostafa, F. Corradi, M. Osswald, F. Stefanini,
2017, pp. 481–492. D. Sumislawska, G. Indiveri, Front. Neurosci. 2015, 9 141.
[102] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, [124] C. Frenkel, M. Lefebvre, J.-D. Legat, D. Bol, IEEE Trans. Biomed.
D. Blaaauw, R. Das, in 2018 ACM/IEEE 45Th Annual International Circuits Syst. 2018, 13, 145.
Symposium on Computer Architecture (ISCA), IEEE, Piscataway, NJ [125] K. A. Boahen, IEEE Trans. Circuits Syst. II 2000, 47, 416.
2018, pp. 383–396. [126] J. Park, T. Yu, S. Joshi, C. Maier, G. Cauwenberghs, IEEE Trans.
[103] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, in Proc. of the Neural Netw. Learn. Syst. 2017, 28, 2408.
IEEE Conf. on Computer Vision and Pattern Recognition, IEEE, [127] V. Kornijcuk, J. Park, G. Kim, D. Kim, I. Kim, J. Kim, J. Y. Kwak,
Piscataway, NJ 2016, pp. 2818–2826. D. S. Jeong, Adv. Mater. Technol. 2019, 4, 1800345.
[104] D. Fujiki, S. Mahlke, R. Das, in Proc. of the 46th Int. Symp. on [128] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary,
Computer Architecture, 2019, pp. 397–410. A. R. Chandrasekaran, J.-M. Bussat, R. Alvarez-Icaza, J. V. Arthur,
[105] J. E. Volder, IRE Trans. Electron. Comput. 1959, EC-8, 330. P. A. Merolla, K. Boahen, Proc. IEEE 2014, 102, 699.
[106] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, [129] J. M. Brader, W. Senn, S. Fusi, Neural Comput. 2007, 19, 2881.
K. Skadron, in 2009 IEEE Int. Symp. on Workload Characterization [130] R. Brette, W. Gerstner, J. Neurophysiol. 2005, 94, 3637.
(IISWC), IEEE, Piscataway, NJ 2009, pp. 44–54. [131] J. Kim, V. Kornijcuk, D. S. Jeong, in 2020 21st Int. Symp. on
[107] Intel AMX, https://www.intel.com/content/www/us/en/content- Quality Electronic Design (ISQED), IEEE, New York, Ny 2020,
details/785250/accelerate-artificial-intelligence-ai-workloads-with- pp. 358–363.
intel-advanced-matrix-extensions-intel-amx.html (accessed: January [132] J. Kim, V. Kornijcuk, C. Ye, D. S. Jeong, IEEE Trans. Circuits Syst. I
2024). 2021, 68, 350.
[108] A. Patel, F. Afram, S. Chen, K. Ghose, in Proc. of the 48th Design [133] S. Yang, J. Wang, B. Deng, C. Liu, H. Li, C. Fietkiewicz, K. A. Loparo,
Automation Conf., ACM, New York, NY 2011, pp. 1050–1055. IEEE Trans. Cybern. 2018, 49, 2490.
[109] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, L. Benini, [134] L. Deng, G. Wang, G. Li, S. Li, L. Liang, M. Zhu, Y. Wu, Z. Yang,
in Proc. of the 25th Edition on Great Lakes Symp. on VLSI, ACM, Z. Zou, J. Pei, Z. Wu, X. Hu, Y. Ding, W. He, Y. Xie, L. Shi, IEEE
New York, NY 2015, pp. 199–204. J. Solid-State Circuits 2020, 55, 2228.
[110] Y.-H. Chen, J. Emer, V. Sze, ACM SIGARCH Comput. Archit. News [135] J. K. Eshraghian, M. Ward, E. O. Neftci, X. Wang, G. Lenz,
2016, 44, 367. G. Dwivedi, M. Bennamoun, D. S. Jeong, W. D. Lu, Proc. IEEE
[111] Y.-H. Chen, T. Krishna, J. S. Emer, V. Sze, IEEE J. Solid-State Circuits 2023, 111.
2016, 52, 127. [136] W. Fang, Y. Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang,
[112] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, ACM H. Zhou, G. Li, Y. Tian, Sci. Adv. 2023, 9, eadi1480.
SIGARCH Comput. Archit. News 2014, 42, 269. [137] K. Heckel, kmheckel/spyx: v0.1.0-beta 2023, https://doi.org/10.
[113] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, 5281/zenodo.8241588.
Y. Chen, IEEE Trans. Comput. 2016, 66, 73. [138] P. Werbos, Proc. IEEE 1990, 78, 1550.
[114] J. Yue, C. He, Z. Wang, Z. Cong, Y. He, M. Zhou, W. Sun, X. Li, [139] M. Xiao, Q. Meng, Z. Zhang, D. He, Z. Lin, in Advances
C. Dou, F. Zhang, et al., in 2023 IEEE Int. Solid-State Circuits in Neural Information Processing Systems (Eds:S. Koyejo,
Conf. (ISSCC), IEEE, Piscataway, NJ 2023, pp. 1–3. S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh),
[115] J.-W. Su, Y.-C. Chou, R. Liu, T.-W. Liu, P.-J. Lu, P.-C. Wu, Y.-L. Chung, Vol. 35, Curran Associates, Inc., Red Hook NJ 2022,
L.-Y. Hung, J.-S. Ren, T. Pan, et al., in 2021 IEEE Int. Solid-State pp. 20717–20730.
Adv. Intell. Syst. 2024, 2300762 2300762 (19 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Choongseok Song received his B.S. degree in electronics and information engineering from Sejong
University, Seoul, South Korea, in 2020. He is currently pursuing his Ph.D. degree in materials science
and engineering from Hanyang University, Seoul, South Korea. His research interests include computer
architecture for deep learning acceleration based on neural processing unit and processing in memory.
ChangMin Ye received his B.S. degree in materials science and engineering from Hanyang University,
Seoul, South Korea, in 2020, where he is currently pursuing his Ph.D. degree in materials science and
engineering. Since 2020, he has been focusing on learning digital neuromorphic processor design.
Yonguk Sim received his B.S. degree in electronic engineering from Hanyang University, Seoul, South
Korea, in 2023, where he is currently pursuing the integrated Ph.D. degree in semiconductor engineering.
Since 2023, he has been focusing on NVM-based deep learning accelerator design.
Doo Seok Jeong is a professor at Hanyang University, Republic of Korea. He received his B.E. and M.E.
in materials science from Seoul National University in 2002 and 2005, respectively. He received his Ph.D.
degree in materials science from RWTH Aachen, Germany in 2008. He was with the Korea Institute of
Science and Technology from 2008 to 2018. His research interest includes spiking neural networks for
sequence learning and future prediction. Learning algorithms, spiking neural network design, and digital
neuromorphic processor design are his current research focus.
Adv. Intell. Syst. 2024, 2300762 2300762 (20 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH