0% found this document useful (0 votes)

35 views20 pages

Hardware For Deep Learning Acceleration

Uploaded by

Bill Petrie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views20 pages

Hardware For Deep Learning Acceleration

Uploaded by

Bill Petrie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

REVIEW

www.advintellsyst.com

Hardware for Deep Learning Acceleration

Choongseok Song, ChangMin Ye, Yonguk Sim, and Doo Seok Jeong*

DNNs are graph models that consist of

Deep learning (DL) has proven to be one of the most pivotal components of nodes (neurons) and edges (synapses),
machine learning given its notable performance in a variety of application which were initially inspired by the biolog-
domains. Neural networks (NNs) for DL are tailored to specific application ical behavior of the brain.[20] The node is a
local processor that encodes its input
domains by varying in their topology and activation nodes. Nevertheless, the
(weighted sum of the outputs from its
major operation type (with the largest computational complexity) is commonly fan-in nodes) as an output value with vari-
multiply-accumulate operation irrespective of their topology. Recent trends in ous precision ranging from 32b floating-
DL highlight the evolution of NNs such that they become deeper and larger, and point (FP32) to 1b (’00 or ’10 ) depending
thus their prohibitive computational complexity. To cope with the consequent on the neuron model employed. Diverse
neuron models of different computational
prohibitive latency for computation, 1) general-purpose hardware, e.g., central
complexities are available, e.g., sigmoid,
processing units and graphics processing units, has been redesigned, and hyperbolic tangent, rectified linear unit
2) various DL accelerators have been newly introduced, e.g., neural processing (ReLU),[21] leaky ReLU,[22] Gaussian error
units, and computing-in-memory units for deep NN-based DL, and neuromor- linear unit,[23] swish,[24] and spiking
phic processors for spiking NN-based DL. In this review, these accelerators and unit.[25] The last model should be distin-
their pros and cons are overviewed with particular focus on their performance guished from the others given that spiking
units are time-dependent models unlike
and memory bandwidth.
the others, rendering DNNs with spiking
units time-dependent. Additionally, spik-
ing units output binary numbers (’00 and ’10
1. Introduction indicating nonspiking and spiking, respectively) unlike the
others. Given these unique properties, spiking unit-based DNNs
Machine learning (ML)—coined by Arthur Samuel in 1959[1]— are particularly referred to as deep spiking NNs (SNNs).
has had a significant impact on human life by being applied to The output from a given neuron serves as inputs to its fan-out
various applications, including natural language processing,[2–5] neurons with particular weights that are trainable parameters.
speech recognition,[6–8] computer vision,[9–12] and machine Subsequently, the fan-out neurons encode the sum of weighted
translation.[13,14] Recently, ML has also been applied to genome inputs as outputs. This neural processing continues through the
analysis,[15] autonomous driving,[16] finance,[17] and game layers in a DNN until the output neurons respond.
industry.[18,19] Particularly, deep learning (DL)—a subset of A common trend is that DNNs evolve such that they become
ML—is the key substrate of this remarkable progress in ML. deeper and larger, so that the numbers of weight (parameters)
DL uses deep neural networks (DNNs) as learning models, which and computations required become significantly larger. For
are significantly versatile at various tasks with their excellent instance, the state-of-the-art language models incorporate a tre-
learning capability. mendous number of parameters insomuch as GPT-4 includes
1.7 trillion parameters[26] and Llma-2 70 billion parameters,[27]
which causes severe space complexity. DNNs are particularly
C. Song, C. M. Ye, D. S. Jeong tailored to their application domains, e.g., convolutional NNs
Division of Materials Science and Engineering (CNNs)[28–30] for image feature extraction in computer vision
Hanyang University and transformer-based networks to capture the degree of signifi-
222 Wangsimni-ro, Seongdong-gu, Seoul 04763, Republic of Korea
E-mail: dooseokj@hanyang.ac.kr cance (attention) between tokens in natural language processing.
Irrespective of DNN topology, the major workload arises from
Y. Sim, D. S. Jeong
Department of Semiconductor Engineering the multiplication of a feature and kernel weight and accumula-
Hanyang University tion of the products, i.e., multiply-accumulate (MAC) operations
222 Wangsimni-ro, Seongdong-gu, Seoul 04763, Republic of Korea (OPs). MAC OPs apply to a vast amount of data brought from the
The ORCID identification number(s) for the author(s) of this article memory through data buses, so that the overall operation latency
can be found under https://doi.org/10.1002/aisy.202300762. is dictated by the arithmetic operations (MAC OPs) and/or mem-
© 2024 The Authors. Advanced Intelligent Systems published by Wiley-
ory bandwidth.
VCH GmbH. This is an open access article under the terms of the The megatrend in DNN evolution is barely supported by cen-
Creative Commons Attribution License, which permits use, distribution tral processing units (CPUs) because of their limited capability of
and reproduction in any medium, provided the original work is parallel computing (due to the limited number of arithmetic logic
properly cited. units (ALUs) in a CPU) and limited memory bandwidth. Given
DOI: 10.1002/aisy.202300762 that the major workload in DNN operation results from simple

Adv. Intell. Syst. 2024, 2300762 2300762 (1 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

MAC OPs (based on a single instruction) applied to massive data, The rest of the article is organized as follows. Section 2 over-
graphics processing units (GPUs) for single-instruction multiple- views key computations (major workload) in DNNs classified as
data and single-instruction multiple-threads (SIMD/SIMT) compute- and memory-bound models, software frameworks for
with high-bandwidth memory are the very suitable type of autodifferentiation, and the suitable topology of DNNs to use
general-purpose hardware for DL acceleration.[31–41] SIMD/SIMT such autodifferentiation frameworks. Additionally, this section
GPUs can significantly accelerate DNN operations (i.e., MAC overviews SNNs and their major workload in comparison with
OPs) irrespective of DNN topology, and thus serving as the DNNs. Section 3 addresses various processors for DL accelera-
mainstream hardware for DL acceleration. However, their high tion, including GPUs, NPUs, and CIM units, and overviews
power consumption, insomuch as they consume a few hundreds several recent designs for each type of accelerator. Section 4 is
watts,[42] is a challenge to their applications to AI at the edge dedicated to neuromorphic processors that accelerate computa-
(also known as edge AI and on-device AI). tions in SNNs. In this section, we overview various neuromor-
As alternatives, various types of application-specific integrated phic processor designs and explain distinguishable concerns
circuit (ASIC)-based accelerators have been introduced, includ- in neuromorphic processor design from that in DNN-based
ing neural processing units (NPUs),[43–46] computing-in-memory DL accelerator design. Section 5 concludes this review with con-
(CIM) units,[47,48] and neuromorphic processors.[49–52] Note that cluding remarks and outlook.
neuromorphic processors (also known as event processors)
accelerate SNN while the others DNN. These alternatives to
GPUs aim to boost the operational efficiency at the cost of loss 2. Computation in NNs
in versatility and flexibility to some degree. That is, a given alter-
native leverages its operational efficiency for a particular class of We address generic features of NNs with regard to computation
DNN. Generally, DNNs are classified as compute- and memory- in Section 2.1, which hold for DNNs and SNNs. Computational
bound models with regard to the overall operation-dictating features of DNNs and SNNs are addressed in Section 2.2 and 2.3,
factor: 1) arithmetic operation rate and 2) memory bandwidth, respectively.
respectively. This classification will be elaborated in Section 3.
Note that the aforementioned accelerators for DNN-based DL 2.1. Generic Properties of Operations in NNs
are responsible for MAC OPs only, so that they need host
CPUs to complete the whole computations in DNNs. 2.1.1. Elementary Layers and Their Computation
Neuromorphic processors (event processors) are standalone
devices without host CPU and main memory unlike the afore- Although NNs differ in topology for different tasks, they are com-
mentioned accelerators. In this review, we refer to neuromorphic monly of layer-wise topology, and neighboring layers are joined
processors as event-based inference accelerators without learn- by unidirectional connections as illustrated in Figure 1a. Despite
ing engine unless otherwise stated. They can be standalone given the diversity in NN topology, the major types of elementary layers
several crucial features of SNNs such that 1) a feature map (event include a dense layer and convolutional layer (conv layer for
map) is remarkably sparse, i.e., the feature map at a given time- short)—illustrated in Figure 1b,c, respectively. Note that we
step includes only few ’10 s and mostly ’00 s and 2) the data assume a mini-batch size of one and FP32 data format, i.e.,
required for inference are local to each spiking neuron. These real-value data, hereafter unless otherwise stated. For a dense
allow the spiking neurons to be distributed over multiple cores layer of N neurons with a fan-in layer of M neurons, its weight
in a neuromorphic processor and to send the events from them matrix w is of N M in dimension. The total number of weights
to their fan-out neurons in an ad hoc manner, significantly boost-
ing power efficiency.
Despite proposals of various DL acceleration platforms to date,
there exist no perfect acceleration platforms with high perfor-
mance in all key performance metrics, e.g., operational through-
put, power efficiency, versatility, flexibility, and so forth. Each of
these platforms has relative pros and cons in terms of these
performance metrics, so that appropriate platforms need to be
chosen to accelerate key workloads for a given model. To begin,
one should understand such key workloads for various models.
In this review, we aims to provide comprehensive reviews on
workloads for DNNs and SNNs of different topologies and vari-
ous hardware platforms to accelerate their major operations. The
primary contributions of this review are as follows: 1) We review
generic properties of DNNs and SNNs and classify them with
regard to the bottleneck in their computations. 2) We overview
various acceleration platforms for DNNs, including, CPUs,
GPUs, NPUs, and CIM units, and compare them with regard
to the key performance metrics. 3) We overview neuromorphic Figure 1. Schematics of a a) feedforward neural network, b) dense layer,
processors for SNNs and comprehensively analyze their distin- and c) conv layer. The dense and conv layers calculate y ¼ wx and
guishable features from DNN accelerators. y ¼ k x, respectively.

Adv. Intell. Syst. 2024, 2300762 2300762 (2 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

Algorithm 1. Naive Matrix–Vector Multiplication. Multiplication of an Algorithm 2. Convolution of input tensor x ∈ ℝcin hw using kernel
0 0
input vector x ∈ ℝM and weight matrix w ∈ ℝNM , yielding an output k ∈ ℝcout cin kh kw , yielding output tensor y ∈ ℝcout h w .
vector y ∈ ℝN .
Initialize y;
Initialize y for i ¼ 1 to cout do
for i ¼ 1 to N do for j ¼ 1 to h0 do
s ← 0; for k ¼ 1 to w0 do
for j ¼ 1 to M do s ← 0;
/* Single MAC OP */ for l ¼ 1 to cin do
s ← s þ w½i, j x½j; for m ¼ 1 to kh do
end for o ¼ 1 to kw do
y½i ← s; /* Single MAC OP */
end s ← s þ k½i, l, m, o
x½l, j þ m 1, k þ o 1;
end
and memory dedicated to them are NM and 32NM bits, respec-
end
tively. The feature map z for this dense layer is calculated by a
linear equation y ¼ wx (x denotes the feature map of the fan-in end
layer) and subsequent nonlinear equation z ¼ f ðyÞ, where f y½i, j, k ← s;
denotes a nonlinear activation function. The linear equation end
(matrix–vector multiplication) involves two nested for-loops, end
and thus NM MAC ﬂoating-point operations (FLOPs) in total,
end
as shown in Algorithm 1. Therefore, the time complexity for the
linear equation is Oðn2 Þ. The nonlinear equation z ¼ f ðyÞ involves
N FLOPs, i.e., OðnÞ. Therefore, the major workload arises from
the MAC FLOPs of Oðn2 Þ in complexity. Given that the major layers with FWR = 1. In this case, the arithmetic operational
workload involves NM weights and NM FLOPs, the ratio of throughput (rather than memory bandwidth) likely dictates the
the number of FLOPs to the number of weights (FWR) is one. overall operational throughput, so that an appropriate strategy
to accelerate the conv layer computation is to increase the arith-
FWR ¼ 1 for dense layers (1) metic operational throughput by employing multiple ALUs that
work in parallel.
That is, one weight loaded is used for one FLOP, and thus, for
each FLOP, one weight value needs to be loaded. In this case,
memory-access latency and memory bandwidth (rather than 2.1.2. Data Formats
the throughput of the arithmetic operations) likely dictate the
overall operational throughput. Consequently, employing a A commonly used data format is FP32 (single-precision ﬂoating-
high-bandwidth memory is an appropriate strategy to accelerate point): 1b sign, 8b exponent, and 23b mantissa (Figure 2a).
the dense layer computation.
A feature map for a conv layer is of c out h0 w 0 , where c out ,
h0 , and w0 denote the channel, height, and width of a rank-3 ten-
sor (see Figure 1c). The feature map is calculated by convolution
(linear operation) of a fan-in feature map x of cin h w using a
rank-4 kernel k of c out cin kh kw , i.e., y ¼ k ∗ x and subse-
quent nonlinear equation z ¼ f ðyÞ. The convolution (y ¼ k ∗ x)
involves six nested for-loops for a unit MAC OP as shown in
Algorithm 2, so that its time complexity is Oðn6 Þ. The number
of FLOPs involved in the convolution is therefore c out h0 w0 cin kh kw .
The nonlinear equation z ¼ f ðyÞ involves c out h0 w0 FLOPs, i.e., the
time complexity is Oðn3 Þ, and thus the major workload for con-
volution layers arises from the convolution operation of Oðn6 Þ in
complexity. FWR for this conv layer is therefore given by

c out h0 w0 cin kh kw
FWR ¼ ¼ h0 w 0 for conv layers: (2)
cout c in kh kw

That is, one weight loaded is h0 w0 times reused for FLOPs, Figure 2. a) Schematics of FP32, BFLOAT16, and FP16 formats.
which is a distinguishable feature of conv layers from dense b) Architecture of a FP multiplier.

Adv. Intell. Syst. 2024, 2300762 2300762 (3 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

Numbers in FP32 are considered as real values. Given the afore- by using GPUs with autodifferentiation frameworks like
mentioned trend in NN evolution, the space complexity (memory PyTorch[54] and TensorFlow.[55] The backpropagation of error
usage) prohibitively increases, leading to the desire for the use of algorithm (backprop for short) uses the backward pass of the
lower precision FP formats such as FP16 (sign/exponent/ error (evaluated at the output nodes), which is based on the chain
mantissa: 1b/5b/10b) and BFLOAT16 (sign/exponent/mantissa: rule on the gradient tensor computed for each node.
1b/8b/7b). These formats are illustrated in Figure 2a. These Autodifferentiation frameworks compute the gradient tensors
low-precision FP formats allow the reduction in memory usage in a user friendly manner. The chain rule should apply to
by 50% and memory-loading latency for a given memory DAGs only because it yields wrong results for graphs with
bandwidth. Particularly, BFLOAT16 allows a similar number self-association like feedback connections.[56] Feedforward NNs
range to FP32 because of eight exponent bits but largely reduces can directly be mapped onto DAGs given no inherent self-
power and area overheads for FP multipliers. The representative association included. Recurrent NNs (RNNs) include in-layer
architecture of a FP multiplier is shown in Figure 2b. The mul- connections, which inevitably include closed data paths within
tiplication of two numbers in FP involves the addition of two a layer. To use autodifferentiation frameworks for such RNNs,
integer exponents and multiplication of two integer mantissa they are unrolled (duplicated) over time on DAGs in that the
parts; integer adder and multiplier are included in Figure 2b. same node at different timesteps is considered as different
Given that the integer multiplier mainly dictates the power nodes, which removes self-association. There exist NNs with
and area overheads for the FP multiplier, BFLOAT16 with ten feedback paths at the same timestep, e.g., SNN with spiking
mantissa bits can significantly reduce the power and area neurons whose state variables (membrane potential) are reset
overheads compared with FP32 and even with FP16. Mixed pre- upon their spiking. In this case, the reset signals are delayed
cision is often used as for Tensor Processing Units (TPUs) for one timestep to avoid self-association within the same time-
(BFLOAT16 for multiplication and FP32 for accumulation).[44] step. Otherwise, particular mathematical techniques based on
In this case, FLOAT16 numbers are easily converted to FP32 the implicit function theorem are required as for EXODUS.[56]
by simply attaching 16 null bits to the right-hand side of the
LSB in FLOAT16.
2.2. Computation in DNNs
There exist efforts to use low-precision integer formats (e.g.,
INT8 and INT4) in place of floating-point formats to further
CNNs are broadly used for computer vision, which include
reduce the memory usage, memory-loading latency, and power
AlexNet,[28] VGG,[29] ResNet,[57] DenseNet,[58] GoogLeNet,[30]
and area overheads for ALUs. To this end, additional algorithms
and so on. Most layers in CNNs are conv layers for feature extrac-
for weight and activation quantization should apply to NNs,
tion but with a few dense layer for classification. Therefore, the
which cause inevitable loss in NN performance.[53] As such,
major operation type is convolution elaborated in the pseudocode
an integer multiplier is much lighter than a FP multiplier as
in Algorithm 2. A pooling layer generally follows a given conv
it is a component of a FP multiplier as shown in Figure 2b, which
layer, which reduces the dimension of the conv layer. There exist
reduces the overheads for the multiplier logic. The extreme cases
various types of available pooling layers such as max pooling,
include binary weight and activation and power-of-two weight,
average pooling, and adaptive max pooling. Max pooling for
which replace multipliers by simple XNOR logics and shift regis-
2D kernels (kh kw in size) involves a simple max function that
ters, respectively.
outputs the largest feature among the features in the 2D kernel.
GPUs have excellent flexibility as they support various data
Average pooling outputs the feature value averaged over the fea-
formats, e.g., FP64/32/16, BFLOAT16, INT8/4, and even binary.
ture map within the kernel, requiring addition and division oper-
ASIC-based accelerators are frequently designed for particular
ations. Batch normalization (BN)[59] is commonly deployed after
data formats to boost their performance at the cost of loss in flex-
pooling. BN L2 normalizes the feature map after pooling using
ibility. An extreme case is the CIM units based on analog MAC
the mean and standard deviation of the features for a given chan-
OPs, which limits the data format to integer only. Generally, the
nel over the samples in a mini-batch. The normalized feature
performance is significantly dictated by the data format used
map undergoes scale and shift using two trainable parameters,
such that the lower the data precision, the higher the perfor-
which require multiply and addition operations such that the
mance insomuch as the performance for INT4 is ≈64 that
normalized feature map is multiplied by the scale parameter,
for FP32 for NVIDIA A100.
and the shift parameter is subsequently added. The nonlinear
Note that the term operation (OP) indicates format-unspecific
function finally applies to the result to compute the activation.
operation, so that it is a rather general term that includes FLOP.
Although various activation functions are available, ReLU and
Hereafter, we use the general terms OP and OPs to refer to oper-
its variants are often chosen as activation functions for feedfor-
ations in various data formats.
ward CNNs.
There are several lightweight DNNs that use depth-wise
2.1.3. Directed Acyclic Graphs separable convolution in place of the aforementioned normal
convolution to reduce their time and space complexities, e.g.,
Directed acyclic graphs (DAGs) define the sequence of successive MobileNets,[60,61] ShuffleNets,[62] and EfficientNets.[63] For a
computations and the consequent data flow. They are acyclic cout h0 w0 feature map with a cin h w fan-in feature
such that the data processed by a given node (function) are dis- map, depth-wise separable convolution requires the total number
allowed to be directed to the node through any paths, i.e., no self- of OPs (time complexity) and parameters (space complexity) as
association is allowed. NNs are mapped onto DAGs to train them follows

Adv. Intell. Syst. 2024, 2300762 2300762 (4 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

(
# OPs ¼ c in h0 w0 kh kw þ c out cin h0 w0 , et=τm HðtÞ for LIF
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl} ε¼ (6)
Depthwis econv Separable conv
(3) HðtÞ for IF
# parameters ¼ cin kh kw þ cout c in :
|fflfflffl{zfflfflffl} |fflffl{zfflffl}
Depthwise conv Separable conv where H denotes the Heaviside step function. Note that the tem-
poral kernel for the IF model is considered as a special case of the
These highlight significant reductions in time and space com- LIF model with τm ! ∞.
plexities compared with the normal convolution whose time and Equation (5) is expressed as a form of convolution integral as
space complexities are cout h0 w 0 c in kh kw and cout cin kh kw , respec- follows
tively. However, FWR for a depth-wise separable conv layer is X Z t
the same as the normal convolution. ui ðtÞ ¼ wij εðτÞsj ðt τÞdτ (7)
j 0
c h0 w 0 kh kw þ c out cin h0 w0
FWR ¼ in ¼ h0 w 0 : (4)
c in kh kw þ c out cin With the temporal kernel in Equation (6), Equation (7) is
equivalent to the following differential equation
Natural language processing depends on dense layer-based
DNNs such as 1) RNNs, e.g., bidirectional RNN,[64] long dui u X
¼ i þ wij sj (8)
short-term memory,[65] and gated recurrent units,[66] and 2) trans- dt τm j
former[67] and its variants, e.g., bidirectional encoder representa-
tions from transformers[68] and generative pretrained This equation can be expressed as a discrete form (with
transformer.[26,69] These models use either dense layers and/or Δt ¼ 1) using the Euler method (explicit method), and we have
dense layer-like operations, i.e., matrix–vector or matrix–matrix the following recursive form
multiplications, so that FWR for the major operation is unity. X
Additionally, these models include exponential function-based ui ½t ¼ e1=τm ui ½t 1 þ wij sj ½t (9)
nonlinear activation functions, e.g., sigmoid, hyperbolic tangent, j

and Gaussian error liner unit (GELU).[23] Particularly, GELUs are

frequently employed in transformer-based models. When the membrane potential ui exceeds a given threshold θ
at t, an LIF or IF neuron fires an event (spike), which can be
implemented using the Heaviside step function such that
2.3. Computation in SNNs
si ðtÞ ¼ Hðui θÞ (10)
SNNs are dynamic (time-dependent) models that extract spatio-
temporal features of input data.[25] This is mainly because of spik- This spike function corresponds to the nonlinear activation
ing neurons (corresponding to activation functions in DNNs) function. Upon the spike generation, the potential ui is reset
to a preset potential value (often to zero) and ready for the inte-
and/or synapse models, which use spatio-temporal kernels to
gration of subsequent spikes. The generated spike si is subse-
compute their state variables. There are diverse neuron models
quently routed to the fan-out (postsynaptic) neurons with the
with different computational complexities, e.g., integrate-and-fire
corresponding kernel values.
(IF) model, leaky IF (LIF) model,[70] and spike response model
The aforementioned sequence of spiking-neuron computa-
(SRM),[71] Hodgkin–Huxley model,[70] Izhikevich’s model,[72]
tions (Equation (9) and (10)) is mapped onto a graph in
and so forth. Particularly, the first three models are often
Figure 3. Note that the superscript of each variable indicates
employed in deep SNNs given their low complexities (such that
IF and LIF models use a single-state variable and SRM two-state the layer index. The signal from sðlÞ to reset the potential uðlÞ
variables) and simplicity in implementation. The last two models is delayed for one timestep, i.e., sðlÞ ½t 1 resets uðlÞ ½t rather than
concern their biological plausibility rather than computational uðlÞ ½t 1. This is to avoid self-association as explained in
complexity. Hereafter, we address IF and LIF models for deep Section 2.1.3. The spike function in Eq. (10) for a layer l results
SNNs with the minimal complexity. IF and LIF models use sep- a spike (event) map sðlÞ which is equivalent to a feature map for a
arable spatio-temporal kernels k to compute their state variables, DNN. Its dimension is determined by the layer type (conv or
i.e., subthreshold membrane potential ui for a neuron i. dense layer) as elaborated in Section 2.1. Notably, the spike
ðlÞ
P map includes binary data only given the elements si ∈ f0, 1g.
ui ðtÞ ¼ kij ∗ ρj ðtÞ
j A crucial feature of such spike maps is that the number of
P
¼ w |{z} ε ∗sj ðtÞ ’10 s in the map is remarkably sparse, i.e., the sparsity f sp is close
|{z}
j (5) to unity, which is defined by
ij temporal
P spatial 8
sj ¼ δðt t̂k Þ >
> Number of 00 s
< for s ∈ f0, 1gchw
k
f sp ¼ chw (11)
>
> Number of 00 s
where sj indicates a train of spikes at t̂k from a fan-in (presynap- : for s ∈ f0, 1gm
m
tic) neuron j, and wij is the weight between neurons i and j.
IF and LIF models differ for the temporal contribution ε in As a generic property of NNs (explained in Section 2.1), the
Equation (5) such that major operations in SNNs are the multiplication of a weight

Adv. Intell. Syst. 2024, 2300762 2300762 (5 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

CPUs and GPUs, and 2) ASIC-based accelerators, e.g., NPUs and

CIM units. The former is addressed in Section 3.3, and the latter
in Section 3.4 and 3.5, respectively.

3.1. Rooﬂine Model

The rooﬂine model proposed in ref. [73] provides insights into

computers with particular peak performance in OPS (OPs/s)
and memory bandwidth in an easy-to-understand manner. For
a given computer, the roofline model designates the attainable
performance with respect to arithmetic intensity, i.e., the ratio
of the number of OPs to the number of data for a given task.
Particularly, this model highlights the bottleneck for given tasks.
The peak performance cannot fully be utilized when the data
transfer is slow due to the limited memory bandwidth, so that
the attainable performance is dictated by the memory bandwidth
(memory-bound case). When the data transfer is fast (high mem-
Figure 3. SNN mapped onto a DAG. ory bandwidth), the attainable performance is dictated by the
peak performance (compute-bound case). Hence, the attainable
performance for a given computer is expressed as
wij and activation sj and the subsequent accumulation of the
products to compute the membrane potential, which are Attainable performance ¼
expressed in Equation (9). This MAC OP is referred to as synaptic
Peak performance (13)
operation (SynOP) min
Arithmetic intensity memory bandwidth:
ui ← ui þ wij sj (12)
Figure 4 illustrates a schematic of the roofline model.
which is equivalent to MAC OP in DNNs. Given the high sparsity The arithmetic intensity value at the cross point between the
of spike maps, sj in Equation (12) is mostly zero, so that SynOPs two lines is referred to as a ridge point. Tasks of arithmetic inten-
for zero sj may be skipped using dedicated processors, particu- sity below the ridge point are termed memory-bound tasks while
the opposite tasks compute-bound tasks.
larly, event processors. A unit SynOP is equivalent to routing an
As detailed in Section 2.1, FWR for conv and depth-wise sep-
event from a presynaptic (fan-in) neuron to a neuron i and sub-
arable conv layers (Equation (2) and (4)) is h0 w0 FWR for dense
sequently updating its membrane potential ui by wij , which is
layers (Equation (1)). Notably, the arithmetic intensity in the roof-
carried out only upon event, i.e., no SynOP when sj ¼ 0. This
line model likely scales with FWR. Thus, the computation of conv
ad hoc update of membrane potential is the key to event process- layers is likely located on the right-hand side of the ridge point
ors that may largely reduce the computational overhead, which (i.e., compute-bound) while that of dense layers on the left-hand
will be detailed in Section 4.1. side of the ridge point (i.e., memory-bound). Consequently,
CNNs are likely classified as compute-bound models while
RNNs and transformer-based models for natural language proc-
3. Accelerators for DNN-Based DL essing as shown in ref. [44] for TPUs.
The y-intercept in Figure 4 (highlighted using a red-filled
For the moment, the megatrend in DNN evolution (deeper and
circle) corresponds to the memory bandwidth on a logarithmic
larger) is supported by the hardware that enables massively par-
scale. For a fixed peak performance, the increase of memory
allel computing with large memory bandwidth, e.g., GPUs and
NPUs. These processors support massive parallelism in arithme-
tic operation, significantly accelerating the major arithmetic
operations. Yet, their limited memory bandwidth can limit the
acceleration within the von Neumann architecture in which
the main memory and processing unit are separate and commu-
nicate through buses with limited bandwidth. There barely exist
ideal processors with regard to operational throughput, power
efficiency, versatility, flexibility, and price. In the following sec-
tions, we address various types of DL accelerators and their pros
and cons in terms of the aforementioned key metrics. To begin,
Section 3.1 explains the roofline model that frequently applies to
DL accelerators for performance estimation for given DNNs.
Section 3.2 provides a brief overview on such accelerators mainly
for comparisons among them. We address various accelerators
classified as 1) general-purpose hardware-based accelerators, e.g., Figure 4. Schematic of the roofline model on a log–log plot.

Adv. Intell. Syst. 2024, 2300762 2300762 (6 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

bandwidth reduces the ridge point, widening the compute-bound

range in which the peak performance is utilized. In this way,
dense layer-based models may belong to the compute-bound
regime. Therefore, the DL accelerator considered should estab-
lish an optimization strategy for DL workloads with limited
memory bandwidth. In the following section, we overview com-
mon features for each accelerator type and present acceleration
strategies with regard to the rooﬂine model.

3.2. Overview on Accelerators

For memory- and compute-bound tasks (depicted in Figure 4),

DL accelerators employ two distinct accelerating strategies:
1) memory optimization, e.g., data-transfer optimization and
increase of off-chip memory bandwidth, and 2) boost in peak per-
formance, e.g., parallel computation and analog computation.
This review also discusses the power efficiency of various plat-
forms, which is not addressed in the roofline model. DL accel-
erators have been developed on various hardware platforms that
include CPUs and GPUs in general-purpose architecture, and
NPU and CIM in application-specific architecture, but within Figure 6. Architecture of a multicore CPU.
the von Neumann architecture in which a main memory and
processing units are separate. A schematic of the roofline model
for CPUs, GPUs, NPUs, and CIM units is illustrated in Figure 5.
A CPU consists of a large number of branch control logic units
and cache hierarchy of high memory bandwidth, and a small
number of ALUs as illustrated in Figure 6. CPUs are widely used
in general-purpose computers because of their high flexibility,
large-size cache memories of high-bandwidth, and large number
of branches supported. However, the limited number of cores
(particularly, ALUs) limits their capability to process parallel
workloads.[74] Hence, CPU-based accelerators need to employ
additional computation engines within CPUs and/or new type
of cache (detailed in Section 3.3). Additionally, an off-chip double
data rate (DDR) memory of limited bandwidth is commonly used
as a main memory.
Unlike CPUs, GPUs are equipped with numerous ALUs that
work in parallel and most widely used as DL accelerators. A com-
mon architecture of GPUs is illustrated in Figure 7. GPUs exploit
parallelism by SIMT in that massive multiple threads allocated by
a GPU kernel are designated for computation and

Figure 7. Architecture of a GPU. SM and SP stand for streaming multi-

processor and streaming processor, respectively.

simultaneously processed. To boost the off-chip memory band-

width, GPUs use graphics DDR (GDDR) or high-bandwidth
memory (HBM) of a higher bandwidth than DDR. The commer-
cialized GDDR6 realizes ≈4 10 the bandwidth of DDR5.[75]
The schematic of the roofline model for GPUs in Figure 5 high-
lights the higher peak performance and memory bandwidth than
CPUs. However, their inherent high thermal design power
(TDP) is a challenge to devices for on-device AI.[76,77] As general-
purpose platforms, the key advantage of GPUs lies in their
Figure 5. Schematic of the roofline model for various accelerators. flexibility and versatility, accelerating various DL models in

Adv. Intell. Syst. 2024, 2300762 2300762 (7 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

various data formats (e.g., FP64/32/16, BFLOAT16, INT8/4,

even binary). Also, GPUs can remarkably accelerate both
memory- and compute-bound models given their high memory
bandwidth and high peak performance based on massive paral-
lelism. While the maximum parallelism is attainable through the
flexible configuration of threads via software, data transfer
between threads is realized only through the shared memory
because GPUs support SIMT. This leads to a significant amount
of data movement, and thus latency and power consumption .
In contrast to GPUs, NPUs (ASIC-based accelerators includ-
ing TPU) can enhance computational efficiency through data-
movement optimization by 1) an array of processing elements
(PEs) that are connected using registers and 2) on-chip memory
hierarchy (e.g., buffer, register, and scratchpad) (Figure 8). A unit
PE carries out a unit MAC OP per cycle. Systolic arrays[78] are
Figure 9. Schematic of a systolic array and dataflow.
frequently employed in NPUs, which realize a sequence of
MAC OPs based on partial sums buffered in the registers (each
of which is embedded in a PE). A schematic of a generic systolic
CIM aims to cope with the massive data movement from the
array is sketched in Figure 9. This systolic array-based architec-
off-chip memory and the consequent latency and power con-
ture is suitable for convolution operations. Given the distributed
sumption by placing PEs near or in the memory domain.
registers over PEs in use, the off-chip memory access frequency
This can significantly reduce the frequency of the off-chip mem-
can significantly be reduced, boosting the performance for a
ory access and data-movement latency for a given memory band-
given memory bandwidth and reducing power consumption.
width. This is equivalent to the increase of memory bandwidth in
Figure 5 shows a schematic of the roofline model for NPUs, indi-
effect as highlighted in Figure 5. However, CIM is endowed with
cating the higher peak performance than CPUs but the similar
a lower peak performance because of the limited parallelism in
memory bandwidth to CPUs, assuming DDR in use. Unlike
arithmetic operation compared to NPUs. A common architecture
general-purpose platforms, NPUs hardly support various data of CIM units is illustrated in Figure 10. Hereafter, CIM with
formats given the PEs are designed for specific data format as digital PEs near or in the memory domain is referred to as
for TPUv4 (BFLOAT16 and FP32 for multiplication and accumu- digital CIM while CIM based on fully or partially analog comput-
lation, respectively).[45] NPUs are suitable for compute-bound ing as analog CIM. To date, digital CIM has been demonstrated
models only given their high peak performance but lower using 1) various types of on-chip memories such as volatile
memory bandwidth than GPUs. For memory-bound models, memories, e.g., dynamic RAM (DRAM)[47,79,80] and static
the lower memory bandwidth hinders NPUs from fully utilizing RAM (SRAM),[81–86] and nonvolatile memories, e.g., resistive
the peak performance. RAM (RRAM),[87] and 2) various architectures, e.g., near-data
processing (PEs in the vicinity of the memory domain) and

Figure 8. Architecture of NPU. Each core indicates a PE for a unit

operation. Figure 10. Architecture of a CIM unit.

Adv. Intell. Syst. 2024, 2300762 2300762 (8 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

PEs merged into the memory domain. High-bandwidth DRAM- and carryless multiplication, are supported. Compute cache
based CIM units have been prototyped, for instance, a HBM- enhances operational performance by 1.9 and reduces energy
based digital CIM unit (FIMDRAM)[47,79] and GDDR6-based consumption by 2.4 compared to an Intel’s eight-core Sandy
CIM units (AiM).[80] These examples use high-bandwidth Bridge processor with three-level cache hierarchy.
DRAMs as embedded memories in conjunction with PEs in Neural cache[102] is the extension of compute cache, which
the vicinity of the memories. There exist diverse SRAM-based supports in-cache integer arithmetic operations in addition to
digital CIM units that fully utilize the advantages of SRAM such the logical and supplementary operations supported in compute
as fast operation, high bandwidth, and perfect compatibility with cache. For the in-cache arithmetic operations, data are trans-
complementary metal–oxide–semiconductor (CMOS) logic posed and mapped onto the SRAM array, i.e., each n-bit element
circuits.[81–86] is stored over n word lines. Two operands sharing a single bitline
Analog CIM is based on PEs merged into the memory are iteratively computed for each bit by simultaneously activating
domain, where full or partial MAC OPs are performed in an ana- the two word lines. Neural cache realizes integer addition, mul-
log manner. Consequently, memory access and full or partial tiplication, and division for n þ 1, n2 þ 5n 2, 1.5n2 þ 5.5n
MAC OP are performed at a single cycle, so that the latency cycles, respectively. Compared to baseline CPU (Xeon E5) and
for load and store in digital CIM can be avoided. A front-runner GPU (Titan Xp) for Inception-v3,[103] neural cache reduces infer-
is SRAM-based CIM in which bitwise multiplication is simply ence latency by 18.3 and 7.7 compared with a Xeon E5 CPU
implemented by using an AND gate (digital) and accumulation and Titan Xp GPU, respectively. Further, its energy efficiency
by using a bitline capacitor (analog). Resistance-based nonvolatile increases by 37.1 and 16.6 compared to the CPU and
memory (e.g., RRAM and MRAM)-based analog CIM has been GPU, respectively.
successfully demonstrated.[88–96] Particularly, RRAM-based ana- Duality cache is an in-cache computation architecture to sup-
log CIM is based on Kirschhoff ’s current law in a nonvolatile port various in-cache arithmetic operations in integer, fixed-
resistor array, inherently realizing bit-wise multiplication and point, and floating-point formats.[104] For addition in FP (which
accumulation of currents through RRAM bitcells that share is of large complexity as explained in Section 2.1.2), duality cache
the same bitline. Both digital and analog CIM units will be is equipped with a new FP addition algorithm (referred to as bit-
addressed in detail in Section 3.5. serial) of low time complexity, which applies to multiple data in
parallel. For INT multiplication and division, zero skipping algo-
rithms are adopted to skip redundant arithmetic operations. The
3.3. CPU-Based Accelerators
CORDIC algorithm[105] is used for transcendental functions in
FP. Duality cache implemented in a two-socket Xeon server by
CPU-based accelerators proposed to date mostly aim to perform
using the entire cache can support 150 the number of threads
operations in caches. As shown in Figure 6, cache hierarchy is
supported by a NVIDIA Titan Xp GPU, achieving an average
used to efficiently retrieve data in a CPU for various
speedup of ≈3.6 and ≈4 compared with the GPU for
applications.[97–99] CPU-based DL accelerators aim to perform
Rodinia[106] and OpenACC benchmarks, respectively, at the cost
operations in the cache to minimize data movement, and thus
of an increase in area overhead by merely 3.5% and TDP by 3 W.
latency and power consumption. However, the limited number
Intel has recently announced DL acceleration using scalable
of cores (and thus cache memories) in a CPU limits the opera-
Xeon CPUs with a built-in AI accelerator referred to as Intel
tional parallelism, so that its performance is hardly comparable
advanced matrix extensions (Intel AMX).[107] The instruction
to the other accelerators when using deep and large DNNs.
set for Intel AMX is an extended version of the x86 ISA to opti-
Nevertheless, lightweight DNNs with high operational sparsity
mize DL training and inference tasks. To minimize data move-
can solely be handled by CPU-based accelerators without the cost
ment, maximize parallelism, and reduce needs for additional
of data movement between CPU and other accelerators.
hardware design, the Intel AMX architecture is based on 1) eight
Compute near last level cache (CNC) realizes cache-level DL
1KB 2D registers (tiles) for data storage and 2) a tile matrix mul-
acceleration by integrating an auxiliary MAC unit in the last level
tiplication unit attached to the tiles. Intel AMX supports
cache (LLC) of 512 KB in an eight-core 64b RISC-V CPU.[100]
BFLOAT16 and INT8 formats, offering a respective speed boost
Instead of loading data into the core for MAC OPs, performing
of 16 and 8 compared to the previous generation Xeon CPUs
the operations within the LLC can boost the performance and
without Intel AMX.
energy efficiency. The CNC MAC unit multiplies an 8 8
INT8 matrix by an 8-long INT8 vector and accumulates the prod-
uct vector in INT32. Custom instructions are added to the 3.4. NPU
RV64GC instruction set architecture (ISA) for the MAC OP in
the LLC. The processor (fabricated using the Intel 4 CMOS pro- NPU was coined by Esmaeilzadeh et al. to refer to an ASIC unit
cess) consumes 510 mW (at 0.85 V and 1.15 GHz) and 73 mW (at that supports parallel arithmetic operations for NNs.[43] The NPU
0.55 V and 350 MHz). The CNC achieved 46 and 27 perfor- proposed by Esmaeilzadeh et al. is equipped with eight PEs to
mance of the scalar ISA for dense and conv layers, respectively. accelerate neural programs and interface with a CPU. Each
Compute cache allows logical operators, e.g., AND, NOR, and PE comprises a multiply-add unit, accumulator registers, and sig-
XOR, in the cache without a large area overhead by designing an moid unit to compute multilayer perceptron (MLP). The NPU
additional decoder for activating multiple word lines in parallel includes three first-in first-out (FIFO) modules (configuration
and a single-ended sense amplifier for each bitline.[101] FIFO and input and output FIFOs) to interface with the CPU
Additionally, compound operations, e.g., compare, search, copy, in use. The configuration FIFO is used to send and retrieve

Adv. Intell. Syst. 2024, 2300762 2300762 (9 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

the MLP parameters, and the input and output FIFOs to send Eyeriss is an NPU based on systolic arrays for row-stationary
and retrieve the input and output data in the execute phase of dataflow, which was fabricated using a 65 nm CMOS process in a
CPU cores, respectively. The programmer chooses the program 16 mm2 die.[110,111] 2D kernel and feature map data in 16b fixed-
segments subject to acceleration in the NPU with regard to the point are distributed over the 12 14 PEs, which are stationary.
following conditions: execution frequency, approximability, and Each PE computes dot-product of a given row of the kernel and a
well-defined input and output. Subsequently, the program seg- feature map row of the same size and buffers the result in the
ments are converted into NPU instructions through a compiler scratchpad. Each kernel row and activation row is horizontally
and processed in the NPU. The performance of the CPU–NPU and diagonally distributed, respectively; partial sums are verti-
was identified using the MARSSx86 cycle-accurate x86-64 simu- cally accumulated (Figure 11a). Eyeriss attains 33.6 GOPS at
lator.[108] With the Intel’s Penryn microarchitecture, the NPU 200 MHz core clock and 1 V supply voltage.
achieves up to 2.3 speedup and 3.0 energy saving. Origami is an NPU based on weight-stationary systolic arrays
Frequently, NPUs are designed to minimize data movement (Figure 9b), designed using 65 nm CMOS process in a 3.09 mm2
from their off-chip memory (and thus power consumption and die.[109] This NPU comprises four processing channels, each of
latency) by employing systolic arrays of PEs,[78] which realize a which is given kh kw multipliers for parallel multiplications of 12b
sequence of MAC OPs based on the partial sums buffered in data and adder-tree for accumulation. The results in the process-
the registers, each of which is embedded in a PE (see ing channel are truncated to 12-b data. In each processing chan-
Figure 9). Systolic array-based NPUs are classified as weight- nel, cin kh kw kernel data are registered for weight reuse.
stationary (storing a weight in each PE), e.g., Origami[109] and Origami attains a maximum performance and power efficiency
TPU[44] and row-stationary (storing more data of an input activa- of 274 GOPS and 369 GOPS/W, respectively.
tion and weight in each PE), e.g., Eyeriss.[110] The latter is to Another NPU based on weight-stationary systolic arrays is
improve data reuse and to reduce energy consumption by mini- TPU.[44,45] TPUv1 consists of a matrix multiply unit, unified
mizing data movement from the off-chip memory but at the cost buffer for local activation, accumulators, control unit, and inter-
of a larger local memory in each PE. A schematic of a systolic face unit. The matrix multiply unit contains 256 256 MAC PEs
array for row-stationary and weight-stationary is illustrated in for INT8 data. 64KB weights are loaded in the matrix multiply
unit from a 8 GB off-chip DRAM. The 16b multiplication results
Figure 11a,b, respectively.
are accumulated in a 4 MB accumulator for 32b data. The large
on-chip memory of 28 MB in total supports the memory-hungry
weight-stationary scheme, yielding a peak performance of
92 TOPS at 700 MHz. TPUv1 is 15 30 faster than a K80
GPU and Haswell E5-2699 v3 CPU, and its power efficiency
is 30 80 higher. TPUv4 supports BFLOAT16 for multiplica-
tion and FP32 for accumulation.[45] Its peak performance attains
275 TOPS at a TDP of 170 W at 1 GHz.
The aforementioned stationary-based NPUs focus on
minimizing data movement, but require additional registers
(scratchpad memory for Eyeriss) for inter-PE data movement.
This is because the full operation is split into a number of unit
operations (MACs) and the results of unit operations need to be
buffered in each PE. An alternative strategy is to perform the full
operation using a multiply and adder-tree structure through
which the data flow without data buffer. An example is
DianNao developed to accelerate the computation of large
scale NNs with a tiling method to alleviate memory bandwidth
requirements.[112] The architecture comprises three stages of
neural functional units (NFUs), three split buffers, and a control
processor. In the NFU, input activations and weights (all in 16b
fixed-point) are multiplied in the first stage, and the products are
added through a adder-tree in the second stage. The final stage
computes the activation function for the results from the adder-
tree. The three buffers store input activations, weights, and out-
put activations. DianNao was designed using a 65 nm CMOS
process and simulated to identify its performance, yielding
452 GOPS at 0.98 GHz. Subsequently, Luo et al. introduced an
extended version of DianNao, referred to as DaDianNao.[113]
DaDianNao (designed using a 22 nm CMOS process) consists
of 16 computing tiles in total and can perform 16 operations
per tile (i.e., total 256 operations). Moreover, a four-bank embed-
Figure 11. Schematic of a systolic array for a) row stationary and b) weight ded DRAM (eDRAM) in each tile accommodates a number of
stationary. weights on chip.

Adv. Intell. Syst. 2024, 2300762 2300762 (10 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

3.5. CIM Units 3.5.2. Analog CIM Units

3.5.1. Digital CIM Analog CIM demonstrates its high operational efficiency by
reducing the area and power overheads for multipliers and add-
Function-In-Memory (FIM) DRAM is a digital CIM unit with ers, and eliminating on-chip data loading. However, they support
programmable computing units (PCUs) supporting FP16 data integer MAC OPs only, and operational reliability is unlikely
and HBM2 as an embedded memory.[47] FIMDRAM includes comparable to digital CIM. There exist a number of analog
a 16-wide SIMD engine in the memory banks to achieve CIM designs using various memories such as mainstream mem-
bank-level parallelism. The physical dimension of HBM2 was ories, e.g., SRAM[115–119] and DRAM,[120,121] and emerging non-
maintained by replacing half of the memory array by PCUs. volatile memories, e.g., resistive RAM (RRAM)[87–95] and
Multibank operations are supported using the FIM mode while magnetic RAM (MRAM).[96] We introduce a few examples in
the general memory operations using the normal mode. The each class of analog CIM as follows.
PCU comprises a register group, execution unit, and interface Dong et al. introduced an 8 T SRAM-based analog CIM macro
unit, and is controlled using the conventional memory com- fabricated using a 7 nm CMOS process.[118] A sufficient noise
mands (CMDs) from the host without modifying the conven- margin allows the 8 T SRAM array of 6464 in size to remain
tional memory controller. FIMDRAM was fabricated using a stable even when multiple words are activated for parallel MAC
20 nm DRAM process and achieves 1.2 TFLOPS in FP16 at
OPs. The macro computes multiply-average (MAV) operations
300 MHz.
for 4b activations and also 4b weights. Multibit inputs are real-
Another example of eDRAM-based digital CIM units is AiM
ized in a bit-serial manner by a 4b digital counter. Multibit
which is based on 4 Gb GDDR6.[80] In AiM, one processing unit
weights are realized by using multiple capacitors of power-of-
(PU) is dedicated to each of 16 banks and placed in the vicinity of
two relative capacitance (1:2:4:8) at the end of each bitline.
the bank. Notably, a set of new CMDs is introduced for bank acti-
MAV operations for 64 4b inputs and 16 4b weights are com-
vation, compute, and data movement unlike FIMDRAM. This
puted using a flash analog-to-digital converter (ADC). The macro
new CMD set allows swift switches between the memory mode
attains 455.1 GOPS and 321 TOPS/W for INT4 MAV operations
and DL operation mode without commands from the host. Each
at 1 V.
PU is equipped with 16 multipliers and a four-stage adder-tree
Fully analog CIM is prone to operational errors due to the lim-
for BFLOAT16 data. For parallel MAC OPs, the PU receives
ited signal-to-noise ratio (SNR) of ADCs. Mixed-signal CIM may
1) 256b weights from its bank and also 256b activations from
be a noise-robust alternative. In this regard, Su et al. introduced
the global buffer or 2) 256b weights from its bank and activations
of the same size from the paired bank. Additionally, AiM sup- 384 Kb 6 T SRAM-based CIM.[115] The proposed macro consists
ports various activation functions, e.g., sigmoid, hyperbolic tan- of SRAM subarrays, ADCs, and digital shifter and adder (DSaA).
gent, GELU, ReLU, and leaky ReLU, based on lookup tables In each subarray, 32 6 T-SRAMs are connected to a local bitline
(LUTs), each of which stores uniform inputs and their function pair (LBL/LBLB), and 16 sub-arrays are connected to a global bit-
values for each activation function. AiM attains 1 TFLOPS for line pair (GBL/GBLB). In a subarray, voltage-scaled 2b activa-
BFLOAT16 at 1 GHz. tions and 1b weights (stored in SRAM) are multiplied using
SRAM is frequently used as an embedded memory in digital LBL/LBLB, and the products are averaged using GBL/GBLB.
CIM units.[81–86] Mostly, weights are stored in the SRAM other The ADC converts the averaged product into a 5b digital value.
than a few cases in which activations are stored in the SRAM, These values from multiple ADCs are combined in DSaA to
e.g., Z-PIM.[81] Most of designs separate a processing domain obtain a 20b output. The macro achieves a peak performance
from a memory domain[82–85,114] other than a few cases[81,86] of 22.75 TOPS/W (peak) for INT8 data at 0.85 V supply voltage.
as follows. Chih et al. proposed a 6 T-SRAM-based CIM macro DynaPlasia is a system-level CIM unit with reconfigurable
design with integrated bit-wise multipliers (4 T NOR gate) in the 3T2C eDRAM of 9.6 Mb.[121] Each bitcell is reconfigurable such
SRAM domain.[86] For massive computing parallelism, the archi- that one of memory, in-memory computing, and ADC modes
tecture employs bit-serial multipliers (in the memory domain) can be chosen. Particularly, a bitcell capacitor serves as a unit
and parallel adder-trees (in the processing domain). This design capacitor of a successive-approximation ADC in the ADC mode,
supports the programmable precision of input activations (binary significantly reducing the area overhead for ADCs. DynaPlasia
to INT8) and weights (INT4/8/12/16). The memory domain was fabricated using a 28 nm CMOS technology in a
includes a 256 4 4b SRAM array in which a 4 T NOR gate 20.25 mm2 die. It attains 56 TOPS/W peak performance for
for bit-wise multiplication is dedicated to each bit cell, supporting INT4 activations, INT5 weights at 250 MHz, and 1 V supply
256 bit-wise multiplications of 256 activations and 256 weights in voltage.
parallel. The 256 products are summed through the add-tree. RRAM is resistance-based nonvolatile memory[122] which was
A single macro was fabricated using a 22 nm CMOS process considered as storage-class memory. Following Kirschhoff ’s cur-
in a 0.202 mm2 die, yielding a performance of 3.3 TOPS (for rent law, the 1T1R bitcells sharing the same bitline inherently
4b activation and 4b weight) at 0.72 V supply voltage. realize bit-wise multiplications and accumulation of the currents
Z-PIM also uses bit-wise multipliers integrated in the SRAM through the bitcells, equivalent to parallel MAC OPs. ISAAC is
domain.[81] Notably, the SRAM array stores input activations an analog dot-product machine that leverages this inherent prop-
instead of weights unlike the above explained CIM designs. erty of RRAM crossbar arrays.[92] ISAAC comprises many tiles,
The key feature of this CIM macro is a zero-skipping scheme each of which includes eDRAM to store input activations, output
that avoids MAC OPs for zero weights. registers, in situ multiply-accumulate (IMA) units, shift-and-add,

Adv. Intell. Syst. 2024, 2300762 2300762 (11 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

sigmoid, and max pooling units. Each IMA includes RRAM OPs. In this regard, GPUs are frequently used to accelerate
crossbar arrays, digital-to-analog converters (DACs) for input SNN computation, which can significantly accelerate SynOPs
activations, and ADCs for outputs. The authors proposed three given the high memory bandwidth and peak performance.
performance metrics: computational efficiency (CE in However, given their prohibitive power consumption, GPUs
GOPS mm2), power efficiency (PE in GOPS/W), and storage hardly utilize the inherent operational efficiency of SNNs. To
efficiency (SE in MB/mm2). These metrics were maximized by leverage the inherent operational efficiency due to operational
searching for the optimal size of each RRAM crossbar array, the sparsity and binary activation, event processors (based on ad hoc
numbers of crossbar arrays and ADCs in each IMA, and the event routing) need to be used to implement SNNs, which are
number of IMAs in each tile. The system-level simulation results referred to as neuromorphic processors. In this review, we
highlight improvements of throughput, power efficiency, and mainly address event-based neuromorphic processors, which
computational density by 14.8, 5.5, and 7.5, respectively, are the mainstream neuromorphic hardware. Note that hereafter
compared to DaDianNao. neuromorphic processors indicate SNN inference accelerators
TIMELY is an analog dot product machine based on RRAM without on-chip learning engine unless otherwise specified.
crossbar arrays with three key innovations: analog local buffers Additionally, we refer to event-based neuromorphic processors
(ALBs) to enhance data locality, time-domain interfaces (TDIs) to as neuromorphic processors unless otherwise specified.
reduce the number of data conversions, and only-once input read Section 4.1 introduces the generic architecture of neuromorphic
(O2 IR) to reuse input activations.[94] When transferring data from processors and several key working principles. Section 4.2 is ded-
an analog to a digital domain, the register in the digital domain icated to various event-routing architectures that are the key to
consumes considerable energy and time. In this regard, ALBs event processors. Section 4.3 overviews various neuromorphic
eliminate the necessity for the data conversion. Additionally, processors introduced to date, which are classified as 1) mix-
energy cost is reduced by replacing ADCs and DACs (used in signal and 2) digital neuromorphic processors. Section 4.4
the conventional crossbar array) by time-to-digital converters addresses nonevent-based neuromorphic processors and
(TDCs) and digital-to-time converters. O2 IR also reduces the compares them with event-based neuromorphic processors.
energy cost by reducing memory access frequency. TIMELY
was designed using a 65 nm CMOS process, working at 4.1. Generic Architecture of Neuromorphic Processors
40 MHz and 1.2 V supply voltage. TIMELY improves energy effi-
ciency by 18.2 and computational density by 20 compared to Unlike DNN accelerators based on the von Neumann architec-
ISAAC. ture, neuromorphic processors are standalone hardware in need
MRAM is also resistance-based nonvolatile memory based on of neither host CPU nor main memory. Generally, a neuromor-
current-controlled magnetic tunnel junctions (MTJs) that exhibit phic processor consists of multiple neuromorphic cores and
nonvolatile high- and low-resistance states depending on their network-on-chips (NoCs) that are responsible for communica-
spin configuration. Jung et al. introduced a spin-transfer-torque tion between cores as illustrated in Figure 12a. Each core consists
MRAM (STT-MRAM)-based CIM unit for binary NNs (binary of a neuron block, weight memory, event router, and event
activation and binary weight).[96] CIM in this unit is realized queue. The neuron block calculates the membrane potential
by using strings of MTJs instead of MTJ crossbar arrays to reduce values. The weight memory stores synaptic weights used for
the power consumption in the memory domain. Given the con- SynOPs. The event router sends input spikes (events) to the post-
siderable current in even high-resistance state MTJs, MTJ cross- synaptic (destination) neurons. The event queue buffers input
bar arrays consume high power during the current summation
spikes temporarily before the event router sends them to the des-
for parallel MAC OPs. To cope with this power issue, the pro-
tination neurons. Note that some designs merge the weight
posed design uses strings of MTJs like NAND flash, where
memory with the event router as for crossbar-based event-
the total resistance of each string is determined by the number
routing architectures.[49,123,124] Spiking neurons are distributed
of high-resistance state MTJs. Each bitcell is of 2T2M, i.e., two
over multiple cores that compute the membrane potentials (state
MTJs of complementary resistance states and two transistors
variables) of their spiking neurons in parallel. The events from a
with complementary inputs, to realize the XNOR logic for binary
given neuron are routed to its postsynaptic (fan-out) neurons
NNs. Thus, the resistance sum of a given string is equivalent to
through the NoCs. Despite the limited bandwidth of the
the dot product of binary weight and activation vectors. The
NoCs, this event routing through NoCs barely causes heavy
string resistance is subsequently converted to a digital value
traffic because of 1) the binary activation (’00 and ’10 , nonspike
using a TDC.
and spike, respectively) instead of real-value activation as for
DNNs and 2) the high sparsity of the feature maps.
The forward pass of an SNN for inference depends on local
4. Accelerators for SNN-Based DL
data only. That is, the membrane potential update in
SNNs are time-dependent models for DL with unique features Equation (9) (based on SynOPs in Equation (12)) uses several
distinct from DNNs as addressed in Section 2.3. Particularly, data like the potential time-constant τm , spiking threshold θ,
their binary activation and highly sparse feature maps allow and fan-in synaptic weights w ij , which are all local. Therefore,
accelerator architectures of extreme power efficiency, which when the fan-in synaptic weights wij are placed in the same core
are essentially distinguishable from DNN accelerators. as the postsynaptic neurons, each core depends on its own local
Nevertheless, SNNs for DL are of the same topology as memory without access to a main memory. This allows the mul-
DNNs, and SynOPs (major operations) are equivalent to MAC tiple cores to operate independently in parallel. Thus, event data

Adv. Intell. Syst. 2024, 2300762 2300762 (12 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

where SynOPs(1) denotes SynOPs for nonzero activation (’10 ),

which are actually performed. Therefore, we have
SynOPS
SynOPSeff ¼ (15)
1 f sp

where the sparsity f sp is given by

SynOPsð0Þ
f sp ¼ , (16)
ðSynOPsð0Þ þ SynOPsð1ÞÞ

which is equivalent to Equation (11). Equation (13) highlights

signiﬁcant boost in effective performance by skipping
SynOPs(0) for a given sparsity f sp . For instance, for
f sp ¼ 0.99, SynOPSeff ¼ 100 SynOPS, highlighting 100
improvement in performance by skipping SynOPS(0).

4.2. Event-Routing Architectures

Figure 12. a) Architecture of a multicore neuromorphic processor.
b) Schematic of an event-data packet for AER. 4.2.1. Neuron-Wise Event Routing

Although the event-routing architecture in a neuromorphic core

(event packets) are the only data that move between the cores
differs for different prototypes, the architecture commonly con-
through NoCs. This multicore architecture with local memory
cerns efficient event routing among neurons, which is referred to
only is of excellent scalability in that the processor capacity sim-
as neuron-wise event routing. A schematic of neuron-wise event
ply scales with the number of cores without imposing additional
routing is illustrated in Figure 13a, where the topology of a given
overheads on their common hardware unlike DNN accelerators
SNN is defined in an LUT configured neuron-wise. Neuron-wise
that impose additional overheads on their main memory when
event routing supports full flexibility of SNN topology configura-
deploying more cores. This becomes obvious for multichip neu-
tion. Crossbar-based event-routing architecture hardwires N neu-
romorphic systems which can accommodate large SNNs by
rons using an N wN SRAM array for w-b weights. This
deploying many neuromorphic chips. On the other hand, the
architecture allows all-to-all connections. A pair of one word line
capacity of distributed on-chip local memory limits the capability
and one column line (w bit lines) are dedicated to each of N neu-
of neuromorphic processors without main memory, e.g., sizes of
rons, and each word corresponds to a single synaptic weight.
SNNs implemented. Therefore, efficiency in memory usage is
Upon event generation from a neuron, the word line dedicated
one of the key metrics. to the neuron is pulled up, and all words (fan-out weights) are
Address event representation (AER) is a widely used protocol
read simultaneously. Subsequently, each of these fan-out weights
for event routing.[125] In AER, an event-data packet from a given
is accumulated in a dedicated PE for SynOPs. This architecture
presynaptic (source) neuron consists of the addresses of the
has been employed in several neuromorphic processors[49,50,124]
source and postsynaptic (destination) neurons as illustrated in
Figure 12b. This event-data packet is routed to its destination
neurons through NoCs using the addresses. Upon the arrival
of the event-data packet at the cores including the destination
neurons, their membrane potential values are updated in paral-
lel, i.e., SynOPs in Equation (12) are performed. That is, ad hoc
SynOPs are performed by routing events in ad hoc manner
instead of constructing and buffering feature (event) maps to per-
form synchronous SynOPs. These ad hoc SynOPs inherently
skip SynOPs for zero activation, i.e., nonspike cases, and thus
unnecessary computational latency and power consumption
are avoided. The performance metric of a given neuromorphic
processor is SynOPS (SynOPs/s), which corresponds to OPS
for DNN accelerators. We define an effective performance
(SynOPSeff ) that counts SynOPs for zero activation
(SynOPs(0)), which is equivalent to the performance of acceler-
ators without zero skipping like GPUs.

SynOPsð0Þ þ SynOPsð1Þ SynOPsð1Þ

¼ (14) Figure 13. Schematic of a) neuron-wise event-routing and b) layer-wise
SynOPSeff SynOPS event-routing.

Adv. Intell. Syst. 2024, 2300762 2300762 (13 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

because of its simplicity. However, the quadratic increase of Table 1. Comparison of event-routing memory usage.
memory usage (≈N 2 ) with the number of neurons N hinders
it from applying to large-scale neuromorphic processors. Network Memory usage
Hierarchical AER is a hierarchical tree-based event-routing – Crossbar [49]
K-tag[50,51] LaCERA[52]
architecture, which supports exponential expandability for a cNet a)
42.2 Mib 4.58 Mib 15.31 Kib
given number of event hops over the hierarchy.[126] The tree hier-
LeNetb) 22.0 Mib 3.77 Mib 16.54 Kib
archy is configured such that the leaves are dedicated to the neu-
rons that send and receive events while the nodes on each
hierarchical level relay the events from the neurons throughout
a)
ð3 32 32Þ 16C4@2 32C3 64C3@2 10C4 10.
the hierarchy. Given the exponential expandability of the hierar-
b)
ð3 32 32Þ 6C5 AP2 16C5 AP2 120C5 84 10.
chical tree, this event-routing architecture supports the minimal
number of event hops (which corresponds to the minimal 4.3. Overview on Neuromorphic Processors
latency) for event routing in large-scale SNNs.
The K-tag routing scheme[50] is a two-stage event-routing Mixed-signal neuromorphic processors are implemented using
method in that the first stage routes an event to the clusters both analog and digital circuits; frequently, neurons and synap-
including the destination neurons and the second stage broad- ses are realized using analog circuits, while network configura-
casts them to their destination neurons. The key advantage of tion and event routing are achieved using digital circuits. The
K-tag routing lies in its optimal use of event-routing memory advantages of analog building blocks (neurons and synapses)
by optimizing the number of fan-out connections for the second include their ability to mimic biological dynamics at low power
stage. The K-tag scheme is employed in DYNAPs[50] and Loihi.[51] consumption. However, given that analog circuits are sensitive to
The pointer-based event-routing scheme proposed by noise, mismatch, and power, voltage, and temperature variations,
Kornijcuk et al.[127] is an LUT-based event-routing method that the scalability of analog neurons and synapses is somewhat lim-
aims to reduce latency for event routing and inverse lookup ited. As such, digital neuromorphic processors are fully imple-
for spike timing-dependent plasticity -based on-chip learning. mented using digital circuits. Neurons and synapses are
This method uses three LUTs (PTR_LUT, FOUT_LUT, and implemented using digital logic cells. The advantages of digital
FIN_LUT). FOUT_LUT is sorted according to source neuron circuits include excellent scalability, reliability, reconfigurability,
address such that the destination neuron addresses for a given and operation speed, allowing digital neuromorphic processors
source neuron are adjacently allocated in FOUT_LUT. FIN_LUT to be a solution to large-scale SNN-based DL accelerators.
is sorted according to destination neuron address such that the Section 4.3.1 and 4.3.2 introduce several neuromorphic proces-
source neuron addresses for a given destination neuron are sor designed using mixed-signal and digital circuits, respectively.
grouped in FIN_LUT. PTR_LUT stores the ranges of the
addresses in FOUT_LUT and FIN_LUT for a given neuron,
4.3.1. Mixed-Signal Neuromorphic Processors
so that, upon an event from a neuron, the addresses of its desti-
nation neurons in FOUT_LUT and its source neurons in
NeuroGrid is a multichip system that works with million neu-
FIN_LUT can be found at one cycle without sequential search
rons and billions of synapses in real time.[128] This system con-
of FOUT_LUT and in FIN_LUT.
sists of 16 mixed-signal neuromorphic chips, each of which is
with a single neuromorphic core. Each chip (core) was fabricated
4.2.2. Layer-Wise Event Routing using a 180 nm CMOS process, within a 168 mm2 die. Each core
is of the shared synapse and dendrite architecture for its area
The neuron-wise event routing supports high flexibility in topol- efficiency and local connectivity. The 16 cores are connected
ogy configuration. Yet, SNNs for DL frequently consist of basic using the tree-based event-routing architecture for its higher
elementary layers (see Section 2.1) instead of arbitrary connec- throughput in multicasting packets than the mesh architecture.
tions, so that they barely need such high flexibility. When a layer Each core consists of a 256 256 neuron array, transmitter,
type (conv or dense) and hyperparameters are fixed for a given receiver, router, and two RAMs. The analog neuron array
layer, all neuron-to-neuron connections are determined, allowing realizes a soma, dendrite, synapse-population, and ion-channel-
us to avoid using neuron-wise configured LUTs that cause sig- population circuits. The analog neurons in the core operate in a
nificant memory usage. That is, layer-wise rather than neuron- fully parallel manner. The transmitter, receiver, and router are
wise event routing can remarkably boost the efficiency in on-chip used to route AER packets. The transmitter encodes the coordi-
memory usage. The layer-centric event-routing architecture nates of spiking neurons and sends them to the output port. The
(LaCERA)[52] realizes this layer-wise event routing by using a tiny receiver decodes the coordinates of the AER packets and delivers
LUT that defines the type of each layer in a given SNN them to the target neurons. The router multicast AER packets
(Figure 13b). Consequently, LaCERA reduces the memory usage using the information retrieved from the memory in the core.
for event routing by more than two orders of magnitude com- The RAMs store the target synapse locations and configuration
pared with the K-tag scheme as compared in Table 1. Further, parameters that are shared among all neurons in the core.
LaCERA supports ideal weight-reuse rate for conv layers, which In NeuroGrid, a million neurons and eight billion synapses
is barely realized in neuron-wise event-routing architectures, so real-time operate at 2.7 W power consumption.
that the efficiency in on-chip memory usage can be further ROLLS[123] is a single core neuromorphic processor fabricated
enhanced. using a 180 nm CMOS process within a 51.4 mm2 die.

Adv. Intell. Syst. 2024, 2300762 2300762 (14 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

It embodies an event-based online learning engine that allows and axon indices. The AER packets are routed to the destination
ad hoc update of synaptic weights, which is based on analog cir- cores through the NoCs. Subsequently, the synaptic fan-in states,
cuit design. The learning algorithm implemented is spike-driven e.g., fan-in weight and delay, are retrieved from the memory in
synaptic plasticity (SDSP).[129] The processor consists of an the destination core, and event routing is completed. A barrier
analog neuron circuit realizing 256 adaptive exponential IF synchronization mechanism is employed to let the whole cores
(AdExp-IF) neurons,[130] two 256 256 arrays of trainable synap- operate timestep-wise without a global clock. The authors dem-
ses, and synapse demultiplexer. The 256 AdExp-IF neurons work onstrate the performance of Loihi on various tasks.
in parallel. Each of the two synapse arrays realizes short- and Frenkel et al. introduced ODIN which is a digital single-core
long-term plasticity, respectively. In the synapse arrays, weights neuromorphic processor with an on-chip learning engine sup-
are update using an analog circuit, while the digital circuit con- porting the SDSP algorithm.[124] ODIN was fabricated using
trols the update protocol and manages handshaking signals with 28 nm CMOS process in a 0.086 mm2 die. This processor sup-
AER packets. The synapse demultiplexer is to allocate one of the ports 256 neurons conforming to the LIF model or custom
256 rows in the 256 256 array to each neuron. ROLLS success- phenomenological model to realize biologically plausible neural
fully emulate attractor networks of cortical neurons for image dynamics. The neuronal state variables are stored in a 4 KB
classification at a power consumption of 4 mW. SRAM memory. Similar to TrueNorth, an SRAM crossbar of
Moradi et al. introduced mixed-signal neuromorphic process- 256 256 in size was adopted for event routing and weight stor-
ors (DYNAPs), which support excellent reconfigurability of SNN age, but it supports weights of 4-b precision unlike TrueNorth.
topology.[50] DYNAP is a quad-core neuromorphic processor fab- The time-multiplexed learning engine for SDSP modifies the
ricated using a 180 nm CMOS process within a 43.79 mm2 die. synaptic weights subject to update. ODIN successfully repro-
Each core consists of 256 AdExp-IF neurons and 16 k synapses duced the dynamics of Izhikevich model[72] and demonstrated
that operate in parallel. The mixed-signal computing node in a its on-chip learning capability on MNIST.
core comprises an analog neuron and four synaptic dynamic cir- There exist several digital neuromorphic processors proto-
cuits. The excellent reconfigurability arises from the novel K-tag typed on field-programmable gate array (FPGA) boards. We
event-routing architecture (see Section 4.2) that first routes an introduce some of them as follows. Ye et al. prototyped a 32-core
event to the clusters including the destination neurons and neuromorphic processor in a Virtex-7 FPGA, which features its
subsequently broadcasts them to the destination neurons with novel layer-wise event-routing architecture (LaCERA introduced
optimal memory usage. The K-tag scheme is realized using in Section 4.2) for convolutional SNNs (conv-SNNs). This archi-
three-level routers. tecture supports significant reductions in 1) event-routing mem-
ory usage compared with conventional neuron-wise routing
methods, e.g., K-tag and crossbar architecture, and 2) weight
4.3.2. Digital Implementation
memory usage given the ideal weight reuse supported by this
architecture for conv-SNNs. Additionally, the hyperparameters
TrueNorth is a digital multicore neuromorphic processor fabri-
for each layer is fully reconfigurable in this architecture. This pro-
cated using a 28 nm CMOS process.[49] Globally asynchronous
cessor supports the LIF and IF models, and SRM. Each core
locally synchronous design was adopted to minimize the power
includes up to 2 k neurons whose state variables (membrane
consumption. Each of 4096 cores in total realizes maximal 256
potential for LIF and membrane potential and synaptic current
LIF neurons and 64 k programmable synapses with several digital
for SRM) are approximated using the template-scaling exponen-
modules, including neuron, memory, scheduler, router, and con-
tial function approximation[131,132] which allows high-precision
troller modules. The router, scheduler, and controller modules
approximations of exponential functions with minimal use of
operate asynchronously in a handshaking manner based on
LUT memory. The 32 cores are distributed conforming to 2D
AER packets. The neuron module is implemented in a synchro-
mesh architecture with eight NoCs.
nous design that receives the generated clock from the asynchro-
Yang et al. proposed a multicore neuromorphic processor of
nous controller module. The 256 410-b memory stores the
6 6 6 3D mesh architecture, which was prototyped in an
synaptic connections, neurons’ state variables, and parameters.
Altera Stratix III FPGA.[133] The keys to the 3D mesh architecture
The scheduler and controller manage the sequence of the core
are a novel 3D NoCs and the router in each core, which multi-
operation. Event routing in TrueNorth is configured such that
casts and receives AER packets through six ports (up, down,
1) intracore event routing uses the crossbar architecture explained
north, east, west, and south). This processor is to real-time realize
in Section 4.2, and 2) mesh architecture-based event routing for
large-scale SNNs of conductance-based neuron models with high
intercore event routing. TrueNorth consumes 63 mW power.
fidelity to their biological counterparts. They authors successfully
Loihi is another digital multicore (128-core) neuromorphic
reproduced the behavior of cortico-basal ganglia-thalamocortical
processor designed using a 14 nm CMOS process in a
networks real-time.
60 mm2 die.[51] Notably, Loihi is equipped with an on-chip learn-
ing engine that supports several event-based learning algorithms.
Each core realizes 1 k time-multiplexed LIF neurons and synap- 4.4. Nonevent-Based Neuromorphic Processors
ses whose number ranges from 114 k to 1M, depending on the
user-programmable weight precision. The core implements syn- We addressed event-based neuromorphic processors that allow
apse, dendrite, axon, and a learning engine modules using 2Mb ad hoc event routing when events are generated. This event-
distributed SRAMs in total. Upon event generation, the axon routing method fully leverages the generic property of SNNs,
module generates AER packets containing the destination core i.e., high sparsity of feature (event) maps. However, the lower

Adv. Intell. Syst. 2024, 2300762 2300762 (15 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

the sparsity, the larger the latency for event routing because of low-power accelerators are endowed with strictly limited capacity
the limited parallelism in event routing through NoCs of limited of DNNs under computation. In this regard, SNN-based DL
bandwidth. There exist several neuromorphic processors that using neuromorphic processors at extremely low power may
build full event maps which are used for subsequent SynOPs offer the largest model capacity at a given power constraint
unlike event-based neuromorphic processors, which we term among these on-device AI accelerators.
nonevent-based neuromorphic processors. These processors Yet, the neuromorphic processor technology is hardly as
support parallel SynOPs with multiple PEs operating in parallel matured as DNN-based DL accelerators, and its ecosystem has
as for DNN accelerators. Thus, these nonevent-based processors barely been solidified although there emerge several SNN
are advantageous over event-based processors for SNNs of low libraries, e.g., snnTorch,[135] SpikingJelly,[136] and Spyx.[137]
sparsity. Note that, in this case, SynOPs are identical to MAC Additionally, to bring the SNN-based DL to the edge, there exist
OPs, so that the same PEs can be used for both operations. several challenges to be overcome, mainly with regard to the
Tianjic[134] is an example which realizes a method to leverage training efficiency and scalability of SNNs. As such, SNNs are
highly sparse feature maps by skipping SynOPs for zero events time-dependent models like RNNs. When training, the model
(i.e., no-spike cases). In the processor, the multiple cores com- is commonly unrolled over time, and its parameters are opti-
municate using AER packets through the NoCs. However, the mized using backpropagation through time (BPTT).[138] Thus,
AER packets are buffered into a memory in the core to construct the space complexity scales with the number of timesteps in
the event vectors (corresponding to binary feature maps) rather use, so that the length of each training sample is strictly limited
than ad hoc routing of the AER packets. An input event vector to a by the memory capacity of a given training platform. To cope
given core undergoes parallel SynOPs using multiple PEs with with this issue, the online training through time (OTTT) algo-
the weights in the same core. SynOPs for zero activations are rithm has recently been proposed, which learns weights using
skipped using a zero-filtering mask before the PEs. the data of temporal locality unlike BPTT.[139] Yet, the scalability
of OTTT to deeper and larger SNNs remains to be identified.
Another challenge is the scalability of SNNs to a similar degree
5. Concluding Remark and Outlook to DNNs. The difficulty lies in a number of hyperparameters
included in SNNs, e.g., various time-constants and spiking
Given that the progress in DL is regarded to continue onward, thresholds, which should be optimized. Nonetheless, these chal-
the computational complexity keeps growing. For the moment, lenges unnecessarily assure the inherent disadvantages of SNNs
GPUs support this DL progress as mainstream DL accelerators. compared with DNNs given that they may be overcome in the
This is not only because of excellent performance and memory near future. Certainly, the fascinating advantage (ultralow power
bandwidth, flexibility, and versatility of GPUs but also because of consumption) of neuromorphic processors for on-device AI
the solidly built ecosystem of DL research based on CUDA-based likely fuels the activities to overcome the challenges.
DL libraries, e.g., PyTorch and TensorFlow. To bring the alter-
native accelerators (NPUs and CIM units) into play in the market,
they should outperform GPUs on the hardware level insomuch Acknowledgements
as the users desire the change of the current solid ecosystem.
C.S. and C.Y. contributed equally to this work. This research was supported
However, for the moment, NPUs and CIM units are endowed
by the National R&D Program through the National Research Foundation
with obvious disadvantages with regard to their versatility such of Korea (NRF) funded by the Ministry of Science and ICT (grant no. NRF-
that NPUs and CIM units are advantageous only for compute- 2021M3F3A2A01037632). This work was supported by the Institute of
and memory-bound models, respectively. Notably, the computa- Information & communications Technology Planning & Evaluation
tional bottleneck mostly arises from the limited memory (IITP) grant funded by the Korea government (MSIT) (grant no.
bandwidth, which is unavoidable for the conventional computer RS-2023-00229689). This work was partly supported by the IITP under
architecture. In this regard, further developments of high- the artificial intelligence semiconductor support program to nurture the
best talents (IITP-(2004)-RS-2023-00253914) grant funded by the Korea
bandwidth memory are the key premise of boosting the government (MSIT).
performance of DL accelerators.
Given that the DL accelerators addressed have clear pros and
cons, it may be feasible to build a system equipped with various Conflict of Interest
accelerators to harness their pros only for various DL tasks. An
example is the heterogeneous computing using a system com- The authors declare no conflict of interest.
bining CPUs, GPUs, field-programmable gate arrays, and so
forth, proposed by Intel recently. To support this computing
environment, there remain several challenges including 1) mini- Keywords
mization of data movements among different accelerators and
2) optimal task assignment to different accelerators of different compute-in-memory, deep learning, deep learning accelerators, graphics
run-times to maximize device utilization rate. processing units, neural processing units, neuromorphic processors
DL acceleration at the edge is strictly constrained by power Received: November 13, 2023
consumption, ruling out GPUs as on-device AI accelerators. Revised: February 6, 2024
For the moment, on-device AI is an emerging market that is pre- Published online:
dicted to grow. Inference-only NPUs and CIM units at low power
may be brought into play in the on-device AI market, but these

Adv. Intell. Syst. 2024, 2300762 2300762 (16 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

[1] A. L. Samuel, IBM J. Res. Dev. 1959, 3, 210. [28] A. Krizhevsky, I. Sutskever, G. E. Hinton, Adv. Neural Inf. Process. Syst.
[2] M.-T. Luong, H. Pham, C. D. Manning (Preprint), arXiv:1508.04025, 2012, 25, 1.
v1, August Submitted: 2015. [29] K. Simonyan, A. Zisserman (Preprint), arXiv:1409.1556, v1,
[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Adv. Neural Submitted: September 2014.
Inf. Process. Syst. 2013, 26, 1. [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[4] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, S. Khudanpur, in V. Vanhoucke, A. Rabinovich, in Proc. of the IEEE Conf. on Computer
Interspeech, Vol. 2, International Speech Communication Vision and Pattern Recognition (CVPR), IEEE, Piscataway, NJ 2015.
Association (ISCA), Makuhari 2010 pp. 1045–1048. [31] T. P. Morgan, Nvidia Rounds Out “Ampere” Lineup with Two New
[5] J. Hirschberg, C. D. Manning, Science 2015, 349, 261. Accelerators 2021, https://www.nextplatform.com/2021/04/15/
[6] D. Yu, L. Deng, Automatic Speech Recognition, Vol. 1, Springer, New nvidia-rounds-out-ampere-lineup-with-two-new-accelerators/ (accessd:
York 2016. April, 2021).
[7] Y. Zhang, W. Chan, N. Jaitly, in 2017 IEEE Int. Conf. on Acoustics, [32] R. Krashinsky, O. Giroux, S. Jones, N. Stam, S. Ramaswamy, Nvidia
Speech and Signal Processing (ICASSP). IEEE, Piscataway, NJ 2017, Ampere Architecture In-Depth 2020, https://developer.nvidia.com/
pp. 4845–4849. blog/nvidia-ampere-architecture-in-depth/ (accessed: May 2020).
[8] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, [33] P. Alcorn, Nvidia Infuses dgx-1 with Volta, Eight v100s in a Single
C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., in Int. Chassis 2017, https://www.tomshardware.com/news/nvidia-volta-
Conf. on Machine Learning, JMLR, New York, NY 2016, pp. 173–182. v100-dgx-1-hgx-1,34380.html (accessed: May 2017).
[9] A. Vedaldi, B. Fulkerson, in Proc. of the 18th ACM Int. Conf. on [34] I. Cutress, Nvidia’s DGX-2: Sixteen Tesla v100s, 30tb of NVME,
Multimedia, ACM, New York, NY 2010, pp. 1469–1472. Only $400k, 2018, https://www.anandtech.com/show/12587/
[10] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, N. Prabhu, nvidias-dgx2-sixteen-v100-gpus-30-tb-of-nvme-only-400k (accessed:
S. S. Kruthiventi, R. V. Babu, Front. Robot. AI 2016, 2, 36. March 2018).
[11] A. Krizhevsky, I. Sutskever, G. E. Hinton, Commun. ACM 2017, 60, 84. [35] C. Campa, C. Kawalek, H. Vo, J. Bessoudo, Defining AI Innovation
[12] A. Kendall, Y. Gal, Adv. Neural Inf. Process. Syst. 2017, 30, 1. with Nvidia DGX A100, 2020, https://developer.nvidia.com/blog/
[13] D. Bahdanau, K. Cho, Y. Bengio (Preprint), arXiv:1409.0473, v1, defining-ai-innovation-with-dgx-a100 (accessed: May 2020).
Submitted: September 2014. [36] R. Smith, Nvidia Hopper GPU Architecture and H100 Accelerator
[14] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T.-Y. Liu, W.-Y. Ma, Adv. Neural Announced: Working Smarter and Harder, 2022, https://www.
Inf. Process. Syst. 2016, 29, 1. anandtech.com/show/17327/nvidia-hopper-gpu-architecture-and-
[15] Y. Liu, C. Niu, Z. Wang, Y. Gan, Y. Zhu, S. Sun, T. Shen, J. Mater. Sci. h100-accelerator-announced (accessed: March 2022).
Technol. 2020, 57 113. [37] R. Smith, Nvidia Gives Jetson AGX Xaview A Trim, Announces Nano-
[16] Q. Yang, S. Fu, H. Wang, H. Fang, IEEE Network 2021, 35, 96. Sized Jetson Xavier Nx 2019, https://www.anandtech.com/show/
[17] M. F. Dixon, I. Halperin, P. Bilokon, Machine Learning in Finance, Vol. 15070/nvidia-gives-jetson-xavier-a-trim-announces-nanosized-jetson-
1170, Springer, New York 2020. xavier-nx (accessed: November 2019).
[18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den [38] B. Funk, Nvidia Jetson AGX Orin: The Next-Gen Platform that will Power
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, Our AI Robot Overloads Unveiled, 2022, https://hothardware.com/
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, news/nvidia-jetson-agx-orin (accessed: March 2022).
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, [39] D. Franklin, Nvidia Jetson TX2 Delivers Twice the Intelligence to the
D. Hassabis, Nature 2016, 529, 484. Edge, 2017, https://developer.nvidia.com/blog/jetson-tx2-delivers-
[19] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, twice-intelligence-edge/ (accessed: March 2017).
J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, [40] B. Hill, Nvidia Unveils Ampere-Infused Drive AGX for Autonomous
D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, Cars, Isaac Robotics Platform with BMW Partnership, 2022, https://
J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, hothardware.com/news/nvidia-drive-agx-pegasus-orin-ampere-next-
T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, gen-autonomous-cars (accessed: May 2020).
C. Gulcehre, Z. Wang, T. Pfaff, et al., Nature 2019, 575, 350. [41] R. Smith, 16gb Nvidia Tesla v100 Gets Reprieve; Remains in
[20] W. S. McCulloch, W. Pitts, Bull. Math. Biophys. 1943, 5 115. Production 2018, https://www.anandtech.com/show/12809/16gb-
[21] V. Nair, G. E. Hinton, in Proc. of the 27th Int. Conf. on Int. Conf. on nvidia-tesla-v100-gets-reprieve-remains-in-production (accessed:
Machine Learning, ICML’10, Omnipress, Madison, WI 2010, May 2018).
pp. 807–814. [42] N. C. Thompson, K. Greenewald, K. Lee, G. F. Manso (Preprint),
[22] A. L. Maas, A. Y. Hannun, A. Y. Ng, in ICML Workshop on Deep arXiv:2007.05558, v1, Submitted: July 2020.
Learning for Audio, Speech and Language Processing, Atlanta, [43] H. Esmaeilzadeh, A. Sampson, L. Ceze, D. Burger, in 2012 45th
June 2013. Annual IEEE/ACM Int. Symp. on Microarchitecture, IEEE,
[23] D. Hendrycks, K. Gimpel, Gaussian Error Linear Units (Gelus), Piscataway, NJ 2012, pp. 449–460.
Arxiv 2016. [44] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
[24] P. Ramachandran, B. Zoph, Q. V. Le, Searching for Activation S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,
Functions, Arxiv 2017. C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
[25] D. S. Jeong, J. Appl. Phys. 2018, 124, 152002. T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann,
[26] OpenAI (Preprint), arXiv:2303.08774, v1, Submitted: March 2023. C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,
[27] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, et al., in Proc. of the 44th Annual Int. Symp. on Computer
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, Architecture, 2017, pp. 1–12.
L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, [45] N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin,
J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil,
N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, S. Prasad, C. Young, Z. Zhou, D. Patterson, in 2021 ACM/IEEE
V. Kerkez, M. Khabsa, et al., Llama 2: Open Foundation and Fine- 48th Annual Int. Symp. on Computer Architecture (ISCA), IEEE,
Tuned Chat Models, Arxiv 2023. Piscataway, NJ 2021, pp. 1–14.

Adv. Intell. Syst. 2024, 2300762 2300762 (17 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

[46] K. J. Lee, in Hardware Accelerator Systems for Artificial Intelligence and [68] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova (Preprint),
Machine Learning (Eds:S. Kim, G. C. Deka)), Vol. 122, Advances in arXiv:1810.04805, v1, Submitted: October 2018.
Computers, Elsevier, Amsterdam 2021, pp. 217–245. [69] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
[47] Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, et al., in 2021 IEEE Int. Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu,
Solid-State Circuits Conf. (ISSCC), Vol. 64, IEEE, Piscataway, NJ C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess,
2021 pp. 350–352. J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, et al.,
[48] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, Adv. Neural Inf. Process. Syst. 2020, 33 1877.
K. Kang, J. Kim, et al., in 2022 IEEE Int. Solid-State Circuits Conf. [70] P. Dayan, L. F. Abbott, Theoretical Neuroscience, MIT Press, London
(ISSCC), Vol. 65, IEEE, Piscataway, NJ 2022, pp. 1–3. 2001.
[49] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, [71] W. Gerstner, W. M. Kistler, Spiking Neuron Models: Single Neurons,
F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, Populations, Plasticity, Cambridge University Press, Cambridge,
I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, England 2002.
W. P. Risk, R. Manohar, D. S. Modha, Science 2014, 345, 668. [72] E. M. Izhikevich, IEEE Trans. Neural Netw. 2003, 14, 1569.
[50] S. Moradi, N. Qiao, F. Stefanini, G. Indiveri, IEEE Trans. Biomed. [73] S. Williams, A. Waterman, D. Patterson, Commun. ACM 2009, 52, 65.
Circuits Syst. 2017, 12, 106. [74] P. Dhilleswararao, S. Boppu, M. S. Manikandan, L. R. Cenkeramaddi,
[51] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, IEEE Access 2022.
G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C.-K. Lin, A. Lines, [75] JEDEC Standards, https://www.jedec.org/standards-documents
R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, (accessed: August 2022).
Y.-H. Weng, A. Wild, Y. Yang, H. Wang, IEEE Micro 2018, 38, 82. [76] Nvidia a100 Tensor Core GPU, 2022, https://resources.nvidia.com/
[52] C. Ye, V. Kornijcuk, D. Yoo, J. Kim, D. S. Jeong, Neurocomputing 2023, en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
520, 46. [77] Nvidia h100 Tensor Core GPU, 2023, https://resources.nvidia.com/
[53] G. Kim, D. S. Jeong, Adv. Neural Inf. Process. Syst. 2021, 34, 28274. en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, [78] H. T. Kung, Computer 1982, 15, 37.
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, [79] S. Lee, S.-H. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee,
A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, K. Lim, H. Shin, et al., in 2021 ACM/IEEE 48th Annual Int. Symp.
B. Steiner, L. Fang, J. Bai, S. Chintala, Adv. Neural Inf. Process. on Computer Architecture (ISCA), IEEE, Piscataway, NJ 2021,
Syst. 2019, 32, 1. pp. 43–56.
[55] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, [80] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park,
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, K. Kang, J. Kim, et al., in 2022 IEEE Int. Solid-State Circuits Conf.
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, (ISSCC), Vol. 65, IEEE, Piscataway ,NJ 2022, pp. 1–3.
P. Warden, M. Wicke, Y. Yu, X. Zheng, Tensorflow: A System for [81] J.-H. Kim, J. Lee, J. Lee, J. Heo, J.-Y. Kim, IEEE J. Solid-State Circuits
Large-Scale Machine Learning, Arxiv 2016. 2021, 56, 1093.
[56] F. C. Bauer, G. Lenz, S. Haghighatshoar, S. Sheik, Front. Neurosci. [82] H. Fujiwara, H. Mori, W.-C. Zhao, M.-C. Chuang, R. Naous,
2023, 17, 1. C.-K. Chuang, T. Hashizume, D. Sun, C.-F. Lee, K. Akarvardar,
[57] K. He, X. Zhang, S. Ren, J. Sun, in The IEEE Conf. on Computer Vision S. Adham, T.-L. Chou, M. E. Sinangil, Y. Wang, Y.-D. Chih,
and Pattern Recognition, IEEE, Piscataway, NJ 2016. Y.-H. Chen, H.-J. Liao, T.-Y. J. Chang, in 2022 IEEE Int. Solid-State
[58] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely Circuits Conf. (ISSCC), Vol. 65, IEEE, Piscataway, NJ 2022, pp. 1–3.
Connected Convolutional Networks, Arxiv 2018. [83] C.-F. Lee, C.-H. Lu, C.-E. Lee, H. Mori, H. Fujiwara, Y.-C. Shih,
[59] S. Ioffe, C. Szegedy, in Int. Conf. on Machine Learning, PMLR, T.-L. Chou, Y.-D. Chih, T.-Y. J. Chang, in 2022 IEEE Symp. on VLSI
New York, NY 2015, pp. 448–456. Technology and Circuits (VLSI Technology and Circuits), IEEE,
[60] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, Piscataway, NJ 2022, pp. 24–25.
T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient Convolutional [84] F. Tu, Y. Wang, Z. Wu, L. Liang, Y. Ding, B. Kim, L. Liu, S. Wei, Y. Xie,
Neural Networks for Mobile Vision Applications, Arxiv 2017. S. Yin, in 2022 IEEE Int. Solid-State Circuits Conf. (ISSCC), Vol. 65,
[61] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, in Proc. of IEEE, Piscataway, NJ 2022, pp. 1–3.
the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), [85] S. Liu, P. Li, J. Zhang, Y. Wang, H. Zhu, W. Jiang, S. Tang, C. Chen,
IEEE, Piscataway, NJ 2018. Q. Liu, M. Liu, in 2023 IEEE Int. Solid-State Circuits Conf. (ISSCC),
[62] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An Extremely Efficient IEEE, Piscataway, NJ 2023 pp. 250–252.
Convolutional Neural Network for Mobile Devices, IEEE (CVPR), [86] Y.-D. Chih, P.-H. Lee, H. Fujiwara, Y.-C. Shih, C.-F. Lee, R. Naous,
New York 2017. Y.-L. Chen, C.-P. Lo, C.-H. Lu, H. Mori, et al., in 2021 IEEE Int.
[63] M. Tan, Q. V. Le, Efficientnet: Rethinking Model Scaling for Solid-State Circuits Conf. (ISSCC), Vol. 64, IEEE, Piscataway, NJ
Convolutional Neural Networks, JMLR, New York, NY 2020. 2021, pp. 252–254.
[64] M. Schuster, K. Paliwal, IEEE Trans. Signal Process. 1997, 45, [87] Z. Li, Z. Wang, L. Xu, Q. Dong, B. Liu, C.-I. Su, W.-T. Chu, G. Tsou,
2673. Y.-D. Chih, T.-Y. J. Chang, et al., IEEE J. Solid-State Circuits 2020, 56, 1105.
[65] S. Hochreiter, J. Schmidhuber, Neural Comput. 1997, 9, 1735. [88] A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee,
[66] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, J. P. Strachan, N. Muralimanohar, IEEE Micro 2018, 38, 41.
H. Schwenk, Y. Bengio, Learning Phrase Representations Using RNN [89] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang,
Encoder-Decoder for Statistical Machine Translation, Arxiv 2014. H. Qian, Nature 2020, 577, 641.
[67] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, [90] S. Yin, X. Sun, S. Yu, J. S. Seo, IEEE Trans. Electron Devices 2020, 67,
A. N. Gomez, L. U. Kaiser, I. Polosukhin, in Advances in Neural 4185.
Information Processing Systems (Eds: I. Guyon, U. V. Luxburg, [91] J.-H. Yoon, M. Chang, W.-S. Khwa, Y.-D. Chih, M.-F. Chang,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett) A. Raychowdhury, in 2021 IEEE Int. Solid- State Circuits Conf.
Vol. 30. Curran Associates, Inc., Red Hook, NY 2017. (ISSCC), Vol. 64, IEEE, Piscataway, NJ 2021 pp. 404–406.

Adv. Intell. Syst. 2024, 2300762 2300762 (18 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

[92] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, Circuits Conf. (ISSCC), Vol. 64, IEEE, Piscataway, NJ 2021,
J. P. Strachan, M. Hu, R. S. Williams, V. Srikumar, ACM pp. 250–252.
SIGARCH Comput. Archit. News 2016, 44, 14. [116] K. Ueyoshi, I. A. Papistas, P. Houshmand, G. M. Sarda, V. Jain,
[93] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, ACM M. Shi, Q. Zheng, S. Giraldo, P. Vrancx, J. Doevenspeck, et al., in
SIGARCH Comput. Archit. News 2016, 44, 27. 2022 IEEE Int. Solid-State Circuits Conf. (ISSCC), Vol. 65, IEEE,
[94] W. Li, P. Xu, Y. Zhao, H. Li, Y. Xie, Y. Lin, in 2020 ACM/IEEE 47th Piscataway, NJ 2022, pp. 1–3.
Annual Int. Symp. on Computer Architecture (ISCA), IEEE, Piscataway, [117] I. A. Papistas, S. Cosemans, B. Rooseleer, J. Doevenspeck,
NJ 2020, pp. 832–845. M.-H. Na, A. Mallik, P. Debacker, D. Verkest, in 2021 IEEE
[95] C. Song, J. Kim, D. S. Jeong, Adv. Intell. Syst. 2023, 5, 2200289. Custom Integrated Circuits Conf. (CICC), IEEE, Piscataway, NJ
[96] S. Jung, H. Lee, S. Myung, H. Kim, S. K. Yoon, S.-W. Kwon, Y. Ju, 2021, pp. 1–2.
M. Kim, W. Yi, S. Han, et al., Nature 2022, 601, 211. [118] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W.-S. Khwa, H.-J. Liao,
[97] J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, in Proc. of the 42nd Y. Wang, J. Chang, in 2020 IEEE Int. Solid-State Circuits Conf.-
Annual Int. Symp. on Computer Architecture, ACM, New York, NY (ISSCC), IEEE, Piscataway, NJ 2020, pp. 242–244.
2015, pp. 105–117. [119] B. Wang, C. Xue, Z. Feng, Z. Zhang, H. Liu, L. Ren, X. Li, A. Yin,
[98] F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, F. Franchetti, in T. Xiong, Y. Xue, et al., in 2023 IEEE Int. Solid-State Circuits Conf.
Proc. of the 52nd Annual IEEE/ACM Int. Symp. on Microarchitecture, (ISSCC), IEEE, Piscataway, NJ 2023, pp. 134–136.
IEEE, Piscataway, NJ 2019, pp. 347–358. [120] S. Xie, C. Ni, A. Sayal, P. Jain, F. Hamzaoglu, J. P. Kulkarni,
[99] M. Zhu, T. Zhang, Z. Gu, Y. Xie, in Proc. of the 52nd Annual IEEE/ in 2021 IEEE Int. Solid- State Circuits Conf. (ISSCC), Vol. 64,
ACM Int. Symp. on Microarchitecture, IEEE, Piscataway, NJ 2019, 2021, pp. 248–250.
pp. 359–371. [121] S. Kim, Z. Li, S. Um, W. Jo, S. Ha, J. Lee, S. Kim, D. Han, H.-J. Yoo, in
[100] G. K. Chen, P. C. Knag, C. Tokunaga, R. K. Krishnamurthy, IEEE J. 2023 IEEE Int. Solid-State Circuits Conf. (ISSCC), IEEE, Piscataway,
Solid-State Circuits 2022, 58, 1117. NJ 2023, pp. 256–258.
[101] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, [122] D. S. Jeong, R. Thomas, R. S. Katiyar, J. F. Scott, H. Kohlstedt,
R. Das, in 2017 IEEE International Symposium on High A. Petraru, C. S. Hwang, Rep. Prog. Phys. 2012, 75, 076502.
Performance Computer Architecture (HPCA), IEEE, Piscataway, NJ [123] N. Qiao, H. Mostafa, F. Corradi, M. Osswald, F. Stefanini,
2017, pp. 481–492. D. Sumislawska, G. Indiveri, Front. Neurosci. 2015, 9 141.
[102] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, [124] C. Frenkel, M. Lefebvre, J.-D. Legat, D. Bol, IEEE Trans. Biomed.
D. Blaaauw, R. Das, in 2018 ACM/IEEE 45Th Annual International Circuits Syst. 2018, 13, 145.
Symposium on Computer Architecture (ISCA), IEEE, Piscataway, NJ [125] K. A. Boahen, IEEE Trans. Circuits Syst. II 2000, 47, 416.
2018, pp. 383–396. [126] J. Park, T. Yu, S. Joshi, C. Maier, G. Cauwenberghs, IEEE Trans.
[103] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, in Proc. of the Neural Netw. Learn. Syst. 2017, 28, 2408.
IEEE Conf. on Computer Vision and Pattern Recognition, IEEE, [127] V. Kornijcuk, J. Park, G. Kim, D. Kim, I. Kim, J. Kim, J. Y. Kwak,
Piscataway, NJ 2016, pp. 2818–2826. D. S. Jeong, Adv. Mater. Technol. 2019, 4, 1800345.
[104] D. Fujiki, S. Mahlke, R. Das, in Proc. of the 46th Int. Symp. on [128] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary,
Computer Architecture, 2019, pp. 397–410. A. R. Chandrasekaran, J.-M. Bussat, R. Alvarez-Icaza, J. V. Arthur,
[105] J. E. Volder, IRE Trans. Electron. Comput. 1959, EC-8, 330. P. A. Merolla, K. Boahen, Proc. IEEE 2014, 102, 699.
[106] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, [129] J. M. Brader, W. Senn, S. Fusi, Neural Comput. 2007, 19, 2881.
K. Skadron, in 2009 IEEE Int. Symp. on Workload Characterization [130] R. Brette, W. Gerstner, J. Neurophysiol. 2005, 94, 3637.
(IISWC), IEEE, Piscataway, NJ 2009, pp. 44–54. [131] J. Kim, V. Kornijcuk, D. S. Jeong, in 2020 21st Int. Symp. on
[107] Intel AMX, https://www.intel.com/content/www/us/en/content- Quality Electronic Design (ISQED), IEEE, New York, Ny 2020,
details/785250/accelerate-artiﬁcial-intelligence-ai-workloads-with- pp. 358–363.
intel-advanced-matrix-extensions-intel-amx.html (accessed: January [132] J. Kim, V. Kornijcuk, C. Ye, D. S. Jeong, IEEE Trans. Circuits Syst. I
2024). 2021, 68, 350.
[108] A. Patel, F. Afram, S. Chen, K. Ghose, in Proc. of the 48th Design [133] S. Yang, J. Wang, B. Deng, C. Liu, H. Li, C. Fietkiewicz, K. A. Loparo,
Automation Conf., ACM, New York, NY 2011, pp. 1050–1055. IEEE Trans. Cybern. 2018, 49, 2490.
[109] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, L. Benini, [134] L. Deng, G. Wang, G. Li, S. Li, L. Liang, M. Zhu, Y. Wu, Z. Yang,
in Proc. of the 25th Edition on Great Lakes Symp. on VLSI, ACM, Z. Zou, J. Pei, Z. Wu, X. Hu, Y. Ding, W. He, Y. Xie, L. Shi, IEEE
New York, NY 2015, pp. 199–204. J. Solid-State Circuits 2020, 55, 2228.
[110] Y.-H. Chen, J. Emer, V. Sze, ACM SIGARCH Comput. Archit. News [135] J. K. Eshraghian, M. Ward, E. O. Neftci, X. Wang, G. Lenz,
2016, 44, 367. G. Dwivedi, M. Bennamoun, D. S. Jeong, W. D. Lu, Proc. IEEE
[111] Y.-H. Chen, T. Krishna, J. S. Emer, V. Sze, IEEE J. Solid-State Circuits 2023, 111.
2016, 52, 127. [136] W. Fang, Y. Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang,
[112] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, ACM H. Zhou, G. Li, Y. Tian, Sci. Adv. 2023, 9, eadi1480.
SIGARCH Comput. Archit. News 2014, 42, 269. [137] K. Heckel, kmheckel/spyx: v0.1.0-beta 2023, https://doi.org/10.
[113] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, 5281/zenodo.8241588.
Y. Chen, IEEE Trans. Comput. 2016, 66, 73. [138] P. Werbos, Proc. IEEE 1990, 78, 1550.
[114] J. Yue, C. He, Z. Wang, Z. Cong, Y. He, M. Zhou, W. Sun, X. Li, [139] M. Xiao, Q. Meng, Z. Zhang, D. He, Z. Lin, in Advances
C. Dou, F. Zhang, et al., in 2023 IEEE Int. Solid-State Circuits in Neural Information Processing Systems (Eds:S. Koyejo,
Conf. (ISSCC), IEEE, Piscataway, NJ 2023, pp. 1–3. S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh),
[115] J.-W. Su, Y.-C. Chou, R. Liu, T.-W. Liu, P.-J. Lu, P.-C. Wu, Y.-L. Chung, Vol. 35, Curran Associates, Inc., Red Hook NJ 2022,
L.-Y. Hung, J.-S. Ren, T. Pan, et al., in 2021 IEEE Int. Solid-State pp. 20717–20730.

Adv. Intell. Syst. 2024, 2300762 2300762 (19 of 20) © 2024 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

Choongseok Song received his B.S. degree in electronics and information engineering from Sejong
University, Seoul, South Korea, in 2020. He is currently pursuing his Ph.D. degree in materials science
and engineering from Hanyang University, Seoul, South Korea. His research interests include computer
architecture for deep learning acceleration based on neural processing unit and processing in memory.

ChangMin Ye received his B.S. degree in materials science and engineering from Hanyang University,
Seoul, South Korea, in 2020, where he is currently pursuing his Ph.D. degree in materials science and
engineering. Since 2020, he has been focusing on learning digital neuromorphic processor design.

Yonguk Sim received his B.S. degree in electronic engineering from Hanyang University, Seoul, South
Korea, in 2023, where he is currently pursuing the integrated Ph.D. degree in semiconductor engineering.
Since 2023, he has been focusing on NVM-based deep learning accelerator design.

Doo Seok Jeong is a professor at Hanyang University, Republic of Korea. He received his B.E. and M.E.
in materials science from Seoul National University in 2002 and 2005, respectively. He received his Ph.D.
degree in materials science from RWTH Aachen, Germany in 2008. He was with the Korea Institute of
Science and Technology from 2008 to 2018. His research interest includes spiking neural networks for
sequence learning and future prediction. Learning algorithms, spiking neural network design, and digital
neuromorphic processor design are his current research focus.

DP_unit1
No ratings yet
DP_unit1
12 pages
NNDL-unit 3
No ratings yet
NNDL-unit 3
25 pages
Process-in-Memory forAI
No ratings yet
Process-in-Memory forAI
168 pages
SNN HW Challenges
No ratings yet
SNN HW Challenges
35 pages
Advanced Intelligent Systems - 2024 - Song - Hardware for Deep Learning Acceleration
No ratings yet
Advanced Intelligent Systems - 2024 - Song - Hardware for Deep Learning Acceleration
20 pages
make-04-00004-v3
No ratings yet
make-04-00004-v3
37 pages
EXSY Apr 21 455 Proof Hi
No ratings yet
EXSY Apr 21 455 Proof Hi
14 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
GENERATIVE AI
No ratings yet
GENERATIVE AI
15 pages
Unit-1 Notes Complete
No ratings yet
Unit-1 Notes Complete
75 pages
Deep NN - Theory, Tutorial and Survey
No ratings yet
Deep NN - Theory, Tutorial and Survey
32 pages
Engineering: Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, Tianqi Tang
No ratings yet
Engineering: Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, Tianqi Tang
11 pages
Topic 07-Part1 Introduction To Deep Neural Networks
No ratings yet
Topic 07-Part1 Introduction To Deep Neural Networks
27 pages
13737_AI Module V
No ratings yet
13737_AI Module V
40 pages
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
No ratings yet
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
290 pages
Hardware Implementation of Neural Networks
No ratings yet
Hardware Implementation of Neural Networks
5 pages
Mod-1 Part 1
No ratings yet
Mod-1 Part 1
143 pages
DLBench A Comprehensive Experimental Evaluation of
No ratings yet
DLBench A Comprehensive Experimental Evaluation of
23 pages
02-92
No ratings yet
02-92
15 pages
NN&DP Unit3
No ratings yet
NN&DP Unit3
41 pages
Intro to deep Learning (4)
No ratings yet
Intro to deep Learning (4)
8 pages
Deep Learning in Spiking Neural Networks
No ratings yet
Deep Learning in Spiking Neural Networks
23 pages
A Developer's Guide To Artificial Intelligence (AI) : Definitions, Insights & Tools For Getting Started in AI
No ratings yet
A Developer's Guide To Artificial Intelligence (AI) : Definitions, Insights & Tools For Getting Started in AI
9 pages
Integrated Photonic Tensor Processing Unit For A M
No ratings yet
Integrated Photonic Tensor Processing Unit For A M
14 pages
Deep Learning 2 July 2014
No ratings yet
Deep Learning 2 July 2014
75 pages
Design and implementation of deep neural network hardware chip and its performance analysis
No ratings yet
Design and implementation of deep neural network hardware chip and its performance analysis
10 pages
Module-2
No ratings yet
Module-2
37 pages
2015WS HS SpikingVision
No ratings yet
2015WS HS SpikingVision
23 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
Deep Learning For Intelligent Wireless Networks: A Comprehensive Survey
No ratings yet
Deep Learning For Intelligent Wireless Networks: A Comprehensive Survey
25 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
Deep Learning Module-01 Search Creators
No ratings yet
Deep Learning Module-01 Search Creators
17 pages
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
No ratings yet
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
19 pages
Deep Learning in Neural Networks An Overview
No ratings yet
Deep Learning in Neural Networks An Overview
89 pages
Deep_Learning_1737909076
No ratings yet
Deep_Learning_1737909076
29 pages
1 - Deep Learning 10-10-2023
No ratings yet
1 - Deep Learning 10-10-2023
30 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
Embedded_Deep_Learning_Accelerators_A_Survey_on_Recent_Advances
No ratings yet
Embedded_Deep_Learning_Accelerators_A_Survey_on_Recent_Advances
19 pages
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
No ratings yet
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
11 pages
Embedded Deep Learning Accelerators - A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators - A Survey On Recent Advances
19 pages
Chips For Artificial Intelligence
No ratings yet
Chips For Artificial Intelligence
3 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
Eng Assng
No ratings yet
Eng Assng
13 pages
Deep Learnings
No ratings yet
Deep Learnings
44 pages
Unit-2 Analysis of Algorithm
No ratings yet
Unit-2 Analysis of Algorithm
118 pages
Network Security and Cryptography: Course Code: 15Cs1105 Pre-Requisites: Computer Networks
No ratings yet
Network Security and Cryptography: Course Code: 15Cs1105 Pre-Requisites: Computer Networks
3 pages
A_DEEP_LEARNING_APPROACH_TO_CLASSIFY_CLASSIFY DRONES AND BIRDS
No ratings yet
A_DEEP_LEARNING_APPROACH_TO_CLASSIFY_CLASSIFY DRONES AND BIRDS
7 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
dl notes
No ratings yet
dl notes
97 pages
Deep Learning Algorithms and Architectures
No ratings yet
Deep Learning Algorithms and Architectures
26 pages
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
100% (1)
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
37 pages
Deep Learning University
No ratings yet
Deep Learning University
129 pages
High-Street Changes and Populism ssrn-5119375
No ratings yet
High-Street Changes and Populism ssrn-5119375
94 pages
Internet Technology and Web Designing
No ratings yet
Internet Technology and Web Designing
242 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Programming Well - Harvard CS51
No ratings yet
Programming Well - Harvard CS51
509 pages
Adapting the Adaptive Toolbox_set of cognitive mechanisms
No ratings yet
Adapting the Adaptive Toolbox_set of cognitive mechanisms
70 pages
A Case Study on AI Engineering Practices
No ratings yet
A Case Study on AI Engineering Practices
13 pages
2D to 3D Image Conversion Algorithms
0% (1)
2D to 3D Image Conversion Algorithms
10 pages
Module 2
No ratings yet
Module 2
197 pages
The_Attention_System_of_the_Human_Brain
No ratings yet
The_Attention_System_of_the_Human_Brain
32 pages
2024-11-28-engineered-carbon-removals-energy-security-affordability-quiggin
No ratings yet
2024-11-28-engineered-carbon-removals-energy-security-affordability-quiggin
53 pages
A Beginner’s Guide to Variational Inference
No ratings yet
A Beginner’s Guide to Variational Inference
48 pages
Logical Reasoning in Large Language Models A Survey
No ratings yet
Logical Reasoning in Large Language Models A Survey
9 pages
(Yet) Another Theoretical Model of Thinking
No ratings yet
(Yet) Another Theoretical Model of Thinking
24 pages
Dojo System v25
No ratings yet
Dojo System v25
45 pages
Agent-Based Modeling the Emergent Behavior of a System of Systems
No ratings yet
Agent-Based Modeling the Emergent Behavior of a System of Systems
10 pages
Adopting Cognitive Computing Solutions in Healthcare
No ratings yet
Adopting Cognitive Computing Solutions in Healthcare
14 pages
Mind the Gaps. Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles
No ratings yet
Mind the Gaps. Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles
14 pages
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
Coding Session Question Bank-2
No ratings yet
Coding Session Question Bank-2
34 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
Functional Architecture of The Cerebral Cortex
No ratings yet
Functional Architecture of The Cerebral Cortex
30 pages
Deep Learning Step by Step
No ratings yet
Deep Learning Step by Step
171 pages
]Evolution of brain size and juvenile periods in primates
No ratings yet
]Evolution of brain size and juvenile periods in primates
10 pages
Week1 2
No ratings yet
Week1 2
93 pages
Ppt-Ii NNFL
No ratings yet
Ppt-Ii NNFL
43 pages
Agents and Ambient Intelligence Case Studies
No ratings yet
Agents and Ambient Intelligence Case Studies
11 pages
Accelerating a Just Transition to Smart, Sustainable Cities
No ratings yet
Accelerating a Just Transition to Smart, Sustainable Cities
10 pages
Deep Learning
100% (3)
Deep Learning
32 pages
DLD Chapter 1
No ratings yet
DLD Chapter 1
15 pages
AIAgentFrameworkinHealthcareIndustry
No ratings yet
AIAgentFrameworkinHealthcareIndustry
8 pages
Menu Recommendation_OOps
No ratings yet
Menu Recommendation_OOps
2 pages
Language_evolution_as_cultural_evolution
No ratings yet
Language_evolution_as_cultural_evolution
6 pages
A Trading Agent With No Intelligence Routinely Outperforms AI
No ratings yet
A Trading Agent With No Intelligence Routinely Outperforms AI
8 pages
A Case of Applying AI to an Ethylene Plant
No ratings yet
A Case of Applying AI to an Ethylene Plant
6 pages
Auditory Cortex_Science
No ratings yet
Auditory Cortex_Science
4 pages
Building the Unified Data Warehouse and Data Lake TDWI Best Practices Report
No ratings yet
Building the Unified Data Warehouse and Data Lake TDWI Best Practices Report
30 pages
AI and the Frontiers of Finance
No ratings yet
AI and the Frontiers of Finance
7 pages
Deep Learning
No ratings yet
Deep Learning
243 pages
Discrete Problems
No ratings yet
Discrete Problems
2 pages
Dsa and ML 10
No ratings yet
Dsa and ML 10
18 pages
Elon Musk’s Neuralink Brain Chip
No ratings yet
Elon Musk’s Neuralink Brain Chip
5 pages
Accelerating-AI-impact-by-taming-the-data-beast
No ratings yet
Accelerating-AI-impact-by-taming-the-data-beast
6 pages
UNIT-1 Foundations of Deep Learning
100% (1)
UNIT-1 Foundations of Deep Learning
51 pages
Pic Programming Tutorial
100% (1)
Pic Programming Tutorial
18 pages
Our DSA lab 12[1]
No ratings yet
Our DSA lab 12[1]
5 pages
Triangular Fuzzy Number
No ratings yet
Triangular Fuzzy Number
10 pages
Face Detectionand Recognition Using Open CV
No ratings yet
Face Detectionand Recognition Using Open CV
13 pages
10 DSD Presentation
No ratings yet
10 DSD Presentation
10 pages
NEURALINK A Brain-Machine Interface Device
No ratings yet
NEURALINK A Brain-Machine Interface Device
5 pages
PSTC QUESTION BANK ON MODULE -3_4_5
No ratings yet
PSTC QUESTION BANK ON MODULE -3_4_5
3 pages
UNIT 4,5 Ques
No ratings yet
UNIT 4,5 Ques
5 pages
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Astar - 23b.trace - XZ Bimodal Next - Line Next - Line Next - Line Next - Line Drrip 1core
No ratings yet
Astar - 23b.trace - XZ Bimodal Next - Line Next - Line Next - Line Next - Line Drrip 1core
4 pages
Lab Assignment 2: There Are Total 5 Test Cases in Weak Normal Equivalence Class Testing
No ratings yet
Lab Assignment 2: There Are Total 5 Test Cases in Weak Normal Equivalence Class Testing
2 pages
23cb511 - Cryptography and Network Security
No ratings yet
23cb511 - Cryptography and Network Security
2 pages
Reaction Paper For Video #1
No ratings yet
Reaction Paper For Video #1
2 pages
Tech Engineer Resume Template
No ratings yet
Tech Engineer Resume Template
1 page
1. Deep Learning
No ratings yet
1. Deep Learning
127 pages
315 14 - Lab Data Structure Using C ++
No ratings yet
315 14 - Lab Data Structure Using C ++
106 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Python UNIT 1
No ratings yet
Python UNIT 1
30 pages
Lab Manual OF Big Data Analtyics Lab (Bca04207) : BCA General II Year IV Semester Academic Session 2021-22
No ratings yet
Lab Manual OF Big Data Analtyics Lab (Bca04207) : BCA General II Year IV Semester Academic Session 2021-22
79 pages
UGRD-ITE6201-Data-Structures-and-Algorithms-legit-not-quizess MidALL
100% (3)
UGRD-ITE6201-Data-Structures-and-Algorithms-legit-not-quizess MidALL
17 pages
Cs 48 Form
No ratings yet
Cs 48 Form
1 page
Grade 7 Fractions: Answer The Questions
No ratings yet
Grade 7 Fractions: Answer The Questions
3 pages
A Comparative Study of AI Agent Orchestration Frameworks
No ratings yet
A Comparative Study of AI Agent Orchestration Frameworks
13 pages
5B Practice 36-39
No ratings yet
5B Practice 36-39
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hardware For Deep Learning Acceleration

Uploaded by

Hardware For Deep Learning Acceleration

Uploaded by

REVIEW

Hardware for Deep Learning Acceleration

DNNs are graph models that consist of

and Gaussian error liner unit (GELU).[23] Particularly, GELUs are

CPUs and GPUs, and 2) ASIC-based accelerators, e.g., NPUs and

3.1. Rooﬂine Model

The rooﬂine model proposed in ref. [73] provides insights into

bandwidth reduces the ridge point, widening the compute-bound

3.2. Overview on Accelerators

For memory- and compute-bound tasks (depicted in Figure 4),

Figure 7. Architecture of a GPU. SM and SP stand for streaming multi-

simultaneously processed. To boost the off-chip memory band-

various data formats (e.g., FP64/32/16, BFLOAT16, INT8/4,

Figure 8. Architecture of NPU. Each core indicates a PE for a unit

3.5. CIM Units 3.5.2. Analog CIM Units

where SynOPs(1) denotes SynOPs for nonzero activation (’10 ),

where the sparsity f sp is given by

which is equivalent to Equation (11). Equation (13) highlights

4.2. Event-Routing Architectures

Although the event-routing architecture in a neuromorphic core

SynOPsð0Þ þ SynOPsð1Þ SynOPsð1Þ

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.