0% found this document useful (0 votes)
33 views55 pages

Energy-Efficient Deep Learning Inference On Edge Devices

Uploaded by

chienphan852003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views55 pages

Energy-Efficient Deep Learning Inference On Edge Devices

Uploaded by

chienphan852003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

CHAPTER EIGHT

Energy-efficient deep learning


inference on edge devices
Francesco Daghero, Daniele Jahier Pagliari, and Massimo Poncino
Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy

Contents
1. Introduction 248
2. Theoretical background 249
2.1 Neurons and layers 249
2.2 Training and inference 252
2.3 Feed-forward models 253
2.4 Sequential models 256
3. Deep learning frameworks and libraries 258
4. Advantages of deep learning on the edge 259
5. Applications of deep learning at the edge 260
5.1 Computer vision 261
5.2 Language and speech processing 261
5.3 Time series processing 262
6. Hardware support for deep learning inference at the edge 262
6.1 Custom accelerators 262
6.2 Embedded GPUs 263
6.3 Embedded CPUs and MCUs 264
7. Static optimizations for deep learning inference at the edge 265
7.1 Quantization 267
7.2 Pruning 273
7.3 Knowledge distillation 278
7.4 Collaborative inference 279
7.5 Limitations of static optimizations 282
8. Dynamic (input-dependent) optimizations for deep learning inference at
the edge 282
8.1 Ensemble learning 283
8.2 Conditional inference and fast exiting 286
8.3 Hierarchical inference 288
8.4 Input-dependent collaborative inference 290
8.5 Dynamic tuning of inference algorithm parameters 291
9. Open challenges and future directions 293
References 293
About the authors 301

Advances in Computers, Volume 122 Copyright # 2021 Elsevier Inc. 247


ISSN 0065-2458 All rights reserved.
https://doi.org/10.1016/bs.adcom.2020.07.002
248 Francesco Daghero et al.

Abstract
The success of deep learning comes at the cost of very high computational complexity.
Consequently, Internet of Things (IoT) edge nodes typically offload deep learning tasks
to powerful cloud servers, an inherently inefficient solution. In fact, transmitting raw data
to the cloud through wireless links incurs long latencies and high energy consumption.
Moreover, pure cloud offloading is not scalable due to network pressure and poses
security concerns related to the transmission of user data.
The straightforward solution to these issues is to perform deep learning inference at
the edge. However, cost and power-constrained embedded processors with limited
processing and memory capabilities cannot handle complex deep learning models.
Even resorting to hardware acceleration, a common approach to handle such complex-
ity, embedded devices are still not able to directly manage models designed for cloud
servers. It becomes then necessary to employ proper optimization strategies to enable
deep learning processing at the edge.
In this chapter, we survey the most relevant optimizations to support embedded
deep learning inference. We focus in particular on optimizations that favor hardware
acceleration (such as quantization and big-little architectures). We divide our analysis
in two parts. First, we review classic approaches based on static (design time) optimi-
zations. We then show how these solutions are often suboptimal, as they produce
models that are either over-optimized for complex inputs (yielding accuracy losses)
or under-optimized for simple inputs (losing energy saving opportunities). Finally, we
review the more recent trend of dynamic (input-dependent) optimizations, which solve
this problem by adapting the optimization to the processed input.

1. Introduction
In recent years, machine learning techniques have become pervasive
in our society, as the backbone of an increasing number of applications in the
mobile and IoT domains. This spread has been mainly fueled by the advent
of deep learning. In fact, one of the limitations of classical (i.e., pre-deep-
learning) ML models is their reliance on carefully hand-engineered feature
extractors, which makes model design a long, complex and costly process.
Furthermore, the resulting models are often not reusable if the specifications
of the problem change even slightly [1]. Deep learning, in contrast, over-
comes the need for hand-crafted features [1], by using representation learn-
ing to extract meaningful features directly from raw data. This approach has
been applied successfully to a number of applications, from smart
manufacturing [2], to medical analysis [3, 4] and agriculture [5]. In particular,
for tasks such as computer vision [6], natural language processing [7] and
speech recognition [8], deep learning models have achieved outstanding
results, sometimes even outperforming humans [1].
Energy-efficient deep learning inference on edge devices 249

In practice, although most of the theoretical concepts at the basis of deep


learning have been around for decades, this approach has only become main-
stream in the last decade, mostly due to the availability of parallel hardware
with increasingly large storage and computing power [9]. Indeed, in order to
generate complex representations from raw data, deep learning models
require both large datasets and huge amounts of computations. For example,
a relatively small (by modern standards) computer vision model such as
AlexNet [6] requires 61 M weights and 724 M multiply and accumulate
operations (MACs) to process a single 227  227 image [9].
Besides being data hungry, deep learning workloads are also “embarrass-
ingly parallel,” which makes them perfectly suited for Graphical Processing
Units (GPUs) or cloud clusters [9]. In contrast, executing these models on cost
and power-constrained edge devices requires a synergy of hardware special-
ization (i.e., accelerators) and model optimization. In this chapter, we describe
some of the most relevant research efforts in this sense, focusing in particular
on model optimizations for energy efficiency.
The rest of the document is organized as follows. Section 2 provides the
required theoretical background on deep learning models from a computa-
tional perspective. Section 3 briefly describes the available frameworks for
the development and deployment of deep learning models. Sections 4
and 5 provide the motivation for performing deep learning at the edge,
while Section 6 describes the main hardware platforms available for such
task. Finally, Sections 7 and 8 describe the most relevant model optimiza-
tions for efficient deep learning at the edge. In particular, Section 7 presents
static (i.e., input-independent) optimizations, while Section 8 focuses on
dynamic (i.e., input-dependent) methods.

2. Theoretical background
In this section, we provide a (noncomprehensive) background on
deep learning models. We focus mostly on computational aspects, which
are relevant for the optimizations presented in the rest of the chapter, pro-
viding examples based on “standard” architectures, while intentionally skip-
ping some theoretical details related to the most advanced and exotic
models. Readers interested in those aspects can refer to [10].

2.1 Neurons and layers


The atomic blocks of a deep learning model are neurons, generic compu-
tational units performing a weighted and biased sum of their inputs.
250 Francesco Daghero et al.

Fig. 1 The conceptual view of an artificial neuron.

A B C

Fig. 2 Some of the most common activation functions in deep neural networks.
(A) Sigmoid; (B) Tanh; (C) ReLU.

A nonlinear activation function is then applied to the output of each neuron.


Fig. 1 shows a graphical representation of a neuron whose computation can
be summarized as:
X
n
z¼ wi xi + b
i¼1 (1)
y ¼ hðzÞ
where wi are the weights applied to the inputs xi, b is the bias and h is the
activation function. During the training phase, weights and biases are
iteratively updated to approximate the target function, as explained below.
Fig. 2 shows some of the most commonly used activation functions, in
particular:
• The sigmoid (sigm) function:
1
hðzÞ ¼ (2)
1 + ez
which squeezes its input onto the range (0, 1), as shown in Fig. 2A.
• The hyperbolic tangent (tanh):
ez  ez
hðzÞ ¼ tanh ðzÞ ¼ (3)
ez + ez
Energy-efficient deep learning inference on edge devices 251

which is similar to sigmoid but maps its input to the interval (1, 1), as
shown in Fig. 2B.
• the Rectified Linear Unit (ReLU) [11]:

0 z0
hðzÞ ¼ max ð0, zÞ ¼ (4)
z z>0
which is often used in modern deep learning models to solve the
vanishing gradient problem of sigmoid and tanh, i.e., the fact that for
very large input magnitude, the gradients of those two functions become
very small, complicating the back-propagation of errors during training.
In contrast, the gradient of ReLU is piece-wise constant, as clear from
Fig. 2C.
Besides reducing the impact of vanishing gradients, ReLU is also advanta-
geous from a computational perspective. Indeed, its evaluation simply con-
sists of a hardware-friendly max() function. In contrast, sigmoid and tanh
must be approximated either via a software routine or using table look-
up, depending on the hardware platform [12]. Furthermore, ReLU maps
all negative inputs to zero, leading to sparse activation outputs, which favor
optimization techniques such as pruning (see Section 7.2) [9].
Most deep learning models are neural networks (NNs), i.e., combina-
tions of neurons organized in a sequence of representation layers. Layers
can be divided in three main categories with respect to their position in
the network. The input layer processes raw input data and contains one
neuron per input variable. Its outputs are then fed to one or more hidden
layers, which are at the core of NN processing. Each hidden layer builds
an increasingly complex representation of the input, projecting it into a
high-dimensional feature space. Finally, the output layer takes the last hidden
layer output and produces the final result of the NN computation. Clearly,
the structure of this layer and its activation function depend on the task for
which the NN is used. For classification, the output layer typically includes a
number of neurons equal to the number of classes, each producing an esti-
mate of the probability that the input belongs to the corresponding class. In
this case, a common activation function is the softmax:

exp ðzi =T Þ
qi ¼ X (5)
j
exp ðzj =TÞ

which converts the preactivation output z for each class i into a probability
q. T is the so-called temperature and is normally set to 1.
252 Francesco Daghero et al.

Input layer Hidden layer Output layer

Fig. 3 A neural network with three layers.

The depth of a NN is defined as its number of layers. NNs are generally


called deep when they include more than one hidden layer. Fig. 3 shows an
overview of a network with fully connected layers (see Section 2.3). Deep
NNs represent the majority of the models, with the average number of hid-
den layers for state-of-the-art NNs steadily increasing from a few to some
thousands in recent years [9].

2.2 Training and inference


Neural networks have to undergo a learning or training phase before being
able to accurately approximate a given function. Usually, training is per-
formed iteratively feeding the NN with chunks of data, called mini-batches,
in order to exploit the available hardware parallelism while still using a small
amount of memory compared to the entire training dataset. In the so-called
forward pass, the NN processes the mini-batch and produces a predicted out-
put y^. A cost or loss function L is then applied to such output. In supervised
learning, which is the most common approach for many deep learning appli-
cations, the loss measures the difference between y^ and the expected output
y. Finally, in the backward pass, the network’s weights are updated based on
the value of the loss using gradient descent. Gradients of the loss with respect to
neuron weights (wi) and biases (bi) are obtained through a process known as
back-propagation [10], This sequence of forward and backward passes is
repeated for a large number of iterations, spanning the entire training dataset
multiple times.
Once the learning phase has concluded, the network can be used to per-
form predictions on unknown data. This process, called inference, consists
Energy-efficient deep learning inference on edge devices 253

only of the forward pass and uses fixed weight values. While it is possible to use
mini-batches also in this phase, most inference tasks, especially for embedded
and IoT applications, have tight latency constraints, i.e., outputs have to be
produced as soon as possible after inputs become available. In those scenarios,
batching is not an option, and inference has to be performed on single inputs,
in a streaming fashion. Clearly, this negatively affects the exploitable parallel-
ism and data sharing opportunities.

2.3 Feed-forward models


Feed-forward NNs are arguably the most popular type of deep learning
model and are characterized by the absence of feedback connections. In
other words, neuron outputs in a given layer are only fed as inputs to sub-
sequent layers. The two most popular types of feed-forward NNs are fully
connected neural networks and convolutional neural networks.

2.3.1 Fully connected neural networks


Fully connected neural networks have been among the first types of NN to
be developed. They are constructed as a sequence of fully connected
layers, i.e., layers in which each neuron receives as input all outputs from
the previous layer [10]. This type of connectivity implies the presence of a
separate parameter wi, j for each pair of neurons in adjacent layers.
Therefore, the forward pass in fully connected networks can be described
compactly as a large matrix multiplication, as depicted in Fig. 4. This is real-
ized in hardware as a sequence of multiply-and-accumulate (MAC) oper-
ations. As shown in the figure, the input x and preactivation output z can be
either vectors or matrices, depending on the use of batching during infer-
ence. Given the large number of neurons in each layer of modern NNs, the
weight matrix in Fig. 4 can easily contain thousands or millions of elements.
Therefore, fully connected forward passes tend to be strongly memory bound,
i.e., to require a large number of data transfers to/from memory for each

Mini-batch size

Zik = Wij X Xjk

Fig. 4 Matrix multiplication performed for each fully connected layer.


254 Francesco Daghero et al.

MAC operation. This is only partially mitigated by the weight reuse made
available by batching, when it is an option [13].

2.3.2 Convolutional neural networks


Convolutional neural networks (CNNs) have achieved outstanding results
for computer vision tasks [1]. These networks typically process 3D tensors
as inputs rather than flat 1D arrays, where the three dimensions may corre-
spond to the height, width and number of channels of an image. The fun-
damental layers used in these networks are Convolutional (or Conv)
layers, which apply a number of sliding window filters to the input tensors,
followed by an element-wise activation function, most commonly a ReLU
[10]. A representation of a Conv layer is shown in Fig. 5. Mathematically, a
Conv layer output is computed as:
!
XK X K X Ci
yi, j, co ¼ h wco , l, m, ci  x k k +b ,
i + l, j + m, ci (6)
l¼0 m¼0 ci ¼0 2 2

8i  ½0,HÞ, 8j  ½0,W Þ, 8co  ½0,Co Þ


where K corresponds to the height and width of the filter’s weight kernel
(assumed square), H and W are the height and width of the input tensor,
and Ci and Co are the number of input and output channels respectively.
There exist several variants of this basic equation, for example to implement
strided convolution, or for different padding strategies for boundary pixels.
Moreover, many recent architectures resort to depth-wise or group-wise con-
volutions, where the input tensor slice only spans one or few channels in Ci.
Besides Conv layers, CNNs often include other types of layers, such as
pooling and batch normalization (or batch norm). The former are used to reduce

Fig. 5 The convolution operation.


Energy-efficient deep learning inference on edge devices 255

the dimensionality of the input and to provide some degree of translation


invariance, and typically compute either a max or an average over a spatial
tensor window [1]. The latter, instead, apply a linear transformation to all
elements in each channel, using the following equation:

x  μB
y ¼ pffiffiffiffiffiffiffiffiffiffiffi
ffiγ + β (7)
σ 2B + E

where γ and β are parameters learned during the training, while μB and σ B
are the mini-batch mean and standard deviation; E is a small constant added
for numerical stability [15]. During inference, μB and σ B are replaced with
the entire training set mean and standard deviations [16]. Batch norm has
been shown to improve training speed by making gradients more stable,
besides favoring the application of quantization (see Section 7.1) [17].
Features extracted by CNNs are often flattened and fed to one or more fully
connected (FC) layers for the final classification. An example of a classical
CNN architecture is shown in Fig. 6.
Conv layers dominate the inference phase of a CNN from a computa-
tional standpoint. This is due to fact that they process larger tensors with
respect to the final FC layers, which operate on a compressed feature space.
Moreover, the number of Conv layers is typically much larger than the num-
ber of FC layers in a modern CNN. The naive realization of a set of (standard)
convolution filters on a 3D tensor consists of 6 indented loops [13]. More
specifically, the 3 innermost loops perform the weighted sum of a 3D slice
of the input tensor with a 3D filter kernel, i.e., the three summations of
(6). Two additional loops move the slice across the width and height of
the input tensor, and the last loop repeats the entire procedure with multiple
filters, thus generating the various channels of the output tensor. Differently
from fully connected layers, Conv operations are typically compute bound,
even without batching [13].

Fig. 6 Lenet-5, an example of classic CNN architecture [14]. FC, fully connected. Layers
activations are not shown for simplicity.
256 Francesco Daghero et al.

Fig. 7 An example of the im2col procedure on a 2  2 convolution. The image data is


reorganized to obtain a single matrix where columns are the elements in a 2  2
window. Fm and om are the filter and output matrices.

Alternatively, convolution can be transformed into a single large matrix


multiplication, using a procedure known as im2col. This operation is used in
particular on CPUs and GPUs, and requires a reorganization of the input ten-
sors and of the filter kernels, as shown in Fig. 7. While this reorganization dupli-
cates some data, thus increasing the memory space required for the operation, it
allows to exploit the extremely optimized CPU/GPU implementations of
GEneral Matrix Multiply (GEMM) available in many mathematical librar-
ies [18].

2.4 Sequential models


Feed-forward models are unable to model time-correlations, nor to process
inputs of variable-size. Therefore, different types of neural networks have
been designed to explicitly work with variable-length temporal sequences
of data. In order to be able to remember information about previous input
in the sequence, these networks typically include feedback connections.
Recurrent neural networks (RNNs) are among the most popular deep
sequence models, and have achieved outstanding results in tasks such as neural
machine translation, speech recognition, and summarization [1, 7, 19]. These
networks process each input from a temporal sequence using one or more
Energy-efficient deep learning inference on edge devices 257

Fig. 8 An overview of a LSTM cell (A) and its unrolling during inference (B). In the cell
diagram, circles labeled with σ represent a multiplication with a weight matrix followed
by an element-wise sigmoid or tanh operation, while circles labeled with x or + represent
element-wise products and sums.

layers of so-called cells, i.e., complex “neurons” with memory. Common cell
architectures include the long-term short memory (LSTM) and the gated
recurrent unit (GRU). An example of LSTM cell is shown in Fig. 8; for
the details of the various operations involved, the reader can refer to [10].
The figure also shows how cells are used to process an entire sequence, by
unrolling them a number of times equal to the sequence length. Notice that
each cell “replica” shares the same learned weights.
258 Francesco Daghero et al.

The sequential nature of RNNs comes at the price of an even higher


complexity compared to feed-forward models. Indeed, for each time-step, a
LSTM cell has to compute the following operations:
it ¼ σðWii xt + bii + Whi ht1 + bhi Þ (8)
ft ¼ σðWif xt + bif + Whf ht1 + bhf Þ (9)
gt ¼ tanh ðWig xt + big + Whg ht1 + bhg Þ (10)
ot ¼ σðW io xt + bio + W ho ht1 + bho Þ (11)
ct ¼ ft ∘ct1 + it ∘gt (12)
ht ¼ ot ∘tanhðct Þ (13)
where ht, ct, and xt are, respectively, the hidden state, cell state, and input
vectors at time t and are multiplied with eight weight matrices Wxx.
Vectors it, ft,gt, and ot are called the input, forget, cell, and output gates,
respectively; σ is the sigmoid function and ∘ is the element-wise product.
Matrix-vector products in (8)–(13) are the dominant operations from a
computational perspective. In particular, the eight weight matrices can be
combined into a single bigger matrix, reconducting the entire time-step com-
putation to a large matrix multiplication, as in FC layers [10]. Differently from
the feed-forward case, however, this multiplication is repeated for each time
step. Moreover, parallelism between time-steps is limited, as each evaluation
of (8)–(13) requires to have available the outputs ct1 and ht1 from the pre-
vious step. RNNs can be also organized in more complex architectures, such
as the encoder–decoder one, which relies on two RNNs to perform sequence-
to-sequence mapping (e.g., for translation). However, the basic computations
involved remain identical.
Finally, it is worth noticing that deep sequence models have seen a very
quick development in recent years, with the advent of attention-based archi-
tectures and transformers [20]. While extremely effective, however, these
networks are still relatively unexplored from the point of view of model
optimization for efficient edge processing and hardware acceleration.

3. Deep learning frameworks and libraries


A reason for the increasing popularity of deep learning is the availabil-
ity of open-source frameworks, which simplify the development of new
models, eliminating the need of rewriting state-of-the-art building blocks
(activation functions, layers, etc.) from scratch. For high-performance sys-
tems, PyTorch [21] and TensorFlow (TF) [22] are currently the two most
Energy-efficient deep learning inference on edge devices 259

popular frameworks. Both provide high-level Python APIs for model devel-
opment, training, and deployment, while leveraging optimized C/C++
libraries for CPU or GPU acceleration under the hood. Cross-framework
model description formats such as the one provided by ONNX [23] are also
increasingly popular.
These frameworks are extremely powerful and flexible, but they are not
suited for low-power edge devices, mostly due to their significant require-
ments in terms of runtime memory occupation. Therefore, recent years have
seen the development of many edge-oriented libraries and frameworks for
deep learning. Both PyTorch and TensorFlow now offer lightweight infer-
ence engines (called PyTorch Mobile and TF Lite, respectively) targeting
resource efficiency. With respect to their full-fledged counterparts, these
stripped-down versions only permit the execution of the inference phase
(i.e., no model development nor training), thus greatly reducing the size
of the corresponding runtime. Conversion tools allow to export a standard
PyTorch/TF model for these engines, while simultaneously applying
efficiency-oriented optimizations to the network (mostly quantization,
see Section 7.1).
While PyTorch Mobile and TF Lite target powerful edge devices such as
smartphones or tablets, other projects have addressed the implementation of
deep learning on even more constrained targets, such as Microcontrollers
(MCUs). Tensorflow Lite for Microcontrollers is a recent effort in this sense
from TF developers, targeting devices such as ARM Cortex-M processors; it
runs without the need of operating system support, and occupies only a few
kB of memory. Moreover, several companies and academic researchers are
developing MCU-oriented libraries for deep learning inference, such as
ARM’s CMSIS-NN [18], STMicroelectronics CUBE AI [24], the PULP
Platform library PULP-NN [25], and many others. More details on these
libraries are provided in Section 6.3.

4. Advantages of deep learning on the edge


Nowadays, deep learning is one of the core components of many mobile
and IoT applications. However, due to the heavy processing and memory
requirements of DNNs, the standard approach to implement deep learning
for mobile and IoT is pure cloud offloading, in which raw data are transmitted
to a remote server, which runs the inference using high-performance hard-
ware (e.g., a GPU) and returns the final result. While this approach is accept-
able for training, which is an “offline” task, pure cloud offloading of DNN
260 Francesco Daghero et al.

inference can be highly inefficient [9, 26]. In contrast, near-sensor edge comput-
ing might provide several benefits, as long as designers have available a set of
optimization strategies that permit the efficient execution of DNNs on cost-
constrained edge devices.
First, cloud offloading might incur long and unpredictable latencies
when devices have a slow or intermittent internet connection. This might
be critical for deep learning applications with tight real-time constraints,
such as autonomous driving [26]. Local computation, instead, can be made
predictable in terms of latency much more easily. Moreover, the cloud
approach also has scalability issues, since having a large number of devices
connected to the cloud increases the network pressure, deteriorating the
quality of service. This is especially true for bandwidth-intensive inputs such
as videos. Performing the entire inference, or at least a part of it, at the edge
would reduce this bottleneck.
Besides latency and bandwidth problems, the transmission of high-
dimensional data over the network is also highly energy inefficient, espe-
cially for wirelessly connected edge devices. In comparison, an optimized
local processing can be orders of magnitude more efficient [13]. This is par-
ticularly critical given that mobile and IoT devices are mostly battery oper-
ated and run on tight energy constraints [27].
Finally, mobile and IoT applications often process sensitive data (e.g., face
detection, speech recognition) whose transmission to the cloud may raise
privacy concerns. Avoiding the transmission or performing a first
preprocessing step locally would therefore also increase security.

5. Applications of deep learning at the edge


The increasing amount of smart devices caused a growth in the num-
ber of deep learning-based applications that would benefit from an execu-
tion at the edge. Tasks such as face recognition, object detection, and
wakeword recognition are in fact already commonly executed in a
decentralized fashion [26]. Some of the most relevant tasks that can benefit
from edge computing are described in the following sections. While describ-
ing them, we try to stress the fact that, in principle, nothing prevents the
implementation of any of these tasks using a standard cloud computing
approach. However, that solution would yield suboptimal results in terms
of responsiveness, security, and energy efficiency.
Energy-efficient deep learning inference on edge devices 261

5.1 Computer vision


Computer vision is one of the most explored applications of deep learning
due to the outstanding results obtained [1]. Tasks such as object detection
and image recognition are common in a broad number of edge-oriented
applications, such as autonomous driving, surveillance, vehicle, and per-
son detection. These tasks often rely on collecting data from cameras and
processing them immediately afterwards [26]. Therefore, the inference
phase has real-time requirements (e.g., car detection in autonomous driv-
ing) which may be violated with a cloud approach due to connectivity
problems. Moreover, vision-based applications often process large amounts
of sensitive user data (e.g., pictures of faces), and therefore would benefit
both from avoiding the energy-consuming transmission of these data to
the cloud, and from the privacy-preserving properties of edge computing.
Evidently, computer vision at the edge can only be realized if the model can
be optimized to efficiently run a highly intensive processing of large videos
on an embedded device, with an acceptable frame rate.

5.2 Language and speech processing


Deep learning has also obtained outstanding results in natural language
processing (NLP) and speech recognition [1]. While language-based appli-
cations are typically not real-time, the perceived user experience (e.g., in an
automatic translation system) would still benefit from the shorter response
times that can be achieved thanks to edge computing. Similarly, privacy
is again a relevant issue when dealing with user audio recordings or text con-
versations. Furthermore, some audio-based edge deep learning tasks, such as
wakeword recognition for smart speakers, are “always-on” (i.e., the infer-
ence is performed constantly in the background). For these tasks, a cloud
approach is even more suboptimal, as it requires a constant energy-hungry
transmission of data to the cloud. Indeed, this is one of the most common
tasks that are already performed at the edge [26]. As for computer vision,
however, state-of-the-art language and speech models are typically of con-
siderable size. In the last years, reduced versions of deep learning models
have successfully been used for simple NLP tasks on embedded devices,
avoiding any use of the network (e.g., the aforementioned wakeword rec-
ognition). However, the challenge is still open for more complex tasks
such as machine translation or summarization, which require far more
computing power.
262 Francesco Daghero et al.

5.3 Time series processing


Smartphones and wearables continuously collect time series of data from
tens of sensors. These data can be processed to recognize the activity per-
formed by the device owner and react to it in various ways, from simple log-
ging [28] to power optimization [29]. Once again, analyzing this flow of data
directly on the device would avoid energy-consuming transmission of pri-
vate data (e.g., the biometric measurements of a smart watch) to the cloud.
Moreover, it would allow a prompter response for the applications.
Similarly, ubiquitous sensors in smart cities and smart factories can also
greatly benefit from deep learning-based inference on time series. Among
the most popular applications in these domains there are electrical load pre-
diction and balancing in smart grids [30], traffic and pollution monitoring in
cities [31], soil monitoring for agriculture [5] and predictive maintenance of
industrial equipment in factories [2]. Even more than for smartphones and
wearables, energy reduction is critical for these kinds of sensors, which are
typically expected to operate on battery for months or years [26]. Therefore,
avoiding raw data transmission and performing at least a partial local
processing is critical for these scenarios as well.

6. Hardware support for deep learning inference


at the edge
The broad diffusion of deep learning has led to two main trends related
to hardware [9]. On the one hand, a lot of custom accelerators for deep learn-
ing inference have been proposed, implemented either as application-
specific integrated circuits (ASICs) or using field-programmable gate arrays
(FPGAs). On the other hand, researchers and companies have also worked in
the direction of improving the efficiency of deep learning processing on gen-
eral purpose hardware, using highly optimized software libraries, some of
which have been mentioned in Section 3.
In this section, we briefly overview the available platforms for deep
learning at the edge, highlighting their advantages and disadvantages.

6.1 Custom accelerators


Resorting to hardware specialization to implement DNN inference leads
to the highest energy efficiency and throughput. This comes at the cost of
high design and manufacturing costs, especially for ASIC implementations.
FPGAs, on the other hand, allow to drastically cut costs due to their field
Energy-efficient deep learning inference on edge devices 263

programmability, which enables easier flows for designers and dilutes


manufacturing costs over large volumes. However, they are significantly
more power hungry than modern ASICs.
Regardless of the target technology, DNN accelerators typically use spa-
tial or systolic architectures to optimize the highly parallel MAC kernels that
are at the core of most key DNN layers (such as FC, Conv, and LSTM layers
described in Section 2) [32–38]. A large number of such accelerators, some-
times called Neural Processing Units (NPUs), has been recently proposed
both by companies and by academic researchers. In the commercial world,
two of the most popular products are Google’s Edge TPU [32] and Intel
Movidius [33]. In the academic world, Eyeriss [34] and Envision [35] are
examples of flexible and powerful accelerators for feed-forward NNs.
One feature of the latter is the use of low-precision quantization (see
Section 7.1) to reduce the memory bandwidth and improve arithmetic
operations efficiency using dynamic voltage scaling [39–41]. The quantiza-
tion concept is brought to the extreme by binary NN accelerators such as the
XNE [36]. Besides these examples, many other designs are constantly being
proposed by hardware designers. Since the focus of this chapter are model-
level, semihardware-independent optimizations, we do not go into details of
the most recent accelerators and refer the interested reader to the excellent
survey in [9].

6.2 Embedded GPUs


GPUs played a fundamental role in the diffusion of deep learning, thanks to
their excellent throughput and efficiency for parallel tasks such as DNN
inference [1]. Most GPUs are high-performance computing devices, with
a power consumption in the order of 100s of Watts, clearly inappropriate
for edge systems. Recently, however, GPU manufacturers have started to
design architectures that focus on efficiency. The NVIDIA Jetson family
of GPUs [42], for example, target explicitly embedded applications and con-
sume in the order of 10 W. While these power values remain excessive for
sensors and battery-operated devices, such kind of embedded GPUs might
be installed in grid-plugged intermediate edge servers. Furthermore, the
high-level of parallelism of GPUs would not be fully exploited by end-
nodes due to the aforementioned difficulties for applying batching in
latency-sensitive inference tasks [37]. In contrast, edge servers can typically
resort to batching by gathering data from multiple sources (e.g., multiple
sensors) [43].
264 Francesco Daghero et al.

Commercial embedded GPUs come with associated libraries and tool-


chains [42] that enable the full exploitation of the capabilities of their hard-
ware for DNN processing. In particular, CuDNN [44] is the library used by
NVIDIA GPUs to map DNN layers from various high-level deep learning
frameworks (TensorFlow, PyTorch, etc.) to optimized implementations
based on CUDA [45]. Compared to the libraries for MCUs described in
the following section, CuDNN supports a very comprehensive set of deep
learning primitives, for both convolutional and sequential models.

6.3 Embedded CPUs and MCUs


Compared to custom accelerators and GPUs, embedded CPUs and micro-
controllers have cost as their main selling point. However, these devices also
have orders of magnitude lower performance, as well as extremely limited
memory spaces. Nonetheless, modern MCUs often support low-precision
integer operations (e.g., 8-bit, 16-bit), sometimes with a certain amount
of parallelism using single instruction multiple multiple data (SIMD) instruc-
tions. Therefore, deep learning libraries for these platforms exploit as much
as possible these capabilities, while minimizing the on-chip memory foot-
print. To achieve both goals at the same time, many MCU deep learning
libraries focus extensively on quantized models.
CMSIS-NN [18] is one of such libraries, targeting ARM Cortex-M
series MCUs. It consists of highly optimized implementations of common
DNN kernels, based on a clever use of the ARM instruction set architecture
(ISA) to increase throughput. For instance, the lack of 8-bit SIMD opera-
tions on Cortex-M is overcome mapping them to 16-bit instructions and
using a custom reorganization of the data in memory to minimize the instruc-
tions needed for sign extension and merging. CMSIS-NN also comes with a
companion set of tools, which allow to convert pretrained models (e.g., in
TensorFlow, PyTorch or ONNX) and to perform posttraining quantization.
X-CUBE-AI [24] is a similar library from STMicroelectronics, again able to
“translate” networks trained with high-level frameworks into optimized code
for STM32 MCUs. Currently, one limitation of these and other similar librar-
ies is the limited support in terms of types of layers, especially for sequential
models.
In the academic domain, several embedded CPU architectures with spe-
cific hardware features that favor the implementation of deep learning algo-
rithms have been proposed, extending open ISAs. One notable example is
the parallel ultra-low-power (PULP) platform [46], which is at the basis of
Energy-efficient deep learning inference on edge devices 265

the GAP8 processor [47]. GAP8 is a scalable, clustered many-core architec-


ture targeting low-voltage and low-energy processing. The extensible RISC-
V ISA on which PULP is based allows the implementation of DNN-oriented
operations directly in hardware, such as bit-extraction for subbyte quantiza-
tion and popcount for binary NNs (see Section 7.1). These features are
exploited by the PULP-NN library [25] to yield improved efficiency com-
pared to that of other MCUs. Another peculiar feature of GAP8 which is
of great interest for deep learning applications is the fact that the main
MCU cluster does not contain data caches, and uses software-controllable
scratchpad memories instead. This increases the efficiency for memory-
intensive tasks such as DNN layers, at the cost of a greater effort from
the programmer. The latter, however, can be eliminated by resorting to
automatic tools that implement custom scratchpad management starting
from a high-level DNN specification [48].

7. Static optimizations for deep learning inference


at the edge
Design time optimizations to improve the energy efficiency of deep
learning models have been actively researched in recent years [9, 49], with
a few main families of methodologies being proposed. Before going into the
details of such approaches, however, it is important to mention what is argu-
ably the single best optimization strategy, i.e., the careful tuning of the
hyper-parameters of the target DNN architecture. Indeed, state-of-the-
art DNNs are often greatly over-parametrized, in order to achieve the best
possible accuracy on complex tasks, such as ImageNet’s 1000-classes classi-
fication [50]. Many researchers have shown that such architectures can be
dramatically shrunk with negligible accuracy losses, even for the same appli-
cation. For example, SqueezeNet [51] is a CNN that obtains the same
ImageNet accuracy as the classical AlexNet [6] with 50x less parameters.
Another notable example are MobileNets [52], a family of CNNs that sub-
stitute standard Conv layers with depth-wise convolutions, thus obtaining
high accuracy with a significantly reduced amount of MAC operations.a
Additionally, popular DNN architectures originally proposed for standard
tasks (such as ImageNet) are often reused for much simpler applications,
especially in the IoT domain [53, 54]. In those scenarios, hyper-parameters
a
Although it must be noted that this reduction in the number of MACs does not always translate into
hardware efficiency improvements, because depending on the adopted tensor layout, depth-wise
Conv layers may have inefficient data access patterns.
266 Francesco Daghero et al.

simplifications can be even more dramatic, ranging from input size reduc-
tion to the complete removal of several layers. Clearly, the drawback of
changing the DNN architecture is that it prevents the fine-tuning [1] of pre-
trained models. We do not describe hyper-parameters optimizations in
detail, since they are highly task-specific. However, we remark that, when-
ever they are possible for the task at hand, these should be the first optimi-
zations to be considered, as they can yield the largest complexity reduction
and efficiency improvement.
In contrast, in the rest of this section we focus on four families of general
optimization strategies that can be applied successfully to many different
DNNs: quantization, pruning, distillation, and collaborative inference. We select
these four families as they are currently the most effective and widely used by
researchers and industry. Two other interesting trends, not covered in this
chapter but worth mentioning, are filter decomposition [55] and the use of
approximate computing techniques (e.g., voltage over-scaling, approximate
functional units) [27, 56].
Importantly, the four families of optimizations described in this section,
as well as the dynamic ones treated in Section 8 are not only (almost) orthog-
onal to the details of the DNN used for inference, but also to the type of
inference hardware. This means that, although with different benefits in
terms of energy efficiency, these strategies can be applied to any of the fam-
ilies of platforms described in Section 6. As such, most of the approaches
described in the following do not try to reduce energy by lowering the power
consumption of each individual operation, which is strongly hardware-
dependent.b Instead, they exploit the fact that energy is a time-integral quan-
tity, and try to decrease it by reducing the number of operations performed,
thus being effective regardless of the underlying hardware, even if the power
of each operation remains constant. In practice, therefore, most of the tech-
niques presented in the chapter reduce either the memory occupation and
bandwidth of DNNs, thus cutting the energy associated with loading and
storing data through memory hierarchy levels, or the number and complex-
ity (precision) of the arithmetic (MAC) operations required for inference.
We select these techniques exactly for their generality. Clearly, hardware-
specific power optimizations can be combined to them, yielding even higher
energy savings, and we will mention some scenarios when this combination
has been realized in the following.

b
There are clearly exceptions to this rule. For example, integer quantization reduces both time and power
as integer ALUs are normally more power-efficient than floating point ones, for practically any hardware
platform.
Energy-efficient deep learning inference on edge devices 267

7.1 Quantization
One of the most widely diffused optimization techniques for deep learning
models is quantization. It consists of reducing the precision of the DNN
weights and possibly of the activations, exploiting the resilience of DNNs
to small errors and noise [27]. Reduced precision formats for DNNs can lever-
age either floating point (such as the 16-bit minifloat, currently supported by
many deep learning oriented GPUs [57]) or integer (i.e., fixed point) arith-
metic. The former is sometimes used also to speedup training, as minifloat rep-
resentation often does not yield any accuracy loss compared to standard 32-bit
floats. Integer quantization, in contrast, is mostly used to improve the effi-
ciency and speed of the inference phase [16]. At training time, integer quanti-
zation is only (optionally) simulated, in order to account for its impact on the
network outputs, as explained in detail in the following. Since our main focus
is inference, in the following, we concentrate on integer quantization
techniques.

7.1.1 Quantization algorithms


Integer quantization algorithms can be divided into uniform and non-
uniform based on the way in which representable numbers are distributed
on the real axis. Uniform quantization is the most common of the two,
and can be performed in several ways, among which the affine, symmetric,
and stochastic quantizers are the most popular [16].
The affine quantizer maps numbers in the range (xmin, xmax) to the range
(0, Nlevels  1), where Nlevels ¼ 2precision. The input range (xmin, xmax) is the set
of values that can be assumed by a given set of weights or activations (relative
to a channel, a layer, or to the entire network, as explained below).
Therefore, the first step needed to apply quantization is to determine this
range. The way to do so depends on whether quantization is being applied
during or after training, and is explained in Section 7.1.2.
Next, the uniform quantization step (Δ) and the zero-point (z) can be
derived based on xmin, xmax, and Nlevels. In particular, z is an integer
corresponding exactly to zero. Ensuring that zero is still represented
exactly after quantization is important, as it avoids that common operations
like zero padding introduce a quantization error. One-sided weights and
activations distribution are therefore relaxed in order to include 0, before
quantizing with the following operations:
x
xint ¼ round +z (14)
Δ
xQ ¼ clampð0,Nlevels  1,xint Þ (15)
268 Francesco Daghero et al.

with clamp defined as:


8
>
<a xa
clampða, b, xÞ ¼ x axb (16)
>
:
b xb
The dequantization can be instead performed in the following way:
xfloat ¼ ðxQ  zÞΔ (17)
The symmetric quantizer is a simplified version of the affine quantizer,
restricting z to 0. Eqs. (14) and (15) then become:
 
x
xint ¼ round (18)
Δ
xQ ¼ clampðN levels =2, N levels =2  1, xint Þ if signed (19)
xQ ¼ clampð0, N levels  1, xint Þ if unsigned (20)
with the reverse operation becoming:
xfloat ¼ ðxQ ÞΔ (21)
The stochastic quantizer adds noise to float numbers before rounding them
to integer [16]. The quantization is then performed in the following way:
x + E
xint ¼ round + z E  Unif ð1=2,1=2Þ (22)
Δ
xQ ¼ clampð0,Nlevels  1,xint Þ (23)
with its reverse operation being (17). Although this quantization technique
can be shown to yield improved accuracy for the same precision, it requires
to generate a new random number for each quantization operation. When
this has to be done at runtime, on the output activations of a layer, this adds a
considerable overhead to the inference phase. Therefore, stochastic quanti-
zation is generally used only for data that can be quantized offline, such as
model weights [58, 59].
Most real DNN weights and activations are not distributed uniformly.
Therefore, nonuniform quantization might yield better representations [9,
60]. In particular, having finer levels close to zero would be ideal, since ten-
sor values distributions tend to be bell-shaped [60]. On the other hand, given
that most inference HW platforms are based on uniform number represen-
tations for integers, nonuniform quantizers have to be carefully designed to
be made hardware-friendly [61].
Energy-efficient deep learning inference on edge devices 269

One popular nonuniform quantization approach is based on a logarithmic


number system. This yields the desired finer granularity for smaller
magnitude data, while grouping “outliers” in few levels [62]. In particular,
using base-2 powers for levels yields a very efficient hardware imple-
mentation of logarithmic quantization, allowing multiplications to be
implemented as bitwise shifts [63, 64].
The quantizers described above (both uniform and nonuniform) can be
applied with different granularities to a DNN. The three most common
granularities are:
• Network-wise: this is the simplest approach, with a single bit-width used
for the whole network. Range and zero-value parameters of quantizers
might still be different for different tensor (e.g., different layers weights).
This is also the easiest solution to support from a HW point of view, since
data widths remain constant throughout the inference, simplifying both
memory accesses and MAC operations. However, for a given accuracy
level, network-wise quantization might lead to significantly larger total
weights and activation sizes compared to finer granularity approaches.
• Layer-wise: in this approach, each layer has a different quantizer, handling
not only different ranges but also possibly different precisions (i.e., bit-
widths). While this approach usually outperforms the previous one, it
requires a more complex support from the underlying hardware.
• Channel-wise: this is an extension of the layer-wise approach tailored for
convolutional neural networks. It consists of adapting the parameters of
the quantizer to each convolutional kernel in a tensor. This strategy usu-
ally leads to an improved final accuracy, although it can only be applied to
weights. In fact, channel-wise quantization of activations would greatly
complicate data alignment for MAC operations in Conv layers [16].
Regardless of the quantization granularity, one important detail to underline is
that weights and activations quantization are significantly different at runtime.
In fact, while weights can be quantized offline, once and for all, the output of
large MAC loops during inference must be requantized at runtime by the
inference platform to produce new activations [65].

7.1.2 Quantization and training


The straightforward way to apply quantization to a DNN is to do it post-
training, i.e., applying quantizers to an already trained model. This requires
minimal setup, while leading to significant energy savings. The main
operation required to implement posttraining quantization consists in
determining the value range for each group of tensors to be quantized, i.e.,
270 Francesco Daghero et al.

xmin and xmax. For model weights, this range can be obtained immediately,
since their value is fixed posttraining. Quantizing activations, instead,
requires a calibration phase to determine data ranges to be used at runtime.
This is normally done running forward passes on a subset of the dataset and
storing the maximum and minimum values encountered for each activa-
tion tensor. This does not guarantee to find the absolute minimum and
maximum values assumed by a given tensor, but is often sufficient.
Values exceeding the expected range will be simply clipped by the clamp()
function in (15), (19), and (23). The risk of clipping is partially mitigated
by batch normalization, which keeps the values in a stable interval [16].
Posttraining quantization does not require a time-consuming training,
nor the availability of training data, except the few samples used for calibra-
tion. However, it may cause noticeable accuracy degradations on complex
tasks [66], although networks with more parameters tend to be less affected
[16].
A very effective solution to cope with these accuracy degradations is
quantization-aware training. In this approach, quantization is simulated while
the model is being trained, thus giving the DNN the opportunity to learn
how to compensate for the loss of precision. Clearly, the drawback of this
approach is that it requires an expensive training run and is only feasible
when training data are available.
During quantization-aware training, weights (and activations) are “fake-
quantized”. This means that their values are rounded in the forward pass, to
mimic low precision, but internal computations are still performed in floating
point. In the backward pass, quantized operations can be dealt with in differ-
ent ways. One classical approach is to approximate them with a straight-
through estimator [67]. This ensures that DNN outputs are produced as if data
had been quantized, while still allowing the back-propagation of small gradi-
ents, which is fundamental for training convergence. At inference time, fake
quantization operations are then removed, and the model uses actual integer
weights and activations.
Quantization-aware training also requires a different procedure for esti-
mating ranges of tensors. For example, since activation values change
depending on the input, in [68] it is proposed to use an exponential moving
average to estimate their range during training. Moreover, in order to avoid
rapidly shifting activation values, their quantization is usually performed
only after a considerable number of initial training steps, so that the network
has reached a more stable state [68].
Energy-efficient deep learning inference on edge devices 271

Besides the aforementioned BatchNorm layers, there are also other DNN
architecture elements that can favor the application of quantization, espe-
cially during training. An important one is the use of bounded activation func-
tions, such as the PArametrized Clipping acTivation (PACT) [69]. PACT is a
bounded ReLU variant that follows this equation:
8
>
<0 x0
pactðxÞ ¼ x 0xk (24)
>
:
k xk
where k is a parameter learned during training. Clipping the maximum
activation output to k has been shown to yield significant improvements
in accuracy for a given quantization precision [69].

7.1.3 Binarization
Binarization is an extreme form of quantization, in which precision is
reduced to 1 bit. Initially focusing only on weights [70], this technique
has then been extended also to activations [59]. The most common form
of binarization consists in using binary values 0 and 1 to represent integer
values 1 and 1, respectively. The conversion of a floating point number
to this format is simply obtained with the sign() function.
Binarization of both weights and activations yields extreme complexity
reductions for DNN inference, since MAC operations can be completely
eliminated and replaced by binary operations [59, 70, 71]. In particular, using
the aforementioned semantic for binary values, multiplications can be replaced
by bitwise XNORs, as shown in Table 1. The accumulation of the binarized
elements of a tensor X with N elements, instead, can be computed as follows:
s ¼ 2  popcountðXÞ  N (25)
where popcount() is a function that counts the number of bits at 1 in X.

Table 1 Equivalence between MUL and XNOR (⊙) when the values 1
and 1 are represented by the binary 0 and 1.
X1val X2val X1binary X2binary X1val * X2val X1binary ⊙ X2binary
1 1 0 0 1 1
1 1 0 1 1 0
1 1 1 0 1 0
1 1 1 1 1 1
272 Francesco Daghero et al.

The dramatic impact of Binarization on model size and operations


complexity is paid with significant accuracy degradations for complex tasks
(e.g., ImageNet classification [50]). However, this approach is of extreme
interest for simpler tasks, such as hand-written digits classification and
human activity recognition [71, 72].

7.1.4 Benefits of quantization


Quantization brings significant advantages in terms of memory occupation,
speed and energy consumption. First and foremost, low-precision data not
only reduces the storage occupation of the DNN, but even more impor-
tantly, it limits the memory bandwidth required to bring weights and acti-
vations on-chip, which is often the dominant contributor to inference time
and energy [13]. Moreover, integer operations are also faster and efficient
than floating point ones on virtually all hardware platforms.
Both energy and speed should ideally improve at least linearly with
respect to bit-width [9]. However, while this is true for what concerns
memory bandwidth and energy, the real trend for inference speedup and
total energy is often very different, depending on the target platform. In
particular, on general purpose hardware such as CPUs and MCUs, the linear
trend definitely stops for subbyte quantization. For these platforms, in fact,
bytes are the atomic load and store elements, and lower precision
quantized data have to be “packed” together in a single memory location.
As a consequence, MAC operations on subbyte values require a set of
additional operations to extract packed data into separate registers and then
perform the reverse operation on results. This overhead greatly affects the
speed of inference, so that, for example, 4-bit quantized DNNs are often
slower than 8-bit ones [25]. Furthermore, some popular CPU architectures
such as ARM Cortex-M, only support 16-bit Single Instruction Multiple
Register (SIMD) MAC instructions. Therefore, even 8-bit values have to
be moved to 16-bit registers and sign-extended before executing a MAC.
Despite thorough ISA-dependent optimizations [18], this overhead
inevitably limits the speedup and energy gain obtained at 8-bit or lower
quantization.
The scenario is clearly different for custom hardware accelerators. Indeed,
accelerator architectures specifically designed to handle subbyte quantizations
have been proposed in literature. These designs avoid unpacking quantized
values and thus benefit from the full gains achievable thanks to quantization.
One notable example is the Envision accelerator [35], which combines
Energy-efficient deep learning inference on edge devices 273

quantization with voltage and frequency scaling to achieve much more than
linear savings down to 4-bit.
Contrary to other types of quantization, binarization does not just reduce
the precision of operations, but radically changes them. Therefore, as antic-
ipated, the benefits that can be derived from it are even higher than the intu-
itive 32x reduction in model size and memory bandwidth compared to
floating point. However, obtaining these gains on general purpose hardware
is again not trivial, mainly because commercial CPUs do not offer an efficient
way to implement the popcount() operation. In contrast, recent academic pro-
cessor platforms [25] have added dedicated hardware and a corresponding
instruction for popcount. Custom accelerators for binary NNs are also quite
explored, due to their extreme compactness and efficiency [36, 73].
Regardless of the target hardware and precision, a considerable advan-
tage of quantization lies in its orthogonality to the DNN architecture.
Indeed, although there are architectural elements that favor its application,
such as BatchNorm and bounded activations, quantization does not require
any particular model characteristic in order to work. This is one of the main
reasons why this technique has become popular and is now widely supported
by the major deep learning frameworks, both in its training-aware and post-
training forms [21, 22], making it easier to implement for developers.
Clearly, what does depend on the platform are the energy and time benefits
of quantization, as described above. Therefore, the same quantized model
might lead to very different efficiency on two different hardware targets.
Quantization has been found very successful on convolutional neural
networks, allowing to reduce significantly the precision with negligible
accuracy loss [16, 71]. However, sequential models are a much harder chal-
lenge [74–76]. While the research on this type of networks has not been as
extensive as the one on CNNs, current results are definitely less outstanding.
In particular, while the easiest tasks and datasets benefit from quantization,
the accuracy deterioration is noticeable on harder ones [75].

7.2 Pruning
It has been known for some time that, due to their overparametrization,
deep learning models can tolerate high levels of sparsity in their weights
[77]. This means that a large portion of the weights can assume value 0, while
still producing accurate results. Furthermore, modern DNN activations are
also inherently sparse, due to the use of functions such as ReLU (see Fig. 9),
which turn all negative inputs to 0.
274 Francesco Daghero et al.

Fig. 9 Activations sparsity deriving from the application of a ReLU, one of the most fre-
quently used activation functions for DNN hidden layers.

Fig. 10 The common weight pruning workflow.

Sparsity can be exploited to optimize the memory requirements of a DNN


model by compressing it, which is particularly useful in memory-constrained
edge devices. Moreover, if the target hardware offers the required support,
sparsity can also be exploited by recognizing and skipping operations (i.e.,
MACs) on zero-valued weights/activations, thus improving inference speed
and energy [9].
The sparsity of a model, and in particular of its weights, can be artificially
increased by so-called pruning algorithms. These techniques identify the
“least important” weights of a model and replace them with 0, in order
to increase sparsity with the least possible impact on DNN accuracy.

7.2.1 Pruning algorithms


The great majority of pruning algorithms operate after an initial standard
training, and iteratively eliminate connections and fine-tune the model in
order to recover the drop in accuracy, as shown in Fig. 10. In particular,
one of the first published techniques was based on eliminating weights with
the smallest saliency, i.e., the smallest impact on the training loss [77]. The
process was repeated until the desired weight reduction or accuracy were
reached.
Unfortunately, the computation of weights saliency has become too
expensive for modern DNNs, due to their increasing depth and total number
of parameters, thus giving birth to a new family of so-called magnitude-based
pruning approaches [78]. This family of techniques prunes the weights sim-
ply according to their magnitude, under the assumption that the smallest
weights have the least impact on accuracy. Clearly, computing the
Energy-efficient deep learning inference on edge devices 275

magnitude of all weights is much simpler than evaluating their saliency, thus
making these approaches much more computationally efficient at training
time. With magnitude-based pruning, the majority of the weights that can
be safely pruned with negligible impact on the final accuracy is found in fully
connected layers.
Both magnitude- and saliency-based pruning simply try to maximize the
number of 0-weights while minimizing the accuracy drop. Other works,
however, have shown that this does not always correspond to the optimal
solution when the target is energy minimization. For instance, the authors
of [79] have shown that, in classical models such as AlexNet [6], most of the
energy for inference is consumed by convolutional layers, and not by fully
connected ones. Therefore, they have introduced energy-driven pruning, in
which the energy impact of each weight is estimated to select the optimal
pruning location, based on the consumption of different layers [79].

7.2.2 Benefits of pruning


The immediate benefit that can be derived from pruning techniques is a
reduction in the total storage size of a DNN, obtained by storing sparse
model parameters in compressed formats. By itself, however, compression does
not reduce the inference energy consumption. Vice versa, it might actually
increase it, due to the need of decompressing weights after loading them.
Therefore, compression formats specifically tailored for DNN inference
have been proposed, which try to simultaneously provide a low-cost
decoding algorithm and a way to exploit sparsity for skipping computations.
One simple format is the compressed sparse row (CSR) [80]. CSR uses
three vectors to store the nonzero values of a DNN weights matrix, and to
recover their original location, as shown in Fig. 11. In particular, the low-
ermost vector contains the values of all the nonzero elements, while the mid-
dle one stores the indexes of these elements in the corresponding matrix
row. Finally, the topmost vector contains pointers to the locations of the
other two arrays where each matrix row starts. To clarify, in the example
of the figure, the topmost vector is interpreted as follows:

Fig. 11 Compressed sparse row (CSR) format.


276 Francesco Daghero et al.

• Nonzero elements relative to the 1st matrix row start at index 0 of the
other two arrays and end before index 2.
• Elements relative to the 2nd matrix row start at index 2 and end before
index 2 (meaning that this row does not contain any nonzero value).
• Elements relative to the 3rd matrix row start at index 2 and end before
index 4.
CSR decoding is efficient if the matrix is read in row-major order. In fact,
row pointers can be accessed in constant time based on the row index, and
reconstructing the entire row is linear in the number of nonzero elements.
The problem of this format, however, occurs when trying to skip compu-
tations related to zero-weights, and in particular when accessing the activa-
tions tensor with which the sparse matrix is multiplied [81]. In fact, since
each CSR matrix row has nonzero elements in different positions, the
corresponding activations must be accessed multiple times with a sparse pat-
tern. Alternatively, the whole vector has to be loaded at once, but depending
on its size, this might not be feasible for memory-constrained devices [80].
Compressed sparse column (CSC) solves this problem using the opposite
approach with respect to CSR, i.e., storing row indices and column
pointers. This allows to read the matrix by column when performing mul-
tiplications, and eliminates the problem of multiple accesses to the input acti-
vations. In fact, in a matrix-vector product, each matrix column is multiplied
with the same input element, thus the input activations are guaranteed to be
read at most once and in order. However, it creates an analogous problem
for the output vector, which has to be either stored as a whole at the end of
the product, or accessed multiple times in a sparse way [80]. Nonetheless,
CSC becomes preferable to CSR when the size of the output is smaller than
the size of the input [80], which is often true for deep learning models.
Both CSC and CSR matrices are built based on the output of unstructured
pruning algorithms, in which weights can be zeroed-out at arbitrary loca-
tions in the weights matrix. This complicates the optimization from a hard-
ware point of view. In fact, virtually all hardware platforms for deep learning
use some form of parallel computation, from SIMD/SIMT operations in
CPUs and GPUs, to systolic processing elements in accelerators [34].
With unstructured pruned formats such as CSC/CSR, however, each
atomic computation step (e.g., the multiplication of a portion of a weights
matrix row with a portion of the activation vector) may require a different
number of operations, depending on the number of nonzero values
involved. Therefore, if computations involving zero-values are skipped, it
becomes hard to fully exploit the available parallel hardware [80].
Energy-efficient deep learning inference on edge devices 277

Structured pruning strategies and compression formats have been intro-


duced to improve hardware utilization. These approaches use the same prun-
ing algorithms described above, but force zeroed-weights to respect certain
patterns. Typically, they constrain given portions of the matrix (e.g., fixed
size subsets of a row or a column) to contain exactly the same number of non-
zero weights. For CNNs, the easiest way to perform structured pruning con-
sists in eliminating entire convolutional filters from layers [82]. For fully
connected and sequential models, instead, more elaborate structured pruning
approaches are needed.
As a representative example, bank-balanced sparsity [83], is a structured
pruning algorithm that splits the weights in subrows (banks), letting each
bank have the same number of pruned weights (see Fig. 12). The resulting
matrix can then be stored in a format called compressed sparse banks (CSB), as
shown in Fig. 13. CSB uses only two vectors, storing the nonzero values and
their indexes in the bank respectively. As shown in the figure, elements are
re-arranged so that the first positions of the two vectors contain the first ele-
ments of each bank, and so on. Since each bank contains the same number of
elements, row pointers are not needed, as they can be automatically inferred
from the sparsity. Assuming that the entire activation vector can be loaded at

Fig. 12 An example of bank-balanced sparsity with a 50% pruning and 1  4 banks.

Fig. 13 An example of the compressed sparse banks (CSB) format.


278 Francesco Daghero et al.

one time and split into banks as well, this reorganization allows to perform
interbank parallelization, where weights from different banks are simulta-
neously multiplied with the corresponding activations. The fact that each
bank contains exactly the same number of elements ensures that parallel
hardware is fully utilized. This improved parallelism comes with minimal
costs in terms of accuracy compared to an unstructured CSR approach,
for the same sparsity level [83].

7.3 Knowledge distillation


Knowledge distillation is a model compression method whose goal is deriving
small but highly accurate networks from ones with far larger sizes. These
models can be then deployed on the edge due to their reduced requirements
and sizes. Specifically, this approach consists of training a small network (stu-
dent) starting from one or more large pretrained DNNs (teachers) [84, 85].
In this scheme, the student learn directly from the teacher rather than just
from data.
As shown in Fig. 14, two different types of predictions are derived from
the networks: hard and soft. Both are typically obtained with a softmax on the
output layer (see Eq. 5). However, hard predictions are obtained setting T to
1, as for standard classification tasks, while soft predictions are obtained with
T > 1.
Distillation is then performed training the student network on a reduced
dataset using two separate losses, as shown in Fig. 14. The student loss

Fig. 14 A simplified overview of knowledge distillation.


Energy-efficient deep learning inference on edge devices 279

measures the difference between the student network’s hard predictions and
the reduced dataset’s true labels. The teacher loss, instead, measures the dis-
tance between the soft predictions of the student and those of the teachers.
Therefore, it measures how different the predictions of the student are from
the ones of the bigger network. When teachers are an ensemble of models,
the teacher loss usually employs the geometric mean of their predictions.
This second loss is computed using soft predictions with T > 1 because this
causes the probabilities to have a more compact distribution. Less extreme
values are in fact more informative and can be more easily learned by the
student model [84].
Recent works have proposed variations to the basic architecture of Fig.
14, such as using multiple connections (or bridges) to enforce similar out-
puts between student and teacher at different layers, aside from the output
[86]. Specifically, some hidden layers from the teacher network are chosen
to “guide” the learning of others belonging to the student model. If the two
layer sizes are different, an additional linear layer is added in-between the
two to match the dimensions. More advanced applications of distillation
have also been proposed in literature [87, 88].
The impact of distillation on energy efficiency is evident. At inference
time, only the distilled student network will be used to process inputs, thus
significantly reducing both the model size and the number of operations per
input. However, the degree at which this network shrinking can be applied
clearly depends on the complexity of the problem. Recent works have
shown that when the sizes of student and teachers differ greatly, the perfor-
mance drops significantly [89, 90]. In fact, the student network can only
learn up to a certain extent from the teacher, becoming unable to mimic
networks with too many parameters [90]. To tackle this problem, an addi-
tional intermediate size network called teacher assistant has been proposed in
[89] in order to have a multiple-step distillation.
Network distillation has been shown to work equally effectively on both
feed-forward and sequential models. For example, a distilled version of
BERT [7], one of the state of the art NLP models, called DistilBERT
[19] has been recently proposed. It manages to obtain 97% of the accuracy
of the full model while performing inference 60% faster.

7.4 Collaborative inference


While the optimizations mentioned above and the availability of custom
hardware help running deep learning models entirely at the edge, for large
280 Francesco Daghero et al.

networks this may still be unfeasible due to memory and performance limi-
tations [26]. Therefore, an interesting research branch seeks for a compromise
between the benefits of edge and cloud computing by means of so-called col-
laborative inference, which consists in distributing the inference computation
among multiple devices (e.g., edge nodes and cloud servers) [91].
One basic form of collaborative inference is proposed in [93], where the
authors suggest to preprocess images used as inputs for a CNN at the edge,
before running the actual inference in the cloud. Specifically, they propose
to discard blurry images, since they will not be useful for the neural network,
thus reducing the total time and energy spent transmitting raw data. A more
advanced collaborative inference framework is presented in [91], where the
authors propose to split the execution of a CNN between edge and cloud
in a layer-wise fashion as shown in Fig. 15A. Specifically, the first layers are
executed on the edge device, while the last ones are computed in the cloud,
based on the observation that intermediate layers outputs (e.g., after a
pooling) are smaller in size compared to raw inputs in many DNNs.
Therefore, they show that computing a few layers at the edge and sending
the resulting activations to the cloud is often the optimal approach in terms

Fig. 15 An overview of Neurosurgeon [91] (A) and BottleNet [92] (B), two collaborative
inference frameworks. Both compute the first layers locally, transmitting their output to
the cloud where the final result is calculated and then sent back.
Energy-efficient deep learning inference on edge devices 281

of balance between computation and transmission time and/or energy.


The best split point for a given DNN is found at runtime, based on the
edge-cloud connectivity conditions and on the load of the servers.
An extension of this approach is proposed in [92], where the DNN
architecture is slightly modified to make layer-wise partitioning even more
convenient, as shown in Fig. 15B. In particular, a pair of so-called reduction
(compression) and restoration (decompression) layers are added at the
selected DNN split point, in order to further reduce the size of the transmit-
ted data. Compression and decompression may use standard algorithms such
as JPEG. At training time, similar to quantization, they are approximated by
a straight-through estimator. Finally, the work of [94] considers the case of
multiple split points, for DNNs where feature sizes are not monotonically
decreasing, such as autoencoders.
As an alternative to layer-wise partitioning, other authors propose to
distribute the inference among a number of small IoT devices, each of
which processes only a part of the input (e.g., some rows of an image),
since it would not be able to handle a full layer [95]. However, this scheme
introduces an additional data dependency. It is in fact necessary to have the
results of adjacent partitions before being able to compute the following
layer.
In [96] a three-level hierarchical framework for deep learning applica-
tions that process data from multiple sources (e.g., multiple sensors) is pro-
posed. In this solution, sensors, edge servers and cloud perform a separate
inference on the locally available data, aggregating the results of the previous
level and forwarding theirs to the following one. This approach drastically
reduces the volume of data transmitted, saving energy and reducing the
latency, but may affect the accuracy. In fact, “higher-level” devices only
have access to the final aggregated outputs of lower-level inference.
Therefore, the authors of [43] propose an evolution of this approach, in
which the architecture of a single DNN is modified to favor distributed
processing for multiple-source tasks. In particular, the first layers of this
DNN process the data from each source separately (i.e., there are no weights
connecting features relative to different sources). These layers are processed
locally by each device, which then transmits the result to the cloud. There,
features from multiple sources are concatenated before executing the
remaining layers. With respect to the previous approach, using a single
DNN enables to train the entire system with an end-to-end approach based
on standard back-propagation, which improves the accuracy.
282 Francesco Daghero et al.

7.5 Limitations of static optimizations


Static design time optimizations of deep learning models have been exten-
sively studied due to their effectiveness in reducing the time, memory, and
energy requirements for inference. However, recent works have pointed
out that static optimizations may be suboptimal in many applications [27,
97–103]. In particular, the main limitation of these approaches comes from
the fact that they are input-independent: that is, since optimizations are fixed
at design time, they cannot be tuned based on the currently processed input. In
contrast, for many realistic applications, inputs are not all equally “difficult” to
process for a DNN. As an intuitive example, for an image classification task, a
blurry image where the subject is small compared to the frame and has been
captured from an unconventional angle might be much more difficult to clas-
sify than one where the subject is clear, large and well-positioned in the center
of the frame. Similarly, a long and ambiguous sentence might represent a
much more challenging task for a translation model than the sentence:
“The cat is near the window.”
In these scenarios, an aggressively optimized network (e.g., using a low
bit-width quantization, a high-sparsity pruning) will likely miss-classify dif-
ficult inputs, whereas a less aggressive optimization would cause unnecessary
time, memory and energy wastes for easy inputs. This has spurred the birth
of optimization strategies that permit the tuning of the complexity versus
accuracy trade-off at runtime, depending on the currently processed input.
These strategies will be analyzed in the next section.

8. Dynamic (input-dependent) optimizations for deep


learning inference at the edge
The limitations of static optimizations have brought to development
of dynamic (or adaptive) deep learning techniques, where inference execu-
tion time and energy consumption (and the corresponding accuracy) can be
tuned at runtime instead of design time. Such tuning can be implemented in
a wide variety of ways, ranging from using two or more completely separate
DNNs based on the difficulty of the input [97, 98], to a different quantiza-
tion bit-width [27], a different number of layers [99, 100], or a different
hardware platform [101]. In the following, we review the main research
directions in this sense, focusing in particular on approaches that target edge
inference.
Energy-efficient deep learning inference on edge devices 283

8.1 Ensemble learning


Ensemble learning is typically viewed as an approach that exploits multiple
machine learning models to improve accuracy. However, in recent years,
a new sort of ensemble learning has been proposed, where two or more
models are instead used to tune the trade-off between inference complexity
and accuracy at runtime [97, 98]. One of the first embodiments of this idea
are big/little DNNs, shown in Fig. 16 [98]. This approach is based on using
two networks of different size at inference time for a classification task. The
“little” model is always executed first, and its output is evaluated by the suc-
cess checker block, which estimates its classification confidence. If the confi-
dence exceeds a threshold, meaning that the little DNN was “sure” about
its prediction, the inference is stopped and the prediction is simply for-
warded to the output. In the opposite case, the “big” model is run on
the input, and its output is used for the final classification.
The entire big/little scheme is based on the assumption that, for most
applications, easy inputs are more frequent than hard ones [98]. Under this
assumption, for the majority of inputs, only the “little” DNN will be exe-
cuted. This allows to absorb the overheads accumulated in the opposite
cases, i.e., when both models are run on the same input, clearly causing a
larger execution time and energy consumption compared to using only
the “big” model, for the same accuracy. In other words, big/little schemes
yield energy savings only as long as “big” models are rarely activated.
Clearly, the function performed by the success checker, which controls
the activation of the second network, plays a fundamental role. If not carefully
tuned, this block could activate the “big” network even when the “little” had
classified correctly, or vice versa, it could label wrong classifications from the

Fig. 16 General scheme of a big/little DNN.


284 Francesco Daghero et al.

first network as correct. In the first case, this would result in an energy waste,
while in the second it would lower the accuracy of the system.
The score margin [98] has been proposed as an effective classification con-
fidence estimate, based on the class probabilities produced by the network’s
output layer. Specifically, the score margin computes the difference between
the largest two of these probabilities. If this difference is large, it means that
the network produced a high probability only for one class, and therefore it is
highly confident that the input belongs to it. On the other hand, a small dif-
ference means that there are at least two classes to which the input could
belong with similar probability, according to the prediction of the DNN,
hence the confidence is low. In summary, the score margin method activates
the “big” network using the following equation:

plargest  p2nd largest < th (26)

that is, whenever the difference between the top-2 probabilities is smaller than
a threshold th. Finding the ideal threshold for a given accuracy level is not an
easy task. The authors of [98] propose to use a fine-tuning step to find the best
th for a given pair of networks and a dataset [98]. Alternatively [27], th can also
be tuned at runtime based on external conditions, e.g., increasing it when the
battery level is low to save more energy.
The score margin method is effective as long as the “little” model’s out-
puts actually resemble the probability of an input belonging to a given class,
which is not guaranteed for black-box models such as DNNs. In particular,
it has been shown that modern DNNs estimate confidence probabilities less
reliably [104]. In fact, their increased depth positively impacts their accuracy,
but negatively affects their capability to predict the likelihood with which a
given input belongs to a class. In practice, modern models tend to be over-
confident even for inputs that are not actually classified correctly. For a big/
little system, this makes it harder to find the optimal score margin threshold.
To mitigate this problem, one solution is to use so-called calibration tech-
niques, that make DNN scores more similar to actual probabilities, at the
cost of some accuracy degradation [104].
The major disadvantage of big/little DNNs, however, is that they
require a double effort at training time and they result in an increased model
size. It is in fact necessary to separately train two different models, each one
with its own hyper-parameters to be selected. After training, then, the
weights of both models have to be stored on the inference device, which
might not be possible on memory-constrained edge platforms.
Energy-efficient deep learning inference on edge devices 285

Fig. 17 Simplified view of the dynamic inference technique proposed in [97]. The black
portion of the network is executed for all inputs, while the gray part is only activated for
difficult data.

One solution to the storage size issue consists in building an ensemble in


which the “little” network is a part of the “big” one, as shown in Fig. 17.
This approach activates a portion of each layer (shown in black in the figure)
for all inputs, whereas the remaining computations (gray part) are only per-
formed for difficult data. Layers can be split at the level of individual neurons,
as shown in the figure for fully connected architectures, while channel-wise
partitioning is the approach proposed in [97] for CNNs. In this way, there is
no need for two separate set of weights (and inference executions) for the
“little” and “big” models, since the latter reuses the weights and computa-
tions of the former. Moreover, the scheme can be easily extended to more
than two “submodels”. Clearly, building a network like the one of Fig. 17
requires a custom training algorithm. Specifically, the DNN has to be
trained incrementally, starting from the smallest submodel (the black part
in Fig. 17). Initially, that portion has to be trained by itself, while disabling
the rest of the model. Then, the next set of neurons/channels has to be
added and trained, while keeping the weights of the previous portion fixed,
and so on.
The work of [27] has proposed a similar way to trade-off complexity and
accuracy using multiple “versions” of a DNN. In this case, however, the
only difference between the versions is the bit-width used for quantization.
The authors perform an offline characterization to identify the set of relevant
286 Francesco Daghero et al.

quantization configurations for a given DNN, i.e., those that yield an inter-
esting trade-off in terms of accuracy and energy. Then, a single posttraining
quantization is performed, targeting the largest bit-width, and lower preci-
sion weights are simply obtained by truncation/rounding. This removes the
need for multiple sets of weights, while also not requiring any training.
Therefore, the approach is also applicable when training data are not avail-
able. Moreover, it is complementary to the previous one and can be used in
conjunction with it to generate even more variants of the same DNN.

8.2 Conditional inference and fast exiting


The term conditional inference indicates a family of dynamic optimization
techniques that change the portion of the DNN graph which is executed
at runtime, depending on the input. This approach is based on the observa-
tion that neural networks are trained to learn complex nonlinear decision
boundaries, in order to correctly classify as many inputs as possible. While
this is indeed required for hard inputs, easy inputs often need a much simpler
decision boundary, and therefore a much less deep network is sufficient to
correctly classify them. In order to save energy and time, part of the DNN
graph can then be “switched off” during inference, when an input is detected
as easy. One popular form of conditional inference is the so-called fast exiting,
in which the execution of a DNN is stopped early, avoiding the processing of
the last layers.
BranchyNet [99] implements conditional inference adding branches to
CNNs. Each branch is a possible exit point, which allows to complete
the inference using only a portion of the layers (see Fig. 18 for an example
with two branches). After each branch, the entropy of the softmax output is
calculated to estimate the classification confidence. Being a measure of the
“uncertainty” of a distribution, a larger entropy indicates a lower confi-
dence. Based on a threshold on the entropy, the execution is either stopped
at that branch or continued.
Similar to the “big/little” approach, BranchyNet reduces the overall
energy consumption and number of computations as long as the full model
or the deeper branches are rarely used. The authors of [99] suggest inserting
branches deeper in the network for more difficult datasets. Easier tasks will in
fact benefit greatly from branching earlier. Furthermore, the lateral part of
each branch (i.e., the one not shared with the main model) may be com-
posed of more than one layer, for example including one or more Conv
layers before the final fully connected classifier. While this increases the
Energy-efficient deep learning inference on edge devices 287

Fig. 18 An example of the BranchyNet architecture proposed in [99].

number of computations, it has been shown to improve the accuracy of the


branch as well [99]. As for the score margin in big/little systems, the entropy
threshold on which this approach is based to decide whether to stop the exe-
cution or not usually requires a fine-tuning step, exploring the effects of dif-
ferent values on the final accuracy and energy. The network and its branches
require a custom training procedure, in which the loss functions is a
weighted sum of the outputs of each branch. In the forward pass the output
of each branch is calculated and then used to update the weights with the
standard backward pass. However, this procedure can be optionally based
on a pretrained network with fixed weights for the main model (Branch
B in Fig. 18) which represents the “backbone” of the system. This makes
the training of other branches significantly faster. The main drawback of
BranchyNet is an increase in the memory occupation of the network,
although smaller than for big/little architectures, due to the new weight ten-
sors of the lateral layers [99].
A different approach to conditional inference is proposed in SkipNet
[100]. In this case, instead of adding branches, some of the layers of the
DNN are optionally skipped for easy inputs. For each set of “skippable” layers,
so-called gates are added to the base network. Gates are neural networks them-
selves, although significantly smaller in size compared to the corresponding set
of main network layers. They are run first, and based on their output, a policy
decides whether to execute or skip the corresponding portion of the main net-
work. Specifically, such policy is based on a combination of supervised and
288 Francesco Daghero et al.

reinforcement learning, and tries to find an acceptable trade-off between the


prediction accuracy and the number of layers skipped [100]. The SkipNet sys-
tem requires a custom training procedure, which includes a first pretraining
phase, in which the gate outputs are allowed to assume continuous values.
This enables the gates’ weights to start from suitable values before the start
of the reinforcement learning phase, in which the network is updated in opti-
mize the aforementioned policy. During inference, the model is then able to
turn off layers depending on the complexity of the input, as shown in Fig. 19,
saving computations with a negligible accuracy loss.
Conditional inference has been also applied to sequential models in
[105], but with more modest results than those obtained with feed-forward
networks.

8.3 Hierarchical inference


Hierarchical (or staged) inference is an application of a divide et impera
paradigm to DNN execution. An inference task (e.g., a classification) is split
into multiple subtasks, with the easiest being the most common. These mul-
tiple tasks are then sequentially executed in increasing complexity order,
with the chance of stopping earlier to save energy. Optionally, the sub-
inferences can be performed on different devices, such as edge nodes for
the simplest tasks and cloud server for the most computationally intensive
ones, in a collaborative fashion. The main difficulty in this approach is being
able to split a single machine learning task into multiple subinferences, which
is a strongly application-dependent problem. Therefore, this is a somewhat
less generally applicable method compared to ensemble and conditional
systems.

Fig. 19 An example of inference using the SkipNet architecture proposed in [100].


Energy-efficient deep learning inference on edge devices 289

One domain where this methodology has been successfully applied is


speech recognition for voice assistants such as Apple Siri, and in particular
the implementation for the Apple Watch [26]. In general, speech recogni-
tion requires large and power hungry DNNs, unable to fit on a constrained
smartwatch. On the other hand, the latency and energy requirements for
interacting with a voice assistant are strict, in order to achieve a good user
experience, and may not be obtained by a system that constantly transmits to
the cloud. The task is then split in two parts: wakeword recognition and
speech recognition. A reduced-size RNN is deployed on the edge device
(the smart watch), constantly running in the background but with very
low-power consumption. This network performs wakeword recognition,
which being a much easier task than voice recognition can be handled by
simple and low-power models. When a wakeword is detected, the edge
device offloads the rest of the audio data to the cloud, where a computation-
ally expensive model is used to interpret them. This division in subtasks per-
mits both a consistent reduction of the data transmitted to the cloud, and a
smaller response latency.
A similar hierarchical approach is proposed in [35], where the authors
hierarchically split a face recognition task for personal devices such as
smartphones. In particular, they propose to run separate inferences to under-
stand: (1) whether the input picture contains a face or not; if it does, (2)
whether the face belongs to the device owner or to someone else; if it
belongs to someone else, (3) whether it is one of the owner’s favorite con-
tacts or not; etc. All steps are executed by increasingly complex CNNs on
the same custom hardware accelerator.
Finally, an interesting combination of hierarchical and conditional infer-
ence is proposed in [4], for the classification of patient medical issues based
on wearable sensors data. In that work, a CNN is split in two parts: a small
one deployed on the edge device and a bigger one on the cloud. The local
network only tries to predict whether the patient is sick or healthy with a
minimal number of layers, while the remote one performs a more in depth
analysis to understand the nature of the sickness. The latter model is clearly
only invoked when the edge device predicts that the patient is sick.
Moreover, the hierarchical splitting into multiple tasks is combined with
conditional inference concepts. Indeed, the remote network is fed with
the output of the local one, rather than with raw input data. This prevents
the repetition of similar computations both locally and remotely to extract
basic features. The overall architecture is summarized in Fig. 20.
290 Francesco Daghero et al.

Fig. 20 Staged inference medical diagnosis based on wearable sensors data proposed
in [4]. Easy inputs are classified directly on the edge device, while harder ones are sent to
the cloud for further (and more computationally expensive) analysis.

8.4 Input-dependent collaborative inference


Collaborative inference, described in Section 7.4 is a very effective approach
to optimize the execution of deep learning models by splitting them
between edge and cloud. The standard approach to collaborative inference
is, however, input-independent.c For example, the layer on which a DNN is
partitioned in Neurosurgeon [91] depends only on the connectivity condi-
tions and on the load of the cloud server, not on the processed input. This
makes sense for some kinds of models, like CNNs, which typically process
fixed-size inputs. However, it is not ideal for sequential models, where the
length of the input time-sequence influences significantly the complexity of
an inference.
Indeed, the works of [101] and [106] demonstrate through a character-
ization that, for RNNs, inference execution time and energy increase linearly
with the length of the input, due to the dependencies between subsequent
steps which prevent interstep parallelism. Therefore, the authors propose a
dynamic framework for energy-efficient input-dependent RNN collabora-
tive inference. The framework deploys a copy of the same RNN both on the
edge device and in the cloud. It then uses a runtime mapping engine to deter-
mine the optimal platform where to execute inference for a given input
sequence. The mapping engine bases its decision on an estimate of the edge
and cloud execution time (or energy consumption) for the current input
length, and on the current status of the connection between the two devices.

c
Notice that the approach of Fig. 20 is also collaborative, as it involves edge and cloud. However, in that
case, the two devices execute different (portions of ) models, while standard collaborative inference is
based on a single model.
Energy-efficient deep learning inference on edge devices 291

Inference time and energy are estimated via linear regression models, based
on the results of the aforementioned characterization. Such dynamic par-
titioning significantly outperforms both edge-only and cloud-only inference
in terms of energy efficiency, for several NLP applications.
Importantly, the engine in [101, 106] maps the entire RNN execution on
one of the two platforms, rather than partitioning it as typically done for
CNNs [91, 92]. This is because, for most RNN applications, input sizes
are much smaller than for CNNs, and most importantly they are smaller than
hidden layer outputs. This eliminates the data compression advantage deriv-
ing from partial local processing described in Section 7.4. Indeed, the authors
show that, for NLP tasks, the total communication time is dominated by the
round-trip network latency, which is independent from data size.

8.5 Dynamic tuning of inference algorithm parameters


In this section, we show how deep learning-based tasks at the edge can be
optimized by tuning some parameters of the inference algorithm, not directly
related with the DNN. While this approach can in principle be applied to
many different domains, it is still largely unexplored, therefore we present
it using a recent example taken from NLP with sequential models
(RNNs, transformers, etc.).
For basic classification or regression tasks, deep learning inference simply
consists of the execution of a DNN forward pass. More complex tasks, such
as machine translation or reinforcement learning, instead, use the DNN as
part of a larger algorithm. Similar to the hyper-parameters of the network,
the configurations of these algorithms influence both the inference accuracy
and the processing complexity. Such configurations are usually chosen stat-
ically at design time, in order to obtain an acceptable accuracy on average.
Exactly as detailed before for DNN architectures, this typically corresponds
to an over-design for easy inputs.
Recently, examples of dynamic input-dependent tuning of inference
algorithm parameters have been proposed in [102, 103, 107, 108] for
sequential NLP models, in particular those based on the encoder–decoder
architecture. In this architecture, the decoder DNN takes as input the fixed-
length representation of an input sequence (e.g., a sentence in English) gen-
erated by the encoder, and produces an output sequence (e.g., a translation in
German). At each step, the decoder outputs the likelihood of all possible out-
puts (e.g., all words in the German vocabulary), given the input representation
and the previous outputs [1]. Since greedily selecting the most likely output at
292 Francesco Daghero et al.

each step generates suboptimal translations, decoding is typically performed


using beam search. With this algorithm the BW most likely partial sentences
are expanded in each step, where BW is a parameter called beam width, and
the final overall most likely translation is selected at the end of decoding.
The beam width influences both accuracy and energy consumption, since
BW forward passes of the decoder DNN must be executed in each step
[1]. Classically, BW is statically chosen at design time, generating the scenario
depicted in Fig. 21A. In contrast, multiple works have recently proposed to
adapt this value dynamically at runtime, lowering it for easier inputs and
increasing it for harder ones [102, 103, 107, 108]. Specifically, the authors
of [103] developed an entropy-based policy to estimate the complexity of a

Fig. 21 On the top a standard beam search with a fixed beam size of 2. On the bottom
its dynamic version, where the Sel. Policy chooses the best beam width to be used in the
following step.
Energy-efficient deep learning inference on edge devices 293

given sequence at each decoding step and change BW accordingly. An exam-


ple of the result is shown in Fig. 21B. With experiments on translation and
summarization tasks, they have demonstrated that this approach yields signif-
icant time and energy reductions, especially for single-core MCUs, where the
different decoder execution must happen sequentially with one another.

9. Open challenges and future directions


Despite the constantly increasing interest for implementations of deep
learning-based applications at the edge, there are still several open challenges
that are just starting to be addressed.
Among the optimization methods presented in this chapter, quantization
(especially to 8-bit integer) is by far the most widely supported by commer-
cial products and frameworks. Additional efforts on other techniques such as
(structured) pruning, collaborative inference, etc., are required to complete
the transition from research prototypes and narrow-scoped implementations
that work on a single combination of hardware platform and task, to industry-
ready methodologies. The same is true for dynamic/adaptive models, which
are potentially the main direction for edge deep learning in the future, but are
still far from being extensively used by industry.
In terms of models, the great majority of the research works in the past
have been focusing on CNNs, while other models (e.g., sequential ones) are
comparatively much less studied. Nonetheless, sequential models are poten-
tially even more relevant for edge devices, with applications in smart devices
(voice recognition, translation, image description, etc.) as well as in IoT sys-
tems deployed in cities and factories to perform time series processing.
Finally, as pointed out in [109], deep learning on edge devices still lacks a
standard and comprehensive set of benchmarks on which to perform mean-
ingful and fair comparisons among different optimization techniques. These
benchmarks should be flexible enough to support the highly heterogeneous
platforms on which edge deep learning can be performed, ranging from
MCUs to custom accelerators.

References
[1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
https://doi.org/10.1038/nature14539, http://www.nature.com/articles/nature14539.
[2] J. Wang, Y. Ma, L. Zhang, R.X. Gao, D. Wu, Deep learning for smart manufacturing:
methods and applications, J. Manufact. Syst. 48 (2018) 144–156.
[3] J. Ker, L. Wang, J. Rao, T. Lim, Deep learning applications in medical image analysis,
IEEE Access 6 (2017) 9375–9389.
294 Francesco Daghero et al.

[4] M. Parsa, P. Panda, S. Sen, K. Roy, Staged inference using conditional deep learning
for energy efficient real-time smart diagnosis, in: 2017 39th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),
IEEE, 2017, pp. 78–81.
[5] A. Kamilaris, F.X. Prenafeta-Boldú, Deep learning in agriculture: a survey, Comput.
Electron. Agric. 147 (2018) 70–90.
[6] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
volutional neural networks, in: Advances in Neural Information Processing
Systems, 2012, pp. 1097–1105.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirec-
tional transformers for language understanding, Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies 1 (2019) 4171–4186.
[8] L. Deng, G. Hinton, B. Kingsbury, New types of deep neural network learning
for speech recognition and related applications: an overview, in: 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, IEEE, 2013,
pp. 8599–8603.
[9] V. Sze, Y.H. Chen, T.J. Yang, J.S. Emer, Efficient processing of deep neural networks:
a tutorial and survey. Proc. IEEE 105 (12) (2017) 2295–2329. ISSN: 15582256.
https://doi.org/10.1109/JPROC.2017.2761740.
[10] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
[11] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines,
in: Proceedings of the 27th International Conference on Machine Learning
(ICML-10), 2010, pp. 807–814.
[12] A.H. Namin, K. Leboeuf, R. Muscedere, H. Wu, M. Ahmadi, Efficient hardware imple-
mentation of the hyperbolic tangent sigmoid function. in: Proceedings—IEEE
International Symposium on Circuits and Systems, ISSN 02714310, IEEE, 2009, ISBN:
9781424438280, pp. 2117–2120. https://doi.org/10.1109/ISCAS.2009.5118213.
[13] L. Benini, Plenty of room at the bottom? Micropower deep learning for cognitive
cyber physical systems. in: 2017 7th IEEE International Workshop on Advances in
Sensors and Interfaces (IWASI), IEEE, 2017, p. 165. https://doi.org/10.1109/iwasi.
2017.7974239. 165.
[14] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
L.D. Jackel, Backpropagation applied to handwritten zip code recognition. Neural
Comput. 1 (4) (1989) 541–551. https://doi.org/10.1162/neco.1989.1.4.541.
[15] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reduc-
ing internal covariate shift, in: Proceedings of the 32nd International Conference on
Machine Learning (ICML), 2015, pp. 448–456.
[16] R. Krishnamoorthi, Quantizing deep convolutional networks for efficient inference: a
whitepaper, arXiv preprint arXiv:1806.08342 (2018), http://arxiv.org/abs/1806.08342.
[17] S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help opti-
mization? in: Advances in Neural Information Processing Systems, ISSN 10495258,
vol. 2018, 2018, pp. 2483–2493.
[18] L. Lai, N. Suda, V. Chandra, CMSIS-NN: efficient neural network kernels for
arm cortex-M CPUs, arXiv preprint arXiv:1801.06601 (2018), http://arxiv.org/
abs/1801.06601.
[19] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT:
smaller, faster, cheaper and lighter, in: Proceedings of the 5th Workshop on Energy
Efficient Machine Learning and Cognitive Computing (EMC2), 2019, pp. 1–5.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser,
I. Polosukhin, Attention is all you need, in: Advances in Neural Information
Processing Systems, ISSN 10495258, vol. 2017-Decem, 2017, pp. 5999–6009.
Energy-efficient deep learning inference on edge devices 295

[21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,


N. Gimelshein, L. Antiga, A. Desmaison, A. K€ opf, E. Yang, Z. DeVito, M. Raison,
A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: an
imperative style, high-performance deep learning library, in: H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alche-Buc, E. Fox, R. Garnett (Eds.),
Advances in Neural Information Processing Systems 32, Curran Associates, Inc.,
2019, pp. 8024–8035. http://arxiv.org/abs/1912.01703
[22] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas,
O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng,
TensorFlow: large-scale machine learning on heterogeneous distributed systems,
arXiv preprint arXiv:1603.04467 (2016), http://arxiv.org/abs/1603.04467.
[23] ONNX, https://onnx.ai/.
[24] ST, STM32Cube-AI, https://www.st.com/en/ecosystems/stm32cube.html.
[25] A. Garofalo, M. Rusci, F. Conti, D. Rossi, L. Benini, Pulp-NN: accelerating quantized
neural networks on parallel ultra-low-power RISC-V processors. in: Philosophical
Transactions of the Royal Society A: Mathematical, Physical and Engineering
Sciences, ISSN 1364503X, vol. 378, 2020, https://doi.org/10.1098/rsta.2019.0155.
[26] J. Chen, X. Ran, Deep learning with edge computing: a review. Proc. IEEE 107 (8)
(2019) 1655–1674. https://doi.org/10.1109/JPROC.2019.2921977.
[27] D. Jahier Pagliari, E. Macii, M. Poncino, Dynamic bit-width reconfiguration for
energy-efficient deep learning hardware. in: Proceedings of the International
Symposium on Low Power Electronics and Design, ISSN 15334678, 2018, ISBN:
9781450357043, pp. 1–6. https://doi.org/10.1145/3218603.3218611.
[28] ST, iNemo, https://www.st.com/en/mems-and-sensors/lsm6dsox.html.
[29] D. Jahier Pagliari, M. Ansaldi, E. Macii, M. Poncino, CNN-based camera-less user atten-
tion detection for smartphone power management, in: 2019 IEEE/ACM International
Symposium on Low Power Electronics and Design (ISLPED), IEEE, 2019, pp. 1–6.
[30] L. Li, K. Ota, M. Dong, When weather matters: IoT-based electrical load forecasting
for smart grid. IEEE Commun. Mag. 55 (10) (2017) 46–51. https://doi.org/10.1109/
MCOM.2017.1700168.
[31] Y. Duan, Y. Lv, Y.L. Liu, F.Y. Wang, An efficient realization of deep learning for
traffic data imputation. Transp. Res. C Emerg. Technol. 72 (2016) 168–181. ISSN:
0968090X. https://doi.org/10.1016/j.trc.2016.09.015.
[32] Google, Edge TPU, https://cloud.google.com/edge-tpu/.
[33] Intel, Movidius, https://software.intel.com/content/www/us/en/develop/articles/
intel-movidius-neural-compute-stick.html.
[34] Y.-H. Chen, T. Krishna, J.S. Emer, V. Sze, Eyeriss: an energy-efficient reconfigurable
accelerator for deep convolutional neural networks, IEEE J. Solid-State Circuits 52 (1)
(2016) 127–138.
[35] B. Moons, R. Uytterhoeven, W. Dehaene, M. Verhelst, Envision: a 0.26-to-10TOPS/W
subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural
Network processor in 28nm FDSOI. in: Digest of Technical Papers—IEEE Interna-
tional Solid-State Circuits Conference, ISSN 01936530, vol. 60, IEEE, 2017, ISBN:
9781509037575, pp. 246–247. https://doi.org/10.1109/ISSCC.2017.7870353.
[36] F. Conti, P.D. Schiavone, L. Benini, XNOR Neural engine: a hardware accelerator
IP for 21.6-fJ/op binary neural network inference, IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems 37 (11) (2018) 2940–2951. https://
doi.org/10.1109/TCAD.2018.2857019. http://arxiv.org/abs/1807.03010.
296 Francesco Daghero et al.

[37] A. Shawahna, S.M. Sait, A. El-Maleh, FPGA-based accelerators of deep learning net-
works for learning and classification: a review. IEEE Access 7 (2019) 7823–7859.
ISSN: 21693536. https://doi.org/10.1109/ACCESS.2018.2890150.
[38] J.E. Stone, D. Gohara, G. Shi, OpenCL: a parallel programming standard for hetero-
geneous computing systems, Comput. Sci. Eng. 12 (3) (2010) 66–73.
[39] B. Moons, R. Uytterhoeven, W. Dehaene, M. Verhelst, DVAFS: trading computa-
tional accuracy for energy through dynamic-voltage-accuracy-frequency-scaling.
in: Proceedings of the 2017 Design, Automation and Test in Europe, DATE 2017,
IEEE, 2017, ISBN: 9783981537093, pp. 488–493. https://doi.org/10.23919/DATE.
2017.7927038.
[40] D. Jahier Pagliari, E. Macii, M. Poncino, Automated synthesis of energy-efficient
reconfigurable-precision circuits, IEEE Access 7 (2019) 172030–172044.
[41] D. Jahier Pagliari, M. Poncino, Application-driven synthesis of energy-efficient
reconfigurable-precision operators, in: 2018 IEEE International Symposium on
Circuits and Systems (ISCAS), IEEE, 2018, pp. 1–5.
[42] J.T.X. Nvidia, Developer Kit, 2015, https://www.nvidia.com/it-it/autonomous-
machines/embedded-systems/.
[43] A. Thomas, Y. Guo, Y. Kim, B. Aksanli, A. Kumar, T.S. Rosing, Hierarchical and
distributed machine learning inference beyond the edge. in: Proceedings of the
2019 IEEE 16th International Conference on Networking, Sensing and Control,
ICNSC 2019, IEEE, 2019, ISBN: 9781728100838, pp. 18–23. https://doi.org/
10.1109/ICNSC.2019.8743164.
[44] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
E. Shelhamer, cuDNN: efficient primitives for deep learning, arXiv preprint
arXiv:1410.0759 (2014), http://arxiv.org/abs/1410.0759.
[45] D. Kirk, NVIDIA CUDA software and GPU parallel computing architecture.
in: International Symposium on Memory Management, ISMM, Vol. 7, 2007,
ISBN: 9781595938930, p. 103. https://doi.org/10.1145/1296907.1296909.
[46] D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, A. Marongiu, Energy efficient par-
allel computing on the PULP platform with support for OpenMP. in: 2014 IEEE 28th
Convention of Electrical and Electronics Engineers in Israel, IEEEI 2014, 2014, ISBN:
9781479959877, pp. 1–5. https://doi.org/10.1109/EEEI.2014.7005803.
[47] GAP8—The IoT Application Processor, https://greenwaves-technologies.com/ai_
processor_gap8/, (Accessed May, 2020).
[48] A. Burrello, F. Conti, A. Garofalo, D. Rossi, L. Benini, Work-in-progress: dory: light-
weight memory hierarchy management for deep NN inference on iot endnodes.
in: Proceedings of the International Conference on Hardware/Software Codesign
and System Synthesis Companion, CODES/ISSS 2019, IEEE, 2019, ISBN:
9781450369237, pp. 1–2. https://doi.org/10.1145/3349567.3351726.
[49] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, P.A. Muller, Deep learning for
time series classification: a review. Data Mining and Knowledge Discovery , 33 (4)
(2019) 917–963. ISSN: 1573756X. https://doi.org/10.1007/s10618-019-00619-1.
[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hier-
archical image database. in: 2009 IEEE Conference on Computer Vision and Pattern
Recognition, IEEE, 2009, pp. 248–255. https://doi.org/10.1109/cvprw.2009.5206848.
[51] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer,
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model
size, arXiv preprint arXiv:1602.07360 (2016). http://arxiv.org/abs/1602.07360.
[52] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for
mobile vision applications, arXiv preprint arXiv:1704.04861 (2017), http://arxiv.
org/abs/1704.04861.
Energy-efficient deep learning inference on edge devices 297

[53] M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu, W. Zhuang, Learning-based compu-


tation offloading for IoT devices with energy harvesting. IEEE Trans. Veh. Technol. ,
68 (2) (2019) 1930–1941. ISSN: 00189545. https://doi.org/10.1109/TVT.2018.
2890685.
[54] C.M.J.M. Dourado, S.P.P. da Silva, R.V.M. da Nóbrega, A.C. Antonio, P.P. Filho,
V.H.C. de Albuquerque, Deep learning IoT system for online stroke detection in skull
computed tomography images. Comput. Netwk. 152 (2019) 25–39. https://doi.org/
10.1016/j.comnet.2019.01.019.
[55] Y.D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, D. Shin, Compression of deep con-
volutional neural networks for fast and low power mobile applications, in: 4th
International Conference on Learning Representations, ICLR 2016—Conference
Track Proceedings, 2016.
[56] D. Jahir Pagliari, M. Poncino, E. Macii, Energy-efficient digital processing via
approximate computing. in: Smart Systems Integration and Simulation, Springer,
2016, ISBN: 9783319273921, pp. 55–89. https://doi.org/10.1007/978-3-319-
27392-1_4.
[57] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg,
M. Houston, O. Kuchaiev, G. Venkatesh, H. Wu, Others, Mixed precision training.,
in: Proceedings of the 6th International Conference on Learning Representations
(ICLR), 2018, pp. 1–12.
[58] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural
networks: training deep neural networks with weights and activations constrained
to +1 or 1, arXiv preprint arXiv:1602.02830 (2016), http://arxiv.org/abs/1602.
02830.
[59] M. Courbariaux, Y. Bengio, J.P. David, Binaryconnect: training deep neural networks
with binary weights during propagations, in: Advances in Neural Information
Processing Systems, ISSN 10495258, vol. 2015, 2015, pp. 3123–3131.
[60] S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural networks
with pruning, trained quantization and huffman coding, in: Proceedings of the 4th
International Conference on Learning Representations (ICLR), 2016, pp. 1–14.
[61] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, Y. Zou, Dorefa-net: training low bitwidth
convolutional neural networks with low bitwidth gradients, arXiv preprint arXiv:
1606.06160 (2016).
[62] E.H. Lee, D. Miyashita, E. Chai, B. Murmann, S.S. Wong, LogNet: energy-efficient
neural networks using logarithmic computation, in: 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017,
pp. 5900–5904.
[63] P. Gysel, J. Pimentel, M. Motamedi, S. Ghiasi, Ristretto: A framework for empirical
study of resource-efficient inference in convolutional neural networks, IEEE Trans.
Neural Netw. Learn. Syst. 29 (11) (2018) 5784–5789.
[64] D. Miyashita, E.H. Lee, B. Murmann, Convolutional neural networks using logarith-
mic data representation, arXiv preprint arXiv:1603.01025 (2016).
[65] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with limited
numerical precision, in: International Conference on Machine Learning, 2015,
pp. 1737–1746.
[66] P. Gysel, J. Pimentel, M. Motamedi, S. Ghiasi, Ristretto: a framework for empirical
study of resource-efficient inference in convolutional neural networks. IEEE Trans.
Neural Netwk. Learning Syst. 29 (11) (2018) 5784–5789. ISSN: 21622388. https://
doi.org/10.1109/TNNLS.2018.2808319.
[67] Y. Bengio, N. Leonard, A. Courville, Estimating or propagating gradients through
stochastic neurons for conditional computation, arXiv preprint arXiv:1308.3432
(2013).
298 Francesco Daghero et al.

[68] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,


D. Kalenichenko, Quantization and training of neural networks for efficient
integer-arithmetic-only inference. in: Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, ISSN 10636919, 2018,
ISBN: 9781538664209, pp. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286.
[69] J. Choi, Z. Wang, S. Venkataramani, P.I.-J. Chuang, V. Srinivasan,
K. Gopalakrishnan, PACT: parameterized clipping activation for quantized neural net-
works, arXiv preprint arXiv:1805.06085 (2018), http://arxiv.org/abs/1805.06085.
[70] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: imagenet classification
using binary convolutional neural networks, in: European Conference on Computer
Vision, Springer, 2016, pp. 525–542.
[71] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural net-
works, in: Advances in Neural Information Processing Systems, ISSN 10495258, 2016,
pp. 4114–4122.
[72] M. Edel, E. K€ oppe, Binarized-blstm-rnn based human activity recognition, in: 2016
International Conference on Indoor Positioning and Indoor Navigation (IPIN), IEEE,
2016, pp. 1–7.
[73] R. Andri, L. Cavigelli, D. Rossi, L. Benini, YodaNN: an ultra-low power con-
volutional neural network accelerator based on binary weights, in: 2016 IEEE
Computer Society Annual Symposium on VLSI (ISVLSI), IEEE, 2016, pp. 236–241.
[74] J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, Y. Bengio, Recurrent neural networks with lim-
ited numerical precision, arXiv preprint arXiv:1608.06902 (2016), http://arxiv.org/
abs/1611.07065.
[75] Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, Y. Zou, Effective quantization
methods for recurrent neural networks, arXiv preprint arXiv:1611.10176 (2016),
http://arxiv.org/abs/1611.10176.
[76] S. Shin, K. Hwang, W. Sung, Fixed-point performance analysis of recurrent neural
networks. in: ICASSP, IEEE International Conference on Acoustics, Speech and
Signal Processing—Proceedings, ISSN 15206149, vol. 2016, IEEE, 2016, ISBN:
9781479999880, pp. 976–980. https://doi.org/10.1109/ICASSP.2016.7471821.
[77] Y. LeCun, J.S. Denker, S.A. Solla, Optimal brain damage, in: Advances in Neural
Information Processing Systems, 1990, pp. 598–605.
[78] S. Han, J. Pool, J. Tran, W.J. Dally, Learning both weights and connections for effi-
cient neural networks, in: Advances in Neural Information Processing Systems, ISSN
10495258, vol. 2015, 2015, pp. 1135–1143.
[79] T.-J. Yang, Y.-H. Chen, V. Sze, Designing energy-efficient convolutional neural net-
works using energy-aware pruning, in: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 5687–5695.
[80] R. Dorrance, F. Ren, D. Markovic, A scalable sparse matrix-vector multiplication
kernel for energy-efficient sparse-blas on FPGAs. in: ACM/SIGDA International
Symposium on Field Programmable Gate Arrays—FPGA, 2014, ISBN: 9781450
326711, pp. 161–169. https://doi.org/10.1145/2554688.2554785.
[81] G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, N. Koziris, Understanding the
performance of sparse matrix-vector multiplication. in: Proceedings of the 16th
Euromicro Conference on Parallel, Distributed and Network-Based Processing,
PDP, 2008, IEEE, 2008, ISBN: 0769530893, pp. 283–292. https://doi.org/10.1109/
PDP.2008.41.
[82] S. Anwar, K. Hwang, W. Sung, Structured pruning of deep convolutional neural net-
works. ACM J. Emerg. Technol. Comput. Syst. 13 (3) (2017) 1–18. ISSN: 15504840.
https://doi.org/10.1145/3005348.
[83] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu, L. Zhang,
Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity.
Energy-efficient deep learning inference on edge devices 299

in: FPGA 2019—Proceedings of the 2019 ACM/SIGDA International Symposium on


Field-Programmable Gate Arrays, 2019, ISBN: 9781450361378, pp. 63–72. https://
doi.org/10.1145/3289602.3293898.
[84] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network,
in: NIPS Deep Learning and Representation Learning Workshop, 2015. http://
arxiv.org/abs/1503.02531.
[85] C. Bucilǎ, R. Caruana, A. Niculescu-Mizil, Model compression. in: Proceedings
of the ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, vol. 2006, 2006, ISBN: 1595933395, pp. 535–541. https://doi.org/
10.1145/1150402.1150464.
[86] A. Romero, N. Ballas, S.E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: hints for
thin deep nets, in: Proceedings of the 3rd International Conference on Learning
Representations (ICLR), 2015, pp. 1–13.
[87] Y. Tian, D. Krishnan, P. Isola, Contrastive representation distillation, in: Proceedings
of the 8th International Conference on Learning Representations (ICLR), 2020,
pp. 1–19.
[88] B.B. Sau, V.N. Balasubramanian, Deep model compression: distilling knowledge from
noisy teachers, arXiv preprint arXiv:1610.09650 (2016), http://arxiv.org/abs/1610.
09650.
[89] S.-I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, H. Ghasemzadeh,
Improved knowledge distillation via teacher assistant, in: Proceedings of the 34th
Conference on Artificial Intelligence (AAAI), 2020, pp. 5191–5198.
[90] J.H. Cho, B. Hariharan, On the efficacy of knowledge distillation. in: Proceedings of
the IEEE International Conference on Computer Vision, ISSN 15505499, vol. 2019,
2019, ISBN: 9781728148038, pp. 4793–4801. https://doi.org/10.1109/ICCV.2019.
00489.
[91] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, L. Tang,
Neurosurgeon: collaborative intelligence between the cloud and mobile edge.
International Conference on Architectural Support for Programming Languages and
Operating Systems—ASPLOS Part F1271 (1) (2017) 615–629. https://doi.org/
10.1145/3037697.3037698.
[92] A.E. Eshratifar, A. Esmaili, M. Pedram, BottleNet: a deep learning architecture for intel-
ligent mobile cloud computing services. in: Proceedings of the International Symposium
on Low Power Electronics and Design, ISSN 15334678, vol. 2019, IEEE, 2019, ISBN:
9781728129549, pp. 1–6. https://doi.org/10.1109/ISLPED.2019.8824955.
[93] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, M. Yunsheng, S. Chen, P. Hou,
A new deep learning-based food recognition system for dietary assessment on an edge
computing service infrastructure. IEEE Trans. Services Comput. 11 (2) (2017)
249–261. https://doi.org/10.1109/TSC.2017.2662008.
[94] A.E. Eshratifar, M.S. Abrishami, M. Pedram, JointDNN: an efficient training and
inference engine for intelligent mobile cloud computing services. IEEE Trans.
Mob. Comput. Early Access (2019) 1. https://doi.org/10.1109/tmc.2019.2947893.
[95] Z. Zhao, K.M. Barijough, A. Gerstlauer, DeepThings: distributed adaptive deep learn-
ing inference on resource-constrained IoT edge clusters. IEEE Trans. Comput. Aided
Des. Integrated Circuits Syst. 37 (11) (2018) 2348–2359. ISSN: 02780070. https://doi.
org/10.1109/TCAD.2018.2858384.
[96] H. Yin, Z. Wang, N.K. Jha, A hierarchical inference model for internet-of-things,
IEEE Trans. Multi-Scale Comput. Syst. 4 (3) (2018) 260–271.
[97] H. Tann, S. Hashemi, R.I. Bahar, S. Reda, Runtime configurable deep neural net-
works for energy-accuracy trade-off. in: 2016 International Conference on
Hardware/Software Codesign and System Synthesis, CODES+ISSS 2016, IEEE,
2016, ISBN: 9781450330503, pp. 1–10. https://doi.org/10.1145/2968456.2968458.
300 Francesco Daghero et al.

[98] E. Park, D. Kim, S. Kim, Y.D. Kim, G. Kim, S. Yoon, S. Yoo, Big/little deep neural
network for ultra low power inference. in: 2015 International Conference on
Hardware/Software Codesign and System Synthesis, CODES+ISSS 2015, IEEE,
2015, ISBN: 9781467383219, pp. 124–132. https://doi.org/10.1109/CODESISSS.
2015.7331375.
[99] S. Teerapittayanon, B. McDanel, H.T. Kung, BranchyNet: fast inference via early exiting
from deep neural networks. in: Proceedings—International Conference on Pattern
Recognition, ISSN 10514651, IEEE, 2016, ISBN: 9781509048472, pp. 2464–2469.
https://doi.org/10.1109/ICPR.2016.7900006.
[100] X. Wang, F. Yu, Z.Y. Dou, T. Darrell, J.E. Gonzalez, SkipNet: learning dynamic routing
in convolutional networks. in: Lecture Notes in Computer Science (Including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN
16113349, vol. 11217 LNCS, 2018, ISBN: 9783030012601, pp. 420–436. https://
doi.org/10.1007/978-3-030-01261-8_25.
[101] D. Jahier Pagliari, R. Chiaro, Y. Chen, E. Macii, M. Poncino, Optimal input-
dependent edge-cloud partitioning for RNN inference, in: 2019 26th IEEE
International Conference on Electronics, Circuits and Systems (ICECS), IEEE,
2019, pp. 442–445.
[102] D. Jahier Pagliari, F. Panini, E. Macii, M. Poncino, Dynamic beam width tuning for
energy-efficient recurrent neural networks, in: Proceedings of the 2019 on Great Lakes
Symposium on VLSI, 2019, pp. 69–74.
[103] D. Jahier Pagliari, F. Daghero, M. Poncino, Sequence-to-sequence neural networks
inference on embedded processors using dynamic beam search. Electronics
(Switzerland) , 9 (2) (2020) 337. ISSN: 20799292. https://doi.org/10.3390/electronics
9020337.
[104] C. Guo, G. Pleiss, Y. Sun, K.Q. Weinberger, On calibration of modern neural
networks, in: 34th International Conference on Machine Learning, ICML 2017,
vol. 3, JMLR.org, 2017, ISBN: 9781510855144, pp. 2130–2143.
[105] A. Graves, Adaptive computation time for recurrent neural networks, arXiv preprint
arXiv:1603.08983 (2016).
[106] D. Jahier Pagliari, R. Chiaro, Y. Chen, S. Vinco, E. Macii, M. Poncino, Input-
dependent edge-cloud mapping of recurrent neural networks inference, in: 2020
57th ACM/EDAC/IEEE Design Automation Conference (DAC), 2020, pp. 1–6.
[107] M. Mejia-Lavalle, C.G.P. Ramos, Beam search with dynamic pruning for artificial
intelligence hard problems, in: 2013 International Conference on Mechatronics,
Electronics and Automotive Engineering, IEEE, 2013, pp. 59–64.
[108] M. Freitag, Y. Al-Onaizan, Beam search strategies for neural machine translation,
in: Proceedings of the First Workshop on Neural Machine Translation, 2017,
pp. 56–60.
[109] C.R. Banbury, V.J. Reddi, M. Lam, W. Fu, A. Fazel, J. Holleman, X. Huang,
R. Hurtado, D. Kanter, A. Lokhmotov, D. Patterson, D. Pau, J.-s. Seo, J. Sieracki,
U. Thakker, M. Verhelst, P. Yadav, Benchmarking TinyML systems: challenges
and direction, arxiv, preprint: arXiv:2003.04821 (2020).
Energy-efficient deep learning inference on edge devices 301

About the authors


Francesco Daghero is a PhD student at
Politecnico di Torino. He received a M.Sc.
degree in computer engineering from
Politecnico di Torino, Italy, in 2019. His
research interests concern embedded
machine learning and Industry 4.0.

Daniele Jahier Pagliari received the M.Sc.


and Ph.D. degrees in computer engineering
from Politecnico di Torino, Italy, in 2014
and 2018, respectively. He is currently an
Assistant Professor in the same institution.
His research interests include computer-aided
design of digital systemsand low-power opti-
mization for embedded systems, with particu-
lar focus on embedded machine learning.

Massimo Poncino is a Full Professor of


Computer Engineering with the Politecnico
di Torino, Italy. His current research interests
include several aspects of design automation of
digital systems, with emphasis on the model-
ing and optimization of energy-efficient sys-
tems. He received a PhD in computer
engineering and a Dr.Eng. in electrical engi-
neering from Politecnico di Torino.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy