Energy-Efficient Deep Learning Inference On Edge Devices
Energy-Efficient Deep Learning Inference On Edge Devices
Contents
1. Introduction 248
2. Theoretical background 249
2.1 Neurons and layers 249
2.2 Training and inference 252
2.3 Feed-forward models 253
2.4 Sequential models 256
3. Deep learning frameworks and libraries 258
4. Advantages of deep learning on the edge 259
5. Applications of deep learning at the edge 260
5.1 Computer vision 261
5.2 Language and speech processing 261
5.3 Time series processing 262
6. Hardware support for deep learning inference at the edge 262
6.1 Custom accelerators 262
6.2 Embedded GPUs 263
6.3 Embedded CPUs and MCUs 264
7. Static optimizations for deep learning inference at the edge 265
7.1 Quantization 267
7.2 Pruning 273
7.3 Knowledge distillation 278
7.4 Collaborative inference 279
7.5 Limitations of static optimizations 282
8. Dynamic (input-dependent) optimizations for deep learning inference at
the edge 282
8.1 Ensemble learning 283
8.2 Conditional inference and fast exiting 286
8.3 Hierarchical inference 288
8.4 Input-dependent collaborative inference 290
8.5 Dynamic tuning of inference algorithm parameters 291
9. Open challenges and future directions 293
References 293
About the authors 301
Abstract
The success of deep learning comes at the cost of very high computational complexity.
Consequently, Internet of Things (IoT) edge nodes typically offload deep learning tasks
to powerful cloud servers, an inherently inefficient solution. In fact, transmitting raw data
to the cloud through wireless links incurs long latencies and high energy consumption.
Moreover, pure cloud offloading is not scalable due to network pressure and poses
security concerns related to the transmission of user data.
The straightforward solution to these issues is to perform deep learning inference at
the edge. However, cost and power-constrained embedded processors with limited
processing and memory capabilities cannot handle complex deep learning models.
Even resorting to hardware acceleration, a common approach to handle such complex-
ity, embedded devices are still not able to directly manage models designed for cloud
servers. It becomes then necessary to employ proper optimization strategies to enable
deep learning processing at the edge.
In this chapter, we survey the most relevant optimizations to support embedded
deep learning inference. We focus in particular on optimizations that favor hardware
acceleration (such as quantization and big-little architectures). We divide our analysis
in two parts. First, we review classic approaches based on static (design time) optimi-
zations. We then show how these solutions are often suboptimal, as they produce
models that are either over-optimized for complex inputs (yielding accuracy losses)
or under-optimized for simple inputs (losing energy saving opportunities). Finally, we
review the more recent trend of dynamic (input-dependent) optimizations, which solve
this problem by adapting the optimization to the processed input.
1. Introduction
In recent years, machine learning techniques have become pervasive
in our society, as the backbone of an increasing number of applications in the
mobile and IoT domains. This spread has been mainly fueled by the advent
of deep learning. In fact, one of the limitations of classical (i.e., pre-deep-
learning) ML models is their reliance on carefully hand-engineered feature
extractors, which makes model design a long, complex and costly process.
Furthermore, the resulting models are often not reusable if the specifications
of the problem change even slightly [1]. Deep learning, in contrast, over-
comes the need for hand-crafted features [1], by using representation learn-
ing to extract meaningful features directly from raw data. This approach has
been applied successfully to a number of applications, from smart
manufacturing [2], to medical analysis [3, 4] and agriculture [5]. In particular,
for tasks such as computer vision [6], natural language processing [7] and
speech recognition [8], deep learning models have achieved outstanding
results, sometimes even outperforming humans [1].
Energy-efficient deep learning inference on edge devices 249
2. Theoretical background
In this section, we provide a (noncomprehensive) background on
deep learning models. We focus mostly on computational aspects, which
are relevant for the optimizations presented in the rest of the chapter, pro-
viding examples based on “standard” architectures, while intentionally skip-
ping some theoretical details related to the most advanced and exotic
models. Readers interested in those aspects can refer to [10].
A B C
Fig. 2 Some of the most common activation functions in deep neural networks.
(A) Sigmoid; (B) Tanh; (C) ReLU.
which is similar to sigmoid but maps its input to the interval (1, 1), as
shown in Fig. 2B.
• the Rectified Linear Unit (ReLU) [11]:
0 z0
hðzÞ ¼ max ð0, zÞ ¼ (4)
z z>0
which is often used in modern deep learning models to solve the
vanishing gradient problem of sigmoid and tanh, i.e., the fact that for
very large input magnitude, the gradients of those two functions become
very small, complicating the back-propagation of errors during training.
In contrast, the gradient of ReLU is piece-wise constant, as clear from
Fig. 2C.
Besides reducing the impact of vanishing gradients, ReLU is also advanta-
geous from a computational perspective. Indeed, its evaluation simply con-
sists of a hardware-friendly max() function. In contrast, sigmoid and tanh
must be approximated either via a software routine or using table look-
up, depending on the hardware platform [12]. Furthermore, ReLU maps
all negative inputs to zero, leading to sparse activation outputs, which favor
optimization techniques such as pruning (see Section 7.2) [9].
Most deep learning models are neural networks (NNs), i.e., combina-
tions of neurons organized in a sequence of representation layers. Layers
can be divided in three main categories with respect to their position in
the network. The input layer processes raw input data and contains one
neuron per input variable. Its outputs are then fed to one or more hidden
layers, which are at the core of NN processing. Each hidden layer builds
an increasingly complex representation of the input, projecting it into a
high-dimensional feature space. Finally, the output layer takes the last hidden
layer output and produces the final result of the NN computation. Clearly,
the structure of this layer and its activation function depend on the task for
which the NN is used. For classification, the output layer typically includes a
number of neurons equal to the number of classes, each producing an esti-
mate of the probability that the input belongs to the corresponding class. In
this case, a common activation function is the softmax:
exp ðzi =T Þ
qi ¼ X (5)
j
exp ðzj =TÞ
which converts the preactivation output z for each class i into a probability
q. T is the so-called temperature and is normally set to 1.
252 Francesco Daghero et al.
only of the forward pass and uses fixed weight values. While it is possible to use
mini-batches also in this phase, most inference tasks, especially for embedded
and IoT applications, have tight latency constraints, i.e., outputs have to be
produced as soon as possible after inputs become available. In those scenarios,
batching is not an option, and inference has to be performed on single inputs,
in a streaming fashion. Clearly, this negatively affects the exploitable parallel-
ism and data sharing opportunities.
Mini-batch size
MAC operation. This is only partially mitigated by the weight reuse made
available by batching, when it is an option [13].
x μB
y ¼ pffiffiffiffiffiffiffiffiffiffiffi
ffiγ + β (7)
σ 2B + E
where γ and β are parameters learned during the training, while μB and σ B
are the mini-batch mean and standard deviation; E is a small constant added
for numerical stability [15]. During inference, μB and σ B are replaced with
the entire training set mean and standard deviations [16]. Batch norm has
been shown to improve training speed by making gradients more stable,
besides favoring the application of quantization (see Section 7.1) [17].
Features extracted by CNNs are often flattened and fed to one or more fully
connected (FC) layers for the final classification. An example of a classical
CNN architecture is shown in Fig. 6.
Conv layers dominate the inference phase of a CNN from a computa-
tional standpoint. This is due to fact that they process larger tensors with
respect to the final FC layers, which operate on a compressed feature space.
Moreover, the number of Conv layers is typically much larger than the num-
ber of FC layers in a modern CNN. The naive realization of a set of (standard)
convolution filters on a 3D tensor consists of 6 indented loops [13]. More
specifically, the 3 innermost loops perform the weighted sum of a 3D slice
of the input tensor with a 3D filter kernel, i.e., the three summations of
(6). Two additional loops move the slice across the width and height of
the input tensor, and the last loop repeats the entire procedure with multiple
filters, thus generating the various channels of the output tensor. Differently
from fully connected layers, Conv operations are typically compute bound,
even without batching [13].
Fig. 6 Lenet-5, an example of classic CNN architecture [14]. FC, fully connected. Layers
activations are not shown for simplicity.
256 Francesco Daghero et al.
Fig. 8 An overview of a LSTM cell (A) and its unrolling during inference (B). In the cell
diagram, circles labeled with σ represent a multiplication with a weight matrix followed
by an element-wise sigmoid or tanh operation, while circles labeled with x or + represent
element-wise products and sums.
layers of so-called cells, i.e., complex “neurons” with memory. Common cell
architectures include the long-term short memory (LSTM) and the gated
recurrent unit (GRU). An example of LSTM cell is shown in Fig. 8; for
the details of the various operations involved, the reader can refer to [10].
The figure also shows how cells are used to process an entire sequence, by
unrolling them a number of times equal to the sequence length. Notice that
each cell “replica” shares the same learned weights.
258 Francesco Daghero et al.
popular frameworks. Both provide high-level Python APIs for model devel-
opment, training, and deployment, while leveraging optimized C/C++
libraries for CPU or GPU acceleration under the hood. Cross-framework
model description formats such as the one provided by ONNX [23] are also
increasingly popular.
These frameworks are extremely powerful and flexible, but they are not
suited for low-power edge devices, mostly due to their significant require-
ments in terms of runtime memory occupation. Therefore, recent years have
seen the development of many edge-oriented libraries and frameworks for
deep learning. Both PyTorch and TensorFlow now offer lightweight infer-
ence engines (called PyTorch Mobile and TF Lite, respectively) targeting
resource efficiency. With respect to their full-fledged counterparts, these
stripped-down versions only permit the execution of the inference phase
(i.e., no model development nor training), thus greatly reducing the size
of the corresponding runtime. Conversion tools allow to export a standard
PyTorch/TF model for these engines, while simultaneously applying
efficiency-oriented optimizations to the network (mostly quantization,
see Section 7.1).
While PyTorch Mobile and TF Lite target powerful edge devices such as
smartphones or tablets, other projects have addressed the implementation of
deep learning on even more constrained targets, such as Microcontrollers
(MCUs). Tensorflow Lite for Microcontrollers is a recent effort in this sense
from TF developers, targeting devices such as ARM Cortex-M processors; it
runs without the need of operating system support, and occupies only a few
kB of memory. Moreover, several companies and academic researchers are
developing MCU-oriented libraries for deep learning inference, such as
ARM’s CMSIS-NN [18], STMicroelectronics CUBE AI [24], the PULP
Platform library PULP-NN [25], and many others. More details on these
libraries are provided in Section 6.3.
inference can be highly inefficient [9, 26]. In contrast, near-sensor edge comput-
ing might provide several benefits, as long as designers have available a set of
optimization strategies that permit the efficient execution of DNNs on cost-
constrained edge devices.
First, cloud offloading might incur long and unpredictable latencies
when devices have a slow or intermittent internet connection. This might
be critical for deep learning applications with tight real-time constraints,
such as autonomous driving [26]. Local computation, instead, can be made
predictable in terms of latency much more easily. Moreover, the cloud
approach also has scalability issues, since having a large number of devices
connected to the cloud increases the network pressure, deteriorating the
quality of service. This is especially true for bandwidth-intensive inputs such
as videos. Performing the entire inference, or at least a part of it, at the edge
would reduce this bottleneck.
Besides latency and bandwidth problems, the transmission of high-
dimensional data over the network is also highly energy inefficient, espe-
cially for wirelessly connected edge devices. In comparison, an optimized
local processing can be orders of magnitude more efficient [13]. This is par-
ticularly critical given that mobile and IoT devices are mostly battery oper-
ated and run on tight energy constraints [27].
Finally, mobile and IoT applications often process sensitive data (e.g., face
detection, speech recognition) whose transmission to the cloud may raise
privacy concerns. Avoiding the transmission or performing a first
preprocessing step locally would therefore also increase security.
simplifications can be even more dramatic, ranging from input size reduc-
tion to the complete removal of several layers. Clearly, the drawback of
changing the DNN architecture is that it prevents the fine-tuning [1] of pre-
trained models. We do not describe hyper-parameters optimizations in
detail, since they are highly task-specific. However, we remark that, when-
ever they are possible for the task at hand, these should be the first optimi-
zations to be considered, as they can yield the largest complexity reduction
and efficiency improvement.
In contrast, in the rest of this section we focus on four families of general
optimization strategies that can be applied successfully to many different
DNNs: quantization, pruning, distillation, and collaborative inference. We select
these four families as they are currently the most effective and widely used by
researchers and industry. Two other interesting trends, not covered in this
chapter but worth mentioning, are filter decomposition [55] and the use of
approximate computing techniques (e.g., voltage over-scaling, approximate
functional units) [27, 56].
Importantly, the four families of optimizations described in this section,
as well as the dynamic ones treated in Section 8 are not only (almost) orthog-
onal to the details of the DNN used for inference, but also to the type of
inference hardware. This means that, although with different benefits in
terms of energy efficiency, these strategies can be applied to any of the fam-
ilies of platforms described in Section 6. As such, most of the approaches
described in the following do not try to reduce energy by lowering the power
consumption of each individual operation, which is strongly hardware-
dependent.b Instead, they exploit the fact that energy is a time-integral quan-
tity, and try to decrease it by reducing the number of operations performed,
thus being effective regardless of the underlying hardware, even if the power
of each operation remains constant. In practice, therefore, most of the tech-
niques presented in the chapter reduce either the memory occupation and
bandwidth of DNNs, thus cutting the energy associated with loading and
storing data through memory hierarchy levels, or the number and complex-
ity (precision) of the arithmetic (MAC) operations required for inference.
We select these techniques exactly for their generality. Clearly, hardware-
specific power optimizations can be combined to them, yielding even higher
energy savings, and we will mention some scenarios when this combination
has been realized in the following.
b
There are clearly exceptions to this rule. For example, integer quantization reduces both time and power
as integer ALUs are normally more power-efficient than floating point ones, for practically any hardware
platform.
Energy-efficient deep learning inference on edge devices 267
7.1 Quantization
One of the most widely diffused optimization techniques for deep learning
models is quantization. It consists of reducing the precision of the DNN
weights and possibly of the activations, exploiting the resilience of DNNs
to small errors and noise [27]. Reduced precision formats for DNNs can lever-
age either floating point (such as the 16-bit minifloat, currently supported by
many deep learning oriented GPUs [57]) or integer (i.e., fixed point) arith-
metic. The former is sometimes used also to speedup training, as minifloat rep-
resentation often does not yield any accuracy loss compared to standard 32-bit
floats. Integer quantization, in contrast, is mostly used to improve the effi-
ciency and speed of the inference phase [16]. At training time, integer quanti-
zation is only (optionally) simulated, in order to account for its impact on the
network outputs, as explained in detail in the following. Since our main focus
is inference, in the following, we concentrate on integer quantization
techniques.
xmin and xmax. For model weights, this range can be obtained immediately,
since their value is fixed posttraining. Quantizing activations, instead,
requires a calibration phase to determine data ranges to be used at runtime.
This is normally done running forward passes on a subset of the dataset and
storing the maximum and minimum values encountered for each activa-
tion tensor. This does not guarantee to find the absolute minimum and
maximum values assumed by a given tensor, but is often sufficient.
Values exceeding the expected range will be simply clipped by the clamp()
function in (15), (19), and (23). The risk of clipping is partially mitigated
by batch normalization, which keeps the values in a stable interval [16].
Posttraining quantization does not require a time-consuming training,
nor the availability of training data, except the few samples used for calibra-
tion. However, it may cause noticeable accuracy degradations on complex
tasks [66], although networks with more parameters tend to be less affected
[16].
A very effective solution to cope with these accuracy degradations is
quantization-aware training. In this approach, quantization is simulated while
the model is being trained, thus giving the DNN the opportunity to learn
how to compensate for the loss of precision. Clearly, the drawback of this
approach is that it requires an expensive training run and is only feasible
when training data are available.
During quantization-aware training, weights (and activations) are “fake-
quantized”. This means that their values are rounded in the forward pass, to
mimic low precision, but internal computations are still performed in floating
point. In the backward pass, quantized operations can be dealt with in differ-
ent ways. One classical approach is to approximate them with a straight-
through estimator [67]. This ensures that DNN outputs are produced as if data
had been quantized, while still allowing the back-propagation of small gradi-
ents, which is fundamental for training convergence. At inference time, fake
quantization operations are then removed, and the model uses actual integer
weights and activations.
Quantization-aware training also requires a different procedure for esti-
mating ranges of tensors. For example, since activation values change
depending on the input, in [68] it is proposed to use an exponential moving
average to estimate their range during training. Moreover, in order to avoid
rapidly shifting activation values, their quantization is usually performed
only after a considerable number of initial training steps, so that the network
has reached a more stable state [68].
Energy-efficient deep learning inference on edge devices 271
Besides the aforementioned BatchNorm layers, there are also other DNN
architecture elements that can favor the application of quantization, espe-
cially during training. An important one is the use of bounded activation func-
tions, such as the PArametrized Clipping acTivation (PACT) [69]. PACT is a
bounded ReLU variant that follows this equation:
8
>
<0 x0
pactðxÞ ¼ x 0xk (24)
>
:
k xk
where k is a parameter learned during training. Clipping the maximum
activation output to k has been shown to yield significant improvements
in accuracy for a given quantization precision [69].
7.1.3 Binarization
Binarization is an extreme form of quantization, in which precision is
reduced to 1 bit. Initially focusing only on weights [70], this technique
has then been extended also to activations [59]. The most common form
of binarization consists in using binary values 0 and 1 to represent integer
values 1 and 1, respectively. The conversion of a floating point number
to this format is simply obtained with the sign() function.
Binarization of both weights and activations yields extreme complexity
reductions for DNN inference, since MAC operations can be completely
eliminated and replaced by binary operations [59, 70, 71]. In particular, using
the aforementioned semantic for binary values, multiplications can be replaced
by bitwise XNORs, as shown in Table 1. The accumulation of the binarized
elements of a tensor X with N elements, instead, can be computed as follows:
s ¼ 2 popcountðXÞ N (25)
where popcount() is a function that counts the number of bits at 1 in X.
Table 1 Equivalence between MUL and XNOR (⊙) when the values 1
and 1 are represented by the binary 0 and 1.
X1val X2val X1binary X2binary X1val * X2val X1binary ⊙ X2binary
1 1 0 0 1 1
1 1 0 1 1 0
1 1 1 0 1 0
1 1 1 1 1 1
272 Francesco Daghero et al.
quantization with voltage and frequency scaling to achieve much more than
linear savings down to 4-bit.
Contrary to other types of quantization, binarization does not just reduce
the precision of operations, but radically changes them. Therefore, as antic-
ipated, the benefits that can be derived from it are even higher than the intu-
itive 32x reduction in model size and memory bandwidth compared to
floating point. However, obtaining these gains on general purpose hardware
is again not trivial, mainly because commercial CPUs do not offer an efficient
way to implement the popcount() operation. In contrast, recent academic pro-
cessor platforms [25] have added dedicated hardware and a corresponding
instruction for popcount. Custom accelerators for binary NNs are also quite
explored, due to their extreme compactness and efficiency [36, 73].
Regardless of the target hardware and precision, a considerable advan-
tage of quantization lies in its orthogonality to the DNN architecture.
Indeed, although there are architectural elements that favor its application,
such as BatchNorm and bounded activations, quantization does not require
any particular model characteristic in order to work. This is one of the main
reasons why this technique has become popular and is now widely supported
by the major deep learning frameworks, both in its training-aware and post-
training forms [21, 22], making it easier to implement for developers.
Clearly, what does depend on the platform are the energy and time benefits
of quantization, as described above. Therefore, the same quantized model
might lead to very different efficiency on two different hardware targets.
Quantization has been found very successful on convolutional neural
networks, allowing to reduce significantly the precision with negligible
accuracy loss [16, 71]. However, sequential models are a much harder chal-
lenge [74–76]. While the research on this type of networks has not been as
extensive as the one on CNNs, current results are definitely less outstanding.
In particular, while the easiest tasks and datasets benefit from quantization,
the accuracy deterioration is noticeable on harder ones [75].
7.2 Pruning
It has been known for some time that, due to their overparametrization,
deep learning models can tolerate high levels of sparsity in their weights
[77]. This means that a large portion of the weights can assume value 0, while
still producing accurate results. Furthermore, modern DNN activations are
also inherently sparse, due to the use of functions such as ReLU (see Fig. 9),
which turn all negative inputs to 0.
274 Francesco Daghero et al.
Fig. 9 Activations sparsity deriving from the application of a ReLU, one of the most fre-
quently used activation functions for DNN hidden layers.
magnitude of all weights is much simpler than evaluating their saliency, thus
making these approaches much more computationally efficient at training
time. With magnitude-based pruning, the majority of the weights that can
be safely pruned with negligible impact on the final accuracy is found in fully
connected layers.
Both magnitude- and saliency-based pruning simply try to maximize the
number of 0-weights while minimizing the accuracy drop. Other works,
however, have shown that this does not always correspond to the optimal
solution when the target is energy minimization. For instance, the authors
of [79] have shown that, in classical models such as AlexNet [6], most of the
energy for inference is consumed by convolutional layers, and not by fully
connected ones. Therefore, they have introduced energy-driven pruning, in
which the energy impact of each weight is estimated to select the optimal
pruning location, based on the consumption of different layers [79].
• Nonzero elements relative to the 1st matrix row start at index 0 of the
other two arrays and end before index 2.
• Elements relative to the 2nd matrix row start at index 2 and end before
index 2 (meaning that this row does not contain any nonzero value).
• Elements relative to the 3rd matrix row start at index 2 and end before
index 4.
CSR decoding is efficient if the matrix is read in row-major order. In fact,
row pointers can be accessed in constant time based on the row index, and
reconstructing the entire row is linear in the number of nonzero elements.
The problem of this format, however, occurs when trying to skip compu-
tations related to zero-weights, and in particular when accessing the activa-
tions tensor with which the sparse matrix is multiplied [81]. In fact, since
each CSR matrix row has nonzero elements in different positions, the
corresponding activations must be accessed multiple times with a sparse pat-
tern. Alternatively, the whole vector has to be loaded at once, but depending
on its size, this might not be feasible for memory-constrained devices [80].
Compressed sparse column (CSC) solves this problem using the opposite
approach with respect to CSR, i.e., storing row indices and column
pointers. This allows to read the matrix by column when performing mul-
tiplications, and eliminates the problem of multiple accesses to the input acti-
vations. In fact, in a matrix-vector product, each matrix column is multiplied
with the same input element, thus the input activations are guaranteed to be
read at most once and in order. However, it creates an analogous problem
for the output vector, which has to be either stored as a whole at the end of
the product, or accessed multiple times in a sparse way [80]. Nonetheless,
CSC becomes preferable to CSR when the size of the output is smaller than
the size of the input [80], which is often true for deep learning models.
Both CSC and CSR matrices are built based on the output of unstructured
pruning algorithms, in which weights can be zeroed-out at arbitrary loca-
tions in the weights matrix. This complicates the optimization from a hard-
ware point of view. In fact, virtually all hardware platforms for deep learning
use some form of parallel computation, from SIMD/SIMT operations in
CPUs and GPUs, to systolic processing elements in accelerators [34].
With unstructured pruned formats such as CSC/CSR, however, each
atomic computation step (e.g., the multiplication of a portion of a weights
matrix row with a portion of the activation vector) may require a different
number of operations, depending on the number of nonzero values
involved. Therefore, if computations involving zero-values are skipped, it
becomes hard to fully exploit the available parallel hardware [80].
Energy-efficient deep learning inference on edge devices 277
one time and split into banks as well, this reorganization allows to perform
interbank parallelization, where weights from different banks are simulta-
neously multiplied with the corresponding activations. The fact that each
bank contains exactly the same number of elements ensures that parallel
hardware is fully utilized. This improved parallelism comes with minimal
costs in terms of accuracy compared to an unstructured CSR approach,
for the same sparsity level [83].
measures the difference between the student network’s hard predictions and
the reduced dataset’s true labels. The teacher loss, instead, measures the dis-
tance between the soft predictions of the student and those of the teachers.
Therefore, it measures how different the predictions of the student are from
the ones of the bigger network. When teachers are an ensemble of models,
the teacher loss usually employs the geometric mean of their predictions.
This second loss is computed using soft predictions with T > 1 because this
causes the probabilities to have a more compact distribution. Less extreme
values are in fact more informative and can be more easily learned by the
student model [84].
Recent works have proposed variations to the basic architecture of Fig.
14, such as using multiple connections (or bridges) to enforce similar out-
puts between student and teacher at different layers, aside from the output
[86]. Specifically, some hidden layers from the teacher network are chosen
to “guide” the learning of others belonging to the student model. If the two
layer sizes are different, an additional linear layer is added in-between the
two to match the dimensions. More advanced applications of distillation
have also been proposed in literature [87, 88].
The impact of distillation on energy efficiency is evident. At inference
time, only the distilled student network will be used to process inputs, thus
significantly reducing both the model size and the number of operations per
input. However, the degree at which this network shrinking can be applied
clearly depends on the complexity of the problem. Recent works have
shown that when the sizes of student and teachers differ greatly, the perfor-
mance drops significantly [89, 90]. In fact, the student network can only
learn up to a certain extent from the teacher, becoming unable to mimic
networks with too many parameters [90]. To tackle this problem, an addi-
tional intermediate size network called teacher assistant has been proposed in
[89] in order to have a multiple-step distillation.
Network distillation has been shown to work equally effectively on both
feed-forward and sequential models. For example, a distilled version of
BERT [7], one of the state of the art NLP models, called DistilBERT
[19] has been recently proposed. It manages to obtain 97% of the accuracy
of the full model while performing inference 60% faster.
networks this may still be unfeasible due to memory and performance limi-
tations [26]. Therefore, an interesting research branch seeks for a compromise
between the benefits of edge and cloud computing by means of so-called col-
laborative inference, which consists in distributing the inference computation
among multiple devices (e.g., edge nodes and cloud servers) [91].
One basic form of collaborative inference is proposed in [93], where the
authors suggest to preprocess images used as inputs for a CNN at the edge,
before running the actual inference in the cloud. Specifically, they propose
to discard blurry images, since they will not be useful for the neural network,
thus reducing the total time and energy spent transmitting raw data. A more
advanced collaborative inference framework is presented in [91], where the
authors propose to split the execution of a CNN between edge and cloud
in a layer-wise fashion as shown in Fig. 15A. Specifically, the first layers are
executed on the edge device, while the last ones are computed in the cloud,
based on the observation that intermediate layers outputs (e.g., after a
pooling) are smaller in size compared to raw inputs in many DNNs.
Therefore, they show that computing a few layers at the edge and sending
the resulting activations to the cloud is often the optimal approach in terms
Fig. 15 An overview of Neurosurgeon [91] (A) and BottleNet [92] (B), two collaborative
inference frameworks. Both compute the first layers locally, transmitting their output to
the cloud where the final result is calculated and then sent back.
Energy-efficient deep learning inference on edge devices 281
first network as correct. In the first case, this would result in an energy waste,
while in the second it would lower the accuracy of the system.
The score margin [98] has been proposed as an effective classification con-
fidence estimate, based on the class probabilities produced by the network’s
output layer. Specifically, the score margin computes the difference between
the largest two of these probabilities. If this difference is large, it means that
the network produced a high probability only for one class, and therefore it is
highly confident that the input belongs to it. On the other hand, a small dif-
ference means that there are at least two classes to which the input could
belong with similar probability, according to the prediction of the DNN,
hence the confidence is low. In summary, the score margin method activates
the “big” network using the following equation:
that is, whenever the difference between the top-2 probabilities is smaller than
a threshold th. Finding the ideal threshold for a given accuracy level is not an
easy task. The authors of [98] propose to use a fine-tuning step to find the best
th for a given pair of networks and a dataset [98]. Alternatively [27], th can also
be tuned at runtime based on external conditions, e.g., increasing it when the
battery level is low to save more energy.
The score margin method is effective as long as the “little” model’s out-
puts actually resemble the probability of an input belonging to a given class,
which is not guaranteed for black-box models such as DNNs. In particular,
it has been shown that modern DNNs estimate confidence probabilities less
reliably [104]. In fact, their increased depth positively impacts their accuracy,
but negatively affects their capability to predict the likelihood with which a
given input belongs to a class. In practice, modern models tend to be over-
confident even for inputs that are not actually classified correctly. For a big/
little system, this makes it harder to find the optimal score margin threshold.
To mitigate this problem, one solution is to use so-called calibration tech-
niques, that make DNN scores more similar to actual probabilities, at the
cost of some accuracy degradation [104].
The major disadvantage of big/little DNNs, however, is that they
require a double effort at training time and they result in an increased model
size. It is in fact necessary to separately train two different models, each one
with its own hyper-parameters to be selected. After training, then, the
weights of both models have to be stored on the inference device, which
might not be possible on memory-constrained edge platforms.
Energy-efficient deep learning inference on edge devices 285
Fig. 17 Simplified view of the dynamic inference technique proposed in [97]. The black
portion of the network is executed for all inputs, while the gray part is only activated for
difficult data.
quantization configurations for a given DNN, i.e., those that yield an inter-
esting trade-off in terms of accuracy and energy. Then, a single posttraining
quantization is performed, targeting the largest bit-width, and lower preci-
sion weights are simply obtained by truncation/rounding. This removes the
need for multiple sets of weights, while also not requiring any training.
Therefore, the approach is also applicable when training data are not avail-
able. Moreover, it is complementary to the previous one and can be used in
conjunction with it to generate even more variants of the same DNN.
Fig. 20 Staged inference medical diagnosis based on wearable sensors data proposed
in [4]. Easy inputs are classified directly on the edge device, while harder ones are sent to
the cloud for further (and more computationally expensive) analysis.
c
Notice that the approach of Fig. 20 is also collaborative, as it involves edge and cloud. However, in that
case, the two devices execute different (portions of ) models, while standard collaborative inference is
based on a single model.
Energy-efficient deep learning inference on edge devices 291
Inference time and energy are estimated via linear regression models, based
on the results of the aforementioned characterization. Such dynamic par-
titioning significantly outperforms both edge-only and cloud-only inference
in terms of energy efficiency, for several NLP applications.
Importantly, the engine in [101, 106] maps the entire RNN execution on
one of the two platforms, rather than partitioning it as typically done for
CNNs [91, 92]. This is because, for most RNN applications, input sizes
are much smaller than for CNNs, and most importantly they are smaller than
hidden layer outputs. This eliminates the data compression advantage deriv-
ing from partial local processing described in Section 7.4. Indeed, the authors
show that, for NLP tasks, the total communication time is dominated by the
round-trip network latency, which is independent from data size.
Fig. 21 On the top a standard beam search with a fixed beam size of 2. On the bottom
its dynamic version, where the Sel. Policy chooses the best beam width to be used in the
following step.
Energy-efficient deep learning inference on edge devices 293
References
[1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
https://doi.org/10.1038/nature14539, http://www.nature.com/articles/nature14539.
[2] J. Wang, Y. Ma, L. Zhang, R.X. Gao, D. Wu, Deep learning for smart manufacturing:
methods and applications, J. Manufact. Syst. 48 (2018) 144–156.
[3] J. Ker, L. Wang, J. Rao, T. Lim, Deep learning applications in medical image analysis,
IEEE Access 6 (2017) 9375–9389.
294 Francesco Daghero et al.
[4] M. Parsa, P. Panda, S. Sen, K. Roy, Staged inference using conditional deep learning
for energy efficient real-time smart diagnosis, in: 2017 39th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),
IEEE, 2017, pp. 78–81.
[5] A. Kamilaris, F.X. Prenafeta-Boldú, Deep learning in agriculture: a survey, Comput.
Electron. Agric. 147 (2018) 70–90.
[6] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
volutional neural networks, in: Advances in Neural Information Processing
Systems, 2012, pp. 1097–1105.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirec-
tional transformers for language understanding, Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies 1 (2019) 4171–4186.
[8] L. Deng, G. Hinton, B. Kingsbury, New types of deep neural network learning
for speech recognition and related applications: an overview, in: 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, IEEE, 2013,
pp. 8599–8603.
[9] V. Sze, Y.H. Chen, T.J. Yang, J.S. Emer, Efficient processing of deep neural networks:
a tutorial and survey. Proc. IEEE 105 (12) (2017) 2295–2329. ISSN: 15582256.
https://doi.org/10.1109/JPROC.2017.2761740.
[10] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
[11] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines,
in: Proceedings of the 27th International Conference on Machine Learning
(ICML-10), 2010, pp. 807–814.
[12] A.H. Namin, K. Leboeuf, R. Muscedere, H. Wu, M. Ahmadi, Efficient hardware imple-
mentation of the hyperbolic tangent sigmoid function. in: Proceedings—IEEE
International Symposium on Circuits and Systems, ISSN 02714310, IEEE, 2009, ISBN:
9781424438280, pp. 2117–2120. https://doi.org/10.1109/ISCAS.2009.5118213.
[13] L. Benini, Plenty of room at the bottom? Micropower deep learning for cognitive
cyber physical systems. in: 2017 7th IEEE International Workshop on Advances in
Sensors and Interfaces (IWASI), IEEE, 2017, p. 165. https://doi.org/10.1109/iwasi.
2017.7974239. 165.
[14] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
L.D. Jackel, Backpropagation applied to handwritten zip code recognition. Neural
Comput. 1 (4) (1989) 541–551. https://doi.org/10.1162/neco.1989.1.4.541.
[15] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reduc-
ing internal covariate shift, in: Proceedings of the 32nd International Conference on
Machine Learning (ICML), 2015, pp. 448–456.
[16] R. Krishnamoorthi, Quantizing deep convolutional networks for efficient inference: a
whitepaper, arXiv preprint arXiv:1806.08342 (2018), http://arxiv.org/abs/1806.08342.
[17] S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help opti-
mization? in: Advances in Neural Information Processing Systems, ISSN 10495258,
vol. 2018, 2018, pp. 2483–2493.
[18] L. Lai, N. Suda, V. Chandra, CMSIS-NN: efficient neural network kernels for
arm cortex-M CPUs, arXiv preprint arXiv:1801.06601 (2018), http://arxiv.org/
abs/1801.06601.
[19] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT:
smaller, faster, cheaper and lighter, in: Proceedings of the 5th Workshop on Energy
Efficient Machine Learning and Cognitive Computing (EMC2), 2019, pp. 1–5.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser,
I. Polosukhin, Attention is all you need, in: Advances in Neural Information
Processing Systems, ISSN 10495258, vol. 2017-Decem, 2017, pp. 5999–6009.
Energy-efficient deep learning inference on edge devices 295
[37] A. Shawahna, S.M. Sait, A. El-Maleh, FPGA-based accelerators of deep learning net-
works for learning and classification: a review. IEEE Access 7 (2019) 7823–7859.
ISSN: 21693536. https://doi.org/10.1109/ACCESS.2018.2890150.
[38] J.E. Stone, D. Gohara, G. Shi, OpenCL: a parallel programming standard for hetero-
geneous computing systems, Comput. Sci. Eng. 12 (3) (2010) 66–73.
[39] B. Moons, R. Uytterhoeven, W. Dehaene, M. Verhelst, DVAFS: trading computa-
tional accuracy for energy through dynamic-voltage-accuracy-frequency-scaling.
in: Proceedings of the 2017 Design, Automation and Test in Europe, DATE 2017,
IEEE, 2017, ISBN: 9783981537093, pp. 488–493. https://doi.org/10.23919/DATE.
2017.7927038.
[40] D. Jahier Pagliari, E. Macii, M. Poncino, Automated synthesis of energy-efficient
reconfigurable-precision circuits, IEEE Access 7 (2019) 172030–172044.
[41] D. Jahier Pagliari, M. Poncino, Application-driven synthesis of energy-efficient
reconfigurable-precision operators, in: 2018 IEEE International Symposium on
Circuits and Systems (ISCAS), IEEE, 2018, pp. 1–5.
[42] J.T.X. Nvidia, Developer Kit, 2015, https://www.nvidia.com/it-it/autonomous-
machines/embedded-systems/.
[43] A. Thomas, Y. Guo, Y. Kim, B. Aksanli, A. Kumar, T.S. Rosing, Hierarchical and
distributed machine learning inference beyond the edge. in: Proceedings of the
2019 IEEE 16th International Conference on Networking, Sensing and Control,
ICNSC 2019, IEEE, 2019, ISBN: 9781728100838, pp. 18–23. https://doi.org/
10.1109/ICNSC.2019.8743164.
[44] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
E. Shelhamer, cuDNN: efficient primitives for deep learning, arXiv preprint
arXiv:1410.0759 (2014), http://arxiv.org/abs/1410.0759.
[45] D. Kirk, NVIDIA CUDA software and GPU parallel computing architecture.
in: International Symposium on Memory Management, ISMM, Vol. 7, 2007,
ISBN: 9781595938930, p. 103. https://doi.org/10.1145/1296907.1296909.
[46] D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, A. Marongiu, Energy efficient par-
allel computing on the PULP platform with support for OpenMP. in: 2014 IEEE 28th
Convention of Electrical and Electronics Engineers in Israel, IEEEI 2014, 2014, ISBN:
9781479959877, pp. 1–5. https://doi.org/10.1109/EEEI.2014.7005803.
[47] GAP8—The IoT Application Processor, https://greenwaves-technologies.com/ai_
processor_gap8/, (Accessed May, 2020).
[48] A. Burrello, F. Conti, A. Garofalo, D. Rossi, L. Benini, Work-in-progress: dory: light-
weight memory hierarchy management for deep NN inference on iot endnodes.
in: Proceedings of the International Conference on Hardware/Software Codesign
and System Synthesis Companion, CODES/ISSS 2019, IEEE, 2019, ISBN:
9781450369237, pp. 1–2. https://doi.org/10.1145/3349567.3351726.
[49] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, P.A. Muller, Deep learning for
time series classification: a review. Data Mining and Knowledge Discovery , 33 (4)
(2019) 917–963. ISSN: 1573756X. https://doi.org/10.1007/s10618-019-00619-1.
[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hier-
archical image database. in: 2009 IEEE Conference on Computer Vision and Pattern
Recognition, IEEE, 2009, pp. 248–255. https://doi.org/10.1109/cvprw.2009.5206848.
[51] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer,
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model
size, arXiv preprint arXiv:1602.07360 (2016). http://arxiv.org/abs/1602.07360.
[52] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for
mobile vision applications, arXiv preprint arXiv:1704.04861 (2017), http://arxiv.
org/abs/1704.04861.
Energy-efficient deep learning inference on edge devices 297
[98] E. Park, D. Kim, S. Kim, Y.D. Kim, G. Kim, S. Yoon, S. Yoo, Big/little deep neural
network for ultra low power inference. in: 2015 International Conference on
Hardware/Software Codesign and System Synthesis, CODES+ISSS 2015, IEEE,
2015, ISBN: 9781467383219, pp. 124–132. https://doi.org/10.1109/CODESISSS.
2015.7331375.
[99] S. Teerapittayanon, B. McDanel, H.T. Kung, BranchyNet: fast inference via early exiting
from deep neural networks. in: Proceedings—International Conference on Pattern
Recognition, ISSN 10514651, IEEE, 2016, ISBN: 9781509048472, pp. 2464–2469.
https://doi.org/10.1109/ICPR.2016.7900006.
[100] X. Wang, F. Yu, Z.Y. Dou, T. Darrell, J.E. Gonzalez, SkipNet: learning dynamic routing
in convolutional networks. in: Lecture Notes in Computer Science (Including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN
16113349, vol. 11217 LNCS, 2018, ISBN: 9783030012601, pp. 420–436. https://
doi.org/10.1007/978-3-030-01261-8_25.
[101] D. Jahier Pagliari, R. Chiaro, Y. Chen, E. Macii, M. Poncino, Optimal input-
dependent edge-cloud partitioning for RNN inference, in: 2019 26th IEEE
International Conference on Electronics, Circuits and Systems (ICECS), IEEE,
2019, pp. 442–445.
[102] D. Jahier Pagliari, F. Panini, E. Macii, M. Poncino, Dynamic beam width tuning for
energy-efficient recurrent neural networks, in: Proceedings of the 2019 on Great Lakes
Symposium on VLSI, 2019, pp. 69–74.
[103] D. Jahier Pagliari, F. Daghero, M. Poncino, Sequence-to-sequence neural networks
inference on embedded processors using dynamic beam search. Electronics
(Switzerland) , 9 (2) (2020) 337. ISSN: 20799292. https://doi.org/10.3390/electronics
9020337.
[104] C. Guo, G. Pleiss, Y. Sun, K.Q. Weinberger, On calibration of modern neural
networks, in: 34th International Conference on Machine Learning, ICML 2017,
vol. 3, JMLR.org, 2017, ISBN: 9781510855144, pp. 2130–2143.
[105] A. Graves, Adaptive computation time for recurrent neural networks, arXiv preprint
arXiv:1603.08983 (2016).
[106] D. Jahier Pagliari, R. Chiaro, Y. Chen, S. Vinco, E. Macii, M. Poncino, Input-
dependent edge-cloud mapping of recurrent neural networks inference, in: 2020
57th ACM/EDAC/IEEE Design Automation Conference (DAC), 2020, pp. 1–6.
[107] M. Mejia-Lavalle, C.G.P. Ramos, Beam search with dynamic pruning for artificial
intelligence hard problems, in: 2013 International Conference on Mechatronics,
Electronics and Automotive Engineering, IEEE, 2013, pp. 59–64.
[108] M. Freitag, Y. Al-Onaizan, Beam search strategies for neural machine translation,
in: Proceedings of the First Workshop on Neural Machine Translation, 2017,
pp. 56–60.
[109] C.R. Banbury, V.J. Reddi, M. Lam, W. Fu, A. Fazel, J. Holleman, X. Huang,
R. Hurtado, D. Kanter, A. Lokhmotov, D. Patterson, D. Pau, J.-s. Seo, J. Sieracki,
U. Thakker, M. Verhelst, P. Yadav, Benchmarking TinyML systems: challenges
and direction, arxiv, preprint: arXiv:2003.04821 (2020).
Energy-efficient deep learning inference on edge devices 301