Thesis 2
Thesis 2
St Edmund’s College
This dissertation is submitted on September, 2021 for the degree of Doctor of Philosophy
Declaration
This dissertation is the result of my own work and includes nothing which is the outcome
of work done in collaboration except as declared in the Preface and specified in the text. It
is not substantially the same as any that I have submitted, or am concurrently submitting,
for a degree or diploma or other qualification at the University of Cambridge or any other
University or similar institution except as declared in the Preface and specified in the text.
I further state that no substantial part of my dissertation has already been submitted, or
is being concurrently submitted, for any such degree, diploma or other qualification at the
University of Cambridge or any other University or similar institution except as declared
in the Preface and specified in the text. This dissertation does not exceed the prescribed
limit of 60 000 words.
Deep Neural Networks (DNNs) offer state-of-the-art performance in many domains but
this success comes at the cost of high computational and memory resources. Since DNN
inference is now a popular workload on both edge and cloud systems, there is an urgent
need to improve its energy efficiency. The improved efficiency enables the use of DNNs in
a boarder range of target markets and helps boost performance as performance if often
limted by power today. This thesis makes a number of contributions that help improve
DNN inference performance on different hardware platforms using a hardware-software
co-design approach.
I first show a number of software optimisation techniques for reducing DNN run-time
costs. Overall, I demonstrate model sizes can be reduced by up to 34×. A combination
of different styles of neural network compression techniques can offer multiplying gains
in shrinking the memory footprints. These techniques are suitable for running DNN
inference on memory-sensitive edge devices. Using the run-time and data-dependent
feature information, I develop a dynamic pruning strategy that outperforms existing static
pruning methods by a significant margin. The proposed dynamic pruning not only reduces
the model sizes but also the number of multiply-accumulate operations for GPUs. I also
introduce a novel quantisation mechanism that is tuned to fit the natural distributions
of model parameters and this method decreases the total number of bit-wise operations
required for DNN inference.
I then focus on accelerating DNNs using custom hardware. I build a framework named
Tomato that generates multi-precision and multi-arithmetic hardware accelerators on
FPGA devices. The software hardware co-generation flow deploys hardware accelerators
from high-level neural network descriptions, and exploits the hardware reconfigurability of
FPGA devices to support a flexible per-layer quantisation strategy. I then demonstrate that
the automatically generated accelerators outperform their closest FPGA-based competitors
by at least 2 to 4× for latency and throughput. The accelerator generated for the ImageNet
classification runs at a rate of more than 3000 frames per second with a latency of only
0.32ms, making it a suitable candidate for latency critical, high throughput inference in
the cloud.
Finally, I show how automated machine learning techniques can be improved with
hardware-awareness to produce efficient network architectures for emerging types of neural
networks and new learning problem setups. Hardware-aware network architecture search
(NAS) is able to discover more power efficient network architectures and achieve significant
computational savings on emerging neural networks types such as graph neural networks.
The proposed Low Precision Graph Network Architecture Search improves the size-accuracy
Pareto frontier when compared to seven manual and NAS-generated baselines on eight
different graph datasets. In addition, I demonstrate hardware-aware NAS can be applied
to a many-task many-device few-shot learning scenario. In popular few-shot learning
benchmarks with various hardware platforms and constraints, the proposed approach
outperforms a variety of NAS and manual baselines by a significant margin. On the
5-way 1-shot Mini-ImageNet classification task, the proposed method outperforms the
best manual baseline by 5.21% in accuracy using 60% less computation.
Acknowledgements
Many say pursuing a PhD is a lonely endeavor, throughout my PhD, I have recieved a
great deal of support and this journey then becomes an unforgettable and fun one for me.
I will start by acknowledging and giving my warmest thanks to my supervisor, Prof.
Robert Mullins. His support went far beyond looking at our paper drafts and discussing
research ideas. I felt I have been increditably lucky to have him as my PhD supervisor
and will always be truely thankful to the amount of reserach freedom that he lets me to
have in my PhD studies.
I would also thank my PhD examiners, Prof. Nic Lane and Dr. Elliot Crowley, for
their fruitful discussion and insightful questions during the viva. These suggestions have
definitely helped me to improve my thesis.
I would also give special thanks to my parents (Mr. Qingjie Zhao and Mrs. Shihong
Li) and my partner (Ms. Shu Zhong). I feel lucky to be in a super supportive family, and
I am grateful to my parents for their continuous support and love. I could not achieve any
of these without your support.
I owe speical thanks to all my reserach collaborators, both in and out of the Computer
Lab in Cambridge. I will always remember the time and effort that they put into these
fun research projects. I also owe a lot of thanks to all my internship mentors over various
summers. I also give my deepest thanks to all my friends, you know who you are. I would
not have got through it without you.
Contents
1 Introduction 15
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Research questions and hypotheses . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Background 23
2.1 An overview of deep neural networks and their computational requirements 23
2.2 Deep neural network structures . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Neural network compression and automated machine learning . . . . . . . 28
2.3.1 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Quantisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3 Automated machine learning and network architecture search . . . 31
2.4 Neural network hardware accelerators . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 From general-purpose processors to custom hardware accelerators . 34
2.4.2 Processing elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.3 Data reuse models . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.4 A comparison of DNN hardware accelerators . . . . . . . . . . . . . 37
6 Conclusion 111
6.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1.1 Software-hardware co-design network architecture search . . . . . . 113
6.1.2 General-purpose AI accelerator . . . . . . . . . . . . . . . . . . . . 114
6.1.3 Large AI chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.4 Handling complex data types in Machine Learning using custom
hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B Graph dataset details, graph sampler configurations and the grid search
setup for baselines 119
Bibliography 123
Chapter 1
Introduction
1.1 Motivation
Deep Neural Networks (DNNs) are now a popular technique for solving problems in
many fields. Previously, building an object detector or a dialogue system required various
sophisticated components and hand-crafted feature extractions from human experts. With
the success of DNNs, these complicated building blocks are replaced by a single or multiple
learning systems [72, 93]. Since many real-life problems can be viewed as learning from
labelled or unlabelled data, the application of DNNs is now popular in domains like natural
language processing [162], computer vision [93], decision making [121], autonomous driving
[138], etc. DNNs are not only broadly adopted in computer science but also show potential
in many other fields for accelerating fundamental scientific discoveries, e.g., DNNs have
found a totally new antibiotic [154] and have been used to successfully predict the structure
of a vast number of proteins from their amino acid sequences [141]. From the hardware
perspective, neural networks are becoming a major workload on both power-sensitive
embedded devices and large distributed cloud systems.
The rise of DNNs helped researchers to achieve high accuracy on many datasets and
showed some interesting trends in terms of their computation requirements. AlexNet, an
early model designed by Krizhevsky et al. for the 2012 ImageNet competition, trained
on 1.2 million images and was used for classification with 1000 categories [93]. The 1000
classes include every-day objects (cat, dog, leopard, etc.). The proposed AlexNet utilises
around 60 million parameters and around 1.6 billion floating-point operations (FLOPs)
for a single inference run. The GPT2 language model, proposed in late 2019, is trained to
perform next word prediction but can be further fine-tuned to many downstream language
tasks. The model uses around 1.5 billion parameters and 3.4 trillion FLOPs to produce a
single inference result and was trained on around 8 million documents. Machine learning is
tackling harder and bigger problems and is using more and more data. This in turn requires
much larger networks (e.g. GPT2 is around 25× larger than AlexNet) and computational
15
complexity (e.g. GPT2 requires around 2000× more computations compared to AlexNet),
posing an enormous challenge of executing these workloads on today’s or future hardware
systems.
The fast-growing computational requirements of Machine Learning (ML) applications
pose a real challenge to existing hardware systems. The possibility that hardware may limit
what ML is able to achieve, e.g. due to power, computation speed or memory bandwidth, is
obviously a concern. The deep learning community faced the second artificial intelligence
winter in the late 90s. A major reason for the recession was the insufficient computational
power since 1) computers at that time were not explicitly designed for massively parallel
workloads such as neural networks, and 2) the CMOS technology (e.g. number/speed/power
of transistors) at that time posts a fundamental limit on the computational capabilities.
Machine Learning algorithms, consequently, do not have a chance to show their great
performance on large-scale problems. It was a great lesson to learn that when hardware
design and their technology scaling cannot keep up with computational requirements,
novel research is not easy to conduct and the whole field is slow-moving because of the
computation bottleneck. Any advances in terms of efficiency or ML hardware, both from
the hardware architecture design and CMOS scaling, will allow us to apply ML methods
to a broader range and more complex problems.
In the meanwhile, challenges also exist in terms of technology scaling, it is reaching
or may have reached the post Moore’s law era [140], the speed and capability of our
computers are not growing as fast as they used to be. Since transistors are getting smaller
and smaller with CMOS scaling, transistor density grows but then power and wire delays
become another major limiting factor [43]. In addition, CMOS scaling has also slowed
down significantly compared to the past decade. From the hardware architecture design
point of view, enabling the ever-increasing computation requirements of running more
and more demanding ML workloads rely on inventions on smart hardware architectures.
From the system design’s point of view, future systems are going to be heterogeneous, and
these systems will have to reach a better performance with a fixed power budget. Future
systems would require an integration of various specialised hardware and ML accelerators
are expected to play a major role in such a system. Given the growing need of applying
Deep Learning to different applications and the potential limitation of CMOS scaling, this
dissertation aims at exploring possible ways of reducing the computational requirements
of modern DNNs.
16
hardware friendly with a given accuracy budget.
• Given the increase in the number of distinct types of neural networks (e.g. con-
volutional neural network, recurrent neural network, graph neural network, etc.)
and different learning setups (e.g. few-shot learning, federated learning, etc.), is
it possible to have a hardware-aware Network Architecture Search framework for
emerging networks or new learning setups?
1.3 Contributions
This dissertation makes a number of contributions in the field of efficient deep neural
network inference.
First, I create and evaluate a range of network compression techniques. These methods
are implemented with an open-source software framework (Mayo). The implemented
compression techniques reduce either the number of parameters or the number of bits
required to represent parameters of a neural network. The open source framework allows
users applying different compression techniques to arbitrary tensor components in a
neural network, this simplifies the evaluation of different styles of network compression
techniques [201, 126]. During the development of the framework, I also demonstrate
that a combination of compression techniques normally significantly outperforms any one
technique alone, offering a large combinatorial design space for compression techniques.
Second, a novel run-time pruning technique that reduces the computing cost of convolu-
tions with minimal neural network architectural modification. The novel pruning explores
the feature saliency at run-time and is so called Feature Boosting and Suppression (FBS)
[50]. This early work in dynamic computation of deep neural networks later encouraged
exploration of more algorithmic optimisations focusing on the same problem [77, 165, 167]
and also CNN hardware accelerators supporting run-time channel skipping [76].
Third, an efficient distribution-aware quantisation and a new general framework of
quantising sparse neural networks based on the minimum description length principle
[202]. This quantisation illustrates that custom hardware has the ability to produce
a better size-accuracy tradeoff by exploring the more flexible hardware operators. In
addition, an efficient combination of different compression techniques (pruning and a
custom quantisation) can significantly improve the compression performance.
17
Fourth, an automated framework for generating multi-precision and multi-arithmetic
accelerators on custom hardware [203]. Here, instead of using a single large systolic
array core as in most neural network accelerators, I exploit a deeply pipelined streaming
architecture with multiple small cores that has the advantage of not temporally sharing
computing elements for different neural network layers. The streaming processing design
offers a flexibility on the arithmetic space since now the hardware can have different layer-
wise quantisations. This framework is an example case where if the hardware provides
enough flexibility, its corresponding neural network model can have more design freedom
and thus a higher re-trained accuracy.
Fifth, a hardware-aware Network Architecture Search (NAS) flow for graph neural
networks [205, 204]. The proposed NAS flow not only is gradient-based that joins the
normal Stochastic Gradient Descent but also is hardware-aware so that it can produce
highly quantised graph neural network models. The NAS-generated networks can be
viewed as a result of a joint-optimisation for both model architectures and quantisation
choices. The proposed NAS flow is a scalable approach for future new types of neural
networks that are going to be deployed with critical model size and latency constraints.
Finally, a general NAS framework for many-task many-device few-shot learning. The
many-task many-device scenario considers deployments of neural network models to different
learning tasks (normally networks of the same style but operate on different datasets),
and deployments of these models on various hardware systems with different run-time
constraints (e.g. latency, storage size, etc.). I then use a classic few-shot learning setup
as a testbed. The empirical results demonstrate that the proposed NAS framework can
produce optimised models for each task-device pair and it outperforms manually designed
baselines by a significant margin.
The contributions of this dissertation cover several important aspects of accelerating
DNN inference. The first three contributions focus on the software stack, these techniques
reduce the redundancies of neural networks for more energy efficient inference. However,
this dissertation reveals that software compression methods show varying performance
on different hardware systems, there does not exist a one-fits-all algorithmic solution
for accelerating DNNs on a diverse set of hardwares. In addition, these contributions
also demonstrate that the best practice is often applying a series of DNN compression
techniques instead of using a single one, there is a multiplying gain when applying
several compression techniques together. The fourth contribution shows how flexible,
reconfigurable, hardware allows a large, flexible software compression design space to
be considered and then supported through a specialised implementation, this HW/SW
co-design generates highly efficient CNN accelerators. This contribution demonstrates that
detailed hardware architectural choices will influence algorithmic compression methods, and
possibly increase the degrees of freedoms of the algorithmic optimisation space. In addition,
18
it demonstrates how deeply pipelined streaming architecture has a latency advantage
compared to other systolic-array based architectures. Since the software compression design
space is large and complex, the final two contributions focus on automating the process
of designing efficient DNNs with hardware-awareness. These contributions demonstrate
that NAS can reduce the amount of software tuning significantly. In addition, NAS is a
scalable approach for emerging neural networks types (Graph Neural Networks) and new
learning setups (many-task many-device learning).
1.4 Outline
The organisation of this dissertation is as follows:
In Chapter 2, I discuss the backgrounds of neural networks and existing techniques
used in software and hardware for minimising their required computation resources. I
will also illustrate the importance and introduce the key challenges for hardware efficient
neural network inference.
In Chapter 3, I show that neural network models are inherently redundant and
algorithmic compression methods can focus on exploring these redundancies. I introduce
the software framework Mayo and show how combining various network compression
methods can explore redundancies in different design spaces. In addition, I present two
novel compression methods for faster inference. The first one is a dynamic pruning
algorithm that can further reduce the computational requirements compared to static
pruning. The second novel compression is a distribution-aware quantisation that can help
sparse models to be represented using a more compact data type.
In Chapter 4, I focus on the hardware implementation of a convolutional neural network
accelerator. I show how to use a streaming-based computation pattern to allow better
flexibility in the arithmetic design space. The flexible arithmetic designs then provides the
opportunities for the network to achieve a better accuracy. I then describe how I extend
this idea to an automated flow for generating multi-precision, multi-arithmetic accelerators
on custom hardware.
In Chapter 5, I introduce the concept of Automated Machine Learning (AutoML) and
Network Architecture Search (NAS). I demonstrate how they can be applied on graph
neural networks and many-task many-device few-shot learning. I then explain how NAS
can be a scalable solution for emerging network types and new learning setups.
In Chapter 6, I conclude the dissertation and provide suggestions for possible future
research on accelerating neural network inference.
19
1.5 Publication
Some of the research discussed in this dissertation also appeared in the following publica-
tions:
• Yiren Zhao, Xitong Gao, Robert Mullins, and Chengzhong Xu. Mayo: A framework
for auto-generating hardware friendly deep neural networks. In Proceedings of the
2ndInternational Workshop on Embedded and Mobile Deep Learning, pages 25–30,
2018. (EMDL 2018).
• Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert Mullins, Chengzhong Xu. Dynamic
channel pruning: Feature boosting and suppression. In International Conference of
Learning Representations 2019 (ICLR 2019).
• Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, and Cheng-Zhong Xu.
Focused quantization for sparse CNNs. In Advances in Neural Information Processing
Systems,pages 5584–5593, 2019 (NeurIPS 2019).
• Milos Nikolic, Mostafa Mahmoud, Andreas Moshovos, Yiren Zhao, and Robert
Mullins. Characterizing sources of ineffectual computations in deep learning networks.
In IEEE International Symposium on Performance Analysis of Systems and Software,
pages 165–176. IEEE, 2019 (ISPASS 2019).
• Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, Robert Mullins, Peter
YK Cheung, George Constantinides, and Cheng-Zhong Xu. Automatic generation of
multi-precision multi-arithmetic CNN accelerators for FPGAs. In 2019 International
Conference on Field-Programmable Technology, pages 45–53. IEEE, 2019 (ICFPT
2019).
• Yiren Zhao, Duo Wang, Xitong Gao, Robert Mullins, Pietro Lio, and Mateja Jamnik.
Probabilistic dual network architecture search on graphs. In Deep Learning on
Graphs: Method and Applications Workshop for 35th AAAI Conference on Artificial
Intelligence (DLG-AAAI 2021), recipient of the best student paper award.
• Yiren Zhao, Duo Wang, Daniel Bates, Robert Mullins, Mateja Jamnik, and Pietro
Lio. Learned low precision graph neural networks. In The 1st Workshop on Machine
Learning and Systems (EuroMLSys 2021).
• Yiren Zhao, Xitong Gao, Ilia Shumailov, Nicolo Fusi, Robert Mullins. Rapid Model
Architecture Adaption for Meta-Learning. In submission
During my PhD study, my research also resulted in the following publication and patents
that are not included in this dissertation:
20
• Yiren Zhao, Ilia Shumailov, Robert Mullins, Ross Anderson. To compress or not to
compress: Understanding the interactions between adversarial attacks and neural
network compression. Proceedings of the 2nd SysML Conference, 2019 (SysML
2019).
• Kaifeng Wang, Xitong Gao, Yiren Zhao, Xingjian Li, Dejing Dou, Cheng-Zhong Xu.
Pay Attention to Features, Transfer Learn Faster CNNs. International Conference
of Learning Representations 2020 (ICLR 2020).
• Ilia Shumailov, Yiren Zhao, Robert Mullins, Ross Anderson. Towards Certifiable
Adversarial Sample Detection. Proceedings of the 13th ACM Workshop on Artificial
Intelligence and Security, Pages 13–24 (AISec 2020).
• Yiren Zhao, Ilia Shumailov, Robert Mullins, Ross Anderson. Blackbox attacks on
reinforcement learning agents using approximated temporal information. 2020 50th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Workshops (DSN-W 2020).
• Daniel Lo, Amar Phanishayee, Eric S Chuang, Yiren Zhao, Richie Zhao. Neural
network activation compression with narrow block floating-point. US Patent.
• Daniel Lo, Amar Phanishayee, Eric S Chuang, Yiren Zhao, Richie Zhao. Neural
network activation compression with outlier block floating-point. US Patent.
• Daniel Lo, Amar Phanishayee, Eric S Chuang, Yiren Zhao. Neural network activation
compression with non-uniform mantissas. US Patent.
• Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, Ross
Anderson. Sponge Examples: Energy-Latency Attacks on Neural Networks. .
Proceedings of the 6th IEEE European Symposium on Security and Privacy (EuroS&P
2021)
• David Khachaturov, Ilia Shumailov, Yiren Zhao, Nicolas Papernot, Ross Ander-
son. Markpainting: Adversarial Machine Learning meets Inpainting. The 38th
International Conference on Machine Learning (ICML 2021, Short Talk).
• Ilia Shumailov, Zakhar Shumaylov, Dmitry Kazhdan, Yiren Zhao, Murat A Erdogdu,
Nicolas Papernot, Ross Anderson. Manipulating SGD with Data Ordering Attacks.
To appear in Advances in Neural Information Processing Systems, 2021 (NeurIPS
2021).
21
22
Chapter 2
Background
This chapter surveys the background literature and recent model design trends of deep
learning. I then use convolutional neural networks as an example to introduce the basic
structure and learning methods of neural networks. Later, I introduce popular neural
network compression techniques and prior automated machine learning algorithms. At the
end, I provide a survey of prior hardware accelerators and their corresponding hardware
optimisation.
23
translating English to SQL to plotting beautiful graphs from plain English. The GPT3
model uses more than 170 billion parameters, and its training costs around 4.6 million
dollars if you are using Azure services [15].
Representative models and their MACs Efficient models and their MACs
Alexnet 102 ResNet50
105 VGG16 MobileNet
WideResNet101 MnasNet
Transformer ALBERT
104 BERT
Number of MACs (B)
102
101
100
100
2012 2013 2014 2015 2016 2017 2018 2019 2020 2015 2016 2017 2018 2019
Year Year
Figure 2.1: Model sizes and numbers of floating points of representative neural networks
published in different years. These models have various accuracies and target different
tasks, however, the general trend suggests that models are getting bigger and more
computationally heavy.
Activation Size (MB)
101
Number of FLOPs (G)
103
100
102
10 1 GAT GAT
Alexnet 101 Alexnet
Resnet18 Resnet18
10 2 Wide_resnet50_2 Wide_resnet50_2
Mobilenet_v2 100 Mobilenet_v2
Resnet101 Resnet101
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number of input nodes 1e6 Number of input nodes 1e6
Figure 2.2: Evaluating the computational requirements for a two-layer Graph attention
network (GAT) compared to a range of vision models. Numbers are reported for a
single graph/image inference and we treat an image as a graph with 224 × 224 × 3 input
nodes. Inputs for the GAT model are Erdős-Rényi-Gilbert Random Graphs generated
with different number of nodes (N) and a connectivity probability of N1 .
While machine learning models are growing in size and complexity, smart network
architectural engineering reduces the cost of running inferences. He et al. and Howard
et al. redesigned the building blocks of CNNs, and proposed the ResNet and MobileNet
families respectively [66, 73]. These vision models use far fewer parameters and operations,
but match the performance of their larger counterparts. Similarly, the design of ALBERT
shows the possibility to port a lightweight transformer model with comparable performance
[97]. In parallel to manual optimisation of networks based on domain expertise, automatic
24
exploration methods such as Network Architecture Search (NAS) successfully find efficient
models such as the MNasNet that reaches the state-of-the-art accuracy per operation
performance [157]. In Figure 2.1, I choose the well-known neural networks mentioned
above and show how their computational requirements change across years. I demonstrate
their computational requirements in terms of 1) number of parameters, also known as
model sizes, and 2) number of multiplication-accumulates (MACs) required for a single
inference run. The first gives an indication of the hardware memory requirements and the
potential need for high off-chip memory bandwidth depending on the detailed memory
hierarchy designs. The latter challenges the computational ability of the hardware. The
number of parameters of various networks are measured in millions (M), and the number of
MACs reported are in billions (B). I also label the names and the published dates of these
networks in Figure 2.1. On the same dataset, researchers spent a long time on reducing
both the number of parameters and error rates of DNNs. It is noteworthy that efficient
networks normally utilise special network structures and optimization techniques. For
example, MobileNet made use of grouped convolutions [73] and ResNet50 has its special
shortcut connections [66]. There are several interesting trends we can summarise from
Figure 2.1:
• It takes a great amount of time to design efficient neural networks. AlexNet was
introduced in 2012 but the efficient MobileNet family starts to appear in late 2017.
• People tend to first design networks for better accuracy, but these networks are
normally harder to train and have a risk of being more sensitive to hardware. For
instance, it is easier to support VGG16 in today’s TPUs [85] than WideResNet101
because of the skip connections.
Apart from the ever increasing model sizes and complexities, recent years have witnessed
a more diverse set of neural network types. In addition to classic learning domains such
as computer vision and natural language processing, emerging research has proven the
capabilities of DNNs on graph [89, 163] or tabular data [183]. Novel neural network types
often show vastly different computational properties. Taking graph neural networks as an
example, Figure 2.2 demonstrates that these models can have a much higher computational
requirements in terms of FLOPs and activation sizes compared to vision models if the
input graph size is large. However, these graph models are often very small in terms of
model sizes compared to their vision counterparts.
In general, we can summarize the following trends:
• We will always have larger datasets and larger model. There will always be a demand
to tackle harder problems. The computational requirement is always growing.
25
• Although over-parameterization is advantageous in training on very large datasets, it
is not always necessary for accurate inference [28]. There exists efficient networks on
the same datasets, smart architectural changes or automated machine learning will
discover the efficient networks in a few years from the release date of the dataset.
• The operations and structures of DNNs will be more diverse and complex (residual
blocks, depthwise separable convolutions, attention mechanisms, etc.) over time in
order to make these models more efficient (e.g. run faster or consume less energy for
a given accuracy budget).
• Over time new neural network types are introduced to improve performance or
solve new machine-learning problems. These may have very different memory access
patterns and computational requirements when compared to existing models.
Figure 2.3: AlexNet network architecture with 5 convolutional layers and 3 fully-conneted
layers [93].
DNNs have been biologically inspired by the neural system in human brains. Neurons
are basic units in a DNN and a number of neurons make up a layer. The network
architecture of AlexNet is shown in Figure 2.3. There are currently a great number of
DNN layer types, the network in Figure 2.3 utilised fully-connected layer, pooling layer and
convolutional layer. AlexNet has 5 convolutional layers, 3 max pooling layers and 3 fully
connected layers. The convolutional layers have most of the computations and they start
to replace the fully-connected layers because of their capabilities on feature extractions
[73]. The fully connected layer is nothing more than dense matrix multiplications and
the pooling layer requires only a linear scan of all variables in the feature maps. Since
convolutions is a popular and representative computation layer in today’s DNNs, I would
26
like to introduce the mathematics behind convolutions for a concrete understanding of
this workload.
Input feature maps: Xl-1
Weights: W
w00 w01 w02 w00 w01 w02 w00 w01 w02 w00 w01 w02
w00 w01 w02 w00 w01 w02 w00 w01 w02 w00 w01 w02
w w01 w02 w w01 w02 w w01 w02 w w01 w02
w10 00
w11 w12 w10 00
w11 w12 w10 00
w11 w12 w10 00
w11 w12
w10 w11 w12 w10 w11 w12 w10 w11 w12 w10 w11 w12
w10 w11 w12 w10 w11 w12 w10 w11 w12 w10 w11 w12
w20 w21 w22 w20 w21 w22 w20 w21 w22 w20 w21 w22
w20 w21 w22 w20 w21 w22 w20 w21 w22 w20 w21 w22
w20 w21 w22 w20 w21 w22 w20 w21 w22 w20 w21 w22
Figure 2.4: A typical convolution layer, with 3 input feature maps, 4 output feature maps
and 12 3 × 3 kernels.
Here, ReLU(x) = max (x, 0) denotes the ReLU activation, and convl (Xl−1 , θl ) computes
l l−1 2
the convolution of input features Xl−1 using the weight tensor θl ∈ RC ×C ×k , where k
is the kernel size. As illustrated in Figure 2.4, for each single output pixel at a particular
channel of the output feature map, the entire weights volume is multiplied with the selected
input feature volume in an element-wise manner and then summed to produce the final
single pixel result. At inference time, a convolution uses k 2 Cl−1 Cl Hl Wl multiply-accumulate
operations (MACs), which means 2k 2 Cl−1 Cl Hl Wl floating-point operations (FLOPs), for
the computation of the lth layer.
Another view on this convolution workload is to treat it as a classic hardware accelera-
tion for nested loops. Listing 2.1 further shows the code snippet for a single convolution.
OutR, OutC and OutF are the three dimensions of the output feature maps. InF is the
number of input feature maps and K and stride are the kernel sizes and stride value
respectively.
27
for ( row =0; row < OutR ; row ++) {
for ( col =0; col < OutC ; col ++) {
for ( fo =0; fo < OutF ; fo ++) {
for ( fi =0; fi < InF ; fi ++) {
for ( i =0; i < K ; i ++) {
for ( j =0; j < K ; j ++) {
output_fm [ fo ][ row ][ col ]+= weights [ fo ][ fi ][ i ][ j ]* in_fm [ fi ][
stride * row + i ][ stride * col + j ];
}}}}}}
Optimisation for computing nested loops has been a long-existing research subject in
hardware acceleration. The formulation above offers many possible optimisations for better
data-reuse patterns or better contiguous DRAM access patterns [137, 103, 124], these
optimisation are beyond the scope of this dissertation. A naive strategy is to unroll all
iterators in the nested loops so that all computations are fully parallelised, however, such
designs normally cannot meet the hardware constraints due to the large computational
dimensions (e.g. large channel counts or large feature map sizes) of neural networks. In
Chapter 4, I will revisit different optimisation choices for nested loops and how they
influence the hardware designs.
• Reshape kernels.
I summarise all the compression techniques in Figure 2.5 as a taxonomy. In this section, I
will mainly focus on pruning and quantisation. In addition, I will introduce optimisation
on the neural network architecture space, namely neural network architecture search
techniques for finding the best performing network architectures.
28
Compression methods
Fine-grained Coarse-grained
2.3.1 Pruning
Pruning directly reduces the number of connections. Since LeCun et al. [100] introduced
optimal brain damage, the idea of creating more compact and efficient CNNs by removing
connections or neurons has received significant attention. Early approaches to pruning
deep CNNs zeros out individual weight parameters [65, 57]. This results in highly irregular
sparse connections, which were notoriously difficult for SIMD architectures such as GPUs
to exploit. This has prompted custom accelerator solutions that exploit sparse weights
[128, 63]. Although supporting both sparse and dense convolutions efficiently normally
involves some compromises in terms of efficiency or performance.
Because of the above-mentioned reasons, recent research is more focused on structured
sparsity [169, 185, 6, 208], that can be directly exploited by GPUs and also allow custom
accelerators to focus solely on efficient dense operations. Wen et al. [169] added group Lasso
on channel weights to the model’s training loss function. This has the effect of reducing
the magnitude of channel weights to diminish during training, and remove connections
from zeroed-out channels. To facilitate this process, Alvarez et al. [6] additionally used
proximal gradient descent, while Li et al. [105] and He et al. [69] proposed to prune channels
by thresholds, i.e. they set unimportant channels to zero, and fine-tune the resulting
CNN. The objective to induce sparsity in groups of weights may present difficulties for
gradient-based methods, given the large number of weights that need to be optimised. A
common approach to overcome this is to solve [70] or learn [114, 185] channel saliencies to
drive the sparsification of CNNs. He et al. [70] solved an optimization problem which limits
the number of active convolutional channels while minimising the reconstruction error on
the convolutional output. Liu et al. [114] used Lasso regularisation on channel saliencies
to induce sparsity and prune channels with a global threshold. Ye et al. [185] learned to
sparsify CNNs with an iterative shrinkage/thresholding algorithm applied to the scaling
factors in batch normalisation. There are methods [116, 213] that use greedy algorithms
for channel selection. Luo et al. [116] selectively prune input channels while minimizing
29
the reconstruction error of the convolutional output. Zhuang et al. [213] similarly perform
channel selection to minimise a joint loss of the reconstruction error and the classification
power of convolutional channels. Huang et al. [78] and He et al. [71] adopted reinforcement
learning to train agents to produce channel pruning decisions.
In general, existing work in the field of network pruning demonstrates that fine-grained
pruning is better at achieving a greater reduction in terms of model sizes. However,
when talking about real inference time reduction on commodity hardware such as GPUs,
coarse-grained methods are a better choice.
2.3.2 Quantisation
Quantisation methods allow parameters to be represented with much narrower bit-widths
than the 32-bit long floating-point numbers used in classic CPU- or GPU-based DNN
implementations. Converting numbers to fixed-point representations drastically reduces
computation and memory requirements [32, 109, 81]. An n-bit fixed-point number with a
binary point position p can represent a value x̂ with:
Lin et al. claim that a suitable fixed-point quantisation can offer a large reduction in model
sizes without any accuracy loss [109]. In extreme cases, the parameters can be binary
(0 and 1) [80] or ternary values (−1, 0, 1) [104]. This significantly reduces the memory
requirements for parameters, but naively putting weights and activation values to extreme
low bit-widths can cause a huge decrease in network accuracy. The idea of grouping
numbers with common bases might be long-existing [172], Lin et al. and Zhang et al. first
showed that this technique works well with neural network model parameters [110, 192].
They group parameters using a set of shared numerical bases, so that the offsets of these
bases can be quantised to binary or ternary values without a large decrease in network
accuracy [110, 192]. These extreme low-precision number formats might be challenging
to be exploited by general computing platforms but can bring a significant performance
gain to custom hardware accelerators. Inference on general purpose hardware, e.g. CPUs
and GPUs, often utilise 8-bit fixed-point numbers. The Intel DL Boost provides built-in
instructions, an extension based on the Advanced Vector Extensions 512 (AVX-512), to
accelerate 8-bit model inference [9]. Recently, Intel also released an Advanced Matrix
extensions (AMX) in addition to AVX-512 [2]. Its supported tile matrix multiplication
enables around 2000 INT8 operations per cycle per core. NVIDIA now has the TensorRT
toolkit to support various low-precision formats (the lowest possible bit-width is 8-bit) on
a variety of GPU devices [161].
Arithmetics beyond fixed-point have also being explored for efficient neural network
30
inference. Miyashita et al. showed that logarithmic data representations can be effective on
DNNs, and Lee et al. also discovered that powers-of-2 numbers is sufficient for networks
to retain accuracy but replaces expensive multipliers with shifting [101]. Shift quantisation
results in the following representable values, where s ∈ {−1, 0, 1} indicates the sign of the
value, b is a constant integer shared among weights within the same layer which ensures
no values overflow, and e is a variable exponent:
x̂ = s × 2e−b . (2.3)
Many quantisation methods also consider re-training the quantised networks for a
better post-quantisation accuracy, this is also known as quantisation-aware training
(QAT). Courbariaux et al. first proposed that re-training the quantised networks can
significantly reduce the accuracy loss introduced by quantisation errors [32]. Direct
quantisation and quantisation-aware training are now pervasive in deep learning. Modern
deep learning frameworks such as Pytorch [129] and Tensorflow [129] now both offer
fixed-point quantisation and quantisation aware training to help developers to deploy deep
learning models to mobile or more constrained devices.
31
a great number of possible operations. Several NAS methods [113, 130, 216, 206] then
focused on a reduced cell-based search space, with the hope that learned cell structure can
be repeated several times to compose a complete network. There is a clear motivation
behind this cell-based search space: previous well-performing human-designed networks
normally stack a number of fixed blocks [66]. Now if we can build a successful cell, it is
expected that replicating this cell multiple times would construct a good model architecture.
Figure 2.6 demonstrates a cell-based search space in DARTs [113]. Each red edge in Celli
is a searchable operation, the designer only searches for reduction and normal cells and
repeat these cells several times to build a complete network. Another popular search space
setup in NAS is to construct a supernet from a seed/backbone network [17, 18, 157]. The
backbone network normally relies on a pre-built successful network such as ResNets [66]
or MobileNets [73]. The NAS methods then expand this network with a set of pre-defined
candidate operations to build a supernet. Finally, the NAS method will exploit a search
method (e.g. genetic algorithms) to pick the best network (architecture adaption phase in
Figure 2.7). It is generally hard to argue which search space is better, since they have
different limitations. Cell-based architecture relies on repeating the blocks and its complex
DAG of each cell might induce a large inference latency (I discuss this latency problem in
details in Chapter 5). The supernet-based approach, on the other hand, focuses on its
pre-defined backbone and can have vastly different performances when backbones are not
designed with care.
I
Ce -1 Ce -2
a ce e a N e
Ce
0
ed c ce
a ce e a N e
2
ed c ce
3
e a N e Eac ed ed e a ea ab e
a ce
32
NAS Backbone Supernet Searched network
NAS methods can also be classified by their search methodologies. Early work in this
field, such as NASNet [216] and AmoebaNet [136], employed reinforcement-learning (RL) or
evolutionary algorithms (EA) to discover optimal network architectures, but this comes at a
large computational cost. Each update of the controller requires a few hours to train a child
network to convergence which significantly increases the search time. Liu et al. proposed
Differentiable Architecture Search (DARTS) that is a purely gradient-based optimisation
(GO); each candidate operation’s importance is scored using a trainable scalar and updated
using Stochastic Gradient Descent (SGD) [113]. Subsequently, Casale et al. approached
the NAS problem from a probabilistic view, transforming concrete trainable scalars used by
DARTS [113] to probabilistic priors and only train a few architectures sampled from these
priors at each training iteration [20]. Wu et al. and Xie et al. used the Gumbel-softmax
trick to relax discrete operation selection to continuous random variables [175, 177]. I
present the representative NAS techniques in Table 2.1. It can be concluded that these
three mainstream NAS approaches all demonstrate compatible search time in terms of
GPU days.
There is also now a rise of applying NAS with hardware design parameters, so that
both the network architecture and hardware architecture search spaces are jointly explored.
Abdelfattah et al. and Zhou et al. have explored how NAS can establish an efficient
exploration in the software-hardware co-design search space [4, 211].
33
Method Style Top1/Top5 (%) Params (Millions) GPU Days
NASNet [216] RL 82.7/96.2 88.9 2000
Path-level EAS [16] RL 74.6/91.9 594 200
FPNAS [33] RL 73.3/- 3.41 0.8
AmoebaNet [136] EA 82.8/96.1 86.7 3150
OFA [18] EA 80.0/- - 1.7
SMASH [14] GO 61.4/83.7 16.2 3
PARSEC [20] GO 74.0/91.6 5.6 1
ProxylessNAS [17] GO 75.1/92.5 - 8.33
SNAS [177] GO 72.7/90.8 4.3 1.5
FBNet [175] GO 74.9/- 5.5 9
DARTS [113] GO 73.3/91.3 4.7 4
Table 2.1: Different Network Architecture Search Methods and their performance on
the ImageNet 2012 dataset [93]. RL, EA and GO represent Reinforcement Learning,
Evolutionary Algorithm and Gradient-based Optimisation respectively.
like to compare some popular accelerator designs in terms of their processing elements and
data reuse models. Energy consumed by a hardware accelerator can be broadly categorised
as relating to computation, on-chip memory accesses or off-chip memory accesses. The
design of DNN accelerators need to carefully arrange its available hardware resources to
processing elements, on-chip memory systems to improve efficiency.
34
hardware designs in the rest of this chapter.
One particularly popular hardware architecture for DNN inference is the Systolic Array.
This is a homogeneous network of processing elements that are connected to a single
memory system. These processing elements are normally data processing units (DPUs),
and each of these DPUs can be configured to independently compute a partial result as
the data moves from its upstream neighbors. The DPU then performs the computation
and then passes the result to its downstream neighbors, as illustrated in Figure 2.8.
35
Accelerator Sparsity Bit-widths
Diannao [24] Dense weights and dense activations 16-bit fixed-point PE
EIE [63] Sparse weights and sparse activations 4-bit encoding, 16-bit fixed-point PE
SCNN [128] Sparse weights and sparse activations 4-bit encoding, 16-bit fixed-point PE
Eyeriss [27] Dense weights and sparse activations 16-bit fixed-point PE
Angel-Eye [56] Dense weights and dense activations 16-bit or 8-bit fixed-point PE
Nullhop [5] Dense weights and sparse activations 16-bit fixed-point PE
XNOR Neural Engine [31] Dense weights and dense activations binary PE with partial sums stored as 16-bit
Bit-Tactical [35] Bit-level sparse weights and activations 16-bit fixed-point PE
YodaNN [7] Dense weights and dense activations binary weights, 12-bit activations
Table 2.2: Sparsities and data formats of various neural network accelerators.
exploits activation sparsity to reduce the computation complexity. XNOR Neural Enginee
focuses on accelerating binary neural networks [31], and Bit-Tactical [35] accelerates
neural networks by exploiting bit-level sparsities using bit-serial operators. YodaNN is
an accelerator for binarised neural networks, the arithmetic operations are nearly costless
because they are simply bit-wise operations.
Unrolling different iterators in hardware would have different impacts on the underlying
architecture (PE arrangements, interconnects and etc) and data reuse model. Naive
unrolling of only i, j means exploiting the hardware parallelism inside each kernel, however,
these kernels are small in size and thus does not provide enough hardware parallelism.
Eyeriss tries to obtain more parallelisms by unrolling the row and col iterators together
with the kernels (i, j) [27]: it utilizes a systolic-array like architecture where outputs are
streamed vertically to be summed and inputs are streamed diagonally. The weights are
broadcasted to each processing element using a large multicast network. Ma et al. find the
parallelisms in input feature maps (fi): each input feature map is fully calculated before
switching to a new one [117]. The partial sums for output feature maps (fo) are cached
on-chip until the calculation finishes. In this case, only weights are not reused. In contrast,
36
Rahman et al. choose to fully unroll the output feature maps and their corresponding
dimensions (row, col, fo) in hardware [134], however, these unrollings force partial sums
at the output to be store back into the external memories if on-chip storage is not large
enough.
In short, various unrollings indicate different hardware architectures (PE design, on-chip
interconnects, memory accesses, etc.) and data reuse models. Convolution layers normally
vary in terms of channel count, kernel size and feature map sizes. An optimal hardware
for one particular convolution layer does not apply to another: hardware that is optimised
for one particular convolution layer may perform poorly when used to execute different
layers with different dimensions.
Table 2.4: Performance comparison of neural network accelerators. Bit-width means the
bit-width of the computation cores.
37
EIE
104
Eyeriss
ShiDianNao TPU Edge
102 Intel Movidius Myriad X Qualcomm A100
TPU-V1 Ascend 910 Groq
Nvidia V100 TPU-V4
100 TPU-V2 TPU-V3 Graphcore C2
FPGAConvNetBrainWave
GOps / mm2 / W
10 4
10 6
10 8
Throuput-Opt
2015 2016 2017 2018 2019 2020 2021
Year
Figure 2.9: A comparison of various DNN accelerators. Orange circles represent research
ASIC accelerators for embedded systems, blue circles represent commercial ASIC accelera-
tors for embedded systems, blue triangles are commercial ASIC cores for cloud systems,
blue rectangles are commercial FPGA designs for cloud systems and orange starts are
research FPGA designs for embedded systems.
from the fastest arithmetic that the accelerator can offer. For instance, if an accelerator
supports both INT16 and FLOAT16, I report the number of operations as the number
of INT16 operations finished per unit time. The estimation is coarse-grained for several
platforms because of the limited information available, e.g. Cerebras does not report
the exact number of floating-point operations so I have to estimate it from its available
clock frequency and number of processing elements. In particular, the following hardware
systems are considered:
• ASIC accelerators for embedded systems (orange circles and blue circles): ShiDianNao
[41], EIE [63] and Eyeriss [27] are representative designs from the academia. These
accelerators tend to use a more aggressive optimisation strategy on the software
space (e.g. more compact number formats). In particular, EIE utilised fine-grained
pruning, fixed-point quantisation and Huffman encoding to aggressively compress
the model [63]. Nvidia Jetson [159, 120], Intel Movidius [3] and TPU Edge [21] have
similar performance although being designed at different years, and I consider them
as representative designs from the industry.
38
• FPGA-based accelerators for embedded systems (orange stars): FPGAConvNet [164]
and AngleEye [56] are two representative designs from the academia.
• ASIC accelerators for cloud systems (blue triangles): TPU-V1, TPU-V2, TPU-V3
and TPU-V4 are a series of tensor processing units developed at Google [85]. TPU-V1
and V4 supports network inference only whereas the other two generations support
both training and inference. Nvidia V100 is a popular platform in popular cloud
services [118]. Huawei HiSilicon ships the Ascend 910 chip based on the same Huawei
Da Vinci architecture [108]. The Groq Tensor Streaming Processor are a set of
Superlanes which each Superlane contains a specialised very long instruction word
(VLIW) instruction [60]. Graphcore AI released a C2 card, showing strong inference
performance for both ResNet50 training and inference [86]. Cerebras shows an
incredibly large chip using their waver scaling technology on the 7nm technology
node [1].
• FPGA-based accelerators for cloud systems (blue triangles): In this case, BrainWave
from Microsoft might be the most recognised FPGA design that is widely deployed
in today’ Microsoft data centers [48].
In Figure 2.9, I observe several interesting trends that is worth highlighting in this
dissertation:
• The middle band in Figure 2.9 (blue circles) are mainly embedded systems designed
by the industry. These accelerators show a better performance compared to server-
class accelerators. This indicates that simply using a larger die size (making larger
chips) might increase the absolute operations per second counts but does not provide
a better power efficiency per die size. This phenomenon is easy to understand since
larger die sizes normally suggest an increased number of transistors and thus a more
difficult routing. In addition, server-class chips normally integrate more networking
facilities and DRAM channels. On the other hand, loading off-chip data from DRAM
is likely to be a bottleneck for server-class accelerators.
• DNN accelerators designed by the industry do not show great performance (Gops/mm2 /W)
in Figure 2.9. It is commonly known that industry accelerators will have to support
39
more types of neural networks and even ensure backwards compatibility [85], this
wider coverage forces the designers to use more conservative optimisation meth-
ods. Some of these industry accelerators have particular applications to support
with particular quality targets, e.g. in some cases forcing the use of some forms of
floating-point arithmetics. They may also have to take and run networks as they are,
i.e. they don’t have the ability to retrain after applying some sort of compression
technique.
It is also worth mentioning that this comparison does not consider the effect of different
fabrication processes. However, it is generally expected that, at least in industry, the
accelerators fabricated in later years are using a more advanced nanometer technology. On
the other hand, recent DNN accelerators contain both ASIC and FPGA designs. ASIC
designs have a major advantage of its high clock frequency FPGAs used to be a testing
facility for ASIC designs [95], but now start to rise as a general purpose computation
platform because of the following reasons:
• Faster time to market: ASIC designs has a much longer development cycle compared
to FPGAs.
• Design difficulty: The design process of ASIC accelerators is complex and requires
collaborations of different entities, FPGAs, on the other hand, has a much simpler
design flow, especially with the rise of recent high-level-synthesis tools [173].
In this dissertation, I mainly focus on FPGA designs and provide a more thorough
discussion of DNN accelerating on FPGAs in Chapter 4.
40
Chapter 3
In this chapter, I will look at how pruning and quantization techniques can reduce the
amount of computation required for neural network inference. These techniques are often
used in the context of deep learning with a pre-defined DNN architecture and pre-trained
DNN weights, and the objective is to lower the amount of computation required for a
neural network is often a limiting factor.
I first discuss the intriguing properties of deep neural networks: their redundancy and
ability to recover accuracy from re-training. I start with a discussion of the effects of
combining different basic neural network compression methods and a flexible software
tool for achieving this combined compression. I later demonstrate and evaluate two novel
compression techniques, dynamic coarse-grained pruning and focused quantisation. Finally,
I discuss how various approximate computing methods might be applicable to different
hardware platforms.
This chapter includes relevant contents published in:
• Yiren Zhao, Xitong Gao, Robert Mullins, and Chengzhong Xu. Mayo: A framework
for auto-generating hardware friendly deep neural networks. In Proceedings of the
2ndInternational Workshop on Embedded and Mobile Deep Learning, pages 25–30,
2018. (EMDL 2018)
• Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert Mullins, Chengzhong Xu. Dynamic
channel pruning: Feature boosting and suppression. In International Conference of
Learning Representations 2019 (ICLR 2019).
• Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, and Cheng-Zhong Xu.
Focused quantisation for sparse CNNs. In Advances in Neural Information Processing
Systems,pages 5584–5593, 2019 (NeurIPS 2019)
41
Yiren Zhao proposed the idea of constructing the Mayo framework. Xitong Gao
developed the pruning functions and Yiren Zhao constructed the quantisation functions
for Mayo. Yiren Zhao conceived and designed the experiments for Dynamic channel
pruning and Focused quantisation. Xitong Gao enhanced the implementation of Dynamic
channel pruning with normalisation and achieved better results, he also provided insights
on developing the feature saliency motivation in the publication.
42
Han’s Deep Compression further demonstrated the possibility of achieving multiplying
gains in terms of shrinking the model sizes when applying a series of compression techniques
[64]. Deep Compression only investigates a single combination of pruning, quantisation
and encoding techniques in the large compression optimisation space. Intuitively, there
are potentially many other combinations of compression techniques that have equivalent
accuracy but with very different hardware implications. In general, we will be presented
with the possibility of combining many different compression techniques and selecting
from numerous different number formats. The Mayo tool was designed to help explore this
design space and find optimal solutions. Mayo then aims to be a flexible software compres-
sion optimisation framework to support arbitrary combinations of various compression
techniques on different components of a neural network as illustrated in Figure 3.1.
Compresion Re-training
setup setup
Pre-trained Compressed
Mayo
Model Model
Figure 3.1: An high-level illustration of the Mayo tool. Mayo takes a set of compression
techniques and combines them to achieve a single compression rate. The compression setup
and the re-training setup can be configured in YAML files. The tool takes a pre-trained
model as an input and outputs a compressed model.
At the time of implementing Mayo, the research community had a few open source
methods and frameworks for compressing neural networks. Table 3.1 demonstrates the
differences between Mayo and other existing tools, it is clear that Mayo supports a wider
range of pruning and quantisation methods with the ability to apply these methods on
different components of a neural network. The dynamic fixed-point arithmetic refers to a
specialised fixed-point arithmetic where the whole layer shared a scaling factor for all the
fixed-point numbers of that layer.
All compression methods are implemented as objects called overriders in Mayo. Over-
riders can be flexibly applied to not only layer-wise model parameters θl , but also the
underlying algorithm fl , and even the gradient of each layer computation xl = fl (xl−1 , θl )
where xl ∈ RCl ×Hl ×Wl ,, The design of overriders provides an abstraction for various
compression techniques.
An overrider g ∈ G, configured with suitable hyperparameters, takes a multi-dimensional
array (a tensor) as an input, and produces a new array with the same shape as the
43
Mayo Ristretto ADaPTION DoReFa
fine-grained 3 7 7 7
Pruning
coarse-grained 3 7 7 7
fixed-point 3 3 3 3
dynamic fixed-point 3 3 3 7
Quantization mini-float 3 3 3 7
log 3 7 7 7
shift 3 3 3 7
Layer-wise customization 3 3 3 7
Automated hyperparameter optimization 3 7 7 7
Compression method chaining 3 7 7 7
Customizable components w/a/g w/a w/a w/a/g
Configuration format YAML Caffe Caffe Python
Table 3.1: A comparison: Ristretto [61], ADaPTION [119], DoReFa [210] and Mayo.
compressed variant. Parameters θl can thus be simply substituted using any overrider
gparam ∈ G:
θ̃l = gparam (θl ). (3.1)
I show in Table 3.2 the effect of applying quantisation on top of a sparse CifarNet
(details of this network is explained in later section in Table 3.3) [201] for classifying
CIFAR10 [94] objects. I designed the CifarNet model based on the VGG family [151] but
added Batch Normalizations and Dropout layers. Applying both quantisation and pruning
offers a multiplying performance gain in terms of the compression rates. Table 3.2 shows
that a pruned model, when quantised with dynamic fixed-point numbers, brings an over
44
Method Bit-width Density Compression rate Top-1/top-5 accuracy
baseline 32 100.00% - 91.37%/99.67%
fixed-point (fixed) 4 100.00% 8.00× 89.64%/99.74%
dynamic fixed-point (DFP) 4 100.00% 8.00× 90.63%/99.68%
fine-grained pruning (pruned) 32 15.65% 6.39× 91.12%/99.70%
pruned + fixed 6 15.65% 33.92× 90.59%/99.68%
pruned + DFP 6 15.65% 33.92× 91.04%/99.70%
30× decrease in model sizes with a minimal decrease in classification accuracy (less than
0.4%).
The comparison in Table 3.2 provides an interesting insight: pruned + DFP outperforms
the pruned + fixed flow utilised in Deep Compression, this proves that there are further
compression opportunities if a more flexible combination of pruning and quantisation
strategy is permitted. pruned + fixed and pruned + DFP have the same compression rates
but different accuracy values. This further demonstrates that there are many combinations
of compression techniques that offer sub-optimal performance on accuracy, the developed
Mayo framework is a suitable tool for traversing the design space. In addition the Mayo
framework quantises not only the weights but also activation values, the tool also supports
compressing each layer of the network independently, i.e. using different compression
techniques or parameters. I will revisit this ability in Chapter 4 to demonstrate how these
features can benefit the design of hardware accelerators.
45
are permanently lost, and the resulting CNN may never regain its accuracy for difficult
inputs for which the removed channels were responsible. Secondly, the saliency of a
neuron is not static, which can be illustrated by the feature visualization in Figure 3.2a,
Figure 3.2b and Figure 3.2c. Here, a CNN is shown a set of input images, certain channel
neurons in a convolutional output may get highly excited, whereas another set of images
elicit little response from the same channels. This is in line with my understanding of
CNNs whereby neurons in a convolutional layer specialise in recognizing distinct features,
and the relative importance of a neuron depends heavily on the inputs.
high
response
5.426 3.297 3.076 2.938 3.409 3.309 3.298 3.171
low
response
0.051 0.052 0.066 0.069 -0.229 -0.218 -0.168 -0.161 (c) The distribution of maxi-
(a) Channel 114 (b) Channel 181 mum activations
Figure 3.2: When images from the ImageNet validation dataset are shown to a pre-trained
ResNet-18 [66], the outputs from certain channel neurons may vary drastically. The top
rows in (a) and (b) are found respectively to greatly excite neurons in channels 114 and 181
of layer block 3b/conv2, whereas the bottom images elicit little activation from the same
channel neurons. The number below each image indicate the maximum values observed in
the channel before adding the shortcut and activation. Finally, (c) shows the distribution
of maximum activations observed in the first 20 channels.
The above shortcomings prompt the question: why should we prune by static impor-
tance, if the importance is highly input-dependent? Surely, a more promising alternative is
to prune dynamically depending on the current input. A dynamic channel pruning strategy
allows the network to learn to prioritise certain convolutional channels and ignore irrelevant
ones. Instead of simply reducing model size at the cost of accuracy with pruning, I can
accelerate convolution by selectively computing only a subset of channels predicted to be
important at run-time, while considering the sparse input from the preceding convolution
layer. In practice, the amount of activation values stayed on-chip and the number of
DRAM reads, writes and arithmetic operations used by a well-designed dynamic model
can be almost identical to an equivalently sparse statically pruned one. In addition to
saving computational resources, a dynamic model preserves all neurons of the full model,
which minimises the impact on task accuracy.
46
3.3.2 Feature boosting and suppression
Motivated by the shortcomings described above, I will start with a high-level illustration
(Figure 3.3) of how Feature Boosting and Suppression accelerates a convolutional layer
with batch normalization (BN). The auxiliary components (in red) predict the importance
of each output channel based on the input features, and amplify the output features
accordingly. Moreover, certain output channels are predicted to be entirely suppressed
(or zero-valued as represented by ), such output sparsity information can advise the
convolution operation to skip the computation of these channels, as indicated by the
dashed arrow. It should be noted that convolutions can be skipped due to (1) inactive
input channels and (2) due to the suppression of particular output channels. The rest of
this section provides a detailed explanation of the components in Figure 3.3.
normalize
convolution bias)
sparsity
subsample
channel saliency multiple winners
predictor take all wta
multiply
xl
ss (xl 1) gl (xl 1) ⇡l (xl 1)
(+ ReLU)
Figure 3.3: A high level view of a convolutional layer with FBS. By way of illustration, I
use the lth layer with 8-channel input and output features, where channels are colored to
indicate different saliencies, and the white blocks ( ) represent all-zero channels.
where f and π respectively use weight parameters θ and φ and may have additional
inputs, and compute tensors of the same output shape, denoted by F and G. Intuitively,
the expensive F[i] can always be skipped for any index i whenever the cost-effective G[i]
evaluates to 0. Here, the superscript [i] is used to index the ith slice of the tensor. For
example, if we have features F ∈ RC×H×W containing C channels of H-by-W features,
F[c] ∈ RH×W retrieves the cth feature image. I can further sparsify and accelerate the layer
by adding, for instance, a Lasso on π to the total loss, where Ex [z] is the expectation of z
over x:
R (x) = Ex [kπ (x, φ, · · · )k1 ] , (3.5)
Despite the simplicity of this formulation, it is however very tricky to design fˆ properly.
Under the right conditions, I can arbitrarily minimise the Lasso while maintaining the
47
same output from the layer by scaling parameters. For example, in low-cost collaborative
layers [39], f and π are simply convolutions (with or without ReLU activation) that
respectively have weights θ and φ. Since f and π are homogeneous functions, one can
always halve φ and double θ to decrease Equation (3.5) while the network output remains
the same. In other words, the optimal network must have kφk∞ → 0, which is infeasible
in finite-precision arithmetic. For the above reasons, Dong et al. [39] observed that
the additional loss in Equation (3.5) always degrades the CNN’s task performance. Ye
et al. [185] pointed out that gradient-based training algorithms are highly inefficient in
exploring such reparameterisation patterns, and channel pruning methods may experience
similar difficulties. Shazzer et al. [145] avoided this limitation by finishing π with a softmax
normalization, but the Equation (3.5) can no longer be used as the softmax renders the
`1 -norm, which now evaluates to 1. In addition, similar to sigmoid, softmax (without the
cross entropy) is easily saturated, and thus may equally suffer from the gradient vanishing
problem.
Instead of imposing sparsity on features or convolutional weight parameters [169, 6,
105, 69], channel pruning methods [114, 185] induce sparsity on the BN scaling factors
γl . Inspired by them, FBS similarly generates a channel-wise importance measure. Yet
contrary to them, instead of using the constant BN scaling factors γl , FBS predicts channel
importance and dynamically amplify or suppress channels with a parametric function
π(xl−1 ) dependent on the output from the previous layer xl−1 . Here, I propose to replace
layer definition fl (xl−1 ) for all l ∈ [1, L] with fˆl (xl−1 ) which employs dynamic channel
pruning:
fˆl (xl−1 ) = (πl (xl−1 ) · (norm (convl (xl−1 , θl )) + βl ))+ , (3.6)
where norm is batch normalisation and a low-overhead policy πl (xl−1 ) evaluates the pruning
decisions for the computationally demanding conv (xl−1 , θl ):
Here, wtak (z) is a k-winners-take-all function, i.e. it returns a tensor identical to z, except
that I zero out entries in z that are smaller than the k largest entries in absolute magnitude.
In other words, wtaddCl e (gl (xl−1 )) provides a pruning strategy that computes only ddCl e
most salient channels predicted by gl (xl−1 ), and suppresses the remaining channels with
zeros. In addition, gl (xl−1 ) is a simple fully connected layer that learns to predict channel
saliencies using only a handful of learning parameters.
It is notable that the proposed strategy prunes Cl − ddCl e least salient output channels
from lth layer, where the density d ∈ ]0, 1] can be varied to sweep the trade-off relationship
between performance and accuracy. Moreover, pruned channels contain all-zero values.
This allows the subsequent (l + 1)th layer to trivially make use of input-side sparsity, since
48
all-zero features can be safely skipped even for zero-padded layers. Because all convolutions
can exploit both input- and output-side sparsity, the speed-up gained from pruning is
quadratic with respect to the pruning ratio. For instance, dynamically pruning half of
the channels in all layers gives rise to a dynamic CNN that uses approximately 14 of the
original MACs.
Theoretically, FBS does not introduce the reparameterisation discussed above. By batch
normalizing the convolution output, the convolution kernel θl is invariant to scaling. Com-
putationally, it is more efficient to train. Many alternative methods use non-differentiable
π functions that produce on/off decisions. In general, DNNs with these policy functions
are incompatible with SGD, and resort to reinforcement learning for training. In contrast,
FBS allows end-to-end training, as wta is a piecewise differentiable and continuous function
like ReLU. Srivastava et al. [153] suggested that in general, a network is easier and faster
to train for complex tasks and less prone to catastrophic forgetting, if it uses functions
such as wta that promote local competition between many sub-networks.
49
perturbed in the brightness, saturation and hue.
ImageNet classifiers, were trained with a procedure similar to the one above. The
difference was that they were trained for a maximum of 35 epochs, the learning rate was
decayed for every 20 epochs, and NS models were all pruned at 15 epochs. For image
preprocessing, we additionally cropped and stretched/squeezed images randomly following
[93].
The proposed FBS method begins by first replacing all convolutional layer computations
with Equation (3.6), and initializing the new convolutional kernels with previous parameters.
Initially, I do not suppress any channel computations by using density d = 1 in Equation
(3.7) and fine-tune the resulting network. For fair comparison against NS, I then follow
[114] by iteratively decrementing the overall density d of the network by 10% in each step,
and thus gradually using fewer channels to sweep the accuracy/performance trade-off. The
difference is that NS prunes channels by ranking globally, while FBS prunes around 1 − d
of each layer.
layers
92% 1
conv0
90%
conv1
88% 0
conv2
Accuracy
86%
conv3
84% conv4
82% NS top-1 conv5
50
same 90.50% accuracy constraints, FBS, NS+FBS, and NS respectively achieve 3.93×,
3.22×, and 1.19× speed-up ratios. Conversely for a 2× speed-up target, they respectively
produce models with accuracies not lower than 91.55%, 90.90% and 87.54%.
Figure 3.4b uses 8 heat maps to respectively represent the channel skipping probabilities
of the 8 convolutional layers. The brightness of the pixel at location (x, y) denotes the
probability of skipping the xth channel when looking at an image of the y th classification
category. The heat maps verify my belief that the auxiliary network learned to predict
which channels specialise to which features, as channels may have drastically distinct
probabilities of being used for images of different categories. The model here is a CifarNet
using FBS with d = 0.5, which has a top-1 accuracy of 90.59% (top-5 99.65%). Moreover,
channels in the heat maps are sorted so the channels that are on average least frequently
evaluated are placed on the left, and channels shaded in stripes are never evaluated. The
network in Figure 3.4b is not only approximately 4× faster than the original, by removing
the unused channels, it also reduces the number of weights by 2.37×. This reveals that
FBS naturally subsumes channel pruning strategies such as NS, as I can simply prune
away channels that are skipped regardless of the input. It is notable that even though I
specified a universal density d, FBS learned to adjust its dynamicity across all layers, and
prune different ratios of channels from the convolutional layers.
I also resent a detailed layer-wise study of CifarNet in Table 3.3 on the CIFAR10
classification task. The CifarNet, a custom designed CNN, with less than 1.30 million
parameters and takes 174 million MACs to perform inference for a 32-by-32 RGB image.
The architecture is illustrated in Table 3.3, where all convolutional layers use 3 × 3 kernels,
the Shape column shows the shapes of each layer’s features, and pool7 is a global average
pooling layer. Table 3.3 additionally provides further comparisons of layer-wise compute
costs between FBS, NS, and the composition of the two methods (NS+FBS). It is notable
that the FBS column has two different output channel counts, where the former is the
number of computed channels for each inference, and the latter is the number of channels
remaining in the layer after removing the unused channels.
Residual networks [66] adopt sequential structure of residual blocks: xb = K (xb−1 ) +
F (xb−1 ), where xb is the output of the bth block, K is either an identity function or a
downsampling convolution, and F consists of a sequence of convolutions. For residual
networks, I directly apply FBS to all convolutional layers, with a difference in the way I
handle the feature summation. Because the (b + 1)th block receives as input the sum of
the two features with sparse channels K (xb−1 ) and F (xb−1 ), the channels of this sum is
considered sparse only when the same channels in both features are sparse.
By applying FBS and NS respectively to ResNet18, I saw the ImageNet validation
accuracy of FBS consistently outperforms NS under different speed-up constraints. For
speed-up, I refer to the number of multiply accumulates required to finish an inference run
51
Layer Shape Number of MACs (Output Channels)
Original NS FBS NS+FBS
conv0 30 × 30 1.5M (64) 1.3M (52) 893K (32/62) 860K (32)
conv1 30 × 30 33.2M (64) 27.0M (64) 8.4M (32/42) 10.2M (39)
conv2 15 × 15 16.6M (128) 15.9M (123) 4.2M (64/67) 5.9M (74)
conv3 15 × 15 33.2M (128) 31.9M (128) 8.3M (64/79) 11.6M (77)
conv4 15 × 15 33.2M (128) 33.1M (128) 8.3M (64/83) 12.1M (77)
conv5 8×8 14.1M (192) 13.4M (182) 3.6M (96/128) 4.9M (110)
conv6 8×8 21.2M (192) 11.6M (111) 5.4M (96/152) 4.3M (67)
conv7 8×8 21.2M (192) 12.3M (192) 5.4M (96/96) 4.5M (116)
pool7 1×1
fc 1×1 1.9K (10) 1.9K (10) 960 (10) 1.1K (10)
Total 174.3M 146.5M 44.3M 54.2M
Saving - 1.19× 3.93× 3.21×
Table 3.3: The network structure of CifarNet for CIFAR10 classification. In addition, we
provide a detailed per-layer MACs compariso between FBS, NS, and the composition of
them (NS+FBS). I minimise the models generated by the three methods while maintaining
a classification accuracy of at least 90.5%.
compared to the original dense model. For instance, at d = 0.7, it utilises only 1.12 billion
multiply accumulates (MACs) (1.62× speed-up) to achieve a top-1 error rate of 31.54%,
while NS requires 1.51 billion MACs (1.21× faster) for a similar error rate of 31.70%.
When compared across recent dynamic execution methods examined in Table 3.4, FBS
demonstrates simultaneously the highest speed-up and the lowest error rates. It is notable
that the baseline accuracies for FBS refer to a network that has been augmented with the
auxiliary layers featuring FBS but suppress no channels, i.e. d = 1. FBS brings immediate
accuracy improvements, an increase of 1.73% in top-1 and 0.46% in top-5 accuracies, to
the baseline network, which is in line with my observation on CifarNet.
Table 3.4: Comparisons of error rates of the baseline and accelerated ResNet18 models.
Since VGG16 is computationally intensive with over 15 billion MACs, I first applied
NS on this large network to reduce the computational and memory requirements, and this
eases the training of the FBS-augmented variant. I assigned a 1% budget in top-5 accuracy
degradation and compressed the network using NS, which gave us a smaller VGG16 with
52
∆ top-5 errors (%)
Method Dynamic
3× 4× 5×
Filter Pruning ([105], reproduced by [70]) — 8.6 14.6
Perforated CNNs [46] 3.7 5.5 —
Network Slimming ([114], my implementation) 1.37 3.26 5.18
Channel Pruning [70] 0.0 1.0 1.7
AMC [71] — — 1.4
ThinNet-Conv [116] 0.37 — —
Feature Boosting and Suppression (FBS) X 0.04 0.52 0.59
Table 3.5: Comparisons of top-5 error rate increases for VGG16 on ImageNet validation
set under 3×, 4× and 5× speed-up constraints. The baseline has a 10.1% top-5 error rate.
Results from He et al. [70] only show numbers with one digit after the decimal point.
20% of all channels pruned. The resulting network is a lot less redundant, which almost
halves the compute requirements, with only 7.90 billion MACs remaining. Table 3.5 shows
the performance of different structured pruning and dynamic execution methods to FBS.
At a speed-up of 3.01×, FBS shows a minimal increase of 0.44% and 0.04% in top-1 and
top-5 error rates respectively. At a 5.23× speed-up, it only degrades the top-1 error rate
by 1.08% and the top-5 by 0.59%.
53
fine-grained pruning (see Figure 3.5c for an example). Shift quantisation is highly desirable
as it can be implemented efficiently, but it becomes a poor choice for certain layers in
sparse models, as most near-zero quantisation levels are under-utilized (Figure 3.5d).
(a) Dense layers (b) After shift quan (c) Sparse layers (d) After shift quan
Figure 3.5: The weight distributions of the first 8 layers of ResNet18 on ImageNet. (a) shows
the weight distributions of the layers, (c) similarly shows the distributions (excluding zeros) for
a sparsified variant. (b) and (d) respectively quantise the weight distributions on the left with
5-bit shift quantisation. Note that in some sparse layers, greedy pruning encourages weights to
avoid near zero values. Shift quantisation on these layers thus results in poor utilisation of the
quantisation levels.
This dichotomy prompts the question, how can we quantise sparse weights efficiently
and effectively? Here, efficiency represents not only the reduced model size but also the
minimised compute cost. Effectiveness means that the quantisation levels are well-utilised.
From an information theory perspective, it is desirable to design a quantisation function
Q such that the quantised values in θ̂ = Q(θ) closely match the prior weight distribution.
v = s · 2e−b , (3.8)
where s = {−1, 0, 1} denotes either zero or the sign of the value, e is an integer bounded
by [0, 2k − 1], and b is the bias, a layer-wise constant which scales the magnitudes of
quantised values. I use θ̂ = Qshift
n,b [θ] to denote a n-bit shift quantisation with a bias b of a
weight value θ to the nearest representable value θ̂.
Intuitively, it is desirable to concentrate quantisation effort on the high probability
regions in the weight distribution in sparse layers. Focused quantisation Qfocus [θ] is
designed specifically for this purpose, and applied in a layer-wise fashion.
X
rec θ − µc
Q[θ] = zθ α δc,mθ Q [θ], where Qrec
c [θ] = Qshift
n,b σc + µc . (3.9)
c∈C
σc
54
to focus quantisation effort, each specified by the component’s mean µc and standard
deviation σc . The Kronecker delta δc,mθ evaluates to either 1 when c = mθ , or 0 otherwise.
In other words, the constant mθ ∈ C chooses which component in C is used to quantise
θ. Finally, Qrec
c [θ] locally quantises the component c with shift quantisation. Following
Zhu et al. [212] and Leng et al.[102], I additionally introduce a layer-wise learnable scaling
factor α initialised to 1, which empirically improves the task accuracy.
<latexit sha1_base64="Ytm7wxKGhw0ZGUKoe1yvm3HlDWo=">AAAB+XicbZDLSsNAFIZP6q3WW9Slm8EidNOSVEU3QsGNywr2Am0Ik+mkHTqZhJlJoYS+iRsXirj1Tdz5Nk7bLLT1h4GP/5zDOfMHCWdKO863VdjY3NreKe6W9vYPDo/s45O2ilNJaIvEPJbdACvKmaAtzTSn3URSHAWcdoLx/bzemVCpWCye9DShXoSHgoWMYG0s37b7UepX0R2qOjXn0q1f+3bZ0EJoHdwcypCr6dtf/UFM0ogKTThWquc6ifYyLDUjnM5K/VTRBJMxHtKeQYEjqrxscfkMXRhngMJYmic0Wri/JzIcKTWNAtMZYT1Sq7W5+V+tl+rw1suYSFJNBVkuClOOdIzmMaABk5RoPjWAiWTmVkRGWGKiTVglE4K7+uV1aNdrrlNzH6/KjUoeRxHO4Bwq4MINNOABmtACAhN4hld4szLrxXq3PpatBSufOYU/sj5/AK72kP4=</latexit>
µ = 0.03125 <latexit sha1_base64="Ytm7wxKGhw0ZGUKoe1yvm3HlDWo=">AAAB+XicbZDLSsNAFIZP6q3WW9Slm8EidNOSVEU3QsGNywr2Am0Ik+mkHTqZhJlJoYS+iRsXirj1Tdz5Nk7bLLT1h4GP/5zDOfMHCWdKO863VdjY3NreKe6W9vYPDo/s45O2ilNJaIvEPJbdACvKmaAtzTSn3URSHAWcdoLx/bzemVCpWCye9DShXoSHgoWMYG0s37b7UepX0R2qOjXn0q1f+3bZ0EJoHdwcypCr6dtf/UFM0ogKTThWquc6ifYyLDUjnM5K/VTRBJMxHtKeQYEjqrxscfkMXRhngMJYmic0Wri/JzIcKTWNAtMZYT1Sq7W5+V+tl+rw1suYSFJNBVkuClOOdIzmMaABk5RoPjWAiWTmVkRGWGKiTVglE4K7+uV1aNdrrlNzH6/KjUoeRxHO4Bwq4MINNOABmtACAhN4hld4szLrxXq3PpatBSufOYU/sj5/AK72kP4=</latexit>
µ = 0.03125
<latexit sha1_base64="yxGxtE23oVGrW8GZXG6LYOtjCPM=">AAAB+HicbZDLSgMxFIYz9VbrpaMu3QSLUBDKTFV0IxTcuKxgL9AOQybNtKFJZshFqEOfxI0LRdz6KO58G9N2Ftr6Q+DjP+dwTv4oZVRpz/t2CmvrG5tbxe3Szu7eftk9OGyrxEhMWjhhiexGSBFGBWlpqhnpppIgHjHSica3s3rnkUhFE/GgJykJOBoKGlOMtLVCt9znJjyDN9Creed+/TJ0K5bmgqvg51ABuZqh+9UfJNhwIjRmSKme76U6yJDUFDMyLfWNIinCYzQkPYsCcaKCbH74FJ5aZwDjRNonNJy7vycyxJWa8Mh2cqRHark2M/+r9YyOr4OMitRoIvBiUWwY1AmcpQAHVBKs2cQCwpLaWyEeIYmwtlmVbAj+8pdXoV2v+V7Nv7+oNKp5HEVwDE5AFfjgCjTAHWiCFsDAgGfwCt6cJ+fFeXc+Fq0FJ585An/kfP4APwWQxQ==</latexit>
µ+ = 0.03125 <latexit sha1_base64="yxGxtE23oVGrW8GZXG6LYOtjCPM=">AAAB+HicbZDLSgMxFIYz9VbrpaMu3QSLUBDKTFV0IxTcuKxgL9AOQybNtKFJZshFqEOfxI0LRdz6KO58G9N2Ftr6Q+DjP+dwTv4oZVRpz/t2CmvrG5tbxe3Szu7eftk9OGyrxEhMWjhhiexGSBFGBWlpqhnpppIgHjHSica3s3rnkUhFE/GgJykJOBoKGlOMtLVCt9znJjyDN9Creed+/TJ0K5bmgqvg51ABuZqh+9UfJNhwIjRmSKme76U6yJDUFDMyLfWNIinCYzQkPYsCcaKCbH74FJ5aZwDjRNonNJy7vycyxJWa8Mh2cqRHark2M/+r9YyOr4OMitRoIvBiUWwY1AmcpQAHVBKs2cQCwpLaWyEeIYmwtlmVbAj+8pdXoV2v+V7Nv7+oNKp5HEVwDE5AFfjgCjTAHWiCFsDAgGfwCt6cJ+fFeXc+Fq0FJ585An/kfP4APwWQxQ==</latexit>
µ+ = 0.03125
where f (θ|µc , σc ) is the probability density function of the Gaussian distribution N (µc , σc ),
the non-negative λc defines the mixing weight of the cth component and Σc∈C λc = 1. Here,
55
I find the set of hyperparameters µc , σc and λc contained in φ that maximises Qmix given
that θ ∼ p(θ|D). This is known as the maximum likelihood estimate (MLE), and the
MLE can be efficiently computed by the expectation-maximisation (EM) algorithm [36].
In practice, it is sufficient to use two Gaussian components, C = {−, +}, for identifying
high-probability regions in the weight distribution. For faster EM convergence, I initialize
µ− , σ− and µ+ , σ+ respectively with the means and standard deviations of negative and
positive values in the layer weights respectively, and λ− , λ+ with 12 . I then generate mθ
from the mixture model, which individually selects the component to use for each weight.
For this, mθ is evaluated for each θ by sampling a categorical distribution where the
probability of assigning a component c to mθ , i.e. p(mθ = c), is λc f (θ|µc , σc )/Qmix (θ).
Finally, I set the constant b to a powers-of-two value, chosen to ensure that Qshift n, b [·]
allows at most a proportion of 2n1+1 values to overflow and clips them to the maximum
representable magnitude. In practice, this heuristic choice makes better use of the
quantisation levels provided by shift quantisation than disallowing overflows. After
determining all of the relevant hyperparameters with the method described above, θ̂ = Q[θ]
can be evaluated to quantise the layer weights.
As I have discussed earlier, the weight distribution of sparse layers may not always
have multiple high-probability regions. It is possible that the sparse layers have two high
probability masses that are overlapping. Under this scenario, I can simply use n-bit shift
quantisation. To decide whether to use shift or recentralised quantisation, it is necessary
to introduce a metric to compare the similarity between the pair of components. While the
KL-divergence provides a measure for similarity, it is however non-symmetric, making it
unsuitable for this purpose. To address this, I propose to first normalise the distribution of
the mixture, then to use the 2-Wasserstein metric between the two Gaussian components
after normalisation as a decision criterion, which I call the Wasserstein separation:
1 2 2
W(c1 , c2 ) = (µ c1 − µ c 2 ) + (σ c 1 − σc 2 ) , (3.11)
σ2
where µc and σc are respectively the mean and standard deviation of the component
c ∈ {c1 , c2 }, and σ 2 denotes the variance of the entire weight distribution. FQ can
then adaptively pick to use recentralised quantisation for all sparse layers except when
W(c1 , c2 ) < wsep , and shift quantisation is used instead. I found wsep = 2.0 usually
provides a good decision criterion. In the next subsection, I additionally study the impact
of quantising a model with different wsep values.
56
demonstrates the effectiveness of FC on the models. Sparsified ResNets with 7-bit weights
are at least 16× smaller than the original dense model with marginal degradation (≤0.24%)
in top-5 accuracy. MobileNets, which are much less redundant and more compute-efficient
models to begin with, achieved a smaller CR at around 8× and slightly larger accuracy
degradations (≤0.89%). Yet when compared to the ResNet-18 models, it is not only more
accurate, but also has a significantly smaller memory footprint at 1.71 MB.
In Table 3.7, I compare FC with many state-of-the-art model compression schemes. It
shows that FC simultaneously achieves the best accuracy and the highest CR on both
ResNets. Trained Ternary Quantisation (TTQ) [212] quantises weights to ternary values,
while INQ [207] and extremely low bit neural network (denoted as ADMM) [102] quantise
weights to ternary or powers-of-two values using shift quantisation. Distillation and
Quantisation (D&Q) [131] quantise parameters to integers via distillation. Note that
D&Q’s results used a larger model as baseline, hence the compressed model has high
accuracy and low CR. I also compared against Coreset-Based Compression [42] comprising
pruning, filter approximation, quantisation and Huffman encoding. This method exploits
inter-filter dependencies in the CNN filter banks using corest representations and prune
unimportant filters. For ResNets, I additionally compare against ThiNet [115], a filter
pruning method, and Clip-Q [158], which interleaves training steps with pruning, weight
sharing and quantisation. FC again achieves the highest CR (18.08×) and accuracy
(74.86%).
Table 3.6: The accuracy (%), sparsities (%) and CRs of focused compression on ImageNet
models. The baseline models are dense models before compression and use 32-bit floating-point
weights, and 5 bits and 7 bits denote the number of bits used by individual weights of the
quantised models before Huffman encoding.
57
Table 3.7: Comparisons of top-1 and top-5 accuracies (%) and CRs with various compression
methods. Numbers with ? indicate results not originally reported and calculated by us. Note
that D&Q used a much larger ResNet18, the 5 bases used by ABC-Net denote 5 separate
binary convolutions. LQ-Net used a “pre-activation” ResNet-18 [67] with a 1.4% higher accuracy
baseline than ours.
As mentioned previously, I use the Wasserstein distance W(c1 , c2 ) between the two
components in the Gaussian mixture model as a metric to differentiate whether recentralised
or shift quantisation should be used. It is important to understand the choices of this
introduced hyperparameter. In my experiments, I specified a threshold wsep = 2.0 such
that for each layer, if W(c1 , c2 ) ≥ wsep then recentralised quantisation is used, otherwise
shift quantisation is employed instead. Figure 3.7 shows the impact of choosing different
wsep ranging from 1.0 to 3.5 at 0.1 increments on the Top-1 accuracy. This model is a
CIFAR10 [94] classifier with only 9 convolutional layers, so that it is possible to repeat
training 100 times for each wsep value to produce high-confidence results. Note that the
average validation accuracy is minimised when the layer with only one high-probability
region uses shift quantisation and the remaining 8 use recentralised quantisation, which
verifies my intuition.
58
#layers with recentralized quantization
9
10.00% 8
7
9.75% 6
Top-1 Error 5
9.50% 4
9.25% 3
2
9.00% 1
0
1.0 1.5 2.0 2.5 3.0 3.5
Wasserstein distance
Figure 3.7: The effect of different threshold values on the Wasserstein distance. The larger
the threshold, the fewer the number of layers using recentralised quantisation instead of shift
quantisation.
if the inputs are batched. Not only does FBS use much fewer FLOPs, it also further
demonstrates significant reductions in bandwidth and memory requirements of hardware
platforms. In low-powered edge computing, inference is typically performed on a single
image. In Table 3.8, I observe a large reduction in the number of memory accesses required
to load weights and read/write activations, as a convolution with FBS uses sparse input
and evaluates sparse output activations. Since these read/write operations are often
carried out by DRAM accesses, minimising them directly translates to power-savings as I
spend much less energy performing expensive DRAM operations. Table 3.8 further reveals
that in diverse application scenarios such as low-end edge devices and cloud-based GPU
platforms, the peak memory usages by the optimised models are also significantly smaller
than the originals, which in general improves cache utilisation.
Table 3.9: Comparison of the original ResNet-18 with successive quantisations applied on weights,
activations and BN parameters. Each row denotes added quantisation on new components.
59
Table 3.8: Comparisons of the number of memory accesses and peak memory utilisation
of VGG16 and ResNet18 on ImageNet with FBS respectively under 3× and 2× inference
speed-up constraints. The Weights and Activations columns respectively show the total
amount of weight and activation accesses required by all convolutions for a single image
inference. The Peak Memory Usage columns show the peak memory usages by the
same models with different batch sizes.
Table 3.10: Computation resource estimates of custom accelerators for inference assuming the
same compute throughput.
60
Input (8-bit int)
X
1 1 2 1 4 6 ⇥ <latexit sha1_base64="a/Xr8fl69V2MM3JDxU4X4wsQQNM=">AAAB7HicbVDLSgNBEOyNrxhfUY9eBoOQi2FXBD0GvHiM4CaBZAmzk9lkyMzsMg8hLPkGLx4U8eoHefNvnCR70MSChqKqm+6uOONMG9//9kobm1vbO+Xdyt7+weFR9fikrVOrCA1JylPVjbGmnEkaGmY47WaKYhFz2oknd3O/80SVZql8NNOMRgKPJEsYwcZJYV/YweWgWvMb/gJonQQFqUGB1qD61R+mxAoqDeFY617gZybKsTKMcDqr9K2mGSYTPKI9RyUWVEf54tgZunDKECWpciUNWqi/J3IstJ6K2HUKbMZ61ZuL/3k9a5LbKGcys4ZKslyUWI5MiuafoyFTlBg+dQQTxdytiIyxwsS4fCouhGD15XXSvmoEfiN4uK4160UcZTiDc6hDADfQhHtoQQgEGDzDK7x50nvx3r2PZWvJK2ZO4Q+8zx92eI5c</latexit>
µ
Extract input
<latexit sha1_base64="aQsHnL0lRwXJwAjFKCXl+UDWmtY=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKewGQY8BLx4jmAckS5idzCZjZmeWmV4hhPyDFw+KePV/vPk3TpI9aGJBQ1HVTXdXlEph0fe/vcLG5tb2TnG3tLd/cHhUPj5pWZ0ZxptMS206EbVcCsWbKFDyTmo4TSLJ29H4du63n7ixQqsHnKQ8TOhQiVgwik5q9VAk3PbLFb/qL0DWSZCTCuRo9MtfvYFmWcIVMkmt7QZ+iuGUGhRM8lmpl1meUjamQ951VFG3JJwurp2RC6cMSKyNK4Vkof6emNLE2kkSuc6E4siuenPxP6+bYXwTToVKM+SKLRfFmSSoyfx1MhCGM5QTRygzwt1K2IgaytAFVHIhBKsvr5NWrRr41eD+qlKv5XEU4QzO4RICuIY63EEDmsDgEZ7hFd487b14797HsrXg5TOn8Afe5w+xW48k</latexit>
<latexit sha1_base64="nctphKWcW6v/PioTicKCV8K4VfA=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOepMhM7PLzKwQQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlApurO9/e6WNza3tnfJuZW//4PCoenzSNkmmGbZYIhLdjahBwRW2LLcCu6lGKiOBnWhyl/udJ9SGJ+rRTlMMJR0pHnNGbS71TSYH1Zpf9xcg6yQoSA0KNAfVr/4wYZlEZZmgxvQCP7XhjGrLmcB5pZ8ZTCmb0BH2HFVUoglni1vn5MIpQxIn2pWyZKH+nphRacxURq5TUjs2q14u/uf1MhvfhjOu0syiYstFcSaITUj+OBlyjcyKqSOUae5uJWxMNWXWxVNxIQSrL6+T9lU98OvBw3WtQYo4ynAG53AJAdxAA+6hCS1gMIZneIU3T3ov3rv3sWwtecXMKfyB9/kDJ7GOMQ==</latexit>
12 35 4 by X
2 12 2 3 <latexit sha1_base64="nctphKWcW6v/PioTicKCV8K4VfA=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOepMhM7PLzKwQQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlApurO9/e6WNza3tnfJuZW//4PCoenzSNkmmGbZYIhLdjahBwRW2LLcCu6lGKiOBnWhyl/udJ9SGJ+rRTlMMJR0pHnNGbS71TSYH1Zpf9xcg6yQoSA0KNAfVr/4wYZlEZZmgxvQCP7XhjGrLmcB5pZ8ZTCmb0BH2HFVUoglni1vn5MIpQxIn2pWyZKH+nphRacxURq5TUjs2q14u/uf1MhvfhjOu0syiYstFcSaITUj+OBlyjcyKqSOUae5uJWxMNWXWxVNxIQSrL6+T9lU98OvBw3WtQYo4ynAG53AJAdxAA+6hCS1gMIZneIU3T3ov3rv3sWwtecXMKfyB9/kDJ7GOMQ==</latexit>
⇥ <latexit sha1_base64="aQsHnL0lRwXJwAjFKCXl+UDWmtY=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKewGQY8BLx4jmAckS5idzCZjZmeWmV4hhPyDFw+KePV/vPk3TpI9aGJBQ1HVTXdXlEph0fe/vcLG5tb2TnG3tLd/cHhUPj5pWZ0ZxptMS206EbVcCsWbKFDyTmo4TSLJ29H4du63n7ixQqsHnKQ8TOhQiVgwik5q9VAk3PbLFb/qL0DWSZCTCuRo9MtfvYFmWcIVMkmt7QZ+iuGUGhRM8lmpl1meUjamQ951VFG3JJwurp2RC6cMSKyNK4Vkof6emNLE2kkSuc6E4siuenPxP6+bYXwTToVKM+SKLRfFmSSoyfx1MhCGM5QTRygzwt1K2IgaytAFVHIhBKsvr5NWrRr41eD+qlKv5XEU4QzO4RICuIY63EEDmsDgEZ7hFd487b14797HsrXg5TOn8Afe5w+xW48k</latexit>
<latexit sha1_base64="/68ja5lkHkUObmrG3K2iCdOI7s4=">AAAB7HicbVDLSgNBEOyNrxhfUY9eBoMQEMKuCHoMePEYwU0CyRJmJ7PJkJnZZR5CWPINXjwo4tUP8ubfOEn2oIkFDUVVN91dccaZNr7/7ZU2Nre2d8q7lb39g8Oj6vFJW6dWERqSlKeqG2NNOZM0NMxw2s0UxSLmtBNP7uZ+54kqzVL5aKYZjQQeSZYwgo2Twr6wg8tBteY3/AXQOgkKUoMCrUH1qz9MiRVUGsKx1r3Az0yUY2UY4XRW6VtNM0wmeER7jkosqI7yxbEzdOGUIUpS5UoatFB/T+RYaD0VsesU2Iz1qjcX//N61iS3Uc5kZg2VZLkosRyZFM0/R0OmKDF86ggmirlbERljhYlx+VRcCMHqy+ukfdUI/EbwcF1r1os4ynAG51CHAG6gCffQghAIMHiGV3jzpPfivXsfy9aSV8ycwh94nz9zcI5a</latexit>
µ+
2 6 3
/8
Learnable
Quantized (5-bit) Shift values (3-bit) scaling factor ↵l
<latexit sha1_base64="JYDOS3GI3kEc9d2Ug+FcbByaij8=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5KIoMeCF48V7Ae0oUy2m3bpZhN3N0IJ/RNePCji1b/jzX/jNs1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqKGvTWMSqF6BmgkvWNtwI1ksUwygQrBtMbxd+94kpzWP5YGYJ8yMcSx5yisZKvQGKZIJDMazW3Iabg6wTryA1KNAaVr8Go5imEZOGCtS677mJ8TNUhlPB5pVBqlmCdIpj1rdUYsS0n+X3zsmFVUYkjJUtaUiu/p7IMNJ6FgW2M0Iz0aveQvzP66cmvPEzLpPUMEmXi8JUEBOTxfNkxBWjRswsQaq4vZXQCSqkxkZUsSF4qy+vk85lw3Mb3v1VrVkv4ijDGZxDHTy4hibcQQvaQEHAM7zCm/PovDjvzseyteQUM6fwB87nDwZ6j90=</latexit>
<latexit sha1_base64="nctphKWcW6v/PioTicKCV8K4VfA=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF48RzAOSJcxOepMhM7PLzKwQQn7BiwdFvPpD3vwbZ5M9aGJBQ1HVTXdXlApurO9/e6WNza3tnfJuZW//4PCoenzSNkmmGbZYIhLdjahBwRW2LLcCu6lGKiOBnWhyl/udJ9SGJ+rRTlMMJR0pHnNGbS71TSYH1Zpf9xcg6yQoSA0KNAfVr/4wYZlEZZmgxvQCP7XhjGrLmcB5pZ8ZTCmb0BH2HFVUoglni1vn5MIpQxIn2pWyZKH+nphRacxURq5TUjs2q14u/uf1MhvfhjOu0syiYstFcSaITUj+OBlyjcyKqSOUae5uJWxMNWXWxVNxIQSrL6+T9lU98OvBw3WtQYo4ynAG53AJAdxAA+6hCS1gMIZneIU3T3ov3rv3sWwtecXMKfyB9/kDJ7GOMQ==</latexit>
0.5 1 -2 0.5 1 -2 1 6 -6
/ 128 / 128 / 1024
Figure 3.8: An implementation of the dot-product used in convolution between an integer input
and a filter quantised by recentralised quantisation. The notation /N means the filter values
share a common denominator N .
convolution but with different quantisations. Both ABC-Net and LQ-Net quantise each
weight to N binary values, and compute N parallel binary convolutions for each binary
weight. The N outputs are then accumulated for each pixel in the output feature map.
Even with the optimal compute pattern proposed by the two methods, there are at least
O(M N ) additional high-precision multiply-adds, where M is the number of parallel binary
convolutions, and N is the number of output pixels. This overhead is much more significant
when compared to other parts of compute in the convolution. As shown in Table 3.10, both
have higher logic usage because of the substantial amount of high-precision multiply-adds.
In contrast, FQ has only one final learnable layer-wise scaling multiplication that can
be further optimised out as it can be fused into BN for inference. Despite having more
quantisation levels and a higher CR, and being more efficient in hardware resources, the
fully quantised ResNet-18 in Table 3.9 can still match the accuracy of a LQ-Net ResNet-18.
3.6 Summary
In this chapter, I showed how combining basic compression algorithms can provide
multiplicative compression gains (Section 3.2). Although the same phenomenon has been
observed in Deep Compression, I revealed that there is further compression optimisation
opportunities if allowing a more flexible combination of pruning and quantisation strategies.
In Section 3.3 and Section 3.4, I introduced novel compression techniques for pruning
and quantisation respectively. These two novel compression techniques exploit certain
characteristics of the neural networks in order to achieve a better compression quality.
The first technique helps the models to go beyond the state-of-the-art compression ratios
by observing that channel saliencies are different for various inputs. The second method
61
exploits the post-pruning weights distributions to build a quantisation scheme. Finally, I
further explained how these novel compression methods can be used on real hardware and
what do they mean for general-purpose and custom hardware platforms.
62
Chapter 4
Multi-precision multi-arithmetic
CNN hardware accelerators for
efficient neural networks
In this chapter, I will look at how re-designing the hardware can improve the performance of
DNN inference. Hardware designs often limit the number of possible software compression
techniques (e.g. pruning and quantisations) that can be applied to the hardware. This
chapter then explores how a more flexible hardware deign can allow us to apply a group of
more sophisticated compression techniques.
I will first provide an overview of different styles of DNN accelerators and then propose
an alternative design that focuses on pipelining a number of small compute cores. I
show how this hardware design can be integrated with software optimisations, offering
a high throughput, low latency design on FPGAs without the need for off-chip memory
traffic. Later, I will discuss the computation flow and data reuse strategies of the proposed
hardware architecture in details and also demonstrate an automated co-design flow for
generating accelerators on custom computing devices.
This chapter includes relevant contents published in:
• Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, Robert Mullins, Peter
YK Cheung, George Constantinides, and Cheng-Zhong Xu. Automatic generation of
multi-precision multi-arithmetic CNN accelerators for FPGAs. In 2019 International
Conference on Field-Programmable Technology, pages 45–53. IEEE, 2019 (ICFPT
2019)
Yiren Zhao conceived and designed the experiments. Yiren Zhao implemented a test
version of a multi-arithmetic DNN accelerator in high level synthesis (HLS), Junyi Liu
provided correction and testing in this process. Erwei Wang changed and deployed the
test version to a more detailed implementation in high level synthesis. Yiren Zhao later
63
realised that the HLS tool was limiting the performance of deeply pipeline designs and
produced a hardware description in a RTL (SystemVerilog) design. Xuan Guo later further
optimised the implementation. Yiren Zhao and Xitong Gao worked on extending the
software framework Mayo to help the hardware generation.
64
Homogenous Core: Flattened Streaming:
Uniform compute and buffer Non-uniform compute and buffer
resources for all layers resources for different layers
On-chip RAM
Off-chip RAM
Logic
Individual block
Weight Buffer
Act Buffer
Act Buffer
for each layer
Compute Compute
…
Cores Cores
Activation
DRAM
Buffer
Compute Core
W Buffer W Buffer
DEMUX
Output Buffer
DRAM
Figure 4.1: An illustration of a homogeneous core (left) and flattened streaming cores
(right).
Figure 4.1 shows a graphical illustration of a homogeneous core and flattened streaming
cores. The homogeneous core has a large compute unit that utilises most of the design
budget, whereas the flattened streaming cores design uses a number of smaller cores. It is
also worth to mention that, in the flatten streaming cores design, since the throughput
is determined by the slowest stage, there exists opportunities for fast (normally later)
computation stages to be unrolled less to save resources without having any impacts on
the system throughput. I will discuss this unrolling strategy in details in Section 4.5.3.
65
accuracies on large datasets. The most popular quantisation strategies used for DNN
accelerators are fixed-point and shift quantisations, since they achieve a good trade-off
between model accuracy and hardware efficiency, I have discussed these two number
representation systems in Chapter 2.
FPGA-based accelerators generally uniformly apply one of the above low precision
quantisation methods [148, 55]. This specialisation provides efficiency and performance
gains when compared to GPUs, but does not fully exploit the reconfigurability of such
devices. Alternatively, we could consider selecting the most appropriate quantisation
and quantisation hardware on a per layer basis. The aim here to is both minimise the
impact of quanisation loss and maximise the model task accuracy. My discussion in
Section 3.2 showed that different layers are suitable for different quantisation strategies
due to the differences in weights distributions. In addition, using different quantisations
on for different computing cores dedicated for each layer will reduce the hardware circuit
area. The idea of allowing different bit-width for different layers is not new. Bit-serial
accelerators [142, 143] target exactly this problem as they provide scope to optimise
away superfluous computation at the bit-level when computing with fixed-point numbers.
These accelerators will run at a much higher clock frequency compared to their bit-parallel
competitors to compensate for the increased number of clock cycles to finish multiplications.
However, the use of bit-serial operations is still a limited exploration of adaptive arithmetic
in the fixed-point number territory, later in this chapter, I will illustrate a workflow that
automatically generate accelerators that not only have computing cores using different
bit-widths but also different arithmetics.
66
Model config
Constraints
Mayo/Software Stack
Pre-trained
Weights
Compressed Model
Verilog
tempaltes
Tomato/Hardware Stack Bitstream
Model config
Figure 4.2: A high level illustration of how Mayo and Tomato can be integrated to produce
directly a bitstream for FPGA devices.
that are fully quantised while satisfying the accuracy constraints. In the exploration
procedure, it iteratively uses an accurate hardware resource estimator to provide rapid
predictions of hardware costs and minimise the costs for the searched models. I designed
an analytical model for predicting the hardware utilisation based on my empirical data,
the designed prediction model then shows a highly accurate estimation (within 2% errors)
on the targeting FPGA devices. The cost, latency, number of lookup tables (LUTs) and
block RAMs (BRAM) usage, is estimated using analytical models generated from post
synthesis results for a wide range of module parameters. The final optimised CNN model
is then fine-tuned on the original training dataset to minimise accuracy degradation.
It is notable that from the optimised model, Tomato generates dedicated compute
engines for each convolutional layer in a streaming fashion. As I have mentioned earlier, the
compute engines are connected in a pipeline, each takes a stream of inputs and produces
a stream of outputs. The isolated compute engines can thus have the freedom to use
different quantisations with individual bit-widths. To minimise hardware utilisation, layers
that exceed throughput requirements can be folded (i.e. only partially unrolled) to share
individual processing elements temporally. The framework automatically computes the
optimal unroll factors required to parallelise each layer which minimises stalls and idle
circuits. Finally, the framework generates SystemVerilog output describing the hardware
implementation of the input model, which is in turn synthesised into circuits with fine-tuned
weights.
67
Model Config
Throughput
Matching
Searched Scaling factors
Accuracy Arith Info
Greedy Search Resource Estimator
Constraints
No
Estimated
Hardware Meeting HW Hardware Info
Constraints Constraints?
Yes
Pretrained
Retraining
Weights Software Stack
Trained Weights
Verilog
Hardware files
Verilog Template Synthesis & Routing Bitstream
Generation
Scaling factors
Throughput
Model Config
Matching
Hardware Stack
Figure 4.3: Framework overview for generating the accelerator for a targeting network on
a particular dataset. The overall framework is composed of the Software Stack (Mayo)
and Hardware Stack (Tomato).
68
new configuration q 0 from q which results in the steepest reduction in hardware resources
hwcost(q) − hwcost(q 0 ) until all layers cannot be modified further without violating the
accuracy constraint. Additionally, if the hardware resource constraint hwcost(q) ≤ hbudget
is already satisfied then I exit early to minimise accuracy loss. In my experiments, I chose
αbudget to be 0.95α, where α is the original accuracy, to generate a fully quantised model
with efficient hardware usage. The resulting model is then fine-tuned to further increase
accuracy.
5: θ0 , α ← finetune (θ, q 0 , E)
6: if α ≥ αbudget then
7: q ← q0
8: θ ← θ0
9: if hwcost(q 0 ) ≤ hbudget then
10: break
11: end if
12: else
13: L ← L − layer changed(q, q 0 )
14: end if
15: end while
16: return q, θ
17: end function
This quantisation search is the core for the software part in the automatic generation
flow as illustrated as the orange regions in Figure 4.2.
69
Inactive Compute
Idle Circuits Active Compute
k2×ic
L2
L1 w×h×oc C L1 L2 L3
Homogeneous Time-shared
Compute Core 1st cycle 2nd cycle 3rd cycle 4th cycle
L2
Execute Compute Flow
L3 C1 C2 C3 L1 L2 L3 L1 L2 L3
Figure 4.4: An illustration of the computation flow on executing three layers of convolutions
(L1 , L2 , L3 ) at different clock cycles. C represents a large homogeneous compute core and
C1 , C2 , C3 are smaller compute cores in a flattened streaming architecture. The rectangle
block of each convolution layer represents input dimensions of a convolution flattened in
2D. k, ic, oc, w, h are kernel size, input/output channels, width and height of input feature
feature maps respectively.
70
are time-shared for U 0 output channels, as the next layer has an input block size equal to
U 0 . As it processes all input channels Cl−1 in blocks of size U , it uses only U ×
l Cl , minstead
of Cl−1 × Cl parallel shift-accumulate or multiply-accumulate units, requiring CUl−1 cycles
to complete the computation of a single pixel of all output channels, as shown in Figure 4.5.
Tomato does not use roll-unrolled in depthwise convolutions. Figure 4.6 shows the
computation pattern for depthwise convolutions. In contrast to normal convolutions,
depthwise convolutions are channel-wise operations, i.e. they do not exchange information
across channels. By rolling input channels in depthwise layers, the generated outputs are
also rolled, different from the normal roll-unrolled compute pattern. In this way, I exactly
match the throughput of incoming and upcoming computations while minimising resource
utilisation. Each parallel adder tree sums up K 2 values and is fully pipelined, where K is
the kernel size.
Roll-unrolled should not be confused with loop tiling. Loop tiling reorders the access
pattern so that it is more friendly to either cache access or DRAM bandwidth utilisation
in systolic array based CNN accelerators. As Tomato pipelines multiple frames instead of
batching them, I did not change the access pattern. The purpose of rolling and unrolling in
Tomato’s streaming cores are to minimise hardware and provide a stall-free computation
dataflow, a fundamentally different objective.
⋮ ⋮ ⋮
ACC
71
Input Feature Maps Weights Output Feature Maps
(After Slide Buffer)
Shift or Mult
≪ or *
Figure 4.6: An illustration of depthwise convolution: blue indicates data elements computed
in a single cycle.
pixel in 32 clock cycles. The choice of the input pixel rate directly impacts the trade-off
between performance and the hardware resources required. If this input pixel rate is 1,
the generated hardware is optimised for performance, fully-pipelined, and never stalls
the input pixel stream. When the input pixel rate decreases, because of the automatic
matching, unroll factors of all subsequent convolution layers decrease and the generated
hardware thus utilises fewer resources but has an increased latency.
I now explain how the automated throughput matching works. The framework utilises
the classic sliding window design — one pixel of a output feature map is produced once all
pixels of the sliding window on input feature maps have arrived [12]. The input stream and
output stream of strided convolutions, however, can have different input and output rates.
For instance, when the stride size is 2, the output stream is then 4 times slower than the
input stream (striding occurs in the two spatial dimensions). Table 4.1 shows the unrolling
factors U and U 0 that the framework picked for each convolution in MobileNet-V1 when
choosing the input pixel rate to be 1. Here, for each pixel, CUi represents the number of
clock cycles required to iterate over all input channel values, and CUo0 is the number of
cycles required to finish generating all output channel values. Taking the second depthwise
convolution layer as an example, this layer has a stride size of 2 and the framework rolls
computations on the output channel side by a factor of CUo0 = 16 so that U 0 = 4 values of
each output channel are computed concurrently. Finally, all of the unrolling information
is provided to the hardware templates in order to instantiate the appropriate hardware.
72
Table 4.1: Unrolling factors U and U 0 are generated by the throughput matcher for
MobileNet-V1, depending on the input and output channel counts (Cl−1 , Cl ), and the
stride of each convolution. dw and pw are depthwise and pointwise convolution. s1 and s2
represents strides are 1 and 2.
Cl−1
Types Cl−1 / Cl U / U0 U / Cl
U0
a channel-wise fashion with a moving mean µ and a moving standard deviation σ, then
applies affine transformation on them with the learned γ and β:
Xl − µ
Xl = γ +β (4.1)
σ
It is notable that Equation (4.1) can be re-arranged into a channel-wise affine transforma-
tion. In the CNN feed-forward stages, I respectively quantise the scaling and offset factors
of this affine transformation to fixed-point numbers:
γ γµ
y = quantise8.8 x + quantise8.8 β − , (4.2)
σ σ
where quantise8.8 (z) quantises z into 16-bit fixed-point values with a binary point at 8.
73
4.6 Evaluating the performance of automatically gen-
erated hardware
4.6.1 Experimental setups
In this section, I report results for automatically generated hardware implementations for
three distinct networks, each optimised for a different dataset. I use CifarNet [201], an
8-layer CNN with 1.30M parameters and 174M multiply-accumulates on the CIFAR-10
dataset [94], MobileNet-V1 [73] on the ImageNet dataset [93] and a customised 5-layer
CNN (FashionNet) for the Fashion MNIST dataset [176]. The first two networks are
relatively large, but the last network is small. I use MobileNet-V1 design as a comparison
to showcase the performance achieved from this hardware and software co-design workflow
in comparison to other published accelerators. The hardware part (SystemVerilog output)
is generated automatically using templates by the framework. Synplify Premier DP is
used for synthesis and post-synthesis timing analysis. The final designs are actually
implementable on FPGAs using Xilinx Vivado to place and route the full-size MobileNet
design.
Table 4.2 shows the total amount of hardware utilised for the generated accelerators for
all networks on different devices. The MobileNet design uses an input pixel rate of 1 for
74
the best performance (achieving 3109 FPS on an Intel Stratix 10). The proposed workflow
is a scalable one since it can adjust the input pixel rate to control a trade-off between
performance and hardware utilisation. CifarNet results in Table 4.2 show how it is possible
1
to target a small FPGA device (Cyclone V) by adjusting this factor to 288 . The results
suggest a 3× reduction in LUT usage compared to the design when the input pixel rate is
1
set to 32 . It is also observed an increase in latency, but part of the increase attributes to
the frequency differences running on various devices. On the other hand, if provided with
a small network (FashionNet), the proposed framework generates hardware that classifies
at a latency of 0.14ms on a very small FPGA device. The quantised FashionNet uses 3-bit
shift quantisation in the most resource-intensive third layer, and the remaining layers use
fixed-point weights with bit-widths from 5 to 7. Although FashionNet is small, it is a good
example of a specialised network produced for resource constrained edge devices; other
examples include emotion recognition [19].
I explore in Figure 4.7 the optimised CifarNets obtained with Algorithm 1 (denoted
by the “hybrid” points) and compare the results against shift (“shift”) and fixed-point
(“fixed”) models with all layers sharing the same bit-widths ranging from 3 to 8. To
explore the trade-off between top-1 error rates and resources, I ran Algorithm 1 20 times
by respectively taking as inputs the accuracy budget values αbudget ranging from 80%
to 100% at 1% increments. Here hbudget is set to 0 as I always minimise the resource
utilisation. I constrain each layer to use either shift or fixed-point quantised weights and
choose a bit-width ranging from 3 to 8. Additionally, E = 0 meaning that I skip the
fine-tuning process; without fine-tuning the accuracies are sub-optimal but the search
process above can complete within 1 hour. Figure 4.7a shows the trade-off between resource
utilisation and top-1 errors under the same throughput constraints. Figure 4.7b further
varies the throughput scaling of the optimised results, and shows that when synthesised
into circuits, the the optimised models (“hybrid”) consistently outperforms models (“shift”
and “hybrid”) with either shift or fixed-point quantisation under the same bit-width
applied across all layer weights. Finally, Figure 4.7c illustrates all results found by the
three methods and the trade-off relationship between top-1 error rates and resource-latency
products.
Hybrid quantisation reduces the accuracy degradation and improves model compression
rates by utilising multi-precision multi-arithmetic convolutions. Importantly, I consider
the ImageNet [93] classification task for MobileNet. This challenging dataset leaves less
headroom for compression techniques. The classification results achieved on this large
dataset using a relatively compact network proves that the workflow is also robust on
other cases where the model is over-provisioned on the target dataset.
75
Table 4.3: A comparison of CNN inference performance on FPGA and GPU platforms.
The quantisation of weights and activations are on the left. Target platform, frequency,
latency, throughput and arithmetic performance are on the right. Metrics with ∗ are our
best-case estimations as they are not mentioned in the original papers. Note VGG16 has
a similar top-5 accuracy to MobileNet-V1 when neither is quantised, many of the works
below do not report ImageNet accuracies after quantisation.
Quantisation(s) Frequency Latency Throughput Arithmetic
Platform perf. (GOP/s)
Implementation Weights Acts (MHz) (ms) (FPS)
Throughput-Opt [155] FXP8 FXP16 Intel Stratix V 120 262.9 3.8∗ 117.8
fpgaConvNet [164] FXP16 FXP16 Xilinx Zynq XC7Z045 125 197∗ 5.07 156
163∗ 6.12∗
VGG16
Ours Mixed FXP8 Xilinx Virtex US+ XCVU9P 125 0.40 2491 2833
Zhao et al. [199] FXP16 FXP16 Intel Stratix V 200 0.88 1131 1287
Zhao et al. [200] FXP8 FXP8 Intel Stratix V 150 4.33 231 264
GPU FP32 FP32 Nvidia GTX 1080Ti – 279.4 515 586
20.0%
shift shift 20.0% shift
18.0% fixed fixed fixed
18.0%
hybrid hybrid hybrid
16.0% 103
16.0%
Latency ( s)
Top-1 Error
Top-1 Error
14.0% 14.0%
12.0% 12.0%
102
10.0% 10.0%
8.0% 8.0%
400 600 800 1000 1200 103 104 105 106
Resources (kLUTs) Resources (kLUTs) Resource-Latency Product (kLUTS s)
(a) Number of LUTs vs top-1 (b) Number of LUTs vs latency (c) Error vs area-latency prod-
1
error under the same 32 scaling for models with ≤ 10% top-1 uct for all optimized models.
and the same throughput. errors.
Figure 4.7: A case study of trade-off options among hardware utilization (LUTs), per-
formance (latency) and model accuracy (top-1 error rate) before fine-tuning, targeting
a clock frequency of 200 MHz. The LUTs and latency numbers are from the hardware
estimator. Here, “shift” and “fixed” respectively indicate using the same shift and fixed-
point quantisation method across all layers with the same weight precisions. The “hybrid”
points are optimized by Algorithm 1. The area shaded in red, green and blue respectively
denote the 2D Pareto frontier of “shift”, “fixed” and “hybrid” optimized results.
76
The generated design is different from most existing designs examined in Table 4.3 in
the following ways. First, the framework exploits hybrid quantisation to minimise the
impact of quantisation errors. Second, using the throughput scaling trick, the amount
of hardware required is reduced significantly. Most of the examined designs rely on a
high DRAM bandwidth with a large monolithic compute core. As discussed previously,
a large compute core cannot explore multi-precision and multi-arithmetic quantisations
and struggles to fully utilise compute units on convolutions with varying channel counts,
kernel sizes and input feature sizes. Tomato generated designs compute various layers
concurrently and quantise each layer differently, thus achieving a very high utilisation of
my compute units and to operate at around 3.5 TOps/s on Stratix 10. Note that, for
accelerators that I compare against, the arithmetic performance reported in Table 4.3
considers the peak performance assuming an unbounded DRAM bandwidth. In reality,
such designs can easily be limited by the available memory bandwidth. In contrast, this
is not a concern in this design as all weights and activations are held on the FPGA.
Additionally, the generated design has a high throughput since operations rarely stall.
Designs proposed by Bai et al. [12] and Zhao et al. [199] have to execute computations in
a layer-wise fashion, and thus operations in the next layer only executes when the current
layer finishes. In this framework, similar to Shen et al. [148], computations in different
layers happen concurrently in the same pipeline stage while later layers never stall earlier
layers. Moreover, consecutive image inputs can be fully pipelined, because the generated
hardware utilises streaming sliding windows. These features help the hardware to achieve
high throughput compared to other designs (Table 4.3). The proposed workflow avoids
complex and time-consuming design space exploration as necessary in many compared
FPGA accelerators [12, 200].
In terms of performance, this design achieves a higher throughput and a lower latency
compared to all designs listed in Table 4.3. I notice that most CNN accelerators report
theoretical upper bounds for arithmetic performance and throughput. In terms of latency,
the numbers are reported optimistically assuming DRAM accesses cause no stalls. In this
design, since I stream in pixels of the input image, the computation pattern differs from
most existing designs. The reported values in Table 4.3 represents the true performance and
make no assumptions regarding DRAM bandwidth. The system automatically produces an
implementation of MobileNet for the Stratix 10 FPGA that outperforms Zhao et al. [199]
by 2.44× in latency and 3.52× in throughput.
77
MobileNet-V1 onto two Stratix V FPGAs, connected through enhanced small form-factor
pluggable (SFP+) interfaces. I present the performance results in comparison to Zhang
et al. [190] in Table 4.4, and the detailed hardware utilisation information in Table 4.5.
The latency is not penalised thanks to the low latency of SFP+, which contributes only a
0.0013ms latency overhead.
The simple case study of partitioning the same MobileNet-V1 design to two devices
demonstrates that, first, Tomato generated designs are scalable from single to multi-FPGAs;
second, aiming accelerating new network architectures with mixed quantisations bring
significant improvements in accuracies, latency and throughput.
4.7 Summary
In this chapter, I presented a hardware-software co-design workflow to automatically
generate high-performance CNN accelerators. The workflow is able to quantise weights to
both fixed-point and shift values at various precisions, and keeps activations to fixed-point
numbers. In addition, it transforms batch normalisation to simple affine operations with
fixed-point scaling and offset factors. In hardware, the framework utilises the Roll-Unrolled
compute pattern and provides flexibility in rolling computations in the channel dimension.
As a result, the guided rolling minimises computation while keeping the input stream
stall-free. The results showed state-of-the-art performance in terms of model accuracy,
latency and throughput. The implemented accelerator for MobileNet is fully pipelined
with sub-millisecond latency (0.32ms) and is able to classify at around 3000 frames per
second.
78
Chapter 5
The previous two chapters describe several software and hardware techniques for improving
the energy efficiency of running DNN inference on different hardware systems. These
methods assume pre-trained networks and pre-defined network architectures. In this
chapter I will explore opportunities in the model architecture space and the creation of
automated tools to aid in their designs. This chapter aims to demonstrate that Automated
Machine Learning (AML) is a powerful tool for designing and optimising emerging neural
network types and also learning scenarios.
In this chapter, I will first introduce the concepts of Automated Machine Learning
(AutoML) and Network Architecture Search (NAS), and discuss why they are particularly
useful for designing hardware-aware DNNs. I show two particular applications of NAS in
this chapter and demonstrate how they can be an efficient solution for emerging neural
network types and new learning setups. First, I demonstrate how state-of-the-art NAS
algorithms can be made hardware-aware and show a novel NAS algorithm aiming at
finding efficient Graph Neural Networks (GNNs). Later, I show how NAS algorithms can
be deployed in a many-task many-device setup for few-shot learning.
This chapter includes relevant contents published in:
• Yiren Zhao, Duo Wang, Xitong Gao, Robert Mullins, Pietro Lio, and Mateja Jamnik.
Probabilistic dual network architecture search on graphs. In Deep Learning on
Graphs: Method and Applications Workshop for 35th AAAI Conference on Artificial
Intelligence (DLG-AAAI 2021), recipient of the best student paper award.
• Yiren Zhao, Duo Wang, Daniel Bates, Robert Mullins, Mateja Jamnik, and Pietro
Lio. Learned low precision graph neural networks. In The 1st Workshop on Machine
Learning and Systems (EuroMLSys 2021), full version is in submission
• Yiren Zhao, Xitong Gao, Ilia Shumailov, Nicolo Fusi, Robert Mullins. Rapid Model
79
Architecture Adaption for Meta-Learning. In submission
Yiren Zhao conceived the experiments for graph-based NAS methods. Duo Wang later
enhanced the search process with a Gumbel-softmax based stochastic method. Yiren
Zhao conceived the experiments for the MAML-based NAS and and implemented it as an
extension to the MAML++ framework [8]. Ilia Shumailov and Xitong Gao ported other
versions using metric-based MAML frameworks as a comparison baseline.
80
approach for emerging network types, and how it can be combined with additional search
spaces to produce models with hardware awareness. Second, there is now a growing need
of applying neural networks to various hardware systems and learning setups. Compared
to a few years ago, I now have a diverse set of hardware devices, ranging from server-class
GPUs to IoT devices, having the need of executing DNN inference. This chapter then
considers a novel many-task many-device deployment scenario, in particular, I consider
this setup in few-shot learning and demonstrate how NAS can be an effective approach in
this setup.
81
aggregation mechanisms where the choice of method influences the required numerical
precision. Third, previous CNN NAS methods [174] consider two possible quantisable
components (weights and activations) for a single layer. Given n quantisation options, this
provides n2 combinations. I show that a single GNN layer has n6 quantisation combinations,
offering a much larger search space.
In this section, I propose Low Precision Graph NAS (LPGNAS), which aims to
automatically design quantised GNNs. The proposed NAS method is single-path, one-shot
and gradient-based. Additionally, LPGNAS’s quantisation options are at a fine granularity
so that different sub-blocks in a graph block can have different quantisation strategies. I
describe the details of how to design a hardware-aware GNN NAS method. Section 5.2.1
and Section 5.2.2 describe the architectural and quantisation search spaces of GNNs
respectively. I discuss the results of the proposed NAS in Section 5.2.4.
82
Here hk−1 ∈ RN ×D is representation for transformed features, where N is the number of
nodes in a graph and D is the dimension of the feature. In GNNs, there is typically a linear
transformation (linear layer in Figure 5.1) in the form of W k hk−1 . ak are the attention
parameters for messages passed from neighbouring nodes (attention in Figure 5.1). AGGR
is an aggregation operation for the messages received and is also responsible for combining
aggregated messages with features of the current node. Act is a non-linear activation.
Graph Block
Attention
Linear Layer Aggregation Activation
Mechanism
Figure 5.1: The structure of a single graph block consists of four sub-blocks: Linear Layer,
Attention, Aggregation and Activation.
A Graph Block is similar to a cell employed in the CNN NAS algorithm DARTS [113].
DARTS uses a weighted sum to combine outputs of all candidate operators. In the
proposed NAS, I use the arg max function, which allows only one candidate operator to
be active in each training iteration. Let ōi,k be the k th sub-block in Graph Block i, and
oi,k,t be the tth candidate operator for ōi,k . ōi,k is then computed as:
ōi,k = oi,k,tmax
i,k
, where tmax a
i,k = arg max Pi,k,t . (5.2)
t∈T
a
Here, Pi,k,t is the probability of the tth candidate operator of sub-block k and layer
i assigned by the NAS Controller. P is a probability distribution over all candidate
operators, and P ∈ RT,K,I , where T , K and I are the number of candidate operators,
sub-blocks and layers respectively. This hard-max approach considerably reduces memory
and computational cost since only one operation is active at any training iterations, whilst
still converges, as shown by Wu et al. [175] and Xie et al. [177]. While the arg max function
is non-differentiable, I use a differentiable approximation which ensures that the controller
receives learning signals. I will introduce the full details of the proposed NAS algorithm in
the next section.
In Table 5.1, I include all the operations described above in our micro-architectural
search space. I show all attention types that are in the search space in Table 5.1. The
attention types includes a various styles of parametric or non-parametric attention methods.
In Table 5.2, I list all activation types that were considered in the architectural search
83
Table 5.1: Different types of attention mechanisms. W here is parameter vector for
attention. <, > is dot product, aij is attention for message from node j to node i.
Activation Equation
None f (x) = x
Sigmoid f (x) = 1+e1−x
Tanh f (x) = tanh(x)
Softplus f (x) = β1 log(1 + eβx )
ReLU f (x) = M ax(0, x)
LeakyReLU f (x) = M ax(0, x) + αM in(0, x)
ReLU6 f (x) = M in(M ax(0, x), 6)
ELU f (x) = M ax(0, x) + M in(0, α(ex − 1))
space. Apart from attention and activation types, I also consider searching for the best
hidden feature size, using two fully-connected layers with an intermediate layer with an
expansion factor that can be picked from a set {1, 2, 4, 8}. In terms of aggregation, the
choices considered include mean, add and max.
84
quantisation from Qw , meaning that a single graph block receives a mixed quantisation
for different quantisable components annotated in Equation (5.4).
Weights Activations
Quantisation Frac Bits Total Bits Quantisation Frac Bits Total Bits
binary 0 1 fix2.2 2 4
binary 0 1 fix4.4 4 8
ternary 0 2 fix2.2 2 4
ternary 0 2 fix4.4 4 8
ternary 0 2 fix4.8 4 12
fix1.3 3 4 fix4.4 4 8
fix2.2 2 4 fix4.4 4 8
fix1.5 5 6 fix4.4 4 8
fix3.3 3 6 fix4.4 4 8
fix2.4 4 6 fix4.4 4 8
fix4.4 4 8 fix4.4 4 8
fix4.4 4 8 fix4.8 8 12
fix4.4 4 8 fix8.8 8 16
fix4.8 8 12 fix4.8 8 12
fix4.12 12 16 fix4.4 4 8
fix4.12 12 16 fix4.8 8 12
fix4.12 12 16 fix8.8 8 16
The reasons for considering input activation values as quantisation candidates are the
following. First, GNNs always consider large input graph data, the computation operates
on a per-node or per-edge resolution but requires the entire graph or the entire sampled
graph as inputs. The amount of intermediate data produced during the computation is
huge and causes a large pressure on the amount of on-chip memory available. Second,
quantised input activations with quantised parameters together simplify the arithmetic
operations. For instance, Linear(Qw (wk ), Qh (hk−1 )))) considers both quantised hk−1 and
quantised wk so that the matrix multiplication with these values can operate on a simpler
fixed-point arithmetic.
The quantisation search space identified in Equation (5.4) is different from the search
85
space identified by Wu et al. [174] and Guo et al. [59]. Most existing NAS methods
focusing on quantising CNNs look at quantisation at each convolutional block. In a graph
neural network, a single graph block is equivalent to a convolutional block. In a single
graph block, I look at more fine-grained quantisation opportunities within the four sub-
blocks. The quantisation search considers a wide range of fixed-point quantisations. The
weights and activation can pick the numbers of bits in [1, 2, 4, 6, 8, 12, 16] and [4, 8, 12, 16]
respectively, and I allow different integer and fractional bits combinations. I list the
detailed quantisation options in the Table 5.3. Table 5.3 shows the quantisation search
space, each quantisable operation identified will have these quantisation choices available.
BINARY means binary quantisation, and TERNARY means two-bit ternary quantisation.
FIXx.y means fixed-point quantisation with x integer bits and y bits for fractions. I also
demonstrate the number of fractional bits (FRAC BITS) and total number of bits (TOTAL
BITS).
In total, as listed in Table 5.3, there are 17 quantisation options; and as illustrated in
Equation (5.4), there are six possible components that join the quantisation search, this
gives in total 176 = 24137569 quantisation combinations for a Graph Neural Network.
I describe the Low Precision Graph Network Architecture Search (LPGNAS) algorithm
in Algorithm 2. (x, y), (xv , xv ) are the training and validation data respectively. M is the
total number of search epochs, and Ma and Mq are the epochs to start architectural and
86
quantisation search. Before the number of epochs reaches Ma or Mq , LPGNAS randomly
picks up samples and warms up the trainable parameters in the supernet. K is the number
of steps to iterate in training the supernet. After generating noise N , I use this noise
in architectural controller ga and quantisation controller gq together with the validation
data xv to determine the architectural πa and quantisation πq choices. After reaching the
pre-defined starting epoch (Ma or Mq ), LPGNAS starts to optimise the controllers’ weights
(wa and wq ). I choose the hyper-parameters Mq = 20, Ma = 50, α = 1.0, β = 0.1, unless
specified otherwise. I provide an ablation study in the Appendix to justify our choices of
hyper-parameters. Notice that QLoss estimates the current hardware cost based on both
architectural and quantisation probabilities. As shown in Equation (5.5), I define S as a
set of values that represents the costs (number of parameters) of all possible architectural
options: for each value s in this set, I multiply it with the probability pa from architecture
probabilities Pa generated by the NAS architecture controller. Similarly, I produce the
weighted-sum for quantisation from the set of values Sq that represents costs of all possible
quantisations. The dot product <, > between these two weighted-sum vectors produces
the quantisation loss Lq .
Lq = QLoss(Pa , Pq )
X X (5.5)
=< (s ∗ pa ), (sq ∗ pq ) >
s∈S,pa ∈Pa sq ∈Sq ,pq ∈Pq
Figure 5.2 is an illustration of the LPGNAS algorithm. I use two controllers (ga
and gq ) for architecture and quantisation (named as Controller and Q-Controller in the
diagram) respectively. The controllers use trainable embeddings connected to linear layers
to produce architecture and quantisation probabilities. The architecture controller provides
probabilities (Pa ) for picking architectural options of graph blocks and also the router. The
router is in charge of determining how each graph block shortcuts to the subsequent graph
blocks [205]. In addition, the quantisation controller provides quantisation probabilities (Pq )
to both modules in graph blocks and the linear layers in the router. In a graph block, only
the linear and attention layers have quantisable parameters and have matrix multiplication
operations; neither activation nor aggregation layers have any trainable or quantisable
parameters. However, all input activation values of each layer in the graph block are
quantised. The amount of intermediate data produced during the computation causes
memory pressure on many existing or emerging hardware devices, so even for modules
that do not have quantisable parameters, I choose to quantise their input activation values
to help reduce the amount of buffer space required.
87
Figure 5.2: An overview of the LPGNAS architecture. Pa denotes controller output
(Purple) for selecting different operations within each graph block, and for routing shortcut
connections between graph blocks. Pq denotes quantisation controller (Q-Controller)
output (Green) for selecting quantisations for operations within graph blocks and shortcut
connections. Gij are gating functions conditioned on Pa . Solid lines are input streams into
the router while dashed lines are output streams.
The Citation datasets include nodes representing documents and edges representing citation
links. The task is to distinguish which research field the document belongs to [184].
Table 5.4 shows the performance of GraphSage [62], GAT [163], JKNet [179], PDNAS
[205] and LPGNAS on the citation datasets with a partition of 0.6, 0.2, 0.2 for training,
validation and testing respectively. For the quantisation options of GraphSage, GAT and
JKNet, I manually perform a grid search on the Cora dataset. The grid search considers
various weight and activation quantisations in a decreasing order. I reimplemented
88
Table 5.4: Accuracy and size comparison on Cora, Pubmed and Citeseer with a data split
of 60%, 20% and 20% for training, validation and testing. Our results are averaged across
3 independent runs.
GraphSage [62], GAT [163] and JKNet [179] for quantisation, and provide the architecture
choices. For the manual baseline networks, I used the quantisation strategy found on Cora
for Citeseer and Pubmed. For LPGNAS, I searched with a set of base configurations that
I clarified in the Appendix.
There are several hyperparameters in the LPGNAS algorithm (Algorithm 2). As
mentioned, I pick Nq = 20, Na = 50, α = 1.0, β = 0.1, lr = 0.005 through a hyperparameter
study. For β, Nq and Na , I justify the choices in the graph below by sweeping across
different values of β, Na and Nq on the Pubmed dataset (Figure 5.3).
Figure 5.3: Collected statistical information for quantisation, the horizontal axis shows
the chosen bit-width and the vertical axis shows the occurences.
The results in Table 5.4 suggest that LPGNAS shows better accuracy on both Cora
and Pubmed with quantised networks. In addition, although LPGNAS does not show the
smallest sizes on these two datasets, it is only slightly bigger than the manual baselines but
shows a much better accuracy. On Citeseer, LPGNAS only shows slightly worse accuracy
(around 0.1% less) with a considerably smaller size (around 9× reduction in model sizes).
89
Table 5.5: Accuracy and size comparison on Amazon-Computers and Amazon-Photo [146]
Our results are averaged across 3 independent runs.
Amazon-Computers Amazon-Photos
Method Quan
Acc Size Act Size Bitops Acc Size Act Size Bitops
% KB MB G % KB MB G
SageNet float 84.0 ± 0.0 49.8 4.6 6.8 84.0 ± 0.1 49.8 4.6 6.8
SageNet w4a8 83.9 ± 0.1 6.2 1.1 0.21 83.9 ± 0.1 6.2 1.1 0.21
GAT float 84.4 ± 0.3 199.8 4.6 3.6 84.2 ± 0.1 199.8 4.6 3.6
GAT w4a8 81.8 ± 1.5 25.0 1.1 0.10 83.5 ± 0.5 25.0 1.1 0.10
JKNet float 87.6 ± 0.0 130.5 10.1 74.5 87.7 ± 0.1 130.5 10.1 74.5
JKNet w4a8 38.0 ± 0.0 16.3 2.5 2.3 38.0 ± 0.0 16.3 2.5 2.3
PDNAS-4 w4a8 85.6 ± 0.1 100.0 2.7 2.1 84.6 ± 0.1 75.4 1.4 3.3
LPGNAS mixed 90.5 ± 0.0 157.3 3.7 3.5 89.7 ± 0.0 164.0 3.7 4.5
LPGNAS-small mixed 88.7 ± 0.1 3.5 1.0 1.3 88.6 ± 0.1 3.6 0.81 1.2
Table 5.6: Accuracy and size comparison on Flickr [189], Cora-full [13] and Yelp [189].
Our results are averaged across 3 independent runs.
Table 5.5 and Table 5.6 show how LPGNAS performs on large datasets sampled using
the recently proposed GraphSaint sampling [189]. The detailed configuration of the
sampler is in the Appendix. I provide additional baselines on these datasets since the
original GraphSage [62], GAT [163], JKNet [179] under-fit this challenging task. I thus
implemented V2 versions of the baselines that have larger channel counts; the architecture
details are reported in the Appendix. In addition, for quantising the baseline models I
uniformly applied a w4a8 (4-bit weights, 8-bit input activations) quantisation. Although I
train on a sampled sub-graph, evaluating networks requires a full traversal of the complete
graph. Since the datasets considered are large, traversing the entire graph adds a huge
computation overhead. If I use the previous grid-search for all quantisation possibilities
(roughly 164 in our case) for the all baseline networks, this will require a huge amount of
search time. So I fixed the quantisation strategy to w4a8 for the baseline networks.
The results in Table 5.5 and Table 5.6 suggest that LPGNAS found the models with
90
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 LPGNAS 0.5 LPGNAS 0.5 LPGNAS
Accuracy
Accuracy
Accuracy
Sage Sage Sage
0.4 QSage 0.4 QSage 0.4 QSage
0.3 GAT 0.3 GAT 0.3 GAT
QGAT QGAT QGAT
0.2 JKNet 0.2 JKNet 0.2 JKNet
QJKNet QJKNet QJKNet
0.1 PDNAS 0.1 PDNAS 0.1 PDNAS
102 103 101 102 100 101 102 103
Model size (KB) Activation size (MB) Bitops (G)
Figure 5.4: Pareto frontier of LPGNAS on Coraf-full with baseline models. I show the
trade-off between model sizes, activation sizes and bitops, LPGNAS demonstrates a Pareto
dominance.
best accuracy and also with a relatively competitive model size when compared to a
wide range of manual baselines. To further demonstrate that LPGNAS is a flexible NAS
method, I adjust the base configurations to provide models that are equal or smaller than
the smallest baseline networks in Table 5.5 and Table 5.6 and name them LPGNAS-small.
I describe how I turn the knobs in LPGNAS by adjusting the base configurations in the
Appendix. It is clear, in both Table 5.5 and Table 5.6, that LPGNAS outperforms all
baselines in terms of accuracy or micro-F1 scores. LPGNAS-small, on the other hand,
trades-off the accuracy for a better model size, however it shows better performance
than baseline models that have a similar size budget. I also visualise the performance of
LPGNAS on sampled datasets using the Pareto frontiers plot for Cora-full in Appendix.
As expected, LPGNAS shows a Pareto dominance compared other manually designed
networks.
LPGNAS is able to achieve significant performance gains in terms of accuracy over the
baseline networks. For instance, on Flickr, CoraFull, and Yelp, I show on average an around
20% increase compared to even the full-precision baseline models, while LPGNAS produced
quantised models. The performance gap between quantised baselines and LPGNAS is
even greater. On the other hand, LPGNAS is able to produce extremely small models.
For instance, LPGNAS-Small is able to produce models that are around 50× smaller on
Amazon-Computers and Amazon-Photos while only suffer from an around 1% drop in
accuracy compared to standard LPGNAS.
To further prove the point that LPGNAS generates more efficient networks, I sweep
across different configurations and different quantisation regularisations to produce Fig-
ure 5.4. LPGNAS shows a Pareto dominance compared to all evaluated methods, meaning
that it strikes the best combination between computational complexity and accuracy.
Figure 5.4 shows that models can run with larger batch sizes (reduction in activation size),
and also run with less computation (reduction in bitops).
91
Table 5.7: Search cost in GPU hours, all experiments are conducted on an NVIDIA
GeForce RTX 2080 Ti GPU.
92
of using larger bit-widths (around 8 bits). Based on these results, if one would like to
manually quantise GNNs, I would suggest to start with around 4 bits for weights and 8
bits for activations. However, it is worth mentioning that LPGNAS is able to search for
the best combination of quantisation and architectural decisions. For instance, I observed
LPGNAS was able to choose operations with less parameters but use higher bit-widths.
Nevertheless, the large number of runs across a relatively wide range of datasets aim at
sharing empirical insights for people that are manually tuning quantisations for GNNs.
For GNN services that are whether energy, memory or latency critical, and for future AI
ASIC designers focusing on GNN inference, the gathered quantisation statistics can be a
useful guideline for fitting a suitable quantisation method to their networks.
Figure 5.5: Collected statistical information for quantisation on Cora, Citeseer, PubMed,
Amazon-Computers, Amazon-Photos, Flickr, CoraFull and Yelp, the horizontal axis shows
the chosen bit-width and the vertical axis shows the occurences of quantisations (not
weighted by the number of weights). A bit-width count of 16 is not present because it has
never been selected by LPGNAS.
It is also interesting to observe that in both Table 5.5 and Table 5.6, some of the
manually designed GNNs suffer from a large accuracy drop when applying a 4-bit weight and
93
8-bit activation fixed-point quantisation (w4a8). For instance, both quantised JKNet and
JKNet-V2 drop down to 4.4% accuracy on the Cora-full dataset (Table 5.5). However, the
collected statistics of LPGNAS suggest that w4a8 is sufficient, proving that modifications
of network architectures are the key element in order for networks to maintain accuracy
with aggressive quantisation strategies.
The applied fixed-point quantisation is a straight-forward one, many researchers have
proposed more advanced quantisation schemes on CNNs or RNNs [202, 192, 150]. I
suggest future research of investigating how these quantisation strategies work on GNNs
should consider w4a8 fixed-point quantisation as a baseline case. Because the collected
statistics suggest that w4a8 fixed-point quantisation is the most aggressive quantisation
for searched GNNs while guaranteeing minimal loss of accuracy, and manual GNNs often
do not perform well using the w4a8 setup.
94
to quickly adapt to new tasks can potentially shrink the O(T HC) complexity illustrated
in Figure 5.6 to O(HC). In the meantime, hardware-aware NAS methods [18, 17, 181],
e.g. the train-once-for-all technique [18], support deployments of searched models to fit to
different hardware platforms with various latency constraints. These hardware-aware NAS
techniques reduce the search complexity from O(T HC) to O(T ) [17].
...
Task T Hardware System H Constraints C
+ +
Figure 5.6: Deploying networks in a many-task many-device setup. This implies a large
search complexity O(T HC) and requires a huge amount of architectural engineering time.
• The classic NAS search space contains many over-parameterised sub-models, this
makes it hard to tackle the over-fitting phenomenon in few-shot learning.
To tackle these challenges, I propose to use Global Expansion (GE) and Adaptive
Number of Layers (ANL) to allow a drastic change in model capabilities for tasks with
varying difficulties. The experiments later proved that such changes alleviate over-fitting in
few-shot learning and improve the accuracy significantly. I also present a novel layer-wise
profiling strategy to allow reuse of profiling information across different tasks.
95
the hope of generalising its knowledge to new tasks [47].
Equation (5.6) captures the optimisation objective of meta-learning, where optimal param-
eters are obtained through optimising on a set of meta-training tasks. Current mainstream
approaches that use meta-learning to tackle few-shot learning problems can be roughly
categorised into three major types: Memory-based, Metric-based and Optimisation-based.
Memory-based method utilises a memory-augmented neural network [123, 52] to
memorise meta-knowledge for a fast adaption to new tasks. Metric-based methods aim to
meta-learn a high-dimensional feature representation of samples, and then apply certain
metrics to distinguish them. For instance, Meta-Baseline utilises the cosine nearest-centroid
metric [26] and DeepEMD applies the Earth Mover’s Distance [191]. Optimisation-based
method, on the other hand, focuses on learning a good parameter initialisation (also known
as meta-parameters or meta-weights) from a great number of training tasks, such that these
meta-parameters adapt to new few-shot tasks within a few gradient updates. The most
well-established Optimisation-based method is Model-Agnostic Meta-Learning (MAML)
[47]. MAML is a powerful yet simple method to tackle the few-shot learning problem,
since its adaption relies solely on gradient updates. Antoniou et al. later demonstrate
MAML++, a series of modifications that improved MAML’s performance and stability [8].
While the meta-learning framework is increasingly used to solve few-shot learning
challenges, little attention has been paid to the run-time efficiency of these approaches.
Meta-learning has been explored in key applications such as facial and speech recognition
[74, 54] for mobile deices. Real-life deployments on these devices resemble a many-task
many-device scenario, where learning on each user’s data is a few-shot learning task
and different hardware platforms represent different types of under-deployment devices.
Memory-based and Metric-based meta-learning methods are then challenged by the
hardware or latency constraints: Memory-based methods need additional storage space (at
least double) and Metric-based approaches use multiple inference runs (at least two) for a
single image classification. Our proposed NAS method then follows the simple yet effective
MAML++ framework. After adapting the model to new tasks, MAML++ executes exactly
one network inference run for a single test sample without additional memory usage.
Several NAS methods are proposed under the MAML framework [88, 144, 107], these
methods successfully reduce the search complexity from O(T ) to O(1). However, some
of these methods do not show significant performance improvements compared to care-
fully designed MAML methods (e.g. MAML++) [88, 144]. In the meantime, some of
these MAML-based NAS methods follow the Gradient-based approach and operate on
complicated cell-based structures [107]. I illustrate later how cell-based NAS causes an
96
undesirable effect on latency, and also meets fundamental scalability challenges when
trying to deploy in a many-task many-device setup.
Phase 1: Super-net Pre-training Phase 2: Layer-wise Profiling Phase 3: Device-aware Task-specific Deployment
...
+
Profile second layer
on different devices
+ AIOT
Figure 5.7: An overview of the three main stages of the proposed H-Meta-NAS algorithm.
The first stage is to meta-train a super-net, the second stage is to adapt the super-net to
a particular task, and the third stage is to adapt the super-net to a particular hardware
system.
In the MAML setup, I consider a set of tasks T and each task Ti ∈ T contains a support
set Dis and a query set Diq . The support set is used for task-level learning while the query
set is in charge of evaluating the meta-model. All tasks are divided into three sets, namely
meta-training (Ttrain ), meta-validation (Tval ) and meta-testing (Ttest ) sets.
Equation (5.7) formally states the objective of the pre-training stage illustrated in
Figure 5.7 Phase 1. The objective of this process is to optimise the parameters θ of the
97
super-net for various sub-networks sampled from the architecture set A. This will ensure
the proposed H-Meta-NAS to have both the meta-parameters and meta-architectures ready
for the adaption to new tasks. Section 5.3.2.3 describes the details of the meta-training
and a progressive shrinking strategy to prevent sub-model interference.
H-Meta-NAS considers a search space composed of different kernel sizes, number of channels
and activation types. I mostly consider a VGG9-based NAS backbone, that is a 5-layer
CNN model with the last layer being a fully connected layer. I chose this NAS backbone
because both MAML [47] and MAML++ [8] used a VGG9 model architecture. The details
of this backbone are in Appendix.
I allow kernel sizes to be picked from {1, 3, 5}, channels to be expanded with a set
of scaling factors {0.25, 0.5, 0.75, 1, 1.5, 2, 2.25} and also six different activation functions
(details in Appendix). For a single layer, there is 3 × 7 × 6 = 126 search options. H-Meta-
NAS also contains an Adaptive Number of Layers strategy, the network is allowed to
use a subset of the total layers in the supernet with a maximum usage of 4 layers. The
whole VGG9-based backbone then gives us in total 1264 × 4 ≈ 109 possible neural network
architectures.
In addition, to demonstrate the ability of H-Meta-NAS on a more complex NAS back-
bone. I also studied an alternative ResNet12-based NAS backbone, that has approximately
2 × 1024 possible sub-networks.
As illustrated by prior work [18], progressively shrinking the super-net during meta-training
can reduce the sub-model interference, I observe the same phenomenon and then use a
similar progressive shrinking strategy in H-Meta-NAS, the architectural sampling process
98
α ∼ p(A) will pick the largest network with a probability of p, and randomly pick other
sub-networks with a probability of 1 − p. I apply an exponentially decay strategy to p:
e − es
p = pe + (pi − pe ) × exp(−α × )) (5.9)
em − es
pe and pi are the end and initial probabilities. e is the current number of epochs, and es
and em are the starting and ending epochs of applying this decaying process. α determines
how fast the decay is. In our experiment, I choose pi = 1.0 and es = 30, because the
super-net reaches a relatively stable training accuracy at that point. I then start the
decaying process, and the value α = 5 is determined through the following hyper-parameter
study shown in Table 5.8.
Table 5.8: Tuning the decay factor α for pre-training on Mini-ImageNet 5-way 1-shot
classification. Accuracy is averaged across 100 randomly picked sub-networks.
α 0.1 0.5 5 10 50
Avg Accuracy 0.424 0.4145 0.5464 0.5323 0.4423
99
Table 5.9: Comparing latency predictor with our proposed profiling. MSE Error is the
error between estimated and measured latency, Time is the total time taken to collect and
build the estimator.
profiling and look-up method is a perfect match in our learning scenario. I pick 16K
training samples and 10K validation samples to train and test the latency predictor. This
training setup is identical to Cai et al. [18]. I use another 10K testing samples to evaluate
the performance of OFA-based latency predictor against the layer-wise profiling on different
hardware systems in terms of MSE (measuring the latency estimation quality) and Time
(measuring the efficiency).
As illustrated in Table 5.9, layer-wise profiling saves not only time but also has a
smaller MSE error compared to a predictor-based strategy that is very popular in today’s
evolutionary-based NAS frameworks [18, 17]. In addition, layer-wise profiling shows orders
of magnitude better run-time when targeting hardware devices with scarce computational
resources. If I consider an IoT class device as a target (i.e the Raspberry Pi Zero), it
requires an unreasonably large amount of time to generate training samples for latency
predictors, making them an infeasible approach in real life. For instance, the total time
consumed by latency predictor is infeasible to execute on Pi Zero (last row in Table 5.9).
Of course, in reality, there is also a great number of IoT devices using more low-end CPUs
compared to Pi Zero (ARMV5 or ARMV4), making the latency predictor even harder to
be deployed on these devices. Also in a many-hardware setup considered in this paper,
this profiling is executed O(H) times.
Most existing layer-wise look-up approaches consider at most mobile systems as
targeting platforms [181, 182]. These systems are in general more capable than a great
range of IoT devices. In this work, I demonstrate the effectiveness of this approach on
more low-end systems (Raspberry Pi and Pi Zeros), illustrating this is the more scalable
approach for hardware-aware NAS on constrained hardware systems.
100
5.3.2.5 Adaption strategy
The adaption strategy uses a genetic algorithm [171] to pick the best suited sub-network
with respective to a given hardware constraint. In general, the adaption algorithm randomly
samples a set of tasks from Tval , and uses the averaged loss value and satisfaction to the
hardware constraints as indicators the for the genetic algorithm. The genetic algorithm
has a pool size P and number of iterations M , I demonstrate the optimal values are
P = 100, M = 200 by sweeping different combinations of these hyper-parameters. I
detailed this hyper-parameter tuning process in later paragraphs.
Algorithm 3 details the adaption algorithm. In the M utate function, each architecture
is ranked with the averaged loss across all sampled tasks, and 10% of the architectures
with the lowest loss values are then used to perform a classic genetic algorithm mutation
[171]. The mutation will allow the top-performing architectures to have two randomly
picked architectural choices being modified to another choice that is not the original
one. The mutation function considers the original pool of architectures (A) and their
averaged loss values (La ). The cost of each architecture can be computed by the pre-build
hardware-specific hash-table Ht (A). I then only mutate the subset in A that their hardware
cost has satisfied the constraints {A|A ∈ A ∧ Ht (A) ≤ C}. The mutation is to randomly
pick two options in the entire architectural space and change them to other choices that
are different from the original.
I identify the following two hyperparameters that can potentially affect the performance,
namely the number of iterations M and the pool size P , and then run an hyper-parameter
analysis in Figure 5.8. The horizontal axis shows the number of iterations and the vertical
axis shows the averaged accuracy on the sampled tasks for all architectures in the pool.
Figure 5.8 shows that the accuracy convergence is reached after around 150 iterations, and
running for additional iterations only provides marginal accuracy gains. For this reason, I
picked the number of iterations to be 200 for a balance between accuracy and run-time.
In the meantime, I notice in general a higher pool size will give better adapted accuracy.
However, this does not mean the final searched accuracy is affected to the same degree.
The final re-trained accuracy of searched architectures show an accuracy gap of 0.21%
between P = 100 and P = 200 and 0.32% between P = 100 and P = 500. An increase in
pool size can prolong the run-time significantly, I thus picked a pool size of 100 since it
offers the best balance between accuracy and run-time.
One particular problem in few-shot learning is that models are prone to over-fitting. This
is because only a small number of training samples are available for each task and the
network normally iterate on these sample many times. I would like to explore on the
architectural space to help models to overcome over-fitting and conduct a case study
101
Algorithm 3 The adaption algorithm
Input: M , P , C, Ht
A = Init(P ) . Initialise a set of architectures with a size of P
for i = 0 to M − 1 do
La = ∅
Ts ∼ p(Tval ) . Obtain a subset from the validation task set
for A ∈ A do
Lt = ∅
for T ∈ Ts do
l = L(T S
, A) . Compute loss
Lt = Lt {l}
end for S
La = La {mean(Lt )} . Collect averaged loss values across all tasks
end for
A = M utate(A, La , Ht , C) . Mutate the architectures based on hardware constraints
end for
1.0
0.9
Averaged accuracy
0.8
0.7
0.6
0.5
0.4 P = 50
0.3 P = 100
P = 200
0.2 P = 500
0 50 100 150 200 250 300
Number of Iterations (M)
Figure 5.8: The effect of different pool size (P) and different number of iterations (M).
for different design options available for the backbone network. I identify the following
key changes to the NAS backbone to help the models to have high accuracy in few-shot
learning:
• Global Expansion (GE): Allowing the NAS to globally expand or shrink the number
of channels of all layers.
• Adaptive Number of Layers (ANL): Allowing the NAS to use an arbitrary number
of layers, the network then is able to early stop using only a fewer number of layers.
102
Figure 5.9 further illustrate that GE and ANL can allow a much smaller model compared
to existing NAS backbones. The results in Table 5.10 suggest that these changes on a
backbone will allow the search space to reach a much smaller model and then provide a
better accuracy. In addition, Table 5.10 also illustrates that 5 × 5 pooling is necessary for
a higher accuracy. I hypothesize this is because a relatively large fully-connected layer
after the pooling is required for the network to achieve good accuracy in this meta-learning
setup.
Table 5.10: A case study of different design options for the NAS backbone network.
Experiments are executed with a model size constraint of 70K on the Mini-ImageNet
5-way 1-shot classification task.
Existing NAS methods have a fixed number of channels GE allows these edge layers to shrink/expand
for layers at the 'edge' of each searchable block
output
Figure 5.9: A graphical illustration of GE and ANL. Both methods will allow a more
drastic change in model capabilities, allowing the searched model to deal with tasks with
varying difficulties.
103
I consider three popular datasets in the few-shot learning community: Omniglot,
Mini-ImageNet and Few-shot CIFAR100. I use the PytorchMeta framework to handle the
datasets [34].
Omniglot is a handwritten digits recognition task, containing 1623 samples and 20
samples per class [96]. I use the meta train/validation/test splits used Vinyals et al. [166].
These splits are over 1028/172/423 classes (characters).
Mini-ImageNet is first introduced by Vinyals et al.. This dataset contains images
of 100 different classes from the ILSVRC-12 dataset [37], the splits are taken from Ravi
et al. [135].
FC100 is introduced by Oreshkin et al. [127], the datasets has 100 different classes
from the CIFAR100 dataset [92]
Table 5.11 details the systems and representative devices considered. I use the ScaleSIM
cycle-accurate simulator [139] for the Eyeriss [25] accelerator. Details about this simulation
and more information with respect to the datasets and search configurations are in our
Appendix.
Table 5.12 displays the results of H-Meta-NAS on the Omniglot 20-way 1-shot and 5-shot
classification tasks. I match the size of H-Meta-NAS to MAML and MAML++ for a fair
comparison. H-Meta-NAS outperforms all competing methods apart from the original
MAML++. MAML++ uses a special evaluation strategy, it creates an ensemble of models
with best validation-set performance. MAML++ then picks the best model from the
ensemble based on support set loss and report accuracy on the query set. I then locally
replicated MAML++ without this trick, and show that H-Meta-NAS outperforms it by
a significant margin (+1.01% on 1-shot and +0.11% on 5-shot) with around half of the
MACs (4.95G compared to 10.07G).
Table 5.13 shows the results of running the 5-way 1-shot and 5-shot Mini-ImageNet
tasks, similar to the previous results, I match the size of searched networks to MAML
and MAML++. Table 5.13 not only displays results on MAML methods with fixed-
architectures, it also shows the performance of searched networks including Auto-Meta
104
Table 5.12: Results of Omniglot 20-way few-shot classification. We keep two decimal
places for our experiments, and keep the decimal places as it was reported for other cited
work. ∗ reports a MAML replication implemented by Antoniou et al. [8]. Details are
discussed in Section 5.3.3.1.
Accuracy
Method Size MACs
1-shot 5-shot
Siamese Nets [91] 35.96M 1.36G 99.2% 97.0%
Matching Nets [166] 225.91K 20.29M 93.8% 98.5%
Meta-SGD [106] 419.86K 46.21M 95.93% ± 0.38% 98.97% ± 0.19%
MAML [47] 113.21K 10.07M 95.8% ± 0.3% 98.9% ± 0.2%
MAML∗ (Replication from [8]) 113.21K 10.07M 91.27% ± 1.07% 98.78%
MAML++ ∗ [8] 113.21K 10.07M 97.65% ± 0.05% 99.33% ± 0.03%
MAML++ (Local Replication) 113.21K 10.07M 96.60% ± 0.28% 99.00% ± 0.07%
H-Meta-NAS 110.73K 4.95M 97.61 ± 0.03% 99.11% ± 0.09%
Table 5.13: Results of Mini-ImageNet 5-way classification. We use two decimal places
for our experiments, and keep the decimal places of cited work as they were originally
reported. T-NAS uses the complicated DARTS cell [107], it has a smaller size but a large
MACs usage. Details are discussed in Section 5.3.3.1.
Accuracy
Method Size MACs
1-shot 5-shot
Matching Nets [166] 228.23K 200.31M 43.44 ± 0.77% 55.31 ± 0.73%
CompareNets [156] 337.95K 318.38M 50.44 ± 0.82% 65.32 ± 0.70%
MAML [47] 70.09K 57.38M 48.70 ± 1.84% 63.11 ± 0.92%
MAML++ [8] 70.09K 57.38M 52.15 ± 0.26% 68.32 ± 0.44%
Auto-Meta [88] 98.70K - 51.16 ± 0.17% 69.18 ± 0.14%
BASE (Softmax) [144] 1200K - - 65.4 ± 0.7%
BASE (Gumbel) [144] 1200K - - 66.2 ± 0.7%
T-NAS ∗ [107] 24.3/26.5K 37.96/52.63M 52.84 ± 1.41% 67.88 ± 0.92%
T-NAS++ ∗ [107] 24.3/26.5K 37.96/52.63M 54.11 ± 1.35% 69.59 ± 0.85%
H-Meta-NAS 70.28K 24.09M 57.36 ± 1.11% 77.53 ± 0.77%
[88], BASE [144] and T-NAS [107]. H-Meta-NAS shows interesting results when compared
to T-NAS and T-NAS++. H-Meta-NAS has a much higher accuracy (+3.26% in 1-shot
and 7.94% in 5-shot) and a smaller MAC count, but uses a greater amount of parameters.
T-NAS and T-NAS++ use DARTS cells [113]. This NAS cell contains a complex routing
of computational blocks, making it not suitable for latency critical applications. I will
demonstrate in Section 5.3.3.2 how this design choice gives a worse on-device latency
performance.
I also show how H-Meta-NAS work with FC100. In Table 5.14, I further demonstrate
the effectiveness of the proposed H-Meta-NAS on the FC100 dataset. T-NAS did not
report their model sizes on this task, and our results suggest that H-Meta-NAS achieves
105
the best accuracy on both the 1-shot and 5-shot setups.
Table 5.14: Results of FC100 5-way few-shot classification. I keep two decimal places
for our experiments, and keep the decimal places of cited work as they were originally
reported.
Accuracy
Method Size
1-shot 5-shot
MAML 70.09K 38.1 ± 1.7% 50.4 ± 1.0%
MAML++ 70.09K 38.7 ± 0.4% 52.9 ± 0.4%
T-NAS - 39.7 ± 1.4% 53.1 ± 1.0%
T-NAS++ - 40.4 ± 1.2% 54.6 ± 0.9%
H-META-NAS 55.52K 43.29 ± 1.22% 56.86 ± 0.76%
In addition to using the model sizes as a constraint for H-Meta-NAS, I use various latency
targets on various hardware platforms as the optimisation target. Figure 5.10 shows how
model sizes and GPU latencies can be used as constraints. T-NAS and T-NAS++ show a
better performance on the size-accuracy plot in Figure 5.10a. However, the smaller model
sizes of T-NAS do not provide a better run-time on GPU devices (Figure 5.10b), in fact,
T-NAS based models have the worst run-time on GPU devices due to the complicated
dependency of DARTS cells. Figure 5.11 illustrates the performance of H-Meta-NAS on
different CPU devices and an ASIC hardware. The details of these hardware are described
in Table 5.11. In Figure 5.11, H-Meta-NAS shows a better Pareto-frontier performance
compared to a range of baselines and searched models. I only compare to MAML and
MAML++ when running on Eyeriss due to the limitations of the ScaleSIM simulator
[139]. Our results in both Figure 5.10 and Figure 5.11 demonstrate that H-Meta-NAS
consistently generates more efficient models compared to various MAML-based methods.
I mostly follow the experiment setup in MAML++ [8]. In the pre-training stage, I
train for 100 epochs, each epoch consists of 500 iterations. I also pick 600 tasks to be
validation tasks. In the adaption stage, I randomly sample from the validation set, and
pick 16 tasks to build a data slice for the architectures to traverse. In the final re-training
stage of a searched architecture, we follow the strategy used in MAML++ [8]. I then
introduce the detailed special configurations for the datasets:
• Omniglot: I randomly split 1200 characters for training, and the rest is used for
testing. The images are augmented with randomised rotation of multiples of 90
degrees.
106
H-Meta-NAS 0.600 H-Meta-NAS
0.600 MAML MAML
MAML++ 0.575 MAML++
0.575 Matching Nets Matching Nets
CompareNets 0.550 CompareNets
0.550
Accuracy
Accuracy
Auto-Meta 0.525 TNAS
0.525 TNAS TNAS++
TNAS++ 0.500
0.500
0.475
0.475
0.450
0.450 0.425
0 50 100 150 200 250 300 350 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Model size (K) Inference Latency on NVIDIA 2080 Ti GPU (ms)
(a) Targeting model sizes (b) Targeting a GPU
Figure 5.10: Applying H-Meta-NAS with model size and GPU latency as optimisation
targets.
Accuracy
TNAS
0.52 TNAS++
0.52
0.50
0.50
0.48
0.48 0.46
0.46 0.44
0 1 2 3 4 5 0 10 20 30 40 50 60
Inference Latency on Eyeriss (ms) Inference Latency on Intel i9 CPU (ms)
(a) Targeting an ASIC accelerator (b) Targeting a mid-end CPU
0.600 H-Meta-NAS
MAML 0.56
0.575 MAML++
Matching Nets 0.54
0.550 CompareNets
Accuracy
Accuracy
TNAS 0.52
0.525 TNAS++
0.50 H-Meta-NAS
0.500 MAML
0.48 MAML++
0.475 Matching Nets
0.46 CompareNets
0.450 TNAS
0.44 TNAS++
0 200 400 600 800 1000 1200 0 10000 20000 30000 40000
Inference Latency on Raspberry Pi 4B (ms) Inference Latency on Raspberry Zero (ms)
(c) Targeting a low-end CPU (d) Targeting an IoT device
107
• Mini-ImageNet: All images are down-sampled to 84 × 84.
I use the ScaleSim framework [139] for simulating the Eyeriss [27] accelerator. ScaleSim
is an open-source cycle-accurate CNN simulator. The simulator has certain limitations
with respect to the DRAM simulation, it could be advanced with an external DRAM
simulator but will cause a large run-time. So I kept the original setup and the DRAM
simulation would report a read/write bandwidth requirements. For simplicity, I assume
these DRAM requirements are met. In addition, it is a well-known fact that cycle-accurate
simulators are slow to execute. Due to this reason, I only launched the MAML and
MAML++ networks in the ScaleSim simulator.
Table 5.15 shows how H-Meta-NAS performs with a more complicated NAS backbone.
In previous experiments, I build the NAS on top of a VGG9 backbone since it is the
architecture utilised in the MAML++ algorithm. For the purpose of having a fair
comparison, I did not manually pick a complex NAS backbone. However, I demonstrate,
in this section, that H-Meta-NAS can be applied with a more complicated backbone and
it shows better final accuracy as expected. The trained accuracy of searched networks
using ResNet12 reaches a 7.31% higher accuracy compared the original VGG9 backbone.
In addition, I compare the proposed approach with state-of-the-art Metric-based meta-
learning methods [191, 26]. Although using only a single inference pass (our method
does not conduct inference runs on the support set when deployed), H-Meta-NAS shows
competitive results with SOTA Metric-based methods while having a much smaller MACs
usage (around 20×).
Table 5.15: Applying H-Meta-NAS to different NAS backbones and algorithms for the
Mini-ImageNet 5-way 1-shot classification task.
In Table 5.16, I show a comparison between H-Meta-NAS and various NAS schemes in
the many-task many-device setup. Specifically, I consider a scenario with 500 tasks and 10
108
Table 5.16: Comparing the NAS search complexity with N tasks, H hardware platforms
and C constraints. Search time is estimated for a deployment scenario with 500 tasks and
10 hardware-constraint pairs, estimation details are discussed in Appendix.
different hardware-constraint paris. In this case, search complexity refers to the amount
of time spent on searching for a new architecture. Our results in Table 5.16 suggest that
H-Meta-NAS is the most efficient search method because of its low search complexity.
Due to the limited computing facilities available, I estimate the search time of DARTS
[113], Once-for-all [18] and T-NAS [107] in a multi-task multi-device setup. I take the
search time reported in the original publications and multiply them by the appropriate
scaling factors. For DARTs, I take the search time (4 GPU days = 96 GPU hours) and
multiply it by H × T = 5000. I additionally assume a linear scaling relationship between
search time and number of input image pixels, so I multiply the total search time by
84×84
32×32
, this gives us in total a search time of around 106 . I perform the same estimation for
Once-for-all [18] and T-NAS [107].
5.4 Summary
In this chapter, in Section 5.2, I propose a novel single-path, one-shot and gradient-
based NAS algorithm named LPGNAS. LPGNAS is able to generate compact quantised
networks with state-of-the-art accuracy. To my knowledge, this is the first piece of work
that systematically studies the joint effect of quantisation and model architectures of
GNNs in a NAS setup. I define the GNN quantisation search space and show how it can
be co-optimised with the original architectural search space. The end results demonstrate
that a co-optimisation between the architectural and quantisation spaces greatly improves
network accuracy. The searched networks show Pareto dominance on a accuracy model
size trade-off over all manually designed networks. In addition, I empirically show the
limitation of GNN quantisation using the proposed NAS algorithm, most of the searched
networks converge to a 4-bit weight 8-bit activation quantisation setup. Despite this
limitation, LPGNAS demonstrates superior performance when compared to networks
generated manually or by other NAS methods on a wide range of datasets.
In Section 5.3 of this chapter, I show H-Meta-NAS, a NAS method focusing on fast
adaption of not only model weights but also model architectures in a many-task many-
device few-shot learning setup. H-Meta-NAS shows a Pareto dominance when compared
109
to a wide range of MAML baselines and other NAS results. I study the effectiveness of
H-Meta-NAS on a wide variety of hardware systems and constraints, and demonstrate
its superior performance on real-hardware devices using an orders of magnitude shorter
search time compared to existing NAS methods.
110
Chapter 6
Conclusion
The deep learning approach offers state-of-the-art performance in many applications and
is now widely deployed in many of today’s hardware architectures. As illustrated in
this dissertation, the optimisation space of DNN inference covers both the software and
hardware stacks. In this dissertation, I provided a set of compression techniques on the
software stack for lowering the computational and memory requirements of running DNN
inference (Chapter 3). I also demonstrate specialised hardware can successfully accelerate
DNNs in Chapter 4. At the end, I demonstrate how automated machine learning, or in
particular network architecture search, can be made hardware-aware for finding efficient
models for emerging network types and in new learning setups (Chapter 5).
I would like address these interesting observations:
• Software optimisation, such as compression methods, will affect the efficiency of
running ML workloads on hardware, but this design space is determined by the
capabilities of the given hardware. In the meantime, these compression methods also
reshape today’s hardwares.
The chosen hardware platform decides the applicability of the software optimisations.
For instance, GPUs struggle to realise the theoretical performance gains of fine-
grained pruning due to limitations of the SIMT architecture [49], but bit-serial
DNN accelerators can utilise not only element-wise but also bit-wise sparsities. In
practice, users have to be aware of the underlying hardware system when selecting
software compression methods, this suggests that DNN implementation frameworks
should carefully select optimisation techniques based on the target hardware. In
this dissertation, I demonstrated that dynamic channel pruning is particularly useful
on CPU devices, but cannot reach its theoretical gains on GPUs because of the
low efficiency of scatter and gather operations of GPUs. In addition, the focused
quantisation scheme presented is only applicable on custom hardware. It is also
interesting to notice that the software-level optimisations for ML workloads are also
reshaping today’s hardware architectures. For instance, NVIDIA GPUs now have
111
tensor cores [118] and Intel CPUs [2] have extended instructions to support low-
precision matrix multiplications. Architecture and micro-architectures are adapting
their target markets (or workloads) and are evolving all the time in order to catch
the performance scaling.
• Algorithmic changes offer larger performance gains than pure hardware modifications,
co-design offers multiplying gains. However, all of these modifications might have
diminishing returns after a while if we try to specilise further.
From the experiments conducted in this dissertation, I observe there are huge
performance gains possible from purely software optimisation, in general, these gains
outweigh those possible from hardware optimisations alone, but co-design often
provides the best performance.
This is an interesting phenomenon for hardware designers, the overall design for
DNN accelerators should be software orientated and co-design focused.
In addition, the possible returns from both hardware and software optimisations
might have a limit. Previous research in the filed of network compression shows
great compression rates but many of these proposed methods optimise on overlapped
design spaces. In other words, these methods are not orthogonal to each other. For
instance, the proposed dynamic channel pruning shows an around 3× compression
ratio for VGG16 but only 1.98× for ResNet18 with no accuracy loss, this implies that
pruning is not totally orthogonal to architectural engineering. The latter apparently
has a more carefully designed, compact architecture so that it benefits less from
network pruning.
• Reconfigurable hardware enlarges the design space and provides significant perfor-
mance gains.
In Chapter 4, I utilised a flexible hardware design consisting of many small cores.
This design in turn provides a larger quantisation optimisation space. I think this
offers an interesting trade-off from the co-design point of view: complex hardware
might provide a large design space but also enables a more involved exploration on
the software stack.
• Is there a general AI accelerator? New network types and structures can be difficult for
DNN accelerators, specialised for a narrower range of models, to process efficiently.
There was a long-held belief in the community that we can build a ‘general AI
accelerator’ and running AI applications is nothing more than accelerating dense
matrix multiplications. I would like to argue that this belief might be true when
we are in the CNN era, but now it is less likely to be the case when we are facing
112
complex models such as Transformers and GNNs. The diverse structures of DNNs
make building a universal accelerator almost impossible. Specialised accelerators for
CNNs can not exploit sparsities that exist in GNNs or NLP models. This indicates
that the reconfigurability of hardware is an interesting alternative to accelerating a
diverse set of models. FPGA-like reconfigurability comes at a very high cost. New
devices or technologies, such as memristor-based accelerators [187], might offer the
potential for very low-cost reconfigurability.
• NAS algorithms benefit from search strategies that are specialised for particular
use-cases and applications.
Although Network Architecture Search algorithms show promising performance in
many domains, these algorithms normally require special treatments to fit to different
application scenarios. Chapter 5 shows two NAS algorithms, namely the LPGNAS
and H-Meta-NAS. The search of the routing strategy is crucial for LPGNAS, and
the proposed Global Expansion and Adaptive Number of Layers is the key for
H-Meta-NAS to achieve high accuracy on few-shot learning tasks. These application
specific modifications are necessary for the NAS techniques to achieve high accuracy.
113
transformations to sweep a much larger hardware design space than only the number
of PEs or number of multipliers. These hardware transformations can then be jointly
searched with architectural choices to offer a much larger exploration in the co-design
search space.
Another interesting challenge when performing HW/SW co-design optimisations stems
from the need to perform retraining. The optimisation covers a range of lossy techniques,
these methods reduce the run-time costs but might induce accuracy degradation. This
phenomenon changes the optimisation problem to be drastically different from classic
software compilation flow. In a standard software compiler, such as LLVM [98], there
are various methods to guarantee the final execution result is unchanged after a series of
complex code transformations. I realised it is hard to fulfill this guarantee when optimising
DNN inference, unless re-training is allowed in the optimisation flow. However, re-training
DNNs then introduces additional problems such as the prolonged optimisation time.
Another fundamental challenge in this co-design is the lack of abilities to measure
hardware performance in a quick and accurate manner. We have seen LLVM using local
profiling information to help the tool to further optimise code snippets [98]. Ideally, if
we can quickly profile hardware platforms, there is a chance to build hardware-aware
optimisations in this co-design circle. However, this performance estimation is highly
data-dependent if we consider different co-design targets. For instance, the optimisations
applicable for different sets of neural networks can be vastly different. In addition, it is
well-known that precise hardware simulation can be extremely slow, while coarse-grained
power emulation can be very inaccurate.
114
finds the sweet-spot of designing the ML accelerator by sweeping across Google’s
inference workloads, the most dominant neural network types then drive the entire
architecture design.
All of these above-mentioned approaches are possible methods for building a general-
purpose AI accelerator. The differences in energy profiles, costs, design complexities,
etc. will then all affect the cost effectiveness of these solutions.
Recent trends in AI accelerators, however, are going in the opposite direction of building
general-purpose AI cores. First, as illustrated in Figure 2.9, cores designed for applications
(e.g. embedded systems or cloud systems) have a big performance gap. Since the power
budget is different for various applications, micro-architectures of these cores are now
vastly different. Second, DNN inference and training have totally different characteristics
and the industry start to have individual designs for these two workloads. For instance,
TPU-V4 surprisingly only supports DNN inference but V3 was able to run both training
and inference [85].
It seems like the industry is choosing the path of building different accelerators for
different DNN applications and they might start to bundle them on a system level
depending on the task demands. These cores can be whether integrated in an SoC if it
is for autonomous driving, or connected using fast communication links (fast rack-level
switches or InfiniBand) in data centers.
115
on these platforms, the optimisation targets and possible transformations are different
from traditional GPU programming. For instance, supporting dynamic operators in a
computational graph might be challenging for these streaming based hardware accelerators,
do programmers now have more restrictions in their high level languages or compiler
engineers can invent smart ways of converting these dynamic operations to static ones?
It is interesting to see how future research in ML compilers can equip these hardware
accelerators with more capabilities.
116
Appendix A
Figure A.1 shows how the skipping probabilites heat maps of the convolutional layer Conv4
evolve as we train MCifarNet. The network was trained for 12 epochs, and I saved the
model at every epoch. The heat maps are generated with the saved models in sequence,
where I apply the same reordering to all heat map channels with the sorted result from
the first epoch. It can be observed that as we train the network, the channel skipping
probabilites become more pronounced.
channels
10
training epochs
Figure A.1: The training history of a convolutional layer Conv4 in MCifarNet. The history
is visualized by the 12 skipping probabilites heat maps, where the heights denote the 10
categories in Cifar10, and channels in Conv4 occupy the width.
117
118
Appendix B
In this section we decribe in details the dataset we used. In our experiments we use dataset
preprocessing and loading implementations from Pytorch Geometric [45].
Citation Dataset: Citation dataset [184] is a standard benchmark dataset for graph
learning. It comprises three publication datasets, which are Cora, CiteSeer and PubMed.
There is an additional extended version of Cora called Cora-Full [13]. Cora contains all
Machine Learning papers in the Cora-Full graph. In these datasets, nodes correspond to
bag-of-word features of the documents and edges indicates citation. Each node has a class
label. In this paper we use the 6:2:2 train/validation/test split ratio.
Amazon Dataset: Amazon datasets [146], including Amazon Photos and Amazon
Computers, are segments of the Amazon co-purchase graph. Nodes represent goods while
edges represent frequent co-purchase of linked goods. Node feature is bag-of-word features
of product review, while node label is its category. We use the 6:2:2 train/validation/test
split ratio.
Flickr and Yelp Dataset: Flickr and Yelp datasets are introduced along Graph-
SAINT [189]. In Flickr dataset, nodes represent images uploaded to Flickr, and edges
represent sharing of common properties (e.g. location and comments by same user). Node
feature is 500-dimensional bag-of-words representation based on SIFT descpritions, and
node label is its class. In Yelp dataset, nodes represent users and edges represent friendship
between users. Node feature is summed word2vec embeddings of words in the user’s
reviews, and node label is a multi-hot vector representing which types of businesses has
the user reviewed. For both dataset we use the 6:2:2 train/validation/test split ratio.
For producing quantised baseline networks, we manually grid searched options listed
in Table 5.3 and follow the order from bottom to top. We stop the search and retrieve to
119
the previous quantisation stage if the current stage shows an accuracy drop of more than
0.5%. It is worth to mention this grid search for quantisation is very time-consuming, and
we therefore only performed on Cora and used the found quantisation strategy for the rest
of the Citation datasets.
Table B.1 shows the configurations of the baseline networks we’ve used in this paper.
120
Appendix C
The two tables below show the NAS backbones of H-Meta-NAS. Clearly the ResNet-based
NAS backbone is significantly more complicated. The kernel size search space is {1, 3, 5}.
The channel expansion search space is {0.25, 0.5, 0.75, 1, 1.5, 2, 2.25} for the VGG-based
NAS backbone but {0.25, 0.5, 1, 1.5, 1.75, 2} for the ResNet-based backbone. The reason
for the modification in search space is because the GPU RAM limitation does not support
an expansion size of 2.25 on the ResNet-based backbone. The activation search space
contains {[0 relu0 ,0 elu0 ,0 selu0 ,0 sigmoid0 ,0 relu60 ,0 leakyrelu0 }.
121
Table C.2: Details of the ResNet12 NAS backbone
122
Bibliography
[4] Mohamed S Abdelfattah, Lukasz Dudziak, Thomas Chau, Royson Lee, Hyeji Kim,
and Nicholas D Lane. Codesign-nas: Automatic FPGA/CNN codesign using neu-
ral architecture search. In Proceedings of the 2020 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, pages 315–315, 2020.
[5] Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ri-
cardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B Milde, Federico Corradi,
Alejandro Linares-Barranco, Shih-Chii Liu, et al. Nullhop: A flexible convolutional
neural network accelerator based on sparse representations of feature maps. IEEE
transactions on neural networks and learning systems, 30(3):644–656, 2018.
[6] Jose M Alvarez and Mathieu Salzmann. Learning the number of neurons in deep
networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
editors, Advances in Neural Information Processing Systems (NIPS), pages 2270–
2278. 2016.
[7] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An archi-
tecture for ultralow power binary-weight CNN acceleration. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 37(1):48–60, 2018.
[8] Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your maml.
International Conference on Learning Representations (ICLR), 2019.
[9] Mohamed Arafa, Bahaa Fahim, Sailesh Kottapalli, Akhilesh Kumar, Lily P Looi,
Sreenivas Mandava, Andy Rudoff, Ian M Steiner, Bob Valentine, Geetha Vedaraman,
123
et al. Cascade lake: Next generation intel xeon scalable processor. IEEE Micro, 39
(2):29–36, 2019.
[10] Sercan O Arık and Tomas Pfister. Tabnet: Attentive interpretable tabular learning.
In AAAI, volume 35, pages 6679–6687, 2021.
[11] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.
arXiv preprint arXiv:1607.06450, 2016.
[12] Lin Bai, Yiming Zhao, and Xinming Huang. A CNN accelerator on FPGA using
depthwise separable convolution. IEEE Transactions on Circuits and Systems II:
Express Briefs, 2018.
[14] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot
model architecture search through hypernetworks. International Conference on
Learning Representations (ICLR), 2018.
[15] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. Advances in Neural Information Processing
Systems (NeurIPS), 2020.
[16] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level
network transformation for efficient architecture search. In International Conference
on Machine Learning, pages 678–687. PMLR, 2018.
[17] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search
on target task and hardware. International Conference on Learning Representations
(ICLR), 2019.
[18] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all:
Train one network and specialize it for efficient deployment. International Conference
on Learning Representations (ICLR), 2020.
[19] Pierre-Luc Carrier, Aaron Courville, Ian J Goodfellow, Medhi Mirza, and Yoshua
Bengio. FER-2013 face database. Technical report, 2013.
[20] Francesco Paolo Casale, Jonathan Gordon, and Nicolo Fusi. Probabilistic neural
architecture search. arXiv preprint arXiv:1902.05116, 2019.
124
[21] Stephen Cass. Taking ai to the edge: Google’s tpu now comes in a maker-friendly
package. IEEE Spectrum, 56(5):16–17, 2019.
[22] Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional
networks with variance reduction. International Conference on Machine Learning,
2018.
[23] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Yan, Leyuan
Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm:
End-to-end optimization stack for deep learning. Proceedings of the 13th USENIX
Conference on Operating Systems Design and Implementation, 2018.
[24] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen,
and Olivier Temam. Diannao: A small-footprint high-throughput accelerator for
ubiquitous machine-learning. ACM Sigplan Notices, 49(4):269–284, 2014.
[26] Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-
baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 9062–9071, 2021.
[27] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural networks. IEEE
Journal of Solid-State Circuits, 52(1):127–138, 2017.
[28] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Universal deep neural network
compression. IEEE Journal of Selected Topics in Signal Processing, 14(4):715–726,
2020.
[29] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield,
Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al.
Accelerating persistent neural networks at datacenter scale. In Hot Chips, volume 29,
2017.
[30] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate
deep network learning by exponential linear units (elus). International Conference
on Learning Representations (ICLR), 2015.
125
[31] Francesco Conti, Pasquale Davide Schiavone, and Luca Benini. Xnor neural engine:
A hardware accelerator ip for 21.6-fj/op binary neural network inference. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11):
2940–2951, 2018.
[32] Matthieu Courbariaux, Yoshua Bengio, et al. Training deep neural networks with low
precision multiplications. In International Conference on Learning Representations,
2015.
[33] Jiequan Cui, Pengguang Chen, Ruiyu Li, Shu Liu, Xiaoyong Shen, and Jiaya Jia.
Fast and practical neural architecture search. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 6509–6518, 2019.
[34] Tristan Deleu, Tobias Würfl, Mandana Samiei, Joseph Paul Cohen, and Yoshua Ben-
gio. Torchmeta: A Meta-Learning library for PyTorch, 2019. URL https://arxiv.
org/abs/1909.06576. Available at: https://github.com/tristandeleu/pytorch-meta.
[35] Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa
Mahmoud, Sayeh Sharify, Milos Nikolic, Kevin Siu, and Andreas Moshovos. Bit-
tactical: A software/hardware approach to exploiting value and bit sparsity in
neural networks. In Proceedings of the Twenty-Fourth International Conference on
Architectural Support for Programming Languages and Operating Systems, pages
749–763, 2019.
[37] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248–255. Ieee, 2009.
[38] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. Proceedings
of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACL-HLT 2019,
Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 2018.
[39] Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more
complicated network with less inference complexity. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), July 2017.
[40] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
126
Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image
recognition at scale. International Conference on Learning Representations (ICLR),
2021.
[41] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing
Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer
to the sensor. In Proceedings of the 42nd Annual International Symposium on
Computer Architecture, pages 92–104, 2015.
[42] Abhimanyu Dubey, Moitreya Chatterjee, and Narendra Ahuja. Coreset-based neural
network compression. In Proceedings of the European Conference on Computer
Vision (ECCV), pages 454–470, 2018.
[43] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and
Doug Burger. Dark silicon and the end of multicore scaling. In 2011 38th Annual
international symposium on computer architecture (ISCA), pages 365–376. IEEE,
2011.
[44] Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold
hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
[45] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch
Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds,
2019.
[46] Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. Per-
foratedCNNs: Acceleration through elimination of redundant convolutions. In
Advances in Neural Information Processing Systems (NIPS), pages 947–955, 2016.
[47] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for
fast adaptation of deep networks. In International Conference on Machine Learning,
pages 1126–1135. PMLR, 2017.
[48] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming
Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi,
et al. A configurable cloud-scale DNN processor for real-time ai. In 45th Annual
International Symposium on Computer Architecture, 2018.
[49] Wilson WL Fung and Tor M Aamodt. Thread block compaction for efficient simt
control flow. In 2011 IEEE 17th International Symposium on High Performance
Computer Architecture, pages 25–36. IEEE, 2011.
127
[50] Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. Dy-
namic channel pruning: Feature boosting and suppression. International Conference
on Learning Representations (ICLR), 2019.
[51] Yang Gao, Hong Yang, Peng Zhang, Chuan Zhou, and Yue Hu. GraphNAS:
Graph neural architecture search with reinforcement learning. arXiv preprint
arXiv:1904.09981, 2019.
[52] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without
forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4367–4375, 2018.
[53] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E
Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR.
org, 2017.
[54] Jianzhu Guo, Xiangyu Zhu, Chenxu Zhao, Dong Cao, Zhen Lei, and Stan Z. Li.
Learning meta face recognition in unseen domains. CoRR, abs/2003.07733, 2020.
URL https://arxiv.org/abs/2003.07733.
[55] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and
Huazhong Yang. Angel-Eye: A complete design flow for mapping CNN onto cus-
tomized hardware. In IEEE Computer Society Annual Symposium on VLSI, 2016.
[56] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song
Han, Yu Wang, and Huazhong Yang. Angel-eye: A complete design flow for mapping
CNN onto embedded FPGA. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 37(1):35–47, 2018.
[57] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient
DNNs. In Advances in Neural Information Processing Systems, 2016.
[58] Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Jian Chen, Peilin Zhao, and Junzhou
Huang. Nat: Neural architecture transformer for accurate and compact architectures.
Advances in Neural Information Processing Systems, 32, 2019.
[59] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and
Jian Sun. Single path one-shot neural architecture search with uniform sampling.
European Conference on Computer Vision, 2020.
[60] Linley Gwennap. Groq rocks neural networks. Microprocessor Report, Tech. Rep.,
jan, 2020.
128
[61] Philipp Gysel et al. Ristretto: A framework for empirical study of resource-efficient
inference in convolutional neural networks. IEEE Transactions on Neural Networks
and Learning Systems, 2018.
[62] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning
on large graphs. In Advances in Neural Information Processing Systems, pages
1024–1034, 2017.
[63] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and
William J Dally. Eie: efficient inference engine on compressed deep neural network.
In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International
Symposium on, pages 243–254. IEEE, 2016.
[64] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing
deep neural networks with pruning, trained quantization and Huffman coding.
International Conference on Learning Representations (ICLR), 2016.
[65] Babak Hassibi, David G. Stork, and Gregory Wolff. Optimal brain surgeon: Exten-
sions and performance comparisons. In J. D. Cowan, G. Tesauro, and J. Alspector,
editors, Advances in Neural Information Processing Systems (NIPS), pages 263–270.
1994.
[66] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 770–778, 2016.
[67] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in
deep residual networks. In European conference on computer vision, pages 630–645.
Springer, 2016.
[68] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art.
Knowledge-Based Systems, 212:106622, 2021.
[69] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter
pruning for accelerating deep convolutional neural networks. In International Joint
Conference on Artificial Intelligence (IJCAI), pages 2234–2240, 2018.
[70] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very
deep neural networks. IEEE International Conference on Computer Vision (ICCV),
pages 1398–1406, 2017.
[71] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. AMC: AutoML
for model compression and acceleration on mobile devices. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 784–800, 2018.
129
[72] Geoffrey Hinton, Li Deng, et al. Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups. IEEE Signal Processing
Magazine, 2012.
[74] Jui-Yang Hsu, Yuan-Jui Chen, and Hung-yi Lee. Meta learning for end-to-end low-
resource speech recognition. In ICASSP 2020-2020 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 7844–7848. IEEE, 2020.
[75] Weizhe Hua, Christopher De Sa, Zhiru Zhang, and G. Edward Suh. Channel
gating neural networks. Advances in Neural Information Processing Systems, pages
1886–1896, 2019.
[76] Weizhe Hua, Yuan Zhou, Christopher De Sa, Zhiru Zhang, and G Edward Suh.
Boosting the performance of CNN accelerators with dynamic fine-grained channel
gating. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on
Microarchitecture, pages 139–150, 2019.
[77] Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh.
Channel gating neural networks. In Advances in Neural Information Processing
Systems, pages 1886–1896, 2019.
[78] Qiangui Huang, Kevin Zhou, Suya You, and Ulrich Neumann. Learning to prune
filters in convolutional neural networks. In IEEE Winter Conference on Computer
Vision. 2018.
[79] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua
Bengio. Binarized neural networks. In Advances in neural information processing
systems, pages 4107–4115, 2016.
[80] Itay Hubara, Matthieu Courbariaux, et al. Binarized neural networks. In Advances
in Neural Information Processing Systems. 2016.
[81] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua
Bengio. Quantized neural networks: Training neural networks with low precision
weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898,
2017.
130
[82] Kyuyeon Hwang and Wonyong Sung. Fixed-point feedforward deep neural network
design using weights +1, 0, and −1. In Signal Processing Systems, 2014.
[83] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International Conference on Machine
Learning, pages 448–456, 2015.
[84] Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. Dissecting the
graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413,
2019.
[85] Norman P Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B
Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, et al. Ten
lessons from three generations shaped google’s tpuv4i.
[86] Ilyes Kacher, Maxime Portaz, Hicham Randrianarivo, and Sylvain Peyronnet. Graph-
core c2 card performance for image-based deep learning application: A report. arXiv
preprint arXiv:2002.11670, 2020.
[87] Kari Kalliojarvi and Jaakko Astola. Roundoff errors in block-floating-point systems.
IEEE transactions on signal processing, 44(4):783–790, 1996.
[88] Jaehong Kim, Sangyeul Lee, Sungwan Kim, Moonsu Cha, Jung Kwon Lee, Youngduck
Choi, Yongseok Choi, Dong-Yeon Cho, and Jiwon Kim. Auto-meta: Automated
gradient based meta learner search. arXiv preprint arXiv:1806.06927, 2018.
[89] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convo-
lutional networks. International Conference on Learning Representations (ICLR),
2016.
[90] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-
normalizing neural networks. Advances in neural information processing systems, 30,
2017.
[91] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks
for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille,
2015.
[92] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical
report, 2009.
[93] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in Neural Information Processing
Systems 25. 2012.
131
[94] Alex Krizhevsky et al. The CIFAR-10 dataset.
http://www.cs.toronto.edu/kriz/cifar.html, 2014.
[95] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs.
IEEE Transactions on computer-aided design of integrated circuits and systems, 26
(2):203–215, 2007.
[97] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. Albert: A lite bert for self-supervised learning of language
representations. International Conference on Learning Representations (ICLR),
2020.
[98] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program
analysis & transformation. In International Symposium on Code Generation and
Optimization, 2004. CGO 2004., pages 75–86. IEEE, 2004.
[99] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques
Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zi-
nenko. Mlir: A compiler infrastructure for the end of moore’s law. arXiv preprint
arXiv:2002.11054, 2020.
[100] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances
in Neural Information Processing Systems (NIPS), pages 598–605. 1990.
[101] Edward H. Lee, Daisuke Miyashita, et al. Lognet: Energy-efficient neural networks
using logarithmic computation. In IEEE International Conference on Acoustics,
Speech and Signal Processing, 2017.
[102] Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low
bit neural network: Squeeze the last bit out with ADMM. In Thirty-Second AAAI
Conference on Artificial Intelligence, 2018.
[104] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint
arXiv:1605.04711, 2016.
[105] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning
filters for efficient convnets. 2017.
132
[106] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn
quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
[107] Dongze Lian, Yin Zheng, Yintao Xu, Yanxiong Lu, Leyu Lin, Peilin Zhao, Junzhou
Huang, and Shenghua Gao. Towards fast adaptation of neural architectures with
meta learning. In International Conference on Learning Representations, 2019.
[108] Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. Davinci: A scalable architecture
for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), pages
1–44. IEEE Computer Society, 2019.
[109] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization
of deep convolutional networks. In International conference on machine learning,
pages 2849–2858, 2016.
[110] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural
network. In Advances in neural information processing systems, pages 345–353, 2017.
[111] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity
and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI
conference on artificial intelligence, 2015.
[112] Baoyuan Liu, Min Wang, et al. Sparse convolutional neural networks. In IEEE
Conference on Computer Vision and Pattern Recognition, 2015.
[113] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture
search. International Conference on Learning Representations (ICLR), 2019.
[114] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui
Zhang. Learning efficient convolutional networks through network slimming. In
International Conference on Computer Vision (ICCV), 2017.
[115] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method
for deep neural network compression. In Proceedings of the IEEE International
Conference on Computer Vision, pages 5058–5066, 2017.
[116] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method
for deep neural network compression. In ICCV, pages 5058–5066, 2017.
[117] Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, and Sarma Vrudhula. Scalable and
modularized rtl compilation of convolutional neural networks onto FPGA. In Field
Programmable Logic and Applications (FPL), 2016 26th International Conference
on, pages 1–8. IEEE, 2016.
133
[118] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S
Vetter. Nvidia tensor core programmability, performance & precision. In 2018 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW),
pages 522–531. IEEE, 2018.
[119] Moritz B Milde, Daniel Neil, et al. ADaPTION: Toolbox and benchmark for
training convolutional neural networks with reduced numerical precision weights and
activation. CoRR, abs/1711.04713, 2017.
[121] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
Ostrovski, et al. Human-level control through deep reinforcement learning. nature,
518(7540):529–533, 2015.
[122] Duncan Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris John-
son, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip
H. W. Leong. A customizable matrix multiplication framework for the Intel HARPv2
Xeon + FPGA platform. In ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2018.
[123] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International Conference
on Machine Learning, pages 2554–2563. PMLR, 2017.
[125] Milos Nikolic, Mostafa Mahmoud, Andreas Moshovos, Yiren Zhao, and Robert
Mullins. Characterizing sources of ineffectual computations in deep learning networks.
In 2019 IEEE International Symposium on Performance Analysis of Systems and
Software (ISPASS), 2019.
[126] Miloš Nikolić, Mostafa Mahmoud, Andreas Moshovos, Yiren Zhao, and Robert
Mullins. Characterizing sources of ineffectual computations in deep learning networks.
In 2019 IEEE International Symposium on Performance Analysis of Systems and
Software (ISPASS), pages 165–176. IEEE, 2019.
134
[127] Boris N Oreshkin, Pau Rodriguez, and Alexandre Lacoste. Tadam: Task dependent
adaptive metric for improved few-shot learning. Advances in neural information
processing systems, 31, 2018.
[128] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan
Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally.
SCNN: An accelerator for compressed-sparse convolutional neural networks. In ACM
SIGARCH Computer Architecture News, volume 45, pages 27–40. ACM, 2017.
[129] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:
An imperative style, high-performance deep learning library. In Advances in neural
information processing systems, pages 8026–8037, 2019.
[130] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural
architecture search via parameters sharing. In International Conference on Machine
Learning, pages 4095–4104. PMLR, 2018.
[131] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distil-
lation and quantization. In International Conference on Learning Representations,
2018.
[132] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu,
Tianqi Tang, Ningyi Xu, and Sen Song. Going deeper with embedded FPGA platform
for convolutional neural network. In ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, 2016.
[133] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):
9, 2019.
[134] Atul Rahman, Jongeun Lee, and Kiyoung Choi. Efficient FPGA acceleration of
convolutional neural networks using logical-3d compute array. In Design, Automation
& Test in Europe Conference & Exhibition (DATE), 2016, pages 1393–1398. IEEE,
2016.
[135] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning.
International Conference on Learning Representations (ICLR), 2017.
[136] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution
for image classifier architecture search. In Proceedings of the aaai conference on
artificial intelligence, volume 33, pages 4780–4789, 2019.
135
[137] Lakshminarayanan Renganarayanan, Daegon Kim, Michelle Mills Strout, and San-
jay Rajopadhye. Parameterized loop tiling. ACM Transactions on Programming
Languages and Systems (TOPLAS), 34(1):1–41, 2012.
[138] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep
reinforcement learning framework for autonomous driving. Electronic Imaging, 2017
(19):70–76, 2017.
[139] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar
Krishna. Scale-sim: Systolic CNN accelerator simulator. arXiv preprint
arXiv:1811.02883, 2018.
[140] Robert R Schaller. Moore’s law: past, present and future. IEEE spectrum, 34(6):
52–59, 1997.
[141] Andrew W Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre,
Tim Green, Chongli Qin, Augustin Žı́dek, Alexander WR Nelson, Alex Bridgland,
et al. Improved protein structure prediction using potentials from deep learning.
Nature, 577(7792):706–710, 2020.
[142] Sayeh Sharify, Alberto Delmas Lascorz, Mostafa Mahmoud, Milos Nikolic, Kevin
Siu, Dylan Malone Stuart, Zissis Poulos, and Andreas Moshovos. Laconic deep
learning inference acceleration. In Proceedings of the 46th International Symposium
on Computer Architecture, ISCA ’19, 2019.
[143] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas
Chandra, and Hadi Esmaeilzadeh. Bit fusion: Bit-level dynamically composable
architecture for accelerating deep neural networks. In Proceedings of the 45th Annual
International Symposium on Computer Architecture, 2018.
[144] Albert Shaw, Wei Wei, Weiyang Liu, Le Song, and Bo Dai. Meta architecture search.
arXiv preprint arXiv:1812.09584, 2018.
[145] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey
Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated
mixture-of-experts layer. International Conference on Learning Representations
(ICLR), 2017.
136
[147] Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan
Zhang. Towards a uniform template-based architecture for accelerating 2D and 3D
CNNs on FPGA. In ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, 2018.
[148] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing CNN acceler-
ator efficiency through resource partitioning. In 2017 ACM/IEEE 44th Annual
International Symposium on Computer Architecture (ISCA), 2017.
[149] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph
convolutional networks for skeleton-based action recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 12026–12035,
2019.
[150] Sungho Shin, Kyuyeon Hwang, and Wonyong Sung. Fixed-point performance analysis
of recurrent neural networks. In 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2016.
[151] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556, 2014.
[152] David So, Quoc Le, and Chen Liang. The evolved transformer. In International
Conference on Machine Learning, pages 5877–5886. PMLR, 2019.
[153] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1):1929–1958, 2014.
[154] Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-
Ruiz, Nina M Donghia, Craig R MacNair, Shawn French, Lindsey A Carfrae, Zohar
Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery. Cell,
180(4):688–702, 2020.
[155] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma
Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized OpenCL-based FPGA
accelerator for large-scale convolutional neural networks. In Proceedings of the 2016
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016.
[156] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M
Hospedales. Learning to compare: Relation network for few-shot learning. In
Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1199–1208, 2018.
137
[157] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew
Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for
mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 2820–2828, 2019.
[158] Frederick Tung, Greg Mori, and Simon Fraser. CLIP-Q: Deep network compres-
sion learning by in-parallel pruning-quantization. 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 7873–7882, 2018.
[159] Yash Ukidave, David Kaeli, Umesh Gupta, and Kurt Keville. Performance of
the nvidia jetson tk1 in hpc. In 2015 IEEE International Conference on Cluster
Computing, pages 533–534. IEEE, 2015.
[160] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization:
The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
[162] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30:5998–6008, 2017.
[163] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Liò, and Yoshua Bengio. Graph Attention Networks. International Conference
on Learning Representations, 2018. URL https://openreview.net/forum?id=
rJXMpikCZ.
[165] Thomas Verelst and Tinne Tuytelaars. Dynamic convolutions: Exploiting spatial
sparsity for faster inference. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 2320–2329, 2020.
[166] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan
Wierstra. Matching networks for one shot learning. Advances in Neural Information
Processing Systems, 2016.
[167] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S
Yu. Heterogeneous graph attention network. In The World Wide Web Conference,
pages 2022–2032, 2019.
138
[168] Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, and Thomas S
Huang. Revisiting dilated convolution: A simple approach for weakly-and semi-
supervised semantic segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 7268–7277, 2018.
[169] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured
sparsity in deep neural networks. In Advances in Neural Information Processing
Systems (NIPS), pages 2074–2082. 2016.
[170] Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Co-
ordinating filters for faster deep neural networks. In Proceedings of the IEEE
International Conference on Computer Vision, pages 658–666, 2017.
[171] Darrell Whitley. A genetic algorithm tutorial. Statistics and computing, 4(2):65–85,
1994.
[173] Felix Winterstein, Samuel Bayliss, and George A Constantinides. High-level synthesis
of dynamic data structures: A case study using vivado hls. In 2013 International
Conference on Field-Programmable Technology (FPT), pages 362–365. IEEE, 2013.
[174] Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and
Kurt Keutzer. Mixed precision quantization of convnets via differentiable neural
architecture search. arXiv preprint arXiv:1812.00090, 2018.
[175] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu,
Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-
aware efficient convnet design via differentiable neural architecture search. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 10734–10742, 2019.
[176] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset
for benchmarking machine learning algorithms. arXiv preprint, 2017.
[177] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural
architecture search. International Conference on Learning Representations (ICLR),
2019.
[178] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified
activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
139
[179] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi,
and Stefanie Jegelka. Representation learning on graphs with jumping knowledge
networks. In International Conference on Machine Learning, pages 5449–5458, 2018.
[180] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are
graph neural networks? In International Conference on Learning Representations,
2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
[181] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai
Xiong. Latency-aware differentiable neural architecture search. arXiv preprint
arXiv:2001.06392, 2020.
[182] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivi-
enne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation
for mobile applications. In Proceedings of the European Conference on Computer
Vision (ECCV), pages 285–300, 2018.
[183] Yongxin Yang, Irene Garcia Morillo, and Timothy M Hospedales. Deep neural
decision trees. arXiv preprint arXiv:1806.06988, 2018.
[184] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-
supervised learning with graph embeddings. In Proceedings of the 33rd International
Conference on International Conference on Machine Learning-Volume 48, pages
40–48. JMLR. org, 2016.
[185] Jianbo Ye, Xin Lu, Zhe L. Lin, and James Z. Wang. Rethinking the smaller-norm-
less-informative assumption in channel pruning of convolution layers. In International
Conference on Learning Representations (ICLR), 2018.
[186] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure
Leskovec. Hierarchical graph representation learning with differentiable pooling. In
Advances in neural information processing systems, pages 4800–4810, 2018.
[187] Geng Yuan, Xiaolong Ma, Caiwen Ding, Sheng Lin, Tianyun Zhang, Zeinab S Jalali,
Yilong Zhao, Li Jiang, Sucheta Soundarajan, and Yanzhi Wang. An ultra-efficient
memristor-based dnn framework with structured weight pruning and quantization us-
ing admm. In 2019 IEEE/ACM International Symposium on Low Power Electronics
and Design (ISLPED), pages 1–6. IEEE, 2019.
[188] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. Proceedings of
the British Machine Vision Conference (BMVC), 2016.
140
[189] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Vik-
tor Prasanna. GraphSAINT: Graph sampling based inductive learning method.
In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=BJe8pkHFwS.
[190] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. Energy-
efficient CNN implementation on a deeply pipelined fpga cluster. In Proceedings of
the 2016 International Symposium on Low Power Electronics and Design, 2016.
[191] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image
classification with differentiable earth mover’s distance and structured classifiers.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 12203–12213, 2020.
[192] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned
quantization for highly accurate and compact deep neural networks. In Proceedings
of the European conference on computer vision (ECCV), pages 365–382, 2018.
[193] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung.
Gaan: Gated attention networks for learning on large and spatiotemporal graphs.
arXiv preprint arXiv:1803.07294, 2018.
[194] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In
Advances in Neural Information Processing Systems, pages 5165–5175, 2018.
[195] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep
learning architecture for graph classification. In Thirty-Second AAAI Conference on
Artificial Intelligence, 2018.
[196] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group
convolutions for deep neural networks. CoRR, abs/1707.02725, 2017. URL
http://arxiv.org/abs/1707.02725.
[197] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely
efficient convolutional neural network for mobile devices. In IEEE Conference on
Computer Vision and Pattern Recognition, 2018.
[198] Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu,
and Deming Chen. DNNBuilder: an automated tool for building high-performance
dnn hardware accelerators for FPGAs. In Proceedings of the International Conference
on Computer-Aided Design, page 56. ACM, 2018.
141
[199] Ruizhe Zhao, Ho-Cheung Ng, Wayne Luk, and Xinyu Niu. Towards efficient con-
volutional neural network for domain-specific applications on FPGA. 2018 28th
International Conference on Field Programmable Logic and Applications (FPL),
pages 147–1477, 2018.
[200] Ruizhe Zhao, Xinyu Niu, and Wayne Luk. Automatic optimising CNN with depthwise
separable convolution on FPGA: (abstact only). In ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, 2018.
[201] Yiren Zhao, Xitong Gao, Robert Mullins, and Chengzhong Xu. Mayo: A framework
for auto-generating hardware friendly deep neural networks. In Proceedings of the
2nd International Workshop on Embedded and Mobile Deep Learning, pages 25–30,
2018.
[202] Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, and Cheng-Zhong Xu.
Focused quantization for sparse CNNs. In Advances in Neural Information Processing
Systems, pages 5584–5593, 2019.
[203] Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, Robert Mullins, Pe-
ter YK Cheung, George Constantinides, and Cheng-Zhong Xu. Automatic generation
of multi-precision multi-arithmetic CNN accelerators for FPGAs. In 2019 Interna-
tional Conference on Field-Programmable Technology (ICFPT), pages 45–53. IEEE,
2019.
[204] Yiren Zhao, Duo Wang, Daniel Bates, Robert Mullins, Mateja Jamnik, and Pietro
Lio. Learned low precision graph neural networks. arXiv preprint arXiv:2009.09232,
2020.
[205] Yiren Zhao, Duo Wang, Xitong Gao, Robert Mullins, Pietro Lio, and Mateja
Jamnik. Probabilistic dual network architecture search on graphs. arXiv preprint
arXiv:2003.09676, 2020.
[206] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-
wise neural network architecture generation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2423–2432, 2018.
[207] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremen-
tal network quantization: Towards lossless CNNs with low-precision weights. In
International Conference on Learning Representations, 2017.
[208] Hao Zhou, Jose M. Alvarez, and Fatih Porikli. Less is more: Towards compact CNNs.
In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision
142
– ECCV 2016, pages 662–677, Cham, 2016. Springer International Publishing. ISBN
978-3-319-46493-0.
[209] Kaixiong Zhou, Qingquan Song, Xiao Huang, and Xia Hu. Auto-GNN: Neural
architecture search of graph neural networks. arXiv preprint arXiv:1909.03184, 2019.
[210] Shuchang Zhou, Zekun Ni, et al. DoReFa-Net: Training low bitwidth convolutional
neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016.
[211] Yanqi Zhou, Xuanyi Dong, Berkin Akin, Mingxing Tan, Daiyi Peng, Tianjian
Meng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon.
Rethinking co-design of neural architectures and hardware accelerators. arXiv
preprint arXiv:2102.08619, 2021.
[212] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary
quantization. International Conference on Learning Representations (ICLR), 2017.
[213] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu,
Junzhou Huang, and Jinhui Zhu. Discrimination-aware channel pruning for deep
neural networks. In Advances in Neural Information Processing Systems (NIPS).
2018.
[214] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-
layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
[215] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.
International Conference on Learning Representations (ICLR), 2017.
[216] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning trans-
ferable architectures for scalable image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 8697–8710, 2018.
143
144