0% found this document useful (0 votes)
23 views60 pages

AutoML - A Survey of The State-Of-The-Art

This document provides a comprehensive survey of the state-of-the-art in automated machine learning (AutoML), emphasizing its importance in building deep learning systems without human intervention. It covers key AutoML methods including data preparation, feature engineering, hyperparameter optimization, and neural architecture search (NAS), with a particular focus on NAS algorithms and their performance on benchmark datasets. The paper also discusses open problems and future research directions in the field of AutoML.

Uploaded by

d.cryptic01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views60 pages

AutoML - A Survey of The State-Of-The-Art

This document provides a comprehensive survey of the state-of-the-art in automated machine learning (AutoML), emphasizing its importance in building deep learning systems without human intervention. It covers key AutoML methods including data preparation, feature engineering, hyperparameter optimization, and neural architecture search (NAS), with a particular focus on NAS algorithms and their performance on benchmark datasets. The paper also discusses open problems and future research directions in the field of AutoML.

Uploaded by

d.cryptic01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

AutoML: A Survey of the State-of-the-Art

Xin He, Kaiyong Zhao, Xiaowen Chu∗


Department of Computer Science, Hong Kong Baptist University

Abstract
Deep learning (DL) techniques have obtained remarkable achievements on various tasks, such as image
recognition, object detection, and language modeling. However, building a high-quality DL system for a specific task
highly relies on human expertise, hindering its wide application. Meanwhile, automated machine learning (AutoML)
is a promising solution for building a DL system without human assistance and is being extensively studied. This
arXiv:1908.00709v6 [cs.LG] 16 Apr 2021

paper presents a comprehensive and up-to-date review of the state-of-the-art (SOTA) in AutoML. According to the
DL pipeline, we introduce AutoML methods –– covering data preparation, feature engineering, hyperparameter optimization,
and neural architecture search (NAS) –– with a particular focus on NAS, as it is currently a hot sub-topic of AutoML.
We summarize the representative NAS algorithms’ performance on the CIFAR-10 and ImageNet datasets and further discuss
the following subjects of NAS methods: one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture
optimization, and resource-aware NAS. Finally, we discuss some open problems related to the existing AutoML methods
for future research.
Keywords: deep learning, automated machine learning (AutoML), neural architecture search (NAS), hyperparameter
optimization (HPO)

1. Introduction can be understood to involve the automated construction


of an ML pipeline on the limited computational budget.
In recent years, deep learning has been applied in vari-
With the exponential growth of computing power, AutoML
ous fields and used to solve many challenging AI tasks,
has become a hot topic in both industry and academia. A
in areas such as image classification [1, 2], object detection
complete AutoML system can make a dynamic
[3], and language modeling [4, 5]. Specifically, since AlexNet
combination of various techniques to form an easy-to-use
[1] outperformed all other traditional manual methods in the
end-to-end ML pipeline system (as shown in Figure 1).
2012 ImageNet Large Scale Visual Recognition Challenge
Many AI com- panies have created and publicly shared
(ILSVRC) [6], increasingly complex and deep neural net-
such systems (e.g., Cloud AutoML 1 by Google) to help
works have been proposed. For example, VGG-16 [7] has
people with little or no ML knowledge to build high-
more than 130 million parameters, occupies nearly 500 MB
quality custom models.
of memory space, and requires 15.3 billion floating-
As Figure 1 shows, the AutoML pipeline consists of
point operations to process an image of size 224 × 224.
several processes: data preparation, feature engineering,
Notably, however, these models were all manually
model generation, and model evaluation. Model
designed by ex- perts by a trial-and-error process, which
generation can be further divided into search space and
means that even experts require substantial resources
optimization methods. The search space defines the design
and time to create well-performing models.
principles of ML models, which can be divided into two
To reduce these onerous development costs, a novel idea
categories: the traditional ML models (e.g., SVM and KNN),
of automating the entire pipeline of machine learning (ML)
and neural architectures. The optimization methods are
has emerged, i.e., automated machine learning (AutoML).
classified into hyperparameter optimization (HPO) and
There are various definitions of AutoML. For example, ac-
architecture optimization (AO), where the former indicates
cording to [8], AutoML is designed to reduce the demand
the training- related parameters (e.g., the learning rate and
for data scientists and enable domain experts to automati-
batch size), and the latter indicates the model-related
cally build ML applications without much requirement for
parameters (e.g., the number of layer for neural
statistical and ML knowledge. In [9], AutoML is defined as
architectures and the number of neighbors for KNN). NAS
a combination of automation and ML. In a word, AutoML
consists of three important components: the search space
of neural architectures, AO methods, and model evaluation

Corresponding author methods. AO methods may also refer to search strategy
Email addresses: csxinhe@comp.hkbu.edu.hk (Xin He), [10] or search policy [11]. Zoph et al. [12] were one of the
first to propose NAS, where a
kyzhao@comp.hkbu.edu.hk (Kaiyong Zhao), chxw@comp.hkbu.edu.hk
(Xiaowen Chu) 1
https://cloud.google.com/automl/

Preprint submitted to Knowledge-Based Systems April 19, 2021


Data Preparation Feature Engineering Model Generation Model Estimation

Search Space Optimization Methods


Data Feature
Collection Selection Low-fidelity

Traditional Hyperparameter
Models Optimization
(SVM, KNN) Early-stopping

Data Cleaning Feature


Feature
Extraction
Surrogate Model

Deep Neural
Architecture
Networks
Data Feature Optimization
(CNN, RNN)
Construction Weight-sharing
Augmentation
Neural Architecture Search (NAS)

Figure 1: An overview of AutoML pipeline covering data preparation (Section 2), feature engineering (Section 3), model generation (Section 4)
and model evaluation (Section 5).

recurrent network is trained by reinforcement learning to Survey DP FE HPO NAS


automatically search for the best-performing architecture. NAS Survey [10] - - - C
Since [12] successfully discovered a neural network achieving A Survey on NAS [45] - - - C
comparable results to human-designed models, there has NAS Challenges [46] - - - C
been an explosion of research interest in AutoML, with most A Survey on AutoML [9] - C C †
focusing on NAS. NAS aims to search for a robust and well- AutoML Challenges [47] C - C †
performing neural architecture by selecting and combining AutoML Benchmark [8] C C C -
different basic operations from a predefined search space. Ours C C C C
By reviewing NAS methods, we classify the commonly used
search space into entire-structured [12, 13, 14], cell-based Table 1: Comparison between different AutoML surveys. The “Survey”
column gives each survey a label based on their title for increasing the
[13, 15, 16, 17, 18], hierarchical [19] and morphism- readability. DP, FE, HPO, NAS indicate data preparation, feature
based [20, 21, 22] search space. The commonly used AO engineering, hyperparameter optimization and neural architecture
methods contain reinforcement learning (RL) [12, 15, search, respectively. “-”, “C”, and “†” indicate the content is 1)
23, 16, 13], not mentioned; 2) mentioned detailed; 3) mentioned briefly, in the
original paper, respectively.
evolution-based algorithm (EA) [24, 25, 26, 27, 28, 29, 30],
and gradient descent (GD) [17, 31, 32], Surrogate Model-
Based Optimization (SMBO) [33, 34, 35, 36, 37, 38, 39], processes of data preparation, feature engineering, model
and hybrid AO methods [40, 41, 42, 43, 44]. generation, and model evaluation are presented in Sections
Although there are already several excellent AutoML- 2, 3, 4, 5, respectively. In Section 6, we compare the
related surveys [10, 45, 46, 9, 8], to the best of our knowl- performance of NAS algorithms on the CIFAR-10 and
edge, our survey covers a broader range of AutoML meth- ImageNet dataset, and discuss several subtopics of great
ods. As summarized in Table 1, [10, 45, 46] only focus concern in NAS community: one/two-stage NAS, one-shot
on NAS, while [9, 8] cover little of NAS technique. In NAS, joint hyperparameter and architecture optimization,
this paper, we summarize the AutoML-related methods and resource-aware NAS. In Section 7, we describe several
according to the complete AutoML pipeline (Figure 1), open problems in AutoML. We conclude our survey in
providing beginners with a comprehensive introduction to Section 8.
the field. Notably, many sub-topics of AutoML are large
enough to have their own surveys. However, our goal is
not to conduct a thorough investigation of all AutoML 2. Data Preparation
sub-topics. Instead, we focus on the breadth of research
The first step in the ML pipeline is data preparation.
in the field of AutoML. Therefore, we will summarize and
Figure 2 presents the workflow of data preparation, which
discuss some representative methods of each process in
can be introduced in three aspects: data collection, data
the pipeline.
cleaning, and data augmentation. Data collection is a
The rest of this paper is organized as follows. The
2
necessary step to build a new dataset or extend the ex- example, Krause et al. [55] separate inaccurate results
isting dataset. The process of data cleaning is used to as cross-domain or cross-category noise, and remove any
filter noisy data so that downstream model training is not images that appear in search results for more than one
compromised. Data augmentation plays an important role category. Vo et al. [56] re-rank relevant results and
in enhancing model robustness and improving model per- provide search results linearly, according to keywords.
formance. The following subsections will cover the three Second, Web data may be incorrectly labeled or even
aspects in more detail. unlabeled. A learning-based self-labeling method is often
used to solve this problem. For example, the active learn-
Start ing method [57] selects the most “uncertain” unlabeled
individual examples for labeling by a human, and then iter-
atively labels the remaining data. Roh et al. [58] provided
Yes a review of semi-supervised learning self-labeling methods,
Enough data?
which can help take the human out of the loop of labeling
to improve efficiency, and can be divided into the following
No
categories: self-training [59, 60], co-training [61, 62], and
No
co-learning [63]. Moreover, due to the complexity of Web
Any exsiting Data quality Data
datasets? improved? Augmentation images content, a single label cannot adequately describe
an image. Consequently, Yang et al. [51] assigned multiple
labels to a Web image, i.e., if the confidence scores of
Yes No these labels are very close or the label with the highest
Yes score is the same as the original label of the image, then
Data Collection this image will be set as a new training sample.
Data Data
Data Cleaning
Model However, the distribution of Web data can be
Searching Synthesis Training extremely different from that of the target dataset, which
will increase the difficulty of training the model. A
Figure 2: The flow chart for data preparation. common solution is to fine-tune these Web data [64, 65].
Yang et al. [51] proposed an iterative algorithm for model
training and Web data- filtering. Dataset imbalance is
2.1. Data Collection another common problem, as some special classes have a
ML’s deepening study has led to a consensus that high- very limited number of Web data. To solve this problem, the
quality datasets are of critical importance for ML; as a synthetic minority over-sampling technique (SMOTE) [66]
result, numerous open datasets have emerged. In the early is used to synthesize new minority samples between
stages of ML study, a handwritten digital dataset, i.e., existing real minority samples, instead of simply up-
MNIST [48], was developed. After that, several larger sampling minority samples or down-sampling the majority
datasets like CIFAR-10 and CIFAR-100 [49] and ImageNet samples. In another approach, Guo et al. [67] combined
[50] were developed. A variety of datasets can also be the boosting method with data generation to enhance the
retrieved by entering the keywords into these websites: generalizability and robustness of the model against
Kaggle 2, Google Dataset Search (GOODS) 3, and Elsevier imbalanced data sets.
Data Search 4.
However, it is usually challenging to find a proper 2.1.2. Data Synthesis
dataset through the above approaches for some partic- Data simulator is one of the most commonly used meth-
ular tasks, such as those related to medical care or other ods to generate data. For some particular tasks, such as
privacy matters. Two types of methods are proposed to autonomous driving, it is not possible to test and adjust
solve this problem: data searching and data synthesis. a model in the real world during the research phase,
due to safety hazards. Therefore, a practical approach to
2.1.1. Data Searching generat- ing data is to use a data simulator that matches
As the Internet is an inexhaustible data source, search- the real world as closely as possible. OpenAI Gym [68] is
ing for Web data is an intuitive way to collect a dataset a popular toolkit that provides various simulation
[51, 52, 53, 54]. However, there are some problems with environments, in which developers can concentrate on
using Web data. designing their al- gorithms, instead of struggling to
First, the search results may not exactly match the generate data. Wang et al. [69] used a popular game
keywords. Thus, unrelated data must be filtered. For engine, Unreal Engine 4, to build a large synthetic
indoor robotics stereo (IRS) dataset, which provides the
information for disparity and surface normal estimation.
2
https://www.kaggle.com
3
https://datasetsearch.research.google.com/
Furthermore, a reinforcement learning-based method is
4
https://www.datasearch.elsevier.com/ applied in [70] for optimizing the parameters of a data
simulator to control the distribution of the synthesized
3
data.

4
Another novel technique for deriving synthetic data [87] proposed an effective way of evaluating the algorithms
is Generative Adversarial Networks (GANs) [71], which
can be used to generate images [71, 72, 73, 74], tabular
[75, 76] and text [77] data. Karras et al. [78] applied GAN
technique to generate realistic human face images. Oh and
Jaroensri et al. [72] built a synthetic dataset, which cap-
tures small motion for video-motion magnification. Bowles
et al. [74] demonstrated the feasibility of using GAN to
generate medical images for brain segmentation tasks. In
the case of textual data, applying GAN to text has proved
difficult because the commonly used method is to use rein-
forcement learning to update the gradient of the
generator, but the text is discrete, and thus the gradient
cannot propa- gate from discriminator to generator. To
solve this problem, Donahue et al. [77] used an
autoencoder to encode sen- tences into a smooth sentence
representation to remove the barrier of reinforcement
learning. Park et al. [75] applied GAN to synthesize fake
tables that are statistically similar to the original table but
do not cause information leakage. Similarly, in [76], GAN is
applied to generate tabular data like medical or
educational records.

2.2. Data Cleaning


The collected data inevitably have noise, but the noise
can negatively affect the training of the model. Therefore,
the process of data cleaning [79, 80] must be carried out if
necessary. Across the literature, the effort of data cleaning
is shifting from crowdsourcing to automation. Tradition-
ally, data cleaning requires specialist knowledge, but access
to specialists is limited and generally expensive. Hence,
Chu et al. [81] proposed Katara, a knowledge-based and
crowd-powered data cleaning system. To improve efficiency,
some studies [82, 83] proposed only to clean a small subset
of the data and maintain comparable results to the case of
cleaning the full dataset. However, these methods require
a data scientist to design what data cleaning operations
are applied to the dataset. BoostClean [84] attempts to
automate this process by treating it as a boosting prob-
lem. Each data cleaning operation effectively adds a new
cleaning operation to the input of the downstream ML
model, and through a combination of Boosting and feature
selection, a good series of cleaning operations, which can
well improve the performance of the ML model, can be
generated. AlphaClean [85] transforms data cleaning into
a hyperparameter optimization problem, which further in-
creases automation. Specifically, the final data cleaning
combinatorial operation in AlphaClean is composed of sev-
eral pipelined cleaning operations that need to be searched
from a predefined search space. Gemp et al. [86] attempted
to use meta-learning technique to automate the process of
data cleaning.
The data cleaning methods mentioned above are applied
to a fixed dataset. However, the real world generates
vast amounts of data every day. In other words, how
to clean data in a continuous process becomes a worth
studying problem, especially for enterprises. Ilyas et al.
5
5
https://pytorch.org/docs/stable/torchvision/transforms.html

Figure 3: A classification of data augmentation techniques.

of continuously cleaning data. Mahdavi et al. [88]


built a cleaning workflow orchestrator, which can learn
from previous cleaning tasks, and proposed promising
cleaning workflows for new datasets.

2.3. Data Augmentation


To some degree, data augmentation (DA) can also
be regarded as a tool for data collection, as it can
generate new data based on the existing data.
However, DA also serves as a regularizer to avoid over-
fitting of model training and has received more and
more attention. Therefore, we introduce DA as a
separate part of data preparation in detail. Figure 3
classifies DA techniques from the perspective of data
type (image, audio, and text), and incorporates
automatic DA techniques that have recently received
much attention.
For image data, the affine transformations include
rota- tion, scaling, random cropping, and reflection; the
elastic transformations contain the operations like
contrast shift, brightness shift, blurring, and channel
shuffle; the advanced transformations involve random
erasing, image blending, cutout [89], and mixup [90],
etc. These three types of common transformations are
available in some open source libraries, like torchvision
5
, ImageAug [91], and Albumen- tations [92]. In terms
of neural-based transformations, it
6
can be divided into three categories: adversarial noise combination of these three processes.
[93], neural style transfer [94], and GAN technique [95].
For textual data, Wong et al. [96] proposed two
approaches for creating additional training examples: data
warping and synthetic over-sampling. The former
generates additional samples by applying transformations
to data-space, and the latter creates additional samples in
feature-space. Tex- tual data can be augmented by
synonym insertion or by first translating the text into a
foreign language and then translating it back to the
original language. In a recent study, Xie et al. [97]
proposed a non-domain-specific DA policy that uses
noising in RNNs, and this approach works well for the tasks
of language modeling and machine trans- lation. Yu et al.
[98] proposed a back-translation method for DA to
improve reading comprehension. NLPAug [99] is an
open-source library that integrates many types of
augmentation operations for both textual and audio data.
The above augmentation techniques still require hu-
man to select augmentation operations and then form a
specific DA policy for specific tasks, which requires much
expertise and time. Recently, there are many methods
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110] pro-
posed to search for augmentation policy for different
tasks. AutoAugment [100] is a pioneering work to
automate the search for optimal DA policies using
reinforcement learn- ing. However, AutoAugment is not
efficient as it takes almost 500 GPU hours for one
augmentation search. In order to improve search
efficiency, a number of improved algorithms have
subsequently been proposed using different search strategies,
such as gradient descent-based [101, 102], Bayesian-based
optimization [103], online hyperparameter learning [109],
greedy-based search [104] and random search [107].
Besides, LingChen et al. [110] proposed a search- free DA
method, namely UniformAugment, by assuming that the
augmentation space is approximately distribution
invariant.

3. Feature Engineering

It is generally accepted that data and features deter-


mine the upper bound of ML, and that models and algo-
rithms can only approximate this limit. In this context,
feature engineering aims to maximize the extraction of
features from raw data for use by algorithms and models.
Feature engineering consists of three sub-topics: feature
selection, feature extraction, and feature construction.
Fea- ture extraction and construction are variants of
feature transformation, by which a new set of features is
created [111]. In most cases, feature extraction aims to
reduce the dimensionality of features by applying specific
mapping functions, while feature construction is used to
expand original feature spaces, and the purpose of feature
selection is to reduce feature redundancy by selecting
important fea- tures. Thus, the essence of automatic
feature engineering is, to some degree, a dynamic
7
3.1. Feature Selection
Feature selection builds a feature subset based on
the original feature set by reducing irrelevant or
redundant features. This tends to simplify the model,
hence avoiding overfitting and improving model
performance. The selected features are usually
divergent and highly correlated with object values.
According to [112], there are four basic steps in a
typical process of feature selection (see Figure 4), as
follows:

Figure 4: The iterative process of feature selection. A subset of Original feature set
features is selected, based on a search strategy, and then
evaluated. Then, a validation procedure is implemented to
determine whether the subset is valid. The above steps are
repeated until the stop criterion is satisfied. Generation
(Search Strategy)

The search strategy for feature selection involves


three types of algorithms: complete search, heuristic
No Subset Evaluation validation
search, and random search. Complete search comprises
exhaustive and non-exhaustive searching; the latter can
be further split into four methods: breadth-first search,
branch and bound search, beam search, and best-first
search. Heuristic search comprises sequential forward Stopping
Yes
criterion?
selection (SFS), sequential backward selection (SBS),
and bidirectional search (BS). In SFS and SBS, the
features are added from an empty set or removed from
a full set, respectively, whereas BS uses both SFS and
SBS to search until these two algorithms obtain the
same subset. The most commonly used random search
methods are simulated annealing (SA) and genetic
algorithms (GAs).
Methods of subset evaluation can be divided into
three different categories. The first is the filter method,
which scores each feature according to its divergence or
corre- lation and then selects features according to a
threshold. Commonly used scoring criteria for each
feature are vari- ance, the correlation coefficient, the
chi-square test, and mutual information. The second is
the wrapper method, which classifies the sample set
with the selected feature subset, after which the
classification accuracy is used as the criterion to
measure the quality of the feature subset. The third
method is the embedded method, in which variable
selection is performed as part of the learning
procedure.

8
Regularization, decision tree, and deep learning are all
Model Generation
embedded methods. Model Estimation
Search Space Architecture Optimization
3.2. Feature Construction model that considers data features and their relationships,
Feature construction is a process that constructs new Entire-structured Random search Low-fidelity

features from the basic feature space or raw data to enhance


the robustness and generalizability of the model. Essen- Evolutionary
Algorithm
tially, this is done to increase the representative ability of
Cell-based Early-stopping
the original features. This process is traditionally highly Bayesian
Optimization
dependent on human expertise, and one of the most com-
monly used methods is preprocessing transformation, such Reinforcement
Learning
as standardization, normalization, or feature discretization. Hierarchical Surrogate Model

In addition, the transformation operations for different Gradient Descent


types of features may vary. For example, operations such
as conjunctions, disjunctions and negation are typically Hybrid Weight-sharing
Morphism-based
used for Boolean features; operations such as minimum,
maximum, addition, subtraction, mean are typically used
for numerical features, and operations such as Cartesian
product [113] and M-of-N [114] are commonly used for
nominal features.
It is impossible to manually explore all possibilities.
Hence, to further improve efficiency, some automatic fea-
ture construction methods [115, 114, 116, 117] have been
proposed to automate the process of searching and
evaluat- ing the operation combination, and shown to
achieve results as good as or superior to those achieved by
human exper- tise. Besides, some feature construction
methods, such as decision tree-based methods [115, 114]
and genetic algo- rithms [116], require a predefined
operation space, while the annotation-based approaches
[117] do not, as they can use domain knowledge (in the
form of annotation) and the training examples, and
hence, can be traced back to the interactive feature-space
construction protocol intro- duced by [118]. Using this
protocol, the learner identifies inadequate regions of
feature space and, in coordination with a domain expert,
adds descriptiveness using existing semantic resources.
After selecting possible operations and constructing a new
feature, feature-selection techniques are applied to evaluate
the new feature.

3.3. Feature Extraction


Feature extraction is a dimensionality-reduction
process performed via some mapping functions. It extracts
infor- mative and non-redundant features according to
certain metrics. Unlike feature selection, feature extraction
alters the original features. The kernel of feature
extraction is a mapping function, which can be
implemented in many ways. The most prominent
approaches are principal component analysis (PCA),
independent component analysis, isomap, nonlinear
dimensionality reduction, and linear discriminant analysis
(LDA). Recently, the feed-forward neural network
approach has become popular; this uses the hidden units
of a pretrained model as extracted features. Furthermore,
many autoencoder-based algorithms are proposed; for ex-
ample, Zeng et al. [119] proposed a relation autoencoder
9
Figure 5: An overview of neural architecture search pipeline.

while an unsupervised feature-extraction method using


autoencoder trees is proposed by [120].

4. Model Generation

Model generation is divided into two parts––search


space and optimization methods––as shown in Figure 1.
The search space defines the model structures that can
be designed and optimized in principle. The types of
models can be broadly divided into two categories:
traditional ML models, such as support vector machine
(SVM) [121] and k-nearest neighbors algorithm (KNN)
[122], and deep neural network (DNN). There are two
types of parameters for the optimization methods:
hyperparameters used for training, such as the learning
rate, and those used for model design, such as the
filter size and the number of layers for DNN. Neural
architecture search (NAS) has recently attracted
considerable attention; therefore, in this section, we
introduce the search space and optimization methods
of NAS technique. Readers who are interested in
traditional models (e.g., SVM) can refer to other
reviews [9, 8].
Figure 5 presents an overview of the NAS pipeline,
which is categorized into the following three
dimensions [10, 123]: search space, architecture
optimization (AO) method6, and model evaluation
method.

• Search Space. The search space defines the


design principles of neural architectures.
Different scenarios require different search
spaces. Here, we summarize four types of
commonly used search spaces: entire- structured,
cell-based, hierarchical, and morphism- based.

6
It can also be referred to as the “search strategy [10, 123]”,
“search policy [11]”, or “optimization method [45, 9]”.

10
• Architecture Optimization Method. The architecture where Nk indicates the indegree of node Zk, Ii and oi
optimization (AO) method defines how to guide the represent i-th input tensor and its associated operation,
search to efficiently find the model architecture with respectively, and O is a set of candidate operations, such
high performance after the search space is defined. as convolution, pooling, activation functions, skip connec-
• Model Evaluation Method. Once a model is gener- tion, concatenation, and addition. To further enhance the
ated, its performance needs to be evaluated. The model performance, many NAS methods use certain ad-
simplest approach of doing this is to train the model vanced human-designed modules as primitive operations,
to converge on the training set, and then estimate such as depth-wise separable convolution [124], dilated
model performance on the validation set; however, convolution[125], and squeeze-and-excitation (SE) blocks
this method is time-consuming and resource-intensive. [126]. The selection and combination of these operations
Some advanced methods can accelerate the evalua- vary with the design of search space. In other words, the
tion process but lose fidelity in the process. Thus, search space defines the structural paradigm that AO
how to balance the efficiency and effectiveness of an meth- ods can explore; thus, designing a good search space
evaluation is a problem worth studying. is a vital but challenging problem. In general, a good
search space is expected to exclude human bias and be
The search space and AO methods are presented in flexible enough to cover a wider variety of model
this section, while the methods of model evaluation are architectures. Based on the existing NAS studies, we detail
presented in the next section. the com- monly used search spaces as follows.

4.1. Search Space 4.1.1. Entire-structured Search Space


A neural architecture can be represented as a direct The space of entire-structured neural networks [12, 13]
acyclic graph (DAG) comprising B ordered nodes. In is one of the most intuitive and straightforward search
DAG, each node and directed edge indicate a feature tensor spaces. Figure 6 presents two simplified examples of
and an operation, respectively. Eq. 1 presents a entire- structured models, which are built by stacking a
formula for computation at any node Zk, k ∈ {1, 2, ..., B}. predefined number of nodes, where each node represents
a layer and performs a specified operation. The left model
shown in Figure 6 indicates the simplest structure, while
output output
the right model is relatively complex, as it permits arbitrary
skip connections [2] to exist between the ordered nodes;
these connections have been proven effective in practice
L4 conv 3x3 conv 3x3
[12]. Al- though an entire structure is easy to implement, it
has several disadvantages. For example, it is widely
accepted that the deeper is the model, the better is its
generaliza- tion ability; however, searching for such a
L3 conv 5x5 conv 5x5 deep network is onerous and computationally expensive.
Furthermore, the generated architecture lacks
transferability; that is, a model generated on a small
dataset may not fit a larger dataset, which necessitates the
L2 conv 3x3 conv 3x3 generation of a new model for a larger dataset.

4.1.2. Cell-based Search Space


Motivation. To enable the transferability of the gener-
L1 max pool max pool
ated model, the cell-based search space has been
proposed [15, 16, 13], in which the neural architecture is
composed of a fixed number of repeating cell structures.
input input This de- sign approach is based on the observation that
many well- performing human-designed models [2, 127]
Figure 6: Two simplified examples of entire-structured neural archi- are also built by stacking a fixed number of modules. For
tectures. Each layer is specified with a different operation, such as example, the ResNet family builds many variants, such as
convolution and max-pooling operations. The edge indicates the infor- ResNet50, ResNet101, and ResNet152, by stacking several
mation flow. The skip-connection operation used in the right example
can help explore deeper and more complex neural architectures.
BottleNeck modules [2]. Throughout the literature, this
repeated mod- ule is referred to as a motif, cell, or block,
while in this paper, we call it a cell.
Nk
Σ Design. Figure 7 (left) presents an example of a final
Zk = oi(Ii), oi ∈ O (1) cell-based neural network, which comprises two types of
i=1 cells: normal and reduction cells. Thus, the problem of
searching for a full neural architecture is simplified into
11
output Cell k+1
each block can be represented by a five-element tuple,
(I1, I2, O1, O2, C), where I1, I2 ∈ Ib indicate the inputs to
the block, while O1, O2 ∈ O indicate the operations applied
Cell k
reduction cell concat to inputs, and C ∈ C describes how to combine O1 and
O2. As the blocks are ordered, the set of candidate inputs
Block 1
Ib for the nodes in block bk, which contains the output of
×n normal cell
add the previous two cells and the output set of all previous
blocks {bi, i < k} of the same cell. The first two inputs of
conv max- the first cell of the whole model are set to the image data
reduction cell 3x3 pool
by default.
Block 0 In the actual implementation, certain essential details
add need to be noted. First, the number of channels may differ
×n normal cell
conv
for different inputs. A commonly used solution is to apply
skip a calibration operation on each node’s input tensor to
5x5
reduction cell ensure that all inputs have the same number of channels.
The calibration operation generally uses 1×1 convolution
Cell k-1 filters, such that it will not change the width and height of
×n normal cell the input tensor, but keep the channel number of all input
... tensors consistent. Second, as mentioned above, the input
of a node in a block can be obtained from the previous two
input Cell k-2 cells or the previous blocks within the same cell; hence,
the cell’s output must have the same spatial resolution. To
Figure 7: (Left) Example of a cell-based model comprising three this end, if the input/output resolutions are different, the
motifs, each with n normal cells and one reduction cell. (Right) calibration operation has stride 2; otherwise, it has stride
Example of a normal cell, which contains two blocks, each having
two nodes. Each node is specified with a different operation and 1. Besides, all blocks have stride 1.
input. Complexity. Searching for a cell structure is more
efficient than searching for an entire structure. To illustrate
this, let us assume that there are M predefined candidate
searching for an optimal cell structure in the context of operations, the number of layers for both entire and
cell-based search space. Besides, the output of the normal the cell-based structures is L, and the number of blocks
cell retains the same spatial dimension as the input, and in a cell is B. Then, the number of possible entire
the number of normal cell repeats is usually set manually structures can be expressed as:
based on the actual demand. The reduction cell follows
L×(L−1)
behind a normal cell and has a similar structure to that of
the normal cell, with the differences being that the width Nentire = M L × 2 (2)
2
and height of the output feature maps of the reduction through addition or concatenation operation. Therefore,
cell are half the input, and the number of channels is
twice the input. This design approach follows the common
practice of manually designing neural networks. Unlike
the entire-structured search space, the model built on cell-
based search space can be expanded to form a larger
model by simply adding more cells without re-searching
for the cell structure. Meanwhile, many approaches [17,
13, 15] have experimentally demonstrated the
transferability of the model generated in cell-based search
space, such as the model built on CIFAR-10, which can also
achieve compa- rable results to those achieved by SOTA
human-designed models on ImageNet.
The design paradigm of the internal cell structure of
most NAS studies refers to Zoph et al. [15], who were
among the first to propose the exploration of cell-based
search space. Figure 7 (right) shows an example of a normal
cell structure. Each cell contains B blocks (here B = 2),
and each block has two nodes. Each node in a block can be
assigned different operations and receive different inputs.
The output of two nodes in the block can be combined
12
The number of possible cells is (MB × (B + 2)!)2.
However, as there are two types of cells (i.e., normal
and reduc- tion cells), the final size of the cell-based
search space is calculated as

Ncell = (MB × (B + 2)!)4 (3)


Evidently, the complexity of searching for the entire struc-
ture grows exponentially with the number of layers.
For an intuitive comparison, we assign the
variables in the Eqs. 2 and 3 the typical value in
the literature, i.e., M = 5, L = 10, B = 3; then
Nentire = 3.44 × 1020 is much larger than Ncell = 5.06
× 1016.
Two-stage Gap. The NAS methods of cell-based
search space usually comprise two phases: search and
eval- uation. First, in the search phase, the best-
performing model is selected, and then, in the
evaluation phase, it is trained from scratch or fine-
tuned. However, there exists a large gap in the model
depth between the two phases. As Figure 8 (left) shows,
for DARTS [17], the generated model in the search
phase only comprises eight cells for reducing the GPU
memory consumption, while in the evaluation phase,
the number of cells is extended to 20. Although the

13
[13, 15, 23, 16, 25, 26] follow a two-level hierarchy: the
inner is the cell level, which selects the operation and
8
Cells
20
Cells
5
Cells
11
Cells
17
Cells
20
Cells
connection for each node in the cell, and the outer is the
5 ops
network level, which controls the spatial-resolution
3 ops changes. However, these approaches focus on the cell
2 ops
Search Estimation Search Estimation level and ignore the network level. As shown in Figure 7,
phase phase phase phase
whenever a fixed number of normal cells are stacked, the
spatial dimension
DARTS P-DARTS
of the feature maps is halved by adding a reduction cell.
To jointly learn a suitable combination of repeatable cell
Figure 8: Difference between DARTS [17] and P-DARTS [128]. Both and network structures, Liu et al. [129] defined a general
methods search and evaluate networks on the CIFAR-10 dataset.
As the number of cell structures increases from 5 to 11 and then 17,
formulation for a network-level structure, depicted in Figure
the number of candidate operations is gradually reduced 9, from which many existing good network designs can be
accordingly. reproduced. In this way, we can fully explore the different
number of channels and sizes of feature maps of each
layer in the network.
search phase finds the best cell structure for the shallow
model, this does not mean that it is still suitable for the
deeper model in the evaluation phase. In other words, 1×1 conv
simply adding more cells may deteriorate the model perfor- 3×3 conv
mance. To bridge this gap, Chen et al. [128] proposed an max-pooling
improved method based on DARTS, namely progressive-
DARTS (P-DARTS), which divides the search phase into level-one level-two level-three
multiple stages and gradually increases the depth of the The cell-based search space enables the transferability of
searched networks at the end of each stage, hence bridging the generated model, and most of the cell-based methods
the gap between search and evaluation. However, increas-
ing the number of cells in the search phase may result in
heavier computational overhead. Thus, for reducing the
computational consumption, P-DARTS gradually reduces
the number of candidate operations from 5 to 3, and then
2, through search space approximation methods, as shown
in Figure 8. Experimentally, P-DARTS obtains a 2.50%
error rate on the CIFAR-10 test dataset, outperforming
the 2.83% error rate achieved by DARTS.

4.1.3. Hierarchical Search Space

d=2 d=4 d=8 d=16 ...

L1

L2

L3

...

...

LN-1

LN

Figure 9: Network-level search space proposed by [129]. The blue


point (top-left) indicates the fixed “stem” structure, the remain-
ing gray and orange points are cell structure, as described above.
The black arrows along the orange points indicate the final selected
network-level structure. “d” and “L” indicate the down sampling
rate and layer, respectively.

14
Figure 10: Example of a three-level hierarchical architecture rep-
resentation. The level-one primitive operations are assembled
into level-two cells. The level-two cells are viewed as primitive
operations and assembled into level-three cell.

In terms of the cell level, the number of blocks (B)


in a cell is still manually predefined and fixed in the search
stage. In other words, B is a new hyperparameter that
requires tuning by human input. To address this
problem, Liu et al.
[19] proposed a novel hierarchical genetic
representation scheme, namely HierNAS, in which a
higher-level cell is generated by iteratively incorporating
lower-level cells. As shown in Figure 10, level-one cells
can be some primitive operations, such as 1 × 1 and
3 × 3 convolution and 3 × 3 max-pooling, and are
the basic components of level-two cells. Then, level-
two cells are used as primitive operations to generate
level-three cells. The highest-level cell is a single motif
corresponding to the full architecture. Besides, a
higher-level cell is defined by a learnable adjacency upper-
triangular matrix G, where Gij = k indicates that
the k-th operation 0k is implemented between nodes
i and j. For example, the level-two cell shown in
Figure 10(a) is defined by a matrix G, where G01 = 2,
G02 = 1, G12 = 0 (the index starts from 0). This
method can identify more types of cell structures
with more complex and flexible topologies. Similarly,
Liu et al. [18] proposed progressive NAS (PNAS) to
search for the cell progressively, starting from the
simplest cell structure, which is composed of only one
block, and then expanding to a higher-level cell by
adding more possible block structures. Moreover,
PNAS improves the search efficiency by using a
surrogate model to predict the top-k promising blocks
from the search space at each stage of cell
construction.
For both HierNAS and PNAS, once a cell structure is
searched, it is used in all network layers, which limits
the layer diversity. Besides, for achieving both high
accuracy

15
and low latency, some studies [130, 131] proposed to search (IdMorph) transformations between the neural network
for complex and fragmented cell structures. For layers. An IdMorph transformation is function-preserving
example, Tan et al. [130] proposed MnasNet, which and can be classified into two types – depth and width
uses a novel factorized hierarchical search space to IdMorph (shown in Figure 12) – which makes it possible to
generate different cell structures, namely MBConv, for replace the original model with an equivalent model that
different layers of the final network. Figure 11 presents the is deeper or wider.
factorized hierarchical search space of MnasNet, which However, IdMorph is limited to width and depth changes,
comprises a predefined number of cell structures. Each and can only modify them separately; moreover, the spar-
cell has a different struc- ture and contains a variable sity of its identity layer can create problems [2]. There-
number of blocks––whereas all blocks in the same cell fore, an improved method is proposed, namely network
exhibit the same structure, those in other cells exhibit morphism [21], which allows a child network to inherit
different structures. As this design method can achieve a all knowledge from its well-trained parent network and
suitable balance between model per- formance and latency, continue to grow into a more robust network within a
many subsequent studies [131, 132] have referred to it. shortened training time. Compared with Net2Net, net-
Owing to the large computational consumption, most of work morphism exhibits the following advantages: 1) it can
the differentiable NAS (DNAS) tech- niques (e.g., DARTS) embed nonidentity layers and handle arbitrary nonlinear
first search for a suitable cell struc- ture on a proxy activation functions, and 2) it can simultaneously perform
dataset (e.g., CIFAR10), and then transfer it to a larger depth, width, and kernel size-morphing in a single oper-
target dataset (e.g., ImageNet). Han et al. ation, whereas Net2Net has to separately consider depth
[132] proposed ProxylessNAS, which can directly search and width changes. The experimental results in [21] show
for neural networks on the targeted dataset and hardware that network morphism can substantially accelerate the
platforms by using BinaryConnect [133], which addresses training process, as it uses one-fifteenth of the training
the high memory consumption issue. time and achieves better results than the original VGG16.
+
output

conv
1x1
Cell n

Block 3-B3 conv


3x3
...
... conv Depth IdMorph
1x1 e f
Deeper Net
Cell 3 Block 3-1 b c
a d
e f/2
f/2
Cell 2 Width IdMorph c c
a b d
conv Initial Net d
Block 1-B1 1x1

Cell 1 ... Wider Net

Figure 12: Net2DeeperNet and Net2WiderNet transformations in


Block 1-1
[20]. “IdMorph” refers to identity morphism operation. The value on
input each edge indicates the weight.

Figure 11: Factorized hierarchical search space in MnasNet [130].


The final network comprises different cells. Each cell is composed of
Several subsequent studies [27, 22, 136, 137, 138, 139,
a variable number of repeated blocks, where the block in the same 140, 141] are based on network morphism. For instance,
cell shares the same structure but differs from that in the other cells. Jin et al. [22] proposed a framework that enables Bayesian
optimization to guide the network morphism for an effi-
cient neural architecture search. Wei et al. [136] further
4.1.4. Morphism-based Search Space improved network morphism at a higher level, i.e., by mor-
Isaac Newton is reported to have said that “If I have phing a convolutional layer into the arbitrary module of a
seen further, it is by standing on the shoulders of giants.” neural network. Additionally, Tan and Le [142] proposed
Similarly, several training tricks have been proposed, such EfficientNet, which re-examines the effect of model scaling
as knowledge distillation [134] and transfer learning [135]. on convolutional neural networks, and proved that
However, these methods do not directly modify the model carefully balancing the network depth, width, and
structure. To this end, Chen et al. [20] proposed the resolution can lead to better performance.
Net2Net technique for designing new neural networks based
on an existing network by inserting identity morphism
16
4.2. Architecture Optimization
Initialization Evolution
After defining the search space, we need to search for
the best-performing architecture, a process we call
Update
architec- ture optimization (AO). Traditionally, the
architecture of a neural network is regarded as a set of
static hyperparame- ters that are tuned based on the
performance observed on the validation set. However, this Stopping? Mutation
process highly depends on human experts and requires
considerable time and re- sources for trial and error. No
Therefore, many AO methods have been proposed to free Yes Crossover
humans from this tedious pro- cedure and to search for
novel architectures automatically. Below, we detail the
commonly used AO methods. Selection
Termination

4.2.1. Evolutionary Algorithm


The evolutionary algorithm (EA) is a generic population-
based metaheuristic optimization algorithm that takes in- Figure 13: Overview of the evolutionary algorithm.
spiration from biological evolution. Compared with
tradi- tional optimization algorithms such as exhaustive
methods, EA is a mature global optimization method each network can be modified using function-preserving
with high robustness and broad applicability. It can network morphism operators. Hence, the child network
effectively ad- dress the complex problems that has increased capacity and is guaranteed to perform at
traditional optimization algorithms struggle to solve, least as well as the parent networks.
without being limited by the problem’s nature. Four Steps. A typical EA comprises the following
Encoding Scheme. Different EAs may use differ- steps: selection, crossover, mutation, and update
ent types of encoding schemes for network representation. (Figure 13):
There are two types of encoding schemes: direct and indi-
rect. • Selection This step involves selecting a portion of
Direct encoding is a widely used method that explicitly the networks from all generated networks for the
specifies the phenotype. For example, genetic CNN [30] crossover, which aims to maintain well-performing
encodes the network structure into a fixed-length binary neural architectures while eliminating the weak ones.
string, e.g., 1 indicates that two nodes are connected, and The following three strategies are adopted for network
vice versa. Although binary encoding can be performed selection. The first is fitness selection, in which the
easily, its computational space is the square of the number probability of a network being selected is
proportional Fitness(hi)

of nodes, which is fixed-length, i.e., predefined manually. to its fitness value, i.e., P (hi) =
Σ
N j
,
j=
Fitness(h )
1

For representing variable-length neural networks, DAG en- grammar. Some recent studies [146, 147, 148, 27] have
coding is a promising solution [28, 25, 19]. For example, described the use of indirect encoding schemes to
Suganuma et al. [28] used the Cartesian genetic represent a network. For example, the network in [27]
program- ming (CGP) [143, 144] encoding scheme to can be encoded by a function, and
represent a neural network built by a list of sub-
modules that are de- fined as DAG. Similarly, in [25], the
neural architecture is also encoded as a graph, whose
vertices indicate rank-3 tensors or activations (with batch
normalization performed with rectified linear units (ReLUs)
or plain linear units) and edges indicate identity
connections or convolutions. Neuro evolution of
augmenting topologies (NEAT) [24, 25] also uses a direct
encoding scheme, where each node and connection is
stored.
Indirect encoding specifies a generation rule to build
the network and allows for a more compact
representation. Cellular encoding (CE) [145] is an
example of a system that utilizes indirect encoding of
network structures. It encodes a family of neural networks
into a set of labeled trees and is based on a simple graph
17
where hi indicates the i-th network. The second is
rank selection, which is similar to fitness selection,
but with the network’s selection probability being
proportional to its relative fitness rather than its
absolute fitness. The third method is tournament
selection [25, 27, 26, 19]. Here, in each iteration, k
(tournament size) networks are randomly selected
from the population and sorted according to their
performance; then, the best network is selected with
a probability of p, the second-best network has a
probability of p × (1 − p), and so on.
• Crossover After selection, every two networks are se-
lected to generate a new offspring network, inheriting
half of the genetic information of each of its parents.
This process is analogous to the genetic recombina-
tion, which occurs during biological reproduction and
crossover. The particular manner of crossover varies
and depends on the encoding scheme. In binary en-
coding, networks are encoded as a linear string of
bits, where each bit represents a unit, such that two
parent networks can be combined through one- or
multiple-point crossover. However, the crossover of

18
the data arranged in such a fashion can sometimes action At: sample an architecture
damage the data. Thus, Xie et al. [30] denoted the
basic unit in a crossover as a stage rather than a
bit, which is a higher-level structure constructed by
Controller
a binary string. For cellular encoding, a randomly se- (RNN)
Environment
lected sub-tree is cut from one parent tree to replace
a sub-tree cut from the other parent tree. In another
approach, NEAT performs an artificial synapsis based
Rt+1
on historical markings, adding a new structure with- reward Rt
out losing track of the gene present throughout the St+1
state St
simulation.
• Mutation As the genetic information of the parents Figure 14: Overview of neural architecture search using
is copied and inherited by the next generation, gene reinforcement learning.
mutation also occurs. A point mutation [28, 30] is
one of the most widely used operations and involves
randomly and independently flipping each bit. Two results (such as accuracy) are returned. Many follow-up
types of mutations have been described in [29]: one approaches [23, 15, 16, 13] have used this framework, but
enables or disables a connection between two lay- with different controller policies and neural-architecture
ers, and the other adds or removes skip connections encoding. Zoph et al. [12] first used the policy gradient
between two nodes or layers. Meanwhile, Real and algorithm [150] to train the controller, and sequentially
Moore et al. [25] predefined a set of mutation opera- sampled a string to encode the entire neural architecture.
tors, such as altering the learning rate and removing In a subsequent study [15], they used the proximal policy
skip connections between the nodes. By analogy optimization (PPO) algorithm [151] to update the con-
with the biological process, although a mutation may troller, and proposed the method shown in Figure 15 to
ap- pear as a mistake that causes damage to the build a cell-based neural architecture. MetaQNN [23] is a
network structure and leads to a loss of functionality, meta-modeling algorithm using Q-learning with an ϵ-
it also enables the exploration of more novel greedy exploration strategy and experience replay to
structures and ensures diversity. sequentially search for neural architectures.

• Update Many new networks are generated by com-


op A index A op B index B op A index A op B index B
conv
pleting the above steps, and considering the limita- prediction
conv
5x5 -2 skip -1 3x3 0
max-
pool -1

tions on computational resources, some of these must


be removed. In [25], the worst-performing network hidden

of two randomly selected networks is immediately state

removed from the population. Alternatively, in [26], Empty


Embedding input
conv -2 skip -1 conv
0
max-
pool
5x5 3x3
the oldest networks are removed. Other methods
Block 2 of cell k
[29, 30, 28] discard all models at regular intervals. Block 1 of cell k

However, Liu et al. [19] did not remove any network


from the population, and instead, allowed the net- corresponding
work number to grow with time. Zhu et al. [149]
regulated the population number through a variable
λ, i.e., removed the worst model with probability λ
and the oldest model with 1 − λ.

4.2.2. Reinforcement Learning


Zoph et al. [12] were among the first to apply reinforce-
ment learning (RL) to neural architecture search. Figure 14
presents an overview of an RL-based NAS algorithm. Here,
the controller is usually a recurrent neural network
(RNN) that executes an action At at each step t to sample a
new ar- chitecture from the search space and receives an
observation of the state St together with a reward scalar
Rt from the environment to update the controller’s sampling
strategy. Environment refers to the use of a standard
neural net- work training procedure to train and evaluate
the network generated by the controller, after which the
19
Figure 15: Example of a controller generating a cell structure. Each
block in the cell comprises two nodes that are specified with
different operations and inputs. The indices −2 and −1
indicate the inputs are derived from prev-previous and previous
cell, respectively.

Although the above RL-based algorithms have achieved


SOTA results on the CIFAR-10 and Penn Treebank (PTB)
[152] datasets, they incur considerable time and
computa- tional resources. For instance, the authors in
[12] took 28 days and 800 K40 GPUs to search for the
best-performing architecture, and MetaQNN [23] also
took 10 days and 10 GPUs to complete its search. To
this end, some improved RL-based algorithms have
been proposed. BlockQNN [16] uses a distributed
asynchronous framework and an early- stop strategy to
complete searching on only one GPU within 20 hours.
The efficient neural architecture search (ENAS) [13] is
even better, as it adopts a parameter-sharing strategy in
which all child architectures are regarded as sub-graphs
of a supernet; this enables these architectures

20
to share parameters, obviating the need to train each child can be efficiently solved as a regular training, the searched
model from scratch. Thus, ENAS took only architecture α commonly overfits the training set and its
approximately 10 hours using one GPU to search for the performance on the validation set cannot be guaranteed.
best architecture on the CIFAR-10 dataset, which is The authors in [153] proposed mixed-level optimization:
nearly 1000× faster than [12].
min [Ltrain (θ∗, α) + λLval (θ∗, α)] (7)
4.2.3. Gradient Descent α,θ

The above-mentioned search strategies sample neural where α indicates the neural architecture, θ is the weight
architectures from a discrete search space. A pioneering as- signed to it, and λ is a non-negative regularization
al- gorithm, namely DARTS [17], was among the first variable to control the weights of the training loss and
gradient descent (GD)-based method to search for neural validation loss. When λ = 0, Eq. 7 reduces to a single-level
architec- tures over a continuous and differentiable search opti- mization (Eq. 6); in contrast, Eq. 7 becomes a bilevel
space by using a softmax function to relax the discrete optimization (Eq. 5). The experimental results presented
space, as outlined below: in [153] showed that mixed-level optimization not only
over- comes the overfitting issue of single-level
Σ
K exp αk optimization but also avoids the gradient error of bilevel
i,j
Σ ok(x) (4) optimization.
o i,j k=1
K i,j Second, in DARTS, the output of each edge is the
l=1 weighted sum of all candidate operations (shown in Eq.
exp αl
(x)
where o(x) indicates the operation performed on input 4) during the whole search stage, which leads to a linear
x, ki,j indicates the weight assigned to the operation ok increase in the requirements of GPU memory with the
α
between a pair of nodes (i, j), and K is the number of number of candidate operations. To reduce resource con-
predefined candidate operations. After the relaxation, the sumption, many subsequent studies [154, 155, 153, 156, 131]
task of searching for architectures is transformed into a have developed a differentiable sampler to sample a child
joint optimization of neural architecture α and the weights architecture from the supernet by using a reparameteri-
of this neural architecture θ. These two types of parameters zation trick, namely Gumbel Softmax [157]. The neural
are optimized alternately, indicating a bilevel optimization architecture is fully factorized and modeled with a concrete
problem. Specifically, α and θ are optimized with the distribution [158], which provides an efficient approach to
validation and the training sets, respectively. The sampling a child architecture and allows gradient backprop-
training and the validation losses are denoted by Ltrain agation. Therefore, Eq. 4 is re-formulated as
and Lval, respectively. Hence, the total loss function can be
derived
as follows:
Σ
K exp log αk +
i,j Gk i,j /τ
Σ
+ ok(x) (8)
k
minα Lval (θ ∗, α) oi,j K α l i,j
G li,j
(5) l=1 exp
s.t. θ∗ = argminθ Ltrain(θ, α) log /τ
(x) k=1

Figure 16 presents an overview of DARTS, where a cell where Gki,j = −log(−log(uki,j )) is the k-th Gumbel sample,
k
is composed of N (here N = 4) ordered nodes and the ui,j is a uniform random variable, and τ is the Softmax
node zk (k starts from 0) is connected to the node zi, i ∈ min Ltrain(θ, α) (6)
θ,α
{k + 1, ..., N }. The operation on each edge ei,j is initially
a mixture of candidate operations, each being of equal which optimizes both neural architecture and weights to-
weight. Therefore, the neural architecture α is a gether. Although the single-level optimization problem
supernet that contains all possible child neural
architectures. At the end of the search, the final
architecture is derived by retaining only the maximum-
weight operation among all mixed operations.
Although DARTS substantially reduces the search time,
it incurs several problems. First, as Eq. 5 shows, DARTS
describes a joint optimization of the neural architecture
and weights as a bilevel optimization problem. However,
this problem is difficult to solve directly, because both
ar- chitecture α and weights θ are high dimensional
parameters. Another solution is single-level optimization,
which can be formalized as

21
temperature. When τ → ∞, the possibility distribution
of all operations between each node pair approximates
to one- hot distribution. In GDAS [154], only the
operation with the maximum possibility for each edge
is selected during the forward pass, while the gradient is
backpropagated according to Eq. 8. In other words, only
one path of the supernet is selected for training,
thereby reducing the GPU memory usage. Besides,
ProxylessNAS [132] alleviates the huge resource
consumption through path binarization. Specifically, it
transforms the real-valued path weights [17] to binary
gates, which activates only one path of the mixed
operations, and hence, solves the memory issue.
Another problem is the optimization of different
op- erations together, as they may compete with
each other, leading to a negative influence. For example,
several studies [159, 128] have found that skip-
connect operation domi- nates at a later search stage
in DARTS, which causes the network to be shallower
and leads to a marked deterioration in performance. To
solve this problem, DARTS+ [159] uses an additional
early-stop criterion, such that when two or

22
0 0 0 0
?

?
1 ? 1 1 1

?
?
2 2 2 2

0.1
0.6
0.3
3 3 3 3

(a) (b) (c) (d)

Figure 16: Overview of DARTS. (a) The data can only flow from lower-level nodes to higher-level nodes, and the operations on edges
are initially unknown. (b) The initial operation on each edge is a mixture of candidate operations, each having equal weight. (c) The weight
of each operation is learnable and ranges from 0 to 1, but for previous discrete sampling methods, the weight could only be 0 or 1. (d) The
final neural architecture is constructed by preserving the maximum weight-value operation on each edge.

more skip-connects occur in a normal cell, the search pro- network as the surrogate model. For example, in PNAS
cess stops. In another example, P-DARTS [128] regularizes [18] and EPNAS [166], an LSTM is derived as the surrogate
the search space by executing operation-level dropout to model to progressively predict variable-sized architectures.
control the proportion of skip-connect operations Meanwhile, NAO [169] uses a simpler surrogate model,
occurring during training and evaluation. i.e., multilayer perceptron (MLP), and NAO is more efficient
and achieves better results on CIFAR-10 than does PNAS
4.2.4. Surrogate Model-based Optimization [18]. White et al. [164] trained an ensemble of neural
Another group of architecture optimization methods is networks to predict the mean and variance of the
surrogate model-based optimization (SMBO) algorithms validation results for candidate neural architectures.
[33, 34, 160, 161, 162, 163, 164, 165, 166, 18, 161]. The
core concept of SMBO is that it builds a surrogate model 4.2.5. Grid and Random Search
of the objective function by iteratively keeping a record Both grid search (GS) and random search (RS) are sim-
of past evaluation results, and uses the surrogate model ple optimization methods applied to several NAS studies
to predict the most promising architecture. Thus, these [178, 179, 180, 11]. For instance, Geifman et al. [179] pro-
methods can substantially shorten the search time and posed a modular architecture search space (A = {A(B, i, j)|i
improve efficiency. ∈
SMBO algorithms differ from the surrogate {1, 2, ..., Ncells}, j ∈ {1, 2, ..., Nblocks}}) that is
models, which can be broadly divided into Bayesian spanned by the grid defined by the two corners A(B,
optimization (BO) methods (including Gaussian 1, 1) and A(B, Ncells, Nblocks), where B is a searched
process (GP) [167], random forest (RF) [37], tree- block struc- ture. Evidently, a larger value Ncells × Nblocks
structured Parzen estimator (TPE) [168]), and neural leads to the exploration of a larger space, but requires more
networks [164, 169, 18, 166]. resources. The authors in [180] conducted an
BO [170, 171] is one of the most popular methods effectiveness com- parison between SOTA NAS methods
for hyperparameter optimization. Many recent studies and RS. The results showed that RS is a competitive NAS
[33, 34, 160, 161, 162, 163, 164, 165] have attempted to baseline. Specifically, RS with an early-stopping
apply these SOTA BO methods to AO. For example, in strategy performs as well as ENAS [13], which is an
[172, 173, 160, 165, 174, 175], the validation results of the RL-based leading NAS method. Besides, Yu et al. [11]
generated neural architectures were modeled as a Gaussian demonstrated that the SOTA NAS techniques are not
process, which guides the search for the optimal neural significantly better than random search.
architectures. However, in GP-based BO methods, the
inference time scales cubically in the number of 4.2.6. Hybrid Optimization Method
observations, and they cannot effectively handle variable- The abovementioned architecture optimization methods
length neural networks. Camero et al. [176] proposed have their own advantages and disadvantages. 1) EA is a
three fixed-length encoding schemes to cope with mature global optimization method with high robustness.
variable-length problems by using RF as the surrogate However, it requires considerable computational resources
model. Similarly, both [33] and [176] used RF as a [26, 25], and its evolution operations (such as crossover
surrogate model, and [177] showed that it works better and mutations) are performed randomly. 2) Although RL-
in setting high dimensionality than GP-based methods. based methods (e.g., ENAS [13]) can learn complex
Instead of using BO, some studies have used a neural architectural patterns, the searching efficiency and stability
23
of the RL controller are not guaranteed because it may
take several

24
actions to obtain a positive reward. 3) The GD-based
meth- ods (e.g., DARTS [17]) substantially improve the
searching efficiency by relaxing the categorical candidate
operations to continuous variables. Nevertheless, in

Unimportant parameter

Unimportant parameter
essence, they all search for a child network from a
supernet, which limits the diversity of neural architectures.
Therefore, some methods have been proposed to
incorporate different optimization methods to capture the
best of their advantages; these methods are summarized
as follows
EA+RL. Chen et al. [42] integrated reinforced muta-
Important parameter Important parameter
tions into an EA, which avoids the randomness of evolution
and improves the searching efficiency. Another similar
method developed in parallel is the evolutionary-neural
hybrid controller (Evo-NAS) [41], which also captures the after evaluating all points; while RS selects the best point
merits of both RL-based methods and EA. The Evo-NAS from a set of randomly drawn points.
controller’s mutations are guided by an RL-trained neural
network, which can explore a vast search space and
sample architectures efficiently.
EA+GD. Yang et al. [40] combined the EA and GD-
based method. The architectures share parameters within
one supernet and are tuned on the training set with a few
epochs. Then, the populations and the supernet are di-
rectly inherited in the next generation, which substantially
accelerates the evolution. The authors in [40] only took 0.4
GPU days for searching, which is more efficient than early
EA methods (e.g., AmoebaNet [26] took 3150 GPU days
and 450 GPUs for searching).
EA+SMBO. The authors in [43] used RF as a surro-
gate to predict model performance, which accelerates the
fitness evaluation in EA.
GD+SMBO. Unlike DARTS, which learns weights
for candidate operations, NAO [169] proposes a variational
autoencoder to generate neural architectures and further
build a regression model as a surrogate to predict the
performance of the generated architecture. The
encoder maps the representations of the neural
architecture to continuous space, and then a predictor
network takes the continuous representations of the
neural architecture as input and predicts the
corresponding accuracy. Finally, the decoder is used to
derive the final architecture from a continuous network
representation.

4.3. Hyperparameter Optimization


Most NAS methods use the same set of hyperparameters
for all candidate architectures during the whole search stage;
thus, after finding the most promising neural architecture,
it is necessary to redesign a hyperparameter set and use
it to retrain or fine-tune the architecture. As some HPO
methods (such as BO and RS) have also been applied in
NAS, we will only briefly introduce these methods here.

4.3.1. Grid and Random Search


Figure 17 shows the difference between grid search
(GS) and random search (RS): GS divides the search space
into regular intervals and selects the best-performing point
25
Figure 17: Examples of grid search (left) and random search (right) in
nine trials for optimizing a two-dimensional space function f (x,
y) = g(x) + h(y) ≈ g(x) [181]. The parameter in g(x) (light-
blue part) is relatively important, while that in h(y) (light-yellow
part) is not important. In a grid search, nine trials cover only
three important parameter values; however, random search can
explore nine distinct values of g. Therefore, random search is
more likely to find the optimal combination of parameters than
grid search (the figure is adopted from [181]).

GS is very simple and naturally supports parallel


imple- mentation; however, it is computationally
expensive and inefficient when the hyperparameter
space is very large, as the number of trials grows
exponentially with the dimen- sionality of
hyperparameters. To alleviate this problem, Hsu et al.
[182] proposed a coarse-to-fine grid search, in which a
coarse grid is first inspected to locate a good re- gion,
and then a finer grid search is implemented on the
identified region. Similarly, Hesterman et al. [183] pro-
posed a contracting GS algorithm, which first computes
the likelihood of each point in the grid, and then
generates a new grid centered on the maximum-
likelihood value. The point separation in the new grid
is reduced to half that on the old grid. The above
procedure is iterated until the results converge to a
local minimum.
Although the authors in [181] empirically and theoreti-
cally showed that RS is more practical and efficient
than GS, RS does not promise an optimum value. This
means that although a longer search increases the
probability of finding optimal hyperparameters, it
consumes more re- sources. Li and Jamieson et al. [184]
proposed a hyperband algorithm to create a tradeoff
between the performance of the hyperparameters and
resource budgets. The hyper- band algorithm allocates
limited resources (such as time or CPUs) to only the
most promising hyperparameters, by successively
discarding the worst half of the configuration settings
long before the training process is finished.

4.3.2. Bayesian Optimization


Bayesian optimization (BO) is an efficient method for
the global optimization of expensive blackbox
functions. In this section, we briefly introduce BO. For
an in-depth discussion on BO, we recommend readers
to refer to the excellent surveys conducted in [171, 170,
185, 186].
BO is an SMBO method that builds a probabilistic

26
model mapping from the hyperparameters to the objective Library Model
metrics evaluated on the validation set. It well balances Spearmint
GP
exploration (evaluating as many hyperparameter sets as https://github.com/HIPS/Spearmint
possible) and exploitation (allocating more resources to MOE
GP
promising hyperparameters). https://github.com/Yelp/MOE
PyBO
GP
Algorithm 1 Sequential Model-Based Optimization https://github.com/mwhoffman/pybo
Bayesopt
INPUT: f, Θ, S, M GP
https://github.com/rmcantin/bayesopt
D ← INITSAMPLES (f, Θ)
SkGP
for i in [1, 2, .., T ] do GP
p(y|θ, D) ← FITMODEL (M, D) https://scikit-optimize.github.io
θi ← arg maxθ∈Θ S(θ, p(y|θ, D)) GPyOpt
GP
yi ← f (θi) d Expensive step http://sheffieldml.github.io/GPyOpt
D ← D ∪ (θi, yi) SMAC
RF
end for https://github.com/automl/SMAC3
Hyperopt
TPE
The steps of SMBO are expressed in Algorithm 1 (adopted http://hyperopt.github.io/hyperopt
from [170]). Here, several inputs need to be predefined ini- BOHB
TPE
tially, including an evaluation function f , search space Θ, https://github.com/automl/HpBandSter
acquisition function S, probabilistic model M, and record
dataset D. Specifically, D is a dataset that records many Table 2: Open-source Bayesian optimization libraries. GP, RF,
and TPE represent Gaussian process [167], random forest [37], and
sample pairs (θi, yi), where θi ∈ Θ indicates a sampled tree- structured Parzen estimator [168], respectively.
neural architecture and yi indicates its evaluation result.
After the initialization, the SMBO steps are described as
follows: 4.3.3. Gradient-based Optimization
Another group of HPO methods are gradient-based op-
1. The first step is to tune the probabilistic model M timization (GO) algorithms [187, 188, 189, 190, 191, 192].
to fit the record dataset D. Unlike the above blackbox HPO methods (e.g., GS, RS,
2. The acquisition function S is used to select the next and BO), GO methods use the gradient information to
promising neural architecture from the probabilistic optimize the hyperparameters and substantially improve
model M. the efficiency of HPO. Maclaurin et al. [189] proposed a
3. The performance of the selected neural architecture reversible-dynamics memory-tape approach to handle thou-
is evaluated by f , which is an expensive step as it sands of hyperparameters efficiently through the gradient
involves training the neural network on the training information. However, optimizing many hyperparameters
set and evaluating it on the validation set. is computationally challenging. To alleviate this issue, the
4. The record dataset D is updated by appending a new authors in [190] used approximate gradient information
pair of results (θi, yi). rather than the true gradient to optimize continuous hy-
perparameters, where the hyperparameters can be updated
The above four steps are repeated T times, where T
before the model is trained to converge. Franceschi et al.
needs to be specified according to the total time or resources
[191] studied both reverse- and forward-mode GO meth-
available. The commonly used surrogate models for the
ods. The reverse-mode method differs from the method
BO method are GP, RF, and TPE. Table 2 summarizes
proposed in [189] and does not require reversible
the existing open-source BO methods, where GP is one
dynamics; however, it needs to store the entire training
of the most popular surrogate models. However, GP
history for computing the gradient with respect to the
scales cubically with the number of data samples, while
hyperparame- ters. The forward-mode method overcomes
RF can natively handle large spaces and scales better to many
this problem by using real-time updating hyperparameters,
data samples. Besides, Falkner and Klein et al. [38] proposed
and is demon- strated to significantly improve the
the BO-based hyperband (BOHB) algorithm, which
efficiency of HPO on large datasets. Chandra [192]
combines the strengths of TPE-based BO and hyperband,
proposed a gradient-based ultimate optimizer, which can
and hence, performs much better than standard BO
optimize not only the regular hyperparameters (e.g.,
methods. Fur- thermore, FABOLAS [35] is a faster BO
learning rate) but also those of the optimizer (e.g., Adam
procedure, which maps the validation loss and training
optimizer [193]’s moment coefficient β1, β2).
time as functions of dataset size, i.e., trains a generative
model on a sub-dataset that gradually increases in size.
Here, FABOLAS is 10-100 times faster than other SOTA BO
algorithms and identifies the most promising
hyperparameters.
27
5. Model Evaluation Progressive Neural Architecture Search (PNAS) [18] intro-
duces a surrogate model to control the method of
Once a new neural network has been generated, its searching. Although ENAS has been proven to be very
performance must be evaluated. An intuitive method is efficient, PNAS is even more efficient, as the number of
to train the network to convergence and then evaluate its models evaluated by PNAS is over five times that evaluated
performance. However, this method requires extensive time by ENAS, and PNAS is eight times faster in terms of total
and computing resources. For example, [12] took 800 K40 computational speed. A well-performing surrogate usually
GPUs and 28 days in total to search. Additionally, NASNet requires large amounts of labeled architectures, while the
[15] and AmoebaNet [26] required 500 P100 GPUs and optimization space is too large and hard to quantify, and
450 K40 GPUs, respectively. In this section, we summarize the evalua- tion of each configuration is extremely
several algorithms for accelerating the process of model expensive [201]. To alleviate this issue, Luo et al. [202]
evaluation. proposed SemiNAS, a semi-supervised NAS method, which
leverages amounts of unlabeled architectures to train the
5.1. Low fidelity surrogate, a con- troller that is used to predict the accuracy
As model training time is highly related to the of architectures without evaluation. Initially, the surrogate
dataset and model size, model evaluation can be accelerated is only trained with a small number of labeled data pairs
in dif- ferent ways. First, the number of images or the (architectures, accuracy), then the generated data pairs will
resolution of images (in terms of image-classification be gradually added to the original data to further improve
tasks) can be decreased. For example, FABOLAS [35] the surrogate.
trains the model on a subset of the training set to
accelerate model evalu- ation. In [194], ImageNet64×64 5.4. Early stopping
and its variants 32×32, 16×16 are provided, while these Early stopping was first used to prevent overfitting in
lower resolution datasets can retain characteristics similar classical ML, and it has been used in several recent studies
to those of the original ImageNet dataset. Second, low- [203, 204, 205] to accelerate model evaluation by stopping
fidelity model evaluation can be realized by reducing the evaluations that are predicted to perform poorly on the
model size, such as by training with fewer filters per validation set. For example, [205] proposes a learning-
layer [15, 26]. By analogy to ensemble learning, [195] curve model that is a weighted combination of a set of
proposes the Transfer Series Expansion (TSE), which parametric curve models selected from the literature,
constructs an ensemble estimator by linearly combining thereby enabling the performance of the network to be
a series of basic low-fidelity estima- tors, hence avoiding predicted. Further- more, [206] presents a novel approach
the bias that can derive from using a single low-fidelity for early stopping based on fast-to-compute local statistics
estimator. Furthermore, Zela et al. of the computed gradients, which no longer relies on the
[34] empirically demonstrated that there is a weak corre- validation set and allows the optimizer to make full use of
lation between performance after short or long training all of the training data.
times, thus confirming that a prolonged search for network
configurations is unnecessary.
6. NAS Discussion
5.2. Weight sharing
In Section 4, we reviewed the various search space and
In [12], once a network has been evaluated, it is dropped. architecture optimization methods, and in Section 5, we
Hence, the technique of weight sharing is used to acceler- summarized commonly used model evaluation methods.
ate the process of NAS. For example, Wong and Lu et al. These two sections introduced many NAS studies, which
[196] proposed transfer neural AutoML, which uses knowl- may cause the readers to get lost in details. Therefore, in
edge from prior tasks to accelerate network design. ENAS this section, we summarize and compare these NAS algo-
[13] shares parameters among child networks, leading to rithms’ performance from a global perspective to provide
a thousand-fold faster network design than [12]. Network readers a clearer and more comprehensive understanding of
morphism based algorithms [20, 21] can also inherit the NAS methods’ development. Then, we discuss some major
weights of previous architectures, and single-path NAS topics of the NAS technique.
[197] uses a single-path over-parameterized ConvNet to
encode all architectural decisions with shared convolutional 6.1. NAS Performance Comparison
kernel parameters.
Many NAS studies have proposed several neural archi-
5.3. Surrogate tecture variants, where each variant is designed for different
scenarios. For instance, some architecture variants perform
The surrogate-based method [198, 199, 200, 43] is an- better but are larger, while some are lightweight for a mo-
other powerful tool that approximates the black-box func-
bile device but with a performance penalty. Therefore, we
tion. In general, once a good approximation has been ob-
only report the representative results of each study. Besides,
tained, it is trivial to find the configurations that directly
to ensure a valid comparison, we consider the accuracy and
optimize the original expensive objective. For example,
28
algorithm efficiency as comparison indices. As the
number

29
Published #Params Top-1 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-110 [2] ECCV16 1.7 93.57 - - Manually
PyramidNet [207] CVPR17 26 96.69 - - designed
DenseNet [127] CVPR17 25.6 96.54 - -
GeNet#2 (G-50) [30] ICCV17 - 92.9 17 -
Large-scale ensemble [25] ICML17 40.4 95.6 2,500 250
Hierarchical-EAS [19] ICLR18 15.7 96.25 300 200
CGP-ResSet [28] IJCAI18 6.4 94.02 27.4 2
AmoebaNet-B (N=6, F=128)+c/o [26] AAAI19 34.9 97.87 3,150 450 K40 EA
AmoebaNet-B (N=6, F=36)+c/o [26] AAAI19 2.8 97.45 3,150 450 K40
Lemonade [27] ICLR19 3.4 97.6 56 8 Titan
EENA [149] ICCV19 8.47 97.44 0.65 1 Titan Xp
EENA (more channels)[149] ICCV19 54.14 97.79 0.65 1 Titan Xp
NASv3[12] ICLR17 7.1 95.53 22,400 800 K40
NASv3+more filters [12] ICLR17 37.4 96.35 22,400 800 K40
MetaQNN [23] ICLR17 - 93.08 100 10
NASNet-A (7 @ 2304)+c/o [15] CVPR18 87.6 97.60 2,000 500 P100
NASNet-A (6 @ 768)+c/o [15] CVPR18 3.3 97.35 2,000 500 P100
Block-QNN-Connection more filter [16] CVPR18 33.3 97.65 96 32 1080Ti
Block-QNN-Depthwise, N=3 [16] CVPR18 3.3 97.42 96 32 1080Ti RL
ENAS+macro [13] ICML18 38.0 96.13 0.32 1
ENAS+micro+c/o [13] ICML18 4.6 97.11 0.45 1
Path-level EAS [139] ICML18 5.7 97.01 200 -
Path-level EAS+c/o [139] ICML18 5.7 97.51 200 -
ProxylessNAS-RL+c/o[132] ICLR19 5.8 97.70 - -
FPNAS[208] ICCV19 5.76 96.99 - -
DARTS(first order)+c/o[17] ICLR19 3.3 97.00 1.5 4 1080Ti
DARTS(second order)+c/o[17] ICLR19 3.3 97.23 4 4 1080Ti
sharpDARTS [178] ArXiv19 3.6 98.07 0.8 1 2080Ti
P-DARTS+c/o[128] ICCV19 3.4 97.50 0.3 -
P-DARTS(large)+c/o[128] ICCV19 10.5 97.75 0.3 -
SETN[209] ICCV19 4.6 97.31 1.8 -
GD
GDAS+c/o [154] CVPR19 2.5 97.18 0.17 1
SNAS+moderate constraint+c/o [155] ICLR19 2.8 97.15 1.5 1
BayesNAS[210] ICML19 3.4 97.59 0.1 1
ProxylessNAS-GD+c/o[132] ICLR19 5.7 97.92 - -
PC-DARTS+c/o [211] CVPR20 3.6 97.43 0.1 1 1080Ti
MiLeNAS[153] CVPR20 3.87 97.66 0.3 -
SGAS[212] CVPR20 3.8 97.61 0.25 1 1080Ti
GDAS-NSAS[213] CVPR20 3.54 97.27 0.4 -
NASBOT[160] NeurIPS18 - 91.31 1.7 -
PNAS [18] ECCV18 3.2 96.59 225 -
SMBO
EPNAS[166] BMVC18 6.6 96.29 1.8 1
GHN[214] ICLR19 5.7 97.16 0.84 -
NAO+random+c/o[169] NeurIPS18 10.6 97.52 200 200 V100
SMASH [14] ICLR18 16 95.97 1.5 -
Hierarchical-random [19] ICLR18 15.7 96.09 8 200
RS
RandomNAS [180] UAI19 4.3 97.15 2.7 -
DARTS - random+c/o [17] ICLR19 3.2 96.71 4 1
RandomNAS-NSAS[213] CVPR20 3.08 97.36 0.7 -
NAO+weight sharing+c/o [169] NeurIPS18 2.5 97.07 0.3 1 V100 GD+SMBO
RENASNet+c/o[42] CVPR19 3.5 91.12 1.5 4 EA+RL
CARS[40] CVPR20 3.6 97.38 0.4 - EA+GD

Table 3: Performance of different NAS algorithms on CIFAR-10. The “AO” column indicates the architecture optimization method. The
dash (-) indicates that the corresponding information is not provided in the original paper. “c/o” indicates the use of Cutout [89]. RL,
EA, GD, RS, and SMBO indicate reinforcement learning, evolution-based algorithm, gradient descent, random search, and surrogate model-
based optimization, respectively.

30
Published #Params Top-1/5 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-152 [2] CVPR16 230 70.62/95.51 - -
PyramidNet [207] CVPR17 116.4 70.8/95.3 - -
SENet-154 [126] CVPR17 - 71.32/95.53 - - Manually
DenseNet-201 [127] CVPR17 76.35 78.54/94.46 - - designed
MobileNetV2 [215] CVPR18 6.9 74.7/- - -
GeNet#2[30] ICCV17 - 72.13/90.26 17 -
AmoebaNet-C(N=4,F=50)[26] AAAI19 6.4 75.7/92.4 3,150 450 K40
Hierarchical-EAS[19] ICLR18 - 79.7/94.8 300 200
EA
AmoebaNet-C(N=6,F=228)[26] AAAI19 155.3 83.1/96.3 3,150 450 K40
GreedyNAS [216] CVPR20 6.5 77.1/93.3 1 -
NASNet-A(4@1056) ICLR17 5.3 74.0/91.6 2,000 500 P100
NASNet-A(6@4032) ICLR17 88.9 82.7/96.2 2,000 500 P100
Block-QNN[16] CVPR18 91 81.0/95.42 96 32 1080Ti
Path-level EAS[139] ICML18 - 74.6/91.9 8.3 -
ProxylessNAS(GPU) [132] ICLR19 - 75.1/92.5 8.3 -
RL
ProxylessNAS-RL(mobile) [132] ICLR19 - 74.6/92.2 8.3 -
MnasNet[130] CVPR19 5.2 76.7/93.3 1,666 -
EfficientNet-B0[142] ICML19 5.3 77.3/93.5 - -
EfficientNet-B7[142] ICML19 66 84.4/97.1 - -
FPNAS[208] ICCV19 3.41 73.3/- 0.8 -
DARTS (searched on CIFAR-10)[17] ICLR19 4.7 73.3/81.3 4 -
sharpDARTS[178] Arxiv19 4.9 74.9/92.2 0.8 -
P-DARTS[128] ICCV19 4.9 75.6/92.6 0.3 -
SETN[209] ICCV19 5.4 74.3/92.0 1.8 -
GDAS [154] CVPR19 4.4 72.5/90.9 0.17 1
SNAS[155] ICLR19 4.3 72.7/90.8 1.5 -
ProxylessNAS-G[132] ICLR19 - 74.2/91.7 - -
BayesNAS[210] ICML19 3.9 73.5/91.1 0.2 1
FBNet[131] CVPR19 5.5 74.9/- 216 -
OFA[217] ICLR20 7.7 77.3/- - - GD
AtomNAS[218] ICLR20 5.9 77.6/93.6 - -
MiLeNAS[153] CVPR20 4.9 75.3/92.4 0.3 -
DSNAS[219] CVPR20 - 74.4/91.54 17.5 4 Titan X
SGAS[212] CVPR20 5.4 75.9/92.7 0.25 1 1080Ti
PC-DARTS [211] CVPR20 5.3 75.8/92.7 3.8 8 V100
DenseNAS[220] CVPR20 - 75.3/- 2.7 -
FBNetV2-L1[221] CVPR20 - 77.2/- 25 8 V100
PNAS-5(N=3,F=54)[18] ECCV18 5.1 74.2/91.9 225 -
PNAS-5(N=4,F=216)[18] ECCV18 86.1 82.9/96.2 225 -
SMBO
GHN[214] ICLR19 6.1 73.0/91.3 0.84 -
SemiNAS[202] CVPR20 6.32 76.5/93.2 4 -
Hierarchical-random[19] ICLR18 - 79.6/94.7 8.3 200
RS
OFA-random[217] CVPR20 7.7 73.8/- - -
RENASNet[42] CVPR19 5.36 75.7/92.6 - - EA+RL
Evo-NAS[41] Arxiv20 - 75.43/- 740 - EA+RL
CARS[40] CVPR20 5.1 75.2/92.5 0.4 - EA+GD

Table 4: Performance of different NAS algorithms on ImageNet. The “AO” column indicates the architecture optimization method. The
dash (-) indicates that the corresponding information is not provided in the original paper. RL, EA, GD, RS, and SMBO indicate
reinforcement learning, evolution-based algorithm, gradient descent, random search, and surrogate model-based optimization,
respectively.

and types of GPUs used vary for different studies, we use


GPU Days to approximate the efficiency, which is defined GPU Days = N × D (9)
as:
where N represents the number of GPUs, and D represents
31
the actual number of days spent searching. Searching Stage Evaluation Stage
Tables 3 and 4 present the performances of different
Architecture
NAS methods on CIFAR-10 and ImageNet, respectively.
Optimization Retraining the
Besides, as most NAS methods first search for the neural Search best-performing
architecture based on a small dataset (CIFAR-10), and then Space model of the
model
transfer the architecture to a larger dataset (ImageNet), searching stage
the search time for both datasets is the same. The tables Parameter
show that the early studies on EA- and RL-based NAS Training
methods focused more on high performance, regardless of
the resource consumption. For example, although (a) Two-stage NAS comprises the searching stage and evaluation
stage. The best-performing model of the searching stage is further
Amoe- baNet [26] achieved excellent results for both
retrained in the evaluation stage.
CIFAR-10 and ImageNet, the searching took 3,150 GPU days
and 450 GPUs. The subsequent NAS studies attempted to Parameter Training
improve the searching efficiency while ensuring the searched
model’s high performance. For instance, EENA [149] Model 1
elaborately designs the mutation and crossover operations,
which can reuse the learned information to guide the Search Architecture Model 2 model
Space Optimization
evolution pro- cess, and hence, substantially improve
...
the efficiency of EA-based NAS methods. ENAS [13] is
one of the first RL-based NAS methods to adopt the Model n
parameter-sharing strategy, which reduces the number
of GPU budgets to 1 and shortens the searching time to (b) One-stage NAS can directly deploy a well-performing model
less than one day. We also observe that gradient descent- without extra retraining or fine-tuning. The two-way arrow indicates
based architecture optimization methods can substantially that the processes of architecture optimization and parameter
training run simultaneously.
reduce the compu- tational resource consumption for
searching, and achieve SOTA results. Several follow-up
Figure 18: Illustration of two- and one-stage neural architecture
studies have been con- ducted to achieve further search flow.
improvement and optimization in this direction.
Interestingly, RS-based methods can also obtain
comparable results. The authors in [180] demon- following properties:
strated that RS with weight-sharing could outperform a
series of powerful methods, such as ENAS [13] and • τ = 1: two rankings are identical
DARTS [17].
• τ = −1: two rankings are completely opposite.
6.1.1. Kendall Tau Metric • τ = 0: there is no relationship between two rankings.
As RS is comparable to more sophisticated methods
(e.g., DARTS and ENAS), a natural question is, what are
the advantages and significance of the other AO algorithms 6.1.2. NAS-Bench Dataset
compared with RS? Researchers have tried to use other Although Tables 3 and 4 present a clear comparison
metrics to answer this question, rather than simply con- between different NAS methods, the results of different
sidering the model’s final accuracy. Most NAS methods methods are obtained under different settings, such as
comprise two stages: 1) search for a best-performing archi- training-related hyperparameters (e.g., batch size and
tecture on the training set and 2) expand it to a deeper train- ing epochs) and data augmentation (e.g., Cutout
architecture and estimate it on the validation set. However, [89]). In other words, the comparison is not quite fair. In
there usually exists a large gap between the two stages. this con- text, NAS-Bench-101 [224] is a pioneering work
In other words, the architecture that achieves the best for improv- ing the reproducibility. It provides a tabular
result in the training set is not necessarily the best one dataset con- taining 423,624 unique neural networks
for the validation set. Therefore, instead of merely generated and eval- uated from a fixed graph-based search
considering the final accuracy and search time cost, space and mapped to their trained and evaluated
many NAS studies [219, 222, 213, 11, 123] have used performance on CIFAR-10. Meanwhile, Dong et al. [225]
Kendall Tau (τ ) metric further built NAS-Bench-201, which is an extension to NAS-
[223] to evaluate the correlation of the model performance Bench-101 and has a differ- ent search space, results on
between the search and evaluation stages. The parameter multiple datasets (CIFAR-10, CIFAR-100, and ImageNet-16-
τ is defined as 120 [194]), and more diag-
NC − ND (10) nostic information. Similarly, Klyuchnikov et al. [226]
τ = NC + ND
proposed a NAS-Bench for the NLP task. These datasets

32
where NC and ND indicate the numbers of concordant and enable NAS researchers to focus solely on verifying the ef-
discordant pairs. τ is a number in the range [-1,1] with the fectiveness and efficiency of their AO algorithms, avoiding

33
repetitive training for selected architectures and substan-
tially helping the NAS community to develop. Search Space Search Space

6.2. One-stage vs. Two-stage weights (Fig- ure 19). However, we observe that most one-
The NAS methods can be roughly divided into two stage NAS methods are based on the one-shot paradigm.
classes according to the flow ––two-stage and one-stage––
as shown in Figure 18.
Two-stage NAS comprises the searching stage and
evaluation stage. The searching stage involves two pro-
cesses: architecture optimization, which aims to find the
optimal architecture, and parameter training, which is
to train the found architecture’s parameter. The
simplest idea is to train all possible architectures’
parameters from scratch and then choose the optimal
architecture. However, it is resource-consuming (e.g.,
NAS-RL [12] took 22,400 GPU days with 800 K40 GPUs
for searching) ), which is in- feasible for most companies
and institutes. Therefore, most NAS methods (such as
ENAS [13] and DARTS [17]) sample and train many
candidate architectures in the searching stage, and then
further retrain the best-performing archi- tecture in the
evaluation stage.
One-stage NAS refers to a class of NAS methods
that can export a well-designed and well-trained neural
architecture without extra retraining, by running AO and
parameter training simultaneously. In this way, the ef-
ficiency can be substantially improved. However, model
architecture and its weight parameters are highly coupled; it
is difficult to optimize them simultaneously. Several recent
studies [217, 227, 228, 218] have attempted to overcome this
challenge. For instance, the authors in [217] proposed
the progressive shrinking algorithm to post-process the
weights after the training was completed. They first
pretrained the entire neural network, and then
progressively fine-tuned the smaller networks that shared
weights with the complete network. Based on well-
designed constraints, the perfor- mance of all
subnetworks was guaranteed. Thus, given a target
deployment device, a specialized subnetwork can be directly
exported without fine-tuning. However, [217] was still
computational resource-intensive, as the whole process took
1,200 GPU hours with V100 GPUs. BigNAS [228] re-
visited the conventional training techniques of stand-alone
networks, and empirically proposed several techniques to
handle a wider set of models, ranging in size from 200M to
1G FLOPs, whereas [217] only handled models under 600M
FLOPs. Both AtomNAS [218] and DSNAS [219] proposed
an end-to-end one-stage NAS framework to further
boost the performance and simplify the flow.

6.3. One-shot/Weight-sharing
One-shot/=one-stage. Note that one shot is not ex-
actly equivalent to one stage. As mentioned above, we
divide the NAS studies into one- and two-stage methods
ac- cording to the flow (Figure 18), whereas whether a NAS
al- gorithm belongs to a one-shot method depends on
whether the candidate architectures share the same
34
training as a constrained optimization problem

Figure 19: (Left) One-shot models. (Right) Non-one-shot


models. Each circle indicates a different model, and its area
indicates the model’s size. We use concentric circles to represent
one-shot models, as they share the weights with each other.

What is One-shot NAS? One-shot NAS methods


embed the search space into an overparameterized
supernet, and thus, all possible architectures can be
derived from the supernet. Figure 18 shows the
difference between the search spaces of one-shot and
non-one-shot NAS. Each circle indicates a different
architecture, where the archi- tectures of one-shot NAS
methods share the same weights. One-shot NAS
methods can be divided into two categories according
to how to handle AO and parameter training: coupled
and decoupled optimization [229, 216].
Coupled optimization. The first category of one-
shot NAS methods optimizes the architecture and
weights in a coupled manner [13, 17, 154, 132, 155].
For instance, ENAS [13] uses an LSTM network to
discretely sample a new architecture, and then uses a
few batches of the training data to optimize the
weight of this architecture. After repeating the above
steps many times, a collection of architectures and their
corresponding performances are recorded. Finally, the
best-performing architecture is se- lected for further
retraining. DARTS [17] uses a similar weight sharing
strategy, but has a continuously parame- terized
architecture distribution. The supernet contains all
candidate operations, each with learnable parameters.
The best architecture can be directly derived from the
distribution. However, as DARTS [17] directly
optimizes the supernet weights and the architecture
distribution, it suffers from vast GPU memory
consumption. Although DARTS-like methods [132, 154,
155] have adopted different approaches to reduce the
resource requirements, coupled optimization inevitably
introduces a bias in both architec- ture distribution
and supernet weights [197, 229], as they treat all
subnetworks unequally. The rapidly converged
architectures can easily obtain more opportunities to
be optimized [17, 159], and are only a small portion
of all candidates; therefore, it is challenging to find
the best architecture.
Another disadvantage of coupling optimization is
that when new architectures are sampled and trained
continu- ously, the weights of previous architectures are
negatively impacted, leading to performance
degradation. The authors in [230] defined this
phenomenon as multimodel forgetting. To overcome
this problem, Zhang et al. [231] modeled supernet

35
of continual learning and proposed novel search-based ar- for each architecture, as these architectures are assigned
chitecture selection (NSAS) loss function. They applied the weights generated by the hypernetwork. Besides, the
the proposed method to RandomNAS [180] and GDAS authors in [232] observed that the architectures with a
[154], where the experimental result demonstrated that smaller symmetrized KL divergence value are more likely
the method effectively reduces the multimodel forgetting to perform better. This can be expressed as follows:
and boosting the predictive ability of the supernet as an
evaluator.
Decoupled optimization. The second category of DSKL = DKL(p q) + DKL(q p)
Σ
n
one-shot NAS methods [209, 232, 229, 217] decouples the s.t. D (p q) = p log (11)
pi
optimization of architecture and weights into two sequential [14] and [232] randomly selected a set of architectures from the
phases: 1) training the supernet and 2) using the trained supernet, and ranked them according to their perfor- mance.
supernet as a predictive performance estimator of SMASH can obtain the validation performance of all selected
different architectures to select the most promising architectures at the cost of a single training run
architecture.
In terms of the supernet training phase, the supernet
cannot be directly trained as a regular neural network
be- cause its weights are also deeply coupled [197]. Yu
et al.
[11] experimentally showed that the weight-sharing strat-
egy degrades the individual architecture’s performance
and negatively impacts the real performance ranking of
the candidate architectures. To reduce the weight
coupling, many one-shot NAS methods [197, 209, 14, 214]
adopt the random sampling policy, which randomly
samples an archi- tecture from the supernet, activating
and optimizing only the weights of this architecture.
Meanwhile, RandomNAS
[180] demonstrates that a random search policy is a compet-
itive baseline method. Although some one-shot approaches
[154, 13, 155, 132, 131] have adopted the strategy that
samples and trains only one path of the supernet at a time,
they sample the path according to the RL controller [13],
Gumbel Softmax [154, 155, 131], or the BinaryConnect net-
work [132], which instead highly couples the architecture
and supernet weights. SMASH [14] adopts an auxiliary
hypernetwork to generate weights for randomly sampled
architectures. Similarly, Zhang et al. [214] proposed a
computation graph representation, and used the graph hy-
pernetwork (GHN) to predict the weights for all possible
architectures faster and more accurately than regular hy-
pernetworks [14]. However, through a careful experimental
analysis conducted to understand the weight-sharing strat-
egy’s mechanism, Bender et al. [232] showed that neither a
hypernetwork nor an RL controller is required to find the
optimal architecture. They proposed a path dropout strat-
egy to alleviate the problem of weight coupling. During
supernet training, each path of the supernet is randomly
dropped with gradually increasing probability. GreedyNAS
[216] adopts a multipath sampling strategy to train the
greedy supernet. This strategy focuses on more potentially
suitable paths, and is demonstrated to effectively achieve
a fairly high rank correlation of candidate architectures
compared with RS.
The second phase involves the selection of the most
promising architecture from the trained supernet, which
is the primary purpose of most NAS tasks. Both SMASH
36
KL i
i=1 q i

where (p1, ..., pn) and (q1, ..., qn) indicate the predictions
of the sampled architecture and one-shot model,
respectively, and n indicates the number of classes. The
cost of calcu- lating the KL value is very small; in [232],
only 64 random training data examples were used.
Meanwhile, EA is also a promising search solution
[197, 216]. For instance, SPOS
[197] uses EA to search for architectures from the
supernet. It is more efficient than the EA methods
introduced in Section 4, because each sampled
architecture only performs inference. The self-evaluated
template network (SETN)
[209] proposes an estimator to predict the
probability of each architecture having a lower
validation loss. The ex- perimental results show that
SETN can potentially find an architecture with better
performance than RS-based methods [232, 14].

6.4. Joint Hyperparameter and Architecture Optimization


Most NAS methods fix the same setting of training-
related hyperparameters during the whole search
stage. Af- ter the search, the hyperparameters of the
best-performing architecture are further optimized.
However, this paradigm may result in suboptimal
results as different architectures tend to fit different
hyperparameters, making the model ranking unfair
[233]. Therefore, a promising solution is the joint
hyperparameter and architecture optimization (HAO)
[34, 234, 233, 235]. We summary the existing joint HAO
methods as follows.
Zela et al. [34] cast NAS as a hyperparameter opti-
mization problem, where the search spaces of NAS and
standard hyperparameters are combined. They applied
BOHB [38], an efficient HPO method, to optimize the ar-
chitecture and hyperparameters jointly. Similarly,
Dong et al. [233] proposed a differentiable method,
namely Au- toHAS, which builds a Cartesian product of
the search spaces of both NAS and HPO by unifying the
represen- tation of all candidate choices for the
architecture (e.g., number of layers) and
hyperparameters (e.g., learning rate). However, a
challenge here is that the candidate choices for the
architecture search space are usually categorical, while
hyperparameters choices can be categorical (e.g., the
type of optimizer) and continuous (e.g., learning rate).
To over- come this challenge, AutoHAS discretizes the
continuous hyperparameters into a linear combination
of multiple cat- egorical bases. For example, the
categorical bases for the learning rate are {0.1, 0.2,
0.3}, and then, the final learning

37
rate is defined as lr = w1 × 0.1 + w2 × 0.2 + w3 × 0.3. to the one-hot random variables, such that the resource
Mean- while, FBNetv3 [235] jointly searches both constraint’s differentiability is ensured.
architectures and the corresponding training recipes (i.e.,
hyperparam- eters). The architectures are represented
with one-hot categorical variables and integral (min-max 7. Open Problems and Future Directions
normalized) range variables, and the representation is fed This section discusses several open problems of the ex-
to an encoder network to generate the architecture isting AutoML methods and proposes some future
embedding. Then, the concatenation of architecture research directions.
embedding and the training hyperparameters is used to
train the accuracy predictor, which will be applied to 7.1. Flexible Search Space
search for promising architectures and hyperparameters at
a later stage. As summarized in Section 4, there are various search
spaces where the primitive operations can be roughly clas-
6.5. Resource-aware NAS sified into pooling and convolution. Some spaces even
use a more complex module (e.g., MBConv [130]) as the
Early NAS studies [12, 15, 26] pay more attention to primitive operation. Although these search spaces have
searching for neural architectures that achieve higher per- been proven effective for generating well-performing neural
formance (e.g., classification accuracy), regardless of the architectures, all of them are based on human knowledge
associated resource consumption (i.e., the number of and experience, which inevitably introduce human bias,
GPUs and time required). Therefore, many follow-up and hence, still do not break away from the human design
studies investigate resource-aware algorithms to trade off paradigm. AutoML-Zero [289] uses very simple mathemat-
perfor- mance against the resource budget. To do so, these ical operations (e.g., cos, sin, mean,std) as the primitive
algo- rithms add computational cost to the loss function as operations of the search space to minimize the human
a resource constraint. These algorithms differ in the bias, and applies EA to discover complete machine learning
type of computational cost, which may be 1) the algorithms. AutoML-Zero successfully designs two-layer
parameter size; 2) the number of Multiply-ACcumulate neural networks based on these basic mathematical opera-
(MAC) opera- tions; 3) the number of float-point tions. Although the network searched by AutoML-Zero is
operations (FLOPs); or much simpler than both human-designed and NAS-designed
4) the real latency. For example, MONAS [236] considers networks, the experimental results show the potential to
MAC as the constraint, and as MONAS uses a policy-based discover a new model design paradigm with minimal
reinforcement-learning algorithm to search, the constraint human design. Therefore, the design of a more general,
can be directly added to the reward function. MnasNet flexible, and free of human bias search space and the
[130] proposes a customized weighted product to approxi- discovery of novel neural architectures based on this search
mate a Pareto optimal solution: space would
maximize ACC w (12) be challenging and advantageous.
(m) × LAT (m)
m
T
where LAT (m) denotes measured inference latency of the function and the latency term, respectively. In SNAS [155], the
model m on the target device, T is the target latency, cost of time for the generated child network is linear
and w is the weight variable defined as:

α, if LAT (m) ≤ T
w= β, otherwise (13)
where the recommended value for both α and β is −0.07.
In terms of a differentiable neural architecture search
(DNAS) framework, the constraint (i.e., loss function)
should be differentiable. For this purpose, FBNet [131]
uses a latency lookup table model to estimate the overall
latency of a network based on the runtime of each operator.
The loss function is defined as

L (a, θa) = CE (a, θa) · α log(LAT(a))β (14)


where CE(a, θa) indicates the cross-entropy loss of architec-
ture a with weights θa. Similar to MnasNet [130], this loss
function also comprises two hyperparameters that need to
be set manually: α and β control the magnitude of the loss

38
7.2. Exploring More Areas
As described in Section 6, the models designed by
NAS algorithms have achieved comparable results in
image clas- sification tasks (CIFAR-10 and ImageNet) to
those of man- ually designed models. Additionally,
many recent studies have applied NAS to other CV
tasks (Table 5).
However, in terms of the NLP task, most NAS
studies have only conducted experiments on the PTB
dataset. Besides, some NAS studies have attempted
to apply NAS to other NLP tasks (shown in Table 5).
However, Figure 20 shows that, even on the PTB
dataset, there is still a big gap in performance between
the NAS-designed models ([13, 17, 12]) and human-
designed models (GPT-2 [290], FRAGE AWD-LSTM-
Mos [4], adversarial AWD-LSTM- Mos [291] and
Transformer-XL [5]). Therefore, the NAS community
still has a long way to achieve comparable results to
those of the models designed by experts on NLP
tasks.
Besides the CV and NLP tasks, Table 5 also shows
that AutoML technique has been applied to other tasks,
such as network compression, federate learning, image
caption,

39
Category Application References
Medical Image Recognition [237, 238, 239]
Object Detection [240, 241, 242, 243, 244, 245]
Semantic Segmentation [246, 129, 247, 248, 249, 250, 251]
Computer Vision Person Re-identification [252]
(CV) Super-Resolution [253, 254, 255]
Image Restoration [256]
Generative Adversarial Network (GAN) [257, 258, 259, 260]
Disparity Estimation [261]
Video Task [262, 263, 264, 265]
Translation [266]
Language Modeling [267]
Natural Language Processing Entity Recognition [267]
(NLP) Text Classification [268]
Sequential Labeling [268]
Keyword Spotting [269]
Network Compression [270, 271, 272, 273, 274, 275, 276, 277]
Graph Neural Network (GNN) [278]
Federate Learning [279, 280]
Loss Function Search [281, 282]
Others Activation Function Search [283]
Image Caption [284, 285]
Text to Speech (TTS) [202]
Recommendation System [286, 287, 288]

Table 5: Summary of the existing automated machine learning applications.

nation operation to process the output of each block in the


Model
cell, instead of the element-wise addition operation. Some
NAS Cell 64
recent studies [232, 292, 96] have shown that the explana-
ENAS 58.6
tion for these occurrences is usually hindsight and lacks
DARTS 56.1 rigorous mathematical proof. Therefore, increasing the
Transformer-XL 54.55 mathematical interpretability of AutoML is an important
FRAGE + AWD-LSTM-MoS 46.54 future research direction.
Human
adversarial+AWD-LSTM-MoS 46.01 Auto
7.4. Reproducibility
Perplexity

GPT-2 35.76
A major challenge with ML is reproducibility. AutoML
0 10 20 30 40 50 60 70
is no exception, especially for NAS, because most of the
existing NAS algorithms still have many parameters that
Figure 20: State-of-the-art models on the PTB dataset. The lower the need to be set manually at the implementation level; how-
perplexity, the better is the performance. The green bar represents
the automatically generated model, and the yellow bar represents the ever, the original papers do not cover much detail. For
model designed by human experts. Best viewed in color. instance, Yang et al. [123] experimentally demonstrated
that the seed plays an important role in NAS experiments;
however, most NAS studies do not mention the seed set in
recommendation system, and searching for loss and acti- the experiments. Besides, considerable resource consump-
vation functions. Therefore, these interesting studies have tion is another obstacle to reproduction. In this context,
indicated the potential of AutoML to be applied in more several NAS-Bench datasets have been proposed, such as
areas. NAS-Bench-101 [224], NAS-Bench-201 [225], and NAS-
Bench-NLP [226]. These datasets allow NAS researchers
7.3. Interpretability to focus on the design of optimization algorithms without
Although AutoML algorithms can find promising con- wasting much time on the model evaluation.
figuration settings more efficiently than humans, there is a
lack of scientific evidence for illustrating why the found 7.5. Robustness
set- tings perform better. For example, in BlockQNN [16], it NAS has been proven effective in searching promising
is unclear why the NAS algorithm tends to select the architectures on many open datasets (e.g., CIFAR-10 and
concate-

40
ImageNet). These datasets are generally used for research; ers. Besides, NNI also integrates scikit-learn features [300],
therefore, most of the images are well-labeled. However,
in real-world situations, the data inevitably contain noise
(e.g., mislabeling and inadequate information). Even
worse, the data might be modified to be adversarial with
carefully designed noises. Deep learning models can be
easily fooled by adversarial data, and so can NAS.
So far, there are a few studies [293, 294, 295, 296]
have attempted to boost the robustness of NAS against
adver- sarial data. Guo et al. [294] experimentally explored
the intrinsic impact of network architectures on network
ro- bustness against adversarial attacks, and observed that
densely connected architectures tend to be more robust.
They also found that the flow of solution procedure (FSP)
matrix [297] is a good indicator of network robustness, i.e.,
the lower is the FSP matrix loss, the more robust is the net-
work. Chen et al. [295] proposed a robust loss function for
effectively alleviating the performance degradation under
symmetric label noise. The authors in [296] adopted EA
to search for robust architectures from a well-designed
and vast search space, where various adversarial attacks
are used as the fitness function for evaluating the
robustness of neural architectures.

7.6. Joint Hyperparameter and Architecture Optimization


Most NAS studies have considered HPO and AO as two
separate processes. However, as already noted in Section
4, there is a tremendous overlap between the methods
used in HPO and AO, e.g., both of them apply RS, BO,
and GO methods. In other words, it is feasible to jointly
optimize both hyperparameters and architectures, which is
experimentally confirmed by several studies [234, 233, 235].
Thus, how to solve the problem of joint hyperparameter
and architecture optimization (HAO) elegantly is a worthy
studying issue.

7.7. Complete AutoML Pipeline


So far, many AutoML pipeline libraries have been pro-
posed, but most of them only focus on some parts of the
AutoML pipeline (Figure 1). For instance, TPOT [298],
Auto-WEAK [177], and Auto-Sklearn [299] are built on
top of scikit-learn [300] for building classification and re-
gression pipelines, but they only search for the traditional
ML models (such as SVM and KNN). Although TPOT
involves neural networks (using Pytorch [301] backend), it
only supports an MLP network. Besides, Auto-Keras [22]
is an open-source library developed based on Keras [302],
which focuses more on searching for deep learning models
and supports multi-modal and multi-task. NNI [303] is
a more powerful and lightweight toolkit of AutoML, as
its built-in capability contains automated feature engineer-
ing, hyperparameter optimization, and neural architecture
search. Additionally, the NAS module in NNI supports
both Pytorch [301] and Tensorflow [304] and reproduces
many SOTA NAS methods [13, 17, 132, 128, 197, 180,
224],
which is very friendly for NAS researchers and develop-
41
which is one step closer to achieving a complete competitive with those searched with la- bels; therefore,
pipeline. Similarly, Vega [305] is another AutoML labels are not necessary for NAS, which has provoked some
algorithm tool that constructs a complete pipeline reflection among researchers about which factors do affect
covering a set of highly de- coupled functions: data NAS.
augmentation, HPO, NAS, model compression, and full
training. In summary, designing an easy-to-use and
complete AutoML pipeline system is a promising
research direction.

7.8. Lifelong Learning


Finally, most AutoML algorithms focus only on
solving a specific task on some fixed datasets, e.g.,
image classifica- tion on CIFAR-10 and ImageNet.
However, a high-quality AutoML system should have the
capability of lifelong learn- ing, i.e., it should be able to
1) efficiently learn new data and 2) remember old
knowledge.

7.8.1. Learn New Data


First, the system should be able to reuse prior
knowl- edge to solve new tasks (i.e., learning to learn).
For example, a child can quickly identify tigers, rabbits,
and elephants after seeing several pictures of these
animals. However, the current DL models must be
trained on considerable data before they can correctly
identify images. A hot topic in this area is meta-
learning, which aims to design models for new tasks
using previous experience.
Meta-learning. Most of the existing NAS
methods can search a well-performing architecture for a
single task. However, they have to search for a new
architecture on a new task; otherwise, the old
architecture might not be optimal. Several studies [306,
307, 308, 309] have combined meta-learning and NAS to
solve this problem. Recently, Lian et al. [308] proposed
a novel and meta-learning-based transferable neural
architecture search method to generate a meta-
architecture, which can adapt to new tasks easily and
quickly through a few gradient steps. Another
challenge of learning new data is few-shot learning
scenarios, where there are only limited data for the new
tasks. To over- come this challenge, the authors in [307]
and [306] applied NAS to few-shot learning, where they
only searched for the most promising architecture and
optimized it to work on multiple few-shot learning tasks.
Elsken et al. [309] pro- posed a gradient-based meta-
learning NAS method, namely METANAS, which can
generate task-specific architectures more efficiently as it
does not require meta-retraining.
Unsupervised learning. Meta-learning-based NAS
methods focus more on labeled data, while in some
cases, only a portion of the data may have labels or
even none at all. Liu et al. [310] proposed a general
problem setup, namely unsupervised neural architecture
search (UnNAS), to explore whether labels are
necessary for NAS. They ex- perimentally demonstrated
that the architectures searched without labels are
42
7.8.2. Remember Old Knowledge Systems 31: Annual Conference on Neural Information
An AutoML system must be able to constantly learn
from new data, without forgetting the knowledge from
old data. However, when we use new datasets to train a
pretrained model, the model’s performance on the previous
datasets is substantially reduced. Incremental learning can
alleviate this problem. For example, Li and Hoiem [311]
proposed the learning without forgetting (LwF) method,
which trains a model using only new data while preserving
its original capabilities. In addition, iCaRL [312] makes
progress based on LwF. It only uses a small proportion of
old data for pretraining, and then gradually increases the
proportion of a new class of data used to train the model.

8. Conclusions

This paper provides a detailed and systematic review


of AutoML studies according to the DL pipeline (Figure
1), ranging from data preparation to model evaluation.
Additionally, we compare the performance and efficiency of
existing NAS algorithms on the CIFAR-10 and ImageNet
datasets, and provide an in-depth discussion of different
research directions on NAS: one/two-stage NAS, one-
shot NAS, and joint HAO. We also describe several
interesting open problems and discuss some important
future research directions. Although research on AutoML
is in its infancy, we believe that future researchers will
effectively solve these problems. In this context, this
review provides a comprehensive and clear understanding
of AutoML for the benefit of those new to this area, and
will thus assist with their future research endeavors.

References
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classi-
fication with deep convolutional neural networks, in: P. L.
Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q.
Weinberger (Eds.), Advances in Neural Information
Processing Systems 25: 26th Annual Conference on Neural
Information Processing Systems 2012. Proceedings of a
meeting held December 3-6, 2012, Lake Tahoe, Nevada,
United States, 2012,
pp. 1106–1114.
URL https://proceedings.neurips.cc/paper/2012/hash/
c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for
image recognition, in: 2016 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,
USA, June 27-30, 2016, IEEE Computer Society, 2016, pp.
770–778. doi:10.1109/CVPR.2016.90.
URL https://doi.org/10.1109/CVPR.2016.90
[3] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi, You
only look once: Unified, real-time object detection, in:
2016 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,
2016, IEEE
Computer Society, 2016, pp. 779–788. doi:10.1109/CVPR.2016.
91.
URL https://doi.org/10.1109/CVPR.2016.91
[4] C. Gong, D. He, X. Tan, T. Qin, L. Wang, T. Liu, FRAGE:
frequency-agnostic word representation, in: S. Bengio, H. M.
Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
R. Garnett (Eds.), Advances in Neural Information Processing
43
Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE
Montr´eal, Canada, 2018, pp. 1341–1352. Computer Society, 2018, pp. 2423–2432.
URL https://proceedings.neurips.cc/paper/2018/hash/ doi:10.1109/CVPR.2018.00257.
e555ebe0ce426f7f9b2bef0706315e0c-Abstract.html URL http://openaccess.thecvf.com/content_cvpr_2018/
[5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, R. html/Zhong_Practical_Block-Wise_Neural_CVPR_2018_
Salakhutdinov, Transformer-XL: Attentive language paper.html
models beyond a fixed- length context, in: Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, Association for Computational
Linguistics, Florence, Italy, 2019, pp. 2978– 2988.
doi:10.18653/v1/P19-1285.
URL https://www.aclweb.org/anthology/P19-1285
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge,
International Journal of Computer Vision (IJCV) 115 (3) (2015)
211–252. doi:10.1007/s11263-015-0816-y.
[7] K. Simonyan, A. Zisserman, Very deep convolutional
networks for large-scale image recognition, in: Y. Bengio,
Y. LeCun (Eds.), 3rd International Conference on Learning
Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-
9, 2015, Conference Track Proceedings, 2015.
URL http://arxiv.org/abs/1409.1556
[8] M. Zoller, M. F. Huber, Benchmark and survey of
automated machine learning frameworks, arXiv preprint
arXiv:1904.12054.
[9] Q. Yao, M. Wang, Y. Chen, W. Dai, H. Yi-Qi, L. Yu-Feng,
T. Wei-Wei, Y. Qiang, Y. Yang, Taking human out of
learning applications: A survey on automated machine
learning, arXiv preprint arXiv:1810.13306.
[10] T. Elsken, J. H. Metzen, F. Hutter, Neural architecture
search: A survey, arXiv preprint arXiv:1808.05377.
[11] K. Yu, C. Sciuto, M. Jaggi, C. Musat, M. Salzmann,
Evaluating the search phase of neural architecture
search, in: 8th Inter- national Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, OpenReview.net, 2020.
URL https://openreview.net/forum?id=H1loF2NFwr
[12] B. Zoph, Q. V. Le, Neural architecture search with reinforce-
ment learning, in: 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April 24-26,
2017, Conference Track Proceedings, OpenReview.net,
2017.
URL https://openreview.net/forum?id=r1Ue8Hcxg
[13] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, J. Dean, Efficient
neural architecture search via parameter sharing, in: J. G.
Dy,
A. Krause (Eds.), Proceedings of the 35th International Con-
ference on Machine Learning, ICML 2018,
Stockholmsma¨ssan, Stockholm, Sweden, July 10-15, 2018,
Vol. 80 of Proceedings of Machine Learning Research,
PMLR, 2018, pp. 4092–4101.
URL http://proceedings.mlr.press/v80/pham18a.html
[14] A. Brock, T. Lim, J. M. Ritchie, N. Weston, SMASH: one-shot
model architecture search through hypernetworks, in: 6th
Inter- national Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
Conference Track Proceedings, OpenReview.net, 2018.
URL https://openreview.net/forum?id=rydeCEhs-
[15] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning
transferable architectures for scalable image recognition,
in: 2018 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2018, Salt Lake City, UT,
USA, June 18-22, 2018, IEEE Computer Society, 2018, pp.
8697–8710. doi:10.1109/CVPR.2018.00907.
URL

http://openaccess.thecvf.com/content_cvpr_2018/
html/Zoph_Learning_Transferable_Architectures_CVPR_
2018_paper.html
[16] Z. Zhong, J. Yan, W. Wu, J. Shao, C. Liu, Practical block-
wise neural network architecture generation, in: 2018 IEEE
Confer- ence on Computer Vision and Pattern Recognition,

44
[17] H. Liu, K. Simonyan, Y. Yang, DARTS: differentiable archi- Intelligence, IJCAI
tecture search, in: 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
2019, OpenReview.net, 2019.
URL https://openreview.net/forum?id=S1eYHoC5FX
[18] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li,
L. Fei-Fei, A. Yuille, J. Huang, K. Murphy, Progressive
neural architecture search (2018) 19–34.
[19] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, K. Kavukcuoglu,
Hierarchical representations for efficient architecture search,
in: 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30
- May 3, 2018, Conference Track Proceedings,
OpenReview.net, 2018.
URL https://openreview.net/forum?id=BJQRKzbA-
[20] T. Chen, I. J. Goodfellow, J. Shlens, Net2net: Accelerating
learning via knowledge transfer, in: Y. Bengio, Y. LeCun
(Eds.), 4th International Conference on Learning Represen-
tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,
Conference Track Proceedings, 2016.
URL http://arxiv.org/abs/1511.05641
[21] T. Wei, C. Wang, Y. Rui, C. W. Chen, Network morphism, in:
M. Balcan, K. Q. Weinberger (Eds.), Proceedings of the
33nd International Conference on Machine Learning, ICML
2016, New York City, NY, USA, June 19-24, 2016, Vol. 48 of
JMLR Workshop and Conference Proceedings, JMLR.org,
2016, pp. 564–572.
URL http://proceedings.mlr.press/v48/wei16.html
[22] H. Jin, Q. Song, X. Hu, Auto-keras: An efficient neural
architecture search system, in: A. Teredesai, V. Kumar,
Y. Li, R. Rosales, E. Terzi, G. Karypis (Eds.), Proceed-
ings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, KDD 2019, Anchor-
age, AK, USA, August 4-8, 2019, ACM, 2019, pp. 1946–
1956. doi:10.1145/3292500.3330648.
URL https://doi.org/10.1145/3292500.3330648
[23] B. Baker, O. Gupta, N. Naik, R. Raskar, Designing neural
network architectures using reinforcement learning, in: 5th
International Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings, OpenReview.net, 2017.
URL https://openreview.net/forum?id=S1c2cvqee
[24] K. O. Stanley, R. Miikkulainen, Evolving neural networks
through augmenting topologies, Evolutionary computation
10 (2) (2002) 99–127.
[25] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan,
Q. V. Le, A. Kurakin, Large-scale evolution of image classifiers,
in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th Inter-
national Conference on Machine Learning, ICML 2017, Sydney,
NSW, Australia, 6-11 August 2017, Vol. 70 of Proceedings of
Machine Learning Research, PMLR, 2017, pp. 2902–2911.
URL http://proceedings.mlr.press/v70/real17a.html
[26] E. Real, A. Aggarwal, Y. Huang, Q. V. Le, Regularized evo-
lution for image classifier architecture search, in: The
Thirty- Third AAAI Conference on Artificial Intelligence,
AAAI 2019, The Thirty-First Innovative Applications of
Artificial Intel- ligence Conference, IAAI 2019, The Ninth
AAAI Sympo- sium on Educational Advances in Artificial
Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27
- February 1, 2019,
AAAI Press, 2019, pp. 4780–4789. doi:10.1609/aaai.v33i01.
33014780.
URL https://doi.org/10.1609/aaai.v33i01.33014780
[27] T. Elsken, J. H. Metzen, F. Hutter, Efficient multi-objective
neural architecture search via lamarckian evolution, in: 7th
International Conference on Learning Representations, ICLR
2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net,
2019.
URL https://openreview.net/forum?id=ByME42AqK7
[28] M. Suganuma, S. Shirakawa, T. Nagao, A genetic
programming approach to designing convolutional neural
network architec- tures, in: J. Lang (Ed.), Proceedings of the
Twenty-Seventh International Joint Conference on Artificial
45
2018, July 13-19, 2018, Stockholm, Sweden, ijcai.org, 2018, pp. evolu- tion approach to designing deep convolutional neural
5369–5373. doi:10.24963/ijcai.2018/755. networks for image classification, in: Australasian Joint
URL https://doi.org/10.24963/ijcai.2018/755 Conference on Artificial Intelligence, Springer, 2018, pp.
[29] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, 237–250.
O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. [45] M. Wistuba, A. Rawat, T. Pedapati, A survey on neural archi-
Duffy, et al., Evolving deep neural networks (2019) 293– tecture search, arXiv preprint arXiv:1905.01392.
312.
[30] L. Xie, A. L. Yuille, Genetic CNN, in: IEEE International
Conference on Computer Vision, ICCV 2017, Venice,
Italy, October 22-29, 2017, IEEE Computer Society, 2017,
pp. 1388–
1397. doi:10.1109/ICCV.2017.154.
URL https://doi.org/10.1109/ICCV.2017.154
[31] K. Ahmed, L. Torresani, Maskconnect: Connectivity
learning by gradient descent (2018) 349–365.
[32] R. Shin, C. Packer, D. Song, Differentiable neural
network architecture search.
[33] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, F.
Hutter, Towards automatically-tuned neural networks
(2016) 58–65.
[34] A. Zela, A. Klein, S. Falkner, F. Hutter, Towards
automated deep learning: Efficient joint neural
architecture and hyperpa- rameter search, arXiv preprint
arXiv:1807.06906.
[35] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, Fast
bayesian optimization of machine learning hyperparameters
on large datasets, in: A. Singh, X. J. Zhu (Eds.),
Proceedings of the 20th International Conference on
Artificial Intelligence and Statistics, AISTATS 2017, 20-22
April 2017, Fort Lauderdale, FL, USA, Vol. 54 of
Proceedings of Machine Learning Research, PMLR, 2017,
pp. 528–536.
URL http://proceedings.mlr.press/v54/klein17a.html
[36] S. Falkner, A. Klein, F. Hutter, Practical hyperparameter
optimization for deep learning.
[37] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-
based optimization for general algorithm configuration, in:
In- ternational conference on learning and intelligent
optimization, 2011, pp. 507–523.
[38] S. Falkner, A. Klein, F. Hutter, BOHB: robust and efficient
hyperparameter optimization at scale, in: J. G. Dy, A.
Krause (Eds.), Proceedings of the 35th International
Conference on Machine Learning, ICML 2018,
Stockholmsma¨ssan, Stockholm, Sweden, July 10-15, 2018,
Vol. 80 of Proceedings of Machine Learning Research,
PMLR, 2018, pp. 1436–1445.
URL http://proceedings.mlr.press/v80/falkner18a.html
[39] J. Bergstra, D. Yamins, D. D. Cox, Making a science of
model search: Hyperparameter optimization in hundreds
of dimen- sions for vision architectures, in: Proceedings of
the 30th Inter- national Conference on Machine Learning,
ICML 2013, Atlanta, GA, USA, 16-21 June 2013, Vol. 28 of
JMLR Workshop and Conference Proceedings,
JMLR.org, 2013, pp. 115–123.
URL http://proceedings.mlr.press/v28/bergstra13.html
[40] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian,
C. Xu, CARS: continuous evolution for efficient neural
ar- chitecture search, in: 2020 IEEE/CVF Conference on
Com- puter Vision and Pattern Recognition, CVPR 2020,
Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp.
1826–1835. doi:10.1109/CVPR42600.2020.00190.
URL https://doi.org/10.1109/CVPR42600.2020.00190
[41] K. Maziarz, M. Tan, A. Khorlin, M. Georgiev, A.
Gesmundo, Evolutionary-neural hybrid agents for
architecture searcharXiv: 1811.09828.
[42] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu,
X. Wang, Reinforced evolutionary neural architecture
search, arXiv preprint arXiv:1808.00193.
[43] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, M. Zhang,
Surrogate-assisted evolutionary deep learning using an end-
to- end random forest-based performance predictor, IEEE
Trans- actions on Evolutionary Computation.
[44] B. Wang, Y. Sun, B. Xue, M. Zhang, A hybrid differential
46
[46] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, December 7-13, 2015, IEEE
X. Wang, A comprehensive survey of neural architecture search:
Challenges and solutions (2020). arXiv:2006.02903.
[47] R. Elshawi, M. Maher, S. Sakr, Automated machine learn-
ing: State-of-the-art and open challenges, arXiv preprint
arXiv:1906.02287.
[48] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based
learning applied to document recognition, Proceedings of
the IEEE 86 (11) (1998) 2278–2324.
[49] A. Krizhevsky, V. Nair, G. Hinton, The cifar-10 dataset, online:
http://www. cs. toronto. edu/kriz/cifar. html.
[50] J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Ima-
genet: A large-scale hierarchical image database, in: 2009
IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2009), 20-25 June 2009,
Miami, Florida, USA, IEEE Computer Society, 2009, pp.
248–255. doi:10.1109/CVPR.2009.5206848.
URL https://doi.org/10.1109/CVPR.2009.5206848
[51] J. Yang, X. Sun, Y.-K. Lai, L. Zheng, M.-M. Cheng, Recog-
nition from web data: a progressive filtering approach,
IEEE Transactions on Image Processing 27 (11) (2018) 5303–
5315.
[52] X. Chen, A. Shrivastava, A. Gupta, NEIL: extracting visual
knowledge from web data, in: IEEE International
Conference on Computer Vision, ICCV 2013, Sydney,
Australia, December 1-8, 2013, IEEE Computer Society, 2013,
pp. 1409–1416. doi: 10.1109/ICCV.2013.178.
URL https://doi.org/10.1109/ICCV.2013.178
[53] Y. Xia, X. Cao, F. Wen, J. Sun, Well begun is half done:
Generating high-quality seeds for automatic image dataset
construction from web, in: European Conference on Computer
Vision, Springer, 2014, pp. 387–400.
[54] N. H. Do, K. Yanai, Automatic construction of action datasets
using web videos with density-based cluster analysis and outlier
detection, in: Pacific-Rim Symposium on Image and Video
Technology, Springer, 2015, pp. 160–172.
[55] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig,
J. Philbin, L. Fei-Fei, The unreasonable effectiveness of noisy
data for fine-grained recognition, in: European Conference
on Computer Vision, Springer, 2016, pp. 301–320.
[56] P. D. Vo, A. Ginsca, H. Le Borgne, A. Popescu, Harnessing
noisy web images for deep representation, Computer Vision
and Image Understanding 164 (2017) 68–81.
[57] B. Collins, J. Deng, K. Li, L. Fei-Fei, Towards scalable dataset
construction: An active learning approach, in: European con-
ference on computer vision, Springer, 2008, pp. 86–98.
[58] Y. Roh, G. Heo, S. E. Whang, A survey on data collection for
machine learning: a big data-ai integration perspective, IEEE
Transactions on Knowledge and Data Engineering.
[59] D. Yarowsky, Unsupervised word sense disambiguation rivaling
supervised methods, in: 33rd Annual Meeting of the
Association for Computational Linguistics, Association for
Computational Linguistics, Cambridge, Massachusetts, USA,
1995, pp. 189–
196. doi:10.3115/981658.981684.
URL https://www.aclweb.org/anthology/P95-1026
[60] I. Triguero, J. A. Sa´ez, J. Luengo, S. Garc´ıa, F. Herrera, On the
characterization of noise filters for self-training semi-supervised
in nearest neighbor classification, Neurocomputing 132 (2014)
30–41.
[61] M. F. A. Hady, F. Schwenker, Combining committee-based semi-
supervised learning and active learning, Journal of Computer
Science and Technology 25 (4) (2010) 681–698.
[62] A. Blum, T. Mitchell, Combining labeled and unlabeled data
with co-training, in: Proceedings of the eleventh annual
con- ference on Computational learning theory, ACM,
1998, pp. 92–100.
[63] Y. Zhou, S. Goldman, Democratic co-learning, in: Tools
with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE
Interna- tional Conference on, IEEE, 2004, pp. 594–602.
[64] X. Chen, A. Gupta, Webly supervised learning of
convolutional networks, in: 2015 IEEE International
Conference on Computer Vision, ICCV 2015, Santiago, Chile,
47
Computer Society, 2015, pp. 1431–1439. doi:10.1109/ICCV. URL http://openaccess.thecvf.com/content_CVPR_2019/
2015.168. html/Karras_A_Style-Based_Generator_Architecture_for_
URL https://doi.org/10.1109/ICCV.2015.168 Generative_Adversarial_Networks_CVPR_2019_paper.html
[65] Z. Xu, S. Huang, Y. Zhang, D. Tao, Augmenting strong [79] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data cleaning:
super- vision using web data for fine-grained Overview and emerging challenges, in: F. O¨ zcan, G.
categorization, in: 2015 IEEE International Conference on Koutrika,
Computer Vision, ICCV 2015, Santiago, Chile, December S. Madden (Eds.), Proceedings of the 2016 International Con-
7-13, 2015, IEEE Computer ference on Management of Data, SIGMOD Conference 2016,
Society, 2015, pp. 2524–2532. doi:10.1109/ICCV.2015.290. San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016,
URL https://doi.org/10.1109/ICCV.2015.290 pp. 2201–2206. doi:10.1145/2882903.2912574.
[66] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. URL https://doi.org/10.1145/2882903.2912574
Kegelmeyer, Smote: synthetic minority over-sampling [80] M. Jesmeen, J. Hossen, S. Sayeed, C. Ho, K. Tawsif, A. Rahman,
technique, Journal of artificial intelligence research 16
(2002) 321–357.
[67] H. Guo, H. L. Viktor, Learning from imbalanced data
sets with boosting and data generation: the databoost-im
approach, ACM Sigkdd Explorations Newsletter 6 (1)
(2004) 30–39.
[68] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J.
Schul- man, J. Tang, W. Zaremba, Openai gym, arXiv
preprint arXiv:1606.01540.
[69] Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, X. Chu, Irs: A
large synthetic indoor robotics stereo dataset for disparity
and surface normal estimation, arXiv preprint
arXiv:1912.09678.
[70] N. Ruiz, S. Schulter, M. Chandraker, Learning to
simulate, in: 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, May
6-9, 2019, OpenRe- view.net, 2019.
URL https://openreview.net/forum?id=HJgkx2Aqt7
[71] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.
Warde- Farley, S. Ozair, A. C. Courville, Y. Bengio,
Generative adversarial nets, in: Z. Ghahramani, M.
Welling, C. Cortes,
N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in
Neural Information Processing Systems 27: Annual
Conference on Neural Information Processing Systems
2014, December 8-13 2014, Montreal, Quebec, Canada,
2014, pp. 2672–2680.
URL
https://proceedings.neurips.cc/paper/2014/hash/
5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
[72] T.-H. Oh, R. Jaroensri, C. Kim, M. Elgharib, F. Durand,
W. T. Freeman, W. Matusik, Learning-based video
motion magnification, in: Proceedings of the European
Conference on Computer Vision (ECCV), 2018, pp. 633–
648.
[73] L. Sixt, Rendergan: Generating realistic labeled data–with an
application on decoding bee tags, unpublished Bachelor
Thesis, Freie Universit¨at, Berlin.
[74] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A.
Ham- mers, D. A. Dickie, M. V. Herna´ndez, J. Wardlaw, D.
Rueckert, Gan augmentation: Augmenting training data
using generative adversarial networks, arXiv preprint
arXiv:1810.10863.
[75] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park,
Y. Kim, Data synthesis based on generative adversarial net-
works, Proceedings of the VLDB Endowment 11 (10) (2018)
1071–1083.
[76] L. Xu, K. Veeramachaneni, Synthesizing tabular data using
gen- erative adversarial networks, arXiv preprint
arXiv:1811.11264.
[77] D. Donahue, A. Rumshisky, Adversarial text generation
without reinforcement learning, arXiv preprint
arXiv:1810.06640.
[78] T. Karras, S. Laine, T. Aila, A style-based generator ar-
chitecture for generative adversarial networks, in: IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA, USA, June
16-20, 2019,
Computer Vision Foundation / IEEE, 2019, pp. 4401–
4410.
doi:10.1109/CVPR.2019.00453.
48
E. Arif, A survey on cleaning dirty data using machine [96] S. C. Wong, A. Gatt, V. Stamatescu, M. D. McDonnell, Under-
learning paradigm for big data analytics, Indonesian Journal of standing data augmentation for classification: when to warp?,
Electrical Engineering and Computer Science 10 (3) (2018)
1234–1243.
[81] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N.
Tang,
Y. Ye, KATARA: A data cleaning system powered by
knowledge bases and crowdsourcing, in: T. K. Sellis, S.
B. Davidson,
Z. G. Ives (Eds.), Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data, Melbourne,
Victoria, Australia, May 31 - June 4, 2015, ACM, 2015, pp.
1247–1261. doi:10.1145/2723372.2749431.
URL https://doi.org/10.1145/2723372.2749431
[82] S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska,
T. Milo, E. Wu, Sampleclean: Fast and reliable analytics on
dirty data., IEEE Data Eng. Bull. 38 (3) (2015) 59–75.
[83] S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, E. Wu, Ac-
tiveclean: An interactive data cleaning framework for
modern machine learning, in: F. O¨ zcan, G. Koutrika, S.
Madden (Eds.),
Proceedings of the 2016 International Conference on Manage-
ment of Data, SIGMOD Conference 2016, San Francisco, CA,
USA, June 26 - July 01, 2016, ACM, 2016, pp. 2117–2120.
doi:10.1145/2882903.2899409.
URL https://doi.org/10.1145/2882903.2899409
[84] S. Krishnan, M. J. Franklin, K. Goldberg, E. Wu, Boostclean:
Automated error detection and repair for machine learning,
arXiv preprint arXiv:1711.01299.
[85] S. Krishnan, E. Wu, Alphaclean: Automatic generation of data
cleaning pipelines, arXiv preprint arXiv:1904.11827.
[86] I. Gemp, G. Theocharous, M. Ghavamzadeh, Automated data
cleansing through meta-learning, in: S. P. Singh, S.
Markovitch (Eds.), Proceedings of the Thirty-First AAAI
Conference on Artificial Intelligence, February 4-9, 2017, San
Francisco, Cali- fornia, USA, AAAI Press, 2017, pp. 4760–4761.
URL http://aaai.org/ocs/index.php/IAAI/IAAI17/paper/
view/14236
[87] I. F. Ilyas, Effective data cleaning with continuous evaluation.,
IEEE Data Eng. Bull. 39 (2) (2016) 38–46.
[88] M. Mahdavi, F. Neutatz, L. Visengeriyeva, Z. Abedjan,
Towards automated data cleaning workflows, Machine
Learning 15 (2019) 16.
[89] T. DeVries, G. W. Taylor, Improved regularization of con-
volutional neural networks with cutout, arXiv preprint
arXiv:1708.04552.
[90] H. Zhang, M. Ciss´e, Y. N. Dauphin, D. Lopez-Paz, mixup:
Beyond empirical risk minimization, in: 6th International Con-
ference on Learning Representations, ICLR 2018, Vancouver,
BC, Canada, April 30 - May 3, 2018, Conference Track Pro-
ceedings, OpenReview.net, 2018.
URL https://openreview.net/forum?id=r1Ddp1-Rb
[91] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving,
C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft,
Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer,
B. Cook, I. Fern´andez, F.-M. De Rainville, C.-H. Weng,
A. Ayala-Acevedo, R. Meudec, M. Laporte, et al., imgaug,
https://github.com/aleju/imgaug, online; accessed 01-Feb-
2020 (2020).
[92] A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, A. A.
Kalinin, Albumentations: fast and flexible image augmenta-
tions, ArXiv e-printsarXiv:1809.06839.
[93] A. Miko-lajczyk, M. Grochowski, Data augmentation for im-
proving deep learning in image classification problem, in:
2018 international interdisciplinary PhD workshop (IIPhDW),
IEEE, 2018, pp. 117–122.
[94] A. Miko-lajczyk, M. Grochowski, Style transfer-based image
synthesis as an efficient regularization technique in deep learn-
ing, in: 2019 24th International Conference on Methods and
Models in Automation and Robotics (MMAR), IEEE, 2019,
pp. 42–47.
[95] A. Antoniou, A. Storkey, H. Edwards, Data augmentation gen-
erative adversarial networks, arXiv preprint arXiv:1711.04340.
49
arXiv preprint arXiv:1609.08764. autoaug- ment, in: 8th International Conference on Learning
[97] Z. Xie, S. I. Wang, J. Li, D. L´evy, A. Nie, D. Jurafsky, A. Y. Represen- tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-
Ng, Data noising as smoothing in neural network 30, 2020,
language models, in: 5th International Conference on OpenReview.net, 2020.
Learning Representations, ICLR 2017, Toulon, France, URL https://openreview.net/forum?id=ByxdUySKvS
April 24-26, 2017, Conference Track Proceedings, [109] C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin,
OpenReview.net, 2017. W. Ouyang, Online hyper-parameter learning for auto-
URL https://openreview.net/forum?id=H1VyHY9gg augmentation strategy, in: 2019 IEEE/CVF International Con-
[98] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. ference on Computer Vision, ICCV 2019, Seoul, Korea (South),
Norouzi, October 27 - November 2, 2019, IEEE, 2019, pp. 6578–
Q. V. Le, Qanet: Combining local convolution with 6587. doi:10.1109/ICCV.2019.00668.
global self-attention for reading comprehension, in: 6th
International Conference on Learning Representations,
ICLR 2018, Vancou- ver, BC, Canada, April 30 - May 3,
2018, Conference Track Proceedings, OpenReview.net,
2018.
URL https://openreview.net/forum?id=B14TlG-RW
[99] E. Ma, Nlp augmentation,
https://github.com/makcedward/ nlpaug (2019).
[100] E. D. Cubuk, B. Zoph, D. Man´e, V. Vasudevan, Q. V.
Le, Autoaugment: Learning augmentation strategies
from data, in: IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2019, Long Beach, CA,
USA, June 16-20, 2019, Computer Vision Foundation /
IEEE, 2019, pp. 113–123.
doi:10.1109/CVPR.2019.00020.
URL

http://openaccess.thecvf.com/content_CVPR_
2019/html/Cubuk_AutoAugment_Learning_Augmentation_
Strategies_From_Data_CVPR_2019_paper.html
[101] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M.
Robertson,
Y. Yang, Dada: Differentiable automatic data
augmentation, arXiv preprint arXiv:2003.03780.
[102] R. Hataya, J. Zdenek, K. Yoshizoe, H. Nakayama, Faster
au- toaugment: Learning augmentation strategies using
backpropa- gation, arXiv preprint arXiv:1911.06987.
[103] S. Lim, I. Kim, T. Kim, C. Kim, S. Kim, Fast autoaugment,
in:
H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´e-Buc,
E. B. Fox, R. Garnett (Eds.), Advances in Neural
Information Processing Systems 32: Annual Conference
on Neural Informa- tion Processing Systems 2019,
NeurIPS 2019, December 8-14,
2019, Vancouver, BC, Canada, 2019, pp. 6662–6672.
URL
https://proceedings.neurips.cc/paper/2019/hash/
6add07cf50424b14fdf649da87843d01-Abstract.html
[104] A. Naghizadeh, M. Abavisani, D. N. Metaxas, Greedy
autoaug- ment, arXiv preprint arXiv:1908.00704.
[105] D. Ho, E. Liang, X. Chen, I. Stoica, P. Abbeel,
Population based augmentation: Efficient learning of
augmentation policy schedules, in: K. Chaudhuri, R.
Salakhutdinov (Eds.), Proceed- ings of the 36th
International Conference on Machine Learning, ICML
2019, 9-15 June 2019, Long Beach, California, USA,
Vol. 97 of Proceedings of Machine Learning Research,
PMLR, 2019, pp. 2731–2741.
URL http://proceedings.mlr.press/v97/ho19b.html
[106] T. Niu, M. Bansal, Automatically learning data
augmenta- tion policies for dialogue tasks, in:
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Pro- cessing and the 9th
International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Com-
putational Linguistics, Hong Kong, China, 2019, pp. 1317–
1323. doi:10.18653/v1/D19-1132.
URL https://www.aclweb.org/anthology/D19-1132
[107] M. Geng, K. Xu, B. Ding, H. Wang, L. Zhang, Learning
data augmentation policies using augmented random
search, arXiv preprint arXiv:1811.04768.
[108] X. Zhang, Q. Wang, J. Zhang, Z. Zhong, Adversarial
50
URL https://doi.org/10.1109/ICCV.2019.00668 [128] X. Chen, L. Xie, J. Wu, Q. Tian, Progressive differentiable
[110] T. C. LingChen, A. Khonsari, A. Lashkari, M. R. Nazari, architecture search: Bridging the depth gap between search and
J. S. Sambee, M. A. Nascimento, Uniformaugment: A search- evaluation, in: 2019 IEEE/CVF International Conference on
free probabilistic data augmentation approach, arXiv preprint Computer Vision, ICCV 2019, Seoul, Korea (South), October
arXiv:2003.14348. 27 - November 2, 2019, IEEE, 2019, pp. 1294–1303. doi:
[111] H. Motoda, H. Liu, Feature selection, extraction and construc- 10.1109/ICCV.2019.00138.
tion, Communication of IICM (Institute of Information and URL https://doi.org/10.1109/ICCV.2019.00138
Computing Machinery, Taiwan) Vol 5 (67-72) (2002) 2. [129] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille,
[112] M. Dash, H. Liu, Feature selection for classification, Intelligent F. Li, Auto-deeplab: Hierarchical neural architecture search
data analysis 1 (1-4) (1997) 131–156. for semantic image segmentation, in: IEEE Conference on
[113] M. J. Pazzani, Constructive induction of cartesian product Computer Vision and Pattern Recognition, CVPR 2019, Long
attributes, in: Feature Extraction, Construction and Selection, Beach, CA, USA, June 16-20, 2019, Computer Vision Founda-
Springer, 1998, pp. 341–354. tion / IEEE, 2019, pp. 82–92. doi:10.1109/CVPR.2019.00017.
[114] Z. Zheng, A comparison of constructing different types of new URL http://openaccess.thecvf.com/content_CVPR_
feature for decision tree learning, in: Feature Extraction, Con- 2019/html/Liu_Auto-DeepLab_Hierarchical_Neural_
struction and Selection, Springer, 1998, pp. 239–255. Architecture_Search_for_Semantic_Image_Segmentation_
[115] J. Gama, Functional trees, Machine Learning 55 (3) (2004) CVPR_2019_paper.html
219–250. [130] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler,
[116] H. Vafaie, K. De Jong, Evolutionary feature space transfor- A. Howard, Q. V. Le, Mnasnet: Platform-aware neural archi-
mation, in: Feature Extraction, Construction and Selection, tecture search for mobile, in: IEEE Conference on Computer
Springer, 1998, pp. 307–323. Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,
[117] P. Sondhi, Feature construction methods: a survey, sifaka. cs. USA, June 16-20, 2019, Computer Vision Foundation / IEEE,
uiuc. edu 69 (2009) 70–71. 2019, pp. 2820–2828. doi:10.1109/CVPR.2019.00293.
[118] D. Roth, K. Small, Interactive feature space construction using URL http://openaccess.thecvf.com/content_CVPR_2019/
semantic information, in: Proceedings of the Thirteenth Con- html/Tan_MnasNet_Platform-Aware_Neural_Architecture_
ference on Computational Natural Language Learning (CoNLL- Search_for_Mobile_CVPR_2019_paper.html
2009), Association for Computational Linguistics, Boulder, [131] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian,
Colorado, 2009, pp. 66–74. P. Vajda, Y. Jia, K. Keutzer, Fbnet: Hardware-aware efficient
URL https://www.aclweb.org/anthology/W09-1110 convnet design via differentiable neural architecture search,
[119] Q. Meng, D. Catchpoole, D. Skillicom, P. J. Kennedy, Rela- in: IEEE Conference on Computer Vision and Pattern
tional autoencoder for feature extraction, in: 2017 International Recognition, CVPR 2019, Long Beach, CA, USA, June
Joint Conference on Neural Networks (IJCNN), IEEE, 2017, 16-20, 2019, Computer Vision Foundation / IEEE, 2019, pp.
pp. 364–371. 10734–10742. doi:10.1109/CVPR.2019.01099.
[120] O. Irsoy, E. Alpaydın, Unsupervised feature extraction with URL http://openaccess.thecvf.com/content_CVPR_
autoencoder trees, Neurocomputing 258 (2017) 63–73. 2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_
[121] C. Cortes, V. Vapnik, Support-vector networks, Machine learn- Design_via_Differentiable_Neural_Architecture_Search_
ing 20 (3) (1995) 273–297. CVPR_2019_paper.html
[122] N. S. Altman, An introduction to kernel and nearest-neighbor [132] H. Cai, L. Zhu, S. Han, Proxylessnas: Direct neural architecture
nonparametric regression, The American Statistician 46 (3) search on target task and hardware, in: 7th International
(1992) 175–185. Conference on Learning Representations, ICLR 2019, New
[123] A. Yang, P. M. Esperan¸ca, F. M. Carlucci, NAS evaluation is Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019.
frustratingly hard, in: 8th International Conference on Learning URL https://openreview.net/forum?id=HylVB3AqYm
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26- [133] M. Courbariaux, Y. Bengio, J. David, Binaryconnect: Training
30, 2020, OpenReview.net, 2020. deep neural networks with binary weights during propagations,
URL https://openreview.net/forum?id=HygrdpVKvr in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
[124] F. Chollet, Xception: Deep learning with depthwise separable R. Garnett (Eds.), Advances in Neural Information Processing
convolutions, in: 2017 IEEE Conference on Computer Vision Systems 28: Annual Conference on Neural Information
and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, Processing Systems 2015, December 7-12, 2015, Montreal,
July 21-26, 2017, IEEE Computer Society, 2017, pp. 1800–1807. Quebec, Canada, 2015, pp. 3123–3131.
doi:10.1109/CVPR.2017.195. URL https://proceedings.neurips.cc/paper/2015/hash/
URL https://doi.org/10.1109/CVPR.2017.195 3e15cc11f979ed25912dff5b0669f2cd-Abstract.html
[125] F. Yu, V. Koltun, Multi-scale context aggregation by dilated [134] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a
convolutions, in: Y. Bengio, Y. LeCun (Eds.), 4th International neural network, arXiv preprint arXiv:1503.02531.
Conference on Learning Representations, ICLR 2016, San Juan, [135] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable
Puerto Rico, May 2-4, 2016, Conference Track Proceedings, are features in deep neural networks?, in: Z. Ghahramani,
2016. M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger
URL http://arxiv.org/abs/1511.07122 (Eds.), Advances in Neural Information Processing Systems 27:
[126] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, Annual Conference on Neural Information Processing Systems
in: 2018 IEEE Conference on Computer Vision and Pattern 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014,
Recognition, CVPR 2018, Salt Lake City, UT, USA, June pp. 3320–3328.
18-22, 2018, IEEE Computer Society, 2018, pp. 7132–7141. URL https://proceedings.neurips.cc/paper/2014/hash/
doi:10.1109/CVPR.2018.00745. 375c71349b295fbe2dcdca9206f20a06-Abstract.html
URL http://openaccess.thecvf.com/content_cvpr_2018/ [136] T. Wei, C. Wang, C. W. Chen, Modularized morphing of neural
html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_ networks, arXiv preprint arXiv:1701.03281.
paper.html [137] H. Cai, T. Chen, W. Zhang, Y. Yu, J. Wang, Efficient
[127] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, architecture search by network transformation, in: S. A.
Densely
connected convolutional networks, in: 2017 IEEE Conference McIlraith, K. Q. Weinberger (Eds.), Proceedings of the
on Computer Vision and Pattern Recognition, CVPR 2017, Thirty-Second AAAI Conference on Artificial Intelligence,
Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, (AAAI-18), the 30th innovative Applications of Artificial
2017, pp. 2261–2269. doi:10.1109/CVPR.2017.243. Intelligence (IAAI-18), and the 8th AAAI Symposium on
URL https://doi.org/10.1109/CVPR.2017.243 Educational Advances in Artificial Intelligence (EAAI-18),

51
New Orleans, Louisiana, USA, February 2-7, 2018, AAAI URL https://www.aclweb.org/anthology/H94-1020
Press, 2018, pp. 2787–2794. [153] C. He, H. Ye, L. Shen, T. Zhang, Milenas: Efficient neu-
URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/ ral architecture search via mixed-level reformulation, in: 2020
paper/view/16755 IEEE/CVF Conference on Computer Vision and Pattern Recog-
[138] A. Kwasigroch, M. Grochowski, M. Mikolajczyk, Deep neu- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
ral network architecture search using network morphism, in: 2020, pp. 11990–11999. doi:10.1109/CVPR42600.2020.01201.
2019 24th International Conference on Methods and Models in URL https://doi.org/10.1109/CVPR42600.2020.01201
Automation and Robotics (MMAR), IEEE, 2019, pp. 30–35. [154] X. Dong, Y. Yang, Searching for a robust neural architecture
[139] H. Cai, J. Yang, W. Zhang, S. Han, Y. Yu, Path-level network in four GPU hours, in: IEEE Conference on Computer Vision
transformation for efficient architecture search, in: J. G. Dy, and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
A. Krause (Eds.), Proceedings of the 35th International Con- June 16-20, 2019, Computer Vision Foundation / IEEE,
2019,
ference on Machine Learning, ICML 2018, Stockholmsma¨ssan, pp. 1761–1770. doi:10.1109/CVPR.2019.00186.
Stockholm, Sweden, July 10-15, 2018, Vol. 80 of Proceedings URL http://openaccess.thecvf.com/content_CVPR_2019/
of Machine Learning Research, PMLR, 2018, pp. 677–686. html/Dong_Searching_for_a_Robust_Neural_Architecture_
URL http://proceedings.mlr.press/v80/cai18a.html in_Four_GPU_Hours_CVPR_2019_paper.html
[140] J. Fang, Y. Sun, K. Peng, Q. Zhang, Y. Li, W. Liu, X. Wang, [155] S. Xie, H. Zheng, C. Liu, L. Lin, SNAS: stochastic neural ar-
Fast neural network adaptation via parameter remapping and chitecture search, in: 7th International Conference on Learning
architecture search, in: 8th International Conference on Learn- Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2019, OpenReview.net, 2019.
26-30, 2020, OpenReview.net, 2020. URL https://openreview.net/forum?id=rylqooRqK7
URL https://openreview.net/forum?id=rklTmyBKPH [156] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, K. Keutzer,
[141] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, Mixed precision quantization of convnets via differentiable
E. Choi, Morphnet: Fast & simple resource-constrained neural architecture search (2018). arXiv:1812.00090.
structure learning of deep networks, in: 2018 IEEE Conference [157] E. Jang, S. Gu, B. Poole, Categorical reparameterization with
on Computer Vision and Pattern Recognition, CVPR 2018, gumbel-softmax, in: 5th International Conference on Learning
Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Society, 2018, pp. 1586–1595. doi:10.1109/CVPR.2018.00171. Conference Track Proceedings, OpenReview.net, 2017.
URL http://openaccess.thecvf.com/content_cvpr_2018/ URL https://openreview.net/forum?id=rkE3y85ee
html/Gordon_MorphNet_Fast CVPR_2018_paper.html [158] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution:
[142] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for A continuous relaxation of discrete random variables, in: 5th
convolutional neural networks, in: K. Chaudhuri, R. Salakhut- International Conference on Learning Representations, ICLR
dinov (Eds.), Proceedings of the 36th International Conference 2017, Toulon, France, April 24-26, 2017, Conference Track
on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, Proceedings, OpenReview.net, 2017.
California, USA, Vol. 97 of Proceedings of Machine Learning URL https://openreview.net/forum?id=S1jE5L5gl
Research, PMLR, 2019, pp. 6105–6114. [159] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, Z. Li,
URL http://proceedings.mlr.press/v97/tan19a.html Darts+: Improved differentiable architecture search with early
[143] J. F. Miller, S. L. Harding, Cartesian genetic programming, stopping, arXiv preprint arXiv:1909.06035.
in: Proceedings of the 10th annual conference companion on [160] K. Kandasamy, W. Neiswanger, J. Schneider, B. Po´czos, E. P.
Genetic and evolutionary computation, ACM, 2008, pp. 2701– Xing, Neural architecture search with bayesian optimisation
2726. and optimal transport, in: S. Bengio, H. M. Wallach,
[144] J. F. Miller, S. L. Smith, Redundancy and computational H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett
efficiency in cartesian genetic programming, IEEE Transactions (Eds.), Advances in Neural Information Processing Systems 31:
on Evolutionary Computation 10 (2) (2006) 167–174. Annual Conference on Neural Information Processing Systems
[145] F. Gruau, Cellular encoding as a graph grammar, in: IEEE 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal, Canada,
Colloquium on Grammatical Inference: Theory, Applications 2018, pp. 2020–2029.
& Alternatives, 1993. URL https://proceedings.neurips.cc/paper/2018/hash/
[146] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, f33ba15effa5c10e873bf3842afb46a6-Abstract.html
M. Jaderberg, M. Lanctot, D. Wierstra, Convolution by evolu- [161] R. Negrinho, G. Gordon, Deeparchitect: Automatically design-
tion: Differentiable pattern producing networks, in: Proceed- ing and training deep architectures (2017). arXiv:1704.08792.
ings of the Genetic and Evolutionary Computation Conference [162] R. Negrinho, M. R. Gormley, G. J. Gordon, D. Patil, N. Le,
2016, ACM, 2016, pp. 109–116. D. Ferreira, Towards modular and programmable architecture
[147] M. Kim, L. Rigazio, Deep clustered convolutional kernels, in: search, in: H. M. Wallach, H. Larochelle, A. Beygelzimer,
Feature Extraction: Modern Questions and Challenges, 2015, F. d’Alch´e-Buc, E. B. Fox, R. Garnett (Eds.), Advances in
pp. 160–172. Neural Information Processing Systems 32: Annual Conference
[148] J. K. Pugh, K. O. Stanley, Evolving multimodal controllers on Neural Information Processing Systems 2019, NeurIPS
with hyperneat, in: Proceedings of the 15th annual conference 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
on Genetic and evolutionary computation, ACM, 2013, pp. 13715–13725.
735–742. URL https://proceedings.neurips.cc/paper/2019/hash/
[149] H. Zhu, Z. An, C. Yang, K. Xu, E. Zhao, Y. Xu, Eena: Efficient 4ab50afd6dcc95fcba76d0fe04295632-Abstract.html
evolution of neural architecture (2019). arXiv:1905.07320. [163] G. Dikov, J. Bayer, Bayesian learning of neural network ar-
[150] R. J. Williams, Simple statistical gradient-following algorithms chitectures, in: K. Chaudhuri, M. Sugiyama (Eds.), The 22nd
for connectionist reinforcement learning, Machine learning 8 (3- International Conference on Artificial Intelligence and Statis-
4) (1992) 229–256. tics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan,
[151] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Vol. 89 of Proceedings of Machine Learning Research, PMLR,
Proximal policy optimization algorithms, arXiv preprint 2019, pp. 730–738.
arXiv:1707.06347. URL http://proceedings.mlr.press/v89/dikov19a.html
[152] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, [164] C. White, W. Neiswanger, Y. Savani, Bananas: Bayesian op-
A. Bies, M. Ferguson, K. Katz, B. Schasberger, The Penn timization with neural architectures for neural architecture
Treebank: Annotating predicate argument structure, in: Hu- search (2019). arXiv:1910.11858.
man Language Technology: Proceedings of a Workshop held at [165] M. Wistuba, Bayesian optimization combined with incremen-
Plainsboro, New Jersey, March 8-11, 1994, 1994. tal evaluation for neural network architecture optimization,

52
in: Proceedings of the International Workshop on Automatic [179] Y. Geifman, R. El-Yaniv, Deep active learning with a
Selection, Configuration and Composition of Machine Learning neural architecture search, in: H. M. Wallach, H. Larochelle,
Algorithms, 2017. A. Beygelzimer, F. d’Alch´e-Buc, E. B. Fox, R. Garnett (Eds.),
[166] J. Perez-Rua, M. Baccouche, S. Pateux, Efficient progressive Advances in Neural Information Processing Systems 32:
neural architecture search, in: British Machine Vision Confer- Annual Conference on Neural Information Processing Systems
ence 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
BMVA Press, 2018, p. 150. Canada, 2019, pp. 5974–5984.
URL http://bmvc2018.org/contents/papers/0291.pdf URL https://proceedings.neurips.cc/paper/2019/hash/
[167] C. E. Rasmussen, Gaussian processes in machine learning, b59307fdacf7b2db12ec4bd5ca1caba8-Abstract.html
Lecture Notes in Computer Science (2003) 63–71. [180] L. Li, A. Talwalkar, Random search and reproducibility for
[168] J. Bergstra, R. Bardenet, Y. Bengio, B. K´egl, Algorithms neural architecture search, in: A. Globerson, R. Silva (Eds.),
for hyper-parameter optimization, in: J. Shawe-Taylor, R. S. Proceedings of the Thirty-Fifth Conference on Uncertainty in
Zemel, P. L. Bartlett, F. C. N. Pereira, K. Q. Weinberger Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25,
(Eds.), Advances in Neural Information Processing Systems 2019, Vol. 115 of Proceedings of Machine Learning Research,
24: 25th Annual Conference on Neural Information Processing AUAI Press, 2019, pp. 367–377.
Systems 2011. Proceedings of a meeting held 12-14 December URL http://proceedings.mlr.press/v115/li20c.html
2011, Granada, Spain, 2011, pp. 2546–2554. [181] J. Bergstra, Y. Bengio, Random search for hyper-parameter
URL https://proceedings.neurips.cc/paper/2011/hash/ optimization, Journal of machine learning research 13 (Feb)
86e8f7ab32cfd12577bc2619bc635690-Abstract.html (2012) 281–305.
[169] R. Luo, F. Tian, T. Qin, E. Chen, T. Liu, Neural architecture [182] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al., A practical guide to
optimization, in: S. Bengio, H. M. Wallach, H. Larochelle, support vector classification.
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in [183] J. Y. Hesterman, L. Caucci, M. A. Kupinski, H. H. Barrett, L. R.
Neural Information Processing Systems 31: Annual Conference Furenlid, Maximum-likelihood estimation with a contracting-
on Neural Information Processing Systems 2018, NeurIPS 2018, grid search algorithm, IEEE transactions on nuclear science
December 3-8, 2018, Montr´eal, Canada, 2018, pp. 7827–7838. 57 (3) (2010) 1077–1084.
URL https://proceedings.neurips.cc/paper/2018/hash/ [184] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A.
Talwalkar,
933670f1ac8ba969f32989c312faba75-Abstract.html Hyperband: A novel bandit-based approach to hyperparameter
[170] M. M. Ian Dewancker, S. Clark, Bayesian optimization primer. optimization, The Journal of Machine Learning Research 18 (1)
URL https://app.sigopt.com/static/pdf/SigOpt_ (2017) 6765–6816.
Bayesian_Optimization_Primer.pdf [185] M. Feurer, F. Hutter, Hyperparameter Optimization, Springer
[171] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Fre- International Publishing, Cham, 2019, pp. 3–33.
itas, Taking the human out of the loop: A review of bayesian URL https://doi.org/10.1007/978-3-030-05318-5_1
optimization, Proceedings of the IEEE 104 (1) (2016) 148–175. [186] T. Yu, H. Zhu, Hyper-parameter optimization: A review of
[172] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sun- algorithms and applications, arXiv preprint arXiv:2003.05689.
daram, M. M. A. Patwary, Prabhat, R. P. Adams, Scalable [187] Y. Bengio, Gradient-based optimization of hyperparameters,
bayesian optimization using deep neural networks, in: F. R. Neural computation 12 (8) (2000) 1889–1900.
Bach, D. M. Blei (Eds.), Proceedings of the 32nd International [188] J. Domke, Generic methods for optimization-based modeling,
Conference on Machine Learning, ICML 2015, Lille, France, in: Artificial Intelligence and Statistics, 2012, pp. 318–326.
6-11 July 2015, Vol. 37 of JMLR Workshop and Conference [189] D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based hy-
Proceedings, JMLR.org, 2015, pp. 2171–2180. perparameter optimization through reversible learning, in: F. R.
URL http://proceedings.mlr.press/v37/snoek15.html Bach, D. M. Blei (Eds.), Proceedings of the 32nd International
[173] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian Conference on Machine Learning, ICML 2015, Lille, France,
optimization of machine learning algorithms, in: P. L. Bartlett, 6-11 July 2015, Vol. 37 of JMLR Workshop and Conference
F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger Proceedings, JMLR.org, 2015, pp. 2113–2122.
(Eds.), Advances in Neural Information Processing Systems URL http://proceedings.mlr.press/v37/maclaurin15.html
25: 26th Annual Conference on Neural Information Processing [190] F. Pedregosa, Hyperparameter optimization with approximate
Systems 2012. Proceedings of a meeting held December 3-6, gradient, in: M. Balcan, K. Q. Weinberger (Eds.), Proceedings
2012, Lake Tahoe, Nevada, United States, 2012, pp. 2960–2968. of the 33nd International Conference on Machine Learning,
URL https://proceedings.neurips.cc/paper/2012/hash/ ICML 2016, New York City, NY, USA, June 19-24, 2016, Vol. 48
05311655a15b75fab86956663e1819cd-Abstract.html of JMLR Workshop and Conference Proceedings, JMLR.org,
[174] J. Stork, M. Zaefferer, T. Bartz-Beielstein, Improving 2016, pp. 737–746.
neuroevo-
lution efficiency by surrogate model-based optimization with URL http://proceedings.mlr.press/v48/pedregosa16.html
phenotypic distance kernels (2019). arXiv:1902.03419. [191] L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward
[175] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, M. A. and reverse gradient-based hyperparameter optimization,
Osborne,
Raiders of the lost architecture: Kernels for bayesian optimiza- in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th
tion in conditional parameter spaces (2014). arXiv:1409.4011. International Conference on Machine Learning, ICML 2017,
[176] A. Camero, H. Wang, E. Alba, T. Ba¨ck, Bayesian neural archi- Sydney, NSW, Australia, 6-11 August 2017, Vol. 70 of
tecture search using a training-free performance metric (2020). Proceedings of Machine Learning Research, PMLR, 2017, pp.
arXiv:2001.10726. 1165–1173.
[177] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto- URL http://proceedings.mlr.press/v70/franceschi17a.
weka: combined selection and hyperparameter optimization of html
classification algorithms, in: I. S. Dhillon, Y. Koren, R. Ghani, [192] K. Chandra, E. Meijer, S. Andow, E. Arroyo-Fang, I. Dea,
T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, J. George, M. Grueter, B. Hosmer, S. Stumpos, A. Tempest,
R. Uthurusamy (Eds.), The 19th ACM SIGKDD International et al., Gradient descent: The ultimate optimizer, arXiv preprint
Conference on Knowledge Discovery and Data Mining, KDD arXiv:1909.13371.
2013, Chicago, IL, USA, August 11-14, 2013, ACM, 2013, pp. [193] D. P. Kingma, J. Ba, Adam: A method for stochastic opti-
847–855. doi:10.1145/2487575.2487629. mization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International
URL https://doi.org/10.1145/2487575.2487629 Conference on Learning Representations, ICLR 2015, San Diego,
[178] A. sharpdarts, V. Jain, G. D. Hager, sharpdarts: Faster and CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
more accurate differentiable architecture search, Tech. rep. URL http://arxiv.org/abs/1412.6980
53
(2019). [194] P. Chrabaszcz, I. Loshchilov, F. Hutter, A downsampled variant

54
of imagenet as an alternative to the CIFAR datasets, CoRR Conference on Computer Vision, ICCV 2019, Seoul, Korea
abs/1707.08819. arXiv:1707.08819. (South), October 27 - November 2, 2019, IEEE, 2019, pp. 6508–
URL http://arxiv.org/abs/1707.08819 6517. doi:10.1109/ICCV.2019.00661.
[195] Y. Hu, Y. Yu, W. Tu, Q. Yang, Y. Chen, W. Dai, Multi- URL https://doi.org/10.1109/ICCV.2019.00661
fidelity automatic hyper-parameter tuning via transfer series [209] X. Dong, Y. Yang, One-shot neural architecture search via self-
expansion, in: The Thirty-Third AAAI Conference on Arti- evaluated template network, in: 2019 IEEE/CVF International
ficial Intelligence, AAAI 2019, The Thirty-First Innovative Conference on Computer Vision, ICCV 2019, Seoul, Korea
Applications of Artificial Intelligence Conference, IAAI 2019, (South), October 27 - November 2, 2019, IEEE, 2019, pp. 3680–
The Ninth AAAI Symposium on Educational Advances in 3689. doi:10.1109/ICCV.2019.00378.
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, Jan- URL https://doi.org/10.1109/ICCV.2019.00378
uary 27 - February 1, 2019, AAAI Press, 2019, pp. 3846–3853. [210] H. Zhou, M. Yang, J. Wang, W. Pan, Bayesnas: A bayesian
doi:10.1609/aaai.v33i01.33013846. approach for neural architecture search, in: K. Chaudhuri,
URL https://doi.org/10.1609/aaai.v33i01.33013846 R. Salakhutdinov (Eds.), Proceedings of the 36th International
[196] C. Wong, N. Houlsby, Y. Lu, A. Gesmundo, Transfer learning Conference on Machine Learning, ICML 2019, 9-15 June 2019,
with neural automl, in: S. Bengio, H. M. Wallach, H. Larochelle, Long Beach, California, USA, Vol. 97 of Proceedings of Machine
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Learning Research, PMLR, 2019, pp. 7603–7613.
Neural Information Processing Systems 31: Annual Conference URL http://proceedings.mlr.press/v97/zhou19e.html
on Neural Information Processing Systems 2018, NeurIPS 2018, [211] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, H. Xiong,
December 3-8, 2018, Montr´eal, Canada, 2018, pp. 8366–8375. PC-DARTS: partial channel connections for memory-efficient
URL https://proceedings.neurips.cc/paper/2018/hash/ architecture search, in: 8th International Conference on Learn-
bdb3c278f45e6734c35733d24299d3f4-Abstract.html ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April
[197] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyan- 26-30, 2020, OpenReview.net, 2020.
tha, J. Liu, D. Marculescu, Single-path nas: Designing URL https://openreview.net/forum?id=BJlS634tPr
hardware-efficient convnets in less than 4 hours, arXiv preprint [212] G. Li, G. Qian, I. C. Delgadillo, M. Mu¨ller, A. K. Thabet,
arXiv:1904.02877. B. Ghanem, SGAS: sequential greedy architecture search, in:
[198] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, 2020 IEEE/CVF Conference on Computer Vision and Pat-
Surrogate benchmarks for hyperparameter optimization., in: tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
MetaSel@ ECAI, 2014, pp. 24–31. 2020, IEEE, 2020, pp. 1617–1627. doi:10.1109/CVPR42600.
[199] C. Wang, Q. Duan, W. Gong, A. Ye, Z. Di, C. Miao, An 2020.00169.
evaluation of adaptive surrogate modeling based optimization URL https://doi.org/10.1109/CVPR42600.2020.00169
with two benchmark problems, Environmental Modelling & [213] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom-
Software 60 (2014) 167–179. ing multi-model forgetting in one-shot NAS with diversity
[200] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, maximization, in: 2020 IEEE/CVF Conference on Computer
Efficient benchmarking of hyperparameter optimizers via Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
surrogates, in: B. Bonet, S. Koenig (Eds.), Proceedings of USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
the Twenty-Ninth AAAI Conference on Artificial Intelligence, 10.1109/CVPR42600.2020.00783.
January 25-30, 2015, Austin, Texas, USA, AAAI Press, 2015, URL https://doi.org/10.1109/CVPR42600.2020.00783
pp. 1114–1120. [214] C. Zhang, M. Ren, R. Urtasun, Graph hypernetworks for neu-
URL http://www.aaai.org/ocs/index.php/AAAI/AAAI15/ ral architecture search, in: 7th International Conference on
paper/view/9993 Learning Representations, ICLR 2019, New Orleans, LA, USA,
[201] K. K. Vu, C. D’Ambrosio, Y. Hamadi, L. Liberti, Surrogate- May 6-9, 2019, OpenReview.net, 2019.
based methods for black-box optimization, International Trans- URL https://openreview.net/forum?id=rkgW0oA9FX
actions in Operational Research 24 (3) (2017) 393–424. [215] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. Chen,
[202] R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, T.-Y. Liu, Mobilenetv2: Inverted residuals and linear bottlenecks, in:
Semi-supervised neural architecture search (2020). arXiv: 2018 IEEE Conference on Computer Vision and Pattern
2002.10389. Recognition, CVPR 2018, Salt Lake City, UT, USA, June
[203] A. Klein, S. Falkner, J. T. Springenberg, F. Hutter, Learning 18-22, 2018, IEEE Computer Society, 2018, pp. 4510–4520.
curve prediction with bayesian neural networks, in: 5th Inter- doi:10.1109/CVPR.2018.00474.
national Conference on Learning Representations, ICLR 2017, URL http://openaccess.thecvf.com/content_cvpr_2018/
Toulon, France, April 24-26, 2017, Conference Track Proceed- html/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_
ings, OpenReview.net, 2017. paper.html
URL https://openreview.net/forum?id=S11KBYclx [216] S. You, T. Huang, M. Yang, F. Wang, C. Qian, C. Zhang,
[204] B. Deng, J. Yan, D. Lin, Peephole: Predicting network perfor- Greedynas: Towards fast one-shot NAS with greedy supernet,
mance before training, arXiv preprint arXiv:1712.03351. in: 2020 IEEE/CVF Conference on Computer Vision and Pat-
[205] T. Domhan, J. T. Springenberg, F. Hutter, Speeding up auto- tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
matic hyperparameter optimization of deep neural networks by 2020, IEEE, 2020, pp. 1996–2005. doi:10.1109/CVPR42600.
extrapolation of learning curves, in: Q. Yang, M. J. Wooldridge 2020.00207.
(Eds.), Proceedings of the Twenty-Fourth International Joint URL https://doi.org/10.1109/CVPR42600.2020.00207
Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, [217] H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all:
Argentina, July 25-31, 2015, AAAI Press, 2015, pp. 3460–3468. Train one network and specialize it for efficient deployment,
URL http://ijcai.org/Abstract/15/487 in: 8th International Conference on Learning Representations,
[206] M. Mahsereci, L. Balles, C. Lassner, P. Hennig, Early stopping ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
without a validation set, arXiv preprint arXiv:1703.09580. Review.net, 2020.
[207] D. Han, J. Kim, J. Kim, Deep pyramidal residual networks, URL https://openreview.net/forum?id=HylxE1HKwS
in: 2017 IEEE Conference on Computer Vision and Pattern [218] J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. L. Yuille, J. Yang,
Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, Atomnas: Fine-grained end-to-end neural architecture search,
2017, IEEE Computer Society, 2017, pp. 6307–6315. doi: in: 8th International Conference on Learning Representations,
10.1109/CVPR.2017.668. ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
URL https://doi.org/10.1109/CVPR.2017.668 Review.net, 2020.
[208] J. Cui, P. Chen, R. Li, S. Liu, X. Shen, J. Jia, Fast and practical URL https://openreview.net/forum?id=BylQSxHFwr
neural architecture search, in: 2019 IEEE/CVF International [219] S. Hu, S. Xie, H. Zheng, C. Liu, J. Shi, X. Liu, D. Lin,

55
DSNAS: direct neural architecture search without param- maximization, in: 2020 IEEE/CVF Conference on Computer
eter retraining, in: 2020 IEEE/CVF Conference on Com- Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
puter Vision and Pattern Recognition, CVPR 2020, Seattle, USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
WA, USA, June 13-19, 2020, IEEE, 2020, pp. 12081–12089. 10.1109/CVPR42600.2020.00783.
doi:10.1109/CVPR42600.2020.01210. URL https://doi.org/10.1109/CVPR42600.2020.00783
URL https://doi.org/10.1109/CVPR42600.2020.01210 [232] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, Q. V. Le,
[220] J. Fang, Y. Sun, Q. Zhang, Y. Li, W. Liu, X. Wang, Densely Understanding and simplifying one-shot architecture search,
connected search space for more flexible neural architecture in: J. G. Dy, A. Krause (Eds.), Proceedings of the 35th Inter-
search, in: 2020 IEEE/CVF Conference on Computer Vision national Conference on Machine Learning, ICML 2018, Stock-
and Pattern Recognition, CVPR 2020, Seattle, WA, USA, holmsma¨ssan, Stockholm, Sweden, July 10-15, 2018, Vol. 80
June 13-19, 2020, IEEE, 2020, pp. 10625–10634. doi:10.1109/ of Proceedings of Machine Learning Research, PMLR, 2018,
CVPR42600.2020.01064. pp. 549–558.
URL https://doi.org/10.1109/CVPR42600.2020.01064 URL http://proceedings.mlr.press/v80/bender18a.html
[221] A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, [233] X. Dong, M. Tan, A. W. Yu, D. Peng, B. Gabrys, Q. V.
M. Yu, T. Xu, K. Chen, P. Vajda, J. E. Gonzalez, Fbnetv2: Le, Autohas: Differentiable hyper-parameter and architecture
Differentiable neural architecture search for spatial and channel search (2020). arXiv:2006.03656.
dimensions, in: 2020 IEEE/CVF Conference on Computer [234] A. Klein, F. Hutter, Tabular benchmarks for joint archi-
Vision and Pattern Recognition, CVPR 2020, Seattle, WA, tecture and hyperparameter optimization, arXiv preprint
USA, June 13-19, 2020, IEEE, 2020, pp. 12962–12971. doi: arXiv:1905.04970.
10.1109/CVPR42600.2020.01298. [235] X. Dai, A. Wan, P. Zhang, B. Wu, Z. He, Z. Wei, K. Chen,
URL https://doi.org/10.1109/CVPR42600.2020.01298 Y. Tian, M. Yu, P. Vajda, et al., Fbnetv3: Joint architecture-
[222] R. Istrate, F. Scheidegger, G. Mariani, D. S. Nikolopoulos, recipe search using neural acquisition function, arXiv preprint
C. Bekas, A. C. I. Malossi, TAPAS: train-less accuracy predic- arXiv:2006.02049.
tor for architecture search, in: The Thirty-Third AAAI Con- [236] C.-H. Hsu, S.-H. Chang, J.-H. Liang, H.-P. Chou, C.-H. Liu, S.-
ference on Artificial Intelligence, AAAI 2019, The Thirty- C. Chang, J.-Y. Pan, Y.-T. Chen, W. Wei, D.-C. Juan, Monas:
First Innovative Applications of Artificial Intelligence Multi-objective neural architecture search using reinforcement
Conference, IAAI 2019, The Ninth AAAI Symposium on learning, arXiv preprint arXiv:1806.10332.
Educational Ad- vances in Artificial Intelligence, EAAI 2019, [237] X. He, S. Wang, S. Shi, X. Chu, J. Tang, X. Liu, C. Yan,
Honolulu, Hawaii, USA, January 27 - February 1, 2019, J. Zhang, G. Ding, Benchmarking deep learning models and
AAAI Press, 2019, pp. automated model design for covid-19 detection with chest ct
3927–3934. doi:10.1609/aaai.v33i01.33013927. scans, medRxiv.
URL https://doi.org/10.1609/aaai.v33i01.33013927 [238] L. Faes, S. K. Wagner, D. J. Fu, X. Liu, E. Korot, J. R. Ledsam,
[223] M. G. Kendall, A new measure of rank correlation, Biometrika T. Back, R. Chopra, N. Pontikos, C. Kern, et al., Automated
30 (1/2) (1938) 81–93. deep learning design for medical image classification by health-
URL http://www.jstor.org/stable/2332226 care professionals with no coding experience: a feasibility study,
[224] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, F. Hut- The Lancet Digital Health 1 (5) (2019) e232–e242.
ter, Nas-bench-101: Towards reproducible neural [239] X. He, S. Wang, X. Chu, S. Shi, J. Tang, X. Liu, C. Yan,
architecture search, in: K. Chaudhuri, R. Salakhutdinov (Eds.), J. Zhang, G. Ding, Automated model design and
Proceed- ings of the 36th International Conference on Machine benchmarking of 3d deep learning models for covid-19
Learning, ICML 2019, 9-15 June 2019, Long Beach, detection with chest ct scans (2021). arXiv:2101.05442.
California, USA, Vol. 97 of Proceedings of Machine Learning [240] G. Ghiasi, T. Lin, Q. V. Le, NAS-FPN: learning scalable
Research, PMLR, 2019, pp. 7105–7114. feature pyramid architecture for object detection, in: IEEE
URL http://proceedings.mlr.press/v97/ying19a.html Conference on Computer Vision and Pattern Recognition,
[225] X. Dong, Y. Yang, Nas-bench-201: Extending the scope of CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
reproducible neural architecture search, in: 8th International Computer Vision Foundation / IEEE, 2019, pp. 7036–7045.
Conference on Learning Representations, ICLR 2020, Addis doi:10.1109/CVPR.2019.00720.
Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL http://openaccess.thecvf.com/content_CVPR_
URL https://openreview.net/forum?id=HJxyZkBKDr 2019/html/Ghiasi_NAS-FPN_Learning_Scalable_Feature_
[226] N. Klyuchnikov, I. Trofimov, E. Artemova, M. Salnikov, M. Fe- Pyramid_Architecture_for_Object_Detection_CVPR_2019_
dorov, E. Burnaev, Nas-bench-nlp: Neural architecture search paper.html
benchmark for natural language processing (2020). arXiv: [241] H. Xu, L. Yao, Z. Li, X. Liang, W. Zhang, Auto-fpn:
2006.07116. Automatic network architecture adaptation for object
[227] X. Zhang, Z. Huang, N. Wang, You only search once: Single detection beyond classification, in: 2019 IEEE/CVF
shot neural architecture search via direct sparse International Conference on Computer Vision, ICCV 2019,
optimization, arXiv preprint arXiv:1811.01567. Seoul, Korea (South), October
[228] J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, 27 - November 2, 2019, IEEE, 2019, pp. 6648–6657. doi:
T. Huang, X. Song, R. Pang, Q. Le, Bignas: Scaling up neural 10.1109/ICCV.2019.00675.
architecture search with big single-stage models, arXiv preprint URL https://doi.org/10.1109/ICCV.2019.00675
arXiv:2003.11142. [242] M. Tan, R. Pang, Q. V. Le, Efficientdet: Scalable and
[229] X. Chu, B. Zhang, R. Xu, J. Li, Fairnas: Rethinking evaluation efficient object detection, in: 2020 IEEE/CVF Conference on
fairness of weight sharing neural architecture search, arXiv Computer Vision and Pattern Recognition, CVPR 2020,
preprint arXiv:1907.01845. Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 10778–
[230] Y. Benyahia, K. Yu, K. Bennani-Smires, M. Jaggi, A. C. Davi- 10787. doi: 10.1109/CVPR42600.2020.01079.
son, M. Salzmann, C. Musat, Overcoming multi-model forget- URL https://doi.org/10.1109/CVPR42600.2020.01079
ting, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of [243] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, J. Sun, Detnas:
the 36th International Conference on Machine Learning, Neural architecture search on object detection, arXiv
ICML 2019, 9-15 June 2019, Long Beach, California, USA, preprint arXiv:1903.10979 1 (2) (2019) 4–1.
Vol. 97 of Proceedings of Machine Learning Research, PMLR, [244] J. Guo, K. Han, Y. Wang, C. Zhang, Z. Yang, H. Wu, X. Chen,
2019, pp. 594–603. C. Xu, Hit-detector: Hierarchical trinity architecture search for
URL http://proceedings.mlr.press/v97/benyahia19a.html object detection, in: 2020 IEEE/CVF Conference on
[231] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom- Computer Vision and Pattern Recognition, CVPR 2020,
ing multi-model forgetting in one-shot NAS with diversity Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp.
11402–11411. doi:
56
10.1109/CVPR42600.2020.01142. URL http://proceedings.mlr.press/v119/fu20b.html
URL https://doi.org/10.1109/CVPR42600.2020.01142 [259] M. Li, J. Lin, Y. Ding, Z. Liu, J. Zhu, S. Han, GAN
compression:
[245] C. Jiang, H. Xu, W. Zhang, X. Liang, Z. Li, SP-NAS: serial- Efficient architectures for interactive conditional gans, in: 2020
to-parallel backbone search for object detection, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-
IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 5283–5293. doi:10.1109/CVPR42600.2020.00533.
2020, pp. 11860–11869. doi:10.1109/CVPR42600.2020.01188. URL https://doi.org/10.1109/CVPR42600.2020.00533
URL https://doi.org/10.1109/CVPR42600.2020.01188 [260] C. Gao, Y. Chen, S. Liu, Z. Tan, S. Yan, Adversarialnas: Adver-
[246] Y. Weng, T. Zhou, Y. Li, X. Qiu, Nas-unet: Neural sarial neural architecture search for gans, in: 2020 IEEE/CVF
architecture
search for medical image segmentation, IEEE Access 7 (2019) Conference on Computer Vision and Pattern Recognition,
44247–44257. CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020,
[247] V. Nekrasov, H. Chen, C. Shen, I. D. Reid, Fast neural pp. 5679–5688. doi:10.1109/CVPR42600.2020.00572.
architecture search of compact semantic segmentation models URL https://doi.org/10.1109/CVPR42600.2020.00572
via auxiliary cells, in: IEEE Conference on Computer Vision [261] T. Saikia, Y. Marrakchi, A. Zela, F. Hutter, T. Brox, Au-
and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, todispnet: Improving disparity estimation with automl, in:
June 16-20, 2019, Computer Vision Foundation / IEEE, 2019, 2019 IEEE/CVF International Conference on Computer Vision,
pp. 9126–9135. doi:10.1109/CVPR.2019.00934. ICCV 2019, Seoul, Korea (South), October 27 - November
URL http://openaccess.thecvf.com/content_CVPR_2019/ 2, 2019, IEEE, 2019, pp. 1812–1823. doi:10.1109/ICCV.2019.
html/Nekrasov_Fast_Neural_Architecture_Search_of_ 00190.
Compact_Semantic_Segmentation_Models_via_CVPR_2019_ URL https://doi.org/10.1109/ICCV.2019.00190
paper.html [262] W. Peng, X. Hong, G. Zhao, Video action recognition via neural
[248] W. Bae, S. Lee, Y. Lee, B. Park, M. Chung, K.-H. Jung, Re- architecture searching, in: 2019 IEEE International Conference
source optimized neural architecture search for 3d medical on Image Processing (ICIP), IEEE, 2019, pp. 11–15.
image segmentation, in: International Conference on Medi- [263] M. S. Ryoo, A. J. Piergiovanni, M. Tan, A. Angelova, Assem-
cal Image Computing and Computer-Assisted Intervention, blenet: Searching for multi-stream neural connectivity in video
Springer, 2019, pp. 228–236. architectures, in: 8th International Conference on Learning
[249] D. Yang, H. Roth, Z. Xu, F. Milletari, L. Zhang, D. Xu, Search- Representations, ICLR 2020, Addis Ababa, Ethiopia, April
ing learning strategy with reinforcement learning for 3d medical 26-30, 2020, OpenReview.net, 2020.
image segmentation, in: International Conference on Medi- URL https://openreview.net/forum?id=SJgMK64Ywr
cal Image Computing and Computer-Assisted Intervention, [264] V. Nekrasov, H. Chen, C. Shen, I. Reid, Architecture search of
Springer, 2019, pp. 3–11. dynamic cells for semantic video segmentation, in: The IEEE
[250] N. Dong, M. Xu, X. Liang, Y. Jiang, W. Dai, E. Xing, Neural Winter Conference on Applications of Computer Vision, 2020,
architecture search for adversarial medical image segmentation, pp. 1970–1979.
in: International Conference on Medical Image Computing and [265] A. J. Piergiovanni, A. Angelova, A. Toshev, M. S. Ryoo,
Computer-Assisted Intervention, Springer, 2019, pp. 828–836. Evolving space-time neural architectures for videos, in: 2019
[251] S. Kim, I. Kim, S. Lim, W. Baek, C. Kim, H. Cho, B. Yoon, IEEE/CVF International Conference on Computer Vision,
T. Kim, Scalable neural architecture search for 3d medical ICCV 2019, Seoul, Korea (South), October 27 - November
image segmentation, in: International Conference on Medi- 2, 2019, IEEE, 2019, pp. 1793–1802. doi:10.1109/ICCV.2019.
cal Image Computing and Computer-Assisted Intervention, 00188.
Springer, 2019, pp. 220–228. URL https://doi.org/10.1109/ICCV.2019.00188
[252] R. Quan, X. Dong, Y. Wu, L. Zhu, Y. Yang, Auto-reid: Search- [266] Y. Fan, F. Tian, Y. Xia, T. Qin, X.-Y. Li, T.-Y. Liu, Searching
ing for a part-aware convnet for person re-identification, in: better architectures for neural machine translation, IEEE/ACM
2019 IEEE/CVF International Conference on Computer Vision, Transactions on Audio, Speech, and Language Processing.
ICCV 2019, Seoul, Korea (South), October 27 - November [267] Y. Jiang, C. Hu, T. Xiao, C. Zhang, J. Zhu, Improved dif-
2, 2019, IEEE, 2019, pp. 3749–3758. doi:10.1109/ICCV.2019. ferentiable architecture search for language modeling and
00385. named entity recognition, in: Proceedings of the 2019 Confer-
URL https://doi.org/10.1109/ICCV.2019.00385 ence on Empirical Methods in Natural Language Processing
[253] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, Y. Wang, Efficient and the 9th International Joint Conference on Natural Lan-
residual dense block search for image super-resolution., in: guage Processing (EMNLP-IJCNLP), Association for Compu-
AAAI, 2020, pp. 12007–12014. tational Linguistics, Hong Kong, China, 2019, pp. 3585–3590.
[254] X. Chu, B. Zhang, H. Ma, R. Xu, J. Li, Q. Li, Fast, accurate doi:10.18653/v1/D19-1367.
and lightweight super-resolution with neural architecture search, URL https://www.aclweb.org/anthology/D19-1367
arXiv preprint arXiv:1901.07261. [268] J. Chen, K. Chen, X. Chen, X. Qiu, X. Huang, Exploring
[255] Y. Guo, Y. Luo, Z. He, J. Huang, J. Chen, Hierarchical neural shared structures and hierarchies for multiple nlp tasks, arXiv
architecture search for single image super-resolution, arXiv preprint arXiv:1808.07658.
preprint arXiv:2003.04619. [269] H. Mazzawi, X. Gonzalvo, A. Kracun, P. Sridhar, N. Subrah-
[256] H. Zhang, Y. Li, H. Chen, C. Shen, Ir-nas: Neural architecture manya, I. Lopez-Moreno, H.-J. Park, P. Violette, Improving
search for image restoration, arXiv preprint arXiv:1909.08228. keyword spotting and language identification via neural ar-
[257] X. Gong, S. Chang, Y. Jiang, Z. Wang, Autogan: Neural chitecture search at scale., in: INTERSPEECH, 2019, pp.
architecture search for generative adversarial networks, in: 1278–1282.
2019 IEEE/CVF International Conference on Computer Vision, [270] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, S. Han, Amc: Automl
ICCV 2019, Seoul, Korea (South), October 27 - November for model compression and acceleration on mobile devices, in:
2, 2019, IEEE, 2019, pp. 3223–3233. doi:10.1109/ICCV.2019. Proceedings of the European Conference on Computer Vision
00332. (ECCV), 2018, pp. 784–800.
URL https://doi.org/10.1109/ICCV.2019.00332 [271] X. Xiao, Z. Wang, S. Rajasekaran, Autoprune: Automatic
[258] Y. Fu, W. Chen, H. Wang, H. Li, Y. Lin, Z. Wang, Autogan- network pruning by regularizing auxiliary parameters, in:
distiller: Searching to compress generative adversarial networks, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´e-Buc,
in: Proceedings of the 37th International Conference on Ma- E. B. Fox, R. Garnett (Eds.), Advances in Neural Information
chine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Processing Systems 32: Annual Conference on Neural Informa-
Vol. 119 of Proceedings of Machine Learning Research, PMLR, tion Processing Systems 2019, NeurIPS 2019, December 8-14,
2020, pp. 3292–3303. 2019, Vancouver, BC, Canada, 2019, pp. 13681–13691.
57
URL https://proceedings.neurips.cc/paper/2019/hash/ learners, OpenAI
4efc9e02abdab6b6166251918570a307-Abstract.html
[272] R. Zhao, W. Luk, Efficient structured pruning and architecture
searching for group convolution, in: Proceedings of the IEEE
International Conference on Computer Vision Workshops, 2019,
pp. 0–0.
[273] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin,
S. Han, APQ: joint search for network architecture, pruning
and quantization policy, in: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2020,
Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 2075–
2084. doi: 10.1109/CVPR42600.2020.00215.
URL https://doi.org/10.1109/CVPR42600.2020.00215
[274] X. Dong, Y. Yang, Network pruning via transformable
architec- ture search, in: H. M. Wallach, H. Larochelle, A.
Beygelzimer,
F. d’Alch´e-Buc, E. B. Fox, R. Garnett (Eds.), Advances in
Neural Information Processing Systems 32: Annual Conference
on Neural Information Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
759–770.
URL https://proceedings.neurips.cc/paper/2019/hash/
a01a0380ca3c61428c26a231f0e49a09-Abstract.html
[275] Q. Huang, K. Zhou, S. You, U. Neumann, Learning to prune
filters in convolutional neural networks (2018). arXiv:1801.
07365.
[276] Y. He, P. Liu, L. Zhu, Y. Yang, Meta filter pruning to accelerate
deep convolutional neural networks (2019). arXiv:1904.03961.
[277] T.-W. Chin, C. Zhang, D. Marculescu, Layer-compensated
pruning for resource-constrained convolutional neural networks
(2018). arXiv:1810.00518.
[278] K. Zhou, Q. Song, X. Huang, X. Hu, Auto-gnn: Neural ar-
chitecture search of graph neural networks, arXiv preprint
arXiv:1909.03184.
[279] C. He, M. Annavaram, S. Avestimehr, Fednas: Federated
deep learning via neural architecture search (2020). arXiv:
2004.08546.
[280] H. Zhu, Y. Jin, Real-time federated evolutionary neural archi-
tecture search, arXiv preprint arXiv:2003.02793.
[281] C. Li, X. Yuan, C. Lin, M. Guo, W. Wu, J. Yan, W. Ouyang,
AM-LFS: automl for loss function search, in: 2019 IEEE/CVF
International Conference on Computer Vision, ICCV 2019,
Seoul, Korea (South), October 27 - November 2, 2019,
IEEE, 2019, pp. 8409–8418. doi:10.1109/ICCV.2019.00850.
URL https://doi.org/10.1109/ICCV.2019.00850
[282] B. Ru, C. Lyle, L. Schut, M. van der Wilk, Y. Gal, Revisiting
the train loss: an efficient performance estimator for neural
architecture search, arXiv preprint arXiv:2006.04492.
[283] P. Ramachandran, B. Zoph, Q. V. Le, Searching for
activation functions (2017). arXiv:1710.05941.
[284] H. Wang, H. Wang, K. Xu, Evolutionary recurrent neural
network for image captioning, Neurocomputing.
[285] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, R. Fonseca, Neural
architecture search using deep neural networks and monte
carlo tree search, arXiv preprint arXiv:1805.07440.
[286] P. Zhao, K. Xiao, Y. Zhang, K. Bian, W. Yan, Amer: Automatic
behavior modeling and interaction exploration in recommender
system, arXiv preprint arXiv:2006.05933.
[287] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, J. Tang, Au-
toemb: Automated embedding dimensionality search in stream-
ing recommendations, arXiv preprint arXiv:2002.11252.
[288] W. Cheng, Y. Shen, L. Huang, Differentiable neural
input search for recommender systems, arXiv preprint
arXiv:2006.04466.
[289] E. Real, C. Liang, D. R. So, Q. V. Le, Automl-zero: Evolving
machine learning algorithms from scratch, in: Proceedings of
the 37th International Conference on Machine Learning,
ICML 2020, 13-18 July 2020, Virtual Event, Vol. 119 of
Proceedings of Machine Learning Research, PMLR, 2020, pp.
8007–8019.
URL http://proceedings.mlr.press/v119/real20a.html
[290] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I.
Sutskever, Language models are unsupervised multitask
58
Blog 1 (2019) 8. 32: Annual Conference on Neural Information Processing
[291] D. Wang, C. Gong, Q. Liu, Improving neural language Systems 2019, NeurIPS 2019, December 8-14, 2019,
modeling via adversarial training, in: K. Chaudhuri, R. Vancouver, BC,
Salakhutdinov (Eds.), Proceedings of the 36th Canada, 2019, pp. 8024–8035.
International Conference on Machine Learning, ICML URL https://proceedings.neurips.cc/paper/2019/hash/
2019, 9-15 June 2019, Long Beach, California, USA, Vol. bdbca288fee7f92f2bfa9f7012727740-Abstract.html
97 of Proceedings of Machine Learning Research, PMLR, [302] F. Chollet, et al., Keras, https://github.com/fchollet/keras
2019, pp. 6555–6565.
URL http://proceedings.mlr.press/v97/wang19f.html
[292] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, F.
Hut- ter, Understanding and robustifying differentiable
architecture search, in: 8th International Conference on
Learning Represen- tations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020,
OpenReview.net, 2020.
URL https://openreview.net/forum?id=H1gDNyrKDS
[293] S. KOTYAN, D. V. VARGAS, Is neural architecture search
a way forward to develop robust neural networks?,
Proceedings of the Annual Conference of JSAI JSAI2020
(2020) 2K1ES203– 2K1ES203.
[294] M. Guo, Y. Yang, R. Xu, Z. Liu, D. Lin, When NAS meets
robustness: In search of robust architectures against
adversarial attacks, in: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2020,
Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 628–
637. doi:10.1109/CVPR42600.
2020.00071.
URL https://doi.org/10.1109/CVPR42600.2020.00071
[295] Y. Chen, Q. Song, X. Liu, P. S. Sastry, X. Hu, On
robustness of neural architecture search under label
noise, in: Frontiers in Big Data, 2020.
[296] D. V. Vargas, S. Kotyan, Evolving robust neural architec-
tures to defend from adversarial attacks, arXiv preprint
arXiv:1906.11667.
[297] J. Yim, D. Joo, J. Bae, J. Kim, A gift from knowledge distil-
lation: Fast optimization, network minimization and
transfer learning, in: 2017 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2017,
Honolulu, HI, USA, July 21-26, 2017, IEEE Computer
Society, 2017, pp. 7130–7138.
doi:10.1109/CVPR.2017.754.
URL https://doi.org/10.1109/CVPR.2017.754
[298] G. Squillero, P. Burelli, Applications of Evolutionary Com-
putation: 19th European Conference, EvoApplications
2016, Porto, Portugal, March 30–April 1, 2016,
Proceedings, Vol. 9597, Springer, 2016.
[299] M. Feurer, A. Klein, K. Eggensperger, J. T. Springen-
berg, M. Blum, F. Hutter, Efficient and robust
automated machine learning, in: C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.),
Advances in Neural Information Processing Systems 28:
Annual Conference on Neural Information Processing
Systems 2015, December 7-12, 2015, Montreal, Quebec,
Canada, 2015, pp. 2962–2970.
URL https://proceedings.neurips.cc/paper/2015/hash/
11d0e6287202fced83f79975ec59a3a6-Abstract.html
[300] F. Pedregosa, G. Varoquaux, A. Gramfort, V.
Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn:
Machine learning in Python, Journal of Machine Learning
Research 12 (2011) 2825–2830.
[301] A. Paszke, S. Gross, F. Massa, A. Lerer, J.
Bradbury,
G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
A. Desmaison, A. K¨opf, E. Yang, Z. DeVito, M. Raison,
A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,
S. Chintala, Pytorch: An imperative style, high-
performance deep learning library, in: H. M.
Wallach, H. Larochelle,
A. Beygelzimer, F. d’Alch´e-Buc, E. B. Fox, R. Garnett
(Eds.), Advances in Neural Information Processing Systems
59
(2015).
[303] NNI (Neural Network Intelligence), 2020.
URL https://github.com/microsoft/nni
[304] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur,
J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner,
P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu,
X. Zheng, Tensorflow: A system for large-scale machine learning
(2016). arXiv:1605.08695.
[305] Vega, 2020.
URL https://github.com/huawei-noah/vega
[306] R. Pasunuru, M. Bansal, Continual and multi-task architec-
ture search, in: Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, Association for
Computational Linguistics, Florence, Italy, 2019, pp. 1911–1922.
doi:10.18653/v1/P19-1185.
URL https://www.aclweb.org/anthology/P19-1185
[307] J. Kim, S. Lee, S. Kim, M. Cha, J. K. Lee, Y. Choi, Y. Choi,
D.-Y. Cho, J. Kim, Auto-meta: Automated gradient based
meta learner search, arXiv preprint arXiv:1806.06927.
[308] D. Lian, Y. Zheng, Y. Xu, Y. Lu, L. Lin, P. Zhao, J. Huang,
S. Gao, Towards fast adaptation of neural architectures with
meta learning, in: 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, OpenReview.net, 2020.
URL https://openreview.net/forum?id=r1eowANFvr
[309] T. Elsken, B. Staffler, J. H. Metzen, F. Hutter, Meta-learning of
neural architectures for few-shot learning, in: 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
2020,
pp. 12362–12372. doi:10.1109/CVPR42600.2020.01238.
URL https://doi.org/10.1109/CVPR42600.2020.01238
[310] C. Liu, P. Doll´ar, K. He, R. Girshick, A. Yuille, S. Xie, Are
labels necessary for neural architecture search? (2020). arXiv:
2003.12056.
[311] Z. Li, D. Hoiem, Learning without forgetting, IEEE transactions
on pattern analysis and machine intelligence 40 (12) (2018)
2935–2947.
[312] S. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl:
Incremental classifier and representation learning, in:
2017 IEEE Conference on Computer Vision and Pattern
Recogni- tion, CVPR 2017, Honolulu, HI, USA, July 21-26,
2017, IEEE
Computer Society, 2017, pp. 5533–5542. doi:10.1109/CVPR.
2017.587.
URL https://doi.org/10.1109/CVPR.2017.587

60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy