0% found this document useful (0 votes)
52 views37 pages

Automl: A Survey of The State-Of-The-Art

This document provides a survey of the state-of-the-art in automated machine learning (AutoML) and neural architecture search (NAS). It discusses the key components of the AutoML pipeline including data preparation, feature engineering, hyperparameter optimization, and NAS. It summarizes representative NAS algorithms' performance on CIFAR-10 and ImageNet datasets and discusses topics like one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture optimization, and resource-aware NAS. Finally, it discusses open problems for future AutoML research.

Uploaded by

J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views37 pages

Automl: A Survey of The State-Of-The-Art

This document provides a survey of the state-of-the-art in automated machine learning (AutoML) and neural architecture search (NAS). It discusses the key components of the AutoML pipeline including data preparation, feature engineering, hyperparameter optimization, and NAS. It summarizes representative NAS algorithms' performance on CIFAR-10 and ImageNet datasets and discusses topics like one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture optimization, and resource-aware NAS. Finally, it discusses open problems for future AutoML research.

Uploaded by

J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

AutoML: A Survey of the State-of-the-Art

Xin He, Kaiyong Zhao, Xiaowen Chu∗


Department of Computer Science, Hong Kong Baptist University

Abstract
Deep learning (DL) techniques have obtained remarkable achievements on various tasks, such as image recognition,
object detection, and language modeling. However, building a high-quality DL system for a specific task highly relies
on human expertise, hindering its wide application. Meanwhile, automated machine learning (AutoML) is a promising
solution for building a DL system without human assistance and is being extensively studied. This paper presents a
arXiv:1908.00709v6 [cs.LG] 16 Apr 2021

comprehensive and up-to-date review of the state-of-the-art (SOTA) in AutoML. According to the DL pipeline, we
introduce AutoML methods –– covering data preparation, feature engineering, hyperparameter optimization, and neural
architecture search (NAS) –– with a particular focus on NAS, as it is currently a hot sub-topic of AutoML. We summarize
the representative NAS algorithms’ performance on the CIFAR-10 and ImageNet datasets and further discuss the following
subjects of NAS methods: one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture optimization, and
resource-aware NAS. Finally, we discuss some open problems related to the existing AutoML methods for future research.
Keywords: deep learning, automated machine learning (AutoML), neural architecture search (NAS), hyperparameter
optimization (HPO)

1. Introduction can be understood to involve the automated construction


of an ML pipeline on the limited computational budget.
In recent years, deep learning has been applied in vari- With the exponential growth of computing power, AutoML
ous fields and used to solve many challenging AI tasks, in has become a hot topic in both industry and academia. A
areas such as image classification [1, 2], object detection [3], complete AutoML system can make a dynamic combination
and language modeling [4, 5]. Specifically, since AlexNet [1] of various techniques to form an easy-to-use end-to-end
outperformed all other traditional manual methods in the ML pipeline system (as shown in Figure 1). Many AI com-
2012 ImageNet Large Scale Visual Recognition Challenge panies have created and publicly shared such systems (e.g.,
(ILSVRC) [6], increasingly complex and deep neural net- Cloud AutoML 1 by Google) to help people with little or
works have been proposed. For example, VGG-16 [7] has no ML knowledge to build high-quality custom models.
more than 130 million parameters, occupies nearly 500 MB As Figure 1 shows, the AutoML pipeline consists of
of memory space, and requires 15.3 billion floating-point several processes: data preparation, feature engineering,
operations to process an image of size 224 × 224. Notably, model generation, and model evaluation. Model generation
however, these models were all manually designed by ex- can be further divided into search space and optimization
perts by a trial-and-error process, which means that even methods. The search space defines the design principles of
experts require substantial resources and time to create ML models, which can be divided into two categories: the
well-performing models. traditional ML models (e.g., SVM and KNN), and neural
To reduce these onerous development costs, a novel idea architectures. The optimization methods are classified
of automating the entire pipeline of machine learning (ML) into hyperparameter optimization (HPO) and architecture
has emerged, i.e., automated machine learning (AutoML). optimization (AO), where the former indicates the training-
There are various definitions of AutoML. For example, ac- related parameters (e.g., the learning rate and batch size),
cording to [8], AutoML is designed to reduce the demand and the latter indicates the model-related parameters (e.g.,
for data scientists and enable domain experts to automati- the number of layer for neural architectures and the number
cally build ML applications without much requirement for of neighbors for KNN). NAS consists of three important
statistical and ML knowledge. In [9], AutoML is defined as components: the search space of neural architectures, AO
a combination of automation and ML. In a word, AutoML methods, and model evaluation methods. AO methods may
also refer to search strategy [10] or search policy [11]. Zoph
∗ Correspondingauthor
et al. [12] were one of the first to propose NAS, where a
Email addresses: csxinhe@comp.hkbu.edu.hk (Xin He),
kyzhao@comp.hkbu.edu.hk (Kaiyong Zhao), chxw@comp.hkbu.edu.hk
(Xiaowen Chu) 1 https://cloud.google.com/automl/

Preprint submitted to Knowledge-Based Systems April 19, 2021


Data Preparation Feature Engineering Model Generation Model Estimation

Search Space Optimization Methods


Data Feature
Collection Selection Low-fidelity

Traditional
Hyperparameter
Models
Optimization
(SVM, KNN) Early-stopping

Feature
Data Cleaning Feature
Extraction
Surrogate Model

Deep Neural
Architecture
Networks
Feature Optimization
Data (CNN, RNN)
Construction Weight-sharing
Augmentation
Neural Architecture Search (NAS)

Figure 1: An overview of AutoML pipeline covering data preparation (Section 2), feature engineering (Section 3), model generation (Section 4)
and model evaluation (Section 5).

recurrent network is trained by reinforcement learning to Survey DP FE HPO NAS


automatically search for the best-performing architecture. NAS Survey [10] - - - X
Since [12] successfully discovered a neural network achieving A Survey on NAS [45] - - - X
comparable results to human-designed models, there has NAS Challenges [46] - - - X
been an explosion of research interest in AutoML, with most A Survey on AutoML [9] - X X †
focusing on NAS. NAS aims to search for a robust and well- AutoML Challenges [47] X - X †
performing neural architecture by selecting and combining AutoML Benchmark [8] X X X -
different basic operations from a predefined search space. Ours X X X X
By reviewing NAS methods, we classify the commonly used
Table 1: Comparison between different AutoML surveys. The “Survey”
search space into entire-structured [12, 13, 14], cell-based
column gives each survey a label based on their title for increasing the
[13, 15, 16, 17, 18], hierarchical [19] and morphism-based readability. DP, FE, HPO, NAS indicate data preparation, feature
[20, 21, 22] search space. The commonly used AO methods engineering, hyperparameter optimization and neural architecture
contain reinforcement learning (RL) [12, 15, 23, 16, 13], search, respectively. “-”, “X”, and “†” indicate the content is 1)
not mentioned; 2) mentioned detailed; 3) mentioned briefly, in the
evolution-based algorithm (EA) [24, 25, 26, 27, 28, 29, 30], original paper, respectively.
and gradient descent (GD) [17, 31, 32], Surrogate Model-
Based Optimization (SMBO) [33, 34, 35, 36, 37, 38, 39],
and hybrid AO methods [40, 41, 42, 43, 44]. processes of data preparation, feature engineering, model
Although there are already several excellent AutoML- generation, and model evaluation are presented in Sections
related surveys [10, 45, 46, 9, 8], to the best of our knowl- 2, 3, 4, 5, respectively. In Section 6, we compare the
edge, our survey covers a broader range of AutoML meth- performance of NAS algorithms on the CIFAR-10 and
ods. As summarized in Table 1, [10, 45, 46] only focus ImageNet dataset, and discuss several subtopics of great
on NAS, while [9, 8] cover little of NAS technique. In concern in NAS community: one/two-stage NAS, one-shot
this paper, we summarize the AutoML-related methods NAS, joint hyperparameter and architecture optimization,
according to the complete AutoML pipeline (Figure 1), and resource-aware NAS. In Section 7, we describe several
providing beginners with a comprehensive introduction to open problems in AutoML. We conclude our survey in
the field. Notably, many sub-topics of AutoML are large Section 8.
enough to have their own surveys. However, our goal is
not to conduct a thorough investigation of all AutoML
sub-topics. Instead, we focus on the breadth of research 2. Data Preparation
in the field of AutoML. Therefore, we will summarize and The first step in the ML pipeline is data preparation.
discuss some representative methods of each process in the Figure 2 presents the workflow of data preparation, which
pipeline. can be introduced in three aspects: data collection, data
The rest of this paper is organized as follows. The cleaning, and data augmentation. Data collection is a

2
necessary step to build a new dataset or extend the ex- example, Krause et al. [55] separate inaccurate results
isting dataset. The process of data cleaning is used to as cross-domain or cross-category noise, and remove any
filter noisy data so that downstream model training is not images that appear in search results for more than one
compromised. Data augmentation plays an important role category. Vo et al. [56] re-rank relevant results and provide
in enhancing model robustness and improving model per- search results linearly, according to keywords.
formance. The following subsections will cover the three Second, Web data may be incorrectly labeled or even
aspects in more detail. unlabeled. A learning-based self-labeling method is often
used to solve this problem. For example, the active learn-
Start ing method [57] selects the most “uncertain” unlabeled
individual examples for labeling by a human, and then iter-
atively labels the remaining data. Roh et al. [58] provided
Yes a review of semi-supervised learning self-labeling methods,
Enough data?
which can help take the human out of the loop of labeling
to improve efficiency, and can be divided into the following
No
categories: self-training [59, 60], co-training [61, 62], and
co-learning [63]. Moreover, due to the complexity of Web
Any exsiting Data quality No Data images content, a single label cannot adequately describe
datasets? improved? Augmentation
an image. Consequently, Yang et al. [51] assigned multiple
labels to a Web image, i.e., if the confidence scores of these
Yes No labels are very close or the label with the highest score is
Yes the same as the original label of the image, then this image
Data Collection will be set as a new training sample.
Data Data Model However, the distribution of Web data can be extremely
Searching Synthesis Data Cleaning
Training different from that of the target dataset, which will increase
the difficulty of training the model. A common solution is to
Figure 2: The flow chart for data preparation. fine-tune these Web data [64, 65]. Yang et al. [51] proposed
an iterative algorithm for model training and Web data-
filtering. Dataset imbalance is another common problem, as
2.1. Data Collection some special classes have a very limited number of Web data.
ML’s deepening study has led to a consensus that high- To solve this problem, the synthetic minority over-sampling
quality datasets are of critical importance for ML; as a technique (SMOTE) [66] is used to synthesize new minority
result, numerous open datasets have emerged. In the early samples between existing real minority samples, instead
stages of ML study, a handwritten digital dataset, i.e., of simply up-sampling minority samples or down-sampling
MNIST [48], was developed. After that, several larger the majority samples. In another approach, Guo et al. [67]
datasets like CIFAR-10 and CIFAR-100 [49] and ImageNet combined the boosting method with data generation to
[50] were developed. A variety of datasets can also be enhance the generalizability and robustness of the model
retrieved by entering the keywords into these websites: against imbalanced data sets.
Kaggle 2 , Google Dataset Search (GOODS) 3 , and Elsevier
Data Search 4 . 2.1.2. Data Synthesis
However, it is usually challenging to find a proper Data simulator is one of the most commonly used meth-
dataset through the above approaches for some partic- ods to generate data. For some particular tasks, such as
ular tasks, such as those related to medical care or other autonomous driving, it is not possible to test and adjust a
privacy matters. Two types of methods are proposed to model in the real world during the research phase, due to
solve this problem: data searching and data synthesis. safety hazards. Therefore, a practical approach to generat-
ing data is to use a data simulator that matches the real
2.1.1. Data Searching world as closely as possible. OpenAI Gym [68] is a popular
As the Internet is an inexhaustible data source, search- toolkit that provides various simulation environments, in
ing for Web data is an intuitive way to collect a dataset which developers can concentrate on designing their al-
[51, 52, 53, 54]. However, there are some problems with gorithms, instead of struggling to generate data. Wang
using Web data. et al. [69] used a popular game engine, Unreal Engine
First, the search results may not exactly match the 4, to build a large synthetic indoor robotics stereo (IRS)
keywords. Thus, unrelated data must be filtered. For dataset, which provides the information for disparity and
surface normal estimation. Furthermore, a reinforcement
learning-based method is applied in [70] for optimizing the
2 https://www.kaggle.com
parameters of a data simulator to control the distribution
3 https://datasetsearch.research.google.com/
4 https://www.datasearch.elsevier.com/
of the synthesized data.

3
Another novel technique for deriving synthetic data
is Generative Adversarial Networks (GANs) [71], which
can be used to generate images [71, 72, 73, 74], tabular
[75, 76] and text [77] data. Karras et al. [78] applied GAN
technique to generate realistic human face images. Oh and
Jaroensri et al. [72] built a synthetic dataset, which cap-
tures small motion for video-motion magnification. Bowles
et al. [74] demonstrated the feasibility of using GAN to
generate medical images for brain segmentation tasks. In
the case of textual data, applying GAN to text has proved
difficult because the commonly used method is to use rein-
forcement learning to update the gradient of the generator,
but the text is discrete, and thus the gradient cannot propa-
gate from discriminator to generator. To solve this problem,
Donahue et al. [77] used an autoencoder to encode sen-
tences into a smooth sentence representation to remove the
barrier of reinforcement learning. Park et al. [75] applied
GAN to synthesize fake tables that are statistically similar
to the original table but do not cause information leakage.
Similarly, in [76], GAN is applied to generate tabular data
like medical or educational records.

2.2. Data Cleaning


The collected data inevitably have noise, but the noise
can negatively affect the training of the model. Therefore,
the process of data cleaning [79, 80] must be carried out if
necessary. Across the literature, the effort of data cleaning
is shifting from crowdsourcing to automation. Tradition- Figure 3: A classification of data augmentation techniques.
ally, data cleaning requires specialist knowledge, but access
to specialists is limited and generally expensive. Hence,
Chu et al. [81] proposed Katara, a knowledge-based and of continuously cleaning data. Mahdavi et al. [88] built
crowd-powered data cleaning system. To improve efficiency, a cleaning workflow orchestrator, which can learn from
some studies [82, 83] proposed only to clean a small subset previous cleaning tasks, and proposed promising cleaning
of the data and maintain comparable results to the case of workflows for new datasets.
cleaning the full dataset. However, these methods require
a data scientist to design what data cleaning operations 2.3. Data Augmentation
are applied to the dataset. BoostClean [84] attempts to To some degree, data augmentation (DA) can also be
automate this process by treating it as a boosting prob- regarded as a tool for data collection, as it can generate new
lem. Each data cleaning operation effectively adds a new data based on the existing data. However, DA also serves as
cleaning operation to the input of the downstream ML a regularizer to avoid over-fitting of model training and has
model, and through a combination of Boosting and feature received more and more attention. Therefore, we introduce
selection, a good series of cleaning operations, which can DA as a separate part of data preparation in detail. Figure
well improve the performance of the ML model, can be 3 classifies DA techniques from the perspective of data type
generated. AlphaClean [85] transforms data cleaning into (image, audio, and text), and incorporates automatic DA
a hyperparameter optimization problem, which further in- techniques that have recently received much attention.
creases automation. Specifically, the final data cleaning For image data, the affine transformations include rota-
combinatorial operation in AlphaClean is composed of sev- tion, scaling, random cropping, and reflection; the elastic
eral pipelined cleaning operations that need to be searched transformations contain the operations like contrast shift,
from a predefined search space. Gemp et al. [86] attempted brightness shift, blurring, and channel shuffle; the advanced
to use meta-learning technique to automate the process of transformations involve random erasing, image blending,
data cleaning. cutout [89], and mixup [90], etc. These three types of
The data cleaning methods mentioned above are applied common transformations are available in some open source
to a fixed dataset. However, the real world generates libraries, like torchvision 5 , ImageAug [91], and Albumen-
vast amounts of data every day. In other words, how tations [92]. In terms of neural-based transformations, it
to clean data in a continuous process becomes a worth
studying problem, especially for enterprises. Ilyas et al.
5 https://pytorch.org/docs/stable/torchvision/transforms.html
[87] proposed an effective way of evaluating the algorithms
4
can be divided into three categories: adversarial noise [93], 3.1. Feature Selection
neural style transfer [94], and GAN technique [95]. For Feature selection builds a feature subset based on the
textual data, Wong et al. [96] proposed two approaches for original feature set by reducing irrelevant or redundant
creating additional training examples: data warping and features. This tends to simplify the model, hence avoiding
synthetic over-sampling. The former generates additional overfitting and improving model performance. The selected
samples by applying transformations to data-space, and features are usually divergent and highly correlated with
the latter creates additional samples in feature-space. Tex- object values. According to [112], there are four basic steps
tual data can be augmented by synonym insertion or by in a typical process of feature selection (see Figure 4), as
first translating the text into a foreign language and then follows:
translating it back to the original language. In a recent
study, Xie et al. [97] proposed a non-domain-specific DA
Original feature set
policy that uses noising in RNNs, and this approach works
well for the tasks of language modeling and machine trans-
lation. Yu et al. [98] proposed a back-translation method
Generation
for DA to improve reading comprehension. NLPAug [99] (Search Strategy)
is an open-source library that integrates many types of
augmentation operations for both textual and audio data.
The above augmentation techniques still require hu-
No Subset Evaluation validation
man to select augmentation operations and then form a
specific DA policy for specific tasks, which requires much
expertise and time. Recently, there are many methods
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110] pro-
posed to search for augmentation policy for different tasks. Stopping
Yes
criterion?
AutoAugment [100] is a pioneering work to automate the
search for optimal DA policies using reinforcement learn-
ing. However, AutoAugment is not efficient as it takes Figure 4: The iterative process of feature selection. A subset of
almost 500 GPU hours for one augmentation search. In features is selected, based on a search strategy, and then evaluated.
order to improve search efficiency, a number of improved Then, a validation procedure is implemented to determine whether
algorithms have subsequently been proposed using different the subset is valid. The above steps are repeated until the stop
criterion is satisfied.
search strategies, such as gradient descent-based [101, 102],
Bayesian-based optimization [103], online hyperparameter
learning [109], greedy-based search [104] and random search The search strategy for feature selection involves three
[107]. Besides, LingChen et al. [110] proposed a search- types of algorithms: complete search, heuristic search, and
free DA method, namely UniformAugment, by assuming random search. Complete search comprises exhaustive and
that the augmentation space is approximately distribution non-exhaustive searching; the latter can be further split
invariant. into four methods: breadth-first search, branch and bound
search, beam search, and best-first search. Heuristic search
comprises sequential forward selection (SFS), sequential
3. Feature Engineering backward selection (SBS), and bidirectional search (BS).
In SFS and SBS, the features are added from an empty set
It is generally accepted that data and features deter-
or removed from a full set, respectively, whereas BS uses
mine the upper bound of ML, and that models and algo-
both SFS and SBS to search until these two algorithms
rithms can only approximate this limit. In this context,
obtain the same subset. The most commonly used random
feature engineering aims to maximize the extraction of
search methods are simulated annealing (SA) and genetic
features from raw data for use by algorithms and models.
algorithms (GAs).
Feature engineering consists of three sub-topics: feature
Methods of subset evaluation can be divided into three
selection, feature extraction, and feature construction. Fea-
different categories. The first is the filter method, which
ture extraction and construction are variants of feature
scores each feature according to its divergence or corre-
transformation, by which a new set of features is created
lation and then selects features according to a threshold.
[111]. In most cases, feature extraction aims to reduce the
Commonly used scoring criteria for each feature are vari-
dimensionality of features by applying specific mapping
ance, the correlation coefficient, the chi-square test, and
functions, while feature construction is used to expand
mutual information. The second is the wrapper method,
original feature spaces, and the purpose of feature selection
which classifies the sample set with the selected feature
is to reduce feature redundancy by selecting important fea-
subset, after which the classification accuracy is used as the
tures. Thus, the essence of automatic feature engineering
criterion to measure the quality of the feature subset. The
is, to some degree, a dynamic combination of these three
third method is the embedded method, in which variable
processes.
selection is performed as part of the learning procedure.

5
Regularization, decision tree, and deep learning are all
Model Generation
embedded methods. Model Estimation
Search Space Architecture Optimization
3.2. Feature Construction
Feature construction is a process that constructs new Entire-structured Random search Low-fidelity

features from the basic feature space or raw data to enhance


the robustness and generalizability of the model. Essen- Evolutionary
Algorithm
tially, this is done to increase the representative ability of
Cell-based Early-stopping
the original features. This process is traditionally highly Bayesian
Optimization
dependent on human expertise, and one of the most com-
monly used methods is preprocessing transformation, such Reinforcement
Learning
as standardization, normalization, or feature discretization. Hierarchical Surrogate Model

In addition, the transformation operations for different Gradient Descent


types of features may vary. For example, operations such
as conjunctions, disjunctions and negation are typically
Morphism-based Hybrid Weight-sharing
used for Boolean features; operations such as minimum,
maximum, addition, subtraction, mean are typically used
for numerical features, and operations such as Cartesian
product [113] and M-of-N [114] are commonly used for Figure 5: An overview of neural architecture search pipeline.
nominal features.
It is impossible to manually explore all possibilities.
Hence, to further improve efficiency, some automatic fea- while an unsupervised feature-extraction method using
ture construction methods [115, 114, 116, 117] have been autoencoder trees is proposed by [120].
proposed to automate the process of searching and evaluat-
ing the operation combination, and shown to achieve results
4. Model Generation
as good as or superior to those achieved by human exper-
tise. Besides, some feature construction methods, such as Model generation is divided into two parts––search
decision tree-based methods [115, 114] and genetic algo- space and optimization methods––as shown in Figure 1.
rithms [116], require a predefined operation space, while The search space defines the model structures that can be
the annotation-based approaches [117] do not, as they can designed and optimized in principle. The types of models
use domain knowledge (in the form of annotation) and can be broadly divided into two categories: traditional
the training examples, and hence, can be traced back to ML models, such as support vector machine (SVM) [121]
the interactive feature-space construction protocol intro- and k-nearest neighbors algorithm (KNN) [122], and deep
duced by [118]. Using this protocol, the learner identifies neural network (DNN). There are two types of parameters
inadequate regions of feature space and, in coordination for the optimization methods: hyperparameters used for
with a domain expert, adds descriptiveness using existing training, such as the learning rate, and those used for model
semantic resources. After selecting possible operations and design, such as the filter size and the number of layers
constructing a new feature, feature-selection techniques are for DNN. Neural architecture search (NAS) has recently
applied to evaluate the new feature. attracted considerable attention; therefore, in this section,
we introduce the search space and optimization methods of
3.3. Feature Extraction NAS technique. Readers who are interested in traditional
Feature extraction is a dimensionality-reduction process models (e.g., SVM) can refer to other reviews [9, 8].
performed via some mapping functions. It extracts infor- Figure 5 presents an overview of the NAS pipeline,
mative and non-redundant features according to certain which is categorized into the following three dimensions
metrics. Unlike feature selection, feature extraction alters [10, 123]: search space, architecture optimization (AO)
the original features. The kernel of feature extraction is a method6 , and model evaluation method.
mapping function, which can be implemented in many ways.
The most prominent approaches are principal component • Search Space. The search space defines the design
analysis (PCA), independent component analysis, isomap, principles of neural architectures. Different scenarios
nonlinear dimensionality reduction, and linear discriminant require different search spaces. Here, we summarize
analysis (LDA). Recently, the feed-forward neural network four types of commonly used search spaces: entire-
approach has become popular; this uses the hidden units structured, cell-based, hierarchical, and morphism-
of a pretrained model as extracted features. Furthermore, based.
many autoencoder-based algorithms are proposed; for ex-
ample, Zeng et al. [119] proposed a relation autoencoder 6 It can also be referred to as the “search strategy [10, 123]”,
model that considers data features and their relationships, “search policy [11]”, or “optimization method [45, 9]”.

6
• Architecture Optimization Method. The architecture where Nk indicates the indegree of node Zk , Ii and oi
optimization (AO) method defines how to guide the represent i-th input tensor and its associated operation,
search to efficiently find the model architecture with respectively, and O is a set of candidate operations, such
high performance after the search space is defined. as convolution, pooling, activation functions, skip connec-
tion, concatenation, and addition. To further enhance the
• Model Evaluation Method. Once a model is gener- model performance, many NAS methods use certain ad-
ated, its performance needs to be evaluated. The vanced human-designed modules as primitive operations,
simplest approach of doing this is to train the model such as depth-wise separable convolution [124], dilated
to converge on the training set, and then estimate convolution[125], and squeeze-and-excitation (SE) blocks
model performance on the validation set; however, [126]. The selection and combination of these operations
this method is time-consuming and resource-intensive. vary with the design of search space. In other words, the
Some advanced methods can accelerate the evalua- search space defines the structural paradigm that AO meth-
tion process but lose fidelity in the process. Thus, ods can explore; thus, designing a good search space is a
how to balance the efficiency and effectiveness of an vital but challenging problem. In general, a good search
evaluation is a problem worth studying. space is expected to exclude human bias and be flexible
enough to cover a wider variety of model architectures.
The search space and AO methods are presented in
Based on the existing NAS studies, we detail the com-
this section, while the methods of model evaluation are
monly used search spaces as follows.
presented in the next section.
4.1.1. Entire-structured Search Space
4.1. Search Space
The space of entire-structured neural networks [12, 13]
A neural architecture can be represented as a direct is one of the most intuitive and straightforward search
acyclic graph (DAG) comprising B ordered nodes. In DAG, spaces. Figure 6 presents two simplified examples of entire-
each node and directed edge indicate a feature tensor and structured models, which are built by stacking a predefined
an operation, respectively. Eq. 1 presents a formula for number of nodes, where each node represents a layer and
computation at any node Zk , k ∈ {1, 2, ..., B}. performs a specified operation. The left model shown in
Figure 6 indicates the simplest structure, while the right
output output model is relatively complex, as it permits arbitrary skip
connections [2] to exist between the ordered nodes; these
connections have been proven effective in practice [12]. Al-
though an entire structure is easy to implement, it has
L4 conv 3x3 conv 3x3 several disadvantages. For example, it is widely accepted
that the deeper is the model, the better is its generaliza-
tion ability; however, searching for such a deep network
is onerous and computationally expensive. Furthermore,
L3 conv 5x5 conv 5x5
the generated architecture lacks transferability; that is, a
model generated on a small dataset may not fit a larger
dataset, which necessitates the generation of a new model
L2 conv 3x3 conv 3x3 for a larger dataset.

4.1.2. Cell-based Search Space


Motivation. To enable the transferability of the gener-
L1 max pool max pool ated model, the cell-based search space has been proposed
[15, 16, 13], in which the neural architecture is composed
of a fixed number of repeating cell structures. This de-
sign approach is based on the observation that many well-
input input
performing human-designed models [2, 127] are also built
by stacking a fixed number of modules. For example, the
Figure 6: Two simplified examples of entire-structured neural archi- ResNet family builds many variants, such as ResNet50,
tectures. Each layer is specified with a different operation, such as
convolution and max-pooling operations. The edge indicates the infor- ResNet101, and ResNet152, by stacking several BottleNeck
mation flow. The skip-connection operation used in the right example modules [2]. Throughout the literature, this repeated mod-
can help explore deeper and more complex neural architectures. ule is referred to as a motif, cell, or block, while in this
paper, we call it a cell.
Design. Figure 7 (left) presents an example of a final
Nk
X cell-based neural network, which comprises two types of
Zk = oi (Ii ), oi ∈ O (1)
cells: normal and reduction cells. Thus, the problem of
i=1
searching for a full neural architecture is simplified into
7
output Cell k+1
each block can be represented by a five-element tuple,
(I1 , I2 , O1 , O2 , C), where I1 , I2 ∈ Ib indicate the inputs to
the block, while O1 , O2 ∈ O indicate the operations applied
Cell k
reduction cell concat to inputs, and C ∈ C describes how to combine O1 and
O2 . As the blocks are ordered, the set of candidate inputs
Ib for the nodes in block bk , which contains the output of
Block 1
× n normal cell the previous two cells and the output set of all previous
add
blocks {bi , i < k} of the same cell. The first two inputs of
conv max- the first cell of the whole model are set to the image data
reduction cell 3x3 pool
by default.
Block 0 In the actual implementation, certain essential details
add need to be noted. First, the number of channels may differ
× n normal cell
for different inputs. A commonly used solution is to apply a
conv
skip calibration operation on each node’s input tensor to ensure
5x5
reduction cell
that all inputs have the same number of channels. The
calibration operation generally uses 1×1 convolution filters,
Cell k-1 such that it will not change the width and height of the
× n normal cell input tensor, but keep the channel number of all input
... tensors consistent. Second, as mentioned above, the input
of a node in a block can be obtained from the previous two
input Cell k-2 cells or the previous blocks within the same cell; hence,
the cell’s output must have the same spatial resolution. To
Figure 7: (Left) Example of a cell-based model comprising three this end, if the input/output resolutions are different, the
motifs, each with n normal cells and one reduction cell. (Right) calibration operation has stride 2; otherwise, it has stride
Example of a normal cell, which contains two blocks, each having two
nodes. Each node is specified with a different operation and input.
1. Besides, all blocks have stride 1.
Complexity. Searching for a cell structure is more
efficient than searching for an entire structure. To illustrate
searching for an optimal cell structure in the context of this, let us assume that there are M predefined candidate
cell-based search space. Besides, the output of the normal operations, the number of layers for both entire and the
cell retains the same spatial dimension as the input, and cell-based structures is L, and the number of blocks in a
the number of normal cell repeats is usually set manually cell is B. Then, the number of possible entire structures
based on the actual demand. The reduction cell follows can be expressed as:
behind a normal cell and has a similar structure to that of L×(L−1)
the normal cell, with the differences being that the width Nentire = M L × 2 2 (2)
and height of the output feature maps of the reduction The number of possible cells is (M B × (B + 2)!)2 . However,
cell are half the input, and the number of channels is as there are two types of cells (i.e., normal and reduc-
twice the input. This design approach follows the common tion cells), the final size of the cell-based search space is
practice of manually designing neural networks. Unlike calculated as
the entire-structured search space, the model built on cell-
based search space can be expanded to form a larger model Ncell = (M B × (B + 2)!)4 (3)
by simply adding more cells without re-searching for the
cell structure. Meanwhile, many approaches [17, 13, 15] Evidently, the complexity of searching for the entire struc-
have experimentally demonstrated the transferability of ture grows exponentially with the number of layers. For
the model generated in cell-based search space, such as the an intuitive comparison, we assign the variables in the
model built on CIFAR-10, which can also achieve compa- Eqs. 2 and 3 the typical value in the literature, i.e.,
rable results to those achieved by SOTA human-designed M = 5, L = 10, B = 3; then Nentire = 3.44 × 1020 is
models on ImageNet. much larger than Ncell = 5.06 × 1016 .
The design paradigm of the internal cell structure of Two-stage Gap. The NAS methods of cell-based
most NAS studies refers to Zoph et al. [15], who were search space usually comprise two phases: search and eval-
among the first to propose the exploration of cell-based uation. First, in the search phase, the best-performing
search space. Figure 7 (right) shows an example of a normal model is selected, and then, in the evaluation phase, it is
cell structure. Each cell contains B blocks (here B = 2), trained from scratch or fine-tuned. However, there exists a
and each block has two nodes. Each node in a block can be large gap in the model depth between the two phases. As
assigned different operations and receive different inputs. Figure 8 (left) shows, for DARTS [17], the generated model
The output of two nodes in the block can be combined in the search phase only comprises eight cells for reducing
through addition or concatenation operation. Therefore, the GPU memory consumption, while in the evaluation
phase, the number of cells is extended to 20. Although the
8
[13, 15, 23, 16, 25, 26] follow a two-level hierarchy: the inner
is the cell level, which selects the operation and connection
8 20 5 11 17 20
Cells Cells Cells Cells Cells Cells for each node in the cell, and the outer is the network level,
5 ops
which controls the spatial-resolution changes. However,
3 ops these approaches focus on the cell level and ignore the
2 ops
Search Estimation Search Estimation network level. As shown in Figure 7, whenever a fixed
phase phase phase phase
number of normal cells are stacked, the spatial dimension
DARTS P-DARTS
of the feature maps is halved by adding a reduction cell.
To jointly learn a suitable combination of repeatable cell
Figure 8: Difference between DARTS [17] and P-DARTS [128]. Both and network structures, Liu et al. [129] defined a general
methods search and evaluate networks on the CIFAR-10 dataset. As formulation for a network-level structure, depicted in Figure
the number of cell structures increases from 5 to 11 and then 17, the
number of candidate operations is gradually reduced accordingly. 9, from which many existing good network designs can be
reproduced. In this way, we can fully explore the different
number of channels and sizes of feature maps of each layer
search phase finds the best cell structure for the shallow in the network.
model, this does not mean that it is still suitable for the
deeper model in the evaluation phase. In other words, 1×1 conv
simply adding more cells may deteriorate the model perfor- 3×3 conv
mance. To bridge this gap, Chen et al. [128] proposed an max-pooling
improved method based on DARTS, namely progressive-
DARTS (P-DARTS), which divides the search phase into level-one level-two level-three
multiple stages and gradually increases the depth of the Figure 10: Example of a three-level hierarchical architecture rep-
searched networks at the end of each stage, hence bridging resentation. The level-one primitive operations are assembled into
the gap between search and evaluation. However, increas- level-two cells. The level-two cells are viewed as primitive operations
ing the number of cells in the search phase may result in and assembled into level-three cell.
heavier computational overhead. Thus, for reducing the
computational consumption, P-DARTS gradually reduces In terms of the cell level, the number of blocks (B) in a
the number of candidate operations from 5 to 3, and then cell is still manually predefined and fixed in the search stage.
2, through search space approximation methods, as shown In other words, B is a new hyperparameter that requires
in Figure 8. Experimentally, P-DARTS obtains a 2.50% tuning by human input. To address this problem, Liu et al.
error rate on the CIFAR-10 test dataset, outperforming [19] proposed a novel hierarchical genetic representation
the 2.83% error rate achieved by DARTS. scheme, namely HierNAS, in which a higher-level cell is
generated by iteratively incorporating lower-level cells. As
4.1.3. Hierarchical Search Space shown in Figure 10, level-one cells can be some primitive
operations, such as 1 × 1 and 3 × 3 convolution and 3 × 3
d=2 d=4 d=8 d=16 ... max-pooling, and are the basic components of level-two
cells. Then, level-two cells are used as primitive operations
L1 to generate level-three cells. The highest-level cell is a single
L2
motif corresponding to the full architecture. Besides, a
higher-level cell is defined by a learnable adjacency upper-
L3 triangular matrix G, where Gij = k indicates that the
k-th operation 0k is implemented between nodes i and j.
... For example, the level-two cell shown in Figure 10(a) is
defined by a matrix G, where G01 = 2, G02 = 1, G12 = 0
...
(the index starts from 0). This method can identify more
LN-1 types of cell structures with more complex and flexible
topologies. Similarly, Liu et al. [18] proposed progressive
LN NAS (PNAS) to search for the cell progressively, starting
from the simplest cell structure, which is composed of only
Figure 9: Network-level search space proposed by [129]. The blue one block, and then expanding to a higher-level cell by
point (top-left) indicates the fixed “stem” structure, the remain-
ing gray and orange points are cell structure, as described above. adding more possible block structures. Moreover, PNAS
The black arrows along the orange points indicate the final selected improves the search efficiency by using a surrogate model
network-level structure. “d” and “L” indicate the down sampling to predict the top-k promising blocks from the search space
rate and layer, respectively.
at each stage of cell construction.
For both HierNAS and PNAS, once a cell structure is
The cell-based search space enables the transferability searched, it is used in all network layers, which limits the
of the generated model, and most of the cell-based methods layer diversity. Besides, for achieving both high accuracy
9
and low latency, some studies [130, 131] proposed to search (IdMorph) transformations between the neural network
for complex and fragmented cell structures. For example, layers. An IdMorph transformation is function-preserving
Tan et al. [130] proposed MnasNet, which uses a novel and can be classified into two types – depth and width
factorized hierarchical search space to generate different IdMorph (shown in Figure 12) – which makes it possible to
cell structures, namely MBConv, for different layers of the replace the original model with an equivalent model that
final network. Figure 11 presents the factorized hierarchical is deeper or wider.
search space of MnasNet, which comprises a predefined However, IdMorph is limited to width and depth changes,
number of cell structures. Each cell has a different struc- and can only modify them separately; moreover, the spar-
ture and contains a variable number of blocks––whereas all sity of its identity layer can create problems [2]. There-
blocks in the same cell exhibit the same structure, those fore, an improved method is proposed, namely network
in other cells exhibit different structures. As this design morphism [21], which allows a child network to inherit
method can achieve a suitable balance between model per- all knowledge from its well-trained parent network and
formance and latency, many subsequent studies [131, 132] continue to grow into a more robust network within a
have referred to it. Owing to the large computational shortened training time. Compared with Net2Net, net-
consumption, most of the differentiable NAS (DNAS) tech- work morphism exhibits the following advantages: 1) it can
niques (e.g., DARTS) first search for a suitable cell struc- embed nonidentity layers and handle arbitrary nonlinear
ture on a proxy dataset (e.g., CIFAR10), and then transfer activation functions, and 2) it can simultaneously perform
it to a larger target dataset (e.g., ImageNet). Han et al. depth, width, and kernel size-morphing in a single oper-
[132] proposed ProxylessNAS, which can directly search ation, whereas Net2Net has to separately consider depth
for neural networks on the targeted dataset and hardware and width changes. The experimental results in [21] show
platforms by using BinaryConnect [133], which addresses that network morphism can substantially accelerate the
the high memory consumption issue. training process, as it uses one-fifteenth of the training
time and achieves better results than the original VGG16.
output +

conv
1x1
Cell n

Block 3-B3 conv


3x3
... Depth IdMorph
... conv
1x1 e f
Deeper Net
Cell 3 Block 3-1 b c
a d
e f/2
f/2

Cell 2 Width IdMorph b c c


a d
conv Initial Net d
Block 1-B1 1x1

Cell 1 ... Wider Net

Figure 12: Net2DeeperNet and Net2WiderNet transformations in


Block 1-1 [20]. “IdMorph” refers to identity morphism operation. The value on
input each edge indicates the weight.

Figure 11: Factorized hierarchical search space in MnasNet [130]. Several subsequent studies [27, 22, 136, 137, 138, 139,
The final network comprises different cells. Each cell is composed of 140, 141] are based on network morphism. For instance,
a variable number of repeated blocks, where the block in the same Jin et al. [22] proposed a framework that enables Bayesian
cell shares the same structure but differs from that in the other cells.
optimization to guide the network morphism for an effi-
cient neural architecture search. Wei et al. [136] further
improved network morphism at a higher level, i.e., by mor-
4.1.4. Morphism-based Search Space
phing a convolutional layer into the arbitrary module of a
Isaac Newton is reported to have said that “If I have neural network. Additionally, Tan and Le [142] proposed
seen further, it is by standing on the shoulders of giants.” EfficientNet, which re-examines the effect of model scaling
Similarly, several training tricks have been proposed, such on convolutional neural networks, and proved that carefully
as knowledge distillation [134] and transfer learning [135]. balancing the network depth, width, and resolution can
However, these methods do not directly modify the model lead to better performance.
structure. To this end, Chen et al. [20] proposed the
Net2Net technique for designing new neural networks based
on an existing network by inserting identity morphism
10
4.2. Architecture Optimization
Initialization Evolution
After defining the search space, we need to search for
the best-performing architecture, a process we call architec-
Update
ture optimization (AO). Traditionally, the architecture of a
neural network is regarded as a set of static hyperparame-
ters that are tuned based on the performance observed on
the validation set. However, this process highly depends Stopping? Mutation
on human experts and requires considerable time and re-
sources for trial and error. Therefore, many AO methods No
have been proposed to free humans from this tedious pro- Yes Crossover
cedure and to search for novel architectures automatically.
Below, we detail the commonly used AO methods.
Termination Selection
4.2.1. Evolutionary Algorithm
The evolutionary algorithm (EA) is a generic population-
based metaheuristic optimization algorithm that takes in- Figure 13: Overview of the evolutionary algorithm.
spiration from biological evolution. Compared with tradi-
tional optimization algorithms such as exhaustive methods,
each network can be modified using function-preserving
EA is a mature global optimization method with high
network morphism operators. Hence, the child network has
robustness and broad applicability. It can effectively ad-
increased capacity and is guaranteed to perform at least
dress the complex problems that traditional optimization
as well as the parent networks.
algorithms struggle to solve, without being limited by the
Four Steps. A typical EA comprises the following
problem’s nature.
steps: selection, crossover, mutation, and update (Figure
Encoding Scheme. Different EAs may use differ-
13):
ent types of encoding schemes for network representation.
There are two types of encoding schemes: direct and indi- • Selection This step involves selecting a portion of
rect. the networks from all generated networks for the
Direct encoding is a widely used method that explicitly crossover, which aims to maintain well-performing
specifies the phenotype. For example, genetic CNN [30] neural architectures while eliminating the weak ones.
encodes the network structure into a fixed-length binary The following three strategies are adopted for network
string, e.g., 1 indicates that two nodes are connected, and selection. The first is fitness selection, in which the
vice versa. Although binary encoding can be performed probability of a network being selected is proportional
easily, its computational space is the square of the number
to its fitness value, i.e., P (hi ) = PNF itness(h i)
,
of nodes, which is fixed-length, i.e., predefined manually. F itness(h )
j=1 j

For representing variable-length neural networks, DAG en- where hi indicates the i-th network. The second is
coding is a promising solution [28, 25, 19]. For example, rank selection, which is similar to fitness selection,
Suganuma et al. [28] used the Cartesian genetic program- but with the network’s selection probability being
ming (CGP) [143, 144] encoding scheme to represent a proportional to its relative fitness rather than its
neural network built by a list of sub-modules that are de- absolute fitness. The third method is tournament
fined as DAG. Similarly, in [25], the neural architecture selection [25, 27, 26, 19]. Here, in each iteration, k
is also encoded as a graph, whose vertices indicate rank-3 (tournament size) networks are randomly selected
tensors or activations (with batch normalization performed from the population and sorted according to their
with rectified linear units (ReLUs) or plain linear units) performance; then, the best network is selected with
and edges indicate identity connections or convolutions. a probability of p, the second-best network has a
Neuro evolution of augmenting topologies (NEAT) [24, 25] probability of p × (1 − p), and so on.
also uses a direct encoding scheme, where each node and
• Crossover After selection, every two networks are se-
connection is stored.
lected to generate a new offspring network, inheriting
Indirect encoding specifies a generation rule to build
half of the genetic information of each of its parents.
the network and allows for a more compact representation.
This process is analogous to the genetic recombina-
Cellular encoding (CE) [145] is an example of a system
tion, which occurs during biological reproduction and
that utilizes indirect encoding of network structures. It
crossover. The particular manner of crossover varies
encodes a family of neural networks into a set of labeled
and depends on the encoding scheme. In binary en-
trees and is based on a simple graph grammar. Some recent
coding, networks are encoded as a linear string of
studies [146, 147, 148, 27] have described the use of indirect
bits, where each bit represents a unit, such that two
encoding schemes to represent a network. For example,
parent networks can be combined through one- or
the network in [27] can be encoded by a function, and
multiple-point crossover. However, the crossover of

11
the data arranged in such a fashion can sometimes action At: sample an architecture

damage the data. Thus, Xie et al. [30] denoted the


basic unit in a crossover as a stage rather than a
bit, which is a higher-level structure constructed by
Controller
a binary string. For cellular encoding, a randomly se- Environment
(RNN)
lected sub-tree is cut from one parent tree to replace
a sub-tree cut from the other parent tree. In another
approach, NEAT performs an artificial synapsis based
reward Rt Rt+1
on historical markings, adding a new structure with-
out losing track of the gene present throughout the state St St+1
simulation.

• Mutation As the genetic information of the parents Figure 14: Overview of neural architecture search using reinforcement
is copied and inherited by the next generation, gene learning.
mutation also occurs. A point mutation [28, 30] is
one of the most widely used operations and involves
results (such as accuracy) are returned. Many follow-up
randomly and independently flipping each bit. Two
approaches [23, 15, 16, 13] have used this framework, but
types of mutations have been described in [29]: one
with different controller policies and neural-architecture
enables or disables a connection between two lay-
encoding. Zoph et al. [12] first used the policy gradient
ers, and the other adds or removes skip connections
algorithm [150] to train the controller, and sequentially
between two nodes or layers. Meanwhile, Real and
sampled a string to encode the entire neural architecture.
Moore et al. [25] predefined a set of mutation opera-
In a subsequent study [15], they used the proximal policy
tors, such as altering the learning rate and removing
optimization (PPO) algorithm [151] to update the con-
skip connections between the nodes. By analogy with
troller, and proposed the method shown in Figure 15 to
the biological process, although a mutation may ap-
build a cell-based neural architecture. MetaQNN [23] is a
pear as a mistake that causes damage to the network
meta-modeling algorithm using Q-learning with an -greedy
structure and leads to a loss of functionality, it also
exploration strategy and experience replay to sequentially
enables the exploration of more novel structures and
search for neural architectures.
ensures diversity.

• Update Many new networks are generated by com- op A index A op B index B op A index A op B index B
conv conv max-
pleting the above steps, and considering the limita- prediction -2 skip -1 0 -1
5x5 3x3 pool
tions on computational resources, some of these must
be removed. In [25], the worst-performing network hidden
state
of two randomly selected networks is immediately
removed from the population. Alternatively, in [26], Empty conv -2 -1 conv 0 max-
Embedding skip
input 5x5 3x3 pool
the oldest networks are removed. Other methods
Block 1 of cell k Block 2 of cell k
[29, 30, 28] discard all models at regular intervals.
However, Liu et al. [19] did not remove any network
Figure 15: Example of a controller generating a cell structure. Each
from the population, and instead, allowed the net- block in the cell comprises two nodes that are specified with different
work number to grow with time. Zhu et al. [149] operations and inputs. The indices −2 and −1 indicate the inputs
regulated the population number through a variable are derived from prev-previous and previous cell, respectively.
λ, i.e., removed the worst model with probability λ
and the oldest model with 1 − λ. Although the above RL-based algorithms have achieved
SOTA results on the CIFAR-10 and Penn Treebank (PTB)
4.2.2. Reinforcement Learning [152] datasets, they incur considerable time and computa-
Zoph et al. [12] were among the first to apply reinforce- tional resources. For instance, the authors in [12] took 28
ment learning (RL) to neural architecture search. Figure 14 days and 800 K40 GPUs to search for the best-performing
presents an overview of an RL-based NAS algorithm. Here, architecture, and MetaQNN [23] also took 10 days and 10
the controller is usually a recurrent neural network (RNN) GPUs to complete its search. To this end, some improved
that executes an action At at each step t to sample a new ar- RL-based algorithms have been proposed. BlockQNN [16]
chitecture from the search space and receives an observation uses a distributed asynchronous framework and an early-
of the state St together with a reward scalar Rt from the stop strategy to complete searching on only one GPU
environment to update the controller’s sampling strategy. within 20 hours. The efficient neural architecture search
Environment refers to the use of a standard neural net- (ENAS) [13] is even better, as it adopts a parameter-sharing
work training procedure to train and evaluate the network strategy in which all child architectures are regarded as
generated by the controller, after which the corresponding sub-graphs of a supernet; this enables these architectures

12
to share parameters, obviating the need to train each child can be efficiently solved as a regular training, the searched
model from scratch. Thus, ENAS took only approximately architecture α commonly overfits the training set and its
10 hours using one GPU to search for the best architecture performance on the validation set cannot be guaranteed.
on the CIFAR-10 dataset, which is nearly 1000× faster The authors in [153] proposed mixed-level optimization:
than [12].
min [Ltrain (θ∗ , α) + λLval (θ∗ , α)] (7)
α,θ
4.2.3. Gradient Descent
The above-mentioned search strategies sample neural where α indicates the neural architecture, θ is the weight as-
architectures from a discrete search space. A pioneering al- signed to it, and λ is a non-negative regularization variable
gorithm, namely DARTS [17], was among the first gradient to control the weights of the training loss and validation
descent (GD)-based method to search for neural architec- loss. When λ = 0, Eq. 7 reduces to a single-level opti-
tures over a continuous and differentiable search space by mization (Eq. 6); in contrast, Eq. 7 becomes a bilevel
using a softmax function to relax the discrete space, as optimization (Eq. 5). The experimental results presented
outlined below: in [153] showed that mixed-level optimization not only over-
comes the overfitting issue of single-level optimization but
K k

also avoids the gradient error of bilevel optimization.
X exp αi,j
oi,j (x) = PK l
 ok (x) (4) Second, in DARTS, the output of each edge is the
k=1 l=1 exp αi,j weighted sum of all candidate operations (shown in Eq.
where o(x) indicates the operation performed on input 4) during the whole search stage, which leads to a linear
k
x, αi,j indicates the weight assigned to the operation ok increase in the requirements of GPU memory with the
between a pair of nodes (i, j), and K is the number of number of candidate operations. To reduce resource con-
predefined candidate operations. After the relaxation, the sumption, many subsequent studies [154, 155, 153, 156, 131]
task of searching for architectures is transformed into a have developed a differentiable sampler to sample a child
joint optimization of neural architecture α and the weights architecture from the supernet by using a reparameteri-
of this neural architecture θ. These two types of parameters zation trick, namely Gumbel Softmax [157]. The neural
are optimized alternately, indicating a bilevel optimization architecture is fully factorized and modeled with a concrete
problem. Specifically, α and θ are optimized with the distribution [158], which provides an efficient approach to
validation and the training sets, respectively. The training sampling a child architecture and allows gradient backprop-
and the validation losses are denoted by Ltrain and Lval , agation. Therefore, Eq. 4 is re-formulated as
respectively. Hence, the total loss function can be derived
as follows: K  
k
X exp log αi,j + Gki,j /τ
minα Lval (θ∗ , α) oki,j (x) = PK   ok (x) (8)
exp log α l + Gl
(5) k=1 l=1 i,j i,j /τ
s.t. θ∗ = argminθ Ltrain (θ, α)
Figure 16 presents an overview of DARTS, where a cell where Gki,j = −log(−log(uki,j )) is the k-th Gumbel sample,
is composed of N (here N = 4) ordered nodes and the uki,j is a uniform random variable, and τ is the Softmax
node z k (k starts from 0) is connected to the node z i , i ∈ temperature. When τ → ∞, the possibility distribution of
{k + 1, ..., N }. The operation on each edge ei,j is initially all operations between each node pair approximates to one-
a mixture of candidate operations, each being of equal hot distribution. In GDAS [154], only the operation with
weight. Therefore, the neural architecture α is a supernet the maximum possibility for each edge is selected during
that contains all possible child neural architectures. At the forward pass, while the gradient is backpropagated
the end of the search, the final architecture is derived by according to Eq. 8. In other words, only one path of the
retaining only the maximum-weight operation among all supernet is selected for training, thereby reducing the GPU
mixed operations. memory usage. Besides, ProxylessNAS [132] alleviates
Although DARTS substantially reduces the search time, the huge resource consumption through path binarization.
it incurs several problems. First, as Eq. 5 shows, DARTS Specifically, it transforms the real-valued path weights [17]
describes a joint optimization of the neural architecture to binary gates, which activates only one path of the mixed
and weights as a bilevel optimization problem. However, operations, and hence, solves the memory issue.
this problem is difficult to solve directly, because both ar- Another problem is the optimization of different op-
chitecture α and weights θ are high dimensional parameters. erations together, as they may compete with each other,
Another solution is single-level optimization, which can be leading to a negative influence. For example, several studies
formalized as [159, 128] have found that skip-connect operation domi-
nates at a later search stage in DARTS, which causes the
min Ltrain (θ, α) (6) network to be shallower and leads to a marked deterioration
θ,α
in performance. To solve this problem, DARTS+ [159] uses
which optimizes both neural architecture and weights to- an additional early-stop criterion, such that when two or
gether. Although the single-level optimization problem
13
0 0 0 0
?

?
1 ? 1 1 1

?
?
2 2 2 2

0.1
0.6
0.3
3 3 3 3

(a) (b) (c) (d)

Figure 16: Overview of DARTS. (a) The data can only flow from lower-level nodes to higher-level nodes, and the operations on edges are
initially unknown. (b) The initial operation on each edge is a mixture of candidate operations, each having equal weight. (c) The weight of
each operation is learnable and ranges from 0 to 1, but for previous discrete sampling methods, the weight could only be 0 or 1. (d) The final
neural architecture is constructed by preserving the maximum weight-value operation on each edge.

more skip-connects occur in a normal cell, the search pro- network as the surrogate model. For example, in PNAS
cess stops. In another example, P-DARTS [128] regularizes [18] and EPNAS [166], an LSTM is derived as the surrogate
the search space by executing operation-level dropout to model to progressively predict variable-sized architectures.
control the proportion of skip-connect operations occurring Meanwhile, NAO [169] uses a simpler surrogate model, i.e.,
during training and evaluation. multilayer perceptron (MLP), and NAO is more efficient
and achieves better results on CIFAR-10 than does PNAS
4.2.4. Surrogate Model-based Optimization [18]. White et al. [164] trained an ensemble of neural
Another group of architecture optimization methods is networks to predict the mean and variance of the validation
surrogate model-based optimization (SMBO) algorithms results for candidate neural architectures.
[33, 34, 160, 161, 162, 163, 164, 165, 166, 18, 161]. The
core concept of SMBO is that it builds a surrogate model 4.2.5. Grid and Random Search
of the objective function by iteratively keeping a record Both grid search (GS) and random search (RS) are sim-
of past evaluation results, and uses the surrogate model ple optimization methods applied to several NAS studies
to predict the most promising architecture. Thus, these [178, 179, 180, 11]. For instance, Geifman et al. [179] pro-
methods can substantially shorten the search time and posed a modular architecture search space (A = {A(B, i, j)|i ∈
improve efficiency. {1, 2, ..., Ncells }, j ∈ {1, 2, ..., Nblocks }}) that is spanned
SMBO algorithms differ from the surrogate models, by the grid defined by the two corners A(B, 1, 1) and
which can be broadly divided into Bayesian optimization A(B, Ncells , Nblocks ), where B is a searched block struc-
(BO) methods (including Gaussian process (GP) [167], ture. Evidently, a larger value Ncells × Nblocks leads to the
random forest (RF) [37], tree-structured Parzen estimator exploration of a larger space, but requires more resources.
(TPE) [168]), and neural networks [164, 169, 18, 166]. The authors in [180] conducted an effectiveness com-
BO [170, 171] is one of the most popular methods for parison between SOTA NAS methods and RS. The results
hyperparameter optimization. Many recent studies [33, showed that RS is a competitive NAS baseline. Specifically,
34, 160, 161, 162, 163, 164, 165] have attempted to apply RS with an early-stopping strategy performs as well as
these SOTA BO methods to AO. For example, in [172, 173, ENAS [13], which is an RL-based leading NAS method.
160, 165, 174, 175], the validation results of the generated Besides, Yu et al. [11] demonstrated that the SOTA NAS
neural architectures were modeled as a Gaussian process, techniques are not significantly better than random search.
which guides the search for the optimal neural architectures.
However, in GP-based BO methods, the inference time 4.2.6. Hybrid Optimization Method
scales cubically in the number of observations, and they The abovementioned architecture optimization methods
cannot effectively handle variable-length neural networks. have their own advantages and disadvantages. 1) EA is a
Camero et al. [176] proposed three fixed-length encoding mature global optimization method with high robustness.
schemes to cope with variable-length problems by using However, it requires considerable computational resources
RF as the surrogate model. Similarly, both [33] and [176] [26, 25], and its evolution operations (such as crossover and
used RF as a surrogate model, and [177] showed that it mutations) are performed randomly. 2) Although RL-based
works better in setting high dimensionality than GP-based methods (e.g., ENAS [13]) can learn complex architectural
methods. patterns, the searching efficiency and stability of the RL
Instead of using BO, some studies have used a neural controller are not guaranteed because it may take several

14
actions to obtain a positive reward. 3) The GD-based meth-
ods (e.g., DARTS [17]) substantially improve the searching
efficiency by relaxing the categorical candidate operations
to continuous variables. Nevertheless, in essence, they all

Unimportant parameter

Unimportant parameter
search for a child network from a supernet, which limits the
diversity of neural architectures. Therefore, some methods
have been proposed to incorporate different optimization
methods to capture the best of their advantages; these
methods are summarized as follows
EA+RL. Chen et al. [42] integrated reinforced muta-
tions into an EA, which avoids the randomness of evolution
Important parameter Important parameter
and improves the searching efficiency. Another similar
method developed in parallel is the evolutionary-neural
hybrid controller (Evo-NAS) [41], which also captures the Figure 17: Examples of grid search (left) and random search (right) in
nine trials for optimizing a two-dimensional space function f (x, y) =
merits of both RL-based methods and EA. The Evo-NAS g(x) + h(y) ≈ g(x) [181]. The parameter in g(x) (light-blue part)
controller’s mutations are guided by an RL-trained neural is relatively important, while that in h(y) (light-yellow part) is not
network, which can explore a vast search space and sample important. In a grid search, nine trials cover only three important
architectures efficiently. parameter values; however, random search can explore nine distinct
values of g. Therefore, random search is more likely to find the
EA+GD. Yang et al. [40] combined the EA and GD- optimal combination of parameters than grid search (the figure is
based method. The architectures share parameters within adopted from [181]).
one supernet and are tuned on the training set with a few
epochs. Then, the populations and the supernet are di-
rectly inherited in the next generation, which substantially GS is very simple and naturally supports parallel imple-
accelerates the evolution. The authors in [40] only took 0.4 mentation; however, it is computationally expensive and
GPU days for searching, which is more efficient than early inefficient when the hyperparameter space is very large, as
EA methods (e.g., AmoebaNet [26] took 3150 GPU days the number of trials grows exponentially with the dimen-
and 450 GPUs for searching). sionality of hyperparameters. To alleviate this problem,
EA+SMBO. The authors in [43] used RF as a surro- Hsu et al. [182] proposed a coarse-to-fine grid search, in
gate to predict model performance, which accelerates the which a coarse grid is first inspected to locate a good re-
fitness evaluation in EA. gion, and then a finer grid search is implemented on the
GD+SMBO. Unlike DARTS, which learns weights identified region. Similarly, Hesterman et al. [183] pro-
for candidate operations, NAO [169] proposes a variational posed a contracting GS algorithm, which first computes
autoencoder to generate neural architectures and further the likelihood of each point in the grid, and then generates
build a regression model as a surrogate to predict the a new grid centered on the maximum-likelihood value. The
performance of the generated architecture. The encoder point separation in the new grid is reduced to half that
maps the representations of the neural architecture to on the old grid. The above procedure is iterated until the
continuous space, and then a predictor network takes the results converge to a local minimum.
continuous representations of the neural architecture as Although the authors in [181] empirically and theoreti-
input and predicts the corresponding accuracy. Finally, cally showed that RS is more practical and efficient than
the decoder is used to derive the final architecture from a GS, RS does not promise an optimum value. This means
continuous network representation. that although a longer search increases the probability
of finding optimal hyperparameters, it consumes more re-
4.3. Hyperparameter Optimization sources. Li and Jamieson et al. [184] proposed a hyperband
Most NAS methods use the same set of hyperparameters algorithm to create a tradeoff between the performance
for all candidate architectures during the whole search stage; of the hyperparameters and resource budgets. The hyper-
thus, after finding the most promising neural architecture, band algorithm allocates limited resources (such as time
it is necessary to redesign a hyperparameter set and use or CPUs) to only the most promising hyperparameters, by
it to retrain or fine-tune the architecture. As some HPO successively discarding the worst half of the configuration
methods (such as BO and RS) have also been applied in settings long before the training process is finished.
NAS, we will only briefly introduce these methods here.
4.3.2. Bayesian Optimization
4.3.1. Grid and Random Search Bayesian optimization (BO) is an efficient method for
Figure 17 shows the difference between grid search (GS) the global optimization of expensive blackbox functions.
and random search (RS): GS divides the search space into In this section, we briefly introduce BO. For an in-depth
regular intervals and selects the best-performing point after discussion on BO, we recommend readers to refer to the
evaluating all points; while RS selects the best point from excellent surveys conducted in [171, 170, 185, 186].
a set of randomly drawn points. BO is an SMBO method that builds a probabilistic
15
model mapping from the hyperparameters to the objective Library Model
metrics evaluated on the validation set. It well balances Spearmint
GP
exploration (evaluating as many hyperparameter sets as https://github.com/HIPS/Spearmint
possible) and exploitation (allocating more resources to MOE
GP
promising hyperparameters). https://github.com/Yelp/MOE
PyBO
GP
Algorithm 1 Sequential Model-Based Optimization https://github.com/mwhoffman/pybo
INPUT: f, Θ, S, M Bayesopt
GP
D ← INITSAMPLES (f, Θ) https://github.com/rmcantin/bayesopt
for i in [1, 2, .., T ] do SkGP
GP
p(y|θ, D) ← FITMODEL (M, D) https://scikit-optimize.github.io
θi ← arg maxθ∈Θ S(θ, p(y|θ, D)) GPyOpt
GP
yi ← f (θi ) . Expensive step http://sheffieldml.github.io/GPyOpt
D ← D ∪ (θi , yi ) SMAC
RF
end for https://github.com/automl/SMAC3
Hyperopt
TPE
The steps of SMBO are expressed in Algorithm 1 (adopted http://hyperopt.github.io/hyperopt
from [170]). Here, several inputs need to be predefined ini- BOHB
TPE
tially, including an evaluation function f , search space Θ, https://github.com/automl/HpBandSter
acquisition function S, probabilistic model M, and record
Table 2: Open-source Bayesian optimization libraries. GP, RF, and
dataset D. Specifically, D is a dataset that records many TPE represent Gaussian process [167], random forest [37], and tree-
sample pairs (θi , yi ), where θi ∈ Θ indicates a sampled structured Parzen estimator [168], respectively.
neural architecture and yi indicates its evaluation result.
After the initialization, the SMBO steps are described as
follows: 4.3.3. Gradient-based Optimization
Another group of HPO methods are gradient-based op-
1. The first step is to tune the probabilistic model M timization (GO) algorithms [187, 188, 189, 190, 191, 192].
to fit the record dataset D. Unlike the above blackbox HPO methods (e.g., GS, RS,
2. The acquisition function S is used to select the next and BO), GO methods use the gradient information to
promising neural architecture from the probabilistic optimize the hyperparameters and substantially improve
model M. the efficiency of HPO. Maclaurin et al. [189] proposed a
3. The performance of the selected neural architecture reversible-dynamics memory-tape approach to handle thou-
is evaluated by f , which is an expensive step as it sands of hyperparameters efficiently through the gradient
involves training the neural network on the training information. However, optimizing many hyperparameters
set and evaluating it on the validation set. is computationally challenging. To alleviate this issue, the
4. The record dataset D is updated by appending a new authors in [190] used approximate gradient information
pair of results (θi , yi ). rather than the true gradient to optimize continuous hy-
perparameters, where the hyperparameters can be updated
The above four steps are repeated T times, where T before the model is trained to converge. Franceschi et al.
needs to be specified according to the total time or resources [191] studied both reverse- and forward-mode GO meth-
available. The commonly used surrogate models for the ods. The reverse-mode method differs from the method
BO method are GP, RF, and TPE. Table 2 summarizes proposed in [189] and does not require reversible dynamics;
the existing open-source BO methods, where GP is one of however, it needs to store the entire training history for
the most popular surrogate models. However, GP scales computing the gradient with respect to the hyperparame-
cubically with the number of data samples, while RF can ters. The forward-mode method overcomes this problem by
natively handle large spaces and scales better to many data using real-time updating hyperparameters, and is demon-
samples. Besides, Falkner and Klein et al. [38] proposed the strated to significantly improve the efficiency of HPO on
BO-based hyperband (BOHB) algorithm, which combines large datasets. Chandra [192] proposed a gradient-based
the strengths of TPE-based BO and hyperband, and hence, ultimate optimizer, which can optimize not only the regular
performs much better than standard BO methods. Fur- hyperparameters (e.g., learning rate) but also those of the
thermore, FABOLAS [35] is a faster BO procedure, which optimizer (e.g., Adam optimizer [193]’s moment coefficient
maps the validation loss and training time as functions of β1 , β2 ).
dataset size, i.e., trains a generative model on a sub-dataset
that gradually increases in size. Here, FABOLAS is 10-100
times faster than other SOTA BO algorithms and identifies
the most promising hyperparameters.

16
5. Model Evaluation Progressive Neural Architecture Search (PNAS) [18] intro-
duces a surrogate model to control the method of searching.
Once a new neural network has been generated, its Although ENAS has been proven to be very efficient, PNAS
performance must be evaluated. An intuitive method is is even more efficient, as the number of models evaluated
to train the network to convergence and then evaluate its by PNAS is over five times that evaluated by ENAS, and
performance. However, this method requires extensive time PNAS is eight times faster in terms of total computational
and computing resources. For example, [12] took 800 K40 speed. A well-performing surrogate usually requires large
GPUs and 28 days in total to search. Additionally, NASNet amounts of labeled architectures, while the optimization
[15] and AmoebaNet [26] required 500 P100 GPUs and 450 space is too large and hard to quantify, and the evalua-
K40 GPUs, respectively. In this section, we summarize tion of each configuration is extremely expensive [201]. To
several algorithms for accelerating the process of model alleviate this issue, Luo et al. [202] proposed SemiNAS,
evaluation. a semi-supervised NAS method, which leverages amounts
of unlabeled architectures to train the surrogate, a con-
5.1. Low fidelity troller that is used to predict the accuracy of architectures
As model training time is highly related to the dataset without evaluation. Initially, the surrogate is only trained
and model size, model evaluation can be accelerated in dif- with a small number of labeled data pairs (architectures,
ferent ways. First, the number of images or the resolution accuracy), then the generated data pairs will be gradually
of images (in terms of image-classification tasks) can be added to the original data to further improve the surrogate.
decreased. For example, FABOLAS [35] trains the model
on a subset of the training set to accelerate model evalu- 5.4. Early stopping
ation. In [194], ImageNet64×64 and its variants 32×32, Early stopping was first used to prevent overfitting in
16×16 are provided, while these lower resolution datasets classical ML, and it has been used in several recent studies
can retain characteristics similar to those of the original [203, 204, 205] to accelerate model evaluation by stopping
ImageNet dataset. Second, low-fidelity model evaluation evaluations that are predicted to perform poorly on the
can be realized by reducing the model size, such as by validation set. For example, [205] proposes a learning-curve
training with fewer filters per layer [15, 26]. By analogy model that is a weighted combination of a set of parametric
to ensemble learning, [195] proposes the Transfer Series curve models selected from the literature, thereby enabling
Expansion (TSE), which constructs an ensemble estimator the performance of the network to be predicted. Further-
by linearly combining a series of basic low-fidelity estima- more, [206] presents a novel approach for early stopping
tors, hence avoiding the bias that can derive from using based on fast-to-compute local statistics of the computed
a single low-fidelity estimator. Furthermore, Zela et al. gradients, which no longer relies on the validation set and
[34] empirically demonstrated that there is a weak corre- allows the optimizer to make full use of all of the training
lation between performance after short or long training data.
times, thus confirming that a prolonged search for network
configurations is unnecessary.
6. NAS Discussion
5.2. Weight sharing
In Section 4, we reviewed the various search space and
In [12], once a network has been evaluated, it is dropped. architecture optimization methods, and in Section 5, we
Hence, the technique of weight sharing is used to acceler- summarized commonly used model evaluation methods.
ate the process of NAS. For example, Wong and Lu et al. These two sections introduced many NAS studies, which
[196] proposed transfer neural AutoML, which uses knowl- may cause the readers to get lost in details. Therefore, in
edge from prior tasks to accelerate network design. ENAS this section, we summarize and compare these NAS algo-
[13] shares parameters among child networks, leading to rithms’ performance from a global perspective to provide
a thousand-fold faster network design than [12]. Network readers a clearer and more comprehensive understanding of
morphism based algorithms [20, 21] can also inherit the NAS methods’ development. Then, we discuss some major
weights of previous architectures, and single-path NAS topics of the NAS technique.
[197] uses a single-path over-parameterized ConvNet to
encode all architectural decisions with shared convolutional 6.1. NAS Performance Comparison
kernel parameters.
Many NAS studies have proposed several neural archi-
tecture variants, where each variant is designed for different
5.3. Surrogate
scenarios. For instance, some architecture variants perform
The surrogate-based method [198, 199, 200, 43] is an- better but are larger, while some are lightweight for a mo-
other powerful tool that approximates the black-box func- bile device but with a performance penalty. Therefore, we
tion. In general, once a good approximation has been ob- only report the representative results of each study. Besides,
tained, it is trivial to find the configurations that directly to ensure a valid comparison, we consider the accuracy and
optimize the original expensive objective. For example, algorithm efficiency as comparison indices. As the number

17
Published #Params Top-1 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-110 [2] ECCV16 1.7 93.57 - - Manually
PyramidNet [207] CVPR17 26 96.69 - - designed
DenseNet [127] CVPR17 25.6 96.54 - -
GeNet#2 (G-50) [30] ICCV17 - 92.9 17 -
Large-scale ensemble [25] ICML17 40.4 95.6 2,500 250
Hierarchical-EAS [19] ICLR18 15.7 96.25 300 200
CGP-ResSet [28] IJCAI18 6.4 94.02 27.4 2
AmoebaNet-B (N=6, F=128)+c/o [26] AAAI19 34.9 97.87 3,150 450 K40 EA
AmoebaNet-B (N=6, F=36)+c/o [26] AAAI19 2.8 97.45 3,150 450 K40
Lemonade [27] ICLR19 3.4 97.6 56 8 Titan
EENA [149] ICCV19 8.47 97.44 0.65 1 Titan Xp
EENA (more channels)[149] ICCV19 54.14 97.79 0.65 1 Titan Xp
NASv3[12] ICLR17 7.1 95.53 22,400 800 K40
NASv3+more filters [12] ICLR17 37.4 96.35 22,400 800 K40
MetaQNN [23] ICLR17 - 93.08 100 10
NASNet-A (7 @ 2304)+c/o [15] CVPR18 87.6 97.60 2,000 500 P100
NASNet-A (6 @ 768)+c/o [15] CVPR18 3.3 97.35 2,000 500 P100
Block-QNN-Connection more filter [16] CVPR18 33.3 97.65 96 32 1080Ti
Block-QNN-Depthwise, N=3 [16] CVPR18 3.3 97.42 96 32 1080Ti RL
ENAS+macro [13] ICML18 38.0 96.13 0.32 1
ENAS+micro+c/o [13] ICML18 4.6 97.11 0.45 1
Path-level EAS [139] ICML18 5.7 97.01 200 -
Path-level EAS+c/o [139] ICML18 5.7 97.51 200 -
ProxylessNAS-RL+c/o[132] ICLR19 5.8 97.70 - -
FPNAS[208] ICCV19 5.76 96.99 - -
DARTS(first order)+c/o[17] ICLR19 3.3 97.00 1.5 4 1080Ti
DARTS(second order)+c/o[17] ICLR19 3.3 97.23 4 4 1080Ti
sharpDARTS [178] ArXiv19 3.6 98.07 0.8 1 2080Ti
P-DARTS+c/o[128] ICCV19 3.4 97.50 0.3 -
P-DARTS(large)+c/o[128] ICCV19 10.5 97.75 0.3 -
SETN[209] ICCV19 4.6 97.31 1.8 -
GD
GDAS+c/o [154] CVPR19 2.5 97.18 0.17 1
SNAS+moderate constraint+c/o [155] ICLR19 2.8 97.15 1.5 1
BayesNAS[210] ICML19 3.4 97.59 0.1 1
ProxylessNAS-GD+c/o[132] ICLR19 5.7 97.92 - -
PC-DARTS+c/o [211] CVPR20 3.6 97.43 0.1 1 1080Ti
MiLeNAS[153] CVPR20 3.87 97.66 0.3 -
SGAS[212] CVPR20 3.8 97.61 0.25 1 1080Ti
GDAS-NSAS[213] CVPR20 3.54 97.27 0.4 -
NASBOT[160] NeurIPS18 - 91.31 1.7 -
PNAS [18] ECCV18 3.2 96.59 225 -
SMBO
EPNAS[166] BMVC18 6.6 96.29 1.8 1
GHN[214] ICLR19 5.7 97.16 0.84 -
NAO+random+c/o[169] NeurIPS18 10.6 97.52 200 200 V100
SMASH [14] ICLR18 16 95.97 1.5 -
Hierarchical-random [19] ICLR18 15.7 96.09 8 200
RS
RandomNAS [180] UAI19 4.3 97.15 2.7 -
DARTS - random+c/o [17] ICLR19 3.2 96.71 4 1
RandomNAS-NSAS[213] CVPR20 3.08 97.36 0.7 -
NAO+weight sharing+c/o [169] NeurIPS18 2.5 97.07 0.3 1 V100 GD+SMBO
RENASNet+c/o[42] CVPR19 3.5 91.12 1.5 4 EA+RL
CARS[40] CVPR20 3.6 97.38 0.4 - EA+GD

Table 3: Performance of different NAS algorithms on CIFAR-10. The “AO” column indicates the architecture optimization method. The
dash (-) indicates that the corresponding information is not provided in the original paper. “c/o” indicates the use of Cutout [89]. RL, EA,
GD, RS, and SMBO indicate reinforcement learning, evolution-based algorithm, gradient descent, random search, and surrogate model-based
optimization, respectively.

18
Published #Params Top-1/5 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-152 [2] CVPR16 230 70.62/95.51 - -
PyramidNet [207] CVPR17 116.4 70.8/95.3 - -
SENet-154 [126] CVPR17 - 71.32/95.53 - - Manually
DenseNet-201 [127] CVPR17 76.35 78.54/94.46 - - designed
MobileNetV2 [215] CVPR18 6.9 74.7/- - -
GeNet#2[30] ICCV17 - 72.13/90.26 17 -
AmoebaNet-C(N=4,F=50)[26] AAAI19 6.4 75.7/92.4 3,150 450 K40
Hierarchical-EAS[19] ICLR18 - 79.7/94.8 300 200
EA
AmoebaNet-C(N=6,F=228)[26] AAAI19 155.3 83.1/96.3 3,150 450 K40
GreedyNAS [216] CVPR20 6.5 77.1/93.3 1 -
NASNet-A(4@1056) ICLR17 5.3 74.0/91.6 2,000 500 P100
NASNet-A(6@4032) ICLR17 88.9 82.7/96.2 2,000 500 P100
Block-QNN[16] CVPR18 91 81.0/95.42 96 32 1080Ti
Path-level EAS[139] ICML18 - 74.6/91.9 8.3 -
ProxylessNAS(GPU) [132] ICLR19 - 75.1/92.5 8.3 -
RL
ProxylessNAS-RL(mobile) [132] ICLR19 - 74.6/92.2 8.3 -
MnasNet[130] CVPR19 5.2 76.7/93.3 1,666 -
EfficientNet-B0[142] ICML19 5.3 77.3/93.5 - -
EfficientNet-B7[142] ICML19 66 84.4/97.1 - -
FPNAS[208] ICCV19 3.41 73.3/- 0.8 -
DARTS (searched on CIFAR-10)[17] ICLR19 4.7 73.3/81.3 4 -
sharpDARTS[178] Arxiv19 4.9 74.9/92.2 0.8 -
P-DARTS[128] ICCV19 4.9 75.6/92.6 0.3 -
SETN[209] ICCV19 5.4 74.3/92.0 1.8 -
GDAS [154] CVPR19 4.4 72.5/90.9 0.17 1
SNAS[155] ICLR19 4.3 72.7/90.8 1.5 -
ProxylessNAS-G[132] ICLR19 - 74.2/91.7 - -
BayesNAS[210] ICML19 3.9 73.5/91.1 0.2 1
FBNet[131] CVPR19 5.5 74.9/- 216 -
OFA[217] ICLR20 7.7 77.3/- - - GD
AtomNAS[218] ICLR20 5.9 77.6/93.6 - -
MiLeNAS[153] CVPR20 4.9 75.3/92.4 0.3 -
DSNAS[219] CVPR20 - 74.4/91.54 17.5 4 Titan X
SGAS[212] CVPR20 5.4 75.9/92.7 0.25 1 1080Ti
PC-DARTS [211] CVPR20 5.3 75.8/92.7 3.8 8 V100
DenseNAS[220] CVPR20 - 75.3/- 2.7 -
FBNetV2-L1[221] CVPR20 - 77.2/- 25 8 V100
PNAS-5(N=3,F=54)[18] ECCV18 5.1 74.2/91.9 225 -
PNAS-5(N=4,F=216)[18] ECCV18 86.1 82.9/96.2 225 -
SMBO
GHN[214] ICLR19 6.1 73.0/91.3 0.84 -
SemiNAS[202] CVPR20 6.32 76.5/93.2 4 -
Hierarchical-random[19] ICLR18 - 79.6/94.7 8.3 200
RS
OFA-random[217] CVPR20 7.7 73.8/- - -
RENASNet[42] CVPR19 5.36 75.7/92.6 - - EA+RL
Evo-NAS[41] Arxiv20 - 75.43/- 740 - EA+RL
CARS[40] CVPR20 5.1 75.2/92.5 0.4 - EA+GD

Table 4: Performance of different NAS algorithms on ImageNet. The “AO” column indicates the architecture optimization method. The dash
(-) indicates that the corresponding information is not provided in the original paper. RL, EA, GD, RS, and SMBO indicate reinforcement
learning, evolution-based algorithm, gradient descent, random search, and surrogate model-based optimization, respectively.

and types of GPUs used vary for different studies, we use


GPU Days to approximate the efficiency, which is defined GPU Days = N × D (9)
as:
where N represents the number of GPUs, and D represents

19
the actual number of days spent searching. Searching Stage Evaluation Stage
Tables 3 and 4 present the performances of different
Architecture
NAS methods on CIFAR-10 and ImageNet, respectively. Searching Stage
Optimization Evaluation Stage
Retraining the
Besides, as most NAS methods first search for the neural Search best-performing
architecture based on a small dataset (CIFAR-10), and then Architecture model
Space Optimization model of the
Retraining the
transfer the architecture to a larger dataset (ImageNet), searching stage
Search best-performing
the search time for both datasets is the same. The tables Parameter model
Space model of the
show that the early studies on EA- and RL-based NAS Training
searching stage
methods focused more on high performance, regardless of Parameter
the resource consumption. For example, although Amoe- (a) Two-stage NASTraining
comprises the searching stage and evaluation
stage. The best-performing model of the searching stage is further
baNet [26] achieved excellent results for both CIFAR-10
retrained in the evaluation stage. Parameter Training
and ImageNet, the searching took 3,150 GPU days and 450
GPUs. The subsequent NAS studies attempted to improve Model 1
Parameter Training
the searching efficiency while ensuring the searched model’s
Search Architecture Model 2
high performance. For instance, EENA [149] elaborately Model 1 model
Space Optimization
designs the mutation and crossover operations, which can
Search Architecture ...
Model 2
reuse the learned information to guide the evolution pro- model
Space Optimization
cess, and hence, substantially improve the efficiency of Model n
EA-based NAS methods. ENAS [13] is one of the first ...
RL-based NAS methods to adopt the parameter-sharing
Model n
strategy, which reduces the number of GPU budgets to
1 and shortens the searching time to less than one day. (b) One-stage NAS can directly deploy a well-performing model
We also observe that gradient descent-based architecture without extra retraining or fine-tuning. The two-way arrow indicates
optimization methods can substantially reduce the compu- that the processes of architecture optimization and parameter training
run simultaneously.
tational resource consumption for searching, and achieve
SOTA results. Several follow-up studies have been con-
Figure 18: Illustration of two- and one-stage neural architecture
ducted to achieve further improvement and optimization search flow.
in this direction. Interestingly, RS-based methods can also
obtain comparable results. The authors in [180] demon-
strated that RS with weight-sharing could outperform a following properties:
series of powerful methods, such as ENAS [13] and DARTS
[17]. • τ = 1: two rankings are identical
• τ = −1: two rankings are completely opposite.
6.1.1. Kendall Tau Metric
As RS is comparable to more sophisticated methods • τ = 0: there is no relationship between two rankings.
(e.g., DARTS and ENAS), a natural question is, what are
the advantages and significance of the other AO algorithms
compared with RS? Researchers have tried to use other 6.1.2. NAS-Bench Dataset
metrics to answer this question, rather than simply con- Although Tables 3 and 4 present a clear comparison
sidering the model’s final accuracy. Most NAS methods between different NAS methods, the results of different
comprise two stages: 1) search for a best-performing archi- methods are obtained under different settings, such as
tecture on the training set and 2) expand it to a deeper training-related hyperparameters (e.g., batch size and train-
architecture and estimate it on the validation set. However, ing epochs) and data augmentation (e.g., Cutout [89]). In
there usually exists a large gap between the two stages. In other words, the comparison is not quite fair. In this con-
other words, the architecture that achieves the best result text, NAS-Bench-101 [224] is a pioneering work for improv-
in the training set is not necessarily the best one for the ing the reproducibility. It provides a tabular dataset con-
validation set. Therefore, instead of merely considering taining 423,624 unique neural networks generated and eval-
the final accuracy and search time cost, many NAS studies uated from a fixed graph-based search space and mapped
[219, 222, 213, 11, 123] have used Kendall Tau (τ ) metric to their trained and evaluated performance on CIFAR-10.
[223] to evaluate the correlation of the model performance Meanwhile, Dong et al. [225] further built NAS-Bench-201,
between the search and evaluation stages. The parameter which is an extension to NAS-Bench-101 and has a differ-
τ is defined as ent search space, results on multiple datasets (CIFAR-10,
CIFAR-100, and ImageNet-16-120 [194]), and more diag-
NC − ND nostic information. Similarly, Klyuchnikov et al. [226]
τ= (10)
NC + ND proposed a NAS-Bench for the NLP task. These datasets
where NC and ND indicate the numbers of concordant and enable NAS researchers to focus solely on verifying the ef-
discordant pairs. τ is a number in the range [-1,1] with the fectiveness and efficiency of their AO algorithms, avoiding
20
repetitive training for selected architectures and substan-
tially helping the NAS community to develop. Search Space Search Space

6.2. One-stage vs. Two-stage


The NAS methods can be roughly divided into two
classes according to the flow ––two-stage and one-stage––
as shown in Figure 18.
Two-stage NAS comprises the searching stage and
evaluation stage. The searching stage involves two pro- Figure 19: (Left) One-shot models. (Right) Non-one-shot models.
cesses: architecture optimization, which aims to find the Each circle indicates a different model, and its area indicates the
optimal architecture, and parameter training, which is to model’s size. We use concentric circles to represent one-shot models,
as they share the weights with each other.
train the found architecture’s parameter. The simplest
idea is to train all possible architectures’ parameters from
scratch and then choose the optimal architecture. However, What is One-shot NAS? One-shot NAS methods
it is resource-consuming (e.g., NAS-RL [12] took 22,400 embed the search space into an overparameterized supernet,
GPU days with 800 K40 GPUs for searching) ), which is in- and thus, all possible architectures can be derived from
feasible for most companies and institutes. Therefore, most the supernet. Figure 18 shows the difference between the
NAS methods (such as ENAS [13] and DARTS [17]) sample search spaces of one-shot and non-one-shot NAS. Each
and train many candidate architectures in the searching circle indicates a different architecture, where the archi-
stage, and then further retrain the best-performing archi- tectures of one-shot NAS methods share the same weights.
tecture in the evaluation stage. One-shot NAS methods can be divided into two categories
One-stage NAS refers to a class of NAS methods according to how to handle AO and parameter training:
that can export a well-designed and well-trained neural coupled and decoupled optimization [229, 216].
architecture without extra retraining, by running AO and Coupled optimization. The first category of one-
parameter training simultaneously. In this way, the ef- shot NAS methods optimizes the architecture and weights
ficiency can be substantially improved. However, model in a coupled manner [13, 17, 154, 132, 155]. For instance,
architecture and its weight parameters are highly coupled; it ENAS [13] uses an LSTM network to discretely sample
is difficult to optimize them simultaneously. Several recent a new architecture, and then uses a few batches of the
studies [217, 227, 228, 218] have attempted to overcome this training data to optimize the weight of this architecture.
challenge. For instance, the authors in [217] proposed the After repeating the above steps many times, a collection
progressive shrinking algorithm to post-process the weights of architectures and their corresponding performances are
after the training was completed. They first pretrained the recorded. Finally, the best-performing architecture is se-
entire neural network, and then progressively fine-tuned lected for further retraining. DARTS [17] uses a similar
the smaller networks that shared weights with the complete weight sharing strategy, but has a continuously parame-
network. Based on well-designed constraints, the perfor- terized architecture distribution. The supernet contains
mance of all subnetworks was guaranteed. Thus, given a all candidate operations, each with learnable parameters.
target deployment device, a specialized subnetwork can be The best architecture can be directly derived from the
directly exported without fine-tuning. However, [217] was distribution. However, as DARTS [17] directly optimizes
still computational resource-intensive, as the whole process the supernet weights and the architecture distribution, it
took 1,200 GPU hours with V100 GPUs. BigNAS [228] re- suffers from vast GPU memory consumption. Although
visited the conventional training techniques of stand-alone DARTS-like methods [132, 154, 155] have adopted different
networks, and empirically proposed several techniques to approaches to reduce the resource requirements, coupled
handle a wider set of models, ranging in size from 200M to optimization inevitably introduces a bias in both architec-
1G FLOPs, whereas [217] only handled models under 600M ture distribution and supernet weights [197, 229], as they
FLOPs. Both AtomNAS [218] and DSNAS [219] proposed treat all subnetworks unequally. The rapidly converged
an end-to-end one-stage NAS framework to further boost architectures can easily obtain more opportunities to be
the performance and simplify the flow. optimized [17, 159], and are only a small portion of all
candidates; therefore, it is challenging to find the best
6.3. One-shot/Weight-sharing architecture.
One-shot6=one-stage. Note that one shot is not ex- Another disadvantage of coupling optimization is that
actly equivalent to one stage. As mentioned above, we when new architectures are sampled and trained continu-
divide the NAS studies into one- and two-stage methods ac- ously, the weights of previous architectures are negatively
cording to the flow (Figure 18), whereas whether a NAS al- impacted, leading to performance degradation. The authors
gorithm belongs to a one-shot method depends on whether in [230] defined this phenomenon as multimodel forgetting.
the candidate architectures share the same weights (Fig- To overcome this problem, Zhang et al. [231] modeled
ure 19). However, we observe that most one-stage NAS supernet training as a constrained optimization problem
methods are based on the one-shot paradigm.
21
of continual learning and proposed novel search-based ar- for each architecture, as these architectures are assigned
chitecture selection (NSAS) loss function. They applied the weights generated by the hypernetwork. Besides, the
the proposed method to RandomNAS [180] and GDAS authors in [232] observed that the architectures with a
[154], where the experimental result demonstrated that smaller symmetrized KL divergence value are more likely
the method effectively reduces the multimodel forgetting to perform better. This can be expressed as follows:
and boosting the predictive ability of the supernet as an
evaluator.
Decoupled optimization. The second category of DSKL = DKL (pkq) + DKL (qkp)
n
one-shot NAS methods [209, 232, 229, 217] decouples the X pi (11)
optimization of architecture and weights into two sequential s.t. DKL (pkq) = pi log
i=1
qi
phases: 1) training the supernet and 2) using the trained
supernet as a predictive performance estimator of different where (p1 , ..., pn ) and (q1 , ..., qn ) indicate the predictions of
architectures to select the most promising architecture. the sampled architecture and one-shot model, respectively,
In terms of the supernet training phase, the supernet and n indicates the number of classes. The cost of calcu-
cannot be directly trained as a regular neural network be- lating the KL value is very small; in [232], only 64 random
cause its weights are also deeply coupled [197]. Yu et al. training data examples were used. Meanwhile, EA is also
[11] experimentally showed that the weight-sharing strat- a promising search solution [197, 216]. For instance, SPOS
egy degrades the individual architecture’s performance and [197] uses EA to search for architectures from the supernet.
negatively impacts the real performance ranking of the It is more efficient than the EA methods introduced in
candidate architectures. To reduce the weight coupling, Section 4, because each sampled architecture only performs
many one-shot NAS methods [197, 209, 14, 214] adopt the inference. The self-evaluated template network (SETN)
random sampling policy, which randomly samples an archi- [209] proposes an estimator to predict the probability of
tecture from the supernet, activating and optimizing only each architecture having a lower validation loss. The ex-
the weights of this architecture. Meanwhile, RandomNAS perimental results show that SETN can potentially find
[180] demonstrates that a random search policy is a compet- an architecture with better performance than RS-based
itive baseline method. Although some one-shot approaches methods [232, 14].
[154, 13, 155, 132, 131] have adopted the strategy that
samples and trains only one path of the supernet at a time, 6.4. Joint Hyperparameter and Architecture Optimization
they sample the path according to the RL controller [13],
Most NAS methods fix the same setting of training-
Gumbel Softmax [154, 155, 131], or the BinaryConnect net-
related hyperparameters during the whole search stage. Af-
work [132], which instead highly couples the architecture
ter the search, the hyperparameters of the best-performing
and supernet weights. SMASH [14] adopts an auxiliary
architecture are further optimized. However, this paradigm
hypernetwork to generate weights for randomly sampled
may result in suboptimal results as different architectures
architectures. Similarly, Zhang et al. [214] proposed a
tend to fit different hyperparameters, making the model
computation graph representation, and used the graph hy-
ranking unfair [233]. Therefore, a promising solution is the
pernetwork (GHN) to predict the weights for all possible
joint hyperparameter and architecture optimization (HAO)
architectures faster and more accurately than regular hy-
[34, 234, 233, 235]. We summary the existing joint HAO
pernetworks [14]. However, through a careful experimental
methods as follows.
analysis conducted to understand the weight-sharing strat-
Zela et al. [34] cast NAS as a hyperparameter opti-
egy’s mechanism, Bender et al. [232] showed that neither a
mization problem, where the search spaces of NAS and
hypernetwork nor an RL controller is required to find the
standard hyperparameters are combined. They applied
optimal architecture. They proposed a path dropout strat-
BOHB [38], an efficient HPO method, to optimize the ar-
egy to alleviate the problem of weight coupling. During
chitecture and hyperparameters jointly. Similarly, Dong
supernet training, each path of the supernet is randomly
et al. [233] proposed a differentiable method, namely Au-
dropped with gradually increasing probability. GreedyNAS
toHAS, which builds a Cartesian product of the search
[216] adopts a multipath sampling strategy to train the
spaces of both NAS and HPO by unifying the represen-
greedy supernet. This strategy focuses on more potentially
tation of all candidate choices for the architecture (e.g.,
suitable paths, and is demonstrated to effectively achieve
number of layers) and hyperparameters (e.g., learning rate).
a fairly high rank correlation of candidate architectures
However, a challenge here is that the candidate choices for
compared with RS.
the architecture search space are usually categorical, while
The second phase involves the selection of the most
hyperparameters choices can be categorical (e.g., the type
promising architecture from the trained supernet, which
of optimizer) and continuous (e.g., learning rate). To over-
is the primary purpose of most NAS tasks. Both SMASH
come this challenge, AutoHAS discretizes the continuous
[14] and [232] randomly selected a set of architectures from
hyperparameters into a linear combination of multiple cat-
the supernet, and ranked them according to their perfor-
egorical bases. For example, the categorical bases for the
mance. SMASH can obtain the validation performance of
learning rate are {0.1, 0.2, 0.3}, and then, the final learning
all selected architectures at the cost of a single training run
22
rate is defined as lr = w1 × 0.1 + w2 × 0.2 + w3 × 0.3. Mean- to the one-hot random variables, such that the resource
while, FBNetv3 [235] jointly searches both architectures constraint’s differentiability is ensured.
and the corresponding training recipes (i.e., hyperparam-
eters). The architectures are represented with one-hot
7. Open Problems and Future Directions
categorical variables and integral (min-max normalized)
range variables, and the representation is fed to an encoder This section discusses several open problems of the ex-
network to generate the architecture embedding. Then, the isting AutoML methods and proposes some future research
concatenation of architecture embedding and the training directions.
hyperparameters is used to train the accuracy predictor,
which will be applied to search for promising architectures 7.1. Flexible Search Space
and hyperparameters at a later stage.
As summarized in Section 4, there are various search
spaces where the primitive operations can be roughly clas-
6.5. Resource-aware NAS
sified into pooling and convolution. Some spaces even
Early NAS studies [12, 15, 26] pay more attention to use a more complex module (e.g., MBConv [130]) as the
searching for neural architectures that achieve higher per- primitive operation. Although these search spaces have
formance (e.g., classification accuracy), regardless of the been proven effective for generating well-performing neural
associated resource consumption (i.e., the number of GPUs architectures, all of them are based on human knowledge
and time required). Therefore, many follow-up studies and experience, which inevitably introduce human bias,
investigate resource-aware algorithms to trade off perfor- and hence, still do not break away from the human design
mance against the resource budget. To do so, these algo- paradigm. AutoML-Zero [289] uses very simple mathemat-
rithms add computational cost to the loss function as a ical operations (e.g., cos, sin, mean,std) as the primitive
resource constraint. These algorithms differ in the type operations of the search space to minimize the human
of computational cost, which may be 1) the parameter bias, and applies EA to discover complete machine learning
size; 2) the number of Multiply-ACcumulate (MAC) opera- algorithms. AutoML-Zero successfully designs two-layer
tions; 3) the number of float-point operations (FLOPs); or neural networks based on these basic mathematical opera-
4) the real latency. For example, MONAS [236] considers tions. Although the network searched by AutoML-Zero is
MAC as the constraint, and as MONAS uses a policy-based much simpler than both human-designed and NAS-designed
reinforcement-learning algorithm to search, the constraint networks, the experimental results show the potential to
can be directly added to the reward function. MnasNet discover a new model design paradigm with minimal human
[130] proposes a customized weighted product to approxi- design. Therefore, the design of a more general, flexible,
mate a Pareto optimal solution: and free of human bias search space and the discovery of
 w novel neural architectures based on this search space would
LAT (m)
maximize ACC(m) × (12) be challenging and advantageous.
m T
where LAT (m) denotes measured inference latency of the 7.2. Exploring More Areas
model m on the target device, T is the target latency, and As described in Section 6, the models designed by NAS
w is the weight variable defined as: algorithms have achieved comparable results in image clas-
 sification tasks (CIFAR-10 and ImageNet) to those of man-
α, if LAT (m) ≤ T ually designed models. Additionally, many recent studies
w= (13)
β, otherwise have applied NAS to other CV tasks (Table 5).
where the recommended value for both α and β is −0.07. However, in terms of the NLP task, most NAS studies
In terms of a differentiable neural architecture search have only conducted experiments on the PTB dataset.
(DNAS) framework, the constraint (i.e., loss function) Besides, some NAS studies have attempted to apply NAS
should be differentiable. For this purpose, FBNet [131] to other NLP tasks (shown in Table 5). However, Figure
uses a latency lookup table model to estimate the overall 20 shows that, even on the PTB dataset, there is still a
latency of a network based on the runtime of each operator. big gap in performance between the NAS-designed models
The loss function is defined as ([13, 17, 12]) and human-designed models (GPT-2 [290],
FRAGE AWD-LSTM-Mos [4], adversarial AWD-LSTM-
L (a, θa ) = CE (a, θa ) · α log(LAT(a))β (14) Mos [291] and Transformer-XL [5]). Therefore, the NAS
community still has a long way to achieve comparable
where CE(a, θa ) indicates the cross-entropy loss of architec-
results to those of the models designed by experts on NLP
ture a with weights θa . Similar to MnasNet [130], this loss
tasks.
function also comprises two hyperparameters that need to
Besides the CV and NLP tasks, Table 5 also shows that
be set manually: α and β control the magnitude of the loss
AutoML technique has been applied to other tasks, such
function and the latency term, respectively. In SNAS [155],
as network compression, federate learning, image caption,
the cost of time for the generated child network is linear

23
Category Application References
Medical Image Recognition [237, 238, 239]
Object Detection [240, 241, 242, 243, 244, 245]
Semantic Segmentation [246, 129, 247, 248, 249, 250, 251]
Computer Vision Person Re-identification [252]
(CV) Super-Resolution [253, 254, 255]
Image Restoration [256]
Generative Adversarial Network (GAN) [257, 258, 259, 260]
Disparity Estimation [261]
Video Task [262, 263, 264, 265]
Translation [266]
Language Modeling [267]
Natural Language Processing Entity Recognition [267]
(NLP) Text Classification [268]
Sequential Labeling [268]
Keyword Spotting [269]
Network Compression [270, 271, 272, 273, 274, 275, 276, 277]
Graph Neural Network (GNN) [278]
Model Federate Learning [279, 280]
WRN Others 97.7 Loss Function Search [281, 282]
AmoebaNet 97.87 Activation Function Search [283]
SENet 97.88 Image Caption [284, 285]
Human
ProxylessNAS 97.92 Text to
Auto Speech (TTS) [202]
Fast AA 98.3 Recommendation System [286, 287, 288]
EfficientNet 98.9
GPIPE 99
Table 5: Summary of the existing automated machine learning applications.
Acc(%)

BiT-L 99.3

96.5 97 97.5 98 98.5 99 99.5


nation operation to process the output of each block in the
Model
cell, instead of the element-wise addition operation. Some
NAS Cell 64
recent studies [232, 292, 96] have shown that the explana-
ENAS 58.6
tion for these occurrences is usually hindsight and lacks
DARTS 56.1 rigorous mathematical proof. Therefore, increasing the
Transformer-XL 54.55 mathematical interpretability of AutoML is an important
FRAGE + AWD-LSTM-MoS 46.54 future research direction.
Human
adversarial+AWD-LSTM-MoS 46.01 Auto
7.4. Reproducibility
Perplexity

GPT-2 35.76
A major challenge with ML is reproducibility. AutoML
0 10 20 30 40 50 60 70
is no exception, especially for NAS, because most of the
existing NAS algorithms still have many parameters that
Figure 20: State-of-the-art models on the PTB dataset. The lower the need to be set manually at the implementation level; how-
perplexity, the better is the performance. The green bar represents
the automatically generated model, and the yellow bar represents the ever, the original papers do not cover much detail. For
model designed by human experts. Best viewed in color. instance, Yang et al. [123] experimentally demonstrated
that the seed plays an important role in NAS experiments;
however, most NAS studies do not mention the seed set in
recommendation system, and searching for loss and acti- the experiments. Besides, considerable resource consump-
vation functions. Therefore, these interesting studies have tion is another obstacle to reproduction. In this context,
indicated the potential of AutoML to be applied in more several NAS-Bench datasets have been proposed, such as
areas. NAS-Bench-101 [224], NAS-Bench-201 [225], and NAS-
Bench-NLP [226]. These datasets allow NAS researchers
7.3. Interpretability to focus on the design of optimization algorithms without
Although AutoML algorithms can find promising con- wasting much time on the model evaluation.
figuration settings more efficiently than humans, there is a
lack of scientific evidence for illustrating why the found set- 7.5. Robustness
tings perform better. For example, in BlockQNN [16], it is NAS has been proven effective in searching promising
unclear why the NAS algorithm tends to select the concate- architectures on many open datasets (e.g., CIFAR-10 and
24
ImageNet). These datasets are generally used for research; which is one step closer to achieving a complete pipeline.
therefore, most of the images are well-labeled. However, Similarly, Vega [305] is another AutoML algorithm tool that
in real-world situations, the data inevitably contain noise constructs a complete pipeline covering a set of highly de-
(e.g., mislabeling and inadequate information). Even worse, coupled functions: data augmentation, HPO, NAS, model
the data might be modified to be adversarial with carefully compression, and full training. In summary, designing an
designed noises. Deep learning models can be easily fooled easy-to-use and complete AutoML pipeline system is a
by adversarial data, and so can NAS. promising research direction.
So far, there are a few studies [293, 294, 295, 296] have
attempted to boost the robustness of NAS against adver- 7.8. Lifelong Learning
sarial data. Guo et al. [294] experimentally explored the Finally, most AutoML algorithms focus only on solving
intrinsic impact of network architectures on network ro- a specific task on some fixed datasets, e.g., image classifica-
bustness against adversarial attacks, and observed that tion on CIFAR-10 and ImageNet. However, a high-quality
densely connected architectures tend to be more robust. AutoML system should have the capability of lifelong learn-
They also found that the flow of solution procedure (FSP) ing, i.e., it should be able to 1) efficiently learn new data
matrix [297] is a good indicator of network robustness, i.e., and 2) remember old knowledge.
the lower is the FSP matrix loss, the more robust is the net-
work. Chen et al. [295] proposed a robust loss function for 7.8.1. Learn New Data
effectively alleviating the performance degradation under First, the system should be able to reuse prior knowl-
symmetric label noise. The authors in [296] adopted EA edge to solve new tasks (i.e., learning to learn). For example,
to search for robust architectures from a well-designed and a child can quickly identify tigers, rabbits, and elephants
vast search space, where various adversarial attacks are after seeing several pictures of these animals. However, the
used as the fitness function for evaluating the robustness current DL models must be trained on considerable data
of neural architectures. before they can correctly identify images. A hot topic in
this area is meta-learning, which aims to design models for
7.6. Joint Hyperparameter and Architecture Optimization
new tasks using previous experience.
Most NAS studies have considered HPO and AO as two Meta-learning. Most of the existing NAS methods
separate processes. However, as already noted in Section can search a well-performing architecture for a single task.
4, there is a tremendous overlap between the methods However, they have to search for a new architecture on
used in HPO and AO, e.g., both of them apply RS, BO, a new task; otherwise, the old architecture might not be
and GO methods. In other words, it is feasible to jointly optimal. Several studies [306, 307, 308, 309] have combined
optimize both hyperparameters and architectures, which is meta-learning and NAS to solve this problem. Recently,
experimentally confirmed by several studies [234, 233, 235]. Lian et al. [308] proposed a novel and meta-learning-based
Thus, how to solve the problem of joint hyperparameter transferable neural architecture search method to generate a
and architecture optimization (HAO) elegantly is a worthy meta-architecture, which can adapt to new tasks easily and
studying issue. quickly through a few gradient steps. Another challenge
of learning new data is few-shot learning scenarios, where
7.7. Complete AutoML Pipeline
there are only limited data for the new tasks. To over-
So far, many AutoML pipeline libraries have been pro- come this challenge, the authors in [307] and [306] applied
posed, but most of them only focus on some parts of the NAS to few-shot learning, where they only searched for the
AutoML pipeline (Figure 1). For instance, TPOT [298], most promising architecture and optimized it to work on
Auto-WEAK [177], and Auto-Sklearn [299] are built on multiple few-shot learning tasks. Elsken et al. [309] pro-
top of scikit-learn [300] for building classification and re- posed a gradient-based meta-learning NAS method, namely
gression pipelines, but they only search for the traditional METANAS, which can generate task-specific architectures
ML models (such as SVM and KNN). Although TPOT more efficiently as it does not require meta-retraining.
involves neural networks (using Pytorch [301] backend), it Unsupervised learning. Meta-learning-based NAS
only supports an MLP network. Besides, Auto-Keras [22] methods focus more on labeled data, while in some cases,
is an open-source library developed based on Keras [302], only a portion of the data may have labels or even none
which focuses more on searching for deep learning models at all. Liu et al. [310] proposed a general problem setup,
and supports multi-modal and multi-task. NNI [303] is namely unsupervised neural architecture search (UnNAS),
a more powerful and lightweight toolkit of AutoML, as to explore whether labels are necessary for NAS. They ex-
its built-in capability contains automated feature engineer- perimentally demonstrated that the architectures searched
ing, hyperparameter optimization, and neural architecture without labels are competitive with those searched with la-
search. Additionally, the NAS module in NNI supports bels; therefore, labels are not necessary for NAS, which has
both Pytorch [301] and Tensorflow [304] and reproduces provoked some reflection among researchers about which
many SOTA NAS methods [13, 17, 132, 128, 197, 180, 224], factors do affect NAS.
which is very friendly for NAS researchers and develop-
ers. Besides, NNI also integrates scikit-learn features [300],
25
7.8.2. Remember Old Knowledge Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,
An AutoML system must be able to constantly learn Montréal, Canada, 2018, pp. 1341–1352.
URL https://proceedings.neurips.cc/paper/2018/hash/
from new data, without forgetting the knowledge from e555ebe0ce426f7f9b2bef0706315e0c-Abstract.html
old data. However, when we use new datasets to train a [5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, R. Salakhutdinov,
pretrained model, the model’s performance on the previous Transformer-XL: Attentive language models beyond a fixed-
datasets is substantially reduced. Incremental learning can length context, in: Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics, Association
alleviate this problem. For example, Li and Hoiem [311] for Computational Linguistics, Florence, Italy, 2019, pp. 2978–
proposed the learning without forgetting (LwF) method, 2988. doi:10.18653/v1/P19-1285.
which trains a model using only new data while preserving URL https://www.aclweb.org/anthology/P19-1285
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
its original capabilities. In addition, iCaRL [312] makes Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
progress based on LwF. It only uses a small proportion of L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge,
old data for pretraining, and then gradually increases the International Journal of Computer Vision (IJCV) 115 (3) (2015)
proportion of a new class of data used to train the model. 211–252. doi:10.1007/s11263-015-0816-y.
[7] K. Simonyan, A. Zisserman, Very deep convolutional networks
for large-scale image recognition, in: Y. Bengio, Y. LeCun
8. Conclusions (Eds.), 3rd International Conference on Learning Represen-
tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings, 2015.
This paper provides a detailed and systematic review URL http://arxiv.org/abs/1409.1556
of AutoML studies according to the DL pipeline (Figure [8] M. Zoller, M. F. Huber, Benchmark and survey of automated
1), ranging from data preparation to model evaluation. machine learning frameworks, arXiv preprint arXiv:1904.12054.
Additionally, we compare the performance and efficiency of [9] Q. Yao, M. Wang, Y. Chen, W. Dai, H. Yi-Qi, L. Yu-Feng,
T. Wei-Wei, Y. Qiang, Y. Yang, Taking human out of learning
existing NAS algorithms on the CIFAR-10 and ImageNet applications: A survey on automated machine learning, arXiv
datasets, and provide an in-depth discussion of different preprint arXiv:1810.13306.
research directions on NAS: one/two-stage NAS, one-shot [10] T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search:
A survey, arXiv preprint arXiv:1808.05377.
NAS, and joint HAO. We also describe several interesting [11] K. Yu, C. Sciuto, M. Jaggi, C. Musat, M. Salzmann, Evaluating
open problems and discuss some important future research the search phase of neural architecture search, in: 8th Inter-
directions. Although research on AutoML is in its infancy, national Conference on Learning Representations, ICLR 2020,
we believe that future researchers will effectively solve Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net,
2020.
these problems. In this context, this review provides a URL https://openreview.net/forum?id=H1loF2NFwr
comprehensive and clear understanding of AutoML for the [12] B. Zoph, Q. V. Le, Neural architecture search with reinforce-
benefit of those new to this area, and will thus assist with ment learning, in: 5th International Conference on Learning
their future research endeavors. Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings, OpenReview.net, 2017.
URL https://openreview.net/forum?id=r1Ue8Hcxg
[13] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, J. Dean, Efficient
References neural architecture search via parameter sharing, in: J. G. Dy,
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classi- A. Krause (Eds.), Proceedings of the 35th International Con-
fication with deep convolutional neural networks, in: P. L. ference on Machine Learning, ICML 2018, Stockholmsmässan,
Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q. Stockholm, Sweden, July 10-15, 2018, Vol. 80 of Proceedings
Weinberger (Eds.), Advances in Neural Information Processing of Machine Learning Research, PMLR, 2018, pp. 4092–4101.
Systems 25: 26th Annual Conference on Neural Information URL http://proceedings.mlr.press/v80/pham18a.html
Processing Systems 2012. Proceedings of a meeting held [14] A. Brock, T. Lim, J. M. Ritchie, N. Weston, SMASH: one-shot
December 3-6, 2012, Lake Tahoe, Nevada, United States, 2012, model architecture search through hypernetworks, in: 6th Inter-
pp. 1106–1114. national Conference on Learning Representations, ICLR 2018,
URL https://proceedings.neurips.cc/paper/2012/hash/ Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
c399862d3b9d6b76c8436e924a68c45b-Abstract.html Track Proceedings, OpenReview.net, 2018.
[2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for URL https://openreview.net/forum?id=rydeCEhs-
image recognition, in: 2016 IEEE Conference on Computer [15] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning
Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, transferable architectures for scalable image recognition, in:
USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 2018 IEEE Conference on Computer Vision and Pattern
770–778. doi:10.1109/CVPR.2016.90. Recognition, CVPR 2018, Salt Lake City, UT, USA, June
URL https://doi.org/10.1109/CVPR.2016.90 18-22, 2018, IEEE Computer Society, 2018, pp. 8697–8710.
[3] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi, You doi:10.1109/CVPR.2018.00907.
only look once: Unified, real-time object detection, in: 2016 URL http://openaccess.thecvf.com/content_cvpr_2018/
IEEE Conference on Computer Vision and Pattern Recognition, html/Zoph_Learning_Transferable_Architectures_CVPR_
CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE 2018_paper.html
Computer Society, 2016, pp. 779–788. doi:10.1109/CVPR.2016. [16] Z. Zhong, J. Yan, W. Wu, J. Shao, C. Liu, Practical block-wise
91. neural network architecture generation, in: 2018 IEEE Confer-
URL https://doi.org/10.1109/CVPR.2016.91 ence on Computer Vision and Pattern Recognition, CVPR 2018,
[4] C. Gong, D. He, X. Tan, T. Qin, L. Wang, T. Liu, FRAGE: Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer
frequency-agnostic word representation, in: S. Bengio, H. M. Society, 2018, pp. 2423–2432. doi:10.1109/CVPR.2018.00257.
Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, URL http://openaccess.thecvf.com/content_cvpr_2018/
R. Garnett (Eds.), Advances in Neural Information Processing html/Zhong_Practical_Block-Wise_Neural_CVPR_2018_
Systems 31: Annual Conference on Neural Information paper.html

26
[17] H. Liu, K. Simonyan, Y. Yang, DARTS: differentiable archi- 2018, July 13-19, 2018, Stockholm, Sweden, ijcai.org, 2018, pp.
tecture search, in: 7th International Conference on Learning 5369–5373. doi:10.24963/ijcai.2018/755.
Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, URL https://doi.org/10.24963/ijcai.2018/755
2019, OpenReview.net, 2019. [29] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink,
URL https://openreview.net/forum?id=S1eYHoC5FX O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy,
[18] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, et al., Evolving deep neural networks (2019) 293–312.
L. Fei-Fei, A. Yuille, J. Huang, K. Murphy, Progressive neural [30] L. Xie, A. L. Yuille, Genetic CNN, in: IEEE International
architecture search (2018) 19–34. Conference on Computer Vision, ICCV 2017, Venice, Italy,
[19] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, K. Kavukcuoglu, October 22-29, 2017, IEEE Computer Society, 2017, pp. 1388–
Hierarchical representations for efficient architecture search, 1397. doi:10.1109/ICCV.2017.154.
in: 6th International Conference on Learning Representations, URL https://doi.org/10.1109/ICCV.2017.154
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, [31] K. Ahmed, L. Torresani, Maskconnect: Connectivity learning
Conference Track Proceedings, OpenReview.net, 2018. by gradient descent (2018) 349–365.
URL https://openreview.net/forum?id=BJQRKzbA- [32] R. Shin, C. Packer, D. Song, Differentiable neural network
[20] T. Chen, I. J. Goodfellow, J. Shlens, Net2net: Accelerating architecture search.
learning via knowledge transfer, in: Y. Bengio, Y. LeCun [33] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, F. Hutter,
(Eds.), 4th International Conference on Learning Represen- Towards automatically-tuned neural networks (2016) 58–65.
tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, [34] A. Zela, A. Klein, S. Falkner, F. Hutter, Towards automated
Conference Track Proceedings, 2016. deep learning: Efficient joint neural architecture and hyperpa-
URL http://arxiv.org/abs/1511.05641 rameter search, arXiv preprint arXiv:1807.06906.
[21] T. Wei, C. Wang, Y. Rui, C. W. Chen, Network morphism, in: [35] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, Fast
M. Balcan, K. Q. Weinberger (Eds.), Proceedings of the 33nd bayesian optimization of machine learning hyperparameters on
International Conference on Machine Learning, ICML 2016, large datasets, in: A. Singh, X. J. Zhu (Eds.), Proceedings of
New York City, NY, USA, June 19-24, 2016, Vol. 48 of JMLR the 20th International Conference on Artificial Intelligence and
Workshop and Conference Proceedings, JMLR.org, 2016, pp. Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale,
564–572. FL, USA, Vol. 54 of Proceedings of Machine Learning Research,
URL http://proceedings.mlr.press/v48/wei16.html PMLR, 2017, pp. 528–536.
[22] H. Jin, Q. Song, X. Hu, Auto-keras: An efficient neural URL http://proceedings.mlr.press/v54/klein17a.html
architecture search system, in: A. Teredesai, V. Kumar, [36] S. Falkner, A. Klein, F. Hutter, Practical hyperparameter
Y. Li, R. Rosales, E. Terzi, G. Karypis (Eds.), Proceed- optimization for deep learning.
ings of the 25th ACM SIGKDD International Conference on [37] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-
Knowledge Discovery & Data Mining, KDD 2019, Anchor- based optimization for general algorithm configuration, in: In-
age, AK, USA, August 4-8, 2019, ACM, 2019, pp. 1946–1956. ternational conference on learning and intelligent optimization,
doi:10.1145/3292500.3330648. 2011, pp. 507–523.
URL https://doi.org/10.1145/3292500.3330648 [38] S. Falkner, A. Klein, F. Hutter, BOHB: robust and efficient
[23] B. Baker, O. Gupta, N. Naik, R. Raskar, Designing neural hyperparameter optimization at scale, in: J. G. Dy, A. Krause
network architectures using reinforcement learning, in: 5th (Eds.), Proceedings of the 35th International Conference on
International Conference on Learning Representations, ICLR Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,
2017, Toulon, France, April 24-26, 2017, Conference Track Sweden, July 10-15, 2018, Vol. 80 of Proceedings of Machine
Proceedings, OpenReview.net, 2017. Learning Research, PMLR, 2018, pp. 1436–1445.
URL https://openreview.net/forum?id=S1c2cvqee URL http://proceedings.mlr.press/v80/falkner18a.html
[24] K. O. Stanley, R. Miikkulainen, Evolving neural networks [39] J. Bergstra, D. Yamins, D. D. Cox, Making a science of model
through augmenting topologies, Evolutionary computation search: Hyperparameter optimization in hundreds of dimen-
10 (2) (2002) 99–127. sions for vision architectures, in: Proceedings of the 30th Inter-
[25] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, national Conference on Machine Learning, ICML 2013, Atlanta,
Q. V. Le, A. Kurakin, Large-scale evolution of image classifiers, GA, USA, 16-21 June 2013, Vol. 28 of JMLR Workshop and
in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th Inter- Conference Proceedings, JMLR.org, 2013, pp. 115–123.
national Conference on Machine Learning, ICML 2017, Sydney, URL http://proceedings.mlr.press/v28/bergstra13.html
NSW, Australia, 6-11 August 2017, Vol. 70 of Proceedings of [40] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian,
Machine Learning Research, PMLR, 2017, pp. 2902–2911. C. Xu, CARS: continuous evolution for efficient neural ar-
URL http://proceedings.mlr.press/v70/real17a.html chitecture search, in: 2020 IEEE/CVF Conference on Com-
[26] E. Real, A. Aggarwal, Y. Huang, Q. V. Le, Regularized evo- puter Vision and Pattern Recognition, CVPR 2020, Seattle,
lution for image classifier architecture search, in: The Thirty- WA, USA, June 13-19, 2020, IEEE, 2020, pp. 1826–1835.
Third AAAI Conference on Artificial Intelligence, AAAI 2019, doi:10.1109/CVPR42600.2020.00190.
The Thirty-First Innovative Applications of Artificial Intel- URL https://doi.org/10.1109/CVPR42600.2020.00190
ligence Conference, IAAI 2019, The Ninth AAAI Sympo- [41] K. Maziarz, M. Tan, A. Khorlin, M. Georgiev, A. Gesmundo,
sium on Educational Advances in Artificial Intelligence, EAAI Evolutionary-neural hybrid agents for architecture searcharXiv:
2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, 1811.09828.
AAAI Press, 2019, pp. 4780–4789. doi:10.1609/aaai.v33i01. [42] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu,
33014780. X. Wang, Reinforced evolutionary neural architecture search,
URL https://doi.org/10.1609/aaai.v33i01.33014780 arXiv preprint arXiv:1808.00193.
[27] T. Elsken, J. H. Metzen, F. Hutter, Efficient multi-objective [43] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, M. Zhang,
neural architecture search via lamarckian evolution, in: 7th Surrogate-assisted evolutionary deep learning using an end-to-
International Conference on Learning Representations, ICLR end random forest-based performance predictor, IEEE Trans-
2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net, actions on Evolutionary Computation.
2019. [44] B. Wang, Y. Sun, B. Xue, M. Zhang, A hybrid differential evolu-
URL https://openreview.net/forum?id=ByME42AqK7 tion approach to designing deep convolutional neural networks
[28] M. Suganuma, S. Shirakawa, T. Nagao, A genetic programming for image classification, in: Australasian Joint Conference on
approach to designing convolutional neural network architec- Artificial Intelligence, Springer, 2018, pp. 237–250.
tures, in: J. Lang (Ed.), Proceedings of the Twenty-Seventh [45] M. Wistuba, A. Rawat, T. Pedapati, A survey on neural archi-
International Joint Conference on Artificial Intelligence, IJCAI tecture search, arXiv preprint arXiv:1905.01392.

27
[46] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, Computer Society, 2015, pp. 1431–1439. doi:10.1109/ICCV.
X. Wang, A comprehensive survey of neural architecture search: 2015.168.
Challenges and solutions (2020). arXiv:2006.02903. URL https://doi.org/10.1109/ICCV.2015.168
[47] R. Elshawi, M. Maher, S. Sakr, Automated machine learn- [65] Z. Xu, S. Huang, Y. Zhang, D. Tao, Augmenting strong super-
ing: State-of-the-art and open challenges, arXiv preprint vision using web data for fine-grained categorization, in: 2015
arXiv:1906.02287. IEEE International Conference on Computer Vision, ICCV
[48] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based 2015, Santiago, Chile, December 7-13, 2015, IEEE Computer
learning applied to document recognition, Proceedings of the Society, 2015, pp. 2524–2532. doi:10.1109/ICCV.2015.290.
IEEE 86 (11) (1998) 2278–2324. URL https://doi.org/10.1109/ICCV.2015.290
[49] A. Krizhevsky, V. Nair, G. Hinton, The cifar-10 dataset, online: [66] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer,
http://www. cs. toronto. edu/kriz/cifar. html. Smote: synthetic minority over-sampling technique, Journal of
[50] J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Ima- artificial intelligence research 16 (2002) 321–357.
genet: A large-scale hierarchical image database, in: 2009 [67] H. Guo, H. L. Viktor, Learning from imbalanced data sets
IEEE Computer Society Conference on Computer Vision and with boosting and data generation: the databoost-im approach,
Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, ACM Sigkdd Explorations Newsletter 6 (1) (2004) 30–39.
Florida, USA, IEEE Computer Society, 2009, pp. 248–255. [68] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-
doi:10.1109/CVPR.2009.5206848. man, J. Tang, W. Zaremba, Openai gym, arXiv preprint
URL https://doi.org/10.1109/CVPR.2009.5206848 arXiv:1606.01540.
[51] J. Yang, X. Sun, Y.-K. Lai, L. Zheng, M.-M. Cheng, Recog- [69] Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, X. Chu, Irs: A
nition from web data: a progressive filtering approach, IEEE large synthetic indoor robotics stereo dataset for disparity and
Transactions on Image Processing 27 (11) (2018) 5303–5315. surface normal estimation, arXiv preprint arXiv:1912.09678.
[52] X. Chen, A. Shrivastava, A. Gupta, NEIL: extracting visual [70] N. Ruiz, S. Schulter, M. Chandraker, Learning to simulate,
knowledge from web data, in: IEEE International Conference in: 7th International Conference on Learning Representations,
on Computer Vision, ICCV 2013, Sydney, Australia, December ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenRe-
1-8, 2013, IEEE Computer Society, 2013, pp. 1409–1416. doi: view.net, 2019.
10.1109/ICCV.2013.178. URL https://openreview.net/forum?id=HJgkx2Aqt7
URL https://doi.org/10.1109/ICCV.2013.178 [71] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
[53] Y. Xia, X. Cao, F. Wen, J. Sun, Well begun is half done: Farley, S. Ozair, A. C. Courville, Y. Bengio, Generative
Generating high-quality seeds for automatic image dataset adversarial nets, in: Z. Ghahramani, M. Welling, C. Cortes,
construction from web, in: European Conference on Computer N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural
Vision, Springer, 2014, pp. 387–400. Information Processing Systems 27: Annual Conference on
[54] N. H. Do, K. Yanai, Automatic construction of action datasets Neural Information Processing Systems 2014, December 8-13
using web videos with density-based cluster analysis and outlier 2014, Montreal, Quebec, Canada, 2014, pp. 2672–2680.
detection, in: Pacific-Rim Symposium on Image and Video URL https://proceedings.neurips.cc/paper/2014/hash/
Technology, Springer, 2015, pp. 160–172. 5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
[55] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, [72] T.-H. Oh, R. Jaroensri, C. Kim, M. Elgharib, F. Durand,
J. Philbin, L. Fei-Fei, The unreasonable effectiveness of noisy W. T. Freeman, W. Matusik, Learning-based video motion
data for fine-grained recognition, in: European Conference on magnification, in: Proceedings of the European Conference on
Computer Vision, Springer, 2016, pp. 301–320. Computer Vision (ECCV), 2018, pp. 633–648.
[56] P. D. Vo, A. Ginsca, H. Le Borgne, A. Popescu, Harnessing [73] L. Sixt, Rendergan: Generating realistic labeled data–with an
noisy web images for deep representation, Computer Vision application on decoding bee tags, unpublished Bachelor Thesis,
and Image Understanding 164 (2017) 68–81. Freie Universität, Berlin.
[57] B. Collins, J. Deng, K. Li, L. Fei-Fei, Towards scalable dataset [74] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Ham-
construction: An active learning approach, in: European con- mers, D. A. Dickie, M. V. Hernández, J. Wardlaw, D. Rueckert,
ference on computer vision, Springer, 2008, pp. 86–98. Gan augmentation: Augmenting training data using generative
[58] Y. Roh, G. Heo, S. E. Whang, A survey on data collection for adversarial networks, arXiv preprint arXiv:1810.10863.
machine learning: a big data-ai integration perspective, IEEE [75] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park,
Transactions on Knowledge and Data Engineering. Y. Kim, Data synthesis based on generative adversarial net-
[59] D. Yarowsky, Unsupervised word sense disambiguation rivaling works, Proceedings of the VLDB Endowment 11 (10) (2018)
supervised methods, in: 33rd Annual Meeting of the Association 1071–1083.
for Computational Linguistics, Association for Computational [76] L. Xu, K. Veeramachaneni, Synthesizing tabular data using gen-
Linguistics, Cambridge, Massachusetts, USA, 1995, pp. 189– erative adversarial networks, arXiv preprint arXiv:1811.11264.
196. doi:10.3115/981658.981684. [77] D. Donahue, A. Rumshisky, Adversarial text generation without
URL https://www.aclweb.org/anthology/P95-1026 reinforcement learning, arXiv preprint arXiv:1810.06640.
[60] I. Triguero, J. A. Sáez, J. Luengo, S. Garcı́a, F. Herrera, On the [78] T. Karras, S. Laine, T. Aila, A style-based generator ar-
characterization of noise filters for self-training semi-supervised chitecture for generative adversarial networks, in: IEEE
in nearest neighbor classification, Neurocomputing 132 (2014) Conference on Computer Vision and Pattern Recognition,
30–41. CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
[61] M. F. A. Hady, F. Schwenker, Combining committee-based semi- Computer Vision Foundation / IEEE, 2019, pp. 4401–4410.
supervised learning and active learning, Journal of Computer doi:10.1109/CVPR.2019.00453.
Science and Technology 25 (4) (2010) 681–698. URL http://openaccess.thecvf.com/content_CVPR_2019/
[62] A. Blum, T. Mitchell, Combining labeled and unlabeled data html/Karras_A_Style-Based_Generator_Architecture_for_
with co-training, in: Proceedings of the eleventh annual con- Generative_Adversarial_Networks_CVPR_2019_paper.html
ference on Computational learning theory, ACM, 1998, pp. [79] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data cleaning:
92–100. Overview and emerging challenges, in: F. Özcan, G. Koutrika,
[63] Y. Zhou, S. Goldman, Democratic co-learning, in: Tools with S. Madden (Eds.), Proceedings of the 2016 International Con-
Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE Interna- ference on Management of Data, SIGMOD Conference 2016,
tional Conference on, IEEE, 2004, pp. 594–602. San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016,
[64] X. Chen, A. Gupta, Webly supervised learning of convolutional pp. 2201–2206. doi:10.1145/2882903.2912574.
networks, in: 2015 IEEE International Conference on Computer URL https://doi.org/10.1145/2882903.2912574
Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, IEEE [80] M. Jesmeen, J. Hossen, S. Sayeed, C. Ho, K. Tawsif, A. Rahman,

28
E. Arif, A survey on cleaning dirty data using machine learning arXiv preprint arXiv:1609.08764.
paradigm for big data analytics, Indonesian Journal of Electrical [97] Z. Xie, S. I. Wang, J. Li, D. Lévy, A. Nie, D. Jurafsky, A. Y. Ng,
Engineering and Computer Science 10 (3) (2018) 1234–1243. Data noising as smoothing in neural network language models,
[81] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, in: 5th International Conference on Learning Representations,
Y. Ye, KATARA: A data cleaning system powered by knowledge ICLR 2017, Toulon, France, April 24-26, 2017, Conference
bases and crowdsourcing, in: T. K. Sellis, S. B. Davidson, Track Proceedings, OpenReview.net, 2017.
Z. G. Ives (Eds.), Proceedings of the 2015 ACM SIGMOD URL https://openreview.net/forum?id=H1VyHY9gg
International Conference on Management of Data, Melbourne, [98] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi,
Victoria, Australia, May 31 - June 4, 2015, ACM, 2015, pp. Q. V. Le, Qanet: Combining local convolution with global
1247–1261. doi:10.1145/2723372.2749431. self-attention for reading comprehension, in: 6th International
URL https://doi.org/10.1145/2723372.2749431 Conference on Learning Representations, ICLR 2018, Vancou-
[82] S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, ver, BC, Canada, April 30 - May 3, 2018, Conference Track
T. Milo, E. Wu, Sampleclean: Fast and reliable analytics on Proceedings, OpenReview.net, 2018.
dirty data., IEEE Data Eng. Bull. 38 (3) (2015) 59–75. URL https://openreview.net/forum?id=B14TlG-RW
[83] S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, E. Wu, Ac- [99] E. Ma, Nlp augmentation, https://github.com/makcedward/
tiveclean: An interactive data cleaning framework for modern nlpaug (2019).
machine learning, in: F. Özcan, G. Koutrika, S. Madden (Eds.), [100] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, Q. V. Le,
Proceedings of the 2016 International Conference on Manage- Autoaugment: Learning augmentation strategies from data,
ment of Data, SIGMOD Conference 2016, San Francisco, CA, in: IEEE Conference on Computer Vision and Pattern
USA, June 26 - July 01, 2016, ACM, 2016, pp. 2117–2120. Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,
doi:10.1145/2882903.2899409. 2019, Computer Vision Foundation / IEEE, 2019, pp. 113–123.
URL https://doi.org/10.1145/2882903.2899409 doi:10.1109/CVPR.2019.00020.
[84] S. Krishnan, M. J. Franklin, K. Goldberg, E. Wu, Boostclean: URL http://openaccess.thecvf.com/content_CVPR_
Automated error detection and repair for machine learning, 2019/html/Cubuk_AutoAugment_Learning_Augmentation_
arXiv preprint arXiv:1711.01299. Strategies_From_Data_CVPR_2019_paper.html
[85] S. Krishnan, E. Wu, Alphaclean: Automatic generation of data [101] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson,
cleaning pipelines, arXiv preprint arXiv:1904.11827. Y. Yang, Dada: Differentiable automatic data augmentation,
[86] I. Gemp, G. Theocharous, M. Ghavamzadeh, Automated data arXiv preprint arXiv:2003.03780.
cleansing through meta-learning, in: S. P. Singh, S. Markovitch [102] R. Hataya, J. Zdenek, K. Yoshizoe, H. Nakayama, Faster au-
(Eds.), Proceedings of the Thirty-First AAAI Conference on toaugment: Learning augmentation strategies using backpropa-
Artificial Intelligence, February 4-9, 2017, San Francisco, Cali- gation, arXiv preprint arXiv:1911.06987.
fornia, USA, AAAI Press, 2017, pp. 4760–4761. [103] S. Lim, I. Kim, T. Kim, C. Kim, S. Kim, Fast autoaugment, in:
URL http://aaai.org/ocs/index.php/IAAI/IAAI17/paper/ H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
view/14236 E. B. Fox, R. Garnett (Eds.), Advances in Neural Information
[87] I. F. Ilyas, Effective data cleaning with continuous evaluation., Processing Systems 32: Annual Conference on Neural Informa-
IEEE Data Eng. Bull. 39 (2) (2016) 38–46. tion Processing Systems 2019, NeurIPS 2019, December 8-14,
[88] M. Mahdavi, F. Neutatz, L. Visengeriyeva, Z. Abedjan, Towards 2019, Vancouver, BC, Canada, 2019, pp. 6662–6672.
automated data cleaning workflows, Machine Learning 15 (2019) URL https://proceedings.neurips.cc/paper/2019/hash/
16. 6add07cf50424b14fdf649da87843d01-Abstract.html
[89] T. DeVries, G. W. Taylor, Improved regularization of con- [104] A. Naghizadeh, M. Abavisani, D. N. Metaxas, Greedy autoaug-
volutional neural networks with cutout, arXiv preprint ment, arXiv preprint arXiv:1908.00704.
arXiv:1708.04552. [105] D. Ho, E. Liang, X. Chen, I. Stoica, P. Abbeel, Population
[90] H. Zhang, M. Cissé, Y. N. Dauphin, D. Lopez-Paz, mixup: based augmentation: Efficient learning of augmentation policy
Beyond empirical risk minimization, in: 6th International Con- schedules, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceed-
ference on Learning Representations, ICLR 2018, Vancouver, ings of the 36th International Conference on Machine Learning,
BC, Canada, April 30 - May 3, 2018, Conference Track Pro- ICML 2019, 9-15 June 2019, Long Beach, California, USA,
ceedings, OpenReview.net, 2018. Vol. 97 of Proceedings of Machine Learning Research, PMLR,
URL https://openreview.net/forum?id=r1Ddp1-Rb 2019, pp. 2731–2741.
[91] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving, URL http://proceedings.mlr.press/v97/ho19b.html
C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft, [106] T. Niu, M. Bansal, Automatically learning data augmenta-
Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer, tion policies for dialogue tasks, in: Proceedings of the 2019
B. Cook, I. Fernández, F.-M. De Rainville, C.-H. Weng, Conference on Empirical Methods in Natural Language Pro-
A. Ayala-Acevedo, R. Meudec, M. Laporte, et al., imgaug, cessing and the 9th International Joint Conference on Natural
https://github.com/aleju/imgaug, online; accessed 01-Feb- Language Processing (EMNLP-IJCNLP), Association for Com-
2020 (2020). putational Linguistics, Hong Kong, China, 2019, pp. 1317–1323.
[92] A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, A. A. doi:10.18653/v1/D19-1132.
Kalinin, Albumentations: fast and flexible image augmenta- URL https://www.aclweb.org/anthology/D19-1132
tions, ArXiv e-printsarXiv:1809.06839. [107] M. Geng, K. Xu, B. Ding, H. Wang, L. Zhang, Learning data
[93] A. Mikolajczyk, M. Grochowski, Data augmentation for im- augmentation policies using augmented random search, arXiv
proving deep learning in image classification problem, in: 2018 preprint arXiv:1811.04768.
international interdisciplinary PhD workshop (IIPhDW), IEEE, [108] X. Zhang, Q. Wang, J. Zhang, Z. Zhong, Adversarial autoaug-
2018, pp. 117–122. ment, in: 8th International Conference on Learning Represen-
[94] A. Mikolajczyk, M. Grochowski, Style transfer-based image tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020,
synthesis as an efficient regularization technique in deep learn- OpenReview.net, 2020.
ing, in: 2019 24th International Conference on Methods and URL https://openreview.net/forum?id=ByxdUySKvS
Models in Automation and Robotics (MMAR), IEEE, 2019, [109] C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin,
pp. 42–47. W. Ouyang, Online hyper-parameter learning for auto-
[95] A. Antoniou, A. Storkey, H. Edwards, Data augmentation gen- augmentation strategy, in: 2019 IEEE/CVF International Con-
erative adversarial networks, arXiv preprint arXiv:1711.04340. ference on Computer Vision, ICCV 2019, Seoul, Korea (South),
[96] S. C. Wong, A. Gatt, V. Stamatescu, M. D. McDonnell, Under- October 27 - November 2, 2019, IEEE, 2019, pp. 6578–6587.
standing data augmentation for classification: when to warp?, doi:10.1109/ICCV.2019.00668.

29
URL https://doi.org/10.1109/ICCV.2019.00668 [128] X. Chen, L. Xie, J. Wu, Q. Tian, Progressive differentiable
[110] T. C. LingChen, A. Khonsari, A. Lashkari, M. R. Nazari, architecture search: Bridging the depth gap between search and
J. S. Sambee, M. A. Nascimento, Uniformaugment: A search- evaluation, in: 2019 IEEE/CVF International Conference on
free probabilistic data augmentation approach, arXiv preprint Computer Vision, ICCV 2019, Seoul, Korea (South), October
arXiv:2003.14348. 27 - November 2, 2019, IEEE, 2019, pp. 1294–1303. doi:
[111] H. Motoda, H. Liu, Feature selection, extraction and construc- 10.1109/ICCV.2019.00138.
tion, Communication of IICM (Institute of Information and URL https://doi.org/10.1109/ICCV.2019.00138
Computing Machinery, Taiwan) Vol 5 (67-72) (2002) 2. [129] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille,
[112] M. Dash, H. Liu, Feature selection for classification, Intelligent F. Li, Auto-deeplab: Hierarchical neural architecture search
data analysis 1 (1-4) (1997) 131–156. for semantic image segmentation, in: IEEE Conference on
[113] M. J. Pazzani, Constructive induction of cartesian product Computer Vision and Pattern Recognition, CVPR 2019, Long
attributes, in: Feature Extraction, Construction and Selection, Beach, CA, USA, June 16-20, 2019, Computer Vision Founda-
Springer, 1998, pp. 341–354. tion / IEEE, 2019, pp. 82–92. doi:10.1109/CVPR.2019.00017.
[114] Z. Zheng, A comparison of constructing different types of new URL http://openaccess.thecvf.com/content_CVPR_
feature for decision tree learning, in: Feature Extraction, Con- 2019/html/Liu_Auto-DeepLab_Hierarchical_Neural_
struction and Selection, Springer, 1998, pp. 239–255. Architecture_Search_for_Semantic_Image_Segmentation_
[115] J. Gama, Functional trees, Machine Learning 55 (3) (2004) CVPR_2019_paper.html
219–250. [130] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler,
[116] H. Vafaie, K. De Jong, Evolutionary feature space transfor- A. Howard, Q. V. Le, Mnasnet: Platform-aware neural archi-
mation, in: Feature Extraction, Construction and Selection, tecture search for mobile, in: IEEE Conference on Computer
Springer, 1998, pp. 307–323. Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,
[117] P. Sondhi, Feature construction methods: a survey, sifaka. cs. USA, June 16-20, 2019, Computer Vision Foundation / IEEE,
uiuc. edu 69 (2009) 70–71. 2019, pp. 2820–2828. doi:10.1109/CVPR.2019.00293.
[118] D. Roth, K. Small, Interactive feature space construction using URL http://openaccess.thecvf.com/content_CVPR_2019/
semantic information, in: Proceedings of the Thirteenth Con- html/Tan_MnasNet_Platform-Aware_Neural_Architecture_
ference on Computational Natural Language Learning (CoNLL- Search_for_Mobile_CVPR_2019_paper.html
2009), Association for Computational Linguistics, Boulder, [131] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian,
Colorado, 2009, pp. 66–74. P. Vajda, Y. Jia, K. Keutzer, Fbnet: Hardware-aware efficient
URL https://www.aclweb.org/anthology/W09-1110 convnet design via differentiable neural architecture search,
[119] Q. Meng, D. Catchpoole, D. Skillicom, P. J. Kennedy, Rela- in: IEEE Conference on Computer Vision and Pattern
tional autoencoder for feature extraction, in: 2017 International Recognition, CVPR 2019, Long Beach, CA, USA, June
Joint Conference on Neural Networks (IJCNN), IEEE, 2017, 16-20, 2019, Computer Vision Foundation / IEEE, 2019, pp.
pp. 364–371. 10734–10742. doi:10.1109/CVPR.2019.01099.
[120] O. Irsoy, E. Alpaydın, Unsupervised feature extraction with URL http://openaccess.thecvf.com/content_CVPR_
autoencoder trees, Neurocomputing 258 (2017) 63–73. 2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_
[121] C. Cortes, V. Vapnik, Support-vector networks, Machine learn- Design_via_Differentiable_Neural_Architecture_Search_
ing 20 (3) (1995) 273–297. CVPR_2019_paper.html
[122] N. S. Altman, An introduction to kernel and nearest-neighbor [132] H. Cai, L. Zhu, S. Han, Proxylessnas: Direct neural architecture
nonparametric regression, The American Statistician 46 (3) search on target task and hardware, in: 7th International
(1992) 175–185. Conference on Learning Representations, ICLR 2019, New
[123] A. Yang, P. M. Esperança, F. M. Carlucci, NAS evaluation is Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019.
frustratingly hard, in: 8th International Conference on Learning URL https://openreview.net/forum?id=HylVB3AqYm
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26- [133] M. Courbariaux, Y. Bengio, J. David, Binaryconnect: Training
30, 2020, OpenReview.net, 2020. deep neural networks with binary weights during propagations,
URL https://openreview.net/forum?id=HygrdpVKvr in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
[124] F. Chollet, Xception: Deep learning with depthwise separable R. Garnett (Eds.), Advances in Neural Information Processing
convolutions, in: 2017 IEEE Conference on Computer Vision Systems 28: Annual Conference on Neural Information
and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, Processing Systems 2015, December 7-12, 2015, Montreal,
July 21-26, 2017, IEEE Computer Society, 2017, pp. 1800–1807. Quebec, Canada, 2015, pp. 3123–3131.
doi:10.1109/CVPR.2017.195. URL https://proceedings.neurips.cc/paper/2015/hash/
URL https://doi.org/10.1109/CVPR.2017.195 3e15cc11f979ed25912dff5b0669f2cd-Abstract.html
[125] F. Yu, V. Koltun, Multi-scale context aggregation by dilated [134] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a
convolutions, in: Y. Bengio, Y. LeCun (Eds.), 4th International neural network, arXiv preprint arXiv:1503.02531.
Conference on Learning Representations, ICLR 2016, San Juan, [135] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable
Puerto Rico, May 2-4, 2016, Conference Track Proceedings, are features in deep neural networks?, in: Z. Ghahramani,
2016. M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger
URL http://arxiv.org/abs/1511.07122 (Eds.), Advances in Neural Information Processing Systems 27:
[126] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, Annual Conference on Neural Information Processing Systems
in: 2018 IEEE Conference on Computer Vision and Pattern 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014,
Recognition, CVPR 2018, Salt Lake City, UT, USA, June pp. 3320–3328.
18-22, 2018, IEEE Computer Society, 2018, pp. 7132–7141. URL https://proceedings.neurips.cc/paper/2014/hash/
doi:10.1109/CVPR.2018.00745. 375c71349b295fbe2dcdca9206f20a06-Abstract.html
URL http://openaccess.thecvf.com/content_cvpr_2018/ [136] T. Wei, C. Wang, C. W. Chen, Modularized morphing of neural
html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_ networks, arXiv preprint arXiv:1701.03281.
paper.html [137] H. Cai, T. Chen, W. Zhang, Y. Yu, J. Wang, Efficient
[127] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely architecture search by network transformation, in: S. A.
connected convolutional networks, in: 2017 IEEE Conference McIlraith, K. Q. Weinberger (Eds.), Proceedings of the
on Computer Vision and Pattern Recognition, CVPR 2017, Thirty-Second AAAI Conference on Artificial Intelligence,
Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, (AAAI-18), the 30th innovative Applications of Artificial
2017, pp. 2261–2269. doi:10.1109/CVPR.2017.243. Intelligence (IAAI-18), and the 8th AAAI Symposium on
URL https://doi.org/10.1109/CVPR.2017.243 Educational Advances in Artificial Intelligence (EAAI-18),

30
New Orleans, Louisiana, USA, February 2-7, 2018, AAAI URL https://www.aclweb.org/anthology/H94-1020
Press, 2018, pp. 2787–2794. [153] C. He, H. Ye, L. Shen, T. Zhang, Milenas: Efficient neu-
URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/ ral architecture search via mixed-level reformulation, in: 2020
paper/view/16755 IEEE/CVF Conference on Computer Vision and Pattern Recog-
[138] A. Kwasigroch, M. Grochowski, M. Mikolajczyk, Deep neu- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
ral network architecture search using network morphism, in: 2020, pp. 11990–11999. doi:10.1109/CVPR42600.2020.01201.
2019 24th International Conference on Methods and Models in URL https://doi.org/10.1109/CVPR42600.2020.01201
Automation and Robotics (MMAR), IEEE, 2019, pp. 30–35. [154] X. Dong, Y. Yang, Searching for a robust neural architecture
[139] H. Cai, J. Yang, W. Zhang, S. Han, Y. Yu, Path-level network in four GPU hours, in: IEEE Conference on Computer Vision
transformation for efficient architecture search, in: J. G. Dy, and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
A. Krause (Eds.), Proceedings of the 35th International Con- June 16-20, 2019, Computer Vision Foundation / IEEE, 2019,
ference on Machine Learning, ICML 2018, Stockholmsmässan, pp. 1761–1770. doi:10.1109/CVPR.2019.00186.
Stockholm, Sweden, July 10-15, 2018, Vol. 80 of Proceedings URL http://openaccess.thecvf.com/content_CVPR_2019/
of Machine Learning Research, PMLR, 2018, pp. 677–686. html/Dong_Searching_for_a_Robust_Neural_Architecture_
URL http://proceedings.mlr.press/v80/cai18a.html in_Four_GPU_Hours_CVPR_2019_paper.html
[140] J. Fang, Y. Sun, K. Peng, Q. Zhang, Y. Li, W. Liu, X. Wang, [155] S. Xie, H. Zheng, C. Liu, L. Lin, SNAS: stochastic neural ar-
Fast neural network adaptation via parameter remapping and chitecture search, in: 7th International Conference on Learning
architecture search, in: 8th International Conference on Learn- Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2019, OpenReview.net, 2019.
26-30, 2020, OpenReview.net, 2020. URL https://openreview.net/forum?id=rylqooRqK7
URL https://openreview.net/forum?id=rklTmyBKPH [156] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, K. Keutzer,
[141] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, Mixed precision quantization of convnets via differentiable
E. Choi, Morphnet: Fast & simple resource-constrained neural architecture search (2018). arXiv:1812.00090.
structure learning of deep networks, in: 2018 IEEE Conference [157] E. Jang, S. Gu, B. Poole, Categorical reparameterization with
on Computer Vision and Pattern Recognition, CVPR 2018, gumbel-softmax, in: 5th International Conference on Learning
Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Society, 2018, pp. 1586–1595. doi:10.1109/CVPR.2018.00171. Conference Track Proceedings, OpenReview.net, 2017.
URL http://openaccess.thecvf.com/content_cvpr_2018/ URL https://openreview.net/forum?id=rkE3y85ee
html/Gordon_MorphNet_Fast__CVPR_2018_paper.html [158] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution:
[142] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for A continuous relaxation of discrete random variables, in: 5th
convolutional neural networks, in: K. Chaudhuri, R. Salakhut- International Conference on Learning Representations, ICLR
dinov (Eds.), Proceedings of the 36th International Conference 2017, Toulon, France, April 24-26, 2017, Conference Track
on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, Proceedings, OpenReview.net, 2017.
California, USA, Vol. 97 of Proceedings of Machine Learning URL https://openreview.net/forum?id=S1jE5L5gl
Research, PMLR, 2019, pp. 6105–6114. [159] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, Z. Li,
URL http://proceedings.mlr.press/v97/tan19a.html Darts+: Improved differentiable architecture search with early
[143] J. F. Miller, S. L. Harding, Cartesian genetic programming, stopping, arXiv preprint arXiv:1909.06035.
in: Proceedings of the 10th annual conference companion on [160] K. Kandasamy, W. Neiswanger, J. Schneider, B. Póczos, E. P.
Genetic and evolutionary computation, ACM, 2008, pp. 2701– Xing, Neural architecture search with bayesian optimisation
2726. and optimal transport, in: S. Bengio, H. M. Wallach,
[144] J. F. Miller, S. L. Smith, Redundancy and computational H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett
efficiency in cartesian genetic programming, IEEE Transactions (Eds.), Advances in Neural Information Processing Systems 31:
on Evolutionary Computation 10 (2) (2006) 167–174. Annual Conference on Neural Information Processing Systems
[145] F. Gruau, Cellular encoding as a graph grammar, in: IEEE 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada,
Colloquium on Grammatical Inference: Theory, Applications 2018, pp. 2020–2029.
& Alternatives, 1993. URL https://proceedings.neurips.cc/paper/2018/hash/
[146] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, f33ba15effa5c10e873bf3842afb46a6-Abstract.html
M. Jaderberg, M. Lanctot, D. Wierstra, Convolution by evolu- [161] R. Negrinho, G. Gordon, Deeparchitect: Automatically design-
tion: Differentiable pattern producing networks, in: Proceed- ing and training deep architectures (2017). arXiv:1704.08792.
ings of the Genetic and Evolutionary Computation Conference [162] R. Negrinho, M. R. Gormley, G. J. Gordon, D. Patil, N. Le,
2016, ACM, 2016, pp. 109–116. D. Ferreira, Towards modular and programmable architecture
[147] M. Kim, L. Rigazio, Deep clustered convolutional kernels, in: search, in: H. M. Wallach, H. Larochelle, A. Beygelzimer,
Feature Extraction: Modern Questions and Challenges, 2015, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in
pp. 160–172. Neural Information Processing Systems 32: Annual Conference
[148] J. K. Pugh, K. O. Stanley, Evolving multimodal controllers on Neural Information Processing Systems 2019, NeurIPS
with hyperneat, in: Proceedings of the 15th annual conference 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
on Genetic and evolutionary computation, ACM, 2013, pp. 13715–13725.
735–742. URL https://proceedings.neurips.cc/paper/2019/hash/
[149] H. Zhu, Z. An, C. Yang, K. Xu, E. Zhao, Y. Xu, Eena: Efficient 4ab50afd6dcc95fcba76d0fe04295632-Abstract.html
evolution of neural architecture (2019). arXiv:1905.07320. [163] G. Dikov, J. Bayer, Bayesian learning of neural network ar-
[150] R. J. Williams, Simple statistical gradient-following algorithms chitectures, in: K. Chaudhuri, M. Sugiyama (Eds.), The 22nd
for connectionist reinforcement learning, Machine learning 8 (3- International Conference on Artificial Intelligence and Statis-
4) (1992) 229–256. tics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan,
[151] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Vol. 89 of Proceedings of Machine Learning Research, PMLR,
Proximal policy optimization algorithms, arXiv preprint 2019, pp. 730–738.
arXiv:1707.06347. URL http://proceedings.mlr.press/v89/dikov19a.html
[152] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, [164] C. White, W. Neiswanger, Y. Savani, Bananas: Bayesian op-
A. Bies, M. Ferguson, K. Katz, B. Schasberger, The Penn timization with neural architectures for neural architecture
Treebank: Annotating predicate argument structure, in: Hu- search (2019). arXiv:1910.11858.
man Language Technology: Proceedings of a Workshop held at [165] M. Wistuba, Bayesian optimization combined with incremen-
Plainsboro, New Jersey, March 8-11, 1994, 1994. tal evaluation for neural network architecture optimization,

31
in: Proceedings of the International Workshop on Automatic [179] Y. Geifman, R. El-Yaniv, Deep active learning with a
Selection, Configuration and Composition of Machine Learning neural architecture search, in: H. M. Wallach, H. Larochelle,
Algorithms, 2017. A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.),
[166] J. Perez-Rua, M. Baccouche, S. Pateux, Efficient progressive Advances in Neural Information Processing Systems 32:
neural architecture search, in: British Machine Vision Confer- Annual Conference on Neural Information Processing Systems
ence 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
BMVA Press, 2018, p. 150. Canada, 2019, pp. 5974–5984.
URL http://bmvc2018.org/contents/papers/0291.pdf URL https://proceedings.neurips.cc/paper/2019/hash/
[167] C. E. Rasmussen, Gaussian processes in machine learning, b59307fdacf7b2db12ec4bd5ca1caba8-Abstract.html
Lecture Notes in Computer Science (2003) 63–71. [180] L. Li, A. Talwalkar, Random search and reproducibility for
[168] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms neural architecture search, in: A. Globerson, R. Silva (Eds.),
for hyper-parameter optimization, in: J. Shawe-Taylor, R. S. Proceedings of the Thirty-Fifth Conference on Uncertainty in
Zemel, P. L. Bartlett, F. C. N. Pereira, K. Q. Weinberger Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25,
(Eds.), Advances in Neural Information Processing Systems 2019, Vol. 115 of Proceedings of Machine Learning Research,
24: 25th Annual Conference on Neural Information Processing AUAI Press, 2019, pp. 367–377.
Systems 2011. Proceedings of a meeting held 12-14 December URL http://proceedings.mlr.press/v115/li20c.html
2011, Granada, Spain, 2011, pp. 2546–2554. [181] J. Bergstra, Y. Bengio, Random search for hyper-parameter
URL https://proceedings.neurips.cc/paper/2011/hash/ optimization, Journal of machine learning research 13 (Feb)
86e8f7ab32cfd12577bc2619bc635690-Abstract.html (2012) 281–305.
[169] R. Luo, F. Tian, T. Qin, E. Chen, T. Liu, Neural architecture [182] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al., A practical guide to
optimization, in: S. Bengio, H. M. Wallach, H. Larochelle, support vector classification.
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in [183] J. Y. Hesterman, L. Caucci, M. A. Kupinski, H. H. Barrett, L. R.
Neural Information Processing Systems 31: Annual Conference Furenlid, Maximum-likelihood estimation with a contracting-
on Neural Information Processing Systems 2018, NeurIPS 2018, grid search algorithm, IEEE transactions on nuclear science
December 3-8, 2018, Montréal, Canada, 2018, pp. 7827–7838. 57 (3) (2010) 1077–1084.
URL https://proceedings.neurips.cc/paper/2018/hash/ [184] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar,
933670f1ac8ba969f32989c312faba75-Abstract.html Hyperband: A novel bandit-based approach to hyperparameter
[170] M. M. Ian Dewancker, S. Clark, Bayesian optimization primer. optimization, The Journal of Machine Learning Research 18 (1)
URL https://app.sigopt.com/static/pdf/SigOpt_ (2017) 6765–6816.
Bayesian_Optimization_Primer.pdf [185] M. Feurer, F. Hutter, Hyperparameter Optimization, Springer
[171] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Fre- International Publishing, Cham, 2019, pp. 3–33.
itas, Taking the human out of the loop: A review of bayesian URL https://doi.org/10.1007/978-3-030-05318-5_1
optimization, Proceedings of the IEEE 104 (1) (2016) 148–175. [186] T. Yu, H. Zhu, Hyper-parameter optimization: A review of
[172] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sun- algorithms and applications, arXiv preprint arXiv:2003.05689.
daram, M. M. A. Patwary, Prabhat, R. P. Adams, Scalable [187] Y. Bengio, Gradient-based optimization of hyperparameters,
bayesian optimization using deep neural networks, in: F. R. Neural computation 12 (8) (2000) 1889–1900.
Bach, D. M. Blei (Eds.), Proceedings of the 32nd International [188] J. Domke, Generic methods for optimization-based modeling,
Conference on Machine Learning, ICML 2015, Lille, France, in: Artificial Intelligence and Statistics, 2012, pp. 318–326.
6-11 July 2015, Vol. 37 of JMLR Workshop and Conference [189] D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based hy-
Proceedings, JMLR.org, 2015, pp. 2171–2180. perparameter optimization through reversible learning, in: F. R.
URL http://proceedings.mlr.press/v37/snoek15.html Bach, D. M. Blei (Eds.), Proceedings of the 32nd International
[173] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian Conference on Machine Learning, ICML 2015, Lille, France,
optimization of machine learning algorithms, in: P. L. Bartlett, 6-11 July 2015, Vol. 37 of JMLR Workshop and Conference
F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger Proceedings, JMLR.org, 2015, pp. 2113–2122.
(Eds.), Advances in Neural Information Processing Systems URL http://proceedings.mlr.press/v37/maclaurin15.html
25: 26th Annual Conference on Neural Information Processing [190] F. Pedregosa, Hyperparameter optimization with approximate
Systems 2012. Proceedings of a meeting held December 3-6, gradient, in: M. Balcan, K. Q. Weinberger (Eds.), Proceedings
2012, Lake Tahoe, Nevada, United States, 2012, pp. 2960–2968. of the 33nd International Conference on Machine Learning,
URL https://proceedings.neurips.cc/paper/2012/hash/ ICML 2016, New York City, NY, USA, June 19-24, 2016, Vol. 48
05311655a15b75fab86956663e1819cd-Abstract.html of JMLR Workshop and Conference Proceedings, JMLR.org,
[174] J. Stork, M. Zaefferer, T. Bartz-Beielstein, Improving neuroevo- 2016, pp. 737–746.
lution efficiency by surrogate model-based optimization with URL http://proceedings.mlr.press/v48/pedregosa16.html
phenotypic distance kernels (2019). arXiv:1902.03419. [191] L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward
[175] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, M. A. Osborne, and reverse gradient-based hyperparameter optimization,
Raiders of the lost architecture: Kernels for bayesian optimiza- in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th
tion in conditional parameter spaces (2014). arXiv:1409.4011. International Conference on Machine Learning, ICML 2017,
[176] A. Camero, H. Wang, E. Alba, T. Bäck, Bayesian neural archi- Sydney, NSW, Australia, 6-11 August 2017, Vol. 70 of
tecture search using a training-free performance metric (2020). Proceedings of Machine Learning Research, PMLR, 2017, pp.
arXiv:2001.10726. 1165–1173.
[177] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto- URL http://proceedings.mlr.press/v70/franceschi17a.
weka: combined selection and hyperparameter optimization of html
classification algorithms, in: I. S. Dhillon, Y. Koren, R. Ghani, [192] K. Chandra, E. Meijer, S. Andow, E. Arroyo-Fang, I. Dea,
T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, J. George, M. Grueter, B. Hosmer, S. Stumpos, A. Tempest,
R. Uthurusamy (Eds.), The 19th ACM SIGKDD International et al., Gradient descent: The ultimate optimizer, arXiv preprint
Conference on Knowledge Discovery and Data Mining, KDD arXiv:1909.13371.
2013, Chicago, IL, USA, August 11-14, 2013, ACM, 2013, pp. [193] D. P. Kingma, J. Ba, Adam: A method for stochastic opti-
847–855. doi:10.1145/2487575.2487629. mization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International
URL https://doi.org/10.1145/2487575.2487629 Conference on Learning Representations, ICLR 2015, San Diego,
[178] A. sharpdarts, V. Jain, G. D. Hager, sharpdarts: Faster and CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
more accurate differentiable architecture search, Tech. rep. URL http://arxiv.org/abs/1412.6980
(2019). [194] P. Chrabaszcz, I. Loshchilov, F. Hutter, A downsampled variant

32
of imagenet as an alternative to the CIFAR datasets, CoRR Conference on Computer Vision, ICCV 2019, Seoul, Korea
abs/1707.08819. arXiv:1707.08819. (South), October 27 - November 2, 2019, IEEE, 2019, pp. 6508–
URL http://arxiv.org/abs/1707.08819 6517. doi:10.1109/ICCV.2019.00661.
[195] Y. Hu, Y. Yu, W. Tu, Q. Yang, Y. Chen, W. Dai, Multi- URL https://doi.org/10.1109/ICCV.2019.00661
fidelity automatic hyper-parameter tuning via transfer series [209] X. Dong, Y. Yang, One-shot neural architecture search via self-
expansion, in: The Thirty-Third AAAI Conference on Arti- evaluated template network, in: 2019 IEEE/CVF International
ficial Intelligence, AAAI 2019, The Thirty-First Innovative Conference on Computer Vision, ICCV 2019, Seoul, Korea
Applications of Artificial Intelligence Conference, IAAI 2019, (South), October 27 - November 2, 2019, IEEE, 2019, pp. 3680–
The Ninth AAAI Symposium on Educational Advances in 3689. doi:10.1109/ICCV.2019.00378.
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, Jan- URL https://doi.org/10.1109/ICCV.2019.00378
uary 27 - February 1, 2019, AAAI Press, 2019, pp. 3846–3853. [210] H. Zhou, M. Yang, J. Wang, W. Pan, Bayesnas: A bayesian
doi:10.1609/aaai.v33i01.33013846. approach for neural architecture search, in: K. Chaudhuri,
URL https://doi.org/10.1609/aaai.v33i01.33013846 R. Salakhutdinov (Eds.), Proceedings of the 36th International
[196] C. Wong, N. Houlsby, Y. Lu, A. Gesmundo, Transfer learning Conference on Machine Learning, ICML 2019, 9-15 June 2019,
with neural automl, in: S. Bengio, H. M. Wallach, H. Larochelle, Long Beach, California, USA, Vol. 97 of Proceedings of Machine
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Learning Research, PMLR, 2019, pp. 7603–7613.
Neural Information Processing Systems 31: Annual Conference URL http://proceedings.mlr.press/v97/zhou19e.html
on Neural Information Processing Systems 2018, NeurIPS 2018, [211] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, H. Xiong,
December 3-8, 2018, Montréal, Canada, 2018, pp. 8366–8375. PC-DARTS: partial channel connections for memory-efficient
URL https://proceedings.neurips.cc/paper/2018/hash/ architecture search, in: 8th International Conference on Learn-
bdb3c278f45e6734c35733d24299d3f4-Abstract.html ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April
[197] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyan- 26-30, 2020, OpenReview.net, 2020.
tha, J. Liu, D. Marculescu, Single-path nas: Designing URL https://openreview.net/forum?id=BJlS634tPr
hardware-efficient convnets in less than 4 hours, arXiv preprint [212] G. Li, G. Qian, I. C. Delgadillo, M. Müller, A. K. Thabet,
arXiv:1904.02877. B. Ghanem, SGAS: sequential greedy architecture search, in:
[198] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, 2020 IEEE/CVF Conference on Computer Vision and Pat-
Surrogate benchmarks for hyperparameter optimization., in: tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
MetaSel@ ECAI, 2014, pp. 24–31. 2020, IEEE, 2020, pp. 1617–1627. doi:10.1109/CVPR42600.
[199] C. Wang, Q. Duan, W. Gong, A. Ye, Z. Di, C. Miao, An 2020.00169.
evaluation of adaptive surrogate modeling based optimization URL https://doi.org/10.1109/CVPR42600.2020.00169
with two benchmark problems, Environmental Modelling & [213] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom-
Software 60 (2014) 167–179. ing multi-model forgetting in one-shot NAS with diversity
[200] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, maximization, in: 2020 IEEE/CVF Conference on Computer
Efficient benchmarking of hyperparameter optimizers via Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
surrogates, in: B. Bonet, S. Koenig (Eds.), Proceedings of USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
the Twenty-Ninth AAAI Conference on Artificial Intelligence, 10.1109/CVPR42600.2020.00783.
January 25-30, 2015, Austin, Texas, USA, AAAI Press, 2015, URL https://doi.org/10.1109/CVPR42600.2020.00783
pp. 1114–1120. [214] C. Zhang, M. Ren, R. Urtasun, Graph hypernetworks for neu-
URL http://www.aaai.org/ocs/index.php/AAAI/AAAI15/ ral architecture search, in: 7th International Conference on
paper/view/9993 Learning Representations, ICLR 2019, New Orleans, LA, USA,
[201] K. K. Vu, C. D’Ambrosio, Y. Hamadi, L. Liberti, Surrogate- May 6-9, 2019, OpenReview.net, 2019.
based methods for black-box optimization, International Trans- URL https://openreview.net/forum?id=rkgW0oA9FX
actions in Operational Research 24 (3) (2017) 393–424. [215] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. Chen,
[202] R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, T.-Y. Liu, Mobilenetv2: Inverted residuals and linear bottlenecks, in:
Semi-supervised neural architecture search (2020). arXiv: 2018 IEEE Conference on Computer Vision and Pattern
2002.10389. Recognition, CVPR 2018, Salt Lake City, UT, USA, June
[203] A. Klein, S. Falkner, J. T. Springenberg, F. Hutter, Learning 18-22, 2018, IEEE Computer Society, 2018, pp. 4510–4520.
curve prediction with bayesian neural networks, in: 5th Inter- doi:10.1109/CVPR.2018.00474.
national Conference on Learning Representations, ICLR 2017, URL http://openaccess.thecvf.com/content_cvpr_2018/
Toulon, France, April 24-26, 2017, Conference Track Proceed- html/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_
ings, OpenReview.net, 2017. paper.html
URL https://openreview.net/forum?id=S11KBYclx [216] S. You, T. Huang, M. Yang, F. Wang, C. Qian, C. Zhang,
[204] B. Deng, J. Yan, D. Lin, Peephole: Predicting network perfor- Greedynas: Towards fast one-shot NAS with greedy supernet,
mance before training, arXiv preprint arXiv:1712.03351. in: 2020 IEEE/CVF Conference on Computer Vision and Pat-
[205] T. Domhan, J. T. Springenberg, F. Hutter, Speeding up auto- tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
matic hyperparameter optimization of deep neural networks by 2020, IEEE, 2020, pp. 1996–2005. doi:10.1109/CVPR42600.
extrapolation of learning curves, in: Q. Yang, M. J. Wooldridge 2020.00207.
(Eds.), Proceedings of the Twenty-Fourth International Joint URL https://doi.org/10.1109/CVPR42600.2020.00207
Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, [217] H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all:
Argentina, July 25-31, 2015, AAAI Press, 2015, pp. 3460–3468. Train one network and specialize it for efficient deployment,
URL http://ijcai.org/Abstract/15/487 in: 8th International Conference on Learning Representations,
[206] M. Mahsereci, L. Balles, C. Lassner, P. Hennig, Early stopping ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
without a validation set, arXiv preprint arXiv:1703.09580. Review.net, 2020.
[207] D. Han, J. Kim, J. Kim, Deep pyramidal residual networks, URL https://openreview.net/forum?id=HylxE1HKwS
in: 2017 IEEE Conference on Computer Vision and Pattern [218] J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. L. Yuille, J. Yang,
Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, Atomnas: Fine-grained end-to-end neural architecture search,
2017, IEEE Computer Society, 2017, pp. 6307–6315. doi: in: 8th International Conference on Learning Representations,
10.1109/CVPR.2017.668. ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
URL https://doi.org/10.1109/CVPR.2017.668 Review.net, 2020.
[208] J. Cui, P. Chen, R. Li, S. Liu, X. Shen, J. Jia, Fast and practical URL https://openreview.net/forum?id=BylQSxHFwr
neural architecture search, in: 2019 IEEE/CVF International [219] S. Hu, S. Xie, H. Zheng, C. Liu, J. Shi, X. Liu, D. Lin,

33
DSNAS: direct neural architecture search without param- maximization, in: 2020 IEEE/CVF Conference on Computer
eter retraining, in: 2020 IEEE/CVF Conference on Com- Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
puter Vision and Pattern Recognition, CVPR 2020, Seattle, USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
WA, USA, June 13-19, 2020, IEEE, 2020, pp. 12081–12089. 10.1109/CVPR42600.2020.00783.
doi:10.1109/CVPR42600.2020.01210. URL https://doi.org/10.1109/CVPR42600.2020.00783
URL https://doi.org/10.1109/CVPR42600.2020.01210 [232] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, Q. V. Le,
[220] J. Fang, Y. Sun, Q. Zhang, Y. Li, W. Liu, X. Wang, Densely Understanding and simplifying one-shot architecture search,
connected search space for more flexible neural architecture in: J. G. Dy, A. Krause (Eds.), Proceedings of the 35th Inter-
search, in: 2020 IEEE/CVF Conference on Computer Vision national Conference on Machine Learning, ICML 2018, Stock-
and Pattern Recognition, CVPR 2020, Seattle, WA, USA, holmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80 of
June 13-19, 2020, IEEE, 2020, pp. 10625–10634. doi:10.1109/ Proceedings of Machine Learning Research, PMLR, 2018, pp.
CVPR42600.2020.01064. 549–558.
URL https://doi.org/10.1109/CVPR42600.2020.01064 URL http://proceedings.mlr.press/v80/bender18a.html
[221] A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, [233] X. Dong, M. Tan, A. W. Yu, D. Peng, B. Gabrys, Q. V.
M. Yu, T. Xu, K. Chen, P. Vajda, J. E. Gonzalez, Fbnetv2: Le, Autohas: Differentiable hyper-parameter and architecture
Differentiable neural architecture search for spatial and channel search (2020). arXiv:2006.03656.
dimensions, in: 2020 IEEE/CVF Conference on Computer [234] A. Klein, F. Hutter, Tabular benchmarks for joint archi-
Vision and Pattern Recognition, CVPR 2020, Seattle, WA, tecture and hyperparameter optimization, arXiv preprint
USA, June 13-19, 2020, IEEE, 2020, pp. 12962–12971. doi: arXiv:1905.04970.
10.1109/CVPR42600.2020.01298. [235] X. Dai, A. Wan, P. Zhang, B. Wu, Z. He, Z. Wei, K. Chen,
URL https://doi.org/10.1109/CVPR42600.2020.01298 Y. Tian, M. Yu, P. Vajda, et al., Fbnetv3: Joint architecture-
[222] R. Istrate, F. Scheidegger, G. Mariani, D. S. Nikolopoulos, recipe search using neural acquisition function, arXiv preprint
C. Bekas, A. C. I. Malossi, TAPAS: train-less accuracy predic- arXiv:2006.02049.
tor for architecture search, in: The Thirty-Third AAAI Con- [236] C.-H. Hsu, S.-H. Chang, J.-H. Liang, H.-P. Chou, C.-H. Liu, S.-
ference on Artificial Intelligence, AAAI 2019, The Thirty-First C. Chang, J.-Y. Pan, Y.-T. Chen, W. Wei, D.-C. Juan, Monas:
Innovative Applications of Artificial Intelligence Conference, Multi-objective neural architecture search using reinforcement
IAAI 2019, The Ninth AAAI Symposium on Educational Ad- learning, arXiv preprint arXiv:1806.10332.
vances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, [237] X. He, S. Wang, S. Shi, X. Chu, J. Tang, X. Liu, C. Yan,
USA, January 27 - February 1, 2019, AAAI Press, 2019, pp. J. Zhang, G. Ding, Benchmarking deep learning models and
3927–3934. doi:10.1609/aaai.v33i01.33013927. automated model design for covid-19 detection with chest ct
URL https://doi.org/10.1609/aaai.v33i01.33013927 scans, medRxiv.
[223] M. G. Kendall, A new measure of rank correlation, Biometrika [238] L. Faes, S. K. Wagner, D. J. Fu, X. Liu, E. Korot, J. R. Ledsam,
30 (1/2) (1938) 81–93. T. Back, R. Chopra, N. Pontikos, C. Kern, et al., Automated
URL http://www.jstor.org/stable/2332226 deep learning design for medical image classification by health-
[224] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, F. Hut- care professionals with no coding experience: a feasibility study,
ter, Nas-bench-101: Towards reproducible neural architecture The Lancet Digital Health 1 (5) (2019) e232–e242.
search, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceed- [239] X. He, S. Wang, X. Chu, S. Shi, J. Tang, X. Liu, C. Yan,
ings of the 36th International Conference on Machine Learning, J. Zhang, G. Ding, Automated model design and benchmarking
ICML 2019, 9-15 June 2019, Long Beach, California, USA, of 3d deep learning models for covid-19 detection with chest ct
Vol. 97 of Proceedings of Machine Learning Research, PMLR, scans (2021). arXiv:2101.05442.
2019, pp. 7105–7114. [240] G. Ghiasi, T. Lin, Q. V. Le, NAS-FPN: learning scalable
URL http://proceedings.mlr.press/v97/ying19a.html feature pyramid architecture for object detection, in: IEEE
[225] X. Dong, Y. Yang, Nas-bench-201: Extending the scope of Conference on Computer Vision and Pattern Recognition,
reproducible neural architecture search, in: 8th International CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
Conference on Learning Representations, ICLR 2020, Addis Computer Vision Foundation / IEEE, 2019, pp. 7036–7045.
Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. doi:10.1109/CVPR.2019.00720.
URL https://openreview.net/forum?id=HJxyZkBKDr URL http://openaccess.thecvf.com/content_CVPR_
[226] N. Klyuchnikov, I. Trofimov, E. Artemova, M. Salnikov, M. Fe- 2019/html/Ghiasi_NAS-FPN_Learning_Scalable_Feature_
dorov, E. Burnaev, Nas-bench-nlp: Neural architecture search Pyramid_Architecture_for_Object_Detection_CVPR_2019_
benchmark for natural language processing (2020). arXiv: paper.html
2006.07116. [241] H. Xu, L. Yao, Z. Li, X. Liang, W. Zhang, Auto-fpn: Automatic
[227] X. Zhang, Z. Huang, N. Wang, You only search once: Single network architecture adaptation for object detection beyond
shot neural architecture search via direct sparse optimization, classification, in: 2019 IEEE/CVF International Conference on
arXiv preprint arXiv:1811.01567. Computer Vision, ICCV 2019, Seoul, Korea (South), October
[228] J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, 27 - November 2, 2019, IEEE, 2019, pp. 6648–6657. doi:
T. Huang, X. Song, R. Pang, Q. Le, Bignas: Scaling up neural 10.1109/ICCV.2019.00675.
architecture search with big single-stage models, arXiv preprint URL https://doi.org/10.1109/ICCV.2019.00675
arXiv:2003.11142. [242] M. Tan, R. Pang, Q. V. Le, Efficientdet: Scalable and efficient
[229] X. Chu, B. Zhang, R. Xu, J. Li, Fairnas: Rethinking evaluation object detection, in: 2020 IEEE/CVF Conference on Computer
fairness of weight sharing neural architecture search, arXiv Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
preprint arXiv:1907.01845. USA, June 13-19, 2020, IEEE, 2020, pp. 10778–10787. doi:
[230] Y. Benyahia, K. Yu, K. Bennani-Smires, M. Jaggi, A. C. Davi- 10.1109/CVPR42600.2020.01079.
son, M. Salzmann, C. Musat, Overcoming multi-model forget- URL https://doi.org/10.1109/CVPR42600.2020.01079
ting, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of [243] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, J. Sun, Detnas:
the 36th International Conference on Machine Learning, ICML Neural architecture search on object detection, arXiv preprint
2019, 9-15 June 2019, Long Beach, California, USA, Vol. 97 of arXiv:1903.10979 1 (2) (2019) 4–1.
Proceedings of Machine Learning Research, PMLR, 2019, pp. [244] J. Guo, K. Han, Y. Wang, C. Zhang, Z. Yang, H. Wu, X. Chen,
594–603. C. Xu, Hit-detector: Hierarchical trinity architecture search for
URL http://proceedings.mlr.press/v97/benyahia19a.html object detection, in: 2020 IEEE/CVF Conference on Computer
[231] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom- Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
ing multi-model forgetting in one-shot NAS with diversity USA, June 13-19, 2020, IEEE, 2020, pp. 11402–11411. doi:

34
10.1109/CVPR42600.2020.01142. URL http://proceedings.mlr.press/v119/fu20b.html
URL https://doi.org/10.1109/CVPR42600.2020.01142 [259] M. Li, J. Lin, Y. Ding, Z. Liu, J. Zhu, S. Han, GAN compression:
[245] C. Jiang, H. Xu, W. Zhang, X. Liang, Z. Li, SP-NAS: serial- Efficient architectures for interactive conditional gans, in: 2020
to-parallel backbone search for object detection, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-
IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 5283–5293. doi:10.1109/CVPR42600.2020.00533.
2020, pp. 11860–11869. doi:10.1109/CVPR42600.2020.01188. URL https://doi.org/10.1109/CVPR42600.2020.00533
URL https://doi.org/10.1109/CVPR42600.2020.01188 [260] C. Gao, Y. Chen, S. Liu, Z. Tan, S. Yan, Adversarialnas: Adver-
[246] Y. Weng, T. Zhou, Y. Li, X. Qiu, Nas-unet: Neural architecture sarial neural architecture search for gans, in: 2020 IEEE/CVF
search for medical image segmentation, IEEE Access 7 (2019) Conference on Computer Vision and Pattern Recognition,
44247–44257. CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020,
[247] V. Nekrasov, H. Chen, C. Shen, I. D. Reid, Fast neural pp. 5679–5688. doi:10.1109/CVPR42600.2020.00572.
architecture search of compact semantic segmentation models URL https://doi.org/10.1109/CVPR42600.2020.00572
via auxiliary cells, in: IEEE Conference on Computer Vision [261] T. Saikia, Y. Marrakchi, A. Zela, F. Hutter, T. Brox, Au-
and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, todispnet: Improving disparity estimation with automl, in:
June 16-20, 2019, Computer Vision Foundation / IEEE, 2019, 2019 IEEE/CVF International Conference on Computer Vision,
pp. 9126–9135. doi:10.1109/CVPR.2019.00934. ICCV 2019, Seoul, Korea (South), October 27 - November
URL http://openaccess.thecvf.com/content_CVPR_2019/ 2, 2019, IEEE, 2019, pp. 1812–1823. doi:10.1109/ICCV.2019.
html/Nekrasov_Fast_Neural_Architecture_Search_of_ 00190.
Compact_Semantic_Segmentation_Models_via_CVPR_2019_ URL https://doi.org/10.1109/ICCV.2019.00190
paper.html [262] W. Peng, X. Hong, G. Zhao, Video action recognition via neural
[248] W. Bae, S. Lee, Y. Lee, B. Park, M. Chung, K.-H. Jung, Re- architecture searching, in: 2019 IEEE International Conference
source optimized neural architecture search for 3d medical on Image Processing (ICIP), IEEE, 2019, pp. 11–15.
image segmentation, in: International Conference on Medi- [263] M. S. Ryoo, A. J. Piergiovanni, M. Tan, A. Angelova, Assem-
cal Image Computing and Computer-Assisted Intervention, blenet: Searching for multi-stream neural connectivity in video
Springer, 2019, pp. 228–236. architectures, in: 8th International Conference on Learning
[249] D. Yang, H. Roth, Z. Xu, F. Milletari, L. Zhang, D. Xu, Search- Representations, ICLR 2020, Addis Ababa, Ethiopia, April
ing learning strategy with reinforcement learning for 3d medical 26-30, 2020, OpenReview.net, 2020.
image segmentation, in: International Conference on Medi- URL https://openreview.net/forum?id=SJgMK64Ywr
cal Image Computing and Computer-Assisted Intervention, [264] V. Nekrasov, H. Chen, C. Shen, I. Reid, Architecture search of
Springer, 2019, pp. 3–11. dynamic cells for semantic video segmentation, in: The IEEE
[250] N. Dong, M. Xu, X. Liang, Y. Jiang, W. Dai, E. Xing, Neural Winter Conference on Applications of Computer Vision, 2020,
architecture search for adversarial medical image segmentation, pp. 1970–1979.
in: International Conference on Medical Image Computing and [265] A. J. Piergiovanni, A. Angelova, A. Toshev, M. S. Ryoo,
Computer-Assisted Intervention, Springer, 2019, pp. 828–836. Evolving space-time neural architectures for videos, in: 2019
[251] S. Kim, I. Kim, S. Lim, W. Baek, C. Kim, H. Cho, B. Yoon, IEEE/CVF International Conference on Computer Vision,
T. Kim, Scalable neural architecture search for 3d medical ICCV 2019, Seoul, Korea (South), October 27 - November
image segmentation, in: International Conference on Medi- 2, 2019, IEEE, 2019, pp. 1793–1802. doi:10.1109/ICCV.2019.
cal Image Computing and Computer-Assisted Intervention, 00188.
Springer, 2019, pp. 220–228. URL https://doi.org/10.1109/ICCV.2019.00188
[252] R. Quan, X. Dong, Y. Wu, L. Zhu, Y. Yang, Auto-reid: Search- [266] Y. Fan, F. Tian, Y. Xia, T. Qin, X.-Y. Li, T.-Y. Liu, Searching
ing for a part-aware convnet for person re-identification, in: better architectures for neural machine translation, IEEE/ACM
2019 IEEE/CVF International Conference on Computer Vision, Transactions on Audio, Speech, and Language Processing.
ICCV 2019, Seoul, Korea (South), October 27 - November [267] Y. Jiang, C. Hu, T. Xiao, C. Zhang, J. Zhu, Improved dif-
2, 2019, IEEE, 2019, pp. 3749–3758. doi:10.1109/ICCV.2019. ferentiable architecture search for language modeling and
00385. named entity recognition, in: Proceedings of the 2019 Confer-
URL https://doi.org/10.1109/ICCV.2019.00385 ence on Empirical Methods in Natural Language Processing
[253] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, Y. Wang, Efficient and the 9th International Joint Conference on Natural Lan-
residual dense block search for image super-resolution., in: guage Processing (EMNLP-IJCNLP), Association for Compu-
AAAI, 2020, pp. 12007–12014. tational Linguistics, Hong Kong, China, 2019, pp. 3585–3590.
[254] X. Chu, B. Zhang, H. Ma, R. Xu, J. Li, Q. Li, Fast, accurate doi:10.18653/v1/D19-1367.
and lightweight super-resolution with neural architecture search, URL https://www.aclweb.org/anthology/D19-1367
arXiv preprint arXiv:1901.07261. [268] J. Chen, K. Chen, X. Chen, X. Qiu, X. Huang, Exploring
[255] Y. Guo, Y. Luo, Z. He, J. Huang, J. Chen, Hierarchical neural shared structures and hierarchies for multiple nlp tasks, arXiv
architecture search for single image super-resolution, arXiv preprint arXiv:1808.07658.
preprint arXiv:2003.04619. [269] H. Mazzawi, X. Gonzalvo, A. Kracun, P. Sridhar, N. Subrah-
[256] H. Zhang, Y. Li, H. Chen, C. Shen, Ir-nas: Neural architecture manya, I. Lopez-Moreno, H.-J. Park, P. Violette, Improving
search for image restoration, arXiv preprint arXiv:1909.08228. keyword spotting and language identification via neural ar-
[257] X. Gong, S. Chang, Y. Jiang, Z. Wang, Autogan: Neural chitecture search at scale., in: INTERSPEECH, 2019, pp.
architecture search for generative adversarial networks, in: 1278–1282.
2019 IEEE/CVF International Conference on Computer Vision, [270] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, S. Han, Amc: Automl
ICCV 2019, Seoul, Korea (South), October 27 - November for model compression and acceleration on mobile devices, in:
2, 2019, IEEE, 2019, pp. 3223–3233. doi:10.1109/ICCV.2019. Proceedings of the European Conference on Computer Vision
00332. (ECCV), 2018, pp. 784–800.
URL https://doi.org/10.1109/ICCV.2019.00332 [271] X. Xiao, Z. Wang, S. Rajasekaran, Autoprune: Automatic
[258] Y. Fu, W. Chen, H. Wang, H. Li, Y. Lin, Z. Wang, Autogan- network pruning by regularizing auxiliary parameters, in:
distiller: Searching to compress generative adversarial networks, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
in: Proceedings of the 37th International Conference on Ma- E. B. Fox, R. Garnett (Eds.), Advances in Neural Information
chine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Processing Systems 32: Annual Conference on Neural Informa-
Vol. 119 of Proceedings of Machine Learning Research, PMLR, tion Processing Systems 2019, NeurIPS 2019, December 8-14,
2020, pp. 3292–3303. 2019, Vancouver, BC, Canada, 2019, pp. 13681–13691.

35
URL https://proceedings.neurips.cc/paper/2019/hash/ Blog 1 (2019) 8.
4efc9e02abdab6b6166251918570a307-Abstract.html [291] D. Wang, C. Gong, Q. Liu, Improving neural language modeling
[272] R. Zhao, W. Luk, Efficient structured pruning and architecture via adversarial training, in: K. Chaudhuri, R. Salakhutdinov
searching for group convolution, in: Proceedings of the IEEE (Eds.), Proceedings of the 36th International Conference on
International Conference on Computer Vision Workshops, 2019, Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,
pp. 0–0. California, USA, Vol. 97 of Proceedings of Machine Learning
[273] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin, Research, PMLR, 2019, pp. 6555–6565.
S. Han, APQ: joint search for network architecture, pruning URL http://proceedings.mlr.press/v97/wang19f.html
and quantization policy, in: 2020 IEEE/CVF Conference on [292] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, F. Hut-
Computer Vision and Pattern Recognition, CVPR 2020, Seattle, ter, Understanding and robustifying differentiable architecture
WA, USA, June 13-19, 2020, IEEE, 2020, pp. 2075–2084. doi: search, in: 8th International Conference on Learning Represen-
10.1109/CVPR42600.2020.00215. tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020,
URL https://doi.org/10.1109/CVPR42600.2020.00215 OpenReview.net, 2020.
[274] X. Dong, Y. Yang, Network pruning via transformable architec- URL https://openreview.net/forum?id=H1gDNyrKDS
ture search, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, [293] S. KOTYAN, D. V. VARGAS, Is neural architecture search a
F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in way forward to develop robust neural networks?, Proceedings
Neural Information Processing Systems 32: Annual Conference of the Annual Conference of JSAI JSAI2020 (2020) 2K1ES203–
on Neural Information Processing Systems 2019, NeurIPS 2K1ES203.
2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. [294] M. Guo, Y. Yang, R. Xu, Z. Liu, D. Lin, When NAS meets
759–770. robustness: In search of robust architectures against adversarial
URL https://proceedings.neurips.cc/paper/2019/hash/ attacks, in: 2020 IEEE/CVF Conference on Computer Vision
a01a0380ca3c61428c26a231f0e49a09-Abstract.html and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June
[275] Q. Huang, K. Zhou, S. You, U. Neumann, Learning to prune 13-19, 2020, IEEE, 2020, pp. 628–637. doi:10.1109/CVPR42600.
filters in convolutional neural networks (2018). arXiv:1801. 2020.00071.
07365. URL https://doi.org/10.1109/CVPR42600.2020.00071
[276] Y. He, P. Liu, L. Zhu, Y. Yang, Meta filter pruning to accelerate [295] Y. Chen, Q. Song, X. Liu, P. S. Sastry, X. Hu, On robustness
deep convolutional neural networks (2019). arXiv:1904.03961. of neural architecture search under label noise, in: Frontiers in
[277] T.-W. Chin, C. Zhang, D. Marculescu, Layer-compensated Big Data, 2020.
pruning for resource-constrained convolutional neural networks [296] D. V. Vargas, S. Kotyan, Evolving robust neural architec-
(2018). arXiv:1810.00518. tures to defend from adversarial attacks, arXiv preprint
[278] K. Zhou, Q. Song, X. Huang, X. Hu, Auto-gnn: Neural ar- arXiv:1906.11667.
chitecture search of graph neural networks, arXiv preprint [297] J. Yim, D. Joo, J. Bae, J. Kim, A gift from knowledge distil-
arXiv:1909.03184. lation: Fast optimization, network minimization and transfer
[279] C. He, M. Annavaram, S. Avestimehr, Fednas: Federated learning, in: 2017 IEEE Conference on Computer Vision and
deep learning via neural architecture search (2020). arXiv: Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July
2004.08546. 21-26, 2017, IEEE Computer Society, 2017, pp. 7130–7138.
[280] H. Zhu, Y. Jin, Real-time federated evolutionary neural archi- doi:10.1109/CVPR.2017.754.
tecture search, arXiv preprint arXiv:2003.02793. URL https://doi.org/10.1109/CVPR.2017.754
[281] C. Li, X. Yuan, C. Lin, M. Guo, W. Wu, J. Yan, W. Ouyang, [298] G. Squillero, P. Burelli, Applications of Evolutionary Com-
AM-LFS: automl for loss function search, in: 2019 IEEE/CVF putation: 19th European Conference, EvoApplications 2016,
International Conference on Computer Vision, ICCV 2019, Porto, Portugal, March 30–April 1, 2016, Proceedings, Vol.
Seoul, Korea (South), October 27 - November 2, 2019, IEEE, 9597, Springer, 2016.
2019, pp. 8409–8418. doi:10.1109/ICCV.2019.00850. [299] M. Feurer, A. Klein, K. Eggensperger, J. T. Springen-
URL https://doi.org/10.1109/ICCV.2019.00850 berg, M. Blum, F. Hutter, Efficient and robust automated
[282] B. Ru, C. Lyle, L. Schut, M. van der Wilk, Y. Gal, Revisiting machine learning, in: C. Cortes, N. D. Lawrence, D. D.
the train loss: an efficient performance estimator for neural Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural
architecture search, arXiv preprint arXiv:2006.04492. Information Processing Systems 28: Annual Conference on
[283] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation Neural Information Processing Systems 2015, December 7-12,
functions (2017). arXiv:1710.05941. 2015, Montreal, Quebec, Canada, 2015, pp. 2962–2970.
[284] H. Wang, H. Wang, K. Xu, Evolutionary recurrent neural URL https://proceedings.neurips.cc/paper/2015/hash/
network for image captioning, Neurocomputing. 11d0e6287202fced83f79975ec59a3a6-Abstract.html
[285] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, R. Fonseca, Neural [300] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
architecture search using deep neural networks and monte carlo B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
tree search, arXiv preprint arXiv:1805.07440. V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
[286] P. Zhao, K. Xiao, Y. Zhang, K. Bian, W. Yan, Amer: Automatic M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine
behavior modeling and interaction exploration in recommender learning in Python, Journal of Machine Learning Research 12
system, arXiv preprint arXiv:2006.05933. (2011) 2825–2830.
[287] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, J. Tang, Au- [301] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
toemb: Automated embedding dimensionality search in stream- G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
ing recommendations, arXiv preprint arXiv:2002.11252. A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison,
[288] W. Cheng, Y. Shen, L. Huang, Differentiable neural A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,
input search for recommender systems, arXiv preprint S. Chintala, Pytorch: An imperative style, high-performance
arXiv:2006.04466. deep learning library, in: H. M. Wallach, H. Larochelle,
[289] E. Real, C. Liang, D. R. So, Q. V. Le, Automl-zero: Evolving A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.),
machine learning algorithms from scratch, in: Proceedings of Advances in Neural Information Processing Systems 32:
the 37th International Conference on Machine Learning, ICML Annual Conference on Neural Information Processing Systems
2020, 13-18 July 2020, Virtual Event, Vol. 119 of Proceedings 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
of Machine Learning Research, PMLR, 2020, pp. 8007–8019. Canada, 2019, pp. 8024–8035.
URL http://proceedings.mlr.press/v119/real20a.html URL https://proceedings.neurips.cc/paper/2019/hash/
[290] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Language models are unsupervised multitask learners, OpenAI [302] F. Chollet, et al., Keras, https://github.com/fchollet/keras

36
(2015).
[303] NNI (Neural Network Intelligence), 2020.
URL https://github.com/microsoft/nni
[304] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur,
J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner,
P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu,
X. Zheng, Tensorflow: A system for large-scale machine learning
(2016). arXiv:1605.08695.
[305] Vega, 2020.
URL https://github.com/huawei-noah/vega
[306] R. Pasunuru, M. Bansal, Continual and multi-task architec-
ture search, in: Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, Association for
Computational Linguistics, Florence, Italy, 2019, pp. 1911–1922.
doi:10.18653/v1/P19-1185.
URL https://www.aclweb.org/anthology/P19-1185
[307] J. Kim, S. Lee, S. Kim, M. Cha, J. K. Lee, Y. Choi, Y. Choi,
D.-Y. Cho, J. Kim, Auto-meta: Automated gradient based
meta learner search, arXiv preprint arXiv:1806.06927.
[308] D. Lian, Y. Zheng, Y. Xu, Y. Lu, L. Lin, P. Zhao, J. Huang,
S. Gao, Towards fast adaptation of neural architectures with
meta learning, in: 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, OpenReview.net, 2020.
URL https://openreview.net/forum?id=r1eowANFvr
[309] T. Elsken, B. Staffler, J. H. Metzen, F. Hutter, Meta-learning of
neural architectures for few-shot learning, in: 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020,
pp. 12362–12372. doi:10.1109/CVPR42600.2020.01238.
URL https://doi.org/10.1109/CVPR42600.2020.01238
[310] C. Liu, P. Dollár, K. He, R. Girshick, A. Yuille, S. Xie, Are
labels necessary for neural architecture search? (2020). arXiv:
2003.12056.
[311] Z. Li, D. Hoiem, Learning without forgetting, IEEE transactions
on pattern analysis and machine intelligence 40 (12) (2018)
2935–2947.
[312] S. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl:
Incremental classifier and representation learning, in: 2017
IEEE Conference on Computer Vision and Pattern Recogni-
tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE
Computer Society, 2017, pp. 5533–5542. doi:10.1109/CVPR.
2017.587.
URL https://doi.org/10.1109/CVPR.2017.587

37

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy