Automl: A Survey of The State-Of-The-Art
Automl: A Survey of The State-Of-The-Art
Abstract
Deep learning (DL) techniques have obtained remarkable achievements on various tasks, such as image recognition,
object detection, and language modeling. However, building a high-quality DL system for a specific task highly relies
on human expertise, hindering its wide application. Meanwhile, automated machine learning (AutoML) is a promising
solution for building a DL system without human assistance and is being extensively studied. This paper presents a
arXiv:1908.00709v6 [cs.LG] 16 Apr 2021
comprehensive and up-to-date review of the state-of-the-art (SOTA) in AutoML. According to the DL pipeline, we
introduce AutoML methods –– covering data preparation, feature engineering, hyperparameter optimization, and neural
architecture search (NAS) –– with a particular focus on NAS, as it is currently a hot sub-topic of AutoML. We summarize
the representative NAS algorithms’ performance on the CIFAR-10 and ImageNet datasets and further discuss the following
subjects of NAS methods: one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture optimization, and
resource-aware NAS. Finally, we discuss some open problems related to the existing AutoML methods for future research.
Keywords: deep learning, automated machine learning (AutoML), neural architecture search (NAS), hyperparameter
optimization (HPO)
Traditional
Hyperparameter
Models
Optimization
(SVM, KNN) Early-stopping
Feature
Data Cleaning Feature
Extraction
Surrogate Model
Deep Neural
Architecture
Networks
Feature Optimization
Data (CNN, RNN)
Construction Weight-sharing
Augmentation
Neural Architecture Search (NAS)
Figure 1: An overview of AutoML pipeline covering data preparation (Section 2), feature engineering (Section 3), model generation (Section 4)
and model evaluation (Section 5).
2
necessary step to build a new dataset or extend the ex- example, Krause et al. [55] separate inaccurate results
isting dataset. The process of data cleaning is used to as cross-domain or cross-category noise, and remove any
filter noisy data so that downstream model training is not images that appear in search results for more than one
compromised. Data augmentation plays an important role category. Vo et al. [56] re-rank relevant results and provide
in enhancing model robustness and improving model per- search results linearly, according to keywords.
formance. The following subsections will cover the three Second, Web data may be incorrectly labeled or even
aspects in more detail. unlabeled. A learning-based self-labeling method is often
used to solve this problem. For example, the active learn-
Start ing method [57] selects the most “uncertain” unlabeled
individual examples for labeling by a human, and then iter-
atively labels the remaining data. Roh et al. [58] provided
Yes a review of semi-supervised learning self-labeling methods,
Enough data?
which can help take the human out of the loop of labeling
to improve efficiency, and can be divided into the following
No
categories: self-training [59, 60], co-training [61, 62], and
co-learning [63]. Moreover, due to the complexity of Web
Any exsiting Data quality No Data images content, a single label cannot adequately describe
datasets? improved? Augmentation
an image. Consequently, Yang et al. [51] assigned multiple
labels to a Web image, i.e., if the confidence scores of these
Yes No labels are very close or the label with the highest score is
Yes the same as the original label of the image, then this image
Data Collection will be set as a new training sample.
Data Data Model However, the distribution of Web data can be extremely
Searching Synthesis Data Cleaning
Training different from that of the target dataset, which will increase
the difficulty of training the model. A common solution is to
Figure 2: The flow chart for data preparation. fine-tune these Web data [64, 65]. Yang et al. [51] proposed
an iterative algorithm for model training and Web data-
filtering. Dataset imbalance is another common problem, as
2.1. Data Collection some special classes have a very limited number of Web data.
ML’s deepening study has led to a consensus that high- To solve this problem, the synthetic minority over-sampling
quality datasets are of critical importance for ML; as a technique (SMOTE) [66] is used to synthesize new minority
result, numerous open datasets have emerged. In the early samples between existing real minority samples, instead
stages of ML study, a handwritten digital dataset, i.e., of simply up-sampling minority samples or down-sampling
MNIST [48], was developed. After that, several larger the majority samples. In another approach, Guo et al. [67]
datasets like CIFAR-10 and CIFAR-100 [49] and ImageNet combined the boosting method with data generation to
[50] were developed. A variety of datasets can also be enhance the generalizability and robustness of the model
retrieved by entering the keywords into these websites: against imbalanced data sets.
Kaggle 2 , Google Dataset Search (GOODS) 3 , and Elsevier
Data Search 4 . 2.1.2. Data Synthesis
However, it is usually challenging to find a proper Data simulator is one of the most commonly used meth-
dataset through the above approaches for some partic- ods to generate data. For some particular tasks, such as
ular tasks, such as those related to medical care or other autonomous driving, it is not possible to test and adjust a
privacy matters. Two types of methods are proposed to model in the real world during the research phase, due to
solve this problem: data searching and data synthesis. safety hazards. Therefore, a practical approach to generat-
ing data is to use a data simulator that matches the real
2.1.1. Data Searching world as closely as possible. OpenAI Gym [68] is a popular
As the Internet is an inexhaustible data source, search- toolkit that provides various simulation environments, in
ing for Web data is an intuitive way to collect a dataset which developers can concentrate on designing their al-
[51, 52, 53, 54]. However, there are some problems with gorithms, instead of struggling to generate data. Wang
using Web data. et al. [69] used a popular game engine, Unreal Engine
First, the search results may not exactly match the 4, to build a large synthetic indoor robotics stereo (IRS)
keywords. Thus, unrelated data must be filtered. For dataset, which provides the information for disparity and
surface normal estimation. Furthermore, a reinforcement
learning-based method is applied in [70] for optimizing the
2 https://www.kaggle.com
parameters of a data simulator to control the distribution
3 https://datasetsearch.research.google.com/
4 https://www.datasearch.elsevier.com/
of the synthesized data.
3
Another novel technique for deriving synthetic data
is Generative Adversarial Networks (GANs) [71], which
can be used to generate images [71, 72, 73, 74], tabular
[75, 76] and text [77] data. Karras et al. [78] applied GAN
technique to generate realistic human face images. Oh and
Jaroensri et al. [72] built a synthetic dataset, which cap-
tures small motion for video-motion magnification. Bowles
et al. [74] demonstrated the feasibility of using GAN to
generate medical images for brain segmentation tasks. In
the case of textual data, applying GAN to text has proved
difficult because the commonly used method is to use rein-
forcement learning to update the gradient of the generator,
but the text is discrete, and thus the gradient cannot propa-
gate from discriminator to generator. To solve this problem,
Donahue et al. [77] used an autoencoder to encode sen-
tences into a smooth sentence representation to remove the
barrier of reinforcement learning. Park et al. [75] applied
GAN to synthesize fake tables that are statistically similar
to the original table but do not cause information leakage.
Similarly, in [76], GAN is applied to generate tabular data
like medical or educational records.
5
Regularization, decision tree, and deep learning are all
Model Generation
embedded methods. Model Estimation
Search Space Architecture Optimization
3.2. Feature Construction
Feature construction is a process that constructs new Entire-structured Random search Low-fidelity
6
• Architecture Optimization Method. The architecture where Nk indicates the indegree of node Zk , Ii and oi
optimization (AO) method defines how to guide the represent i-th input tensor and its associated operation,
search to efficiently find the model architecture with respectively, and O is a set of candidate operations, such
high performance after the search space is defined. as convolution, pooling, activation functions, skip connec-
tion, concatenation, and addition. To further enhance the
• Model Evaluation Method. Once a model is gener- model performance, many NAS methods use certain ad-
ated, its performance needs to be evaluated. The vanced human-designed modules as primitive operations,
simplest approach of doing this is to train the model such as depth-wise separable convolution [124], dilated
to converge on the training set, and then estimate convolution[125], and squeeze-and-excitation (SE) blocks
model performance on the validation set; however, [126]. The selection and combination of these operations
this method is time-consuming and resource-intensive. vary with the design of search space. In other words, the
Some advanced methods can accelerate the evalua- search space defines the structural paradigm that AO meth-
tion process but lose fidelity in the process. Thus, ods can explore; thus, designing a good search space is a
how to balance the efficiency and effectiveness of an vital but challenging problem. In general, a good search
evaluation is a problem worth studying. space is expected to exclude human bias and be flexible
enough to cover a wider variety of model architectures.
The search space and AO methods are presented in
Based on the existing NAS studies, we detail the com-
this section, while the methods of model evaluation are
monly used search spaces as follows.
presented in the next section.
4.1.1. Entire-structured Search Space
4.1. Search Space
The space of entire-structured neural networks [12, 13]
A neural architecture can be represented as a direct is one of the most intuitive and straightforward search
acyclic graph (DAG) comprising B ordered nodes. In DAG, spaces. Figure 6 presents two simplified examples of entire-
each node and directed edge indicate a feature tensor and structured models, which are built by stacking a predefined
an operation, respectively. Eq. 1 presents a formula for number of nodes, where each node represents a layer and
computation at any node Zk , k ∈ {1, 2, ..., B}. performs a specified operation. The left model shown in
Figure 6 indicates the simplest structure, while the right
output output model is relatively complex, as it permits arbitrary skip
connections [2] to exist between the ordered nodes; these
connections have been proven effective in practice [12]. Al-
though an entire structure is easy to implement, it has
L4 conv 3x3 conv 3x3 several disadvantages. For example, it is widely accepted
that the deeper is the model, the better is its generaliza-
tion ability; however, searching for such a deep network
is onerous and computationally expensive. Furthermore,
L3 conv 5x5 conv 5x5
the generated architecture lacks transferability; that is, a
model generated on a small dataset may not fit a larger
dataset, which necessitates the generation of a new model
L2 conv 3x3 conv 3x3 for a larger dataset.
conv
1x1
Cell n
Figure 11: Factorized hierarchical search space in MnasNet [130]. Several subsequent studies [27, 22, 136, 137, 138, 139,
The final network comprises different cells. Each cell is composed of 140, 141] are based on network morphism. For instance,
a variable number of repeated blocks, where the block in the same Jin et al. [22] proposed a framework that enables Bayesian
cell shares the same structure but differs from that in the other cells.
optimization to guide the network morphism for an effi-
cient neural architecture search. Wei et al. [136] further
improved network morphism at a higher level, i.e., by mor-
4.1.4. Morphism-based Search Space
phing a convolutional layer into the arbitrary module of a
Isaac Newton is reported to have said that “If I have neural network. Additionally, Tan and Le [142] proposed
seen further, it is by standing on the shoulders of giants.” EfficientNet, which re-examines the effect of model scaling
Similarly, several training tricks have been proposed, such on convolutional neural networks, and proved that carefully
as knowledge distillation [134] and transfer learning [135]. balancing the network depth, width, and resolution can
However, these methods do not directly modify the model lead to better performance.
structure. To this end, Chen et al. [20] proposed the
Net2Net technique for designing new neural networks based
on an existing network by inserting identity morphism
10
4.2. Architecture Optimization
Initialization Evolution
After defining the search space, we need to search for
the best-performing architecture, a process we call architec-
Update
ture optimization (AO). Traditionally, the architecture of a
neural network is regarded as a set of static hyperparame-
ters that are tuned based on the performance observed on
the validation set. However, this process highly depends Stopping? Mutation
on human experts and requires considerable time and re-
sources for trial and error. Therefore, many AO methods No
have been proposed to free humans from this tedious pro- Yes Crossover
cedure and to search for novel architectures automatically.
Below, we detail the commonly used AO methods.
Termination Selection
4.2.1. Evolutionary Algorithm
The evolutionary algorithm (EA) is a generic population-
based metaheuristic optimization algorithm that takes in- Figure 13: Overview of the evolutionary algorithm.
spiration from biological evolution. Compared with tradi-
tional optimization algorithms such as exhaustive methods,
each network can be modified using function-preserving
EA is a mature global optimization method with high
network morphism operators. Hence, the child network has
robustness and broad applicability. It can effectively ad-
increased capacity and is guaranteed to perform at least
dress the complex problems that traditional optimization
as well as the parent networks.
algorithms struggle to solve, without being limited by the
Four Steps. A typical EA comprises the following
problem’s nature.
steps: selection, crossover, mutation, and update (Figure
Encoding Scheme. Different EAs may use differ-
13):
ent types of encoding schemes for network representation.
There are two types of encoding schemes: direct and indi- • Selection This step involves selecting a portion of
rect. the networks from all generated networks for the
Direct encoding is a widely used method that explicitly crossover, which aims to maintain well-performing
specifies the phenotype. For example, genetic CNN [30] neural architectures while eliminating the weak ones.
encodes the network structure into a fixed-length binary The following three strategies are adopted for network
string, e.g., 1 indicates that two nodes are connected, and selection. The first is fitness selection, in which the
vice versa. Although binary encoding can be performed probability of a network being selected is proportional
easily, its computational space is the square of the number
to its fitness value, i.e., P (hi ) = PNF itness(h i)
,
of nodes, which is fixed-length, i.e., predefined manually. F itness(h )
j=1 j
For representing variable-length neural networks, DAG en- where hi indicates the i-th network. The second is
coding is a promising solution [28, 25, 19]. For example, rank selection, which is similar to fitness selection,
Suganuma et al. [28] used the Cartesian genetic program- but with the network’s selection probability being
ming (CGP) [143, 144] encoding scheme to represent a proportional to its relative fitness rather than its
neural network built by a list of sub-modules that are de- absolute fitness. The third method is tournament
fined as DAG. Similarly, in [25], the neural architecture selection [25, 27, 26, 19]. Here, in each iteration, k
is also encoded as a graph, whose vertices indicate rank-3 (tournament size) networks are randomly selected
tensors or activations (with batch normalization performed from the population and sorted according to their
with rectified linear units (ReLUs) or plain linear units) performance; then, the best network is selected with
and edges indicate identity connections or convolutions. a probability of p, the second-best network has a
Neuro evolution of augmenting topologies (NEAT) [24, 25] probability of p × (1 − p), and so on.
also uses a direct encoding scheme, where each node and
• Crossover After selection, every two networks are se-
connection is stored.
lected to generate a new offspring network, inheriting
Indirect encoding specifies a generation rule to build
half of the genetic information of each of its parents.
the network and allows for a more compact representation.
This process is analogous to the genetic recombina-
Cellular encoding (CE) [145] is an example of a system
tion, which occurs during biological reproduction and
that utilizes indirect encoding of network structures. It
crossover. The particular manner of crossover varies
encodes a family of neural networks into a set of labeled
and depends on the encoding scheme. In binary en-
trees and is based on a simple graph grammar. Some recent
coding, networks are encoded as a linear string of
studies [146, 147, 148, 27] have described the use of indirect
bits, where each bit represents a unit, such that two
encoding schemes to represent a network. For example,
parent networks can be combined through one- or
the network in [27] can be encoded by a function, and
multiple-point crossover. However, the crossover of
11
the data arranged in such a fashion can sometimes action At: sample an architecture
• Mutation As the genetic information of the parents Figure 14: Overview of neural architecture search using reinforcement
is copied and inherited by the next generation, gene learning.
mutation also occurs. A point mutation [28, 30] is
one of the most widely used operations and involves
results (such as accuracy) are returned. Many follow-up
randomly and independently flipping each bit. Two
approaches [23, 15, 16, 13] have used this framework, but
types of mutations have been described in [29]: one
with different controller policies and neural-architecture
enables or disables a connection between two lay-
encoding. Zoph et al. [12] first used the policy gradient
ers, and the other adds or removes skip connections
algorithm [150] to train the controller, and sequentially
between two nodes or layers. Meanwhile, Real and
sampled a string to encode the entire neural architecture.
Moore et al. [25] predefined a set of mutation opera-
In a subsequent study [15], they used the proximal policy
tors, such as altering the learning rate and removing
optimization (PPO) algorithm [151] to update the con-
skip connections between the nodes. By analogy with
troller, and proposed the method shown in Figure 15 to
the biological process, although a mutation may ap-
build a cell-based neural architecture. MetaQNN [23] is a
pear as a mistake that causes damage to the network
meta-modeling algorithm using Q-learning with an -greedy
structure and leads to a loss of functionality, it also
exploration strategy and experience replay to sequentially
enables the exploration of more novel structures and
search for neural architectures.
ensures diversity.
• Update Many new networks are generated by com- op A index A op B index B op A index A op B index B
conv conv max-
pleting the above steps, and considering the limita- prediction -2 skip -1 0 -1
5x5 3x3 pool
tions on computational resources, some of these must
be removed. In [25], the worst-performing network hidden
state
of two randomly selected networks is immediately
removed from the population. Alternatively, in [26], Empty conv -2 -1 conv 0 max-
Embedding skip
input 5x5 3x3 pool
the oldest networks are removed. Other methods
Block 1 of cell k Block 2 of cell k
[29, 30, 28] discard all models at regular intervals.
However, Liu et al. [19] did not remove any network
Figure 15: Example of a controller generating a cell structure. Each
from the population, and instead, allowed the net- block in the cell comprises two nodes that are specified with different
work number to grow with time. Zhu et al. [149] operations and inputs. The indices −2 and −1 indicate the inputs
regulated the population number through a variable are derived from prev-previous and previous cell, respectively.
λ, i.e., removed the worst model with probability λ
and the oldest model with 1 − λ. Although the above RL-based algorithms have achieved
SOTA results on the CIFAR-10 and Penn Treebank (PTB)
4.2.2. Reinforcement Learning [152] datasets, they incur considerable time and computa-
Zoph et al. [12] were among the first to apply reinforce- tional resources. For instance, the authors in [12] took 28
ment learning (RL) to neural architecture search. Figure 14 days and 800 K40 GPUs to search for the best-performing
presents an overview of an RL-based NAS algorithm. Here, architecture, and MetaQNN [23] also took 10 days and 10
the controller is usually a recurrent neural network (RNN) GPUs to complete its search. To this end, some improved
that executes an action At at each step t to sample a new ar- RL-based algorithms have been proposed. BlockQNN [16]
chitecture from the search space and receives an observation uses a distributed asynchronous framework and an early-
of the state St together with a reward scalar Rt from the stop strategy to complete searching on only one GPU
environment to update the controller’s sampling strategy. within 20 hours. The efficient neural architecture search
Environment refers to the use of a standard neural net- (ENAS) [13] is even better, as it adopts a parameter-sharing
work training procedure to train and evaluate the network strategy in which all child architectures are regarded as
generated by the controller, after which the corresponding sub-graphs of a supernet; this enables these architectures
12
to share parameters, obviating the need to train each child can be efficiently solved as a regular training, the searched
model from scratch. Thus, ENAS took only approximately architecture α commonly overfits the training set and its
10 hours using one GPU to search for the best architecture performance on the validation set cannot be guaranteed.
on the CIFAR-10 dataset, which is nearly 1000× faster The authors in [153] proposed mixed-level optimization:
than [12].
min [Ltrain (θ∗ , α) + λLval (θ∗ , α)] (7)
α,θ
4.2.3. Gradient Descent
The above-mentioned search strategies sample neural where α indicates the neural architecture, θ is the weight as-
architectures from a discrete search space. A pioneering al- signed to it, and λ is a non-negative regularization variable
gorithm, namely DARTS [17], was among the first gradient to control the weights of the training loss and validation
descent (GD)-based method to search for neural architec- loss. When λ = 0, Eq. 7 reduces to a single-level opti-
tures over a continuous and differentiable search space by mization (Eq. 6); in contrast, Eq. 7 becomes a bilevel
using a softmax function to relax the discrete space, as optimization (Eq. 5). The experimental results presented
outlined below: in [153] showed that mixed-level optimization not only over-
comes the overfitting issue of single-level optimization but
K k
also avoids the gradient error of bilevel optimization.
X exp αi,j
oi,j (x) = PK l
ok (x) (4) Second, in DARTS, the output of each edge is the
k=1 l=1 exp αi,j weighted sum of all candidate operations (shown in Eq.
where o(x) indicates the operation performed on input 4) during the whole search stage, which leads to a linear
k
x, αi,j indicates the weight assigned to the operation ok increase in the requirements of GPU memory with the
between a pair of nodes (i, j), and K is the number of number of candidate operations. To reduce resource con-
predefined candidate operations. After the relaxation, the sumption, many subsequent studies [154, 155, 153, 156, 131]
task of searching for architectures is transformed into a have developed a differentiable sampler to sample a child
joint optimization of neural architecture α and the weights architecture from the supernet by using a reparameteri-
of this neural architecture θ. These two types of parameters zation trick, namely Gumbel Softmax [157]. The neural
are optimized alternately, indicating a bilevel optimization architecture is fully factorized and modeled with a concrete
problem. Specifically, α and θ are optimized with the distribution [158], which provides an efficient approach to
validation and the training sets, respectively. The training sampling a child architecture and allows gradient backprop-
and the validation losses are denoted by Ltrain and Lval , agation. Therefore, Eq. 4 is re-formulated as
respectively. Hence, the total loss function can be derived
as follows: K
k
X exp log αi,j + Gki,j /τ
minα Lval (θ∗ , α) oki,j (x) = PK ok (x) (8)
exp log α l + Gl
(5) k=1 l=1 i,j i,j /τ
s.t. θ∗ = argminθ Ltrain (θ, α)
Figure 16 presents an overview of DARTS, where a cell where Gki,j = −log(−log(uki,j )) is the k-th Gumbel sample,
is composed of N (here N = 4) ordered nodes and the uki,j is a uniform random variable, and τ is the Softmax
node z k (k starts from 0) is connected to the node z i , i ∈ temperature. When τ → ∞, the possibility distribution of
{k + 1, ..., N }. The operation on each edge ei,j is initially all operations between each node pair approximates to one-
a mixture of candidate operations, each being of equal hot distribution. In GDAS [154], only the operation with
weight. Therefore, the neural architecture α is a supernet the maximum possibility for each edge is selected during
that contains all possible child neural architectures. At the forward pass, while the gradient is backpropagated
the end of the search, the final architecture is derived by according to Eq. 8. In other words, only one path of the
retaining only the maximum-weight operation among all supernet is selected for training, thereby reducing the GPU
mixed operations. memory usage. Besides, ProxylessNAS [132] alleviates
Although DARTS substantially reduces the search time, the huge resource consumption through path binarization.
it incurs several problems. First, as Eq. 5 shows, DARTS Specifically, it transforms the real-valued path weights [17]
describes a joint optimization of the neural architecture to binary gates, which activates only one path of the mixed
and weights as a bilevel optimization problem. However, operations, and hence, solves the memory issue.
this problem is difficult to solve directly, because both ar- Another problem is the optimization of different op-
chitecture α and weights θ are high dimensional parameters. erations together, as they may compete with each other,
Another solution is single-level optimization, which can be leading to a negative influence. For example, several studies
formalized as [159, 128] have found that skip-connect operation domi-
nates at a later search stage in DARTS, which causes the
min Ltrain (θ, α) (6) network to be shallower and leads to a marked deterioration
θ,α
in performance. To solve this problem, DARTS+ [159] uses
which optimizes both neural architecture and weights to- an additional early-stop criterion, such that when two or
gether. Although the single-level optimization problem
13
0 0 0 0
?
?
1 ? 1 1 1
?
?
2 2 2 2
0.1
0.6
0.3
3 3 3 3
Figure 16: Overview of DARTS. (a) The data can only flow from lower-level nodes to higher-level nodes, and the operations on edges are
initially unknown. (b) The initial operation on each edge is a mixture of candidate operations, each having equal weight. (c) The weight of
each operation is learnable and ranges from 0 to 1, but for previous discrete sampling methods, the weight could only be 0 or 1. (d) The final
neural architecture is constructed by preserving the maximum weight-value operation on each edge.
more skip-connects occur in a normal cell, the search pro- network as the surrogate model. For example, in PNAS
cess stops. In another example, P-DARTS [128] regularizes [18] and EPNAS [166], an LSTM is derived as the surrogate
the search space by executing operation-level dropout to model to progressively predict variable-sized architectures.
control the proportion of skip-connect operations occurring Meanwhile, NAO [169] uses a simpler surrogate model, i.e.,
during training and evaluation. multilayer perceptron (MLP), and NAO is more efficient
and achieves better results on CIFAR-10 than does PNAS
4.2.4. Surrogate Model-based Optimization [18]. White et al. [164] trained an ensemble of neural
Another group of architecture optimization methods is networks to predict the mean and variance of the validation
surrogate model-based optimization (SMBO) algorithms results for candidate neural architectures.
[33, 34, 160, 161, 162, 163, 164, 165, 166, 18, 161]. The
core concept of SMBO is that it builds a surrogate model 4.2.5. Grid and Random Search
of the objective function by iteratively keeping a record Both grid search (GS) and random search (RS) are sim-
of past evaluation results, and uses the surrogate model ple optimization methods applied to several NAS studies
to predict the most promising architecture. Thus, these [178, 179, 180, 11]. For instance, Geifman et al. [179] pro-
methods can substantially shorten the search time and posed a modular architecture search space (A = {A(B, i, j)|i ∈
improve efficiency. {1, 2, ..., Ncells }, j ∈ {1, 2, ..., Nblocks }}) that is spanned
SMBO algorithms differ from the surrogate models, by the grid defined by the two corners A(B, 1, 1) and
which can be broadly divided into Bayesian optimization A(B, Ncells , Nblocks ), where B is a searched block struc-
(BO) methods (including Gaussian process (GP) [167], ture. Evidently, a larger value Ncells × Nblocks leads to the
random forest (RF) [37], tree-structured Parzen estimator exploration of a larger space, but requires more resources.
(TPE) [168]), and neural networks [164, 169, 18, 166]. The authors in [180] conducted an effectiveness com-
BO [170, 171] is one of the most popular methods for parison between SOTA NAS methods and RS. The results
hyperparameter optimization. Many recent studies [33, showed that RS is a competitive NAS baseline. Specifically,
34, 160, 161, 162, 163, 164, 165] have attempted to apply RS with an early-stopping strategy performs as well as
these SOTA BO methods to AO. For example, in [172, 173, ENAS [13], which is an RL-based leading NAS method.
160, 165, 174, 175], the validation results of the generated Besides, Yu et al. [11] demonstrated that the SOTA NAS
neural architectures were modeled as a Gaussian process, techniques are not significantly better than random search.
which guides the search for the optimal neural architectures.
However, in GP-based BO methods, the inference time 4.2.6. Hybrid Optimization Method
scales cubically in the number of observations, and they The abovementioned architecture optimization methods
cannot effectively handle variable-length neural networks. have their own advantages and disadvantages. 1) EA is a
Camero et al. [176] proposed three fixed-length encoding mature global optimization method with high robustness.
schemes to cope with variable-length problems by using However, it requires considerable computational resources
RF as the surrogate model. Similarly, both [33] and [176] [26, 25], and its evolution operations (such as crossover and
used RF as a surrogate model, and [177] showed that it mutations) are performed randomly. 2) Although RL-based
works better in setting high dimensionality than GP-based methods (e.g., ENAS [13]) can learn complex architectural
methods. patterns, the searching efficiency and stability of the RL
Instead of using BO, some studies have used a neural controller are not guaranteed because it may take several
14
actions to obtain a positive reward. 3) The GD-based meth-
ods (e.g., DARTS [17]) substantially improve the searching
efficiency by relaxing the categorical candidate operations
to continuous variables. Nevertheless, in essence, they all
Unimportant parameter
Unimportant parameter
search for a child network from a supernet, which limits the
diversity of neural architectures. Therefore, some methods
have been proposed to incorporate different optimization
methods to capture the best of their advantages; these
methods are summarized as follows
EA+RL. Chen et al. [42] integrated reinforced muta-
tions into an EA, which avoids the randomness of evolution
Important parameter Important parameter
and improves the searching efficiency. Another similar
method developed in parallel is the evolutionary-neural
hybrid controller (Evo-NAS) [41], which also captures the Figure 17: Examples of grid search (left) and random search (right) in
nine trials for optimizing a two-dimensional space function f (x, y) =
merits of both RL-based methods and EA. The Evo-NAS g(x) + h(y) ≈ g(x) [181]. The parameter in g(x) (light-blue part)
controller’s mutations are guided by an RL-trained neural is relatively important, while that in h(y) (light-yellow part) is not
network, which can explore a vast search space and sample important. In a grid search, nine trials cover only three important
architectures efficiently. parameter values; however, random search can explore nine distinct
values of g. Therefore, random search is more likely to find the
EA+GD. Yang et al. [40] combined the EA and GD- optimal combination of parameters than grid search (the figure is
based method. The architectures share parameters within adopted from [181]).
one supernet and are tuned on the training set with a few
epochs. Then, the populations and the supernet are di-
rectly inherited in the next generation, which substantially GS is very simple and naturally supports parallel imple-
accelerates the evolution. The authors in [40] only took 0.4 mentation; however, it is computationally expensive and
GPU days for searching, which is more efficient than early inefficient when the hyperparameter space is very large, as
EA methods (e.g., AmoebaNet [26] took 3150 GPU days the number of trials grows exponentially with the dimen-
and 450 GPUs for searching). sionality of hyperparameters. To alleviate this problem,
EA+SMBO. The authors in [43] used RF as a surro- Hsu et al. [182] proposed a coarse-to-fine grid search, in
gate to predict model performance, which accelerates the which a coarse grid is first inspected to locate a good re-
fitness evaluation in EA. gion, and then a finer grid search is implemented on the
GD+SMBO. Unlike DARTS, which learns weights identified region. Similarly, Hesterman et al. [183] pro-
for candidate operations, NAO [169] proposes a variational posed a contracting GS algorithm, which first computes
autoencoder to generate neural architectures and further the likelihood of each point in the grid, and then generates
build a regression model as a surrogate to predict the a new grid centered on the maximum-likelihood value. The
performance of the generated architecture. The encoder point separation in the new grid is reduced to half that
maps the representations of the neural architecture to on the old grid. The above procedure is iterated until the
continuous space, and then a predictor network takes the results converge to a local minimum.
continuous representations of the neural architecture as Although the authors in [181] empirically and theoreti-
input and predicts the corresponding accuracy. Finally, cally showed that RS is more practical and efficient than
the decoder is used to derive the final architecture from a GS, RS does not promise an optimum value. This means
continuous network representation. that although a longer search increases the probability
of finding optimal hyperparameters, it consumes more re-
4.3. Hyperparameter Optimization sources. Li and Jamieson et al. [184] proposed a hyperband
Most NAS methods use the same set of hyperparameters algorithm to create a tradeoff between the performance
for all candidate architectures during the whole search stage; of the hyperparameters and resource budgets. The hyper-
thus, after finding the most promising neural architecture, band algorithm allocates limited resources (such as time
it is necessary to redesign a hyperparameter set and use or CPUs) to only the most promising hyperparameters, by
it to retrain or fine-tune the architecture. As some HPO successively discarding the worst half of the configuration
methods (such as BO and RS) have also been applied in settings long before the training process is finished.
NAS, we will only briefly introduce these methods here.
4.3.2. Bayesian Optimization
4.3.1. Grid and Random Search Bayesian optimization (BO) is an efficient method for
Figure 17 shows the difference between grid search (GS) the global optimization of expensive blackbox functions.
and random search (RS): GS divides the search space into In this section, we briefly introduce BO. For an in-depth
regular intervals and selects the best-performing point after discussion on BO, we recommend readers to refer to the
evaluating all points; while RS selects the best point from excellent surveys conducted in [171, 170, 185, 186].
a set of randomly drawn points. BO is an SMBO method that builds a probabilistic
15
model mapping from the hyperparameters to the objective Library Model
metrics evaluated on the validation set. It well balances Spearmint
GP
exploration (evaluating as many hyperparameter sets as https://github.com/HIPS/Spearmint
possible) and exploitation (allocating more resources to MOE
GP
promising hyperparameters). https://github.com/Yelp/MOE
PyBO
GP
Algorithm 1 Sequential Model-Based Optimization https://github.com/mwhoffman/pybo
INPUT: f, Θ, S, M Bayesopt
GP
D ← INITSAMPLES (f, Θ) https://github.com/rmcantin/bayesopt
for i in [1, 2, .., T ] do SkGP
GP
p(y|θ, D) ← FITMODEL (M, D) https://scikit-optimize.github.io
θi ← arg maxθ∈Θ S(θ, p(y|θ, D)) GPyOpt
GP
yi ← f (θi ) . Expensive step http://sheffieldml.github.io/GPyOpt
D ← D ∪ (θi , yi ) SMAC
RF
end for https://github.com/automl/SMAC3
Hyperopt
TPE
The steps of SMBO are expressed in Algorithm 1 (adopted http://hyperopt.github.io/hyperopt
from [170]). Here, several inputs need to be predefined ini- BOHB
TPE
tially, including an evaluation function f , search space Θ, https://github.com/automl/HpBandSter
acquisition function S, probabilistic model M, and record
Table 2: Open-source Bayesian optimization libraries. GP, RF, and
dataset D. Specifically, D is a dataset that records many TPE represent Gaussian process [167], random forest [37], and tree-
sample pairs (θi , yi ), where θi ∈ Θ indicates a sampled structured Parzen estimator [168], respectively.
neural architecture and yi indicates its evaluation result.
After the initialization, the SMBO steps are described as
follows: 4.3.3. Gradient-based Optimization
Another group of HPO methods are gradient-based op-
1. The first step is to tune the probabilistic model M timization (GO) algorithms [187, 188, 189, 190, 191, 192].
to fit the record dataset D. Unlike the above blackbox HPO methods (e.g., GS, RS,
2. The acquisition function S is used to select the next and BO), GO methods use the gradient information to
promising neural architecture from the probabilistic optimize the hyperparameters and substantially improve
model M. the efficiency of HPO. Maclaurin et al. [189] proposed a
3. The performance of the selected neural architecture reversible-dynamics memory-tape approach to handle thou-
is evaluated by f , which is an expensive step as it sands of hyperparameters efficiently through the gradient
involves training the neural network on the training information. However, optimizing many hyperparameters
set and evaluating it on the validation set. is computationally challenging. To alleviate this issue, the
4. The record dataset D is updated by appending a new authors in [190] used approximate gradient information
pair of results (θi , yi ). rather than the true gradient to optimize continuous hy-
perparameters, where the hyperparameters can be updated
The above four steps are repeated T times, where T before the model is trained to converge. Franceschi et al.
needs to be specified according to the total time or resources [191] studied both reverse- and forward-mode GO meth-
available. The commonly used surrogate models for the ods. The reverse-mode method differs from the method
BO method are GP, RF, and TPE. Table 2 summarizes proposed in [189] and does not require reversible dynamics;
the existing open-source BO methods, where GP is one of however, it needs to store the entire training history for
the most popular surrogate models. However, GP scales computing the gradient with respect to the hyperparame-
cubically with the number of data samples, while RF can ters. The forward-mode method overcomes this problem by
natively handle large spaces and scales better to many data using real-time updating hyperparameters, and is demon-
samples. Besides, Falkner and Klein et al. [38] proposed the strated to significantly improve the efficiency of HPO on
BO-based hyperband (BOHB) algorithm, which combines large datasets. Chandra [192] proposed a gradient-based
the strengths of TPE-based BO and hyperband, and hence, ultimate optimizer, which can optimize not only the regular
performs much better than standard BO methods. Fur- hyperparameters (e.g., learning rate) but also those of the
thermore, FABOLAS [35] is a faster BO procedure, which optimizer (e.g., Adam optimizer [193]’s moment coefficient
maps the validation loss and training time as functions of β1 , β2 ).
dataset size, i.e., trains a generative model on a sub-dataset
that gradually increases in size. Here, FABOLAS is 10-100
times faster than other SOTA BO algorithms and identifies
the most promising hyperparameters.
16
5. Model Evaluation Progressive Neural Architecture Search (PNAS) [18] intro-
duces a surrogate model to control the method of searching.
Once a new neural network has been generated, its Although ENAS has been proven to be very efficient, PNAS
performance must be evaluated. An intuitive method is is even more efficient, as the number of models evaluated
to train the network to convergence and then evaluate its by PNAS is over five times that evaluated by ENAS, and
performance. However, this method requires extensive time PNAS is eight times faster in terms of total computational
and computing resources. For example, [12] took 800 K40 speed. A well-performing surrogate usually requires large
GPUs and 28 days in total to search. Additionally, NASNet amounts of labeled architectures, while the optimization
[15] and AmoebaNet [26] required 500 P100 GPUs and 450 space is too large and hard to quantify, and the evalua-
K40 GPUs, respectively. In this section, we summarize tion of each configuration is extremely expensive [201]. To
several algorithms for accelerating the process of model alleviate this issue, Luo et al. [202] proposed SemiNAS,
evaluation. a semi-supervised NAS method, which leverages amounts
of unlabeled architectures to train the surrogate, a con-
5.1. Low fidelity troller that is used to predict the accuracy of architectures
As model training time is highly related to the dataset without evaluation. Initially, the surrogate is only trained
and model size, model evaluation can be accelerated in dif- with a small number of labeled data pairs (architectures,
ferent ways. First, the number of images or the resolution accuracy), then the generated data pairs will be gradually
of images (in terms of image-classification tasks) can be added to the original data to further improve the surrogate.
decreased. For example, FABOLAS [35] trains the model
on a subset of the training set to accelerate model evalu- 5.4. Early stopping
ation. In [194], ImageNet64×64 and its variants 32×32, Early stopping was first used to prevent overfitting in
16×16 are provided, while these lower resolution datasets classical ML, and it has been used in several recent studies
can retain characteristics similar to those of the original [203, 204, 205] to accelerate model evaluation by stopping
ImageNet dataset. Second, low-fidelity model evaluation evaluations that are predicted to perform poorly on the
can be realized by reducing the model size, such as by validation set. For example, [205] proposes a learning-curve
training with fewer filters per layer [15, 26]. By analogy model that is a weighted combination of a set of parametric
to ensemble learning, [195] proposes the Transfer Series curve models selected from the literature, thereby enabling
Expansion (TSE), which constructs an ensemble estimator the performance of the network to be predicted. Further-
by linearly combining a series of basic low-fidelity estima- more, [206] presents a novel approach for early stopping
tors, hence avoiding the bias that can derive from using based on fast-to-compute local statistics of the computed
a single low-fidelity estimator. Furthermore, Zela et al. gradients, which no longer relies on the validation set and
[34] empirically demonstrated that there is a weak corre- allows the optimizer to make full use of all of the training
lation between performance after short or long training data.
times, thus confirming that a prolonged search for network
configurations is unnecessary.
6. NAS Discussion
5.2. Weight sharing
In Section 4, we reviewed the various search space and
In [12], once a network has been evaluated, it is dropped. architecture optimization methods, and in Section 5, we
Hence, the technique of weight sharing is used to acceler- summarized commonly used model evaluation methods.
ate the process of NAS. For example, Wong and Lu et al. These two sections introduced many NAS studies, which
[196] proposed transfer neural AutoML, which uses knowl- may cause the readers to get lost in details. Therefore, in
edge from prior tasks to accelerate network design. ENAS this section, we summarize and compare these NAS algo-
[13] shares parameters among child networks, leading to rithms’ performance from a global perspective to provide
a thousand-fold faster network design than [12]. Network readers a clearer and more comprehensive understanding of
morphism based algorithms [20, 21] can also inherit the NAS methods’ development. Then, we discuss some major
weights of previous architectures, and single-path NAS topics of the NAS technique.
[197] uses a single-path over-parameterized ConvNet to
encode all architectural decisions with shared convolutional 6.1. NAS Performance Comparison
kernel parameters.
Many NAS studies have proposed several neural archi-
tecture variants, where each variant is designed for different
5.3. Surrogate
scenarios. For instance, some architecture variants perform
The surrogate-based method [198, 199, 200, 43] is an- better but are larger, while some are lightweight for a mo-
other powerful tool that approximates the black-box func- bile device but with a performance penalty. Therefore, we
tion. In general, once a good approximation has been ob- only report the representative results of each study. Besides,
tained, it is trivial to find the configurations that directly to ensure a valid comparison, we consider the accuracy and
optimize the original expensive objective. For example, algorithm efficiency as comparison indices. As the number
17
Published #Params Top-1 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-110 [2] ECCV16 1.7 93.57 - - Manually
PyramidNet [207] CVPR17 26 96.69 - - designed
DenseNet [127] CVPR17 25.6 96.54 - -
GeNet#2 (G-50) [30] ICCV17 - 92.9 17 -
Large-scale ensemble [25] ICML17 40.4 95.6 2,500 250
Hierarchical-EAS [19] ICLR18 15.7 96.25 300 200
CGP-ResSet [28] IJCAI18 6.4 94.02 27.4 2
AmoebaNet-B (N=6, F=128)+c/o [26] AAAI19 34.9 97.87 3,150 450 K40 EA
AmoebaNet-B (N=6, F=36)+c/o [26] AAAI19 2.8 97.45 3,150 450 K40
Lemonade [27] ICLR19 3.4 97.6 56 8 Titan
EENA [149] ICCV19 8.47 97.44 0.65 1 Titan Xp
EENA (more channels)[149] ICCV19 54.14 97.79 0.65 1 Titan Xp
NASv3[12] ICLR17 7.1 95.53 22,400 800 K40
NASv3+more filters [12] ICLR17 37.4 96.35 22,400 800 K40
MetaQNN [23] ICLR17 - 93.08 100 10
NASNet-A (7 @ 2304)+c/o [15] CVPR18 87.6 97.60 2,000 500 P100
NASNet-A (6 @ 768)+c/o [15] CVPR18 3.3 97.35 2,000 500 P100
Block-QNN-Connection more filter [16] CVPR18 33.3 97.65 96 32 1080Ti
Block-QNN-Depthwise, N=3 [16] CVPR18 3.3 97.42 96 32 1080Ti RL
ENAS+macro [13] ICML18 38.0 96.13 0.32 1
ENAS+micro+c/o [13] ICML18 4.6 97.11 0.45 1
Path-level EAS [139] ICML18 5.7 97.01 200 -
Path-level EAS+c/o [139] ICML18 5.7 97.51 200 -
ProxylessNAS-RL+c/o[132] ICLR19 5.8 97.70 - -
FPNAS[208] ICCV19 5.76 96.99 - -
DARTS(first order)+c/o[17] ICLR19 3.3 97.00 1.5 4 1080Ti
DARTS(second order)+c/o[17] ICLR19 3.3 97.23 4 4 1080Ti
sharpDARTS [178] ArXiv19 3.6 98.07 0.8 1 2080Ti
P-DARTS+c/o[128] ICCV19 3.4 97.50 0.3 -
P-DARTS(large)+c/o[128] ICCV19 10.5 97.75 0.3 -
SETN[209] ICCV19 4.6 97.31 1.8 -
GD
GDAS+c/o [154] CVPR19 2.5 97.18 0.17 1
SNAS+moderate constraint+c/o [155] ICLR19 2.8 97.15 1.5 1
BayesNAS[210] ICML19 3.4 97.59 0.1 1
ProxylessNAS-GD+c/o[132] ICLR19 5.7 97.92 - -
PC-DARTS+c/o [211] CVPR20 3.6 97.43 0.1 1 1080Ti
MiLeNAS[153] CVPR20 3.87 97.66 0.3 -
SGAS[212] CVPR20 3.8 97.61 0.25 1 1080Ti
GDAS-NSAS[213] CVPR20 3.54 97.27 0.4 -
NASBOT[160] NeurIPS18 - 91.31 1.7 -
PNAS [18] ECCV18 3.2 96.59 225 -
SMBO
EPNAS[166] BMVC18 6.6 96.29 1.8 1
GHN[214] ICLR19 5.7 97.16 0.84 -
NAO+random+c/o[169] NeurIPS18 10.6 97.52 200 200 V100
SMASH [14] ICLR18 16 95.97 1.5 -
Hierarchical-random [19] ICLR18 15.7 96.09 8 200
RS
RandomNAS [180] UAI19 4.3 97.15 2.7 -
DARTS - random+c/o [17] ICLR19 3.2 96.71 4 1
RandomNAS-NSAS[213] CVPR20 3.08 97.36 0.7 -
NAO+weight sharing+c/o [169] NeurIPS18 2.5 97.07 0.3 1 V100 GD+SMBO
RENASNet+c/o[42] CVPR19 3.5 91.12 1.5 4 EA+RL
CARS[40] CVPR20 3.6 97.38 0.4 - EA+GD
Table 3: Performance of different NAS algorithms on CIFAR-10. The “AO” column indicates the architecture optimization method. The
dash (-) indicates that the corresponding information is not provided in the original paper. “c/o” indicates the use of Cutout [89]. RL, EA,
GD, RS, and SMBO indicate reinforcement learning, evolution-based algorithm, gradient descent, random search, and surrogate model-based
optimization, respectively.
18
Published #Params Top-1/5 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-152 [2] CVPR16 230 70.62/95.51 - -
PyramidNet [207] CVPR17 116.4 70.8/95.3 - -
SENet-154 [126] CVPR17 - 71.32/95.53 - - Manually
DenseNet-201 [127] CVPR17 76.35 78.54/94.46 - - designed
MobileNetV2 [215] CVPR18 6.9 74.7/- - -
GeNet#2[30] ICCV17 - 72.13/90.26 17 -
AmoebaNet-C(N=4,F=50)[26] AAAI19 6.4 75.7/92.4 3,150 450 K40
Hierarchical-EAS[19] ICLR18 - 79.7/94.8 300 200
EA
AmoebaNet-C(N=6,F=228)[26] AAAI19 155.3 83.1/96.3 3,150 450 K40
GreedyNAS [216] CVPR20 6.5 77.1/93.3 1 -
NASNet-A(4@1056) ICLR17 5.3 74.0/91.6 2,000 500 P100
NASNet-A(6@4032) ICLR17 88.9 82.7/96.2 2,000 500 P100
Block-QNN[16] CVPR18 91 81.0/95.42 96 32 1080Ti
Path-level EAS[139] ICML18 - 74.6/91.9 8.3 -
ProxylessNAS(GPU) [132] ICLR19 - 75.1/92.5 8.3 -
RL
ProxylessNAS-RL(mobile) [132] ICLR19 - 74.6/92.2 8.3 -
MnasNet[130] CVPR19 5.2 76.7/93.3 1,666 -
EfficientNet-B0[142] ICML19 5.3 77.3/93.5 - -
EfficientNet-B7[142] ICML19 66 84.4/97.1 - -
FPNAS[208] ICCV19 3.41 73.3/- 0.8 -
DARTS (searched on CIFAR-10)[17] ICLR19 4.7 73.3/81.3 4 -
sharpDARTS[178] Arxiv19 4.9 74.9/92.2 0.8 -
P-DARTS[128] ICCV19 4.9 75.6/92.6 0.3 -
SETN[209] ICCV19 5.4 74.3/92.0 1.8 -
GDAS [154] CVPR19 4.4 72.5/90.9 0.17 1
SNAS[155] ICLR19 4.3 72.7/90.8 1.5 -
ProxylessNAS-G[132] ICLR19 - 74.2/91.7 - -
BayesNAS[210] ICML19 3.9 73.5/91.1 0.2 1
FBNet[131] CVPR19 5.5 74.9/- 216 -
OFA[217] ICLR20 7.7 77.3/- - - GD
AtomNAS[218] ICLR20 5.9 77.6/93.6 - -
MiLeNAS[153] CVPR20 4.9 75.3/92.4 0.3 -
DSNAS[219] CVPR20 - 74.4/91.54 17.5 4 Titan X
SGAS[212] CVPR20 5.4 75.9/92.7 0.25 1 1080Ti
PC-DARTS [211] CVPR20 5.3 75.8/92.7 3.8 8 V100
DenseNAS[220] CVPR20 - 75.3/- 2.7 -
FBNetV2-L1[221] CVPR20 - 77.2/- 25 8 V100
PNAS-5(N=3,F=54)[18] ECCV18 5.1 74.2/91.9 225 -
PNAS-5(N=4,F=216)[18] ECCV18 86.1 82.9/96.2 225 -
SMBO
GHN[214] ICLR19 6.1 73.0/91.3 0.84 -
SemiNAS[202] CVPR20 6.32 76.5/93.2 4 -
Hierarchical-random[19] ICLR18 - 79.6/94.7 8.3 200
RS
OFA-random[217] CVPR20 7.7 73.8/- - -
RENASNet[42] CVPR19 5.36 75.7/92.6 - - EA+RL
Evo-NAS[41] Arxiv20 - 75.43/- 740 - EA+RL
CARS[40] CVPR20 5.1 75.2/92.5 0.4 - EA+GD
Table 4: Performance of different NAS algorithms on ImageNet. The “AO” column indicates the architecture optimization method. The dash
(-) indicates that the corresponding information is not provided in the original paper. RL, EA, GD, RS, and SMBO indicate reinforcement
learning, evolution-based algorithm, gradient descent, random search, and surrogate model-based optimization, respectively.
19
the actual number of days spent searching. Searching Stage Evaluation Stage
Tables 3 and 4 present the performances of different
Architecture
NAS methods on CIFAR-10 and ImageNet, respectively. Searching Stage
Optimization Evaluation Stage
Retraining the
Besides, as most NAS methods first search for the neural Search best-performing
architecture based on a small dataset (CIFAR-10), and then Architecture model
Space Optimization model of the
Retraining the
transfer the architecture to a larger dataset (ImageNet), searching stage
Search best-performing
the search time for both datasets is the same. The tables Parameter model
Space model of the
show that the early studies on EA- and RL-based NAS Training
searching stage
methods focused more on high performance, regardless of Parameter
the resource consumption. For example, although Amoe- (a) Two-stage NASTraining
comprises the searching stage and evaluation
stage. The best-performing model of the searching stage is further
baNet [26] achieved excellent results for both CIFAR-10
retrained in the evaluation stage. Parameter Training
and ImageNet, the searching took 3,150 GPU days and 450
GPUs. The subsequent NAS studies attempted to improve Model 1
Parameter Training
the searching efficiency while ensuring the searched model’s
Search Architecture Model 2
high performance. For instance, EENA [149] elaborately Model 1 model
Space Optimization
designs the mutation and crossover operations, which can
Search Architecture ...
Model 2
reuse the learned information to guide the evolution pro- model
Space Optimization
cess, and hence, substantially improve the efficiency of Model n
EA-based NAS methods. ENAS [13] is one of the first ...
RL-based NAS methods to adopt the parameter-sharing
Model n
strategy, which reduces the number of GPU budgets to
1 and shortens the searching time to less than one day. (b) One-stage NAS can directly deploy a well-performing model
We also observe that gradient descent-based architecture without extra retraining or fine-tuning. The two-way arrow indicates
optimization methods can substantially reduce the compu- that the processes of architecture optimization and parameter training
run simultaneously.
tational resource consumption for searching, and achieve
SOTA results. Several follow-up studies have been con-
Figure 18: Illustration of two- and one-stage neural architecture
ducted to achieve further improvement and optimization search flow.
in this direction. Interestingly, RS-based methods can also
obtain comparable results. The authors in [180] demon-
strated that RS with weight-sharing could outperform a following properties:
series of powerful methods, such as ENAS [13] and DARTS
[17]. • τ = 1: two rankings are identical
• τ = −1: two rankings are completely opposite.
6.1.1. Kendall Tau Metric
As RS is comparable to more sophisticated methods • τ = 0: there is no relationship between two rankings.
(e.g., DARTS and ENAS), a natural question is, what are
the advantages and significance of the other AO algorithms
compared with RS? Researchers have tried to use other 6.1.2. NAS-Bench Dataset
metrics to answer this question, rather than simply con- Although Tables 3 and 4 present a clear comparison
sidering the model’s final accuracy. Most NAS methods between different NAS methods, the results of different
comprise two stages: 1) search for a best-performing archi- methods are obtained under different settings, such as
tecture on the training set and 2) expand it to a deeper training-related hyperparameters (e.g., batch size and train-
architecture and estimate it on the validation set. However, ing epochs) and data augmentation (e.g., Cutout [89]). In
there usually exists a large gap between the two stages. In other words, the comparison is not quite fair. In this con-
other words, the architecture that achieves the best result text, NAS-Bench-101 [224] is a pioneering work for improv-
in the training set is not necessarily the best one for the ing the reproducibility. It provides a tabular dataset con-
validation set. Therefore, instead of merely considering taining 423,624 unique neural networks generated and eval-
the final accuracy and search time cost, many NAS studies uated from a fixed graph-based search space and mapped
[219, 222, 213, 11, 123] have used Kendall Tau (τ ) metric to their trained and evaluated performance on CIFAR-10.
[223] to evaluate the correlation of the model performance Meanwhile, Dong et al. [225] further built NAS-Bench-201,
between the search and evaluation stages. The parameter which is an extension to NAS-Bench-101 and has a differ-
τ is defined as ent search space, results on multiple datasets (CIFAR-10,
CIFAR-100, and ImageNet-16-120 [194]), and more diag-
NC − ND nostic information. Similarly, Klyuchnikov et al. [226]
τ= (10)
NC + ND proposed a NAS-Bench for the NLP task. These datasets
where NC and ND indicate the numbers of concordant and enable NAS researchers to focus solely on verifying the ef-
discordant pairs. τ is a number in the range [-1,1] with the fectiveness and efficiency of their AO algorithms, avoiding
20
repetitive training for selected architectures and substan-
tially helping the NAS community to develop. Search Space Search Space
23
Category Application References
Medical Image Recognition [237, 238, 239]
Object Detection [240, 241, 242, 243, 244, 245]
Semantic Segmentation [246, 129, 247, 248, 249, 250, 251]
Computer Vision Person Re-identification [252]
(CV) Super-Resolution [253, 254, 255]
Image Restoration [256]
Generative Adversarial Network (GAN) [257, 258, 259, 260]
Disparity Estimation [261]
Video Task [262, 263, 264, 265]
Translation [266]
Language Modeling [267]
Natural Language Processing Entity Recognition [267]
(NLP) Text Classification [268]
Sequential Labeling [268]
Keyword Spotting [269]
Network Compression [270, 271, 272, 273, 274, 275, 276, 277]
Graph Neural Network (GNN) [278]
Model Federate Learning [279, 280]
WRN Others 97.7 Loss Function Search [281, 282]
AmoebaNet 97.87 Activation Function Search [283]
SENet 97.88 Image Caption [284, 285]
Human
ProxylessNAS 97.92 Text to
Auto Speech (TTS) [202]
Fast AA 98.3 Recommendation System [286, 287, 288]
EfficientNet 98.9
GPIPE 99
Table 5: Summary of the existing automated machine learning applications.
Acc(%)
BiT-L 99.3
GPT-2 35.76
A major challenge with ML is reproducibility. AutoML
0 10 20 30 40 50 60 70
is no exception, especially for NAS, because most of the
existing NAS algorithms still have many parameters that
Figure 20: State-of-the-art models on the PTB dataset. The lower the need to be set manually at the implementation level; how-
perplexity, the better is the performance. The green bar represents
the automatically generated model, and the yellow bar represents the ever, the original papers do not cover much detail. For
model designed by human experts. Best viewed in color. instance, Yang et al. [123] experimentally demonstrated
that the seed plays an important role in NAS experiments;
however, most NAS studies do not mention the seed set in
recommendation system, and searching for loss and acti- the experiments. Besides, considerable resource consump-
vation functions. Therefore, these interesting studies have tion is another obstacle to reproduction. In this context,
indicated the potential of AutoML to be applied in more several NAS-Bench datasets have been proposed, such as
areas. NAS-Bench-101 [224], NAS-Bench-201 [225], and NAS-
Bench-NLP [226]. These datasets allow NAS researchers
7.3. Interpretability to focus on the design of optimization algorithms without
Although AutoML algorithms can find promising con- wasting much time on the model evaluation.
figuration settings more efficiently than humans, there is a
lack of scientific evidence for illustrating why the found set- 7.5. Robustness
tings perform better. For example, in BlockQNN [16], it is NAS has been proven effective in searching promising
unclear why the NAS algorithm tends to select the concate- architectures on many open datasets (e.g., CIFAR-10 and
24
ImageNet). These datasets are generally used for research; which is one step closer to achieving a complete pipeline.
therefore, most of the images are well-labeled. However, Similarly, Vega [305] is another AutoML algorithm tool that
in real-world situations, the data inevitably contain noise constructs a complete pipeline covering a set of highly de-
(e.g., mislabeling and inadequate information). Even worse, coupled functions: data augmentation, HPO, NAS, model
the data might be modified to be adversarial with carefully compression, and full training. In summary, designing an
designed noises. Deep learning models can be easily fooled easy-to-use and complete AutoML pipeline system is a
by adversarial data, and so can NAS. promising research direction.
So far, there are a few studies [293, 294, 295, 296] have
attempted to boost the robustness of NAS against adver- 7.8. Lifelong Learning
sarial data. Guo et al. [294] experimentally explored the Finally, most AutoML algorithms focus only on solving
intrinsic impact of network architectures on network ro- a specific task on some fixed datasets, e.g., image classifica-
bustness against adversarial attacks, and observed that tion on CIFAR-10 and ImageNet. However, a high-quality
densely connected architectures tend to be more robust. AutoML system should have the capability of lifelong learn-
They also found that the flow of solution procedure (FSP) ing, i.e., it should be able to 1) efficiently learn new data
matrix [297] is a good indicator of network robustness, i.e., and 2) remember old knowledge.
the lower is the FSP matrix loss, the more robust is the net-
work. Chen et al. [295] proposed a robust loss function for 7.8.1. Learn New Data
effectively alleviating the performance degradation under First, the system should be able to reuse prior knowl-
symmetric label noise. The authors in [296] adopted EA edge to solve new tasks (i.e., learning to learn). For example,
to search for robust architectures from a well-designed and a child can quickly identify tigers, rabbits, and elephants
vast search space, where various adversarial attacks are after seeing several pictures of these animals. However, the
used as the fitness function for evaluating the robustness current DL models must be trained on considerable data
of neural architectures. before they can correctly identify images. A hot topic in
this area is meta-learning, which aims to design models for
7.6. Joint Hyperparameter and Architecture Optimization
new tasks using previous experience.
Most NAS studies have considered HPO and AO as two Meta-learning. Most of the existing NAS methods
separate processes. However, as already noted in Section can search a well-performing architecture for a single task.
4, there is a tremendous overlap between the methods However, they have to search for a new architecture on
used in HPO and AO, e.g., both of them apply RS, BO, a new task; otherwise, the old architecture might not be
and GO methods. In other words, it is feasible to jointly optimal. Several studies [306, 307, 308, 309] have combined
optimize both hyperparameters and architectures, which is meta-learning and NAS to solve this problem. Recently,
experimentally confirmed by several studies [234, 233, 235]. Lian et al. [308] proposed a novel and meta-learning-based
Thus, how to solve the problem of joint hyperparameter transferable neural architecture search method to generate a
and architecture optimization (HAO) elegantly is a worthy meta-architecture, which can adapt to new tasks easily and
studying issue. quickly through a few gradient steps. Another challenge
of learning new data is few-shot learning scenarios, where
7.7. Complete AutoML Pipeline
there are only limited data for the new tasks. To over-
So far, many AutoML pipeline libraries have been pro- come this challenge, the authors in [307] and [306] applied
posed, but most of them only focus on some parts of the NAS to few-shot learning, where they only searched for the
AutoML pipeline (Figure 1). For instance, TPOT [298], most promising architecture and optimized it to work on
Auto-WEAK [177], and Auto-Sklearn [299] are built on multiple few-shot learning tasks. Elsken et al. [309] pro-
top of scikit-learn [300] for building classification and re- posed a gradient-based meta-learning NAS method, namely
gression pipelines, but they only search for the traditional METANAS, which can generate task-specific architectures
ML models (such as SVM and KNN). Although TPOT more efficiently as it does not require meta-retraining.
involves neural networks (using Pytorch [301] backend), it Unsupervised learning. Meta-learning-based NAS
only supports an MLP network. Besides, Auto-Keras [22] methods focus more on labeled data, while in some cases,
is an open-source library developed based on Keras [302], only a portion of the data may have labels or even none
which focuses more on searching for deep learning models at all. Liu et al. [310] proposed a general problem setup,
and supports multi-modal and multi-task. NNI [303] is namely unsupervised neural architecture search (UnNAS),
a more powerful and lightweight toolkit of AutoML, as to explore whether labels are necessary for NAS. They ex-
its built-in capability contains automated feature engineer- perimentally demonstrated that the architectures searched
ing, hyperparameter optimization, and neural architecture without labels are competitive with those searched with la-
search. Additionally, the NAS module in NNI supports bels; therefore, labels are not necessary for NAS, which has
both Pytorch [301] and Tensorflow [304] and reproduces provoked some reflection among researchers about which
many SOTA NAS methods [13, 17, 132, 128, 197, 180, 224], factors do affect NAS.
which is very friendly for NAS researchers and develop-
ers. Besides, NNI also integrates scikit-learn features [300],
25
7.8.2. Remember Old Knowledge Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,
An AutoML system must be able to constantly learn Montréal, Canada, 2018, pp. 1341–1352.
URL https://proceedings.neurips.cc/paper/2018/hash/
from new data, without forgetting the knowledge from e555ebe0ce426f7f9b2bef0706315e0c-Abstract.html
old data. However, when we use new datasets to train a [5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, R. Salakhutdinov,
pretrained model, the model’s performance on the previous Transformer-XL: Attentive language models beyond a fixed-
datasets is substantially reduced. Incremental learning can length context, in: Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics, Association
alleviate this problem. For example, Li and Hoiem [311] for Computational Linguistics, Florence, Italy, 2019, pp. 2978–
proposed the learning without forgetting (LwF) method, 2988. doi:10.18653/v1/P19-1285.
which trains a model using only new data while preserving URL https://www.aclweb.org/anthology/P19-1285
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
its original capabilities. In addition, iCaRL [312] makes Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
progress based on LwF. It only uses a small proportion of L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge,
old data for pretraining, and then gradually increases the International Journal of Computer Vision (IJCV) 115 (3) (2015)
proportion of a new class of data used to train the model. 211–252. doi:10.1007/s11263-015-0816-y.
[7] K. Simonyan, A. Zisserman, Very deep convolutional networks
for large-scale image recognition, in: Y. Bengio, Y. LeCun
8. Conclusions (Eds.), 3rd International Conference on Learning Represen-
tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings, 2015.
This paper provides a detailed and systematic review URL http://arxiv.org/abs/1409.1556
of AutoML studies according to the DL pipeline (Figure [8] M. Zoller, M. F. Huber, Benchmark and survey of automated
1), ranging from data preparation to model evaluation. machine learning frameworks, arXiv preprint arXiv:1904.12054.
Additionally, we compare the performance and efficiency of [9] Q. Yao, M. Wang, Y. Chen, W. Dai, H. Yi-Qi, L. Yu-Feng,
T. Wei-Wei, Y. Qiang, Y. Yang, Taking human out of learning
existing NAS algorithms on the CIFAR-10 and ImageNet applications: A survey on automated machine learning, arXiv
datasets, and provide an in-depth discussion of different preprint arXiv:1810.13306.
research directions on NAS: one/two-stage NAS, one-shot [10] T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search:
A survey, arXiv preprint arXiv:1808.05377.
NAS, and joint HAO. We also describe several interesting [11] K. Yu, C. Sciuto, M. Jaggi, C. Musat, M. Salzmann, Evaluating
open problems and discuss some important future research the search phase of neural architecture search, in: 8th Inter-
directions. Although research on AutoML is in its infancy, national Conference on Learning Representations, ICLR 2020,
we believe that future researchers will effectively solve Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net,
2020.
these problems. In this context, this review provides a URL https://openreview.net/forum?id=H1loF2NFwr
comprehensive and clear understanding of AutoML for the [12] B. Zoph, Q. V. Le, Neural architecture search with reinforce-
benefit of those new to this area, and will thus assist with ment learning, in: 5th International Conference on Learning
their future research endeavors. Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings, OpenReview.net, 2017.
URL https://openreview.net/forum?id=r1Ue8Hcxg
[13] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, J. Dean, Efficient
References neural architecture search via parameter sharing, in: J. G. Dy,
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classi- A. Krause (Eds.), Proceedings of the 35th International Con-
fication with deep convolutional neural networks, in: P. L. ference on Machine Learning, ICML 2018, Stockholmsmässan,
Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q. Stockholm, Sweden, July 10-15, 2018, Vol. 80 of Proceedings
Weinberger (Eds.), Advances in Neural Information Processing of Machine Learning Research, PMLR, 2018, pp. 4092–4101.
Systems 25: 26th Annual Conference on Neural Information URL http://proceedings.mlr.press/v80/pham18a.html
Processing Systems 2012. Proceedings of a meeting held [14] A. Brock, T. Lim, J. M. Ritchie, N. Weston, SMASH: one-shot
December 3-6, 2012, Lake Tahoe, Nevada, United States, 2012, model architecture search through hypernetworks, in: 6th Inter-
pp. 1106–1114. national Conference on Learning Representations, ICLR 2018,
URL https://proceedings.neurips.cc/paper/2012/hash/ Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
c399862d3b9d6b76c8436e924a68c45b-Abstract.html Track Proceedings, OpenReview.net, 2018.
[2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for URL https://openreview.net/forum?id=rydeCEhs-
image recognition, in: 2016 IEEE Conference on Computer [15] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning
Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, transferable architectures for scalable image recognition, in:
USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 2018 IEEE Conference on Computer Vision and Pattern
770–778. doi:10.1109/CVPR.2016.90. Recognition, CVPR 2018, Salt Lake City, UT, USA, June
URL https://doi.org/10.1109/CVPR.2016.90 18-22, 2018, IEEE Computer Society, 2018, pp. 8697–8710.
[3] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi, You doi:10.1109/CVPR.2018.00907.
only look once: Unified, real-time object detection, in: 2016 URL http://openaccess.thecvf.com/content_cvpr_2018/
IEEE Conference on Computer Vision and Pattern Recognition, html/Zoph_Learning_Transferable_Architectures_CVPR_
CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE 2018_paper.html
Computer Society, 2016, pp. 779–788. doi:10.1109/CVPR.2016. [16] Z. Zhong, J. Yan, W. Wu, J. Shao, C. Liu, Practical block-wise
91. neural network architecture generation, in: 2018 IEEE Confer-
URL https://doi.org/10.1109/CVPR.2016.91 ence on Computer Vision and Pattern Recognition, CVPR 2018,
[4] C. Gong, D. He, X. Tan, T. Qin, L. Wang, T. Liu, FRAGE: Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer
frequency-agnostic word representation, in: S. Bengio, H. M. Society, 2018, pp. 2423–2432. doi:10.1109/CVPR.2018.00257.
Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, URL http://openaccess.thecvf.com/content_cvpr_2018/
R. Garnett (Eds.), Advances in Neural Information Processing html/Zhong_Practical_Block-Wise_Neural_CVPR_2018_
Systems 31: Annual Conference on Neural Information paper.html
26
[17] H. Liu, K. Simonyan, Y. Yang, DARTS: differentiable archi- 2018, July 13-19, 2018, Stockholm, Sweden, ijcai.org, 2018, pp.
tecture search, in: 7th International Conference on Learning 5369–5373. doi:10.24963/ijcai.2018/755.
Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, URL https://doi.org/10.24963/ijcai.2018/755
2019, OpenReview.net, 2019. [29] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink,
URL https://openreview.net/forum?id=S1eYHoC5FX O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy,
[18] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, et al., Evolving deep neural networks (2019) 293–312.
L. Fei-Fei, A. Yuille, J. Huang, K. Murphy, Progressive neural [30] L. Xie, A. L. Yuille, Genetic CNN, in: IEEE International
architecture search (2018) 19–34. Conference on Computer Vision, ICCV 2017, Venice, Italy,
[19] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, K. Kavukcuoglu, October 22-29, 2017, IEEE Computer Society, 2017, pp. 1388–
Hierarchical representations for efficient architecture search, 1397. doi:10.1109/ICCV.2017.154.
in: 6th International Conference on Learning Representations, URL https://doi.org/10.1109/ICCV.2017.154
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, [31] K. Ahmed, L. Torresani, Maskconnect: Connectivity learning
Conference Track Proceedings, OpenReview.net, 2018. by gradient descent (2018) 349–365.
URL https://openreview.net/forum?id=BJQRKzbA- [32] R. Shin, C. Packer, D. Song, Differentiable neural network
[20] T. Chen, I. J. Goodfellow, J. Shlens, Net2net: Accelerating architecture search.
learning via knowledge transfer, in: Y. Bengio, Y. LeCun [33] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, F. Hutter,
(Eds.), 4th International Conference on Learning Represen- Towards automatically-tuned neural networks (2016) 58–65.
tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, [34] A. Zela, A. Klein, S. Falkner, F. Hutter, Towards automated
Conference Track Proceedings, 2016. deep learning: Efficient joint neural architecture and hyperpa-
URL http://arxiv.org/abs/1511.05641 rameter search, arXiv preprint arXiv:1807.06906.
[21] T. Wei, C. Wang, Y. Rui, C. W. Chen, Network morphism, in: [35] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, Fast
M. Balcan, K. Q. Weinberger (Eds.), Proceedings of the 33nd bayesian optimization of machine learning hyperparameters on
International Conference on Machine Learning, ICML 2016, large datasets, in: A. Singh, X. J. Zhu (Eds.), Proceedings of
New York City, NY, USA, June 19-24, 2016, Vol. 48 of JMLR the 20th International Conference on Artificial Intelligence and
Workshop and Conference Proceedings, JMLR.org, 2016, pp. Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale,
564–572. FL, USA, Vol. 54 of Proceedings of Machine Learning Research,
URL http://proceedings.mlr.press/v48/wei16.html PMLR, 2017, pp. 528–536.
[22] H. Jin, Q. Song, X. Hu, Auto-keras: An efficient neural URL http://proceedings.mlr.press/v54/klein17a.html
architecture search system, in: A. Teredesai, V. Kumar, [36] S. Falkner, A. Klein, F. Hutter, Practical hyperparameter
Y. Li, R. Rosales, E. Terzi, G. Karypis (Eds.), Proceed- optimization for deep learning.
ings of the 25th ACM SIGKDD International Conference on [37] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-
Knowledge Discovery & Data Mining, KDD 2019, Anchor- based optimization for general algorithm configuration, in: In-
age, AK, USA, August 4-8, 2019, ACM, 2019, pp. 1946–1956. ternational conference on learning and intelligent optimization,
doi:10.1145/3292500.3330648. 2011, pp. 507–523.
URL https://doi.org/10.1145/3292500.3330648 [38] S. Falkner, A. Klein, F. Hutter, BOHB: robust and efficient
[23] B. Baker, O. Gupta, N. Naik, R. Raskar, Designing neural hyperparameter optimization at scale, in: J. G. Dy, A. Krause
network architectures using reinforcement learning, in: 5th (Eds.), Proceedings of the 35th International Conference on
International Conference on Learning Representations, ICLR Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,
2017, Toulon, France, April 24-26, 2017, Conference Track Sweden, July 10-15, 2018, Vol. 80 of Proceedings of Machine
Proceedings, OpenReview.net, 2017. Learning Research, PMLR, 2018, pp. 1436–1445.
URL https://openreview.net/forum?id=S1c2cvqee URL http://proceedings.mlr.press/v80/falkner18a.html
[24] K. O. Stanley, R. Miikkulainen, Evolving neural networks [39] J. Bergstra, D. Yamins, D. D. Cox, Making a science of model
through augmenting topologies, Evolutionary computation search: Hyperparameter optimization in hundreds of dimen-
10 (2) (2002) 99–127. sions for vision architectures, in: Proceedings of the 30th Inter-
[25] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, national Conference on Machine Learning, ICML 2013, Atlanta,
Q. V. Le, A. Kurakin, Large-scale evolution of image classifiers, GA, USA, 16-21 June 2013, Vol. 28 of JMLR Workshop and
in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th Inter- Conference Proceedings, JMLR.org, 2013, pp. 115–123.
national Conference on Machine Learning, ICML 2017, Sydney, URL http://proceedings.mlr.press/v28/bergstra13.html
NSW, Australia, 6-11 August 2017, Vol. 70 of Proceedings of [40] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian,
Machine Learning Research, PMLR, 2017, pp. 2902–2911. C. Xu, CARS: continuous evolution for efficient neural ar-
URL http://proceedings.mlr.press/v70/real17a.html chitecture search, in: 2020 IEEE/CVF Conference on Com-
[26] E. Real, A. Aggarwal, Y. Huang, Q. V. Le, Regularized evo- puter Vision and Pattern Recognition, CVPR 2020, Seattle,
lution for image classifier architecture search, in: The Thirty- WA, USA, June 13-19, 2020, IEEE, 2020, pp. 1826–1835.
Third AAAI Conference on Artificial Intelligence, AAAI 2019, doi:10.1109/CVPR42600.2020.00190.
The Thirty-First Innovative Applications of Artificial Intel- URL https://doi.org/10.1109/CVPR42600.2020.00190
ligence Conference, IAAI 2019, The Ninth AAAI Sympo- [41] K. Maziarz, M. Tan, A. Khorlin, M. Georgiev, A. Gesmundo,
sium on Educational Advances in Artificial Intelligence, EAAI Evolutionary-neural hybrid agents for architecture searcharXiv:
2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, 1811.09828.
AAAI Press, 2019, pp. 4780–4789. doi:10.1609/aaai.v33i01. [42] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu,
33014780. X. Wang, Reinforced evolutionary neural architecture search,
URL https://doi.org/10.1609/aaai.v33i01.33014780 arXiv preprint arXiv:1808.00193.
[27] T. Elsken, J. H. Metzen, F. Hutter, Efficient multi-objective [43] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, M. Zhang,
neural architecture search via lamarckian evolution, in: 7th Surrogate-assisted evolutionary deep learning using an end-to-
International Conference on Learning Representations, ICLR end random forest-based performance predictor, IEEE Trans-
2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net, actions on Evolutionary Computation.
2019. [44] B. Wang, Y. Sun, B. Xue, M. Zhang, A hybrid differential evolu-
URL https://openreview.net/forum?id=ByME42AqK7 tion approach to designing deep convolutional neural networks
[28] M. Suganuma, S. Shirakawa, T. Nagao, A genetic programming for image classification, in: Australasian Joint Conference on
approach to designing convolutional neural network architec- Artificial Intelligence, Springer, 2018, pp. 237–250.
tures, in: J. Lang (Ed.), Proceedings of the Twenty-Seventh [45] M. Wistuba, A. Rawat, T. Pedapati, A survey on neural archi-
International Joint Conference on Artificial Intelligence, IJCAI tecture search, arXiv preprint arXiv:1905.01392.
27
[46] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, Computer Society, 2015, pp. 1431–1439. doi:10.1109/ICCV.
X. Wang, A comprehensive survey of neural architecture search: 2015.168.
Challenges and solutions (2020). arXiv:2006.02903. URL https://doi.org/10.1109/ICCV.2015.168
[47] R. Elshawi, M. Maher, S. Sakr, Automated machine learn- [65] Z. Xu, S. Huang, Y. Zhang, D. Tao, Augmenting strong super-
ing: State-of-the-art and open challenges, arXiv preprint vision using web data for fine-grained categorization, in: 2015
arXiv:1906.02287. IEEE International Conference on Computer Vision, ICCV
[48] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based 2015, Santiago, Chile, December 7-13, 2015, IEEE Computer
learning applied to document recognition, Proceedings of the Society, 2015, pp. 2524–2532. doi:10.1109/ICCV.2015.290.
IEEE 86 (11) (1998) 2278–2324. URL https://doi.org/10.1109/ICCV.2015.290
[49] A. Krizhevsky, V. Nair, G. Hinton, The cifar-10 dataset, online: [66] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer,
http://www. cs. toronto. edu/kriz/cifar. html. Smote: synthetic minority over-sampling technique, Journal of
[50] J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Ima- artificial intelligence research 16 (2002) 321–357.
genet: A large-scale hierarchical image database, in: 2009 [67] H. Guo, H. L. Viktor, Learning from imbalanced data sets
IEEE Computer Society Conference on Computer Vision and with boosting and data generation: the databoost-im approach,
Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, ACM Sigkdd Explorations Newsletter 6 (1) (2004) 30–39.
Florida, USA, IEEE Computer Society, 2009, pp. 248–255. [68] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-
doi:10.1109/CVPR.2009.5206848. man, J. Tang, W. Zaremba, Openai gym, arXiv preprint
URL https://doi.org/10.1109/CVPR.2009.5206848 arXiv:1606.01540.
[51] J. Yang, X. Sun, Y.-K. Lai, L. Zheng, M.-M. Cheng, Recog- [69] Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, X. Chu, Irs: A
nition from web data: a progressive filtering approach, IEEE large synthetic indoor robotics stereo dataset for disparity and
Transactions on Image Processing 27 (11) (2018) 5303–5315. surface normal estimation, arXiv preprint arXiv:1912.09678.
[52] X. Chen, A. Shrivastava, A. Gupta, NEIL: extracting visual [70] N. Ruiz, S. Schulter, M. Chandraker, Learning to simulate,
knowledge from web data, in: IEEE International Conference in: 7th International Conference on Learning Representations,
on Computer Vision, ICCV 2013, Sydney, Australia, December ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenRe-
1-8, 2013, IEEE Computer Society, 2013, pp. 1409–1416. doi: view.net, 2019.
10.1109/ICCV.2013.178. URL https://openreview.net/forum?id=HJgkx2Aqt7
URL https://doi.org/10.1109/ICCV.2013.178 [71] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
[53] Y. Xia, X. Cao, F. Wen, J. Sun, Well begun is half done: Farley, S. Ozair, A. C. Courville, Y. Bengio, Generative
Generating high-quality seeds for automatic image dataset adversarial nets, in: Z. Ghahramani, M. Welling, C. Cortes,
construction from web, in: European Conference on Computer N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural
Vision, Springer, 2014, pp. 387–400. Information Processing Systems 27: Annual Conference on
[54] N. H. Do, K. Yanai, Automatic construction of action datasets Neural Information Processing Systems 2014, December 8-13
using web videos with density-based cluster analysis and outlier 2014, Montreal, Quebec, Canada, 2014, pp. 2672–2680.
detection, in: Pacific-Rim Symposium on Image and Video URL https://proceedings.neurips.cc/paper/2014/hash/
Technology, Springer, 2015, pp. 160–172. 5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
[55] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, [72] T.-H. Oh, R. Jaroensri, C. Kim, M. Elgharib, F. Durand,
J. Philbin, L. Fei-Fei, The unreasonable effectiveness of noisy W. T. Freeman, W. Matusik, Learning-based video motion
data for fine-grained recognition, in: European Conference on magnification, in: Proceedings of the European Conference on
Computer Vision, Springer, 2016, pp. 301–320. Computer Vision (ECCV), 2018, pp. 633–648.
[56] P. D. Vo, A. Ginsca, H. Le Borgne, A. Popescu, Harnessing [73] L. Sixt, Rendergan: Generating realistic labeled data–with an
noisy web images for deep representation, Computer Vision application on decoding bee tags, unpublished Bachelor Thesis,
and Image Understanding 164 (2017) 68–81. Freie Universität, Berlin.
[57] B. Collins, J. Deng, K. Li, L. Fei-Fei, Towards scalable dataset [74] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Ham-
construction: An active learning approach, in: European con- mers, D. A. Dickie, M. V. Hernández, J. Wardlaw, D. Rueckert,
ference on computer vision, Springer, 2008, pp. 86–98. Gan augmentation: Augmenting training data using generative
[58] Y. Roh, G. Heo, S. E. Whang, A survey on data collection for adversarial networks, arXiv preprint arXiv:1810.10863.
machine learning: a big data-ai integration perspective, IEEE [75] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park,
Transactions on Knowledge and Data Engineering. Y. Kim, Data synthesis based on generative adversarial net-
[59] D. Yarowsky, Unsupervised word sense disambiguation rivaling works, Proceedings of the VLDB Endowment 11 (10) (2018)
supervised methods, in: 33rd Annual Meeting of the Association 1071–1083.
for Computational Linguistics, Association for Computational [76] L. Xu, K. Veeramachaneni, Synthesizing tabular data using gen-
Linguistics, Cambridge, Massachusetts, USA, 1995, pp. 189– erative adversarial networks, arXiv preprint arXiv:1811.11264.
196. doi:10.3115/981658.981684. [77] D. Donahue, A. Rumshisky, Adversarial text generation without
URL https://www.aclweb.org/anthology/P95-1026 reinforcement learning, arXiv preprint arXiv:1810.06640.
[60] I. Triguero, J. A. Sáez, J. Luengo, S. Garcı́a, F. Herrera, On the [78] T. Karras, S. Laine, T. Aila, A style-based generator ar-
characterization of noise filters for self-training semi-supervised chitecture for generative adversarial networks, in: IEEE
in nearest neighbor classification, Neurocomputing 132 (2014) Conference on Computer Vision and Pattern Recognition,
30–41. CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
[61] M. F. A. Hady, F. Schwenker, Combining committee-based semi- Computer Vision Foundation / IEEE, 2019, pp. 4401–4410.
supervised learning and active learning, Journal of Computer doi:10.1109/CVPR.2019.00453.
Science and Technology 25 (4) (2010) 681–698. URL http://openaccess.thecvf.com/content_CVPR_2019/
[62] A. Blum, T. Mitchell, Combining labeled and unlabeled data html/Karras_A_Style-Based_Generator_Architecture_for_
with co-training, in: Proceedings of the eleventh annual con- Generative_Adversarial_Networks_CVPR_2019_paper.html
ference on Computational learning theory, ACM, 1998, pp. [79] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data cleaning:
92–100. Overview and emerging challenges, in: F. Özcan, G. Koutrika,
[63] Y. Zhou, S. Goldman, Democratic co-learning, in: Tools with S. Madden (Eds.), Proceedings of the 2016 International Con-
Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE Interna- ference on Management of Data, SIGMOD Conference 2016,
tional Conference on, IEEE, 2004, pp. 594–602. San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016,
[64] X. Chen, A. Gupta, Webly supervised learning of convolutional pp. 2201–2206. doi:10.1145/2882903.2912574.
networks, in: 2015 IEEE International Conference on Computer URL https://doi.org/10.1145/2882903.2912574
Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, IEEE [80] M. Jesmeen, J. Hossen, S. Sayeed, C. Ho, K. Tawsif, A. Rahman,
28
E. Arif, A survey on cleaning dirty data using machine learning arXiv preprint arXiv:1609.08764.
paradigm for big data analytics, Indonesian Journal of Electrical [97] Z. Xie, S. I. Wang, J. Li, D. Lévy, A. Nie, D. Jurafsky, A. Y. Ng,
Engineering and Computer Science 10 (3) (2018) 1234–1243. Data noising as smoothing in neural network language models,
[81] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, in: 5th International Conference on Learning Representations,
Y. Ye, KATARA: A data cleaning system powered by knowledge ICLR 2017, Toulon, France, April 24-26, 2017, Conference
bases and crowdsourcing, in: T. K. Sellis, S. B. Davidson, Track Proceedings, OpenReview.net, 2017.
Z. G. Ives (Eds.), Proceedings of the 2015 ACM SIGMOD URL https://openreview.net/forum?id=H1VyHY9gg
International Conference on Management of Data, Melbourne, [98] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi,
Victoria, Australia, May 31 - June 4, 2015, ACM, 2015, pp. Q. V. Le, Qanet: Combining local convolution with global
1247–1261. doi:10.1145/2723372.2749431. self-attention for reading comprehension, in: 6th International
URL https://doi.org/10.1145/2723372.2749431 Conference on Learning Representations, ICLR 2018, Vancou-
[82] S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, ver, BC, Canada, April 30 - May 3, 2018, Conference Track
T. Milo, E. Wu, Sampleclean: Fast and reliable analytics on Proceedings, OpenReview.net, 2018.
dirty data., IEEE Data Eng. Bull. 38 (3) (2015) 59–75. URL https://openreview.net/forum?id=B14TlG-RW
[83] S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, E. Wu, Ac- [99] E. Ma, Nlp augmentation, https://github.com/makcedward/
tiveclean: An interactive data cleaning framework for modern nlpaug (2019).
machine learning, in: F. Özcan, G. Koutrika, S. Madden (Eds.), [100] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, Q. V. Le,
Proceedings of the 2016 International Conference on Manage- Autoaugment: Learning augmentation strategies from data,
ment of Data, SIGMOD Conference 2016, San Francisco, CA, in: IEEE Conference on Computer Vision and Pattern
USA, June 26 - July 01, 2016, ACM, 2016, pp. 2117–2120. Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,
doi:10.1145/2882903.2899409. 2019, Computer Vision Foundation / IEEE, 2019, pp. 113–123.
URL https://doi.org/10.1145/2882903.2899409 doi:10.1109/CVPR.2019.00020.
[84] S. Krishnan, M. J. Franklin, K. Goldberg, E. Wu, Boostclean: URL http://openaccess.thecvf.com/content_CVPR_
Automated error detection and repair for machine learning, 2019/html/Cubuk_AutoAugment_Learning_Augmentation_
arXiv preprint arXiv:1711.01299. Strategies_From_Data_CVPR_2019_paper.html
[85] S. Krishnan, E. Wu, Alphaclean: Automatic generation of data [101] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson,
cleaning pipelines, arXiv preprint arXiv:1904.11827. Y. Yang, Dada: Differentiable automatic data augmentation,
[86] I. Gemp, G. Theocharous, M. Ghavamzadeh, Automated data arXiv preprint arXiv:2003.03780.
cleansing through meta-learning, in: S. P. Singh, S. Markovitch [102] R. Hataya, J. Zdenek, K. Yoshizoe, H. Nakayama, Faster au-
(Eds.), Proceedings of the Thirty-First AAAI Conference on toaugment: Learning augmentation strategies using backpropa-
Artificial Intelligence, February 4-9, 2017, San Francisco, Cali- gation, arXiv preprint arXiv:1911.06987.
fornia, USA, AAAI Press, 2017, pp. 4760–4761. [103] S. Lim, I. Kim, T. Kim, C. Kim, S. Kim, Fast autoaugment, in:
URL http://aaai.org/ocs/index.php/IAAI/IAAI17/paper/ H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
view/14236 E. B. Fox, R. Garnett (Eds.), Advances in Neural Information
[87] I. F. Ilyas, Effective data cleaning with continuous evaluation., Processing Systems 32: Annual Conference on Neural Informa-
IEEE Data Eng. Bull. 39 (2) (2016) 38–46. tion Processing Systems 2019, NeurIPS 2019, December 8-14,
[88] M. Mahdavi, F. Neutatz, L. Visengeriyeva, Z. Abedjan, Towards 2019, Vancouver, BC, Canada, 2019, pp. 6662–6672.
automated data cleaning workflows, Machine Learning 15 (2019) URL https://proceedings.neurips.cc/paper/2019/hash/
16. 6add07cf50424b14fdf649da87843d01-Abstract.html
[89] T. DeVries, G. W. Taylor, Improved regularization of con- [104] A. Naghizadeh, M. Abavisani, D. N. Metaxas, Greedy autoaug-
volutional neural networks with cutout, arXiv preprint ment, arXiv preprint arXiv:1908.00704.
arXiv:1708.04552. [105] D. Ho, E. Liang, X. Chen, I. Stoica, P. Abbeel, Population
[90] H. Zhang, M. Cissé, Y. N. Dauphin, D. Lopez-Paz, mixup: based augmentation: Efficient learning of augmentation policy
Beyond empirical risk minimization, in: 6th International Con- schedules, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceed-
ference on Learning Representations, ICLR 2018, Vancouver, ings of the 36th International Conference on Machine Learning,
BC, Canada, April 30 - May 3, 2018, Conference Track Pro- ICML 2019, 9-15 June 2019, Long Beach, California, USA,
ceedings, OpenReview.net, 2018. Vol. 97 of Proceedings of Machine Learning Research, PMLR,
URL https://openreview.net/forum?id=r1Ddp1-Rb 2019, pp. 2731–2741.
[91] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving, URL http://proceedings.mlr.press/v97/ho19b.html
C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft, [106] T. Niu, M. Bansal, Automatically learning data augmenta-
Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer, tion policies for dialogue tasks, in: Proceedings of the 2019
B. Cook, I. Fernández, F.-M. De Rainville, C.-H. Weng, Conference on Empirical Methods in Natural Language Pro-
A. Ayala-Acevedo, R. Meudec, M. Laporte, et al., imgaug, cessing and the 9th International Joint Conference on Natural
https://github.com/aleju/imgaug, online; accessed 01-Feb- Language Processing (EMNLP-IJCNLP), Association for Com-
2020 (2020). putational Linguistics, Hong Kong, China, 2019, pp. 1317–1323.
[92] A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, A. A. doi:10.18653/v1/D19-1132.
Kalinin, Albumentations: fast and flexible image augmenta- URL https://www.aclweb.org/anthology/D19-1132
tions, ArXiv e-printsarXiv:1809.06839. [107] M. Geng, K. Xu, B. Ding, H. Wang, L. Zhang, Learning data
[93] A. Mikolajczyk, M. Grochowski, Data augmentation for im- augmentation policies using augmented random search, arXiv
proving deep learning in image classification problem, in: 2018 preprint arXiv:1811.04768.
international interdisciplinary PhD workshop (IIPhDW), IEEE, [108] X. Zhang, Q. Wang, J. Zhang, Z. Zhong, Adversarial autoaug-
2018, pp. 117–122. ment, in: 8th International Conference on Learning Represen-
[94] A. Mikolajczyk, M. Grochowski, Style transfer-based image tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020,
synthesis as an efficient regularization technique in deep learn- OpenReview.net, 2020.
ing, in: 2019 24th International Conference on Methods and URL https://openreview.net/forum?id=ByxdUySKvS
Models in Automation and Robotics (MMAR), IEEE, 2019, [109] C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin,
pp. 42–47. W. Ouyang, Online hyper-parameter learning for auto-
[95] A. Antoniou, A. Storkey, H. Edwards, Data augmentation gen- augmentation strategy, in: 2019 IEEE/CVF International Con-
erative adversarial networks, arXiv preprint arXiv:1711.04340. ference on Computer Vision, ICCV 2019, Seoul, Korea (South),
[96] S. C. Wong, A. Gatt, V. Stamatescu, M. D. McDonnell, Under- October 27 - November 2, 2019, IEEE, 2019, pp. 6578–6587.
standing data augmentation for classification: when to warp?, doi:10.1109/ICCV.2019.00668.
29
URL https://doi.org/10.1109/ICCV.2019.00668 [128] X. Chen, L. Xie, J. Wu, Q. Tian, Progressive differentiable
[110] T. C. LingChen, A. Khonsari, A. Lashkari, M. R. Nazari, architecture search: Bridging the depth gap between search and
J. S. Sambee, M. A. Nascimento, Uniformaugment: A search- evaluation, in: 2019 IEEE/CVF International Conference on
free probabilistic data augmentation approach, arXiv preprint Computer Vision, ICCV 2019, Seoul, Korea (South), October
arXiv:2003.14348. 27 - November 2, 2019, IEEE, 2019, pp. 1294–1303. doi:
[111] H. Motoda, H. Liu, Feature selection, extraction and construc- 10.1109/ICCV.2019.00138.
tion, Communication of IICM (Institute of Information and URL https://doi.org/10.1109/ICCV.2019.00138
Computing Machinery, Taiwan) Vol 5 (67-72) (2002) 2. [129] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille,
[112] M. Dash, H. Liu, Feature selection for classification, Intelligent F. Li, Auto-deeplab: Hierarchical neural architecture search
data analysis 1 (1-4) (1997) 131–156. for semantic image segmentation, in: IEEE Conference on
[113] M. J. Pazzani, Constructive induction of cartesian product Computer Vision and Pattern Recognition, CVPR 2019, Long
attributes, in: Feature Extraction, Construction and Selection, Beach, CA, USA, June 16-20, 2019, Computer Vision Founda-
Springer, 1998, pp. 341–354. tion / IEEE, 2019, pp. 82–92. doi:10.1109/CVPR.2019.00017.
[114] Z. Zheng, A comparison of constructing different types of new URL http://openaccess.thecvf.com/content_CVPR_
feature for decision tree learning, in: Feature Extraction, Con- 2019/html/Liu_Auto-DeepLab_Hierarchical_Neural_
struction and Selection, Springer, 1998, pp. 239–255. Architecture_Search_for_Semantic_Image_Segmentation_
[115] J. Gama, Functional trees, Machine Learning 55 (3) (2004) CVPR_2019_paper.html
219–250. [130] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler,
[116] H. Vafaie, K. De Jong, Evolutionary feature space transfor- A. Howard, Q. V. Le, Mnasnet: Platform-aware neural archi-
mation, in: Feature Extraction, Construction and Selection, tecture search for mobile, in: IEEE Conference on Computer
Springer, 1998, pp. 307–323. Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,
[117] P. Sondhi, Feature construction methods: a survey, sifaka. cs. USA, June 16-20, 2019, Computer Vision Foundation / IEEE,
uiuc. edu 69 (2009) 70–71. 2019, pp. 2820–2828. doi:10.1109/CVPR.2019.00293.
[118] D. Roth, K. Small, Interactive feature space construction using URL http://openaccess.thecvf.com/content_CVPR_2019/
semantic information, in: Proceedings of the Thirteenth Con- html/Tan_MnasNet_Platform-Aware_Neural_Architecture_
ference on Computational Natural Language Learning (CoNLL- Search_for_Mobile_CVPR_2019_paper.html
2009), Association for Computational Linguistics, Boulder, [131] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian,
Colorado, 2009, pp. 66–74. P. Vajda, Y. Jia, K. Keutzer, Fbnet: Hardware-aware efficient
URL https://www.aclweb.org/anthology/W09-1110 convnet design via differentiable neural architecture search,
[119] Q. Meng, D. Catchpoole, D. Skillicom, P. J. Kennedy, Rela- in: IEEE Conference on Computer Vision and Pattern
tional autoencoder for feature extraction, in: 2017 International Recognition, CVPR 2019, Long Beach, CA, USA, June
Joint Conference on Neural Networks (IJCNN), IEEE, 2017, 16-20, 2019, Computer Vision Foundation / IEEE, 2019, pp.
pp. 364–371. 10734–10742. doi:10.1109/CVPR.2019.01099.
[120] O. Irsoy, E. Alpaydın, Unsupervised feature extraction with URL http://openaccess.thecvf.com/content_CVPR_
autoencoder trees, Neurocomputing 258 (2017) 63–73. 2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_
[121] C. Cortes, V. Vapnik, Support-vector networks, Machine learn- Design_via_Differentiable_Neural_Architecture_Search_
ing 20 (3) (1995) 273–297. CVPR_2019_paper.html
[122] N. S. Altman, An introduction to kernel and nearest-neighbor [132] H. Cai, L. Zhu, S. Han, Proxylessnas: Direct neural architecture
nonparametric regression, The American Statistician 46 (3) search on target task and hardware, in: 7th International
(1992) 175–185. Conference on Learning Representations, ICLR 2019, New
[123] A. Yang, P. M. Esperança, F. M. Carlucci, NAS evaluation is Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019.
frustratingly hard, in: 8th International Conference on Learning URL https://openreview.net/forum?id=HylVB3AqYm
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26- [133] M. Courbariaux, Y. Bengio, J. David, Binaryconnect: Training
30, 2020, OpenReview.net, 2020. deep neural networks with binary weights during propagations,
URL https://openreview.net/forum?id=HygrdpVKvr in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
[124] F. Chollet, Xception: Deep learning with depthwise separable R. Garnett (Eds.), Advances in Neural Information Processing
convolutions, in: 2017 IEEE Conference on Computer Vision Systems 28: Annual Conference on Neural Information
and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, Processing Systems 2015, December 7-12, 2015, Montreal,
July 21-26, 2017, IEEE Computer Society, 2017, pp. 1800–1807. Quebec, Canada, 2015, pp. 3123–3131.
doi:10.1109/CVPR.2017.195. URL https://proceedings.neurips.cc/paper/2015/hash/
URL https://doi.org/10.1109/CVPR.2017.195 3e15cc11f979ed25912dff5b0669f2cd-Abstract.html
[125] F. Yu, V. Koltun, Multi-scale context aggregation by dilated [134] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a
convolutions, in: Y. Bengio, Y. LeCun (Eds.), 4th International neural network, arXiv preprint arXiv:1503.02531.
Conference on Learning Representations, ICLR 2016, San Juan, [135] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable
Puerto Rico, May 2-4, 2016, Conference Track Proceedings, are features in deep neural networks?, in: Z. Ghahramani,
2016. M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger
URL http://arxiv.org/abs/1511.07122 (Eds.), Advances in Neural Information Processing Systems 27:
[126] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, Annual Conference on Neural Information Processing Systems
in: 2018 IEEE Conference on Computer Vision and Pattern 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014,
Recognition, CVPR 2018, Salt Lake City, UT, USA, June pp. 3320–3328.
18-22, 2018, IEEE Computer Society, 2018, pp. 7132–7141. URL https://proceedings.neurips.cc/paper/2014/hash/
doi:10.1109/CVPR.2018.00745. 375c71349b295fbe2dcdca9206f20a06-Abstract.html
URL http://openaccess.thecvf.com/content_cvpr_2018/ [136] T. Wei, C. Wang, C. W. Chen, Modularized morphing of neural
html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_ networks, arXiv preprint arXiv:1701.03281.
paper.html [137] H. Cai, T. Chen, W. Zhang, Y. Yu, J. Wang, Efficient
[127] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely architecture search by network transformation, in: S. A.
connected convolutional networks, in: 2017 IEEE Conference McIlraith, K. Q. Weinberger (Eds.), Proceedings of the
on Computer Vision and Pattern Recognition, CVPR 2017, Thirty-Second AAAI Conference on Artificial Intelligence,
Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, (AAAI-18), the 30th innovative Applications of Artificial
2017, pp. 2261–2269. doi:10.1109/CVPR.2017.243. Intelligence (IAAI-18), and the 8th AAAI Symposium on
URL https://doi.org/10.1109/CVPR.2017.243 Educational Advances in Artificial Intelligence (EAAI-18),
30
New Orleans, Louisiana, USA, February 2-7, 2018, AAAI URL https://www.aclweb.org/anthology/H94-1020
Press, 2018, pp. 2787–2794. [153] C. He, H. Ye, L. Shen, T. Zhang, Milenas: Efficient neu-
URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/ ral architecture search via mixed-level reformulation, in: 2020
paper/view/16755 IEEE/CVF Conference on Computer Vision and Pattern Recog-
[138] A. Kwasigroch, M. Grochowski, M. Mikolajczyk, Deep neu- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
ral network architecture search using network morphism, in: 2020, pp. 11990–11999. doi:10.1109/CVPR42600.2020.01201.
2019 24th International Conference on Methods and Models in URL https://doi.org/10.1109/CVPR42600.2020.01201
Automation and Robotics (MMAR), IEEE, 2019, pp. 30–35. [154] X. Dong, Y. Yang, Searching for a robust neural architecture
[139] H. Cai, J. Yang, W. Zhang, S. Han, Y. Yu, Path-level network in four GPU hours, in: IEEE Conference on Computer Vision
transformation for efficient architecture search, in: J. G. Dy, and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
A. Krause (Eds.), Proceedings of the 35th International Con- June 16-20, 2019, Computer Vision Foundation / IEEE, 2019,
ference on Machine Learning, ICML 2018, Stockholmsmässan, pp. 1761–1770. doi:10.1109/CVPR.2019.00186.
Stockholm, Sweden, July 10-15, 2018, Vol. 80 of Proceedings URL http://openaccess.thecvf.com/content_CVPR_2019/
of Machine Learning Research, PMLR, 2018, pp. 677–686. html/Dong_Searching_for_a_Robust_Neural_Architecture_
URL http://proceedings.mlr.press/v80/cai18a.html in_Four_GPU_Hours_CVPR_2019_paper.html
[140] J. Fang, Y. Sun, K. Peng, Q. Zhang, Y. Li, W. Liu, X. Wang, [155] S. Xie, H. Zheng, C. Liu, L. Lin, SNAS: stochastic neural ar-
Fast neural network adaptation via parameter remapping and chitecture search, in: 7th International Conference on Learning
architecture search, in: 8th International Conference on Learn- Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2019, OpenReview.net, 2019.
26-30, 2020, OpenReview.net, 2020. URL https://openreview.net/forum?id=rylqooRqK7
URL https://openreview.net/forum?id=rklTmyBKPH [156] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, K. Keutzer,
[141] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, Mixed precision quantization of convnets via differentiable
E. Choi, Morphnet: Fast & simple resource-constrained neural architecture search (2018). arXiv:1812.00090.
structure learning of deep networks, in: 2018 IEEE Conference [157] E. Jang, S. Gu, B. Poole, Categorical reparameterization with
on Computer Vision and Pattern Recognition, CVPR 2018, gumbel-softmax, in: 5th International Conference on Learning
Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Society, 2018, pp. 1586–1595. doi:10.1109/CVPR.2018.00171. Conference Track Proceedings, OpenReview.net, 2017.
URL http://openaccess.thecvf.com/content_cvpr_2018/ URL https://openreview.net/forum?id=rkE3y85ee
html/Gordon_MorphNet_Fast__CVPR_2018_paper.html [158] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution:
[142] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for A continuous relaxation of discrete random variables, in: 5th
convolutional neural networks, in: K. Chaudhuri, R. Salakhut- International Conference on Learning Representations, ICLR
dinov (Eds.), Proceedings of the 36th International Conference 2017, Toulon, France, April 24-26, 2017, Conference Track
on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, Proceedings, OpenReview.net, 2017.
California, USA, Vol. 97 of Proceedings of Machine Learning URL https://openreview.net/forum?id=S1jE5L5gl
Research, PMLR, 2019, pp. 6105–6114. [159] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, Z. Li,
URL http://proceedings.mlr.press/v97/tan19a.html Darts+: Improved differentiable architecture search with early
[143] J. F. Miller, S. L. Harding, Cartesian genetic programming, stopping, arXiv preprint arXiv:1909.06035.
in: Proceedings of the 10th annual conference companion on [160] K. Kandasamy, W. Neiswanger, J. Schneider, B. Póczos, E. P.
Genetic and evolutionary computation, ACM, 2008, pp. 2701– Xing, Neural architecture search with bayesian optimisation
2726. and optimal transport, in: S. Bengio, H. M. Wallach,
[144] J. F. Miller, S. L. Smith, Redundancy and computational H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett
efficiency in cartesian genetic programming, IEEE Transactions (Eds.), Advances in Neural Information Processing Systems 31:
on Evolutionary Computation 10 (2) (2006) 167–174. Annual Conference on Neural Information Processing Systems
[145] F. Gruau, Cellular encoding as a graph grammar, in: IEEE 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada,
Colloquium on Grammatical Inference: Theory, Applications 2018, pp. 2020–2029.
& Alternatives, 1993. URL https://proceedings.neurips.cc/paper/2018/hash/
[146] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, f33ba15effa5c10e873bf3842afb46a6-Abstract.html
M. Jaderberg, M. Lanctot, D. Wierstra, Convolution by evolu- [161] R. Negrinho, G. Gordon, Deeparchitect: Automatically design-
tion: Differentiable pattern producing networks, in: Proceed- ing and training deep architectures (2017). arXiv:1704.08792.
ings of the Genetic and Evolutionary Computation Conference [162] R. Negrinho, M. R. Gormley, G. J. Gordon, D. Patil, N. Le,
2016, ACM, 2016, pp. 109–116. D. Ferreira, Towards modular and programmable architecture
[147] M. Kim, L. Rigazio, Deep clustered convolutional kernels, in: search, in: H. M. Wallach, H. Larochelle, A. Beygelzimer,
Feature Extraction: Modern Questions and Challenges, 2015, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in
pp. 160–172. Neural Information Processing Systems 32: Annual Conference
[148] J. K. Pugh, K. O. Stanley, Evolving multimodal controllers on Neural Information Processing Systems 2019, NeurIPS
with hyperneat, in: Proceedings of the 15th annual conference 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
on Genetic and evolutionary computation, ACM, 2013, pp. 13715–13725.
735–742. URL https://proceedings.neurips.cc/paper/2019/hash/
[149] H. Zhu, Z. An, C. Yang, K. Xu, E. Zhao, Y. Xu, Eena: Efficient 4ab50afd6dcc95fcba76d0fe04295632-Abstract.html
evolution of neural architecture (2019). arXiv:1905.07320. [163] G. Dikov, J. Bayer, Bayesian learning of neural network ar-
[150] R. J. Williams, Simple statistical gradient-following algorithms chitectures, in: K. Chaudhuri, M. Sugiyama (Eds.), The 22nd
for connectionist reinforcement learning, Machine learning 8 (3- International Conference on Artificial Intelligence and Statis-
4) (1992) 229–256. tics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan,
[151] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Vol. 89 of Proceedings of Machine Learning Research, PMLR,
Proximal policy optimization algorithms, arXiv preprint 2019, pp. 730–738.
arXiv:1707.06347. URL http://proceedings.mlr.press/v89/dikov19a.html
[152] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, [164] C. White, W. Neiswanger, Y. Savani, Bananas: Bayesian op-
A. Bies, M. Ferguson, K. Katz, B. Schasberger, The Penn timization with neural architectures for neural architecture
Treebank: Annotating predicate argument structure, in: Hu- search (2019). arXiv:1910.11858.
man Language Technology: Proceedings of a Workshop held at [165] M. Wistuba, Bayesian optimization combined with incremen-
Plainsboro, New Jersey, March 8-11, 1994, 1994. tal evaluation for neural network architecture optimization,
31
in: Proceedings of the International Workshop on Automatic [179] Y. Geifman, R. El-Yaniv, Deep active learning with a
Selection, Configuration and Composition of Machine Learning neural architecture search, in: H. M. Wallach, H. Larochelle,
Algorithms, 2017. A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.),
[166] J. Perez-Rua, M. Baccouche, S. Pateux, Efficient progressive Advances in Neural Information Processing Systems 32:
neural architecture search, in: British Machine Vision Confer- Annual Conference on Neural Information Processing Systems
ence 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
BMVA Press, 2018, p. 150. Canada, 2019, pp. 5974–5984.
URL http://bmvc2018.org/contents/papers/0291.pdf URL https://proceedings.neurips.cc/paper/2019/hash/
[167] C. E. Rasmussen, Gaussian processes in machine learning, b59307fdacf7b2db12ec4bd5ca1caba8-Abstract.html
Lecture Notes in Computer Science (2003) 63–71. [180] L. Li, A. Talwalkar, Random search and reproducibility for
[168] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms neural architecture search, in: A. Globerson, R. Silva (Eds.),
for hyper-parameter optimization, in: J. Shawe-Taylor, R. S. Proceedings of the Thirty-Fifth Conference on Uncertainty in
Zemel, P. L. Bartlett, F. C. N. Pereira, K. Q. Weinberger Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25,
(Eds.), Advances in Neural Information Processing Systems 2019, Vol. 115 of Proceedings of Machine Learning Research,
24: 25th Annual Conference on Neural Information Processing AUAI Press, 2019, pp. 367–377.
Systems 2011. Proceedings of a meeting held 12-14 December URL http://proceedings.mlr.press/v115/li20c.html
2011, Granada, Spain, 2011, pp. 2546–2554. [181] J. Bergstra, Y. Bengio, Random search for hyper-parameter
URL https://proceedings.neurips.cc/paper/2011/hash/ optimization, Journal of machine learning research 13 (Feb)
86e8f7ab32cfd12577bc2619bc635690-Abstract.html (2012) 281–305.
[169] R. Luo, F. Tian, T. Qin, E. Chen, T. Liu, Neural architecture [182] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al., A practical guide to
optimization, in: S. Bengio, H. M. Wallach, H. Larochelle, support vector classification.
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in [183] J. Y. Hesterman, L. Caucci, M. A. Kupinski, H. H. Barrett, L. R.
Neural Information Processing Systems 31: Annual Conference Furenlid, Maximum-likelihood estimation with a contracting-
on Neural Information Processing Systems 2018, NeurIPS 2018, grid search algorithm, IEEE transactions on nuclear science
December 3-8, 2018, Montréal, Canada, 2018, pp. 7827–7838. 57 (3) (2010) 1077–1084.
URL https://proceedings.neurips.cc/paper/2018/hash/ [184] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar,
933670f1ac8ba969f32989c312faba75-Abstract.html Hyperband: A novel bandit-based approach to hyperparameter
[170] M. M. Ian Dewancker, S. Clark, Bayesian optimization primer. optimization, The Journal of Machine Learning Research 18 (1)
URL https://app.sigopt.com/static/pdf/SigOpt_ (2017) 6765–6816.
Bayesian_Optimization_Primer.pdf [185] M. Feurer, F. Hutter, Hyperparameter Optimization, Springer
[171] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Fre- International Publishing, Cham, 2019, pp. 3–33.
itas, Taking the human out of the loop: A review of bayesian URL https://doi.org/10.1007/978-3-030-05318-5_1
optimization, Proceedings of the IEEE 104 (1) (2016) 148–175. [186] T. Yu, H. Zhu, Hyper-parameter optimization: A review of
[172] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sun- algorithms and applications, arXiv preprint arXiv:2003.05689.
daram, M. M. A. Patwary, Prabhat, R. P. Adams, Scalable [187] Y. Bengio, Gradient-based optimization of hyperparameters,
bayesian optimization using deep neural networks, in: F. R. Neural computation 12 (8) (2000) 1889–1900.
Bach, D. M. Blei (Eds.), Proceedings of the 32nd International [188] J. Domke, Generic methods for optimization-based modeling,
Conference on Machine Learning, ICML 2015, Lille, France, in: Artificial Intelligence and Statistics, 2012, pp. 318–326.
6-11 July 2015, Vol. 37 of JMLR Workshop and Conference [189] D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based hy-
Proceedings, JMLR.org, 2015, pp. 2171–2180. perparameter optimization through reversible learning, in: F. R.
URL http://proceedings.mlr.press/v37/snoek15.html Bach, D. M. Blei (Eds.), Proceedings of the 32nd International
[173] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian Conference on Machine Learning, ICML 2015, Lille, France,
optimization of machine learning algorithms, in: P. L. Bartlett, 6-11 July 2015, Vol. 37 of JMLR Workshop and Conference
F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger Proceedings, JMLR.org, 2015, pp. 2113–2122.
(Eds.), Advances in Neural Information Processing Systems URL http://proceedings.mlr.press/v37/maclaurin15.html
25: 26th Annual Conference on Neural Information Processing [190] F. Pedregosa, Hyperparameter optimization with approximate
Systems 2012. Proceedings of a meeting held December 3-6, gradient, in: M. Balcan, K. Q. Weinberger (Eds.), Proceedings
2012, Lake Tahoe, Nevada, United States, 2012, pp. 2960–2968. of the 33nd International Conference on Machine Learning,
URL https://proceedings.neurips.cc/paper/2012/hash/ ICML 2016, New York City, NY, USA, June 19-24, 2016, Vol. 48
05311655a15b75fab86956663e1819cd-Abstract.html of JMLR Workshop and Conference Proceedings, JMLR.org,
[174] J. Stork, M. Zaefferer, T. Bartz-Beielstein, Improving neuroevo- 2016, pp. 737–746.
lution efficiency by surrogate model-based optimization with URL http://proceedings.mlr.press/v48/pedregosa16.html
phenotypic distance kernels (2019). arXiv:1902.03419. [191] L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward
[175] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, M. A. Osborne, and reverse gradient-based hyperparameter optimization,
Raiders of the lost architecture: Kernels for bayesian optimiza- in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th
tion in conditional parameter spaces (2014). arXiv:1409.4011. International Conference on Machine Learning, ICML 2017,
[176] A. Camero, H. Wang, E. Alba, T. Bäck, Bayesian neural archi- Sydney, NSW, Australia, 6-11 August 2017, Vol. 70 of
tecture search using a training-free performance metric (2020). Proceedings of Machine Learning Research, PMLR, 2017, pp.
arXiv:2001.10726. 1165–1173.
[177] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto- URL http://proceedings.mlr.press/v70/franceschi17a.
weka: combined selection and hyperparameter optimization of html
classification algorithms, in: I. S. Dhillon, Y. Koren, R. Ghani, [192] K. Chandra, E. Meijer, S. Andow, E. Arroyo-Fang, I. Dea,
T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, J. George, M. Grueter, B. Hosmer, S. Stumpos, A. Tempest,
R. Uthurusamy (Eds.), The 19th ACM SIGKDD International et al., Gradient descent: The ultimate optimizer, arXiv preprint
Conference on Knowledge Discovery and Data Mining, KDD arXiv:1909.13371.
2013, Chicago, IL, USA, August 11-14, 2013, ACM, 2013, pp. [193] D. P. Kingma, J. Ba, Adam: A method for stochastic opti-
847–855. doi:10.1145/2487575.2487629. mization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International
URL https://doi.org/10.1145/2487575.2487629 Conference on Learning Representations, ICLR 2015, San Diego,
[178] A. sharpdarts, V. Jain, G. D. Hager, sharpdarts: Faster and CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
more accurate differentiable architecture search, Tech. rep. URL http://arxiv.org/abs/1412.6980
(2019). [194] P. Chrabaszcz, I. Loshchilov, F. Hutter, A downsampled variant
32
of imagenet as an alternative to the CIFAR datasets, CoRR Conference on Computer Vision, ICCV 2019, Seoul, Korea
abs/1707.08819. arXiv:1707.08819. (South), October 27 - November 2, 2019, IEEE, 2019, pp. 6508–
URL http://arxiv.org/abs/1707.08819 6517. doi:10.1109/ICCV.2019.00661.
[195] Y. Hu, Y. Yu, W. Tu, Q. Yang, Y. Chen, W. Dai, Multi- URL https://doi.org/10.1109/ICCV.2019.00661
fidelity automatic hyper-parameter tuning via transfer series [209] X. Dong, Y. Yang, One-shot neural architecture search via self-
expansion, in: The Thirty-Third AAAI Conference on Arti- evaluated template network, in: 2019 IEEE/CVF International
ficial Intelligence, AAAI 2019, The Thirty-First Innovative Conference on Computer Vision, ICCV 2019, Seoul, Korea
Applications of Artificial Intelligence Conference, IAAI 2019, (South), October 27 - November 2, 2019, IEEE, 2019, pp. 3680–
The Ninth AAAI Symposium on Educational Advances in 3689. doi:10.1109/ICCV.2019.00378.
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, Jan- URL https://doi.org/10.1109/ICCV.2019.00378
uary 27 - February 1, 2019, AAAI Press, 2019, pp. 3846–3853. [210] H. Zhou, M. Yang, J. Wang, W. Pan, Bayesnas: A bayesian
doi:10.1609/aaai.v33i01.33013846. approach for neural architecture search, in: K. Chaudhuri,
URL https://doi.org/10.1609/aaai.v33i01.33013846 R. Salakhutdinov (Eds.), Proceedings of the 36th International
[196] C. Wong, N. Houlsby, Y. Lu, A. Gesmundo, Transfer learning Conference on Machine Learning, ICML 2019, 9-15 June 2019,
with neural automl, in: S. Bengio, H. M. Wallach, H. Larochelle, Long Beach, California, USA, Vol. 97 of Proceedings of Machine
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Learning Research, PMLR, 2019, pp. 7603–7613.
Neural Information Processing Systems 31: Annual Conference URL http://proceedings.mlr.press/v97/zhou19e.html
on Neural Information Processing Systems 2018, NeurIPS 2018, [211] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, H. Xiong,
December 3-8, 2018, Montréal, Canada, 2018, pp. 8366–8375. PC-DARTS: partial channel connections for memory-efficient
URL https://proceedings.neurips.cc/paper/2018/hash/ architecture search, in: 8th International Conference on Learn-
bdb3c278f45e6734c35733d24299d3f4-Abstract.html ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April
[197] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyan- 26-30, 2020, OpenReview.net, 2020.
tha, J. Liu, D. Marculescu, Single-path nas: Designing URL https://openreview.net/forum?id=BJlS634tPr
hardware-efficient convnets in less than 4 hours, arXiv preprint [212] G. Li, G. Qian, I. C. Delgadillo, M. Müller, A. K. Thabet,
arXiv:1904.02877. B. Ghanem, SGAS: sequential greedy architecture search, in:
[198] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, 2020 IEEE/CVF Conference on Computer Vision and Pat-
Surrogate benchmarks for hyperparameter optimization., in: tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
MetaSel@ ECAI, 2014, pp. 24–31. 2020, IEEE, 2020, pp. 1617–1627. doi:10.1109/CVPR42600.
[199] C. Wang, Q. Duan, W. Gong, A. Ye, Z. Di, C. Miao, An 2020.00169.
evaluation of adaptive surrogate modeling based optimization URL https://doi.org/10.1109/CVPR42600.2020.00169
with two benchmark problems, Environmental Modelling & [213] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom-
Software 60 (2014) 167–179. ing multi-model forgetting in one-shot NAS with diversity
[200] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, maximization, in: 2020 IEEE/CVF Conference on Computer
Efficient benchmarking of hyperparameter optimizers via Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
surrogates, in: B. Bonet, S. Koenig (Eds.), Proceedings of USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
the Twenty-Ninth AAAI Conference on Artificial Intelligence, 10.1109/CVPR42600.2020.00783.
January 25-30, 2015, Austin, Texas, USA, AAAI Press, 2015, URL https://doi.org/10.1109/CVPR42600.2020.00783
pp. 1114–1120. [214] C. Zhang, M. Ren, R. Urtasun, Graph hypernetworks for neu-
URL http://www.aaai.org/ocs/index.php/AAAI/AAAI15/ ral architecture search, in: 7th International Conference on
paper/view/9993 Learning Representations, ICLR 2019, New Orleans, LA, USA,
[201] K. K. Vu, C. D’Ambrosio, Y. Hamadi, L. Liberti, Surrogate- May 6-9, 2019, OpenReview.net, 2019.
based methods for black-box optimization, International Trans- URL https://openreview.net/forum?id=rkgW0oA9FX
actions in Operational Research 24 (3) (2017) 393–424. [215] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. Chen,
[202] R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, T.-Y. Liu, Mobilenetv2: Inverted residuals and linear bottlenecks, in:
Semi-supervised neural architecture search (2020). arXiv: 2018 IEEE Conference on Computer Vision and Pattern
2002.10389. Recognition, CVPR 2018, Salt Lake City, UT, USA, June
[203] A. Klein, S. Falkner, J. T. Springenberg, F. Hutter, Learning 18-22, 2018, IEEE Computer Society, 2018, pp. 4510–4520.
curve prediction with bayesian neural networks, in: 5th Inter- doi:10.1109/CVPR.2018.00474.
national Conference on Learning Representations, ICLR 2017, URL http://openaccess.thecvf.com/content_cvpr_2018/
Toulon, France, April 24-26, 2017, Conference Track Proceed- html/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_
ings, OpenReview.net, 2017. paper.html
URL https://openreview.net/forum?id=S11KBYclx [216] S. You, T. Huang, M. Yang, F. Wang, C. Qian, C. Zhang,
[204] B. Deng, J. Yan, D. Lin, Peephole: Predicting network perfor- Greedynas: Towards fast one-shot NAS with greedy supernet,
mance before training, arXiv preprint arXiv:1712.03351. in: 2020 IEEE/CVF Conference on Computer Vision and Pat-
[205] T. Domhan, J. T. Springenberg, F. Hutter, Speeding up auto- tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
matic hyperparameter optimization of deep neural networks by 2020, IEEE, 2020, pp. 1996–2005. doi:10.1109/CVPR42600.
extrapolation of learning curves, in: Q. Yang, M. J. Wooldridge 2020.00207.
(Eds.), Proceedings of the Twenty-Fourth International Joint URL https://doi.org/10.1109/CVPR42600.2020.00207
Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, [217] H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all:
Argentina, July 25-31, 2015, AAAI Press, 2015, pp. 3460–3468. Train one network and specialize it for efficient deployment,
URL http://ijcai.org/Abstract/15/487 in: 8th International Conference on Learning Representations,
[206] M. Mahsereci, L. Balles, C. Lassner, P. Hennig, Early stopping ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
without a validation set, arXiv preprint arXiv:1703.09580. Review.net, 2020.
[207] D. Han, J. Kim, J. Kim, Deep pyramidal residual networks, URL https://openreview.net/forum?id=HylxE1HKwS
in: 2017 IEEE Conference on Computer Vision and Pattern [218] J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. L. Yuille, J. Yang,
Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, Atomnas: Fine-grained end-to-end neural architecture search,
2017, IEEE Computer Society, 2017, pp. 6307–6315. doi: in: 8th International Conference on Learning Representations,
10.1109/CVPR.2017.668. ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
URL https://doi.org/10.1109/CVPR.2017.668 Review.net, 2020.
[208] J. Cui, P. Chen, R. Li, S. Liu, X. Shen, J. Jia, Fast and practical URL https://openreview.net/forum?id=BylQSxHFwr
neural architecture search, in: 2019 IEEE/CVF International [219] S. Hu, S. Xie, H. Zheng, C. Liu, J. Shi, X. Liu, D. Lin,
33
DSNAS: direct neural architecture search without param- maximization, in: 2020 IEEE/CVF Conference on Computer
eter retraining, in: 2020 IEEE/CVF Conference on Com- Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
puter Vision and Pattern Recognition, CVPR 2020, Seattle, USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
WA, USA, June 13-19, 2020, IEEE, 2020, pp. 12081–12089. 10.1109/CVPR42600.2020.00783.
doi:10.1109/CVPR42600.2020.01210. URL https://doi.org/10.1109/CVPR42600.2020.00783
URL https://doi.org/10.1109/CVPR42600.2020.01210 [232] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, Q. V. Le,
[220] J. Fang, Y. Sun, Q. Zhang, Y. Li, W. Liu, X. Wang, Densely Understanding and simplifying one-shot architecture search,
connected search space for more flexible neural architecture in: J. G. Dy, A. Krause (Eds.), Proceedings of the 35th Inter-
search, in: 2020 IEEE/CVF Conference on Computer Vision national Conference on Machine Learning, ICML 2018, Stock-
and Pattern Recognition, CVPR 2020, Seattle, WA, USA, holmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80 of
June 13-19, 2020, IEEE, 2020, pp. 10625–10634. doi:10.1109/ Proceedings of Machine Learning Research, PMLR, 2018, pp.
CVPR42600.2020.01064. 549–558.
URL https://doi.org/10.1109/CVPR42600.2020.01064 URL http://proceedings.mlr.press/v80/bender18a.html
[221] A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, [233] X. Dong, M. Tan, A. W. Yu, D. Peng, B. Gabrys, Q. V.
M. Yu, T. Xu, K. Chen, P. Vajda, J. E. Gonzalez, Fbnetv2: Le, Autohas: Differentiable hyper-parameter and architecture
Differentiable neural architecture search for spatial and channel search (2020). arXiv:2006.03656.
dimensions, in: 2020 IEEE/CVF Conference on Computer [234] A. Klein, F. Hutter, Tabular benchmarks for joint archi-
Vision and Pattern Recognition, CVPR 2020, Seattle, WA, tecture and hyperparameter optimization, arXiv preprint
USA, June 13-19, 2020, IEEE, 2020, pp. 12962–12971. doi: arXiv:1905.04970.
10.1109/CVPR42600.2020.01298. [235] X. Dai, A. Wan, P. Zhang, B. Wu, Z. He, Z. Wei, K. Chen,
URL https://doi.org/10.1109/CVPR42600.2020.01298 Y. Tian, M. Yu, P. Vajda, et al., Fbnetv3: Joint architecture-
[222] R. Istrate, F. Scheidegger, G. Mariani, D. S. Nikolopoulos, recipe search using neural acquisition function, arXiv preprint
C. Bekas, A. C. I. Malossi, TAPAS: train-less accuracy predic- arXiv:2006.02049.
tor for architecture search, in: The Thirty-Third AAAI Con- [236] C.-H. Hsu, S.-H. Chang, J.-H. Liang, H.-P. Chou, C.-H. Liu, S.-
ference on Artificial Intelligence, AAAI 2019, The Thirty-First C. Chang, J.-Y. Pan, Y.-T. Chen, W. Wei, D.-C. Juan, Monas:
Innovative Applications of Artificial Intelligence Conference, Multi-objective neural architecture search using reinforcement
IAAI 2019, The Ninth AAAI Symposium on Educational Ad- learning, arXiv preprint arXiv:1806.10332.
vances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, [237] X. He, S. Wang, S. Shi, X. Chu, J. Tang, X. Liu, C. Yan,
USA, January 27 - February 1, 2019, AAAI Press, 2019, pp. J. Zhang, G. Ding, Benchmarking deep learning models and
3927–3934. doi:10.1609/aaai.v33i01.33013927. automated model design for covid-19 detection with chest ct
URL https://doi.org/10.1609/aaai.v33i01.33013927 scans, medRxiv.
[223] M. G. Kendall, A new measure of rank correlation, Biometrika [238] L. Faes, S. K. Wagner, D. J. Fu, X. Liu, E. Korot, J. R. Ledsam,
30 (1/2) (1938) 81–93. T. Back, R. Chopra, N. Pontikos, C. Kern, et al., Automated
URL http://www.jstor.org/stable/2332226 deep learning design for medical image classification by health-
[224] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, F. Hut- care professionals with no coding experience: a feasibility study,
ter, Nas-bench-101: Towards reproducible neural architecture The Lancet Digital Health 1 (5) (2019) e232–e242.
search, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceed- [239] X. He, S. Wang, X. Chu, S. Shi, J. Tang, X. Liu, C. Yan,
ings of the 36th International Conference on Machine Learning, J. Zhang, G. Ding, Automated model design and benchmarking
ICML 2019, 9-15 June 2019, Long Beach, California, USA, of 3d deep learning models for covid-19 detection with chest ct
Vol. 97 of Proceedings of Machine Learning Research, PMLR, scans (2021). arXiv:2101.05442.
2019, pp. 7105–7114. [240] G. Ghiasi, T. Lin, Q. V. Le, NAS-FPN: learning scalable
URL http://proceedings.mlr.press/v97/ying19a.html feature pyramid architecture for object detection, in: IEEE
[225] X. Dong, Y. Yang, Nas-bench-201: Extending the scope of Conference on Computer Vision and Pattern Recognition,
reproducible neural architecture search, in: 8th International CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
Conference on Learning Representations, ICLR 2020, Addis Computer Vision Foundation / IEEE, 2019, pp. 7036–7045.
Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. doi:10.1109/CVPR.2019.00720.
URL https://openreview.net/forum?id=HJxyZkBKDr URL http://openaccess.thecvf.com/content_CVPR_
[226] N. Klyuchnikov, I. Trofimov, E. Artemova, M. Salnikov, M. Fe- 2019/html/Ghiasi_NAS-FPN_Learning_Scalable_Feature_
dorov, E. Burnaev, Nas-bench-nlp: Neural architecture search Pyramid_Architecture_for_Object_Detection_CVPR_2019_
benchmark for natural language processing (2020). arXiv: paper.html
2006.07116. [241] H. Xu, L. Yao, Z. Li, X. Liang, W. Zhang, Auto-fpn: Automatic
[227] X. Zhang, Z. Huang, N. Wang, You only search once: Single network architecture adaptation for object detection beyond
shot neural architecture search via direct sparse optimization, classification, in: 2019 IEEE/CVF International Conference on
arXiv preprint arXiv:1811.01567. Computer Vision, ICCV 2019, Seoul, Korea (South), October
[228] J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, 27 - November 2, 2019, IEEE, 2019, pp. 6648–6657. doi:
T. Huang, X. Song, R. Pang, Q. Le, Bignas: Scaling up neural 10.1109/ICCV.2019.00675.
architecture search with big single-stage models, arXiv preprint URL https://doi.org/10.1109/ICCV.2019.00675
arXiv:2003.11142. [242] M. Tan, R. Pang, Q. V. Le, Efficientdet: Scalable and efficient
[229] X. Chu, B. Zhang, R. Xu, J. Li, Fairnas: Rethinking evaluation object detection, in: 2020 IEEE/CVF Conference on Computer
fairness of weight sharing neural architecture search, arXiv Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
preprint arXiv:1907.01845. USA, June 13-19, 2020, IEEE, 2020, pp. 10778–10787. doi:
[230] Y. Benyahia, K. Yu, K. Bennani-Smires, M. Jaggi, A. C. Davi- 10.1109/CVPR42600.2020.01079.
son, M. Salzmann, C. Musat, Overcoming multi-model forget- URL https://doi.org/10.1109/CVPR42600.2020.01079
ting, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of [243] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, J. Sun, Detnas:
the 36th International Conference on Machine Learning, ICML Neural architecture search on object detection, arXiv preprint
2019, 9-15 June 2019, Long Beach, California, USA, Vol. 97 of arXiv:1903.10979 1 (2) (2019) 4–1.
Proceedings of Machine Learning Research, PMLR, 2019, pp. [244] J. Guo, K. Han, Y. Wang, C. Zhang, Z. Yang, H. Wu, X. Chen,
594–603. C. Xu, Hit-detector: Hierarchical trinity architecture search for
URL http://proceedings.mlr.press/v97/benyahia19a.html object detection, in: 2020 IEEE/CVF Conference on Computer
[231] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom- Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
ing multi-model forgetting in one-shot NAS with diversity USA, June 13-19, 2020, IEEE, 2020, pp. 11402–11411. doi:
34
10.1109/CVPR42600.2020.01142. URL http://proceedings.mlr.press/v119/fu20b.html
URL https://doi.org/10.1109/CVPR42600.2020.01142 [259] M. Li, J. Lin, Y. Ding, Z. Liu, J. Zhu, S. Han, GAN compression:
[245] C. Jiang, H. Xu, W. Zhang, X. Liang, Z. Li, SP-NAS: serial- Efficient architectures for interactive conditional gans, in: 2020
to-parallel backbone search for object detection, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-
IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 5283–5293. doi:10.1109/CVPR42600.2020.00533.
2020, pp. 11860–11869. doi:10.1109/CVPR42600.2020.01188. URL https://doi.org/10.1109/CVPR42600.2020.00533
URL https://doi.org/10.1109/CVPR42600.2020.01188 [260] C. Gao, Y. Chen, S. Liu, Z. Tan, S. Yan, Adversarialnas: Adver-
[246] Y. Weng, T. Zhou, Y. Li, X. Qiu, Nas-unet: Neural architecture sarial neural architecture search for gans, in: 2020 IEEE/CVF
search for medical image segmentation, IEEE Access 7 (2019) Conference on Computer Vision and Pattern Recognition,
44247–44257. CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020,
[247] V. Nekrasov, H. Chen, C. Shen, I. D. Reid, Fast neural pp. 5679–5688. doi:10.1109/CVPR42600.2020.00572.
architecture search of compact semantic segmentation models URL https://doi.org/10.1109/CVPR42600.2020.00572
via auxiliary cells, in: IEEE Conference on Computer Vision [261] T. Saikia, Y. Marrakchi, A. Zela, F. Hutter, T. Brox, Au-
and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, todispnet: Improving disparity estimation with automl, in:
June 16-20, 2019, Computer Vision Foundation / IEEE, 2019, 2019 IEEE/CVF International Conference on Computer Vision,
pp. 9126–9135. doi:10.1109/CVPR.2019.00934. ICCV 2019, Seoul, Korea (South), October 27 - November
URL http://openaccess.thecvf.com/content_CVPR_2019/ 2, 2019, IEEE, 2019, pp. 1812–1823. doi:10.1109/ICCV.2019.
html/Nekrasov_Fast_Neural_Architecture_Search_of_ 00190.
Compact_Semantic_Segmentation_Models_via_CVPR_2019_ URL https://doi.org/10.1109/ICCV.2019.00190
paper.html [262] W. Peng, X. Hong, G. Zhao, Video action recognition via neural
[248] W. Bae, S. Lee, Y. Lee, B. Park, M. Chung, K.-H. Jung, Re- architecture searching, in: 2019 IEEE International Conference
source optimized neural architecture search for 3d medical on Image Processing (ICIP), IEEE, 2019, pp. 11–15.
image segmentation, in: International Conference on Medi- [263] M. S. Ryoo, A. J. Piergiovanni, M. Tan, A. Angelova, Assem-
cal Image Computing and Computer-Assisted Intervention, blenet: Searching for multi-stream neural connectivity in video
Springer, 2019, pp. 228–236. architectures, in: 8th International Conference on Learning
[249] D. Yang, H. Roth, Z. Xu, F. Milletari, L. Zhang, D. Xu, Search- Representations, ICLR 2020, Addis Ababa, Ethiopia, April
ing learning strategy with reinforcement learning for 3d medical 26-30, 2020, OpenReview.net, 2020.
image segmentation, in: International Conference on Medi- URL https://openreview.net/forum?id=SJgMK64Ywr
cal Image Computing and Computer-Assisted Intervention, [264] V. Nekrasov, H. Chen, C. Shen, I. Reid, Architecture search of
Springer, 2019, pp. 3–11. dynamic cells for semantic video segmentation, in: The IEEE
[250] N. Dong, M. Xu, X. Liang, Y. Jiang, W. Dai, E. Xing, Neural Winter Conference on Applications of Computer Vision, 2020,
architecture search for adversarial medical image segmentation, pp. 1970–1979.
in: International Conference on Medical Image Computing and [265] A. J. Piergiovanni, A. Angelova, A. Toshev, M. S. Ryoo,
Computer-Assisted Intervention, Springer, 2019, pp. 828–836. Evolving space-time neural architectures for videos, in: 2019
[251] S. Kim, I. Kim, S. Lim, W. Baek, C. Kim, H. Cho, B. Yoon, IEEE/CVF International Conference on Computer Vision,
T. Kim, Scalable neural architecture search for 3d medical ICCV 2019, Seoul, Korea (South), October 27 - November
image segmentation, in: International Conference on Medi- 2, 2019, IEEE, 2019, pp. 1793–1802. doi:10.1109/ICCV.2019.
cal Image Computing and Computer-Assisted Intervention, 00188.
Springer, 2019, pp. 220–228. URL https://doi.org/10.1109/ICCV.2019.00188
[252] R. Quan, X. Dong, Y. Wu, L. Zhu, Y. Yang, Auto-reid: Search- [266] Y. Fan, F. Tian, Y. Xia, T. Qin, X.-Y. Li, T.-Y. Liu, Searching
ing for a part-aware convnet for person re-identification, in: better architectures for neural machine translation, IEEE/ACM
2019 IEEE/CVF International Conference on Computer Vision, Transactions on Audio, Speech, and Language Processing.
ICCV 2019, Seoul, Korea (South), October 27 - November [267] Y. Jiang, C. Hu, T. Xiao, C. Zhang, J. Zhu, Improved dif-
2, 2019, IEEE, 2019, pp. 3749–3758. doi:10.1109/ICCV.2019. ferentiable architecture search for language modeling and
00385. named entity recognition, in: Proceedings of the 2019 Confer-
URL https://doi.org/10.1109/ICCV.2019.00385 ence on Empirical Methods in Natural Language Processing
[253] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, Y. Wang, Efficient and the 9th International Joint Conference on Natural Lan-
residual dense block search for image super-resolution., in: guage Processing (EMNLP-IJCNLP), Association for Compu-
AAAI, 2020, pp. 12007–12014. tational Linguistics, Hong Kong, China, 2019, pp. 3585–3590.
[254] X. Chu, B. Zhang, H. Ma, R. Xu, J. Li, Q. Li, Fast, accurate doi:10.18653/v1/D19-1367.
and lightweight super-resolution with neural architecture search, URL https://www.aclweb.org/anthology/D19-1367
arXiv preprint arXiv:1901.07261. [268] J. Chen, K. Chen, X. Chen, X. Qiu, X. Huang, Exploring
[255] Y. Guo, Y. Luo, Z. He, J. Huang, J. Chen, Hierarchical neural shared structures and hierarchies for multiple nlp tasks, arXiv
architecture search for single image super-resolution, arXiv preprint arXiv:1808.07658.
preprint arXiv:2003.04619. [269] H. Mazzawi, X. Gonzalvo, A. Kracun, P. Sridhar, N. Subrah-
[256] H. Zhang, Y. Li, H. Chen, C. Shen, Ir-nas: Neural architecture manya, I. Lopez-Moreno, H.-J. Park, P. Violette, Improving
search for image restoration, arXiv preprint arXiv:1909.08228. keyword spotting and language identification via neural ar-
[257] X. Gong, S. Chang, Y. Jiang, Z. Wang, Autogan: Neural chitecture search at scale., in: INTERSPEECH, 2019, pp.
architecture search for generative adversarial networks, in: 1278–1282.
2019 IEEE/CVF International Conference on Computer Vision, [270] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, S. Han, Amc: Automl
ICCV 2019, Seoul, Korea (South), October 27 - November for model compression and acceleration on mobile devices, in:
2, 2019, IEEE, 2019, pp. 3223–3233. doi:10.1109/ICCV.2019. Proceedings of the European Conference on Computer Vision
00332. (ECCV), 2018, pp. 784–800.
URL https://doi.org/10.1109/ICCV.2019.00332 [271] X. Xiao, Z. Wang, S. Rajasekaran, Autoprune: Automatic
[258] Y. Fu, W. Chen, H. Wang, H. Li, Y. Lin, Z. Wang, Autogan- network pruning by regularizing auxiliary parameters, in:
distiller: Searching to compress generative adversarial networks, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
in: Proceedings of the 37th International Conference on Ma- E. B. Fox, R. Garnett (Eds.), Advances in Neural Information
chine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Processing Systems 32: Annual Conference on Neural Informa-
Vol. 119 of Proceedings of Machine Learning Research, PMLR, tion Processing Systems 2019, NeurIPS 2019, December 8-14,
2020, pp. 3292–3303. 2019, Vancouver, BC, Canada, 2019, pp. 13681–13691.
35
URL https://proceedings.neurips.cc/paper/2019/hash/ Blog 1 (2019) 8.
4efc9e02abdab6b6166251918570a307-Abstract.html [291] D. Wang, C. Gong, Q. Liu, Improving neural language modeling
[272] R. Zhao, W. Luk, Efficient structured pruning and architecture via adversarial training, in: K. Chaudhuri, R. Salakhutdinov
searching for group convolution, in: Proceedings of the IEEE (Eds.), Proceedings of the 36th International Conference on
International Conference on Computer Vision Workshops, 2019, Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,
pp. 0–0. California, USA, Vol. 97 of Proceedings of Machine Learning
[273] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin, Research, PMLR, 2019, pp. 6555–6565.
S. Han, APQ: joint search for network architecture, pruning URL http://proceedings.mlr.press/v97/wang19f.html
and quantization policy, in: 2020 IEEE/CVF Conference on [292] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, F. Hut-
Computer Vision and Pattern Recognition, CVPR 2020, Seattle, ter, Understanding and robustifying differentiable architecture
WA, USA, June 13-19, 2020, IEEE, 2020, pp. 2075–2084. doi: search, in: 8th International Conference on Learning Represen-
10.1109/CVPR42600.2020.00215. tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020,
URL https://doi.org/10.1109/CVPR42600.2020.00215 OpenReview.net, 2020.
[274] X. Dong, Y. Yang, Network pruning via transformable architec- URL https://openreview.net/forum?id=H1gDNyrKDS
ture search, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, [293] S. KOTYAN, D. V. VARGAS, Is neural architecture search a
F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in way forward to develop robust neural networks?, Proceedings
Neural Information Processing Systems 32: Annual Conference of the Annual Conference of JSAI JSAI2020 (2020) 2K1ES203–
on Neural Information Processing Systems 2019, NeurIPS 2K1ES203.
2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. [294] M. Guo, Y. Yang, R. Xu, Z. Liu, D. Lin, When NAS meets
759–770. robustness: In search of robust architectures against adversarial
URL https://proceedings.neurips.cc/paper/2019/hash/ attacks, in: 2020 IEEE/CVF Conference on Computer Vision
a01a0380ca3c61428c26a231f0e49a09-Abstract.html and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June
[275] Q. Huang, K. Zhou, S. You, U. Neumann, Learning to prune 13-19, 2020, IEEE, 2020, pp. 628–637. doi:10.1109/CVPR42600.
filters in convolutional neural networks (2018). arXiv:1801. 2020.00071.
07365. URL https://doi.org/10.1109/CVPR42600.2020.00071
[276] Y. He, P. Liu, L. Zhu, Y. Yang, Meta filter pruning to accelerate [295] Y. Chen, Q. Song, X. Liu, P. S. Sastry, X. Hu, On robustness
deep convolutional neural networks (2019). arXiv:1904.03961. of neural architecture search under label noise, in: Frontiers in
[277] T.-W. Chin, C. Zhang, D. Marculescu, Layer-compensated Big Data, 2020.
pruning for resource-constrained convolutional neural networks [296] D. V. Vargas, S. Kotyan, Evolving robust neural architec-
(2018). arXiv:1810.00518. tures to defend from adversarial attacks, arXiv preprint
[278] K. Zhou, Q. Song, X. Huang, X. Hu, Auto-gnn: Neural ar- arXiv:1906.11667.
chitecture search of graph neural networks, arXiv preprint [297] J. Yim, D. Joo, J. Bae, J. Kim, A gift from knowledge distil-
arXiv:1909.03184. lation: Fast optimization, network minimization and transfer
[279] C. He, M. Annavaram, S. Avestimehr, Fednas: Federated learning, in: 2017 IEEE Conference on Computer Vision and
deep learning via neural architecture search (2020). arXiv: Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July
2004.08546. 21-26, 2017, IEEE Computer Society, 2017, pp. 7130–7138.
[280] H. Zhu, Y. Jin, Real-time federated evolutionary neural archi- doi:10.1109/CVPR.2017.754.
tecture search, arXiv preprint arXiv:2003.02793. URL https://doi.org/10.1109/CVPR.2017.754
[281] C. Li, X. Yuan, C. Lin, M. Guo, W. Wu, J. Yan, W. Ouyang, [298] G. Squillero, P. Burelli, Applications of Evolutionary Com-
AM-LFS: automl for loss function search, in: 2019 IEEE/CVF putation: 19th European Conference, EvoApplications 2016,
International Conference on Computer Vision, ICCV 2019, Porto, Portugal, March 30–April 1, 2016, Proceedings, Vol.
Seoul, Korea (South), October 27 - November 2, 2019, IEEE, 9597, Springer, 2016.
2019, pp. 8409–8418. doi:10.1109/ICCV.2019.00850. [299] M. Feurer, A. Klein, K. Eggensperger, J. T. Springen-
URL https://doi.org/10.1109/ICCV.2019.00850 berg, M. Blum, F. Hutter, Efficient and robust automated
[282] B. Ru, C. Lyle, L. Schut, M. van der Wilk, Y. Gal, Revisiting machine learning, in: C. Cortes, N. D. Lawrence, D. D.
the train loss: an efficient performance estimator for neural Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural
architecture search, arXiv preprint arXiv:2006.04492. Information Processing Systems 28: Annual Conference on
[283] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation Neural Information Processing Systems 2015, December 7-12,
functions (2017). arXiv:1710.05941. 2015, Montreal, Quebec, Canada, 2015, pp. 2962–2970.
[284] H. Wang, H. Wang, K. Xu, Evolutionary recurrent neural URL https://proceedings.neurips.cc/paper/2015/hash/
network for image captioning, Neurocomputing. 11d0e6287202fced83f79975ec59a3a6-Abstract.html
[285] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, R. Fonseca, Neural [300] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
architecture search using deep neural networks and monte carlo B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
tree search, arXiv preprint arXiv:1805.07440. V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
[286] P. Zhao, K. Xiao, Y. Zhang, K. Bian, W. Yan, Amer: Automatic M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine
behavior modeling and interaction exploration in recommender learning in Python, Journal of Machine Learning Research 12
system, arXiv preprint arXiv:2006.05933. (2011) 2825–2830.
[287] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, J. Tang, Au- [301] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
toemb: Automated embedding dimensionality search in stream- G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
ing recommendations, arXiv preprint arXiv:2002.11252. A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison,
[288] W. Cheng, Y. Shen, L. Huang, Differentiable neural A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,
input search for recommender systems, arXiv preprint S. Chintala, Pytorch: An imperative style, high-performance
arXiv:2006.04466. deep learning library, in: H. M. Wallach, H. Larochelle,
[289] E. Real, C. Liang, D. R. So, Q. V. Le, Automl-zero: Evolving A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.),
machine learning algorithms from scratch, in: Proceedings of Advances in Neural Information Processing Systems 32:
the 37th International Conference on Machine Learning, ICML Annual Conference on Neural Information Processing Systems
2020, 13-18 July 2020, Virtual Event, Vol. 119 of Proceedings 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
of Machine Learning Research, PMLR, 2020, pp. 8007–8019. Canada, 2019, pp. 8024–8035.
URL http://proceedings.mlr.press/v119/real20a.html URL https://proceedings.neurips.cc/paper/2019/hash/
[290] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Language models are unsupervised multitask learners, OpenAI [302] F. Chollet, et al., Keras, https://github.com/fchollet/keras
36
(2015).
[303] NNI (Neural Network Intelligence), 2020.
URL https://github.com/microsoft/nni
[304] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur,
J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner,
P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu,
X. Zheng, Tensorflow: A system for large-scale machine learning
(2016). arXiv:1605.08695.
[305] Vega, 2020.
URL https://github.com/huawei-noah/vega
[306] R. Pasunuru, M. Bansal, Continual and multi-task architec-
ture search, in: Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, Association for
Computational Linguistics, Florence, Italy, 2019, pp. 1911–1922.
doi:10.18653/v1/P19-1185.
URL https://www.aclweb.org/anthology/P19-1185
[307] J. Kim, S. Lee, S. Kim, M. Cha, J. K. Lee, Y. Choi, Y. Choi,
D.-Y. Cho, J. Kim, Auto-meta: Automated gradient based
meta learner search, arXiv preprint arXiv:1806.06927.
[308] D. Lian, Y. Zheng, Y. Xu, Y. Lu, L. Lin, P. Zhao, J. Huang,
S. Gao, Towards fast adaptation of neural architectures with
meta learning, in: 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, OpenReview.net, 2020.
URL https://openreview.net/forum?id=r1eowANFvr
[309] T. Elsken, B. Staffler, J. H. Metzen, F. Hutter, Meta-learning of
neural architectures for few-shot learning, in: 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020,
pp. 12362–12372. doi:10.1109/CVPR42600.2020.01238.
URL https://doi.org/10.1109/CVPR42600.2020.01238
[310] C. Liu, P. Dollár, K. He, R. Girshick, A. Yuille, S. Xie, Are
labels necessary for neural architecture search? (2020). arXiv:
2003.12056.
[311] Z. Li, D. Hoiem, Learning without forgetting, IEEE transactions
on pattern analysis and machine intelligence 40 (12) (2018)
2935–2947.
[312] S. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl:
Incremental classifier and representation learning, in: 2017
IEEE Conference on Computer Vision and Pattern Recogni-
tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE
Computer Society, 2017, pp. 5533–5542. doi:10.1109/CVPR.
2017.587.
URL https://doi.org/10.1109/CVPR.2017.587
37