AutoML - A Survey of The State-Of-The-Art
AutoML - A Survey of The State-Of-The-Art
Abstract
Deep learning (DL) techniques have obtained remarkable achievements on various tasks, such as image
recognition, object detection, and language modeling. However, building a high-quality DL system for a specific task
highly relies on human expertise, hindering its wide application. Meanwhile, automated machine learning (AutoML)
is a promising solution for building a DL system without human assistance and is being extensively studied. This
arXiv:1908.00709v6 [cs.LG] 16 Apr 2021
paper presents a comprehensive and up-to-date review of the state-of-the-art (SOTA) in AutoML. According to the
DL pipeline, we introduce AutoML methods –– covering data preparation, feature engineering, hyperparameter optimization,
and neural architecture search (NAS) –– with a particular focus on NAS, as it is currently a hot sub-topic of AutoML.
We summarize the representative NAS algorithms’ performance on the CIFAR-10 and ImageNet datasets and further discuss
the following subjects of NAS methods: one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture
optimization, and resource-aware NAS. Finally, we discuss some open problems related to the existing AutoML methods
for future research.
Keywords: deep learning, automated machine learning (AutoML), neural architecture search (NAS), hyperparameter
optimization (HPO)
Traditional Hyperparameter
Models Optimization
(SVM, KNN) Early-stopping
Deep Neural
Architecture
Networks
Data Feature Optimization
(CNN, RNN)
Construction Weight-sharing
Augmentation
Neural Architecture Search (NAS)
Figure 1: An overview of AutoML pipeline covering data preparation (Section 2), feature engineering (Section 3), model generation (Section 4)
and model evaluation (Section 5).
4
Another novel technique for deriving synthetic data [87] proposed an effective way of evaluating the algorithms
is Generative Adversarial Networks (GANs) [71], which
can be used to generate images [71, 72, 73, 74], tabular
[75, 76] and text [77] data. Karras et al. [78] applied GAN
technique to generate realistic human face images. Oh and
Jaroensri et al. [72] built a synthetic dataset, which cap-
tures small motion for video-motion magnification. Bowles
et al. [74] demonstrated the feasibility of using GAN to
generate medical images for brain segmentation tasks. In
the case of textual data, applying GAN to text has proved
difficult because the commonly used method is to use rein-
forcement learning to update the gradient of the
generator, but the text is discrete, and thus the gradient
cannot propa- gate from discriminator to generator. To
solve this problem, Donahue et al. [77] used an
autoencoder to encode sen- tences into a smooth sentence
representation to remove the barrier of reinforcement
learning. Park et al. [75] applied GAN to synthesize fake
tables that are statistically similar to the original table but
do not cause information leakage. Similarly, in [76], GAN is
applied to generate tabular data like medical or
educational records.
3. Feature Engineering
Figure 4: The iterative process of feature selection. A subset of Original feature set
features is selected, based on a search strategy, and then
evaluated. Then, a validation procedure is implemented to
determine whether the subset is valid. The above steps are
repeated until the stop criterion is satisfied. Generation
(Search Strategy)
8
Regularization, decision tree, and deep learning are all
Model Generation
embedded methods. Model Estimation
Search Space Architecture Optimization
3.2. Feature Construction model that considers data features and their relationships,
Feature construction is a process that constructs new Entire-structured Random search Low-fidelity
4. Model Generation
6
It can also be referred to as the “search strategy [10, 123]”,
“search policy [11]”, or “optimization method [45, 9]”.
10
• Architecture Optimization Method. The architecture where Nk indicates the indegree of node Zk, Ii and oi
optimization (AO) method defines how to guide the represent i-th input tensor and its associated operation,
search to efficiently find the model architecture with respectively, and O is a set of candidate operations, such
high performance after the search space is defined. as convolution, pooling, activation functions, skip connec-
• Model Evaluation Method. Once a model is gener- tion, concatenation, and addition. To further enhance the
ated, its performance needs to be evaluated. The model performance, many NAS methods use certain ad-
simplest approach of doing this is to train the model vanced human-designed modules as primitive operations,
to converge on the training set, and then estimate such as depth-wise separable convolution [124], dilated
model performance on the validation set; however, convolution[125], and squeeze-and-excitation (SE) blocks
this method is time-consuming and resource-intensive. [126]. The selection and combination of these operations
Some advanced methods can accelerate the evalua- vary with the design of search space. In other words, the
tion process but lose fidelity in the process. Thus, search space defines the structural paradigm that AO
how to balance the efficiency and effectiveness of an meth- ods can explore; thus, designing a good search space
evaluation is a problem worth studying. is a vital but challenging problem. In general, a good
search space is expected to exclude human bias and be
The search space and AO methods are presented in flexible enough to cover a wider variety of model
this section, while the methods of model evaluation are architectures. Based on the existing NAS studies, we detail
presented in the next section. the com- monly used search spaces as follows.
13
[13, 15, 23, 16, 25, 26] follow a two-level hierarchy: the
inner is the cell level, which selects the operation and
8
Cells
20
Cells
5
Cells
11
Cells
17
Cells
20
Cells
connection for each node in the cell, and the outer is the
5 ops
network level, which controls the spatial-resolution
3 ops changes. However, these approaches focus on the cell
2 ops
Search Estimation Search Estimation level and ignore the network level. As shown in Figure 7,
phase phase phase phase
whenever a fixed number of normal cells are stacked, the
spatial dimension
DARTS P-DARTS
of the feature maps is halved by adding a reduction cell.
To jointly learn a suitable combination of repeatable cell
Figure 8: Difference between DARTS [17] and P-DARTS [128]. Both and network structures, Liu et al. [129] defined a general
methods search and evaluate networks on the CIFAR-10 dataset.
As the number of cell structures increases from 5 to 11 and then 17,
formulation for a network-level structure, depicted in Figure
the number of candidate operations is gradually reduced 9, from which many existing good network designs can be
accordingly. reproduced. In this way, we can fully explore the different
number of channels and sizes of feature maps of each
layer in the network.
search phase finds the best cell structure for the shallow
model, this does not mean that it is still suitable for the
deeper model in the evaluation phase. In other words, 1×1 conv
simply adding more cells may deteriorate the model perfor- 3×3 conv
mance. To bridge this gap, Chen et al. [128] proposed an max-pooling
improved method based on DARTS, namely progressive-
DARTS (P-DARTS), which divides the search phase into level-one level-two level-three
multiple stages and gradually increases the depth of the The cell-based search space enables the transferability of
searched networks at the end of each stage, hence bridging the generated model, and most of the cell-based methods
the gap between search and evaluation. However, increas-
ing the number of cells in the search phase may result in
heavier computational overhead. Thus, for reducing the
computational consumption, P-DARTS gradually reduces
the number of candidate operations from 5 to 3, and then
2, through search space approximation methods, as shown
in Figure 8. Experimentally, P-DARTS obtains a 2.50%
error rate on the CIFAR-10 test dataset, outperforming
the 2.83% error rate achieved by DARTS.
L1
L2
L3
...
...
LN-1
LN
14
Figure 10: Example of a three-level hierarchical architecture rep-
resentation. The level-one primitive operations are assembled
into level-two cells. The level-two cells are viewed as primitive
operations and assembled into level-three cell.
15
and low latency, some studies [130, 131] proposed to search (IdMorph) transformations between the neural network
for complex and fragmented cell structures. For layers. An IdMorph transformation is function-preserving
example, Tan et al. [130] proposed MnasNet, which and can be classified into two types – depth and width
uses a novel factorized hierarchical search space to IdMorph (shown in Figure 12) – which makes it possible to
generate different cell structures, namely MBConv, for replace the original model with an equivalent model that
different layers of the final network. Figure 11 presents the is deeper or wider.
factorized hierarchical search space of MnasNet, which However, IdMorph is limited to width and depth changes,
comprises a predefined number of cell structures. Each and can only modify them separately; moreover, the spar-
cell has a different struc- ture and contains a variable sity of its identity layer can create problems [2]. There-
number of blocks––whereas all blocks in the same cell fore, an improved method is proposed, namely network
exhibit the same structure, those in other cells exhibit morphism [21], which allows a child network to inherit
different structures. As this design method can achieve a all knowledge from its well-trained parent network and
suitable balance between model per- formance and latency, continue to grow into a more robust network within a
many subsequent studies [131, 132] have referred to it. shortened training time. Compared with Net2Net, net-
Owing to the large computational consumption, most of work morphism exhibits the following advantages: 1) it can
the differentiable NAS (DNAS) tech- niques (e.g., DARTS) embed nonidentity layers and handle arbitrary nonlinear
first search for a suitable cell struc- ture on a proxy activation functions, and 2) it can simultaneously perform
dataset (e.g., CIFAR10), and then transfer it to a larger depth, width, and kernel size-morphing in a single oper-
target dataset (e.g., ImageNet). Han et al. ation, whereas Net2Net has to separately consider depth
[132] proposed ProxylessNAS, which can directly search and width changes. The experimental results in [21] show
for neural networks on the targeted dataset and hardware that network morphism can substantially accelerate the
platforms by using BinaryConnect [133], which addresses training process, as it uses one-fifteenth of the training
the high memory consumption issue. time and achieves better results than the original VGG16.
+
output
conv
1x1
Cell n
of nodes, which is fixed-length, i.e., predefined manually. to its fitness value, i.e., P (hi) =
Σ
N j
,
j=
Fitness(h )
1
For representing variable-length neural networks, DAG en- grammar. Some recent studies [146, 147, 148, 27] have
coding is a promising solution [28, 25, 19]. For example, described the use of indirect encoding schemes to
Suganuma et al. [28] used the Cartesian genetic represent a network. For example, the network in [27]
program- ming (CGP) [143, 144] encoding scheme to can be encoded by a function, and
represent a neural network built by a list of sub-
modules that are de- fined as DAG. Similarly, in [25], the
neural architecture is also encoded as a graph, whose
vertices indicate rank-3 tensors or activations (with batch
normalization performed with rectified linear units (ReLUs)
or plain linear units) and edges indicate identity
connections or convolutions. Neuro evolution of
augmenting topologies (NEAT) [24, 25] also uses a direct
encoding scheme, where each node and connection is
stored.
Indirect encoding specifies a generation rule to build
the network and allows for a more compact
representation. Cellular encoding (CE) [145] is an
example of a system that utilizes indirect encoding of
network structures. It encodes a family of neural networks
into a set of labeled trees and is based on a simple graph
17
where hi indicates the i-th network. The second is
rank selection, which is similar to fitness selection,
but with the network’s selection probability being
proportional to its relative fitness rather than its
absolute fitness. The third method is tournament
selection [25, 27, 26, 19]. Here, in each iteration, k
(tournament size) networks are randomly selected
from the population and sorted according to their
performance; then, the best network is selected with
a probability of p, the second-best network has a
probability of p × (1 − p), and so on.
• Crossover After selection, every two networks are se-
lected to generate a new offspring network, inheriting
half of the genetic information of each of its parents.
This process is analogous to the genetic recombina-
tion, which occurs during biological reproduction and
crossover. The particular manner of crossover varies
and depends on the encoding scheme. In binary en-
coding, networks are encoded as a linear string of
bits, where each bit represents a unit, such that two
parent networks can be combined through one- or
multiple-point crossover. However, the crossover of
18
the data arranged in such a fashion can sometimes action At: sample an architecture
damage the data. Thus, Xie et al. [30] denoted the
basic unit in a crossover as a stage rather than a
bit, which is a higher-level structure constructed by
Controller
a binary string. For cellular encoding, a randomly se- (RNN)
Environment
lected sub-tree is cut from one parent tree to replace
a sub-tree cut from the other parent tree. In another
approach, NEAT performs an artificial synapsis based
Rt+1
on historical markings, adding a new structure with- reward Rt
out losing track of the gene present throughout the St+1
state St
simulation.
• Mutation As the genetic information of the parents Figure 14: Overview of neural architecture search using
is copied and inherited by the next generation, gene reinforcement learning.
mutation also occurs. A point mutation [28, 30] is
one of the most widely used operations and involves
randomly and independently flipping each bit. Two results (such as accuracy) are returned. Many follow-up
types of mutations have been described in [29]: one approaches [23, 15, 16, 13] have used this framework, but
enables or disables a connection between two lay- with different controller policies and neural-architecture
ers, and the other adds or removes skip connections encoding. Zoph et al. [12] first used the policy gradient
between two nodes or layers. Meanwhile, Real and algorithm [150] to train the controller, and sequentially
Moore et al. [25] predefined a set of mutation opera- sampled a string to encode the entire neural architecture.
tors, such as altering the learning rate and removing In a subsequent study [15], they used the proximal policy
skip connections between the nodes. By analogy optimization (PPO) algorithm [151] to update the con-
with the biological process, although a mutation may troller, and proposed the method shown in Figure 15 to
ap- pear as a mistake that causes damage to the build a cell-based neural architecture. MetaQNN [23] is a
network structure and leads to a loss of functionality, meta-modeling algorithm using Q-learning with an ϵ-
it also enables the exploration of more novel greedy exploration strategy and experience replay to
structures and ensures diversity. sequentially search for neural architectures.
20
to share parameters, obviating the need to train each child can be efficiently solved as a regular training, the searched
model from scratch. Thus, ENAS took only architecture α commonly overfits the training set and its
approximately 10 hours using one GPU to search for the performance on the validation set cannot be guaranteed.
best architecture on the CIFAR-10 dataset, which is The authors in [153] proposed mixed-level optimization:
nearly 1000× faster than [12].
min [Ltrain (θ∗, α) + λLval (θ∗, α)] (7)
4.2.3. Gradient Descent α,θ
The above-mentioned search strategies sample neural where α indicates the neural architecture, θ is the weight
architectures from a discrete search space. A pioneering as- signed to it, and λ is a non-negative regularization
al- gorithm, namely DARTS [17], was among the first variable to control the weights of the training loss and
gradient descent (GD)-based method to search for neural validation loss. When λ = 0, Eq. 7 reduces to a single-level
architec- tures over a continuous and differentiable search opti- mization (Eq. 6); in contrast, Eq. 7 becomes a bilevel
space by using a softmax function to relax the discrete optimization (Eq. 5). The experimental results presented
space, as outlined below: in [153] showed that mixed-level optimization not only
over- comes the overfitting issue of single-level
Σ
K exp αk optimization but also avoids the gradient error of bilevel
i,j
Σ ok(x) (4) optimization.
o i,j k=1
K i,j Second, in DARTS, the output of each edge is the
l=1 weighted sum of all candidate operations (shown in Eq.
exp αl
(x)
where o(x) indicates the operation performed on input 4) during the whole search stage, which leads to a linear
x, ki,j indicates the weight assigned to the operation ok increase in the requirements of GPU memory with the
α
between a pair of nodes (i, j), and K is the number of number of candidate operations. To reduce resource con-
predefined candidate operations. After the relaxation, the sumption, many subsequent studies [154, 155, 153, 156, 131]
task of searching for architectures is transformed into a have developed a differentiable sampler to sample a child
joint optimization of neural architecture α and the weights architecture from the supernet by using a reparameteri-
of this neural architecture θ. These two types of parameters zation trick, namely Gumbel Softmax [157]. The neural
are optimized alternately, indicating a bilevel optimization architecture is fully factorized and modeled with a concrete
problem. Specifically, α and θ are optimized with the distribution [158], which provides an efficient approach to
validation and the training sets, respectively. The sampling a child architecture and allows gradient backprop-
training and the validation losses are denoted by Ltrain agation. Therefore, Eq. 4 is re-formulated as
and Lval, respectively. Hence, the total loss function can be
derived
as follows:
Σ
K exp log αk +
i,j Gk i,j /τ
Σ
+ ok(x) (8)
k
minα Lval (θ ∗, α) oi,j K α l i,j
G li,j
(5) l=1 exp
s.t. θ∗ = argminθ Ltrain(θ, α) log /τ
(x) k=1
Figure 16 presents an overview of DARTS, where a cell where Gki,j = −log(−log(uki,j )) is the k-th Gumbel sample,
k
is composed of N (here N = 4) ordered nodes and the ui,j is a uniform random variable, and τ is the Softmax
node zk (k starts from 0) is connected to the node zi, i ∈ min Ltrain(θ, α) (6)
θ,α
{k + 1, ..., N }. The operation on each edge ei,j is initially
a mixture of candidate operations, each being of equal which optimizes both neural architecture and weights to-
weight. Therefore, the neural architecture α is a gether. Although the single-level optimization problem
supernet that contains all possible child neural
architectures. At the end of the search, the final
architecture is derived by retaining only the maximum-
weight operation among all mixed operations.
Although DARTS substantially reduces the search time,
it incurs several problems. First, as Eq. 5 shows, DARTS
describes a joint optimization of the neural architecture
and weights as a bilevel optimization problem. However,
this problem is difficult to solve directly, because both
ar- chitecture α and weights θ are high dimensional
parameters. Another solution is single-level optimization,
which can be formalized as
21
temperature. When τ → ∞, the possibility distribution
of all operations between each node pair approximates
to one- hot distribution. In GDAS [154], only the
operation with the maximum possibility for each edge
is selected during the forward pass, while the gradient is
backpropagated according to Eq. 8. In other words, only
one path of the supernet is selected for training,
thereby reducing the GPU memory usage. Besides,
ProxylessNAS [132] alleviates the huge resource
consumption through path binarization. Specifically, it
transforms the real-valued path weights [17] to binary
gates, which activates only one path of the mixed
operations, and hence, solves the memory issue.
Another problem is the optimization of different
op- erations together, as they may compete with
each other, leading to a negative influence. For example,
several studies [159, 128] have found that skip-
connect operation domi- nates at a later search stage
in DARTS, which causes the network to be shallower
and leads to a marked deterioration in performance. To
solve this problem, DARTS+ [159] uses an additional
early-stop criterion, such that when two or
22
0 0 0 0
?
?
1 ? 1 1 1
?
?
2 2 2 2
0.1
0.6
0.3
3 3 3 3
Figure 16: Overview of DARTS. (a) The data can only flow from lower-level nodes to higher-level nodes, and the operations on edges
are initially unknown. (b) The initial operation on each edge is a mixture of candidate operations, each having equal weight. (c) The weight
of each operation is learnable and ranges from 0 to 1, but for previous discrete sampling methods, the weight could only be 0 or 1. (d) The
final neural architecture is constructed by preserving the maximum weight-value operation on each edge.
more skip-connects occur in a normal cell, the search pro- network as the surrogate model. For example, in PNAS
cess stops. In another example, P-DARTS [128] regularizes [18] and EPNAS [166], an LSTM is derived as the surrogate
the search space by executing operation-level dropout to model to progressively predict variable-sized architectures.
control the proportion of skip-connect operations Meanwhile, NAO [169] uses a simpler surrogate model,
occurring during training and evaluation. i.e., multilayer perceptron (MLP), and NAO is more efficient
and achieves better results on CIFAR-10 than does PNAS
4.2.4. Surrogate Model-based Optimization [18]. White et al. [164] trained an ensemble of neural
Another group of architecture optimization methods is networks to predict the mean and variance of the
surrogate model-based optimization (SMBO) algorithms validation results for candidate neural architectures.
[33, 34, 160, 161, 162, 163, 164, 165, 166, 18, 161]. The
core concept of SMBO is that it builds a surrogate model 4.2.5. Grid and Random Search
of the objective function by iteratively keeping a record Both grid search (GS) and random search (RS) are sim-
of past evaluation results, and uses the surrogate model ple optimization methods applied to several NAS studies
to predict the most promising architecture. Thus, these [178, 179, 180, 11]. For instance, Geifman et al. [179] pro-
methods can substantially shorten the search time and posed a modular architecture search space (A = {A(B, i, j)|i
improve efficiency. ∈
SMBO algorithms differ from the surrogate {1, 2, ..., Ncells}, j ∈ {1, 2, ..., Nblocks}}) that is
models, which can be broadly divided into Bayesian spanned by the grid defined by the two corners A(B,
optimization (BO) methods (including Gaussian 1, 1) and A(B, Ncells, Nblocks), where B is a searched
process (GP) [167], random forest (RF) [37], tree- block struc- ture. Evidently, a larger value Ncells × Nblocks
structured Parzen estimator (TPE) [168]), and neural leads to the exploration of a larger space, but requires more
networks [164, 169, 18, 166]. resources. The authors in [180] conducted an
BO [170, 171] is one of the most popular methods effectiveness com- parison between SOTA NAS methods
for hyperparameter optimization. Many recent studies and RS. The results showed that RS is a competitive NAS
[33, 34, 160, 161, 162, 163, 164, 165] have attempted to baseline. Specifically, RS with an early-stopping
apply these SOTA BO methods to AO. For example, in strategy performs as well as ENAS [13], which is an
[172, 173, 160, 165, 174, 175], the validation results of the RL-based leading NAS method. Besides, Yu et al. [11]
generated neural architectures were modeled as a Gaussian demonstrated that the SOTA NAS techniques are not
process, which guides the search for the optimal neural significantly better than random search.
architectures. However, in GP-based BO methods, the
inference time scales cubically in the number of 4.2.6. Hybrid Optimization Method
observations, and they cannot effectively handle variable- The abovementioned architecture optimization methods
length neural networks. Camero et al. [176] proposed have their own advantages and disadvantages. 1) EA is a
three fixed-length encoding schemes to cope with mature global optimization method with high robustness.
variable-length problems by using RF as the surrogate However, it requires considerable computational resources
model. Similarly, both [33] and [176] used RF as a [26, 25], and its evolution operations (such as crossover
surrogate model, and [177] showed that it works better and mutations) are performed randomly. 2) Although RL-
in setting high dimensionality than GP-based methods. based methods (e.g., ENAS [13]) can learn complex
Instead of using BO, some studies have used a neural architectural patterns, the searching efficiency and stability
23
of the RL controller are not guaranteed because it may
take several
24
actions to obtain a positive reward. 3) The GD-based
meth- ods (e.g., DARTS [17]) substantially improve the
searching efficiency by relaxing the categorical candidate
operations to continuous variables. Nevertheless, in
Unimportant parameter
Unimportant parameter
essence, they all search for a child network from a
supernet, which limits the diversity of neural architectures.
Therefore, some methods have been proposed to
incorporate different optimization methods to capture the
best of their advantages; these methods are summarized
as follows
EA+RL. Chen et al. [42] integrated reinforced muta-
Important parameter Important parameter
tions into an EA, which avoids the randomness of evolution
and improves the searching efficiency. Another similar
method developed in parallel is the evolutionary-neural
hybrid controller (Evo-NAS) [41], which also captures the after evaluating all points; while RS selects the best point
merits of both RL-based methods and EA. The Evo-NAS from a set of randomly drawn points.
controller’s mutations are guided by an RL-trained neural
network, which can explore a vast search space and
sample architectures efficiently.
EA+GD. Yang et al. [40] combined the EA and GD-
based method. The architectures share parameters within
one supernet and are tuned on the training set with a few
epochs. Then, the populations and the supernet are di-
rectly inherited in the next generation, which substantially
accelerates the evolution. The authors in [40] only took 0.4
GPU days for searching, which is more efficient than early
EA methods (e.g., AmoebaNet [26] took 3150 GPU days
and 450 GPUs for searching).
EA+SMBO. The authors in [43] used RF as a surro-
gate to predict model performance, which accelerates the
fitness evaluation in EA.
GD+SMBO. Unlike DARTS, which learns weights
for candidate operations, NAO [169] proposes a variational
autoencoder to generate neural architectures and further
build a regression model as a surrogate to predict the
performance of the generated architecture. The
encoder maps the representations of the neural
architecture to continuous space, and then a predictor
network takes the continuous representations of the
neural architecture as input and predicts the
corresponding accuracy. Finally, the decoder is used to
derive the final architecture from a continuous network
representation.
26
model mapping from the hyperparameters to the objective Library Model
metrics evaluated on the validation set. It well balances Spearmint
GP
exploration (evaluating as many hyperparameter sets as https://github.com/HIPS/Spearmint
possible) and exploitation (allocating more resources to MOE
GP
promising hyperparameters). https://github.com/Yelp/MOE
PyBO
GP
Algorithm 1 Sequential Model-Based Optimization https://github.com/mwhoffman/pybo
Bayesopt
INPUT: f, Θ, S, M GP
https://github.com/rmcantin/bayesopt
D ← INITSAMPLES (f, Θ)
SkGP
for i in [1, 2, .., T ] do GP
p(y|θ, D) ← FITMODEL (M, D) https://scikit-optimize.github.io
θi ← arg maxθ∈Θ S(θ, p(y|θ, D)) GPyOpt
GP
yi ← f (θi) d Expensive step http://sheffieldml.github.io/GPyOpt
D ← D ∪ (θi, yi) SMAC
RF
end for https://github.com/automl/SMAC3
Hyperopt
TPE
The steps of SMBO are expressed in Algorithm 1 (adopted http://hyperopt.github.io/hyperopt
from [170]). Here, several inputs need to be predefined ini- BOHB
TPE
tially, including an evaluation function f , search space Θ, https://github.com/automl/HpBandSter
acquisition function S, probabilistic model M, and record
dataset D. Specifically, D is a dataset that records many Table 2: Open-source Bayesian optimization libraries. GP, RF,
and TPE represent Gaussian process [167], random forest [37], and
sample pairs (θi, yi), where θi ∈ Θ indicates a sampled tree- structured Parzen estimator [168], respectively.
neural architecture and yi indicates its evaluation result.
After the initialization, the SMBO steps are described as
follows: 4.3.3. Gradient-based Optimization
Another group of HPO methods are gradient-based op-
1. The first step is to tune the probabilistic model M timization (GO) algorithms [187, 188, 189, 190, 191, 192].
to fit the record dataset D. Unlike the above blackbox HPO methods (e.g., GS, RS,
2. The acquisition function S is used to select the next and BO), GO methods use the gradient information to
promising neural architecture from the probabilistic optimize the hyperparameters and substantially improve
model M. the efficiency of HPO. Maclaurin et al. [189] proposed a
3. The performance of the selected neural architecture reversible-dynamics memory-tape approach to handle thou-
is evaluated by f , which is an expensive step as it sands of hyperparameters efficiently through the gradient
involves training the neural network on the training information. However, optimizing many hyperparameters
set and evaluating it on the validation set. is computationally challenging. To alleviate this issue, the
4. The record dataset D is updated by appending a new authors in [190] used approximate gradient information
pair of results (θi, yi). rather than the true gradient to optimize continuous hy-
perparameters, where the hyperparameters can be updated
The above four steps are repeated T times, where T
before the model is trained to converge. Franceschi et al.
needs to be specified according to the total time or resources
[191] studied both reverse- and forward-mode GO meth-
available. The commonly used surrogate models for the
ods. The reverse-mode method differs from the method
BO method are GP, RF, and TPE. Table 2 summarizes
proposed in [189] and does not require reversible
the existing open-source BO methods, where GP is one
dynamics; however, it needs to store the entire training
of the most popular surrogate models. However, GP
history for computing the gradient with respect to the
scales cubically with the number of data samples, while
hyperparame- ters. The forward-mode method overcomes
RF can natively handle large spaces and scales better to many
this problem by using real-time updating hyperparameters,
data samples. Besides, Falkner and Klein et al. [38] proposed
and is demon- strated to significantly improve the
the BO-based hyperband (BOHB) algorithm, which
efficiency of HPO on large datasets. Chandra [192]
combines the strengths of TPE-based BO and hyperband,
proposed a gradient-based ultimate optimizer, which can
and hence, performs much better than standard BO
optimize not only the regular hyperparameters (e.g.,
methods. Fur- thermore, FABOLAS [35] is a faster BO
learning rate) but also those of the optimizer (e.g., Adam
procedure, which maps the validation loss and training
optimizer [193]’s moment coefficient β1, β2).
time as functions of dataset size, i.e., trains a generative
model on a sub-dataset that gradually increases in size.
Here, FABOLAS is 10-100 times faster than other SOTA BO
algorithms and identifies the most promising
hyperparameters.
27
5. Model Evaluation Progressive Neural Architecture Search (PNAS) [18] intro-
duces a surrogate model to control the method of
Once a new neural network has been generated, its searching. Although ENAS has been proven to be very
performance must be evaluated. An intuitive method is efficient, PNAS is even more efficient, as the number of
to train the network to convergence and then evaluate its models evaluated by PNAS is over five times that evaluated
performance. However, this method requires extensive time by ENAS, and PNAS is eight times faster in terms of total
and computing resources. For example, [12] took 800 K40 computational speed. A well-performing surrogate usually
GPUs and 28 days in total to search. Additionally, NASNet requires large amounts of labeled architectures, while the
[15] and AmoebaNet [26] required 500 P100 GPUs and optimization space is too large and hard to quantify, and
450 K40 GPUs, respectively. In this section, we summarize the evalua- tion of each configuration is extremely
several algorithms for accelerating the process of model expensive [201]. To alleviate this issue, Luo et al. [202]
evaluation. proposed SemiNAS, a semi-supervised NAS method, which
leverages amounts of unlabeled architectures to train the
5.1. Low fidelity surrogate, a con- troller that is used to predict the accuracy
As model training time is highly related to the of architectures without evaluation. Initially, the surrogate
dataset and model size, model evaluation can be accelerated is only trained with a small number of labeled data pairs
in dif- ferent ways. First, the number of images or the (architectures, accuracy), then the generated data pairs will
resolution of images (in terms of image-classification be gradually added to the original data to further improve
tasks) can be decreased. For example, FABOLAS [35] the surrogate.
trains the model on a subset of the training set to
accelerate model evalu- ation. In [194], ImageNet64×64 5.4. Early stopping
and its variants 32×32, 16×16 are provided, while these Early stopping was first used to prevent overfitting in
lower resolution datasets can retain characteristics similar classical ML, and it has been used in several recent studies
to those of the original ImageNet dataset. Second, low- [203, 204, 205] to accelerate model evaluation by stopping
fidelity model evaluation can be realized by reducing the evaluations that are predicted to perform poorly on the
model size, such as by training with fewer filters per validation set. For example, [205] proposes a learning-
layer [15, 26]. By analogy to ensemble learning, [195] curve model that is a weighted combination of a set of
proposes the Transfer Series Expansion (TSE), which parametric curve models selected from the literature,
constructs an ensemble estimator by linearly combining thereby enabling the performance of the network to be
a series of basic low-fidelity estima- tors, hence avoiding predicted. Further- more, [206] presents a novel approach
the bias that can derive from using a single low-fidelity for early stopping based on fast-to-compute local statistics
estimator. Furthermore, Zela et al. of the computed gradients, which no longer relies on the
[34] empirically demonstrated that there is a weak corre- validation set and allows the optimizer to make full use of
lation between performance after short or long training all of the training data.
times, thus confirming that a prolonged search for network
configurations is unnecessary.
6. NAS Discussion
5.2. Weight sharing
In Section 4, we reviewed the various search space and
In [12], once a network has been evaluated, it is dropped. architecture optimization methods, and in Section 5, we
Hence, the technique of weight sharing is used to acceler- summarized commonly used model evaluation methods.
ate the process of NAS. For example, Wong and Lu et al. These two sections introduced many NAS studies, which
[196] proposed transfer neural AutoML, which uses knowl- may cause the readers to get lost in details. Therefore, in
edge from prior tasks to accelerate network design. ENAS this section, we summarize and compare these NAS algo-
[13] shares parameters among child networks, leading to rithms’ performance from a global perspective to provide
a thousand-fold faster network design than [12]. Network readers a clearer and more comprehensive understanding of
morphism based algorithms [20, 21] can also inherit the NAS methods’ development. Then, we discuss some major
weights of previous architectures, and single-path NAS topics of the NAS technique.
[197] uses a single-path over-parameterized ConvNet to
encode all architectural decisions with shared convolutional 6.1. NAS Performance Comparison
kernel parameters.
Many NAS studies have proposed several neural archi-
5.3. Surrogate tecture variants, where each variant is designed for different
scenarios. For instance, some architecture variants perform
The surrogate-based method [198, 199, 200, 43] is an- better but are larger, while some are lightweight for a mo-
other powerful tool that approximates the black-box func-
bile device but with a performance penalty. Therefore, we
tion. In general, once a good approximation has been ob-
only report the representative results of each study. Besides,
tained, it is trivial to find the configurations that directly
to ensure a valid comparison, we consider the accuracy and
optimize the original expensive objective. For example,
28
algorithm efficiency as comparison indices. As the
number
29
Published #Params Top-1 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-110 [2] ECCV16 1.7 93.57 - - Manually
PyramidNet [207] CVPR17 26 96.69 - - designed
DenseNet [127] CVPR17 25.6 96.54 - -
GeNet#2 (G-50) [30] ICCV17 - 92.9 17 -
Large-scale ensemble [25] ICML17 40.4 95.6 2,500 250
Hierarchical-EAS [19] ICLR18 15.7 96.25 300 200
CGP-ResSet [28] IJCAI18 6.4 94.02 27.4 2
AmoebaNet-B (N=6, F=128)+c/o [26] AAAI19 34.9 97.87 3,150 450 K40 EA
AmoebaNet-B (N=6, F=36)+c/o [26] AAAI19 2.8 97.45 3,150 450 K40
Lemonade [27] ICLR19 3.4 97.6 56 8 Titan
EENA [149] ICCV19 8.47 97.44 0.65 1 Titan Xp
EENA (more channels)[149] ICCV19 54.14 97.79 0.65 1 Titan Xp
NASv3[12] ICLR17 7.1 95.53 22,400 800 K40
NASv3+more filters [12] ICLR17 37.4 96.35 22,400 800 K40
MetaQNN [23] ICLR17 - 93.08 100 10
NASNet-A (7 @ 2304)+c/o [15] CVPR18 87.6 97.60 2,000 500 P100
NASNet-A (6 @ 768)+c/o [15] CVPR18 3.3 97.35 2,000 500 P100
Block-QNN-Connection more filter [16] CVPR18 33.3 97.65 96 32 1080Ti
Block-QNN-Depthwise, N=3 [16] CVPR18 3.3 97.42 96 32 1080Ti RL
ENAS+macro [13] ICML18 38.0 96.13 0.32 1
ENAS+micro+c/o [13] ICML18 4.6 97.11 0.45 1
Path-level EAS [139] ICML18 5.7 97.01 200 -
Path-level EAS+c/o [139] ICML18 5.7 97.51 200 -
ProxylessNAS-RL+c/o[132] ICLR19 5.8 97.70 - -
FPNAS[208] ICCV19 5.76 96.99 - -
DARTS(first order)+c/o[17] ICLR19 3.3 97.00 1.5 4 1080Ti
DARTS(second order)+c/o[17] ICLR19 3.3 97.23 4 4 1080Ti
sharpDARTS [178] ArXiv19 3.6 98.07 0.8 1 2080Ti
P-DARTS+c/o[128] ICCV19 3.4 97.50 0.3 -
P-DARTS(large)+c/o[128] ICCV19 10.5 97.75 0.3 -
SETN[209] ICCV19 4.6 97.31 1.8 -
GD
GDAS+c/o [154] CVPR19 2.5 97.18 0.17 1
SNAS+moderate constraint+c/o [155] ICLR19 2.8 97.15 1.5 1
BayesNAS[210] ICML19 3.4 97.59 0.1 1
ProxylessNAS-GD+c/o[132] ICLR19 5.7 97.92 - -
PC-DARTS+c/o [211] CVPR20 3.6 97.43 0.1 1 1080Ti
MiLeNAS[153] CVPR20 3.87 97.66 0.3 -
SGAS[212] CVPR20 3.8 97.61 0.25 1 1080Ti
GDAS-NSAS[213] CVPR20 3.54 97.27 0.4 -
NASBOT[160] NeurIPS18 - 91.31 1.7 -
PNAS [18] ECCV18 3.2 96.59 225 -
SMBO
EPNAS[166] BMVC18 6.6 96.29 1.8 1
GHN[214] ICLR19 5.7 97.16 0.84 -
NAO+random+c/o[169] NeurIPS18 10.6 97.52 200 200 V100
SMASH [14] ICLR18 16 95.97 1.5 -
Hierarchical-random [19] ICLR18 15.7 96.09 8 200
RS
RandomNAS [180] UAI19 4.3 97.15 2.7 -
DARTS - random+c/o [17] ICLR19 3.2 96.71 4 1
RandomNAS-NSAS[213] CVPR20 3.08 97.36 0.7 -
NAO+weight sharing+c/o [169] NeurIPS18 2.5 97.07 0.3 1 V100 GD+SMBO
RENASNet+c/o[42] CVPR19 3.5 91.12 1.5 4 EA+RL
CARS[40] CVPR20 3.6 97.38 0.4 - EA+GD
Table 3: Performance of different NAS algorithms on CIFAR-10. The “AO” column indicates the architecture optimization method. The
dash (-) indicates that the corresponding information is not provided in the original paper. “c/o” indicates the use of Cutout [89]. RL,
EA, GD, RS, and SMBO indicate reinforcement learning, evolution-based algorithm, gradient descent, random search, and surrogate model-
based optimization, respectively.
30
Published #Params Top-1/5 GPU
Reference #GPUs AO
in (Millions) Acc(%) Days
ResNet-152 [2] CVPR16 230 70.62/95.51 - -
PyramidNet [207] CVPR17 116.4 70.8/95.3 - -
SENet-154 [126] CVPR17 - 71.32/95.53 - - Manually
DenseNet-201 [127] CVPR17 76.35 78.54/94.46 - - designed
MobileNetV2 [215] CVPR18 6.9 74.7/- - -
GeNet#2[30] ICCV17 - 72.13/90.26 17 -
AmoebaNet-C(N=4,F=50)[26] AAAI19 6.4 75.7/92.4 3,150 450 K40
Hierarchical-EAS[19] ICLR18 - 79.7/94.8 300 200
EA
AmoebaNet-C(N=6,F=228)[26] AAAI19 155.3 83.1/96.3 3,150 450 K40
GreedyNAS [216] CVPR20 6.5 77.1/93.3 1 -
NASNet-A(4@1056) ICLR17 5.3 74.0/91.6 2,000 500 P100
NASNet-A(6@4032) ICLR17 88.9 82.7/96.2 2,000 500 P100
Block-QNN[16] CVPR18 91 81.0/95.42 96 32 1080Ti
Path-level EAS[139] ICML18 - 74.6/91.9 8.3 -
ProxylessNAS(GPU) [132] ICLR19 - 75.1/92.5 8.3 -
RL
ProxylessNAS-RL(mobile) [132] ICLR19 - 74.6/92.2 8.3 -
MnasNet[130] CVPR19 5.2 76.7/93.3 1,666 -
EfficientNet-B0[142] ICML19 5.3 77.3/93.5 - -
EfficientNet-B7[142] ICML19 66 84.4/97.1 - -
FPNAS[208] ICCV19 3.41 73.3/- 0.8 -
DARTS (searched on CIFAR-10)[17] ICLR19 4.7 73.3/81.3 4 -
sharpDARTS[178] Arxiv19 4.9 74.9/92.2 0.8 -
P-DARTS[128] ICCV19 4.9 75.6/92.6 0.3 -
SETN[209] ICCV19 5.4 74.3/92.0 1.8 -
GDAS [154] CVPR19 4.4 72.5/90.9 0.17 1
SNAS[155] ICLR19 4.3 72.7/90.8 1.5 -
ProxylessNAS-G[132] ICLR19 - 74.2/91.7 - -
BayesNAS[210] ICML19 3.9 73.5/91.1 0.2 1
FBNet[131] CVPR19 5.5 74.9/- 216 -
OFA[217] ICLR20 7.7 77.3/- - - GD
AtomNAS[218] ICLR20 5.9 77.6/93.6 - -
MiLeNAS[153] CVPR20 4.9 75.3/92.4 0.3 -
DSNAS[219] CVPR20 - 74.4/91.54 17.5 4 Titan X
SGAS[212] CVPR20 5.4 75.9/92.7 0.25 1 1080Ti
PC-DARTS [211] CVPR20 5.3 75.8/92.7 3.8 8 V100
DenseNAS[220] CVPR20 - 75.3/- 2.7 -
FBNetV2-L1[221] CVPR20 - 77.2/- 25 8 V100
PNAS-5(N=3,F=54)[18] ECCV18 5.1 74.2/91.9 225 -
PNAS-5(N=4,F=216)[18] ECCV18 86.1 82.9/96.2 225 -
SMBO
GHN[214] ICLR19 6.1 73.0/91.3 0.84 -
SemiNAS[202] CVPR20 6.32 76.5/93.2 4 -
Hierarchical-random[19] ICLR18 - 79.6/94.7 8.3 200
RS
OFA-random[217] CVPR20 7.7 73.8/- - -
RENASNet[42] CVPR19 5.36 75.7/92.6 - - EA+RL
Evo-NAS[41] Arxiv20 - 75.43/- 740 - EA+RL
CARS[40] CVPR20 5.1 75.2/92.5 0.4 - EA+GD
Table 4: Performance of different NAS algorithms on ImageNet. The “AO” column indicates the architecture optimization method. The
dash (-) indicates that the corresponding information is not provided in the original paper. RL, EA, GD, RS, and SMBO indicate
reinforcement learning, evolution-based algorithm, gradient descent, random search, and surrogate model-based optimization,
respectively.
32
where NC and ND indicate the numbers of concordant and enable NAS researchers to focus solely on verifying the ef-
discordant pairs. τ is a number in the range [-1,1] with the fectiveness and efficiency of their AO algorithms, avoiding
33
repetitive training for selected architectures and substan-
tially helping the NAS community to develop. Search Space Search Space
6.2. One-stage vs. Two-stage weights (Fig- ure 19). However, we observe that most one-
The NAS methods can be roughly divided into two stage NAS methods are based on the one-shot paradigm.
classes according to the flow ––two-stage and one-stage––
as shown in Figure 18.
Two-stage NAS comprises the searching stage and
evaluation stage. The searching stage involves two pro-
cesses: architecture optimization, which aims to find the
optimal architecture, and parameter training, which is
to train the found architecture’s parameter. The
simplest idea is to train all possible architectures’
parameters from scratch and then choose the optimal
architecture. However, it is resource-consuming (e.g.,
NAS-RL [12] took 22,400 GPU days with 800 K40 GPUs
for searching) ), which is in- feasible for most companies
and institutes. Therefore, most NAS methods (such as
ENAS [13] and DARTS [17]) sample and train many
candidate architectures in the searching stage, and then
further retrain the best-performing archi- tecture in the
evaluation stage.
One-stage NAS refers to a class of NAS methods
that can export a well-designed and well-trained neural
architecture without extra retraining, by running AO and
parameter training simultaneously. In this way, the ef-
ficiency can be substantially improved. However, model
architecture and its weight parameters are highly coupled; it
is difficult to optimize them simultaneously. Several recent
studies [217, 227, 228, 218] have attempted to overcome this
challenge. For instance, the authors in [217] proposed
the progressive shrinking algorithm to post-process the
weights after the training was completed. They first
pretrained the entire neural network, and then
progressively fine-tuned the smaller networks that shared
weights with the complete network. Based on well-
designed constraints, the perfor- mance of all
subnetworks was guaranteed. Thus, given a target
deployment device, a specialized subnetwork can be directly
exported without fine-tuning. However, [217] was still
computational resource-intensive, as the whole process took
1,200 GPU hours with V100 GPUs. BigNAS [228] re-
visited the conventional training techniques of stand-alone
networks, and empirically proposed several techniques to
handle a wider set of models, ranging in size from 200M to
1G FLOPs, whereas [217] only handled models under 600M
FLOPs. Both AtomNAS [218] and DSNAS [219] proposed
an end-to-end one-stage NAS framework to further
boost the performance and simplify the flow.
6.3. One-shot/Weight-sharing
One-shot/=one-stage. Note that one shot is not ex-
actly equivalent to one stage. As mentioned above, we
divide the NAS studies into one- and two-stage methods
ac- cording to the flow (Figure 18), whereas whether a NAS
al- gorithm belongs to a one-shot method depends on
whether the candidate architectures share the same
34
training as a constrained optimization problem
35
of continual learning and proposed novel search-based ar- for each architecture, as these architectures are assigned
chitecture selection (NSAS) loss function. They applied the weights generated by the hypernetwork. Besides, the
the proposed method to RandomNAS [180] and GDAS authors in [232] observed that the architectures with a
[154], where the experimental result demonstrated that smaller symmetrized KL divergence value are more likely
the method effectively reduces the multimodel forgetting to perform better. This can be expressed as follows:
and boosting the predictive ability of the supernet as an
evaluator.
Decoupled optimization. The second category of DSKL = DKL(p q) + DKL(q p)
Σ
n
one-shot NAS methods [209, 232, 229, 217] decouples the s.t. D (p q) = p log (11)
pi
optimization of architecture and weights into two sequential [14] and [232] randomly selected a set of architectures from the
phases: 1) training the supernet and 2) using the trained supernet, and ranked them according to their perfor- mance.
supernet as a predictive performance estimator of SMASH can obtain the validation performance of all selected
different architectures to select the most promising architectures at the cost of a single training run
architecture.
In terms of the supernet training phase, the supernet
cannot be directly trained as a regular neural network
be- cause its weights are also deeply coupled [197]. Yu
et al.
[11] experimentally showed that the weight-sharing strat-
egy degrades the individual architecture’s performance
and negatively impacts the real performance ranking of
the candidate architectures. To reduce the weight
coupling, many one-shot NAS methods [197, 209, 14, 214]
adopt the random sampling policy, which randomly
samples an archi- tecture from the supernet, activating
and optimizing only the weights of this architecture.
Meanwhile, RandomNAS
[180] demonstrates that a random search policy is a compet-
itive baseline method. Although some one-shot approaches
[154, 13, 155, 132, 131] have adopted the strategy that
samples and trains only one path of the supernet at a time,
they sample the path according to the RL controller [13],
Gumbel Softmax [154, 155, 131], or the BinaryConnect net-
work [132], which instead highly couples the architecture
and supernet weights. SMASH [14] adopts an auxiliary
hypernetwork to generate weights for randomly sampled
architectures. Similarly, Zhang et al. [214] proposed a
computation graph representation, and used the graph hy-
pernetwork (GHN) to predict the weights for all possible
architectures faster and more accurately than regular hy-
pernetworks [14]. However, through a careful experimental
analysis conducted to understand the weight-sharing strat-
egy’s mechanism, Bender et al. [232] showed that neither a
hypernetwork nor an RL controller is required to find the
optimal architecture. They proposed a path dropout strat-
egy to alleviate the problem of weight coupling. During
supernet training, each path of the supernet is randomly
dropped with gradually increasing probability. GreedyNAS
[216] adopts a multipath sampling strategy to train the
greedy supernet. This strategy focuses on more potentially
suitable paths, and is demonstrated to effectively achieve
a fairly high rank correlation of candidate architectures
compared with RS.
The second phase involves the selection of the most
promising architecture from the trained supernet, which
is the primary purpose of most NAS tasks. Both SMASH
36
KL i
i=1 q i
where (p1, ..., pn) and (q1, ..., qn) indicate the predictions
of the sampled architecture and one-shot model,
respectively, and n indicates the number of classes. The
cost of calcu- lating the KL value is very small; in [232],
only 64 random training data examples were used.
Meanwhile, EA is also a promising search solution
[197, 216]. For instance, SPOS
[197] uses EA to search for architectures from the
supernet. It is more efficient than the EA methods
introduced in Section 4, because each sampled
architecture only performs inference. The self-evaluated
template network (SETN)
[209] proposes an estimator to predict the
probability of each architecture having a lower
validation loss. The ex- perimental results show that
SETN can potentially find an architecture with better
performance than RS-based methods [232, 14].
37
rate is defined as lr = w1 × 0.1 + w2 × 0.2 + w3 × 0.3. to the one-hot random variables, such that the resource
Mean- while, FBNetv3 [235] jointly searches both constraint’s differentiability is ensured.
architectures and the corresponding training recipes (i.e.,
hyperparam- eters). The architectures are represented
with one-hot categorical variables and integral (min-max 7. Open Problems and Future Directions
normalized) range variables, and the representation is fed This section discusses several open problems of the ex-
to an encoder network to generate the architecture isting AutoML methods and proposes some future
embedding. Then, the concatenation of architecture research directions.
embedding and the training hyperparameters is used to
train the accuracy predictor, which will be applied to 7.1. Flexible Search Space
search for promising architectures and hyperparameters at
a later stage. As summarized in Section 4, there are various search
spaces where the primitive operations can be roughly clas-
6.5. Resource-aware NAS sified into pooling and convolution. Some spaces even
use a more complex module (e.g., MBConv [130]) as the
Early NAS studies [12, 15, 26] pay more attention to primitive operation. Although these search spaces have
searching for neural architectures that achieve higher per- been proven effective for generating well-performing neural
formance (e.g., classification accuracy), regardless of the architectures, all of them are based on human knowledge
associated resource consumption (i.e., the number of and experience, which inevitably introduce human bias,
GPUs and time required). Therefore, many follow-up and hence, still do not break away from the human design
studies investigate resource-aware algorithms to trade off paradigm. AutoML-Zero [289] uses very simple mathemat-
perfor- mance against the resource budget. To do so, these ical operations (e.g., cos, sin, mean,std) as the primitive
algo- rithms add computational cost to the loss function as operations of the search space to minimize the human
a resource constraint. These algorithms differ in the bias, and applies EA to discover complete machine learning
type of computational cost, which may be 1) the algorithms. AutoML-Zero successfully designs two-layer
parameter size; 2) the number of Multiply-ACcumulate neural networks based on these basic mathematical opera-
(MAC) opera- tions; 3) the number of float-point tions. Although the network searched by AutoML-Zero is
operations (FLOPs); or much simpler than both human-designed and NAS-designed
4) the real latency. For example, MONAS [236] considers networks, the experimental results show the potential to
MAC as the constraint, and as MONAS uses a policy-based discover a new model design paradigm with minimal
reinforcement-learning algorithm to search, the constraint human design. Therefore, the design of a more general,
can be directly added to the reward function. MnasNet flexible, and free of human bias search space and the
[130] proposes a customized weighted product to approxi- discovery of novel neural architectures based on this search
mate a Pareto optimal solution: space would
maximize ACC w (12) be challenging and advantageous.
(m) × LAT (m)
m
T
where LAT (m) denotes measured inference latency of the function and the latency term, respectively. In SNAS [155], the
model m on the target device, T is the target latency, cost of time for the generated child network is linear
and w is the weight variable defined as:
α, if LAT (m) ≤ T
w= β, otherwise (13)
where the recommended value for both α and β is −0.07.
In terms of a differentiable neural architecture search
(DNAS) framework, the constraint (i.e., loss function)
should be differentiable. For this purpose, FBNet [131]
uses a latency lookup table model to estimate the overall
latency of a network based on the runtime of each operator.
The loss function is defined as
38
7.2. Exploring More Areas
As described in Section 6, the models designed by
NAS algorithms have achieved comparable results in
image clas- sification tasks (CIFAR-10 and ImageNet) to
those of man- ually designed models. Additionally,
many recent studies have applied NAS to other CV
tasks (Table 5).
However, in terms of the NLP task, most NAS
studies have only conducted experiments on the PTB
dataset. Besides, some NAS studies have attempted
to apply NAS to other NLP tasks (shown in Table 5).
However, Figure 20 shows that, even on the PTB
dataset, there is still a big gap in performance between
the NAS-designed models ([13, 17, 12]) and human-
designed models (GPT-2 [290], FRAGE AWD-LSTM-
Mos [4], adversarial AWD-LSTM- Mos [291] and
Transformer-XL [5]). Therefore, the NAS community
still has a long way to achieve comparable results to
those of the models designed by experts on NLP
tasks.
Besides the CV and NLP tasks, Table 5 also shows
that AutoML technique has been applied to other tasks,
such as network compression, federate learning, image
caption,
39
Category Application References
Medical Image Recognition [237, 238, 239]
Object Detection [240, 241, 242, 243, 244, 245]
Semantic Segmentation [246, 129, 247, 248, 249, 250, 251]
Computer Vision Person Re-identification [252]
(CV) Super-Resolution [253, 254, 255]
Image Restoration [256]
Generative Adversarial Network (GAN) [257, 258, 259, 260]
Disparity Estimation [261]
Video Task [262, 263, 264, 265]
Translation [266]
Language Modeling [267]
Natural Language Processing Entity Recognition [267]
(NLP) Text Classification [268]
Sequential Labeling [268]
Keyword Spotting [269]
Network Compression [270, 271, 272, 273, 274, 275, 276, 277]
Graph Neural Network (GNN) [278]
Federate Learning [279, 280]
Loss Function Search [281, 282]
Others Activation Function Search [283]
Image Caption [284, 285]
Text to Speech (TTS) [202]
Recommendation System [286, 287, 288]
GPT-2 35.76
A major challenge with ML is reproducibility. AutoML
0 10 20 30 40 50 60 70
is no exception, especially for NAS, because most of the
existing NAS algorithms still have many parameters that
Figure 20: State-of-the-art models on the PTB dataset. The lower the need to be set manually at the implementation level; how-
perplexity, the better is the performance. The green bar represents
the automatically generated model, and the yellow bar represents the ever, the original papers do not cover much detail. For
model designed by human experts. Best viewed in color. instance, Yang et al. [123] experimentally demonstrated
that the seed plays an important role in NAS experiments;
however, most NAS studies do not mention the seed set in
recommendation system, and searching for loss and acti- the experiments. Besides, considerable resource consump-
vation functions. Therefore, these interesting studies have tion is another obstacle to reproduction. In this context,
indicated the potential of AutoML to be applied in more several NAS-Bench datasets have been proposed, such as
areas. NAS-Bench-101 [224], NAS-Bench-201 [225], and NAS-
Bench-NLP [226]. These datasets allow NAS researchers
7.3. Interpretability to focus on the design of optimization algorithms without
Although AutoML algorithms can find promising con- wasting much time on the model evaluation.
figuration settings more efficiently than humans, there is a
lack of scientific evidence for illustrating why the found 7.5. Robustness
set- tings perform better. For example, in BlockQNN [16], it NAS has been proven effective in searching promising
is unclear why the NAS algorithm tends to select the architectures on many open datasets (e.g., CIFAR-10 and
concate-
40
ImageNet). These datasets are generally used for research; ers. Besides, NNI also integrates scikit-learn features [300],
therefore, most of the images are well-labeled. However,
in real-world situations, the data inevitably contain noise
(e.g., mislabeling and inadequate information). Even
worse, the data might be modified to be adversarial with
carefully designed noises. Deep learning models can be
easily fooled by adversarial data, and so can NAS.
So far, there are a few studies [293, 294, 295, 296]
have attempted to boost the robustness of NAS against
adver- sarial data. Guo et al. [294] experimentally explored
the intrinsic impact of network architectures on network
ro- bustness against adversarial attacks, and observed that
densely connected architectures tend to be more robust.
They also found that the flow of solution procedure (FSP)
matrix [297] is a good indicator of network robustness, i.e.,
the lower is the FSP matrix loss, the more robust is the net-
work. Chen et al. [295] proposed a robust loss function for
effectively alleviating the performance degradation under
symmetric label noise. The authors in [296] adopted EA
to search for robust architectures from a well-designed
and vast search space, where various adversarial attacks
are used as the fitness function for evaluating the
robustness of neural architectures.
8. Conclusions
References
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classi-
fication with deep convolutional neural networks, in: P. L.
Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q.
Weinberger (Eds.), Advances in Neural Information
Processing Systems 25: 26th Annual Conference on Neural
Information Processing Systems 2012. Proceedings of a
meeting held December 3-6, 2012, Lake Tahoe, Nevada,
United States, 2012,
pp. 1106–1114.
URL https://proceedings.neurips.cc/paper/2012/hash/
c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for
image recognition, in: 2016 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,
USA, June 27-30, 2016, IEEE Computer Society, 2016, pp.
770–778. doi:10.1109/CVPR.2016.90.
URL https://doi.org/10.1109/CVPR.2016.90
[3] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi, You
only look once: Unified, real-time object detection, in:
2016 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,
2016, IEEE
Computer Society, 2016, pp. 779–788. doi:10.1109/CVPR.2016.
91.
URL https://doi.org/10.1109/CVPR.2016.91
[4] C. Gong, D. He, X. Tan, T. Qin, L. Wang, T. Liu, FRAGE:
frequency-agnostic word representation, in: S. Bengio, H. M.
Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
R. Garnett (Eds.), Advances in Neural Information Processing
43
Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE
Montr´eal, Canada, 2018, pp. 1341–1352. Computer Society, 2018, pp. 2423–2432.
URL https://proceedings.neurips.cc/paper/2018/hash/ doi:10.1109/CVPR.2018.00257.
e555ebe0ce426f7f9b2bef0706315e0c-Abstract.html URL http://openaccess.thecvf.com/content_cvpr_2018/
[5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, R. html/Zhong_Practical_Block-Wise_Neural_CVPR_2018_
Salakhutdinov, Transformer-XL: Attentive language paper.html
models beyond a fixed- length context, in: Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, Association for Computational
Linguistics, Florence, Italy, 2019, pp. 2978– 2988.
doi:10.18653/v1/P19-1285.
URL https://www.aclweb.org/anthology/P19-1285
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge,
International Journal of Computer Vision (IJCV) 115 (3) (2015)
211–252. doi:10.1007/s11263-015-0816-y.
[7] K. Simonyan, A. Zisserman, Very deep convolutional
networks for large-scale image recognition, in: Y. Bengio,
Y. LeCun (Eds.), 3rd International Conference on Learning
Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-
9, 2015, Conference Track Proceedings, 2015.
URL http://arxiv.org/abs/1409.1556
[8] M. Zoller, M. F. Huber, Benchmark and survey of
automated machine learning frameworks, arXiv preprint
arXiv:1904.12054.
[9] Q. Yao, M. Wang, Y. Chen, W. Dai, H. Yi-Qi, L. Yu-Feng,
T. Wei-Wei, Y. Qiang, Y. Yang, Taking human out of
learning applications: A survey on automated machine
learning, arXiv preprint arXiv:1810.13306.
[10] T. Elsken, J. H. Metzen, F. Hutter, Neural architecture
search: A survey, arXiv preprint arXiv:1808.05377.
[11] K. Yu, C. Sciuto, M. Jaggi, C. Musat, M. Salzmann,
Evaluating the search phase of neural architecture
search, in: 8th Inter- national Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, OpenReview.net, 2020.
URL https://openreview.net/forum?id=H1loF2NFwr
[12] B. Zoph, Q. V. Le, Neural architecture search with reinforce-
ment learning, in: 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April 24-26,
2017, Conference Track Proceedings, OpenReview.net,
2017.
URL https://openreview.net/forum?id=r1Ue8Hcxg
[13] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, J. Dean, Efficient
neural architecture search via parameter sharing, in: J. G.
Dy,
A. Krause (Eds.), Proceedings of the 35th International Con-
ference on Machine Learning, ICML 2018,
Stockholmsma¨ssan, Stockholm, Sweden, July 10-15, 2018,
Vol. 80 of Proceedings of Machine Learning Research,
PMLR, 2018, pp. 4092–4101.
URL http://proceedings.mlr.press/v80/pham18a.html
[14] A. Brock, T. Lim, J. M. Ritchie, N. Weston, SMASH: one-shot
model architecture search through hypernetworks, in: 6th
Inter- national Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
Conference Track Proceedings, OpenReview.net, 2018.
URL https://openreview.net/forum?id=rydeCEhs-
[15] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning
transferable architectures for scalable image recognition,
in: 2018 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2018, Salt Lake City, UT,
USA, June 18-22, 2018, IEEE Computer Society, 2018, pp.
8697–8710. doi:10.1109/CVPR.2018.00907.
URL
http://openaccess.thecvf.com/content_cvpr_2018/
html/Zoph_Learning_Transferable_Architectures_CVPR_
2018_paper.html
[16] Z. Zhong, J. Yan, W. Wu, J. Shao, C. Liu, Practical block-
wise neural network architecture generation, in: 2018 IEEE
Confer- ence on Computer Vision and Pattern Recognition,
44
[17] H. Liu, K. Simonyan, Y. Yang, DARTS: differentiable archi- Intelligence, IJCAI
tecture search, in: 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
2019, OpenReview.net, 2019.
URL https://openreview.net/forum?id=S1eYHoC5FX
[18] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li,
L. Fei-Fei, A. Yuille, J. Huang, K. Murphy, Progressive
neural architecture search (2018) 19–34.
[19] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, K. Kavukcuoglu,
Hierarchical representations for efficient architecture search,
in: 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30
- May 3, 2018, Conference Track Proceedings,
OpenReview.net, 2018.
URL https://openreview.net/forum?id=BJQRKzbA-
[20] T. Chen, I. J. Goodfellow, J. Shlens, Net2net: Accelerating
learning via knowledge transfer, in: Y. Bengio, Y. LeCun
(Eds.), 4th International Conference on Learning Represen-
tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,
Conference Track Proceedings, 2016.
URL http://arxiv.org/abs/1511.05641
[21] T. Wei, C. Wang, Y. Rui, C. W. Chen, Network morphism, in:
M. Balcan, K. Q. Weinberger (Eds.), Proceedings of the
33nd International Conference on Machine Learning, ICML
2016, New York City, NY, USA, June 19-24, 2016, Vol. 48 of
JMLR Workshop and Conference Proceedings, JMLR.org,
2016, pp. 564–572.
URL http://proceedings.mlr.press/v48/wei16.html
[22] H. Jin, Q. Song, X. Hu, Auto-keras: An efficient neural
architecture search system, in: A. Teredesai, V. Kumar,
Y. Li, R. Rosales, E. Terzi, G. Karypis (Eds.), Proceed-
ings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, KDD 2019, Anchor-
age, AK, USA, August 4-8, 2019, ACM, 2019, pp. 1946–
1956. doi:10.1145/3292500.3330648.
URL https://doi.org/10.1145/3292500.3330648
[23] B. Baker, O. Gupta, N. Naik, R. Raskar, Designing neural
network architectures using reinforcement learning, in: 5th
International Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings, OpenReview.net, 2017.
URL https://openreview.net/forum?id=S1c2cvqee
[24] K. O. Stanley, R. Miikkulainen, Evolving neural networks
through augmenting topologies, Evolutionary computation
10 (2) (2002) 99–127.
[25] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan,
Q. V. Le, A. Kurakin, Large-scale evolution of image classifiers,
in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th Inter-
national Conference on Machine Learning, ICML 2017, Sydney,
NSW, Australia, 6-11 August 2017, Vol. 70 of Proceedings of
Machine Learning Research, PMLR, 2017, pp. 2902–2911.
URL http://proceedings.mlr.press/v70/real17a.html
[26] E. Real, A. Aggarwal, Y. Huang, Q. V. Le, Regularized evo-
lution for image classifier architecture search, in: The
Thirty- Third AAAI Conference on Artificial Intelligence,
AAAI 2019, The Thirty-First Innovative Applications of
Artificial Intel- ligence Conference, IAAI 2019, The Ninth
AAAI Sympo- sium on Educational Advances in Artificial
Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27
- February 1, 2019,
AAAI Press, 2019, pp. 4780–4789. doi:10.1609/aaai.v33i01.
33014780.
URL https://doi.org/10.1609/aaai.v33i01.33014780
[27] T. Elsken, J. H. Metzen, F. Hutter, Efficient multi-objective
neural architecture search via lamarckian evolution, in: 7th
International Conference on Learning Representations, ICLR
2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net,
2019.
URL https://openreview.net/forum?id=ByME42AqK7
[28] M. Suganuma, S. Shirakawa, T. Nagao, A genetic
programming approach to designing convolutional neural
network architec- tures, in: J. Lang (Ed.), Proceedings of the
Twenty-Seventh International Joint Conference on Artificial
45
2018, July 13-19, 2018, Stockholm, Sweden, ijcai.org, 2018, pp. evolu- tion approach to designing deep convolutional neural
5369–5373. doi:10.24963/ijcai.2018/755. networks for image classification, in: Australasian Joint
URL https://doi.org/10.24963/ijcai.2018/755 Conference on Artificial Intelligence, Springer, 2018, pp.
[29] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, 237–250.
O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. [45] M. Wistuba, A. Rawat, T. Pedapati, A survey on neural archi-
Duffy, et al., Evolving deep neural networks (2019) 293– tecture search, arXiv preprint arXiv:1905.01392.
312.
[30] L. Xie, A. L. Yuille, Genetic CNN, in: IEEE International
Conference on Computer Vision, ICCV 2017, Venice,
Italy, October 22-29, 2017, IEEE Computer Society, 2017,
pp. 1388–
1397. doi:10.1109/ICCV.2017.154.
URL https://doi.org/10.1109/ICCV.2017.154
[31] K. Ahmed, L. Torresani, Maskconnect: Connectivity
learning by gradient descent (2018) 349–365.
[32] R. Shin, C. Packer, D. Song, Differentiable neural
network architecture search.
[33] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, F.
Hutter, Towards automatically-tuned neural networks
(2016) 58–65.
[34] A. Zela, A. Klein, S. Falkner, F. Hutter, Towards
automated deep learning: Efficient joint neural
architecture and hyperpa- rameter search, arXiv preprint
arXiv:1807.06906.
[35] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, Fast
bayesian optimization of machine learning hyperparameters
on large datasets, in: A. Singh, X. J. Zhu (Eds.),
Proceedings of the 20th International Conference on
Artificial Intelligence and Statistics, AISTATS 2017, 20-22
April 2017, Fort Lauderdale, FL, USA, Vol. 54 of
Proceedings of Machine Learning Research, PMLR, 2017,
pp. 528–536.
URL http://proceedings.mlr.press/v54/klein17a.html
[36] S. Falkner, A. Klein, F. Hutter, Practical hyperparameter
optimization for deep learning.
[37] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-
based optimization for general algorithm configuration, in:
In- ternational conference on learning and intelligent
optimization, 2011, pp. 507–523.
[38] S. Falkner, A. Klein, F. Hutter, BOHB: robust and efficient
hyperparameter optimization at scale, in: J. G. Dy, A.
Krause (Eds.), Proceedings of the 35th International
Conference on Machine Learning, ICML 2018,
Stockholmsma¨ssan, Stockholm, Sweden, July 10-15, 2018,
Vol. 80 of Proceedings of Machine Learning Research,
PMLR, 2018, pp. 1436–1445.
URL http://proceedings.mlr.press/v80/falkner18a.html
[39] J. Bergstra, D. Yamins, D. D. Cox, Making a science of
model search: Hyperparameter optimization in hundreds
of dimen- sions for vision architectures, in: Proceedings of
the 30th Inter- national Conference on Machine Learning,
ICML 2013, Atlanta, GA, USA, 16-21 June 2013, Vol. 28 of
JMLR Workshop and Conference Proceedings,
JMLR.org, 2013, pp. 115–123.
URL http://proceedings.mlr.press/v28/bergstra13.html
[40] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian,
C. Xu, CARS: continuous evolution for efficient neural
ar- chitecture search, in: 2020 IEEE/CVF Conference on
Com- puter Vision and Pattern Recognition, CVPR 2020,
Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp.
1826–1835. doi:10.1109/CVPR42600.2020.00190.
URL https://doi.org/10.1109/CVPR42600.2020.00190
[41] K. Maziarz, M. Tan, A. Khorlin, M. Georgiev, A.
Gesmundo, Evolutionary-neural hybrid agents for
architecture searcharXiv: 1811.09828.
[42] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu,
X. Wang, Reinforced evolutionary neural architecture
search, arXiv preprint arXiv:1808.00193.
[43] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, M. Zhang,
Surrogate-assisted evolutionary deep learning using an end-
to- end random forest-based performance predictor, IEEE
Trans- actions on Evolutionary Computation.
[44] B. Wang, Y. Sun, B. Xue, M. Zhang, A hybrid differential
46
[46] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, December 7-13, 2015, IEEE
X. Wang, A comprehensive survey of neural architecture search:
Challenges and solutions (2020). arXiv:2006.02903.
[47] R. Elshawi, M. Maher, S. Sakr, Automated machine learn-
ing: State-of-the-art and open challenges, arXiv preprint
arXiv:1906.02287.
[48] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based
learning applied to document recognition, Proceedings of
the IEEE 86 (11) (1998) 2278–2324.
[49] A. Krizhevsky, V. Nair, G. Hinton, The cifar-10 dataset, online:
http://www. cs. toronto. edu/kriz/cifar. html.
[50] J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Ima-
genet: A large-scale hierarchical image database, in: 2009
IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2009), 20-25 June 2009,
Miami, Florida, USA, IEEE Computer Society, 2009, pp.
248–255. doi:10.1109/CVPR.2009.5206848.
URL https://doi.org/10.1109/CVPR.2009.5206848
[51] J. Yang, X. Sun, Y.-K. Lai, L. Zheng, M.-M. Cheng, Recog-
nition from web data: a progressive filtering approach,
IEEE Transactions on Image Processing 27 (11) (2018) 5303–
5315.
[52] X. Chen, A. Shrivastava, A. Gupta, NEIL: extracting visual
knowledge from web data, in: IEEE International
Conference on Computer Vision, ICCV 2013, Sydney,
Australia, December 1-8, 2013, IEEE Computer Society, 2013,
pp. 1409–1416. doi: 10.1109/ICCV.2013.178.
URL https://doi.org/10.1109/ICCV.2013.178
[53] Y. Xia, X. Cao, F. Wen, J. Sun, Well begun is half done:
Generating high-quality seeds for automatic image dataset
construction from web, in: European Conference on Computer
Vision, Springer, 2014, pp. 387–400.
[54] N. H. Do, K. Yanai, Automatic construction of action datasets
using web videos with density-based cluster analysis and outlier
detection, in: Pacific-Rim Symposium on Image and Video
Technology, Springer, 2015, pp. 160–172.
[55] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig,
J. Philbin, L. Fei-Fei, The unreasonable effectiveness of noisy
data for fine-grained recognition, in: European Conference
on Computer Vision, Springer, 2016, pp. 301–320.
[56] P. D. Vo, A. Ginsca, H. Le Borgne, A. Popescu, Harnessing
noisy web images for deep representation, Computer Vision
and Image Understanding 164 (2017) 68–81.
[57] B. Collins, J. Deng, K. Li, L. Fei-Fei, Towards scalable dataset
construction: An active learning approach, in: European con-
ference on computer vision, Springer, 2008, pp. 86–98.
[58] Y. Roh, G. Heo, S. E. Whang, A survey on data collection for
machine learning: a big data-ai integration perspective, IEEE
Transactions on Knowledge and Data Engineering.
[59] D. Yarowsky, Unsupervised word sense disambiguation rivaling
supervised methods, in: 33rd Annual Meeting of the
Association for Computational Linguistics, Association for
Computational Linguistics, Cambridge, Massachusetts, USA,
1995, pp. 189–
196. doi:10.3115/981658.981684.
URL https://www.aclweb.org/anthology/P95-1026
[60] I. Triguero, J. A. Sa´ez, J. Luengo, S. Garc´ıa, F. Herrera, On the
characterization of noise filters for self-training semi-supervised
in nearest neighbor classification, Neurocomputing 132 (2014)
30–41.
[61] M. F. A. Hady, F. Schwenker, Combining committee-based semi-
supervised learning and active learning, Journal of Computer
Science and Technology 25 (4) (2010) 681–698.
[62] A. Blum, T. Mitchell, Combining labeled and unlabeled data
with co-training, in: Proceedings of the eleventh annual
con- ference on Computational learning theory, ACM,
1998, pp. 92–100.
[63] Y. Zhou, S. Goldman, Democratic co-learning, in: Tools
with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE
Interna- tional Conference on, IEEE, 2004, pp. 594–602.
[64] X. Chen, A. Gupta, Webly supervised learning of
convolutional networks, in: 2015 IEEE International
Conference on Computer Vision, ICCV 2015, Santiago, Chile,
47
Computer Society, 2015, pp. 1431–1439. doi:10.1109/ICCV. URL http://openaccess.thecvf.com/content_CVPR_2019/
2015.168. html/Karras_A_Style-Based_Generator_Architecture_for_
URL https://doi.org/10.1109/ICCV.2015.168 Generative_Adversarial_Networks_CVPR_2019_paper.html
[65] Z. Xu, S. Huang, Y. Zhang, D. Tao, Augmenting strong [79] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data cleaning:
super- vision using web data for fine-grained Overview and emerging challenges, in: F. O¨ zcan, G.
categorization, in: 2015 IEEE International Conference on Koutrika,
Computer Vision, ICCV 2015, Santiago, Chile, December S. Madden (Eds.), Proceedings of the 2016 International Con-
7-13, 2015, IEEE Computer ference on Management of Data, SIGMOD Conference 2016,
Society, 2015, pp. 2524–2532. doi:10.1109/ICCV.2015.290. San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016,
URL https://doi.org/10.1109/ICCV.2015.290 pp. 2201–2206. doi:10.1145/2882903.2912574.
[66] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. URL https://doi.org/10.1145/2882903.2912574
Kegelmeyer, Smote: synthetic minority over-sampling [80] M. Jesmeen, J. Hossen, S. Sayeed, C. Ho, K. Tawsif, A. Rahman,
technique, Journal of artificial intelligence research 16
(2002) 321–357.
[67] H. Guo, H. L. Viktor, Learning from imbalanced data
sets with boosting and data generation: the databoost-im
approach, ACM Sigkdd Explorations Newsletter 6 (1)
(2004) 30–39.
[68] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J.
Schul- man, J. Tang, W. Zaremba, Openai gym, arXiv
preprint arXiv:1606.01540.
[69] Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, X. Chu, Irs: A
large synthetic indoor robotics stereo dataset for disparity
and surface normal estimation, arXiv preprint
arXiv:1912.09678.
[70] N. Ruiz, S. Schulter, M. Chandraker, Learning to
simulate, in: 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, May
6-9, 2019, OpenRe- view.net, 2019.
URL https://openreview.net/forum?id=HJgkx2Aqt7
[71] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.
Warde- Farley, S. Ozair, A. C. Courville, Y. Bengio,
Generative adversarial nets, in: Z. Ghahramani, M.
Welling, C. Cortes,
N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in
Neural Information Processing Systems 27: Annual
Conference on Neural Information Processing Systems
2014, December 8-13 2014, Montreal, Quebec, Canada,
2014, pp. 2672–2680.
URL
https://proceedings.neurips.cc/paper/2014/hash/
5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
[72] T.-H. Oh, R. Jaroensri, C. Kim, M. Elgharib, F. Durand,
W. T. Freeman, W. Matusik, Learning-based video
motion magnification, in: Proceedings of the European
Conference on Computer Vision (ECCV), 2018, pp. 633–
648.
[73] L. Sixt, Rendergan: Generating realistic labeled data–with an
application on decoding bee tags, unpublished Bachelor
Thesis, Freie Universit¨at, Berlin.
[74] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A.
Ham- mers, D. A. Dickie, M. V. Herna´ndez, J. Wardlaw, D.
Rueckert, Gan augmentation: Augmenting training data
using generative adversarial networks, arXiv preprint
arXiv:1810.10863.
[75] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park,
Y. Kim, Data synthesis based on generative adversarial net-
works, Proceedings of the VLDB Endowment 11 (10) (2018)
1071–1083.
[76] L. Xu, K. Veeramachaneni, Synthesizing tabular data using
gen- erative adversarial networks, arXiv preprint
arXiv:1811.11264.
[77] D. Donahue, A. Rumshisky, Adversarial text generation
without reinforcement learning, arXiv preprint
arXiv:1810.06640.
[78] T. Karras, S. Laine, T. Aila, A style-based generator ar-
chitecture for generative adversarial networks, in: IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA, USA, June
16-20, 2019,
Computer Vision Foundation / IEEE, 2019, pp. 4401–
4410.
doi:10.1109/CVPR.2019.00453.
48
E. Arif, A survey on cleaning dirty data using machine [96] S. C. Wong, A. Gatt, V. Stamatescu, M. D. McDonnell, Under-
learning paradigm for big data analytics, Indonesian Journal of standing data augmentation for classification: when to warp?,
Electrical Engineering and Computer Science 10 (3) (2018)
1234–1243.
[81] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N.
Tang,
Y. Ye, KATARA: A data cleaning system powered by
knowledge bases and crowdsourcing, in: T. K. Sellis, S.
B. Davidson,
Z. G. Ives (Eds.), Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data, Melbourne,
Victoria, Australia, May 31 - June 4, 2015, ACM, 2015, pp.
1247–1261. doi:10.1145/2723372.2749431.
URL https://doi.org/10.1145/2723372.2749431
[82] S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska,
T. Milo, E. Wu, Sampleclean: Fast and reliable analytics on
dirty data., IEEE Data Eng. Bull. 38 (3) (2015) 59–75.
[83] S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, E. Wu, Ac-
tiveclean: An interactive data cleaning framework for
modern machine learning, in: F. O¨ zcan, G. Koutrika, S.
Madden (Eds.),
Proceedings of the 2016 International Conference on Manage-
ment of Data, SIGMOD Conference 2016, San Francisco, CA,
USA, June 26 - July 01, 2016, ACM, 2016, pp. 2117–2120.
doi:10.1145/2882903.2899409.
URL https://doi.org/10.1145/2882903.2899409
[84] S. Krishnan, M. J. Franklin, K. Goldberg, E. Wu, Boostclean:
Automated error detection and repair for machine learning,
arXiv preprint arXiv:1711.01299.
[85] S. Krishnan, E. Wu, Alphaclean: Automatic generation of data
cleaning pipelines, arXiv preprint arXiv:1904.11827.
[86] I. Gemp, G. Theocharous, M. Ghavamzadeh, Automated data
cleansing through meta-learning, in: S. P. Singh, S.
Markovitch (Eds.), Proceedings of the Thirty-First AAAI
Conference on Artificial Intelligence, February 4-9, 2017, San
Francisco, Cali- fornia, USA, AAAI Press, 2017, pp. 4760–4761.
URL http://aaai.org/ocs/index.php/IAAI/IAAI17/paper/
view/14236
[87] I. F. Ilyas, Effective data cleaning with continuous evaluation.,
IEEE Data Eng. Bull. 39 (2) (2016) 38–46.
[88] M. Mahdavi, F. Neutatz, L. Visengeriyeva, Z. Abedjan,
Towards automated data cleaning workflows, Machine
Learning 15 (2019) 16.
[89] T. DeVries, G. W. Taylor, Improved regularization of con-
volutional neural networks with cutout, arXiv preprint
arXiv:1708.04552.
[90] H. Zhang, M. Ciss´e, Y. N. Dauphin, D. Lopez-Paz, mixup:
Beyond empirical risk minimization, in: 6th International Con-
ference on Learning Representations, ICLR 2018, Vancouver,
BC, Canada, April 30 - May 3, 2018, Conference Track Pro-
ceedings, OpenReview.net, 2018.
URL https://openreview.net/forum?id=r1Ddp1-Rb
[91] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving,
C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft,
Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer,
B. Cook, I. Fern´andez, F.-M. De Rainville, C.-H. Weng,
A. Ayala-Acevedo, R. Meudec, M. Laporte, et al., imgaug,
https://github.com/aleju/imgaug, online; accessed 01-Feb-
2020 (2020).
[92] A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, A. A.
Kalinin, Albumentations: fast and flexible image augmenta-
tions, ArXiv e-printsarXiv:1809.06839.
[93] A. Miko-lajczyk, M. Grochowski, Data augmentation for im-
proving deep learning in image classification problem, in:
2018 international interdisciplinary PhD workshop (IIPhDW),
IEEE, 2018, pp. 117–122.
[94] A. Miko-lajczyk, M. Grochowski, Style transfer-based image
synthesis as an efficient regularization technique in deep learn-
ing, in: 2019 24th International Conference on Methods and
Models in Automation and Robotics (MMAR), IEEE, 2019,
pp. 42–47.
[95] A. Antoniou, A. Storkey, H. Edwards, Data augmentation gen-
erative adversarial networks, arXiv preprint arXiv:1711.04340.
49
arXiv preprint arXiv:1609.08764. autoaug- ment, in: 8th International Conference on Learning
[97] Z. Xie, S. I. Wang, J. Li, D. L´evy, A. Nie, D. Jurafsky, A. Y. Represen- tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-
Ng, Data noising as smoothing in neural network 30, 2020,
language models, in: 5th International Conference on OpenReview.net, 2020.
Learning Representations, ICLR 2017, Toulon, France, URL https://openreview.net/forum?id=ByxdUySKvS
April 24-26, 2017, Conference Track Proceedings, [109] C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin,
OpenReview.net, 2017. W. Ouyang, Online hyper-parameter learning for auto-
URL https://openreview.net/forum?id=H1VyHY9gg augmentation strategy, in: 2019 IEEE/CVF International Con-
[98] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. ference on Computer Vision, ICCV 2019, Seoul, Korea (South),
Norouzi, October 27 - November 2, 2019, IEEE, 2019, pp. 6578–
Q. V. Le, Qanet: Combining local convolution with 6587. doi:10.1109/ICCV.2019.00668.
global self-attention for reading comprehension, in: 6th
International Conference on Learning Representations,
ICLR 2018, Vancou- ver, BC, Canada, April 30 - May 3,
2018, Conference Track Proceedings, OpenReview.net,
2018.
URL https://openreview.net/forum?id=B14TlG-RW
[99] E. Ma, Nlp augmentation,
https://github.com/makcedward/ nlpaug (2019).
[100] E. D. Cubuk, B. Zoph, D. Man´e, V. Vasudevan, Q. V.
Le, Autoaugment: Learning augmentation strategies
from data, in: IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2019, Long Beach, CA,
USA, June 16-20, 2019, Computer Vision Foundation /
IEEE, 2019, pp. 113–123.
doi:10.1109/CVPR.2019.00020.
URL
http://openaccess.thecvf.com/content_CVPR_
2019/html/Cubuk_AutoAugment_Learning_Augmentation_
Strategies_From_Data_CVPR_2019_paper.html
[101] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M.
Robertson,
Y. Yang, Dada: Differentiable automatic data
augmentation, arXiv preprint arXiv:2003.03780.
[102] R. Hataya, J. Zdenek, K. Yoshizoe, H. Nakayama, Faster
au- toaugment: Learning augmentation strategies using
backpropa- gation, arXiv preprint arXiv:1911.06987.
[103] S. Lim, I. Kim, T. Kim, C. Kim, S. Kim, Fast autoaugment,
in:
H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´e-Buc,
E. B. Fox, R. Garnett (Eds.), Advances in Neural
Information Processing Systems 32: Annual Conference
on Neural Informa- tion Processing Systems 2019,
NeurIPS 2019, December 8-14,
2019, Vancouver, BC, Canada, 2019, pp. 6662–6672.
URL
https://proceedings.neurips.cc/paper/2019/hash/
6add07cf50424b14fdf649da87843d01-Abstract.html
[104] A. Naghizadeh, M. Abavisani, D. N. Metaxas, Greedy
autoaug- ment, arXiv preprint arXiv:1908.00704.
[105] D. Ho, E. Liang, X. Chen, I. Stoica, P. Abbeel,
Population based augmentation: Efficient learning of
augmentation policy schedules, in: K. Chaudhuri, R.
Salakhutdinov (Eds.), Proceed- ings of the 36th
International Conference on Machine Learning, ICML
2019, 9-15 June 2019, Long Beach, California, USA,
Vol. 97 of Proceedings of Machine Learning Research,
PMLR, 2019, pp. 2731–2741.
URL http://proceedings.mlr.press/v97/ho19b.html
[106] T. Niu, M. Bansal, Automatically learning data
augmenta- tion policies for dialogue tasks, in:
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Pro- cessing and the 9th
International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Com-
putational Linguistics, Hong Kong, China, 2019, pp. 1317–
1323. doi:10.18653/v1/D19-1132.
URL https://www.aclweb.org/anthology/D19-1132
[107] M. Geng, K. Xu, B. Ding, H. Wang, L. Zhang, Learning
data augmentation policies using augmented random
search, arXiv preprint arXiv:1811.04768.
[108] X. Zhang, Q. Wang, J. Zhang, Z. Zhong, Adversarial
50
URL https://doi.org/10.1109/ICCV.2019.00668 [128] X. Chen, L. Xie, J. Wu, Q. Tian, Progressive differentiable
[110] T. C. LingChen, A. Khonsari, A. Lashkari, M. R. Nazari, architecture search: Bridging the depth gap between search and
J. S. Sambee, M. A. Nascimento, Uniformaugment: A search- evaluation, in: 2019 IEEE/CVF International Conference on
free probabilistic data augmentation approach, arXiv preprint Computer Vision, ICCV 2019, Seoul, Korea (South), October
arXiv:2003.14348. 27 - November 2, 2019, IEEE, 2019, pp. 1294–1303. doi:
[111] H. Motoda, H. Liu, Feature selection, extraction and construc- 10.1109/ICCV.2019.00138.
tion, Communication of IICM (Institute of Information and URL https://doi.org/10.1109/ICCV.2019.00138
Computing Machinery, Taiwan) Vol 5 (67-72) (2002) 2. [129] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille,
[112] M. Dash, H. Liu, Feature selection for classification, Intelligent F. Li, Auto-deeplab: Hierarchical neural architecture search
data analysis 1 (1-4) (1997) 131–156. for semantic image segmentation, in: IEEE Conference on
[113] M. J. Pazzani, Constructive induction of cartesian product Computer Vision and Pattern Recognition, CVPR 2019, Long
attributes, in: Feature Extraction, Construction and Selection, Beach, CA, USA, June 16-20, 2019, Computer Vision Founda-
Springer, 1998, pp. 341–354. tion / IEEE, 2019, pp. 82–92. doi:10.1109/CVPR.2019.00017.
[114] Z. Zheng, A comparison of constructing different types of new URL http://openaccess.thecvf.com/content_CVPR_
feature for decision tree learning, in: Feature Extraction, Con- 2019/html/Liu_Auto-DeepLab_Hierarchical_Neural_
struction and Selection, Springer, 1998, pp. 239–255. Architecture_Search_for_Semantic_Image_Segmentation_
[115] J. Gama, Functional trees, Machine Learning 55 (3) (2004) CVPR_2019_paper.html
219–250. [130] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler,
[116] H. Vafaie, K. De Jong, Evolutionary feature space transfor- A. Howard, Q. V. Le, Mnasnet: Platform-aware neural archi-
mation, in: Feature Extraction, Construction and Selection, tecture search for mobile, in: IEEE Conference on Computer
Springer, 1998, pp. 307–323. Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,
[117] P. Sondhi, Feature construction methods: a survey, sifaka. cs. USA, June 16-20, 2019, Computer Vision Foundation / IEEE,
uiuc. edu 69 (2009) 70–71. 2019, pp. 2820–2828. doi:10.1109/CVPR.2019.00293.
[118] D. Roth, K. Small, Interactive feature space construction using URL http://openaccess.thecvf.com/content_CVPR_2019/
semantic information, in: Proceedings of the Thirteenth Con- html/Tan_MnasNet_Platform-Aware_Neural_Architecture_
ference on Computational Natural Language Learning (CoNLL- Search_for_Mobile_CVPR_2019_paper.html
2009), Association for Computational Linguistics, Boulder, [131] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian,
Colorado, 2009, pp. 66–74. P. Vajda, Y. Jia, K. Keutzer, Fbnet: Hardware-aware efficient
URL https://www.aclweb.org/anthology/W09-1110 convnet design via differentiable neural architecture search,
[119] Q. Meng, D. Catchpoole, D. Skillicom, P. J. Kennedy, Rela- in: IEEE Conference on Computer Vision and Pattern
tional autoencoder for feature extraction, in: 2017 International Recognition, CVPR 2019, Long Beach, CA, USA, June
Joint Conference on Neural Networks (IJCNN), IEEE, 2017, 16-20, 2019, Computer Vision Foundation / IEEE, 2019, pp.
pp. 364–371. 10734–10742. doi:10.1109/CVPR.2019.01099.
[120] O. Irsoy, E. Alpaydın, Unsupervised feature extraction with URL http://openaccess.thecvf.com/content_CVPR_
autoencoder trees, Neurocomputing 258 (2017) 63–73. 2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_
[121] C. Cortes, V. Vapnik, Support-vector networks, Machine learn- Design_via_Differentiable_Neural_Architecture_Search_
ing 20 (3) (1995) 273–297. CVPR_2019_paper.html
[122] N. S. Altman, An introduction to kernel and nearest-neighbor [132] H. Cai, L. Zhu, S. Han, Proxylessnas: Direct neural architecture
nonparametric regression, The American Statistician 46 (3) search on target task and hardware, in: 7th International
(1992) 175–185. Conference on Learning Representations, ICLR 2019, New
[123] A. Yang, P. M. Esperan¸ca, F. M. Carlucci, NAS evaluation is Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019.
frustratingly hard, in: 8th International Conference on Learning URL https://openreview.net/forum?id=HylVB3AqYm
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26- [133] M. Courbariaux, Y. Bengio, J. David, Binaryconnect: Training
30, 2020, OpenReview.net, 2020. deep neural networks with binary weights during propagations,
URL https://openreview.net/forum?id=HygrdpVKvr in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
[124] F. Chollet, Xception: Deep learning with depthwise separable R. Garnett (Eds.), Advances in Neural Information Processing
convolutions, in: 2017 IEEE Conference on Computer Vision Systems 28: Annual Conference on Neural Information
and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, Processing Systems 2015, December 7-12, 2015, Montreal,
July 21-26, 2017, IEEE Computer Society, 2017, pp. 1800–1807. Quebec, Canada, 2015, pp. 3123–3131.
doi:10.1109/CVPR.2017.195. URL https://proceedings.neurips.cc/paper/2015/hash/
URL https://doi.org/10.1109/CVPR.2017.195 3e15cc11f979ed25912dff5b0669f2cd-Abstract.html
[125] F. Yu, V. Koltun, Multi-scale context aggregation by dilated [134] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a
convolutions, in: Y. Bengio, Y. LeCun (Eds.), 4th International neural network, arXiv preprint arXiv:1503.02531.
Conference on Learning Representations, ICLR 2016, San Juan, [135] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable
Puerto Rico, May 2-4, 2016, Conference Track Proceedings, are features in deep neural networks?, in: Z. Ghahramani,
2016. M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger
URL http://arxiv.org/abs/1511.07122 (Eds.), Advances in Neural Information Processing Systems 27:
[126] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, Annual Conference on Neural Information Processing Systems
in: 2018 IEEE Conference on Computer Vision and Pattern 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014,
Recognition, CVPR 2018, Salt Lake City, UT, USA, June pp. 3320–3328.
18-22, 2018, IEEE Computer Society, 2018, pp. 7132–7141. URL https://proceedings.neurips.cc/paper/2014/hash/
doi:10.1109/CVPR.2018.00745. 375c71349b295fbe2dcdca9206f20a06-Abstract.html
URL http://openaccess.thecvf.com/content_cvpr_2018/ [136] T. Wei, C. Wang, C. W. Chen, Modularized morphing of neural
html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_ networks, arXiv preprint arXiv:1701.03281.
paper.html [137] H. Cai, T. Chen, W. Zhang, Y. Yu, J. Wang, Efficient
[127] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, architecture search by network transformation, in: S. A.
Densely
connected convolutional networks, in: 2017 IEEE Conference McIlraith, K. Q. Weinberger (Eds.), Proceedings of the
on Computer Vision and Pattern Recognition, CVPR 2017, Thirty-Second AAAI Conference on Artificial Intelligence,
Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, (AAAI-18), the 30th innovative Applications of Artificial
2017, pp. 2261–2269. doi:10.1109/CVPR.2017.243. Intelligence (IAAI-18), and the 8th AAAI Symposium on
URL https://doi.org/10.1109/CVPR.2017.243 Educational Advances in Artificial Intelligence (EAAI-18),
51
New Orleans, Louisiana, USA, February 2-7, 2018, AAAI URL https://www.aclweb.org/anthology/H94-1020
Press, 2018, pp. 2787–2794. [153] C. He, H. Ye, L. Shen, T. Zhang, Milenas: Efficient neu-
URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/ ral architecture search via mixed-level reformulation, in: 2020
paper/view/16755 IEEE/CVF Conference on Computer Vision and Pattern Recog-
[138] A. Kwasigroch, M. Grochowski, M. Mikolajczyk, Deep neu- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
ral network architecture search using network morphism, in: 2020, pp. 11990–11999. doi:10.1109/CVPR42600.2020.01201.
2019 24th International Conference on Methods and Models in URL https://doi.org/10.1109/CVPR42600.2020.01201
Automation and Robotics (MMAR), IEEE, 2019, pp. 30–35. [154] X. Dong, Y. Yang, Searching for a robust neural architecture
[139] H. Cai, J. Yang, W. Zhang, S. Han, Y. Yu, Path-level network in four GPU hours, in: IEEE Conference on Computer Vision
transformation for efficient architecture search, in: J. G. Dy, and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
A. Krause (Eds.), Proceedings of the 35th International Con- June 16-20, 2019, Computer Vision Foundation / IEEE,
2019,
ference on Machine Learning, ICML 2018, Stockholmsma¨ssan, pp. 1761–1770. doi:10.1109/CVPR.2019.00186.
Stockholm, Sweden, July 10-15, 2018, Vol. 80 of Proceedings URL http://openaccess.thecvf.com/content_CVPR_2019/
of Machine Learning Research, PMLR, 2018, pp. 677–686. html/Dong_Searching_for_a_Robust_Neural_Architecture_
URL http://proceedings.mlr.press/v80/cai18a.html in_Four_GPU_Hours_CVPR_2019_paper.html
[140] J. Fang, Y. Sun, K. Peng, Q. Zhang, Y. Li, W. Liu, X. Wang, [155] S. Xie, H. Zheng, C. Liu, L. Lin, SNAS: stochastic neural ar-
Fast neural network adaptation via parameter remapping and chitecture search, in: 7th International Conference on Learning
architecture search, in: 8th International Conference on Learn- Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2019, OpenReview.net, 2019.
26-30, 2020, OpenReview.net, 2020. URL https://openreview.net/forum?id=rylqooRqK7
URL https://openreview.net/forum?id=rklTmyBKPH [156] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, K. Keutzer,
[141] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, Mixed precision quantization of convnets via differentiable
E. Choi, Morphnet: Fast & simple resource-constrained neural architecture search (2018). arXiv:1812.00090.
structure learning of deep networks, in: 2018 IEEE Conference [157] E. Jang, S. Gu, B. Poole, Categorical reparameterization with
on Computer Vision and Pattern Recognition, CVPR 2018, gumbel-softmax, in: 5th International Conference on Learning
Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Society, 2018, pp. 1586–1595. doi:10.1109/CVPR.2018.00171. Conference Track Proceedings, OpenReview.net, 2017.
URL http://openaccess.thecvf.com/content_cvpr_2018/ URL https://openreview.net/forum?id=rkE3y85ee
html/Gordon_MorphNet_Fast CVPR_2018_paper.html [158] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution:
[142] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for A continuous relaxation of discrete random variables, in: 5th
convolutional neural networks, in: K. Chaudhuri, R. Salakhut- International Conference on Learning Representations, ICLR
dinov (Eds.), Proceedings of the 36th International Conference 2017, Toulon, France, April 24-26, 2017, Conference Track
on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, Proceedings, OpenReview.net, 2017.
California, USA, Vol. 97 of Proceedings of Machine Learning URL https://openreview.net/forum?id=S1jE5L5gl
Research, PMLR, 2019, pp. 6105–6114. [159] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, Z. Li,
URL http://proceedings.mlr.press/v97/tan19a.html Darts+: Improved differentiable architecture search with early
[143] J. F. Miller, S. L. Harding, Cartesian genetic programming, stopping, arXiv preprint arXiv:1909.06035.
in: Proceedings of the 10th annual conference companion on [160] K. Kandasamy, W. Neiswanger, J. Schneider, B. Po´czos, E. P.
Genetic and evolutionary computation, ACM, 2008, pp. 2701– Xing, Neural architecture search with bayesian optimisation
2726. and optimal transport, in: S. Bengio, H. M. Wallach,
[144] J. F. Miller, S. L. Smith, Redundancy and computational H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett
efficiency in cartesian genetic programming, IEEE Transactions (Eds.), Advances in Neural Information Processing Systems 31:
on Evolutionary Computation 10 (2) (2006) 167–174. Annual Conference on Neural Information Processing Systems
[145] F. Gruau, Cellular encoding as a graph grammar, in: IEEE 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal, Canada,
Colloquium on Grammatical Inference: Theory, Applications 2018, pp. 2020–2029.
& Alternatives, 1993. URL https://proceedings.neurips.cc/paper/2018/hash/
[146] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, f33ba15effa5c10e873bf3842afb46a6-Abstract.html
M. Jaderberg, M. Lanctot, D. Wierstra, Convolution by evolu- [161] R. Negrinho, G. Gordon, Deeparchitect: Automatically design-
tion: Differentiable pattern producing networks, in: Proceed- ing and training deep architectures (2017). arXiv:1704.08792.
ings of the Genetic and Evolutionary Computation Conference [162] R. Negrinho, M. R. Gormley, G. J. Gordon, D. Patil, N. Le,
2016, ACM, 2016, pp. 109–116. D. Ferreira, Towards modular and programmable architecture
[147] M. Kim, L. Rigazio, Deep clustered convolutional kernels, in: search, in: H. M. Wallach, H. Larochelle, A. Beygelzimer,
Feature Extraction: Modern Questions and Challenges, 2015, F. d’Alch´e-Buc, E. B. Fox, R. Garnett (Eds.), Advances in
pp. 160–172. Neural Information Processing Systems 32: Annual Conference
[148] J. K. Pugh, K. O. Stanley, Evolving multimodal controllers on Neural Information Processing Systems 2019, NeurIPS
with hyperneat, in: Proceedings of the 15th annual conference 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
on Genetic and evolutionary computation, ACM, 2013, pp. 13715–13725.
735–742. URL https://proceedings.neurips.cc/paper/2019/hash/
[149] H. Zhu, Z. An, C. Yang, K. Xu, E. Zhao, Y. Xu, Eena: Efficient 4ab50afd6dcc95fcba76d0fe04295632-Abstract.html
evolution of neural architecture (2019). arXiv:1905.07320. [163] G. Dikov, J. Bayer, Bayesian learning of neural network ar-
[150] R. J. Williams, Simple statistical gradient-following algorithms chitectures, in: K. Chaudhuri, M. Sugiyama (Eds.), The 22nd
for connectionist reinforcement learning, Machine learning 8 (3- International Conference on Artificial Intelligence and Statis-
4) (1992) 229–256. tics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan,
[151] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Vol. 89 of Proceedings of Machine Learning Research, PMLR,
Proximal policy optimization algorithms, arXiv preprint 2019, pp. 730–738.
arXiv:1707.06347. URL http://proceedings.mlr.press/v89/dikov19a.html
[152] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, [164] C. White, W. Neiswanger, Y. Savani, Bananas: Bayesian op-
A. Bies, M. Ferguson, K. Katz, B. Schasberger, The Penn timization with neural architectures for neural architecture
Treebank: Annotating predicate argument structure, in: Hu- search (2019). arXiv:1910.11858.
man Language Technology: Proceedings of a Workshop held at [165] M. Wistuba, Bayesian optimization combined with incremen-
Plainsboro, New Jersey, March 8-11, 1994, 1994. tal evaluation for neural network architecture optimization,
52
in: Proceedings of the International Workshop on Automatic [179] Y. Geifman, R. El-Yaniv, Deep active learning with a
Selection, Configuration and Composition of Machine Learning neural architecture search, in: H. M. Wallach, H. Larochelle,
Algorithms, 2017. A. Beygelzimer, F. d’Alch´e-Buc, E. B. Fox, R. Garnett (Eds.),
[166] J. Perez-Rua, M. Baccouche, S. Pateux, Efficient progressive Advances in Neural Information Processing Systems 32:
neural architecture search, in: British Machine Vision Confer- Annual Conference on Neural Information Processing Systems
ence 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
BMVA Press, 2018, p. 150. Canada, 2019, pp. 5974–5984.
URL http://bmvc2018.org/contents/papers/0291.pdf URL https://proceedings.neurips.cc/paper/2019/hash/
[167] C. E. Rasmussen, Gaussian processes in machine learning, b59307fdacf7b2db12ec4bd5ca1caba8-Abstract.html
Lecture Notes in Computer Science (2003) 63–71. [180] L. Li, A. Talwalkar, Random search and reproducibility for
[168] J. Bergstra, R. Bardenet, Y. Bengio, B. K´egl, Algorithms neural architecture search, in: A. Globerson, R. Silva (Eds.),
for hyper-parameter optimization, in: J. Shawe-Taylor, R. S. Proceedings of the Thirty-Fifth Conference on Uncertainty in
Zemel, P. L. Bartlett, F. C. N. Pereira, K. Q. Weinberger Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25,
(Eds.), Advances in Neural Information Processing Systems 2019, Vol. 115 of Proceedings of Machine Learning Research,
24: 25th Annual Conference on Neural Information Processing AUAI Press, 2019, pp. 367–377.
Systems 2011. Proceedings of a meeting held 12-14 December URL http://proceedings.mlr.press/v115/li20c.html
2011, Granada, Spain, 2011, pp. 2546–2554. [181] J. Bergstra, Y. Bengio, Random search for hyper-parameter
URL https://proceedings.neurips.cc/paper/2011/hash/ optimization, Journal of machine learning research 13 (Feb)
86e8f7ab32cfd12577bc2619bc635690-Abstract.html (2012) 281–305.
[169] R. Luo, F. Tian, T. Qin, E. Chen, T. Liu, Neural architecture [182] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al., A practical guide to
optimization, in: S. Bengio, H. M. Wallach, H. Larochelle, support vector classification.
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in [183] J. Y. Hesterman, L. Caucci, M. A. Kupinski, H. H. Barrett, L. R.
Neural Information Processing Systems 31: Annual Conference Furenlid, Maximum-likelihood estimation with a contracting-
on Neural Information Processing Systems 2018, NeurIPS 2018, grid search algorithm, IEEE transactions on nuclear science
December 3-8, 2018, Montr´eal, Canada, 2018, pp. 7827–7838. 57 (3) (2010) 1077–1084.
URL https://proceedings.neurips.cc/paper/2018/hash/ [184] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A.
Talwalkar,
933670f1ac8ba969f32989c312faba75-Abstract.html Hyperband: A novel bandit-based approach to hyperparameter
[170] M. M. Ian Dewancker, S. Clark, Bayesian optimization primer. optimization, The Journal of Machine Learning Research 18 (1)
URL https://app.sigopt.com/static/pdf/SigOpt_ (2017) 6765–6816.
Bayesian_Optimization_Primer.pdf [185] M. Feurer, F. Hutter, Hyperparameter Optimization, Springer
[171] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Fre- International Publishing, Cham, 2019, pp. 3–33.
itas, Taking the human out of the loop: A review of bayesian URL https://doi.org/10.1007/978-3-030-05318-5_1
optimization, Proceedings of the IEEE 104 (1) (2016) 148–175. [186] T. Yu, H. Zhu, Hyper-parameter optimization: A review of
[172] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sun- algorithms and applications, arXiv preprint arXiv:2003.05689.
daram, M. M. A. Patwary, Prabhat, R. P. Adams, Scalable [187] Y. Bengio, Gradient-based optimization of hyperparameters,
bayesian optimization using deep neural networks, in: F. R. Neural computation 12 (8) (2000) 1889–1900.
Bach, D. M. Blei (Eds.), Proceedings of the 32nd International [188] J. Domke, Generic methods for optimization-based modeling,
Conference on Machine Learning, ICML 2015, Lille, France, in: Artificial Intelligence and Statistics, 2012, pp. 318–326.
6-11 July 2015, Vol. 37 of JMLR Workshop and Conference [189] D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based hy-
Proceedings, JMLR.org, 2015, pp. 2171–2180. perparameter optimization through reversible learning, in: F. R.
URL http://proceedings.mlr.press/v37/snoek15.html Bach, D. M. Blei (Eds.), Proceedings of the 32nd International
[173] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian Conference on Machine Learning, ICML 2015, Lille, France,
optimization of machine learning algorithms, in: P. L. Bartlett, 6-11 July 2015, Vol. 37 of JMLR Workshop and Conference
F. C. N. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger Proceedings, JMLR.org, 2015, pp. 2113–2122.
(Eds.), Advances in Neural Information Processing Systems URL http://proceedings.mlr.press/v37/maclaurin15.html
25: 26th Annual Conference on Neural Information Processing [190] F. Pedregosa, Hyperparameter optimization with approximate
Systems 2012. Proceedings of a meeting held December 3-6, gradient, in: M. Balcan, K. Q. Weinberger (Eds.), Proceedings
2012, Lake Tahoe, Nevada, United States, 2012, pp. 2960–2968. of the 33nd International Conference on Machine Learning,
URL https://proceedings.neurips.cc/paper/2012/hash/ ICML 2016, New York City, NY, USA, June 19-24, 2016, Vol. 48
05311655a15b75fab86956663e1819cd-Abstract.html of JMLR Workshop and Conference Proceedings, JMLR.org,
[174] J. Stork, M. Zaefferer, T. Bartz-Beielstein, Improving 2016, pp. 737–746.
neuroevo-
lution efficiency by surrogate model-based optimization with URL http://proceedings.mlr.press/v48/pedregosa16.html
phenotypic distance kernels (2019). arXiv:1902.03419. [191] L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward
[175] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, M. A. and reverse gradient-based hyperparameter optimization,
Osborne,
Raiders of the lost architecture: Kernels for bayesian optimiza- in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th
tion in conditional parameter spaces (2014). arXiv:1409.4011. International Conference on Machine Learning, ICML 2017,
[176] A. Camero, H. Wang, E. Alba, T. Ba¨ck, Bayesian neural archi- Sydney, NSW, Australia, 6-11 August 2017, Vol. 70 of
tecture search using a training-free performance metric (2020). Proceedings of Machine Learning Research, PMLR, 2017, pp.
arXiv:2001.10726. 1165–1173.
[177] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto- URL http://proceedings.mlr.press/v70/franceschi17a.
weka: combined selection and hyperparameter optimization of html
classification algorithms, in: I. S. Dhillon, Y. Koren, R. Ghani, [192] K. Chandra, E. Meijer, S. Andow, E. Arroyo-Fang, I. Dea,
T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, J. George, M. Grueter, B. Hosmer, S. Stumpos, A. Tempest,
R. Uthurusamy (Eds.), The 19th ACM SIGKDD International et al., Gradient descent: The ultimate optimizer, arXiv preprint
Conference on Knowledge Discovery and Data Mining, KDD arXiv:1909.13371.
2013, Chicago, IL, USA, August 11-14, 2013, ACM, 2013, pp. [193] D. P. Kingma, J. Ba, Adam: A method for stochastic opti-
847–855. doi:10.1145/2487575.2487629. mization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International
URL https://doi.org/10.1145/2487575.2487629 Conference on Learning Representations, ICLR 2015, San Diego,
[178] A. sharpdarts, V. Jain, G. D. Hager, sharpdarts: Faster and CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
more accurate differentiable architecture search, Tech. rep. URL http://arxiv.org/abs/1412.6980
53
(2019). [194] P. Chrabaszcz, I. Loshchilov, F. Hutter, A downsampled variant
54
of imagenet as an alternative to the CIFAR datasets, CoRR Conference on Computer Vision, ICCV 2019, Seoul, Korea
abs/1707.08819. arXiv:1707.08819. (South), October 27 - November 2, 2019, IEEE, 2019, pp. 6508–
URL http://arxiv.org/abs/1707.08819 6517. doi:10.1109/ICCV.2019.00661.
[195] Y. Hu, Y. Yu, W. Tu, Q. Yang, Y. Chen, W. Dai, Multi- URL https://doi.org/10.1109/ICCV.2019.00661
fidelity automatic hyper-parameter tuning via transfer series [209] X. Dong, Y. Yang, One-shot neural architecture search via self-
expansion, in: The Thirty-Third AAAI Conference on Arti- evaluated template network, in: 2019 IEEE/CVF International
ficial Intelligence, AAAI 2019, The Thirty-First Innovative Conference on Computer Vision, ICCV 2019, Seoul, Korea
Applications of Artificial Intelligence Conference, IAAI 2019, (South), October 27 - November 2, 2019, IEEE, 2019, pp. 3680–
The Ninth AAAI Symposium on Educational Advances in 3689. doi:10.1109/ICCV.2019.00378.
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, Jan- URL https://doi.org/10.1109/ICCV.2019.00378
uary 27 - February 1, 2019, AAAI Press, 2019, pp. 3846–3853. [210] H. Zhou, M. Yang, J. Wang, W. Pan, Bayesnas: A bayesian
doi:10.1609/aaai.v33i01.33013846. approach for neural architecture search, in: K. Chaudhuri,
URL https://doi.org/10.1609/aaai.v33i01.33013846 R. Salakhutdinov (Eds.), Proceedings of the 36th International
[196] C. Wong, N. Houlsby, Y. Lu, A. Gesmundo, Transfer learning Conference on Machine Learning, ICML 2019, 9-15 June 2019,
with neural automl, in: S. Bengio, H. M. Wallach, H. Larochelle, Long Beach, California, USA, Vol. 97 of Proceedings of Machine
K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Learning Research, PMLR, 2019, pp. 7603–7613.
Neural Information Processing Systems 31: Annual Conference URL http://proceedings.mlr.press/v97/zhou19e.html
on Neural Information Processing Systems 2018, NeurIPS 2018, [211] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, H. Xiong,
December 3-8, 2018, Montr´eal, Canada, 2018, pp. 8366–8375. PC-DARTS: partial channel connections for memory-efficient
URL https://proceedings.neurips.cc/paper/2018/hash/ architecture search, in: 8th International Conference on Learn-
bdb3c278f45e6734c35733d24299d3f4-Abstract.html ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April
[197] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyan- 26-30, 2020, OpenReview.net, 2020.
tha, J. Liu, D. Marculescu, Single-path nas: Designing URL https://openreview.net/forum?id=BJlS634tPr
hardware-efficient convnets in less than 4 hours, arXiv preprint [212] G. Li, G. Qian, I. C. Delgadillo, M. Mu¨ller, A. K. Thabet,
arXiv:1904.02877. B. Ghanem, SGAS: sequential greedy architecture search, in:
[198] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, 2020 IEEE/CVF Conference on Computer Vision and Pat-
Surrogate benchmarks for hyperparameter optimization., in: tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
MetaSel@ ECAI, 2014, pp. 24–31. 2020, IEEE, 2020, pp. 1617–1627. doi:10.1109/CVPR42600.
[199] C. Wang, Q. Duan, W. Gong, A. Ye, Z. Di, C. Miao, An 2020.00169.
evaluation of adaptive surrogate modeling based optimization URL https://doi.org/10.1109/CVPR42600.2020.00169
with two benchmark problems, Environmental Modelling & [213] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom-
Software 60 (2014) 167–179. ing multi-model forgetting in one-shot NAS with diversity
[200] K. Eggensperger, F. Hutter, H. H. Hoos, K. Leyton-Brown, maximization, in: 2020 IEEE/CVF Conference on Computer
Efficient benchmarking of hyperparameter optimizers via Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
surrogates, in: B. Bonet, S. Koenig (Eds.), Proceedings of USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
the Twenty-Ninth AAAI Conference on Artificial Intelligence, 10.1109/CVPR42600.2020.00783.
January 25-30, 2015, Austin, Texas, USA, AAAI Press, 2015, URL https://doi.org/10.1109/CVPR42600.2020.00783
pp. 1114–1120. [214] C. Zhang, M. Ren, R. Urtasun, Graph hypernetworks for neu-
URL http://www.aaai.org/ocs/index.php/AAAI/AAAI15/ ral architecture search, in: 7th International Conference on
paper/view/9993 Learning Representations, ICLR 2019, New Orleans, LA, USA,
[201] K. K. Vu, C. D’Ambrosio, Y. Hamadi, L. Liberti, Surrogate- May 6-9, 2019, OpenReview.net, 2019.
based methods for black-box optimization, International Trans- URL https://openreview.net/forum?id=rkgW0oA9FX
actions in Operational Research 24 (3) (2017) 393–424. [215] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. Chen,
[202] R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, T.-Y. Liu, Mobilenetv2: Inverted residuals and linear bottlenecks, in:
Semi-supervised neural architecture search (2020). arXiv: 2018 IEEE Conference on Computer Vision and Pattern
2002.10389. Recognition, CVPR 2018, Salt Lake City, UT, USA, June
[203] A. Klein, S. Falkner, J. T. Springenberg, F. Hutter, Learning 18-22, 2018, IEEE Computer Society, 2018, pp. 4510–4520.
curve prediction with bayesian neural networks, in: 5th Inter- doi:10.1109/CVPR.2018.00474.
national Conference on Learning Representations, ICLR 2017, URL http://openaccess.thecvf.com/content_cvpr_2018/
Toulon, France, April 24-26, 2017, Conference Track Proceed- html/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_
ings, OpenReview.net, 2017. paper.html
URL https://openreview.net/forum?id=S11KBYclx [216] S. You, T. Huang, M. Yang, F. Wang, C. Qian, C. Zhang,
[204] B. Deng, J. Yan, D. Lin, Peephole: Predicting network perfor- Greedynas: Towards fast one-shot NAS with greedy supernet,
mance before training, arXiv preprint arXiv:1712.03351. in: 2020 IEEE/CVF Conference on Computer Vision and Pat-
[205] T. Domhan, J. T. Springenberg, F. Hutter, Speeding up auto- tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,
matic hyperparameter optimization of deep neural networks by 2020, IEEE, 2020, pp. 1996–2005. doi:10.1109/CVPR42600.
extrapolation of learning curves, in: Q. Yang, M. J. Wooldridge 2020.00207.
(Eds.), Proceedings of the Twenty-Fourth International Joint URL https://doi.org/10.1109/CVPR42600.2020.00207
Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, [217] H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all:
Argentina, July 25-31, 2015, AAAI Press, 2015, pp. 3460–3468. Train one network and specialize it for efficient deployment,
URL http://ijcai.org/Abstract/15/487 in: 8th International Conference on Learning Representations,
[206] M. Mahsereci, L. Balles, C. Lassner, P. Hennig, Early stopping ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
without a validation set, arXiv preprint arXiv:1703.09580. Review.net, 2020.
[207] D. Han, J. Kim, J. Kim, Deep pyramidal residual networks, URL https://openreview.net/forum?id=HylxE1HKwS
in: 2017 IEEE Conference on Computer Vision and Pattern [218] J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. L. Yuille, J. Yang,
Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, Atomnas: Fine-grained end-to-end neural architecture search,
2017, IEEE Computer Society, 2017, pp. 6307–6315. doi: in: 8th International Conference on Learning Representations,
10.1109/CVPR.2017.668. ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Open-
URL https://doi.org/10.1109/CVPR.2017.668 Review.net, 2020.
[208] J. Cui, P. Chen, R. Li, S. Liu, X. Shen, J. Jia, Fast and practical URL https://openreview.net/forum?id=BylQSxHFwr
neural architecture search, in: 2019 IEEE/CVF International [219] S. Hu, S. Xie, H. Zheng, C. Liu, J. Shi, X. Liu, D. Lin,
55
DSNAS: direct neural architecture search without param- maximization, in: 2020 IEEE/CVF Conference on Computer
eter retraining, in: 2020 IEEE/CVF Conference on Com- Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
puter Vision and Pattern Recognition, CVPR 2020, Seattle, USA, June 13-19, 2020, IEEE, 2020, pp. 7806–7815. doi:
WA, USA, June 13-19, 2020, IEEE, 2020, pp. 12081–12089. 10.1109/CVPR42600.2020.00783.
doi:10.1109/CVPR42600.2020.01210. URL https://doi.org/10.1109/CVPR42600.2020.00783
URL https://doi.org/10.1109/CVPR42600.2020.01210 [232] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, Q. V. Le,
[220] J. Fang, Y. Sun, Q. Zhang, Y. Li, W. Liu, X. Wang, Densely Understanding and simplifying one-shot architecture search,
connected search space for more flexible neural architecture in: J. G. Dy, A. Krause (Eds.), Proceedings of the 35th Inter-
search, in: 2020 IEEE/CVF Conference on Computer Vision national Conference on Machine Learning, ICML 2018, Stock-
and Pattern Recognition, CVPR 2020, Seattle, WA, USA, holmsma¨ssan, Stockholm, Sweden, July 10-15, 2018, Vol. 80
June 13-19, 2020, IEEE, 2020, pp. 10625–10634. doi:10.1109/ of Proceedings of Machine Learning Research, PMLR, 2018,
CVPR42600.2020.01064. pp. 549–558.
URL https://doi.org/10.1109/CVPR42600.2020.01064 URL http://proceedings.mlr.press/v80/bender18a.html
[221] A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, [233] X. Dong, M. Tan, A. W. Yu, D. Peng, B. Gabrys, Q. V.
M. Yu, T. Xu, K. Chen, P. Vajda, J. E. Gonzalez, Fbnetv2: Le, Autohas: Differentiable hyper-parameter and architecture
Differentiable neural architecture search for spatial and channel search (2020). arXiv:2006.03656.
dimensions, in: 2020 IEEE/CVF Conference on Computer [234] A. Klein, F. Hutter, Tabular benchmarks for joint archi-
Vision and Pattern Recognition, CVPR 2020, Seattle, WA, tecture and hyperparameter optimization, arXiv preprint
USA, June 13-19, 2020, IEEE, 2020, pp. 12962–12971. doi: arXiv:1905.04970.
10.1109/CVPR42600.2020.01298. [235] X. Dai, A. Wan, P. Zhang, B. Wu, Z. He, Z. Wei, K. Chen,
URL https://doi.org/10.1109/CVPR42600.2020.01298 Y. Tian, M. Yu, P. Vajda, et al., Fbnetv3: Joint architecture-
[222] R. Istrate, F. Scheidegger, G. Mariani, D. S. Nikolopoulos, recipe search using neural acquisition function, arXiv preprint
C. Bekas, A. C. I. Malossi, TAPAS: train-less accuracy predic- arXiv:2006.02049.
tor for architecture search, in: The Thirty-Third AAAI Con- [236] C.-H. Hsu, S.-H. Chang, J.-H. Liang, H.-P. Chou, C.-H. Liu, S.-
ference on Artificial Intelligence, AAAI 2019, The Thirty- C. Chang, J.-Y. Pan, Y.-T. Chen, W. Wei, D.-C. Juan, Monas:
First Innovative Applications of Artificial Intelligence Multi-objective neural architecture search using reinforcement
Conference, IAAI 2019, The Ninth AAAI Symposium on learning, arXiv preprint arXiv:1806.10332.
Educational Ad- vances in Artificial Intelligence, EAAI 2019, [237] X. He, S. Wang, S. Shi, X. Chu, J. Tang, X. Liu, C. Yan,
Honolulu, Hawaii, USA, January 27 - February 1, 2019, J. Zhang, G. Ding, Benchmarking deep learning models and
AAAI Press, 2019, pp. automated model design for covid-19 detection with chest ct
3927–3934. doi:10.1609/aaai.v33i01.33013927. scans, medRxiv.
URL https://doi.org/10.1609/aaai.v33i01.33013927 [238] L. Faes, S. K. Wagner, D. J. Fu, X. Liu, E. Korot, J. R. Ledsam,
[223] M. G. Kendall, A new measure of rank correlation, Biometrika T. Back, R. Chopra, N. Pontikos, C. Kern, et al., Automated
30 (1/2) (1938) 81–93. deep learning design for medical image classification by health-
URL http://www.jstor.org/stable/2332226 care professionals with no coding experience: a feasibility study,
[224] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, F. Hut- The Lancet Digital Health 1 (5) (2019) e232–e242.
ter, Nas-bench-101: Towards reproducible neural [239] X. He, S. Wang, X. Chu, S. Shi, J. Tang, X. Liu, C. Yan,
architecture search, in: K. Chaudhuri, R. Salakhutdinov (Eds.), J. Zhang, G. Ding, Automated model design and
Proceed- ings of the 36th International Conference on Machine benchmarking of 3d deep learning models for covid-19
Learning, ICML 2019, 9-15 June 2019, Long Beach, detection with chest ct scans (2021). arXiv:2101.05442.
California, USA, Vol. 97 of Proceedings of Machine Learning [240] G. Ghiasi, T. Lin, Q. V. Le, NAS-FPN: learning scalable
Research, PMLR, 2019, pp. 7105–7114. feature pyramid architecture for object detection, in: IEEE
URL http://proceedings.mlr.press/v97/ying19a.html Conference on Computer Vision and Pattern Recognition,
[225] X. Dong, Y. Yang, Nas-bench-201: Extending the scope of CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
reproducible neural architecture search, in: 8th International Computer Vision Foundation / IEEE, 2019, pp. 7036–7045.
Conference on Learning Representations, ICLR 2020, Addis doi:10.1109/CVPR.2019.00720.
Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL http://openaccess.thecvf.com/content_CVPR_
URL https://openreview.net/forum?id=HJxyZkBKDr 2019/html/Ghiasi_NAS-FPN_Learning_Scalable_Feature_
[226] N. Klyuchnikov, I. Trofimov, E. Artemova, M. Salnikov, M. Fe- Pyramid_Architecture_for_Object_Detection_CVPR_2019_
dorov, E. Burnaev, Nas-bench-nlp: Neural architecture search paper.html
benchmark for natural language processing (2020). arXiv: [241] H. Xu, L. Yao, Z. Li, X. Liang, W. Zhang, Auto-fpn:
2006.07116. Automatic network architecture adaptation for object
[227] X. Zhang, Z. Huang, N. Wang, You only search once: Single detection beyond classification, in: 2019 IEEE/CVF
shot neural architecture search via direct sparse International Conference on Computer Vision, ICCV 2019,
optimization, arXiv preprint arXiv:1811.01567. Seoul, Korea (South), October
[228] J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, 27 - November 2, 2019, IEEE, 2019, pp. 6648–6657. doi:
T. Huang, X. Song, R. Pang, Q. Le, Bignas: Scaling up neural 10.1109/ICCV.2019.00675.
architecture search with big single-stage models, arXiv preprint URL https://doi.org/10.1109/ICCV.2019.00675
arXiv:2003.11142. [242] M. Tan, R. Pang, Q. V. Le, Efficientdet: Scalable and
[229] X. Chu, B. Zhang, R. Xu, J. Li, Fairnas: Rethinking evaluation efficient object detection, in: 2020 IEEE/CVF Conference on
fairness of weight sharing neural architecture search, arXiv Computer Vision and Pattern Recognition, CVPR 2020,
preprint arXiv:1907.01845. Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 10778–
[230] Y. Benyahia, K. Yu, K. Bennani-Smires, M. Jaggi, A. C. Davi- 10787. doi: 10.1109/CVPR42600.2020.01079.
son, M. Salzmann, C. Musat, Overcoming multi-model forget- URL https://doi.org/10.1109/CVPR42600.2020.01079
ting, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of [243] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, J. Sun, Detnas:
the 36th International Conference on Machine Learning, Neural architecture search on object detection, arXiv
ICML 2019, 9-15 June 2019, Long Beach, California, USA, preprint arXiv:1903.10979 1 (2) (2019) 4–1.
Vol. 97 of Proceedings of Machine Learning Research, PMLR, [244] J. Guo, K. Han, Y. Wang, C. Zhang, Z. Yang, H. Wu, X. Chen,
2019, pp. 594–603. C. Xu, Hit-detector: Hierarchical trinity architecture search for
URL http://proceedings.mlr.press/v97/benyahia19a.html object detection, in: 2020 IEEE/CVF Conference on
[231] M. Zhang, H. Li, S. Pan, X. Chang, S. W. Su, Overcom- Computer Vision and Pattern Recognition, CVPR 2020,
ing multi-model forgetting in one-shot NAS with diversity Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp.
11402–11411. doi:
56
10.1109/CVPR42600.2020.01142. URL http://proceedings.mlr.press/v119/fu20b.html
URL https://doi.org/10.1109/CVPR42600.2020.01142 [259] M. Li, J. Lin, Y. Ding, Z. Liu, J. Zhu, S. Han, GAN
compression:
[245] C. Jiang, H. Xu, W. Zhang, X. Liang, Z. Li, SP-NAS: serial- Efficient architectures for interactive conditional gans, in: 2020
to-parallel backbone search for object detection, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-
IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 5283–5293. doi:10.1109/CVPR42600.2020.00533.
2020, pp. 11860–11869. doi:10.1109/CVPR42600.2020.01188. URL https://doi.org/10.1109/CVPR42600.2020.00533
URL https://doi.org/10.1109/CVPR42600.2020.01188 [260] C. Gao, Y. Chen, S. Liu, Z. Tan, S. Yan, Adversarialnas: Adver-
[246] Y. Weng, T. Zhou, Y. Li, X. Qiu, Nas-unet: Neural sarial neural architecture search for gans, in: 2020 IEEE/CVF
architecture
search for medical image segmentation, IEEE Access 7 (2019) Conference on Computer Vision and Pattern Recognition,
44247–44257. CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE, 2020,
[247] V. Nekrasov, H. Chen, C. Shen, I. D. Reid, Fast neural pp. 5679–5688. doi:10.1109/CVPR42600.2020.00572.
architecture search of compact semantic segmentation models URL https://doi.org/10.1109/CVPR42600.2020.00572
via auxiliary cells, in: IEEE Conference on Computer Vision [261] T. Saikia, Y. Marrakchi, A. Zela, F. Hutter, T. Brox, Au-
and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, todispnet: Improving disparity estimation with automl, in:
June 16-20, 2019, Computer Vision Foundation / IEEE, 2019, 2019 IEEE/CVF International Conference on Computer Vision,
pp. 9126–9135. doi:10.1109/CVPR.2019.00934. ICCV 2019, Seoul, Korea (South), October 27 - November
URL http://openaccess.thecvf.com/content_CVPR_2019/ 2, 2019, IEEE, 2019, pp. 1812–1823. doi:10.1109/ICCV.2019.
html/Nekrasov_Fast_Neural_Architecture_Search_of_ 00190.
Compact_Semantic_Segmentation_Models_via_CVPR_2019_ URL https://doi.org/10.1109/ICCV.2019.00190
paper.html [262] W. Peng, X. Hong, G. Zhao, Video action recognition via neural
[248] W. Bae, S. Lee, Y. Lee, B. Park, M. Chung, K.-H. Jung, Re- architecture searching, in: 2019 IEEE International Conference
source optimized neural architecture search for 3d medical on Image Processing (ICIP), IEEE, 2019, pp. 11–15.
image segmentation, in: International Conference on Medi- [263] M. S. Ryoo, A. J. Piergiovanni, M. Tan, A. Angelova, Assem-
cal Image Computing and Computer-Assisted Intervention, blenet: Searching for multi-stream neural connectivity in video
Springer, 2019, pp. 228–236. architectures, in: 8th International Conference on Learning
[249] D. Yang, H. Roth, Z. Xu, F. Milletari, L. Zhang, D. Xu, Search- Representations, ICLR 2020, Addis Ababa, Ethiopia, April
ing learning strategy with reinforcement learning for 3d medical 26-30, 2020, OpenReview.net, 2020.
image segmentation, in: International Conference on Medi- URL https://openreview.net/forum?id=SJgMK64Ywr
cal Image Computing and Computer-Assisted Intervention, [264] V. Nekrasov, H. Chen, C. Shen, I. Reid, Architecture search of
Springer, 2019, pp. 3–11. dynamic cells for semantic video segmentation, in: The IEEE
[250] N. Dong, M. Xu, X. Liang, Y. Jiang, W. Dai, E. Xing, Neural Winter Conference on Applications of Computer Vision, 2020,
architecture search for adversarial medical image segmentation, pp. 1970–1979.
in: International Conference on Medical Image Computing and [265] A. J. Piergiovanni, A. Angelova, A. Toshev, M. S. Ryoo,
Computer-Assisted Intervention, Springer, 2019, pp. 828–836. Evolving space-time neural architectures for videos, in: 2019
[251] S. Kim, I. Kim, S. Lim, W. Baek, C. Kim, H. Cho, B. Yoon, IEEE/CVF International Conference on Computer Vision,
T. Kim, Scalable neural architecture search for 3d medical ICCV 2019, Seoul, Korea (South), October 27 - November
image segmentation, in: International Conference on Medi- 2, 2019, IEEE, 2019, pp. 1793–1802. doi:10.1109/ICCV.2019.
cal Image Computing and Computer-Assisted Intervention, 00188.
Springer, 2019, pp. 220–228. URL https://doi.org/10.1109/ICCV.2019.00188
[252] R. Quan, X. Dong, Y. Wu, L. Zhu, Y. Yang, Auto-reid: Search- [266] Y. Fan, F. Tian, Y. Xia, T. Qin, X.-Y. Li, T.-Y. Liu, Searching
ing for a part-aware convnet for person re-identification, in: better architectures for neural machine translation, IEEE/ACM
2019 IEEE/CVF International Conference on Computer Vision, Transactions on Audio, Speech, and Language Processing.
ICCV 2019, Seoul, Korea (South), October 27 - November [267] Y. Jiang, C. Hu, T. Xiao, C. Zhang, J. Zhu, Improved dif-
2, 2019, IEEE, 2019, pp. 3749–3758. doi:10.1109/ICCV.2019. ferentiable architecture search for language modeling and
00385. named entity recognition, in: Proceedings of the 2019 Confer-
URL https://doi.org/10.1109/ICCV.2019.00385 ence on Empirical Methods in Natural Language Processing
[253] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, Y. Wang, Efficient and the 9th International Joint Conference on Natural Lan-
residual dense block search for image super-resolution., in: guage Processing (EMNLP-IJCNLP), Association for Compu-
AAAI, 2020, pp. 12007–12014. tational Linguistics, Hong Kong, China, 2019, pp. 3585–3590.
[254] X. Chu, B. Zhang, H. Ma, R. Xu, J. Li, Q. Li, Fast, accurate doi:10.18653/v1/D19-1367.
and lightweight super-resolution with neural architecture search, URL https://www.aclweb.org/anthology/D19-1367
arXiv preprint arXiv:1901.07261. [268] J. Chen, K. Chen, X. Chen, X. Qiu, X. Huang, Exploring
[255] Y. Guo, Y. Luo, Z. He, J. Huang, J. Chen, Hierarchical neural shared structures and hierarchies for multiple nlp tasks, arXiv
architecture search for single image super-resolution, arXiv preprint arXiv:1808.07658.
preprint arXiv:2003.04619. [269] H. Mazzawi, X. Gonzalvo, A. Kracun, P. Sridhar, N. Subrah-
[256] H. Zhang, Y. Li, H. Chen, C. Shen, Ir-nas: Neural architecture manya, I. Lopez-Moreno, H.-J. Park, P. Violette, Improving
search for image restoration, arXiv preprint arXiv:1909.08228. keyword spotting and language identification via neural ar-
[257] X. Gong, S. Chang, Y. Jiang, Z. Wang, Autogan: Neural chitecture search at scale., in: INTERSPEECH, 2019, pp.
architecture search for generative adversarial networks, in: 1278–1282.
2019 IEEE/CVF International Conference on Computer Vision, [270] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, S. Han, Amc: Automl
ICCV 2019, Seoul, Korea (South), October 27 - November for model compression and acceleration on mobile devices, in:
2, 2019, IEEE, 2019, pp. 3223–3233. doi:10.1109/ICCV.2019. Proceedings of the European Conference on Computer Vision
00332. (ECCV), 2018, pp. 784–800.
URL https://doi.org/10.1109/ICCV.2019.00332 [271] X. Xiao, Z. Wang, S. Rajasekaran, Autoprune: Automatic
[258] Y. Fu, W. Chen, H. Wang, H. Li, Y. Lin, Z. Wang, Autogan- network pruning by regularizing auxiliary parameters, in:
distiller: Searching to compress generative adversarial networks, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´e-Buc,
in: Proceedings of the 37th International Conference on Ma- E. B. Fox, R. Garnett (Eds.), Advances in Neural Information
chine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Processing Systems 32: Annual Conference on Neural Informa-
Vol. 119 of Proceedings of Machine Learning Research, PMLR, tion Processing Systems 2019, NeurIPS 2019, December 8-14,
2020, pp. 3292–3303. 2019, Vancouver, BC, Canada, 2019, pp. 13681–13691.
57
URL https://proceedings.neurips.cc/paper/2019/hash/ learners, OpenAI
4efc9e02abdab6b6166251918570a307-Abstract.html
[272] R. Zhao, W. Luk, Efficient structured pruning and architecture
searching for group convolution, in: Proceedings of the IEEE
International Conference on Computer Vision Workshops, 2019,
pp. 0–0.
[273] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin,
S. Han, APQ: joint search for network architecture, pruning
and quantization policy, in: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2020,
Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 2075–
2084. doi: 10.1109/CVPR42600.2020.00215.
URL https://doi.org/10.1109/CVPR42600.2020.00215
[274] X. Dong, Y. Yang, Network pruning via transformable
architec- ture search, in: H. M. Wallach, H. Larochelle, A.
Beygelzimer,
F. d’Alch´e-Buc, E. B. Fox, R. Garnett (Eds.), Advances in
Neural Information Processing Systems 32: Annual Conference
on Neural Information Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
759–770.
URL https://proceedings.neurips.cc/paper/2019/hash/
a01a0380ca3c61428c26a231f0e49a09-Abstract.html
[275] Q. Huang, K. Zhou, S. You, U. Neumann, Learning to prune
filters in convolutional neural networks (2018). arXiv:1801.
07365.
[276] Y. He, P. Liu, L. Zhu, Y. Yang, Meta filter pruning to accelerate
deep convolutional neural networks (2019). arXiv:1904.03961.
[277] T.-W. Chin, C. Zhang, D. Marculescu, Layer-compensated
pruning for resource-constrained convolutional neural networks
(2018). arXiv:1810.00518.
[278] K. Zhou, Q. Song, X. Huang, X. Hu, Auto-gnn: Neural ar-
chitecture search of graph neural networks, arXiv preprint
arXiv:1909.03184.
[279] C. He, M. Annavaram, S. Avestimehr, Fednas: Federated
deep learning via neural architecture search (2020). arXiv:
2004.08546.
[280] H. Zhu, Y. Jin, Real-time federated evolutionary neural archi-
tecture search, arXiv preprint arXiv:2003.02793.
[281] C. Li, X. Yuan, C. Lin, M. Guo, W. Wu, J. Yan, W. Ouyang,
AM-LFS: automl for loss function search, in: 2019 IEEE/CVF
International Conference on Computer Vision, ICCV 2019,
Seoul, Korea (South), October 27 - November 2, 2019,
IEEE, 2019, pp. 8409–8418. doi:10.1109/ICCV.2019.00850.
URL https://doi.org/10.1109/ICCV.2019.00850
[282] B. Ru, C. Lyle, L. Schut, M. van der Wilk, Y. Gal, Revisiting
the train loss: an efficient performance estimator for neural
architecture search, arXiv preprint arXiv:2006.04492.
[283] P. Ramachandran, B. Zoph, Q. V. Le, Searching for
activation functions (2017). arXiv:1710.05941.
[284] H. Wang, H. Wang, K. Xu, Evolutionary recurrent neural
network for image captioning, Neurocomputing.
[285] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, R. Fonseca, Neural
architecture search using deep neural networks and monte
carlo tree search, arXiv preprint arXiv:1805.07440.
[286] P. Zhao, K. Xiao, Y. Zhang, K. Bian, W. Yan, Amer: Automatic
behavior modeling and interaction exploration in recommender
system, arXiv preprint arXiv:2006.05933.
[287] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, J. Tang, Au-
toemb: Automated embedding dimensionality search in stream-
ing recommendations, arXiv preprint arXiv:2002.11252.
[288] W. Cheng, Y. Shen, L. Huang, Differentiable neural
input search for recommender systems, arXiv preprint
arXiv:2006.04466.
[289] E. Real, C. Liang, D. R. So, Q. V. Le, Automl-zero: Evolving
machine learning algorithms from scratch, in: Proceedings of
the 37th International Conference on Machine Learning,
ICML 2020, 13-18 July 2020, Virtual Event, Vol. 119 of
Proceedings of Machine Learning Research, PMLR, 2020, pp.
8007–8019.
URL http://proceedings.mlr.press/v119/real20a.html
[290] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I.
Sutskever, Language models are unsupervised multitask
58
Blog 1 (2019) 8. 32: Annual Conference on Neural Information Processing
[291] D. Wang, C. Gong, Q. Liu, Improving neural language Systems 2019, NeurIPS 2019, December 8-14, 2019,
modeling via adversarial training, in: K. Chaudhuri, R. Vancouver, BC,
Salakhutdinov (Eds.), Proceedings of the 36th Canada, 2019, pp. 8024–8035.
International Conference on Machine Learning, ICML URL https://proceedings.neurips.cc/paper/2019/hash/
2019, 9-15 June 2019, Long Beach, California, USA, Vol. bdbca288fee7f92f2bfa9f7012727740-Abstract.html
97 of Proceedings of Machine Learning Research, PMLR, [302] F. Chollet, et al., Keras, https://github.com/fchollet/keras
2019, pp. 6555–6565.
URL http://proceedings.mlr.press/v97/wang19f.html
[292] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, F.
Hut- ter, Understanding and robustifying differentiable
architecture search, in: 8th International Conference on
Learning Represen- tations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020,
OpenReview.net, 2020.
URL https://openreview.net/forum?id=H1gDNyrKDS
[293] S. KOTYAN, D. V. VARGAS, Is neural architecture search
a way forward to develop robust neural networks?,
Proceedings of the Annual Conference of JSAI JSAI2020
(2020) 2K1ES203– 2K1ES203.
[294] M. Guo, Y. Yang, R. Xu, Z. Liu, D. Lin, When NAS meets
robustness: In search of robust architectures against
adversarial attacks, in: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2020,
Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 628–
637. doi:10.1109/CVPR42600.
2020.00071.
URL https://doi.org/10.1109/CVPR42600.2020.00071
[295] Y. Chen, Q. Song, X. Liu, P. S. Sastry, X. Hu, On
robustness of neural architecture search under label
noise, in: Frontiers in Big Data, 2020.
[296] D. V. Vargas, S. Kotyan, Evolving robust neural architec-
tures to defend from adversarial attacks, arXiv preprint
arXiv:1906.11667.
[297] J. Yim, D. Joo, J. Bae, J. Kim, A gift from knowledge distil-
lation: Fast optimization, network minimization and
transfer learning, in: 2017 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2017,
Honolulu, HI, USA, July 21-26, 2017, IEEE Computer
Society, 2017, pp. 7130–7138.
doi:10.1109/CVPR.2017.754.
URL https://doi.org/10.1109/CVPR.2017.754
[298] G. Squillero, P. Burelli, Applications of Evolutionary Com-
putation: 19th European Conference, EvoApplications
2016, Porto, Portugal, March 30–April 1, 2016,
Proceedings, Vol. 9597, Springer, 2016.
[299] M. Feurer, A. Klein, K. Eggensperger, J. T. Springen-
berg, M. Blum, F. Hutter, Efficient and robust
automated machine learning, in: C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.),
Advances in Neural Information Processing Systems 28:
Annual Conference on Neural Information Processing
Systems 2015, December 7-12, 2015, Montreal, Quebec,
Canada, 2015, pp. 2962–2970.
URL https://proceedings.neurips.cc/paper/2015/hash/
11d0e6287202fced83f79975ec59a3a6-Abstract.html
[300] F. Pedregosa, G. Varoquaux, A. Gramfort, V.
Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn:
Machine learning in Python, Journal of Machine Learning
Research 12 (2011) 2825–2830.
[301] A. Paszke, S. Gross, F. Massa, A. Lerer, J.
Bradbury,
G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
A. Desmaison, A. K¨opf, E. Yang, Z. DeVito, M. Raison,
A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,
S. Chintala, Pytorch: An imperative style, high-
performance deep learning library, in: H. M.
Wallach, H. Larochelle,
A. Beygelzimer, F. d’Alch´e-Buc, E. B. Fox, R. Garnett
(Eds.), Advances in Neural Information Processing Systems
59
(2015).
[303] NNI (Neural Network Intelligence), 2020.
URL https://github.com/microsoft/nni
[304] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur,
J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner,
P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu,
X. Zheng, Tensorflow: A system for large-scale machine learning
(2016). arXiv:1605.08695.
[305] Vega, 2020.
URL https://github.com/huawei-noah/vega
[306] R. Pasunuru, M. Bansal, Continual and multi-task architec-
ture search, in: Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, Association for
Computational Linguistics, Florence, Italy, 2019, pp. 1911–1922.
doi:10.18653/v1/P19-1185.
URL https://www.aclweb.org/anthology/P19-1185
[307] J. Kim, S. Lee, S. Kim, M. Cha, J. K. Lee, Y. Choi, Y. Choi,
D.-Y. Cho, J. Kim, Auto-meta: Automated gradient based
meta learner search, arXiv preprint arXiv:1806.06927.
[308] D. Lian, Y. Zheng, Y. Xu, Y. Lu, L. Lin, P. Zhao, J. Huang,
S. Gao, Towards fast adaptation of neural architectures with
meta learning, in: 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, OpenReview.net, 2020.
URL https://openreview.net/forum?id=r1eowANFvr
[309] T. Elsken, B. Staffler, J. H. Metzen, F. Hutter, Meta-learning of
neural architectures for few-shot learning, in: 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE,
2020,
pp. 12362–12372. doi:10.1109/CVPR42600.2020.01238.
URL https://doi.org/10.1109/CVPR42600.2020.01238
[310] C. Liu, P. Doll´ar, K. He, R. Girshick, A. Yuille, S. Xie, Are
labels necessary for neural architecture search? (2020). arXiv:
2003.12056.
[311] Z. Li, D. Hoiem, Learning without forgetting, IEEE transactions
on pattern analysis and machine intelligence 40 (12) (2018)
2935–2947.
[312] S. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl:
Incremental classifier and representation learning, in:
2017 IEEE Conference on Computer Vision and Pattern
Recogni- tion, CVPR 2017, Honolulu, HI, USA, July 21-26,
2017, IEEE
Computer Society, 2017, pp. 5533–5542. doi:10.1109/CVPR.
2017.587.
URL https://doi.org/10.1109/CVPR.2017.587
60