0% found this document useful (0 votes)
5 views23 pages

Discrete Simulation Optimization For Tuning Machine Learning Method Hyperparameters

This paper explores discrete simulation optimization methods for hyperparameter tuning in machine learning, specifically utilizing ranking and selection (R&S) and random search techniques. The authors demonstrate the effectiveness of the KN R&S method and the stochastic ruler random search method, benchmarking them against established libraries like hyperopt and mango, showing superior performance. The study emphasizes the statistical guarantees provided by simulation optimization methods in identifying optimal hyperparameter sets, which is crucial for enhancing machine learning model performance.

Uploaded by

IYFUBF7TK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

Discrete Simulation Optimization For Tuning Machine Learning Method Hyperparameters

This paper explores discrete simulation optimization methods for hyperparameter tuning in machine learning, specifically utilizing ranking and selection (R&S) and random search techniques. The authors demonstrate the effectiveness of the KN R&S method and the stochastic ruler random search method, benchmarking them against established libraries like hyperopt and mango, showing superior performance. The study emphasizes the statistical guarantees provided by simulation optimization methods in identifying optimal hyperparameter sets, which is crucial for enhancing machine learning model performance.

Uploaded by

IYFUBF7TK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Discrete Simulation Optimization for Tuning Machine Learning

Method Hyperparameters

Varun Ramamohana , Shobhit Singhala , Aditya Raj Guptaa and Nomesh Bhojkumar
Boliaa
arXiv:2201.05978v2 [cs.LG] 4 May 2022

a
Department of Mechanical Engineering, Indian Institute of Technology Delhi, Hauz Khas,
New Delhi 110016, India

ARTICLE HISTORY
Compiled May 5, 2022

ABSTRACT
Machine learning (ML) methods are used in most technical areas such as image
recognition, product recommendation, financial analysis, medical diagnosis, and pre-
dictive maintenance. An important aspect of implementing ML methods involves
controlling the learning process for the ML method so as to maximize the perfor-
mance of the method under consideration. Hyperparameter tuning is the process of
selecting a suitable set of ML method parameters that control its learning process.
In this work, we demonstrate the use of discrete simulation optimization methods
such as ranking and selection (R&S) and random search for identifying a hyperpa-
rameter set that maximizes the performance of a ML method. Specifically, we use
the KN R&S method and the stochastic ruler random search method and one of its
variations for this purpose. We also construct the theoretical basis for applying the
KN method, which determines the optimal solution with a statistical guarantee via
solution space enumeration. In comparison, the stochastic ruler method asymptot-
ically converges to global optima and incurs smaller computational overheads. We
demonstrate the application of these methods to a wide variety of machine learning
models, including deep neural network models used for time series prediction and
image classification. We benchmark our application of these methods with state-
of-the-art hyperparameter optimization libraries such as hyperopt and mango. The
KN method consistently outperforms hyperopt’s random search (RS) and Tree of
Parzen Estimators (TPE) methods. The stochastic ruler method outperforms the
hyperopt RS method and offers statistically comparable performance with respect
to hyperopt’s TPE method and the mango algorithm.

KEYWORDS
Hyperparameter tuning; Simulation optimization; Ranking and selection; Random
search; Stochastic ruler; Tree of Parzen Estimators

1. Introduction & Literature Review

Most machine learning (ML) algorithms or methods are characterized by multiple pa-
rameters that can be selected by the analyst prior to starting the training process and
are used to determine the ML model architecture and control its training process (Jor-
dan & Mitchell, 2015; Kuhn, Johnson, et al., 2013). Such parameters are referred to as
‘hyperparameters’ of the ML method, in contrast to parameters of the ML method or
model itself that are estimated during the training process, such as the slope and the

CONTACT Varun Ramamohan. Email: varunr@mech.iitd.ac.in


intercept of a linear regression model estimated via a maximum likelihood estimation
process applied on the training dataset for the problem. Examples of ML method hy-
perparameters include the support vector machine (SVM) classifier in the Scikit-learn
Python ML library (Pedregosa et al., 2011) that is controlled via hyperparameters
such as the kernel type, regularization and kernel coefficient parameters, among oth-
ers. Similarly, a feed-forward artificial neural network (ANN) method is controlled
by the learning rate, learning type, optimization solver used (e.g., stochastic gradient
descent, ADAM), the number of layers, and the number of nodes in each layer. It is
evident that finding the optimal set of hyperparameters - that is, the hyperparameter
set that maximizes a measure of the performance of the ML method appropriate to
the prediction problem at hand - thus becomes important. Brute force enumeration
of the solution space may not be a computationally tractable approach towards de-
termining the optimal hyperparameter set, even if hyperparameters with continuous
search spaces are discretized. Hence many approaches for hyperparameter optimization
or ‘tuning’ have been proposed, such as grid searches, manual searches and Bayesian
optimization (Yang & Shami, 2020). In this paper, we propose the use of discrete
simulation optimization methods for the purpose of hyperparameter optimization.
Simulation optimization methods are used to solve optimization problems wherein
the objective function values are estimated via a simulation (Henderson & Nelson,
2006). They can be expressed as follows.

max E[h(x, Yx )] (1)


x∈S

In the above equation, h(x, Yx ) represents the simulation outcome from one replica-
tion of the simulation, and we are interested in maximizing its expected value. x ∈ Rn
represents the vector of decision variables taking values in the feasible search space S
(S ⊆ Rn ), and the Yx are random variables representing the stochasticity in the prob-
lem. In this paper, we conceptualize the training and validation process of the machine
learning method as the simulation, and its hyperparameters as the decision variables.
The outcome of a single replication of the ‘simulation’ is represented by a performance
measure - such as the classification accuracy or the area under the receiver operating
characteristic curve - of the machine learning method on a validation dataset, and thus
the objective function becomes the average value of the performance measure that we
wish to maximize. In Section 2, we describe how we model stochasticity in the ML
method training and validation process. As part of this work, we wish to demonstrate
that simulation optimization methods can be leveraged to perform hyperparameter
optimization. A key advantage of simulation optimization methods, over many hy-
perparameter tuning methods such as grid search, random search, etc., is that they
provide statistical guarantees of finding the optimal hyperparameter set, or provably
possess the property of asymptotic convergence to the optimal hyperparameter set.
This implies that if sufficient computational overheads are available for the machine
learning method training process, then simulation optimization methods are highly
likely to find the optimal hyperparameter set.
As a first step towards leveraging simulation optimization methods for this purpose,
we use discrete simulation optimization methods for hyperparameter optimization.
Specifically, we demonstrate the use of two types of discrete simulation optimization
methods: ranking and selection (R& S) methods and random search methods. We il-
lustrate the use of the KN R&S method (Kim & Nelson, 2001), and the stochastic ruler

2
random search method (Yan & Mukai, 1992) for optimizing the hyperparameters of
SVMs, feedforward ANNs, long short term memory (LSTM) and convolutional neural
networks (CNNs). We benchmark the performance of the KN and the stochastic ruler
methods against multiple algorithms within existing state-of-the-art hyperparameter
tuning packages such as hyperopt and mango. We also provide the theoretical basis
for applying the KN method, which requires that the simulation outcomes from each
replication possess the iid property.
We now discuss the relevant literature and our associated research contributions.
We begin by consider the hyperparameter tuning method known as ‘babysitting’,
which is essentially a trial and error method. This method is used by analysts with prior
experience with the prediction problem at hand, and is useful when severe limitations
on computational infrastructure are present (Yang & Shami, 2020).
The grid search (GS) method is one of the first systematic procedures used for hy-
perparameter optimization (Bergstra, Bardenet, Bengio, & Kégl, 2011; Yang & Shami,
2020). The method involves discretizing the hyperparameter space, and then exhaus-
tively evaluating the average performance of each solution in the discretized hyper-
parameter space via, for example, cross validation. However, the method does not
provide a statistical guarantee of selecting the optimal hyperparameter set.
To address the limitations of the GS method, Bergstra and Bengio (2012) pro-
posed the random search hyperparameter optimization approach, and showed that it
outperformed GS for the same computational budget. The method involves defining
the hyperparameter space as a bounded set - typically discrete - and then samples
hyperparameter sets randomly from said space. The methodology does not consider
information from prior iterations, and thus poorer performing regions of the space also
receive equal consideration in the search. Similar to GS, the approach does not provide
statistical optimal solution selection guarantees.
Another popular technique for optimizing hyperparameter selection is the Bayesian
Optimization (BO) method. BO involves performing relatively fewer function evalu-
ations as it uses information from previous iterations to select the new solution at
each evaluation it uses previously obtained observations to determine next evalua-
tion points. Snoek, Larochelle, and Adams (2012) introduced BO for hyperparameter
optimization and demonstrated its application for several ML algorithms. It uses an
acquisition function, typically a Gaussian process, to model the objective function and
determines the next candidate for the optimal solution by predicting the best system
according to the acquisition function. The acquisition function is updated at each
iteration based on information from previous objective function evaluations. Some
drawbacks of BO include sensitivity to the choice of acquisition function and its pa-
rameters and cubic time complexity with number of hyperparameters. Moreover, it has
difficulty handling categorical and integer-valued hyperparameters. Examples of BO
based methods include the random forest based and Tree of Parzen Estimators based
hyperparameter optimization methods (Bergstra, Yamins, & Cox, 2013; Eggensperger,
Hutter, Hoos, & Leyton-Brown, 2015).
Metaheuristic algorithms have been used for hyperparameter optimization as well.
Among these, the genetic algorithm (GA) and particle swarm optimization (PSO)
methods have been used most widely. For example, Lessmann, Stahlbock, and Crone
(2005), Hutter, Hoos, and Leyton-Brown (2011) and Peng-Wei Chen, Jung-Ying Wang,
and Hahn-Ming Lee (2004) applied a GA to optimize support vector machine hyper-
parameters, and Lorenzo, Nalepa, Kawulok, Ramos, and Pastor (2017) applied the
PSO approach to optimize hyperparameters of deep neural networks.
Stochastic gradient based techniques have been used to explore continuous hyperpa-

3
rameter spaces. For example, Maclaurin, Duvenaud, and Adams (2015) and Pedregosa
(2016) demonstrated the use of the stochastic gradient descent method for finding op-
timal values of continuous hyperparameters of various machine learning algorithms.
A few Python libraries for hyperparameter optimization exist. Widely used among
these include the hyperopt package, which implements the Tree of Parzen Estimators
(TPE) based BO approach. It also implements discrete and continuous versions of
random search (RS) hyperparameter optimization methods (Bergstra & Bengio, 2012;
Bergstra et al., 2013). Another widely used library is mango, which implements an
efficient and effective batch Gaussian process bandit search approach using upper con-
fidence bound as the acquisition function (Sandha, Aggarwal, Fedorov, & Srivastava,
2020).
In our search of the literature, we have not identified another study that utilizes dis-
crete simulation optimization methods for ML method hyperparameter optimization.
As mentioned earlier, the key advantage of applying R& S methods over methods such
as GS and random search is that they provide a probabilistic guarantee of selecting the
best hyperparameter set - for example, they guarantee that the optimal solution will be
selected with 1 - α probability as long as the difference between its objective function
value and that of the next best solution is at least δ. Similarly, random search methods
such as the stochastic ruler method provide asymptotic guarantees of converging to
the optimal solution. We provide a conceptualization of the ML training and holdout
validation process as a simulation, and also construct the theoretical grounding for
applying the KN ranking & selection procedure for hyperparameter optimization. Fi-
nally, we provide a detailed benchmarking of the KN R&S method and the stochastic
ruler random search method against popular hyperparameter optimization methods
implemented in the hyperopt and mango libraries.
The remainder of the paper is organized as follows. In Section 2, we set up the
theoretical framework for applying simulation optimization methods to hyperparam-
eter optimization. In Sections 3 and 4, we describe the results of applying the KN
R&S method and the stochastic ruler random search, respectively, to optimize the
hyperparameters of various ML methods. We also provide benchmarking results for
these methods against hyperopt and mango hyperparameter optimization methods.
We conclude the paper in Section 5 with a discussion of the potential impact of our
work, its limitations, and avenues for future research.

2. Theoretical Framework

Consider a machine learning problem where we are attempting to find the best-fit
function f that predicts an outcome y as a function of features x. Here x  Rn , y  R.
Let the training dataset be denoted by (X, Y ), where X is an m × n matrix and Y
is an m-dimensional vector of labels. Note that we do not make a distinction between
classification and regression at this juncture, and hence we let y ∈ R, and do not
impose any other constraint on y. For example, if our focus was only on classification
problems, we would specify that y ∈ {0, 1, .., k − 1}, where k represents the number of
classes.
Let θ  Rd be a set of d hyperparameters that characterize the machine learning
model architecture and training process, and let P be the performance measure used
to judge the quality of the fit of f (x) (without loss of generality, we assume increasing
P denotes increasing quality of fit). Typically, P is a random variable whose value
depends upon the organization of the training and holdout (test) validation datasets,

4
and hence hyperparameter optimization attempts to find θ that maximizes E[P ]. E[P ]
may be estimated, for example, via a cross-validation exercise. We propose to use
discrete simulation optimization methods for selecting optimal hyperparameter values.
More specifically, we demonstrate here the use of ranking and selection methods such
as the KN method (Kim & Nelson, 2001) and random search methods such as the
stochastic ruler method (Yan & Mukai, 1992) and one of its variants (Ramamohan,
Agrawal, & Goyal, 2020) for this purpose.

2.1. Simulation Algorithm


In order to apply simulation optimization methods to find optimal hyperparameter
values for ML methods, we first need to define the corresponding ‘simulation’ and the
associated optimization problem. We propose considering the process of finding the
optimal parameter estimates of the function f given a particular organization of the
training set (X, Y )i and a specific kth set of hyperparameters θk , and evaluating the
quality of the fitted function f via holdout validation by generating an estimate of the
performance measure, as the ith replication of the simulation. The particular version
of the training set used in the ith replication is generated by a single permutation of
the dataset, denoted by σi (X, Y ), where σ is the permutation function that operates
on the dataset. The permuted dataset (X, Y )i can then be divided into the training
and holdout validation datasets (e.g., first 80% used for training, next 20% for holdout
validation). Thus the output of one replication (e.g., the ith replication) of this simula-
tion can be considered to be the performance measure Pi (θk ), denoted in short as Pki .
The simulation algorithm can thus be written as follows, depicted in Algorithm 1. In
Algorithm 1 below, in order to make the simulation conceptualization clear, we also
explicitly describe the permutation process (Step 3).
Algorithm 1: Simulation algorithm for generating an estimate of E[P ] for a
given hyperparameter set θk .
1. Initialize with dataset (X, Y ), number of replications I used to estimate
E[P ] for a given hyperparameter set θk , set of indices of the dataset samples
M = {1, 2,..., m}, hyperparameter set θk , P (θk ) = 0.
2. Set random number seed sk ∼ gs (s). /* Comment. Sample random number seed
from its cdf gs (s) for generation of P (θk ) */
3. Permute dataset and estimate E[P ] (as P (θk ) given hyperparameter set θk .
for i = 1 − I do
(X, Y )i = {}, J = |M |
for j = 1 − J do
m ← |M |
u ← Sample from U (0, 1) /* Sample u from uniform random variable on (0,1). */
l ← du × me /* Randomly sample lth element from M . */
(X, Y )i ← (X, Y )i ∪ (xl , yl ) /* (xl , yl ) is the lth sample from dataset (X, Y ).
M ← M − {l}
end
P (θk ) = P (θk ) + f it((X, Y )i , f, θk )
end
P (θk ) ← P (θI k )
In Algorithm 1, the term f it((X, Y )i , f ) can be thought as the training and vali-
dation subroutine that takes as input the ith permutation of the dataset (X, Y ), the
function to be fit f and the hyperparameters θk . The subroutine then divides this

5
dataset into training and validation subdatasets, estimates the parameters of f using
the training subdataset, and then outputs the performance measure Pki for the dataset
(X, Y )i by applying the fitted function on the validation dataset. We note here that
while the generation of the training and validation subdatasets itself from a given
dataset (say, dataset (X, Y )i ) itself can involve some stochasticity, we assume other-
wise - for example, we assume that the first, say, X% of the dataset is always used as
the training subdataset and the remaining 100-X% of the dataset is then used as the
validation subdataset. Thus the generation of the training and validation subdatasets
is also determined by the random number seed sk specified in Algorithm 1.
Now that we have conceptualized the generation of the ith estimate of the per-
formance measure Pki as one replication of the simulation, we can write down the
optimization problem as follows.
max E[P (θ)]
θ∈S

Here S represents the set of allowable values for θ. E[P (θ)] is estimated as P (θ) by
the simulation described above.
We now discuss how the set of I replicate values of Pki generated by Algorithm 1
used to estimate E[P (θk )] can be considered to be an iid sample despite the fact that
they are generated from effectively the same dataset (X, Y ).

2.2. IID Requirements


In order to apply common R&S methods such as the KN or NSGS procedures (Hender-
son & Nelson, 2006), or random search methods such as the stochastic ruler algorithm
(Yan & Mukai, 1992), a standard iid requirement must be satisfied by the simulation
replications associated with a given hyperparameter set θk . That is, the I replications
Pk1 , Pk2 , ..., PkI must be iid. Note that, if |S| = K (that is, there are K allowable values
of θ), some procedures such as the NSGS method impose the additional requirement
that the random variables P (θk ), k = (1, 2, ..., K) are also independent; however, other
methods such as the KN method do not, as they permit the use of common random
numbers (Henderson & Nelson, 2006; Kim & Nelson, 2001). Hence our focus in this
section will be the iid requirement that the Pki , (i = 1 − I) have to satisfy.
For typical machine learning exercises, the samples in a given organization of the
dataset (X, Y )i are assumed to be iid. However, this is typically difficult to verify,
especially if the analyst is not involved in the generation of the dataset. In such sit-
uations, the operational definition of the iid property to be applied is derived from
the class of representation theorems discussed in Ressel (1985) and O’Neill (2009).
Broadly, according to these theorems, a sequence (typically infinite) of random vari-
ables (e.g., X1 , X2 , ...) consisting of terms that may not be independent of each other
can be shown to be conditionally independent given a realization of a random variable
Q driving the generation of the sequence as long as the terms in the sequence are ex-
changeable. The key requirement here is exchangeability - that is, given a realization
of Q, the joint distribution of the sample must not change if the order in which the
Xi are generated changes (Ressel, 1985).
For example, consider a sequence of random variables X1 , X2 , . . . that are generated
independently from the uniform distribution U (a, b). Now if a is a constant, and b
is itself a Bernoulli random variable that takes value b1 with probability p and b2
with probability 1 − p, then the distribution of the sequence of random variables
is driven by the realization of the random variable b. The representation theorems

6
assert that the {Xi } can be considered to be an iid sample as long as the order
in which the Xi are generated does not change their joint distribution. Note that
while the original representation theorem due to De Finetti was developed for infinite
sequences of Bernoulli random variables, a version of the representation theorem for
finite sequences was developed by Diaconis and Freedman (1980).
In order to demonstrate the applicability of this class of theorems to our case,
we must identify the distribution that drives the generation of the particular set of
replications Pki , i = 1 to I and then demonstrate that these replicate values of Pki
are exchangeable.
To this end, we first note that each value of Pki is a function of the ith permutation
of the training set. Each permutation of the training set (X, Y )i can be considered
to be governed by the sequence of uniform random numbers that are used to sample
(without replacement) from the training set (X, Y ). For example, per Algorithm 1, the
ith permuted dataset (X, Y )i is generated by the sequence of random numbers {Ui } =
Ui1 , Ui2 , ..., UiJ . However, the {Ui } are a subsequence of the stream of pseudorandom
numbers generated for the entire simulation in algorithm 1, which can be thought
of as a sequence in itself, given by {U } = U1 , U2 , ..., UI . This sequence is governed
by a specific seed sk that is supplied to the random number generator at the start
of the simulation. That is, if we denote the seed used to generate U as sk , then
U in effect becomes a function of sk . Now, if the sk are themselves generated from
some distribution gs (s) - for example the sk are randomly sampled (with replacement)
integers from 1 to L, where L is a very large number when compared to K to minimize
the likelihood of sampling the same sk twice or more - then the underlying distribution
governing the generation of the Pki becomes this uniform random integer distribution.
Thus we have constructed the distribution driving the generation of the Pki .
Exchangeability is now easy to demonstrate, given that each Pki is a function of the
random number seed sk . Therefore, the joint distribution of the Pki depends only on
the distribution of the sk , and not on the order in which the Pki are generated. This
completes the argument.
We now describe the numerical experiments involving implementation of the KN
ranking and selection method to the hyperparameter optimization problem.

3. Numerical Evaluation: KN Ranking and Selection Method

In this section, we demonstrate the application of the KN R&S method for ML hy-
perparameter optimization. We begin by providing an overview of the KN procedure,
then demonstrate the application of the KN method to optimize the hyperparame-
ters of various ML methods, and finally describe the results of benchmarking the KN
method against widely used hyperparameter optimization methods implemented in
the hyperopt hyperparameter tuning package.
The KN algorithm provides a guarantee that, out of N possible hyperparameter
sets (or feasible solutions) θn (n ∈ {1, 2, . . . , N }), the optimal set will be selected with
a probability 1 − p, given that the difference between the best solution θ∗ and the next
best θn is at least δ. The values of p and δ are parameters of the algorithm. As part
of the KN method, we generate an initial set of replications R0 for each θn , and then
using δ and p, select a subset D of the original N feasible solutions. This is referred
to as the screening phase. Only the variances of the pairwise differences between the
replications of each hyperparameter set are used in determining the subset D using
the first set of R0 replications.

7
Once D is identified after the screening phase, we generate an additional replication
for each θn , and the subset D is updated using the screening criteria until D contains a
single hyperparameter set. This is the optimal θ∗ . This phase is referred to as ranking
and selection. We provide the details of the procedure below, adapted from Henderson
and Nelson (2006).
Step 1. Initialize the method with R0 , δ and p. Calculate η and h2 as:

" (2/(R0 −1)) #


1 2p
η= − 1 , and
2 N −1
h2 = 2η(R0 − 1)

Step 2. Generate R0 replicate values of P (θn ) for each θn . Estimate their means
P (θn ) for n ∈ {1, 2, .., N }.
Step 3. Estimate the standard deviations of paired differences as follows:

0 R
2 1 X
Snl = (P (θnj ) − P (θlj ) − (P (θn ) − P (θl )))2
R0 − 1
j=1

Step 4. Screening phase: construct the subset D using the following criterion:

D = {n ∈ Dold
and Pn (k) ≥ Pl (k) − Gnl (k), ∀ n ∈ Dold , n 6= l},
(2)
δ h2 Snl 2
  
where Gnl (k) = max 0, −k
2k δ2

Step 5. If |D| = 1, then stop, and the θn ∈ D is the optimal θ with probability 1 − p.
If not, generate one extra replicate value of the performance measure for each n ∈ D,
set k = k + 1, Dold = D, and repeat steps 4 and 5.
In the Step 4 above, Pn (k) represents the average value of the performance measure
calculated with k replications and generated with the nth hyperparameter set. Note
that in Step 3, k = R0 .
With the KN method, the number of evaluations of the ML model are relatively
larger than those of the grid search or the random search method, as the number of
‘replicate’ values of the performance measure to be generated are not a priori known.
So, the KN method is ideally suited for cases where the hyperparameter space is
relatively small, and where generating replicate estimates of the performance measure
is not expensive. We demonstrate such use cases in Section 3.1.

3.1. KN Method: Implementation for Hyperparameter Optimization


We applied the KN method to optimize the hyperparameters for support vector ma-
chines and feedforward neural network ML methods. We discretized the hyperparam-
eter spaces for each of these ML methods so that the feasible hyperparameter space
is suitable for application of the KN method. We chose the ‘breast cancer Wisconsin
dataset’ from the Scikit-learn repository (Pedregosa et al., 2011) for all computational
experiments in this section. The dataset contained 569 samples in total, where each

8
sample consists of 30 features and a binary label. Thus the objective of the ML exercise
here is to train and test an ML model on the dataset such that it is able to classify
a sample into one of two categories: ‘malignant’ or ‘benign’, and the simulation op-
timization problem in turn involves finding the hyperparameter set that maximizes
average classification accuracy.
We utilized the public version of the Google Colaboratory platform, using CPUs
with a 2.2 gigaHertz clock cycle speed and 12 gigabytes of memory. The results of
applying the KN method, implemented as described above, are provided below in
Table 1. We provide the results in Table 1 to illustrate how the KN method can be
applied for hyperparameter optimization - that is, to illustrate how the parameters
of the KN method can be specified, how the search space is constructed, etc. We do
not describe the ML methods in question themselves in detail - that is, the SVM and
neural network models used in this manuscript, as well as their hyperparameters - as
we assume sufficient familiarity with these ML methods on part of the reader.
Table 1. KN method implementation for optimizing support vector machine and artificial neural network
method hyperparameters. Notes. num = number.
KN method
Hyperparameters Results
parameters
Support vector machine
Kernel ∈ {0 rbf 0 ,0 linear0 } N = 200, no = 10 Optimal accuracy: 0.968
Gamma ∈
p = 0.05, δ = 10% Num function evaluations: 2001
{0.001, 0.01, 0.1, 0.5, 1, 10, 30, 50, 80, 100}
C∈
Net runtime: 6709 seconds
{0.01, 0.1, 1, 10, 100, 300, 500, 700, 800, 1000}
Total num iterations (k): 10
Artifical neural network
Hidden layer sizes
k = 600, no = 10 Optimal accuracy: 0.934
∈ {3, 5, 8, 12, 15, 20, 25, 30, 50, 80}
Activation function
p = 0.05, δ = 10% Num function evaluations: 9871
∈ {relu, logistic, tanh}
Solver ∈ {adam, sgd} Net runtime: 3093 seconds
Learning rate type ∈ {constant, adaptive} Total num iterations (k): 195
Learning rate value (discretized) ∈
{0.0005, 0.001, 0.01, 0.05, 0.1}

We can see from Table 1 that the search space for the simulation optimization
algorithm is constructed by taking the Cartesian product of the sets of allowable
values for the hyperparameters. For some continuous parameters, such as the ‘Gamma’
parameter for the SVM, the range may theoretically be unbounded in R; in such cases,
the set of allowable values must initially be chosen as a reasonable range, and then
discretized. Other parameters are inherently discrete-valued, such as the kernel type
for the SVM, or the activation function type for the ANN. Further, we specified the
value of δ for the above computational experiments to be 10%, implying that we are
indifferent to any hyperparameter set that yields a classification accuracy within 10%
of the classification accuracy yielded by the optimal solution.
A drawback of applying the KN method for this purpose is that it requires complete
enumeration of the solution space; however, as we see from Table 1, the method iden-

9
tifies an optimal solution in reasonable runtimes when the cardinality of the search
space is in the order of hundreds.
We note here that the KN method parameters p and δ control the evolution of the
search process for the optimal hyperparameter set. In this sense, they are themselves
‘hyperparameters’ of the KN method; however, unlike ML method hyperparameters,
these parameters control the point at which the KN method terminates its search,
and the solution identified at this termination step remains the optimal solution with
a corresponding statistical guarantee. We investigated how the number of iterations
(the final value of k in the execution of the KN procedure) and correspondingly the
computational runtime required to arrive at the optimal hyperparameter set changes
with δ. We would expect that as the value of δ decreases, the final value of k would
increase given that more precision would be required of the KN method to distinguish
the best hyperparameter set from the next best set. This is indeed the case, as depicted
in Figure 1 below.

Figure 1. KN method: number of function evaluations required to arrive at optimal hyperparameter set
versus δ.

The experiments summarized in Figure 1 and in the following Section 3.2, where we
describe the process of benchmarking the KN method with respect to hyperparameter
optimization methods in the hyperopt package, were conducted with an ANN with the
following hyperparameter space.
• Hidden layer sizes ∈ {3, 10, 25, 50, 80}
• Learning rate value (discretized) ∈ {0.0005,0.001,0.01}
• Activation function ∈ {relu, logistic, tanh}
• Solver ∈ {adam, sgd}
• Learning rate type ∈ {adaptive}

3.2. Benchmarking the KN results


We now discuss the benchmarking of the KN method against the RS and the Tree
of Parzen Estimators (TPE) methods implemented by the hyperopt library. We have

10
provided a brief introduction to the hyperopt RS method previously. The hyperopt
TPE algorithm (Bergstra et al., 2013) is a Bayesian optimization algorithm, which
constructs a Gaussian mixture model of the classification performance as a function
of hyperparameter sets, and then performs a Bayesian update of the classification
performance given a hyperparameter set at each iteration.
For the benchmarking case, we consider a realistic hyperparameter tuning situation,
where the optimal hyperparameter set is not known previously, and hence the optimal
hyperparameter set is determined in conjunction with constraints such as computa-
tional budgets specified in terms of the number of function evaluations or computa-
tional runtime. Thus, for the KN method, we let it run its course to find the optimal
solution given that the solution space appears computationally tractable, and for the
RS and TPE methods, we specify a computational budget in terms of the number of
function evaluations. Here each function evaluation involves generation of one repli-
cate value of the performance measure for a hyperparameter set. Note that for the KN
method, multiple function evaluations will be generated for the same hyperparameter
set, whereas the hyperopt RS and TPE methods evaluate a hyperparameter set via
generation of only one replicate value of the performance measure. However, the same
hyperparameter set may be evaluated more than once if it is randomly sampled more
than once. Thus for the RS and TPE methods, we compare the performance of its
optimal hyperparameter set for multiple computational budgets, where said budgets
are specified in terms of the number of function evaluations.
In our benchmarking exercise, we begin by comparing the classification accuracies of
the best system returned by the KN and the hyperopt methods as well as the number
of function evaluations required to generate the optimal solution. However, given that
the RS and TPE methods use only a single replicate value of the performance measure
to evaluate a hyperparameter set, it is advisable to conduct a more rigorous compari-
son of the optimal solutions returned by the hyperopt methods and the KN method.
Therefore, we conduct a Student’s t-test, at a 5% level of significance, to compare the
mean performances of the optimal hyperparameter sets as returned by the KN method
and the RS and TPE methods, respectively. 25 replicate values of the classification
accuracies were generated for the optimal hyperparameter set identified by the KN
method and the sets identified by the RS and TPE methods for each computational
budget. t-tests for equality of the mean classification accuracies for both hyperparam-
eter sets were conducted using these replications, and the results are documented in
Table 2 below.
It is evident from Table 2 that, at minimum, the KN method performs at least
as well as all the cases of hyperopt RS and TPE methods. More specifically, the KN
method appears to perform significantly better than the hyperopt RS method, as it
yields statistically better performance than the RS method in almost all comparisons.
With respect to the hyperopt TPE method, the KN method offers statistically bet-
ter performance in three out of five comparisons, and is comparable in the remaining
two cases. Importantly, we see that for the comparisons with both methods, the KN
method is found to perform better even for the highest computational budget speci-
fied for the hyperopt methods. Finally, the computational runtime for the KN method,
while higher than the corresponding runtimes for the hyperopt methods, is still rea-
sonable, and is on the same order as that for the hyperopt methods with the highest
computational budgets.
Overall, given that the KN method provides a statistical guarantee of optimality
for the solution selected as the best, it is evident that ranking and selection methods
such as the KN method, where theoretically feasible, can be considered as a serious

11
Table 2. Benchmarking: the KN method versus the hyperopt random search and hyperopt TPE implemen-
tation. Notes. RS = random search; TPE = Tree of Parzen Estimators, N/A = not applicable.
Mean
Number Optimal accuracy p value
of Runtime objec- from from
Method function (sec- tive optimal t-test for Inference
evalua- onds) function hyperpa- equality
tions value rameter of means
set
KN 1381 635 0.940 0.932 N/A N/A
KN is
1000 463 0.982 0.815 0.001
better
KN is
500 209 0.982 0.89 0.018
hyperopt RS better
KN is
300 124 0.982 0.728 0.000
better
KN is
200 85 0.972 0.655 0.000
better
KN is
100 42 0.965 0.941 0.187
comparable
KN is
1000 463 0.982 0.815 0.001
better
KN is
500 306 0.991 0.927 0.724
comparable
hyperopt TPE
KN is
300 248 0.982 0.911 0.044
better
KN is
200 138 0.982 0.85 0.069
comparable
KN is
100 88 0.973 0.882 0.001
better

12
alternative to the hyperopt RS and TPE methods for hyperparameter optimization.
This is particularly the case when the size of the hyperparameter search space is
computationally tractable. For example, certain ranking and selection methods have
been used for solution spaces of sizes numbering in the thousands (Nelson, 2010).
We now discuss the application of random search simulation optimization methods -
specifically the stochastic ruler method - to the hyperparameter optimization problem.

4. Numerical Evaluation: Stochastic Ruler Random Search Method

The KN method requires complete enumeration and evaluation of the search space.
This can become a computationally expensive exercise when the search space car-
dinality is relatively high, or when generation of replicate values of the objective
function via the simulation is expensive. In such situations, random search meth-
ods are useful. In this section, we demonstrate the application of one of the simplest
random search methods proposed for discrete simulation optimization, the stochastic
ruler random search method (Yan & Mukai, 1992), for hyperparameter optimization.
The method involves comparing each replicate objective function value h(x), where
x belongs to the solution space S, with a uniform random variable defined on the pos-
sible range of the values within which h(x) lies. This uniform random variable is used
as a scale - that is, the stochastic ruler - against which the observations h(x) are
compared. The method converges asymptotically (in probability) to a global optimal
solution. In this paper, in addition to the original version of the stochastic ruler method
(Yan & Mukai, 1992), we also implement its modification proposed by Ramamohan et
al. (2020).
We begin by providing an overview of the stochastic ruler method. The stochastic
ruler discrete simulation optimisation method was developed by Yan and Mukai (1992)
to solve problems similar to that expressed in equation 1.
The method is initialized with a feasible solution, and then proceeds by searching the
neighborhood of the solution for a more suitable solution. A neighbourhood structure
is constructed for all x ∈ S satisfying notions of ‘reachability’ (Yan & Mukai, 1992)
such that each solution in the feasible space can be reached from another solution x0
which may or may not be in the neighborhood N (x) of x. Further, the neighborhood
structure must also satisfy a condition of symmetry: if x0 ∈ N (x) then x ∈ N (x0 ).
From a given estimate of the solution x, the next estimate is identified by checking
whether every replicate value of h(x0 ) (x0 ∈ N (x)) generated by the simulation is
greater than (for a maximization problem) a sample from a uniform random variable
U (a, b) (a < b), where (a, b) typically encompasses the range of possible values of
h(x). Hence U (a, b) becomes the stochastic ruler. Such a comparison is performed a
maximum of Mk times for each x0 ∈ N (x). In the k th iteration, an x0k ∈ N (xk ) is set to
be the next estimate xk+1 of the optimal solution x∗ if every one of the Mk replicate
values of the objective function is greater than every one of the corresponding Mk
samples of the stochastic ruler U (a, b).
The x0k ∈ N (xk ) are selected from N (xk ) with probability |N (x 1
k )|
. Note that Mk
is specified to be an nondecreasing function of k. While the algorithm converges in
probability to the global optimum, in practice, it is terminated when a preset compu-
tational budget is exhausted, or an acceptable increase (for a maximization problem)
in the objective function value is attained. The specific steps involved in the algorithm
are described below. All assumptions and definitions associated with the method can
be found in Yan and Mukai (1992).

13
Initialization. Construction of neighborhood structure N (x) for x ∈ S, specification
of nondecreasing Mk , U (a, b) and initial solution x0 , and setting k := 0.
Step 1. For xk = x, choose a new candidate solution z from the neighbourhood
N (x) with probability P {z | x} = |N 1(x)| , z ∈ N (x).
Step 2. For z, generate one replicate value h(z). Then generate a single realization
u from U (a, b). If h(z) < u, then let xk+1 = xk and go to Step 3. If not, generate
another replicate value h(z) and another realization u from U (a, b). If h(z) > u, then
let xk+1 = xk and go to Step 3. Otherwise continue to generate replicates h(z) and u
from U (a, b) and conduct the comparisons. If all Mk tests, h(z) < u, fail, then accept
z as the next candidate solution and set xk+1 = z.
Step 3. Set k = k + 1 and go to Step 1.
Note that in addition to applying the original version of the SR method (described
above), we also applied a recent modification proposed by Ramamohan et al. (2020)
to the hyperparameter optimization problem. The modification involves requiring that
only a fraction α (0 < α < 1, with α ideally set to be greater than 0.5) of the Mk tests
be successful for picking a candidate solution from the neighborhood of the current
solution as the next estimate of the optimal solution, unlike the original version where
every one of the Mk tests is required to be successful. The advantage of the modification
is that it prevents promising candidate solutions in the neighborhood of the current
solution from being rejected as the next estimate of the optimal solution if it fails only
a small fraction of the Mk (e.g., one test) tests.
We now describe the application of the stochastic ruler method to various types of
hyperparameter optimization problems.

4.1. Computational Experiments


In this section, we demonstrate the application of the stochastic ruler (SR) method
to the problem of optimizing the hyperparameters of a variety of neural network ma-
chine learning methods and problems: for standard feedforward NNs as described
in Section 3, and for ‘deep’ neural network models such as long short term memory
(LSTM) networks for time series prediction and convolutional neural networks (CNNs)
for image classification.
All computational experiments described in this section were executed on the public
version of the Google Colaboratory platform, with a 2.2 gigaHertz clock cycle speed
CPU and 12 gigabytes of memory.
In the first set of computational experiments, we consider two types of stopping
criteria for the SR method. First, we consider an ‘optimal performance’ based criterion:
we identify the optimal solution for the hyperparameter optimization problem under
consideration on an a priori basis by either applying the KN method or running
the stochastic ruler method itself for a large duration and terminating the method
when it remains in the same neighborhood for a sufficiently long duration. The SR
method is then terminated when it reaches this predetermined optimal solution. The
second termination criterion we consider resembles a more practical situation when the
optimal solution is unknown, and involves terminating the SR method when a preset
computational budget is exhausted or after a sufficient improvement in the objective
function value is attained.
Further, we employ two types of neighborhood structures for the solution spaces of
the hyperparameter optimization problems considered in this paper. The first involves
considering the entire feasible solution space, except for the solution under considera-

14
tion, to be the neighborhood of each solution. That is, for x ∈ S, N (x) = S − {x}. We
refer to this neighborhood structure as N1(SR) . The second neighborhood structure
involves considering points in the set of allowable values for each hyperparameter that
are adjacent to the hyperparameter value under consideration as neighbors, where the
elements of the set are ordered according to some criterion. For example, for the ith
hyperparameter taking value xij in its set of allowable values Si , its neighborhood is
constructed as: N (xij ) = {xij |xij ∈ Si ; j ∈ {−1, 0, 1}}. The neighborhood of a solution
x is then constructed as the Cartesian product of the N (xi ) and excluding x itself.
QN
That is, N (x) = N (xi ) − {x} (where the dimension of the hyperparameter search
i=1
space is N ). We refer to this neighborhood structure as N2(SR) . Neighborhoods for hy-
perparameters at the boundary of their ordered set of allowable values are constructed
by just taking the corresponding value from the beginning or end of the set. For exam-
ple, for the ith hyperparameter with set of allowable values Si , where |Si | = m, then
N (xim ) = {xi,m−1 , xim , xi1 }.
For all hyperparameter optimization problems discussed in this section, we set Mk =
ln k+10
ln 5 .
We begin by demonstrating the implementation of the SR method with the first
termination criterion - i.e., the optimal performance criterion. We consider four
hyperparameter optimization problems as part of this, and these are listed below.

Optimal Performance Termination Criterion: Demonstration.

(1) We use a feedforward ANN with two hidden layers to solve the breast cancer
classification problem described in Section 3.1. We consider the following
hyperparameters for the problem, and apply the N2(SR) neighborhood structure
for the solution space. We set U (a, b) = U (0, 1).

Neurons in hidden layer 1 ∈ {2, 4, 6, 8, 10}


Neurons in hidden layer 2 ∈ {1, 2, 3, 4, 5}
Learning rate ∈ {0.001, 0.004, 0.007, 0.01}

(2) We considered a larger hyperparameter space for the feedforward ANN,


considering the activation function and learning rate type hyperparameters as
well. We used the N2(SR) neighborhood structure, and set U (a, b) = U (0, 1).

Neurons in hidden layer 2 ∈ {2, 4, 6, 8, 10}


Neurons in hidden layer 1 ∈ {1, 2, 3, 4, 5}
Learning rate ∈ {0.001, 0.004, 0.007, 0.01}
Activation function ∈ {relu, tanh, logistic}
Learning rate type ∈ {constant, invscaling, adaptive}

(3) We then considered a more complex ‘deep’ neural network model - LSTMs, which
are typically used for time series prediction (Hochreiter & Schmidhuber, 1997).
For this application of the SR method, we considered a different dataset. This is
because each independent variable that forms part of an LSTM feature is a time
series in itself. For example, if an LSTM model uses n features, then a single
sample in the training dataset for the model would be an n × t matrix, where
each of the n variables is a t-dimensional vector. Thus the entire dataset, if it
consists of m samples, will be three dimensional.

15
For our hyperparameter optimization model, we applied an LSTM neural
network model for stock price prediction. Thus this serves to demonstrate the
application of discrete simulation optimization methods for regression problems,
as opposed to the classification problems that we address in the remainder of
this manuscript. We used the Indian National Stock Exchange opening price
data for a single stock. We collected the opening stock prices (therefore, a
dataset with a single feature) for this stock for a period of 700 days, and divided
the data into training and test datasets in a 5:2 ratio. The opening stock price
data for a period of 30 consecutive days is used as a single sample used to
predict the stock price on the 31st day. Thus a training set consisting of 500
30-day stock price sequences and corresponding regression labels (i.e., the stock
prices on the 31st days corresponding to each of the 500 30-day sequences)
was used to train the LSTM model. The trained LSTM model’s regression
accuracy was then estimated using the testing dataset; that is, using 200 30-day
sequences and their corresponding labels (the stock price on the 31st day). The
hyperparameters for the LSTM model are given below.

Neurons per layer ∈ {10, 50}


Dropout rate ∈ {0.2, 0.5}
Epochs ∈ {5,10,50}
Batch size ∈ {47, 97, 470}

The N2(SR) neighborhood structure along with an U (0, 1) stochastic ruler was
used. Note that many of the hyperparameters controlling the learning process of
LSTM models are different from those of the feedforward ANNs. This is due to
the substantial difference in the architecture of LSTMs when compared to feed-
forward ANNs. We do not describe in detail LSTM model architecture because
the focus of this study is demonstrating the application of random search simu-
lation optimization methods for deep learning models, and not the deep learning
models themselves.
(4) We also considered the application of the SR method to optimize the hypepa-
rameters of ANN models commonly used for image processing and computer
vision applications - convolutional neural networks (CNNs). CNNs typically
take images as input and assign the image into one of two or more predefined
categories. We used the MNIST handwritten digit recognition dataset for our
demonstration, which is a widely used dataset for benchmarking the perfor-
mance of CNN models (Deng, 2012). The classification problem here involves
classifying each image of a handwritten digit into one of 10 categories: that
is, whether it is one of {0, 1, 2, . . . , 9}. We used 400 samples from the MNIST
dataset available from the Scikit-learn machine learning repository for training
the CNN and 100 images for validation. Each sample in the dataset was a 28×28
pixel image. The N2(SR) neighborhood structure with U (a, b) = U (0, 0.08) was
used for this hyperparameter optimization problem. The hyperparameter set
for the problem is given below.

Neurons per layer ∈ {32, 64}


Epochs ∈ {3, 5}
Batch size ∈ {50, 100, 400, 800}

16
The results from applying the stochastic ruler method with the optimal performance
stopping criterion to the above four neural network hyperparameter optimization prob-
lems are summarized in Table 3. Note that ‘stages’ in Table 3 and in the remainder
of this article refer to the value of k from the description of the SR method - i.e., an
evaluation of a specific solution x ∈ S. The above problems are referred to as Problems
1 through 4 in Table 3.
Table 3. Stochastic ruler method for hyperparameter optimization: application to neural networks with
optimal performance stopping criterion.
α Stages Runtime (seconds)
Problem 1
1 1354 285
0.8 1017 212
Problem 2
1 3273 1463
0.8 2712 1045
Problem 3
1 13 88
0.8 13 83
Problem 4
1 10 83
0.8 9 80

We see from Table 3 that in all cases the SR method arrives at the optimal solu-
tion within reasonable runtimes. The runtimes for Problem 2 are significantly higher
because the number of hyperparameters are also correspondingly higher. We also see
that the modified version of the SR method (proposed by Ramamohan et al. (2020))
performs at least as well as the original method, and outperforms the original method
both in terms of runtimes and stages when the dimensionality of the search space is
also higher (Problems 1 and 2).
We now describe the implementation of the SR method for the more realistic
hyperparameter optimization case when the optimal solution is not known - that is,
when exhaustion of the computational budget is used as the termination criteria.

Computational Budget Termination Criterion: Demonstration.

We consider two hyperparameter optimization problems (Problems 5 and 6 be-


low) implemented using this termination criterion in this section, and discuss the
implementation of this criterion in more detail in the following section where
we discuss benchmarking the SR method against state-of-the-art hyperparameter
optimization packages.
The computational budgets for Problems 5 and 6 below are specified in terms of
maximum computational runtimes, measured in seconds. We measured the average
and standard deviation of the improvement in the classification accuracy when starting
with a randomly selected hyperparameter set from the search space. In these problems
also, we applied both the original as well as the modified versions of the SR method.

17
5. We considered the application of the feedforward ANN for the classification
problem as described for Problems 1 and 2. We used the N2(SR) neighborhood
structure, set U (a, b) = U (0, 1), and specified a computational budget of 1,500
seconds. The hyperparameter space is described below.

Neurons in hidden layer 2 ∈ {2, 4, 6, 8, 10}


Neurons in hidden layer 1 ∈ {1, 2, 3, 4, 5}
Learning rate ∈ {1e − 6, 5e − 6, 1e − 5, 4e − 5, 7e − 5, 1e − 4, 4e − 4, 7e − 4, 1e − 3}

6. We then considered the same hyperparameter optimization problem as in prob-


lem 5, but with a discrete-valued search space alone. This is in contrast to previ-
ous examples where we discretized the otherwise continuous-valued learning rate
search space. We used the N2(SR) neighborhood structure, set U (a, b) = U (0, 1)
and specified a computational budget of 250 seconds.
Neurons in hidden layer 2 ∈ {2, 4, 6, 8, 10}
Neurons in hidden layer 1 ∈ {1, 2, 3, 4, 5}

The results of applying the SR method to Problems 5 and 6 are provided in Table 4
below.
Table 4. Stochastic ruler method for hyperparameter optimization: application to neural networks with
computational budget based termination criterion.
α Stages Mean classification perfor-
mance improvement (SD)
Problem 1, budget: 1,500 seconds
1 2699 0.4982 (0.083)
0.8 2649 0.4730 (0.107)
0.6 1946 0.4999 (0.084)
Problem 2, budget: 250 seconds
1 599 0.5747 (0.074)
0.8 620 0.5334 (0.024)
0.6 585 0.5581 (0.065)

We see from Table 4 that a significant improvement in classification accuracy is


observed on average within the computational budget. We see similar trends regarding
the performance of the original and modified versions of the SR algorithm, and for
these problems, we see that setting α = 0.6 also yields acceptable results.
We now discuss the benchmarking of the SR method with existing widely used
hyperparameter optimization packages. Before we do so, we would like to note that
the SR method is one of the simplest and one of the first developed random search
simulation optimization methods, and hence our use of this method for hyperparameter
optimization. Many other random search methods that can be used in a similar manner
have been developed, such as other modifications to the SR method itself (Alrefaei &
Andradóttir, 2001, 2005), COMPASS (Hong & Nelson, 2006), etc.

18
4.2. Benchmarking Stochastic Ruler results
We benchmark the SR method against hyperparameter optimization algorithms in the
hyperopt and mango packages. Specifically, we benchmark the SR method against the
TPE algorithm discussed in Section 3.2, the RS algorithm, and the continuous RS
algorithm from the hyperopt packages. The hyperopt continuous RS algorithm is the
continuous version of the hyperopt RS method (Bergstra & Bengio, 2012), and does
not discretize the search space unlike its discrete counterpart.
mango is a recently developed hyperparameter optimization package that imple-
ments a state-of-the-art batch Gaussian process bandit search algorithm Sandha et al.
(2020). The algorithm implements an adaptive exploitation vs. exploitation trade-off
search strategy, parameterized by the size of the solution space and parallel batch size,
among others.
The benchmarking process that we implement for the SR method is similar to that
followed for the KN method. However, for the KN method, we allow it to run its
course given that the solution space of the hyperparameter optimization problem that
we explored in its benchmarking process was tractable given the computational in-
frastructure at hand. The SR method only promises asymptotic convergence (in prob-
ability) to the optimal solution, and hence in practical hyperparameter optimization
situations, it must be implemented with a computational budget specified in terms of
runtime or function evaluations. Given that we have already demonstrated the imple-
mentation of this termination criterion with the runtime standpoint in the previous
section, we choose the latter approach for specifying the computational budget in this
benchmarking exercise. Thus we specify computational budgets in terms of 100, 200
and 300 function evaluations and compare the performance of the SR method to that
of the hyperopt and mango methods for each of these cases. All benchmarking ex-
periments were carried out taking Problem 5 described in the previous section as the
test case. Secondly, we use two parameterizations of U (a, b): U (0.5, 1) and U (0.7, 1),
and provide benchmarking results for both cases. In each case, the SR method was
parameterized with a randomly chosen initial solution.
After identifying the hyperparameter set that each method (SR or the benchmark
method) converges to within the computational budget, we conduct Student’s t-tests
to check for statistically significant differences between the means of the classification
accuracies from both methods. We conduct 10 such tests (by changing random number
seeds appropriately, and generating a different set of training and testing dataset
permutations corresponding to each random number seed) for each comparison. We
then report the number of times the null hypothesis (both means are equal) was
rejected, and in such cases, which method’s mean was higher. We do not conduct such
a rigorous evaluation for the benchmarking of the KN method because of the nature
of the method - if it runs to completion (i.e., the set of solutions is iteratively reduced
to a singleton set with the optimal solution), the optimal solution is identified with a
built-in statistical guarantee.
The results from the benchmarking experiments are provided in Table 5 below.
Overall, we observe from Table 5 that the SR method offers comparable performance
with respect to the hyperopt and mango hyperparameter optimization methods. In
particular, we consistently observe that with higher computational budgets (200 func-
tion evaluations and above), the SR method identifies optimal hyperparameter sets
that offer classification accuracies that are statistically comparable to every hyperopt
and mango method evaluated. The advantage that simulation optimization methods
offer over hyperopt’s RS method is clear. We also see that even for smaller computa-

19
Table 5. Benchmarking of the stochastic ruler method against the hyperopt and mango hyperparameter
optimization packages. Notes. Max. = maximum; SR = stochastic ruler; TPE = tree of Parzen estimators; RS
= random search; N/A = not applicable.
Stochastic ruler versus: Max. function Proportion of tests If null rejected, which
evaluations with p ¡ 0.05 model has higher mean
performance?
U (a, b) = U (0.7, 1.0)
100 0.3 hyperopt TPE
hyperopt TPE 200 0.0 N/A
300 0.0 N/A
100 0.7 SR
hyperopt RS
200 1.0 SR
100 0.2 hyperopt continuous RS
hyperopt continuous RS
200 0.0 N/A
mango 200 0.0 N/A
U (a, b) = U (0.5, 1.0)
100 0.4 hyperopt TPE
hyperopt TPE 200 0.2 hyperopt TPE
300 0.0 N/A
100 0.6 SR
hyperopt RS
200 0.8 SR
100 0.4 hyperopt continuous RS
hyperopt continuous RS
200 0.3 SR
200 0.2 mango
mango
300 0.0 N/A

20
tional budgets (100 function evaluations), the proportion of tests where the hyperopt
or mango methods appear to perform better is less than 0.5. Finally, we note that a
more ‘informatively’ parameterized stochastic ruler - that is, U (0.7, 1.0), which reflects
the information that the classification performance with most hyperparameter sets is
unlikely to fall below 70% - appears to lead to better performance for the SR method.

5. Concluding Remarks

In this work, we demonstrate the use of simulation optimization algorithms - in par-


ticular, those developed for discrete search spaces - for selecting hyperparameter sets
that maximize the performance of machine learning methods. We provide a compre-
hensive demonstration of the use of two discrete simulation methods in particular: (a)
the KN ranking and selection method, which considers every single solution in the
discrete solution space, and provides a statistical guarantee of selecting the optimal
hyperparameter set, and (b) the stochastic ruler random search method that moves
between neighborhoods of promising candidate solutions. We demonstrate the use of
these methods across a wide variety of machine learning methods, datasets and their
associated hyperparameter sets - we consider both standard machine learning methods
such as support vector machines as well neural network based ‘deep’ learning machine
learning models.
Through our computational experiments, we provide evidence that the KN and
stochastic ruler methods perform at least as well as state-of-the-art hyperparameter
tuning packages such as hyperopt and mango. More specifically, we show consistently
better performance when compared to the hyperopt’s implementation of both dis-
crete as well as continuous random search (Bergstra & Bengio, 2012) hyperparameter
tuning methods. With the KN method, we observe statistical evidence of superior
performance with respect to more sophisticated methods such as hyperopt’s Tree of
Parzen Estimators algorithm, and with the stochastic ruler method, we observe statis-
tically comparable performance. We observe statistically comparable performance for
the stochastic ruler method with respect to the relatively recently developed mango
Gaussian process bandit search method. We note that such performance is observed
even though our implementations of the KN and stochastic ruler methods are not
professionally optimized - from a software development standpoint - versions of these
methods.
Our work has a few limitations, which also serve to suggest avenues for future re-
search. As a first step towards exploring the use of simulation optimization methods
for hyperparameter optimization, we have only explored the use of discrete simula-
tion optimization methods. Thus exploring whether simulation optimization methods
developed for continuous solution spaces can be adapted for hyperparameter solution
spaces may serve as a future avenue of research. Similarly, we have also only explored
the use of two relatively simple discrete simulation optimization methods in this work:
the KN and the stochastic ruler methods. Thus an avenue of future research can in-
volve exploring the use of more advanced discrete simulation optimization methods,
including more advanced versions of the KN and stochastic ruler methods themselves.
For example, efficient ranking and selection methods for computing in parallel environ-
ments have been developed (Ni, Ciocan, Henderson, & Hunter, 2017), and exploring
the adaptation of these for hyperparameter optimization problems may prove useful.
The dimensionality of the hyperparameter search spaces that we have considered
has also been small to moderate from the perspective of simulation optimization. For

21
example, the largest space we have considered is of five dimensions, and for this search
space, we find that the stochastic ruler method and its modification required approxi-
mately 15-25 minutes of computational runtime to arrive at the optimal hyperparam-
eter set. For most machine learning exercises, the number of realistically controllable
hyperparameters may be of this order; however, for spaces of higher dimensions (for
example, if more than 10-15 hyperparameters are required to be optimized), the use
of methods such as the adaptive hyperbox algorithm (Xu, Nelson, & Hong, 2013) may
be required.
Finally, while we have demonstrated the use of simulation optimization methods
for optimizing the hyperparameters of a reasonably wide variety of machine learning
methods, ranging from support vector machines to ‘deep’ neural network models such
as convolutional neural networks and long short term memory networks, the use of
such methods to even more complex deep learning models such as generative adver-
sarial neural networks remains to be explored, and can offer another avenue of future
research.

References

Alrefaei, M. H., & Andradóttir, S. (2001). A modification of the stochastic ruler method
for discrete stochastic optimization. European Journal of Operational Research, 133 (1),
160–182.
Alrefaei, M. H., & Andradóttir, S. (2005). Discrete stochastic optimization using variants of
the stochastic ruler method. Naval Research Logistics (NRL), 52 (4), 344–360.
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter
optimization. Advances in neural information processing systems, 24 .
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. The
Journal of Machine Learning Research, 13 (1), 281–305.
Bergstra, J., Yamins, D., & Cox, D. (2013). Making a science of model search: Hyperparameter
optimization in hundreds of dimensions for vision architectures. In International conference
on machine learning (pp. 115–123).
Deng, L. (2012). The mnist database of handwritten digit images for machine learning research.
IEEE Signal Processing Magazine, 29 (6), 141–142.
Diaconis, P., & Freedman, D. (1980). Finite Exchangeable Sequences. The Annals of Proba-
bility, 8 (4), 745 – 764. Retrieved from https://doi.org/10.1214/aop/1176994663
Eggensperger, K., Hutter, F., Hoos, H., & Leyton-Brown, K. (2015). Efficient benchmarking of
hyperparameter optimizers via surrogates. In Proceedings of the aaai conference on artificial
intelligence (Vol. 29).
Henderson, S. G., & Nelson, B. L. (2006). Handbooks in operations research and management
science: simulation. Elsevier.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation,
9 (8), 1735–1780.
Hong, L. J., & Nelson, B. L. (2006). Discrete optimization via simulation using compass.
Operations research, 54 (1), 115–129.
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential model-based optimization
for general algorithm configuration. In International conference on learning and intelligent
optimization (pp. 507–523).
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects.
Science, 349 (6245), 255–260.
Kim, S.-H., & Nelson, B. L. (2001). A fully sequential procedure for indifference-zone selection
in simulation. ACM Transactions on Modeling and Computer Simulation (TOMACS),
11 (3), 251–273.

22
Kuhn, M., Johnson, K., et al. (2013). Applied predictive modeling (Vol. 26). Springer.
Lessmann, S., Stahlbock, R., & Crone, S. F. (2005). Optimizing hyperparameters of support
vector machines by genetic algorithms. In Ic-ai (pp. 74–82).
Lorenzo, P. R., Nalepa, J., Kawulok, M., Ramos, L. S., & Pastor, J. R. (2017). Particle
swarm optimization for hyper-parameter selection in deep neural networks. In Proceed-
ings of the genetic and evolutionary computation conference (p. 481–488). New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/
3071178.3071208
Maclaurin, D., Duvenaud, D., & Adams, R. P. (2015). Gradient-based hyperparameter opti-
mization through reversible learning.
Nelson, B. L. (2010). Optimization via simulation over discrete decision variables. In Risk
and optimization in an uncertain world (pp. 193–207). Informs.
Ni, E. C., Ciocan, D. F., Henderson, S. G., & Hunter, S. R. (2017). Efficient ranking and
selection in parallel computing environments. Operations Research, 65 (3), 821–836.
O’Neill, B. (2009). Exchangeability, correlation, and bayes’ effect. International Statistical
Review , 77 (2), 241-250. Retrieved from https://onlinelibrary.wiley.com/doi/abs/
10.1111/j.1751-5823.2008.00059.x
Pedregosa, F. (2016). Hyperparameter optimization with approximate gradient. arXiv preprint
arXiv:1602.02355 .
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay,
E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
12 , 2825–2830.
Peng-Wei Chen, Jung-Ying Wang, & Hahn-Ming Lee. (2004). Model selection of svms using
ga approach. In 2004 ieee international joint conference on neural networks (ieee cat.
no.04ch37541) (Vol. 3, p. 2035-2040 vol.3).
Ramamohan, V., Agrawal, U., & Goyal, M. (2020). A note on the stochastic ruler method
for discrete simulation optimization. arXiv. Retrieved from https://arxiv.org/abs/
2010.06909
Ressel, P. (1985). De finetti-type theorems: an analytical approach. In The annals of probability
(Vol. 13, p. 898–922).
Sandha, S. S., Aggarwal, M., Fedorov, I., & Srivastava, M. (2020). Mango: A python library
for parallel hyperparameter tuning. In Icassp 2020-2020 ieee international conference on
acoustics, speech and signal processing (icassp) (pp. 3987–3991).
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine
learning algorithms. Advances in neural information processing systems, 25 , 2951–2959.
Xu, J., Nelson, B. L., & Hong, L. J. (2013). An adaptive hyperbox algorithm for high-
dimensional discrete optimization via simulation problems. INFORMS Journal on Com-
puting, 25 (1), 133–146.
Yan, D., & Mukai, H. (1992). Stochastic discrete optimization. SIAM Journal on control and
optimization, 30 (3), 594–612.
Yang, L., & Shami, A. (2020). On hyperparameter optimization of machine learning algo-
rithms: Theory and practice. Neurocomputing, 415 , 295–316.

23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy