0% found this document useful (0 votes)
144 views18 pages

Surrogate Based Optimization

journal paper on surrogate modeling

Uploaded by

Gamini Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views18 pages

Surrogate Based Optimization

journal paper on surrogate modeling

Uploaded by

Gamini Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Computers and Chemical Engineering 108 (2018) 250–267

Contents lists available at ScienceDirect

Computers and Chemical Engineering


journal homepage: www.elsevier.com/locate/compchemeng

Review

Advances in surrogate based modeling, feasibility analysis,


and optimization: A review
Atharv Bhosekar, Marianthi Ierapetritou ∗
Dept. of Chemical and Biochemical Engineering, Rutgers, The State University of New Jersey, 98 Brett Road, Piscataway 08901, United States

a r t i c l e i n f o a b s t r a c t

Article history: The idea of using a simpler surrogate to represent a complex phenomenon has gained increasing popu-
Received 5 June 2017 larity over past three decades. Due to their ability to exploit the black-box nature of the problem and the
Received in revised form 5 September 2017 attractive computational simplicity, surrogates have been studied by researchers in multiple scientific
Accepted 19 September 2017
and engineering disciplines. Successful use of surrogates shall result in significant savings in terms of
Available online 21 September 2017
computational time and resources. However, with a wide variety of approaches available in the litera-
ture, the correct choice of surrogate is a difficult task. An important aspect of this choice is based on the
Keywords:
type of problem at hand. This paper reviews recent advances in the area of surrogate models for prob-
Surrogate models
Derivative-free optimization
lems in modeling, feasibility analysis, and optimization. Two of the frequently used surrogates, radial
Feasibility analysis basis functions, and Kriging are tested on a variety of test problems. Finally, guidelines for the choice of
Sampling appropriate surrogate model are discussed.
Model selection © 2017 Elsevier Ltd. All rights reserved.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
2. Surrogate models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
2.1. Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
2.1.1. Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
2.1.2. Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
2.2. Support vector regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
2.3. Radial basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
2.4. Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
2.4.1. Variants of Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
2.4.2. Nugget effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
2.4.3. Computational aspects of kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
2.5. Mixture of surrogates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
3. Derivative-free optimization and surrogates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
3.1. Model-based local DFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
3.2. Model-based global DFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4. Feasibility analysis and surrogates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
5. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.1. Expected improvement function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.2. Bumpiness function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
5.3. Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6. Surrogate model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
7. Software implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
8. Computational results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

∗ Corresponding author.
E-mail address: mierapetritou@gmail.com (M. Ierapetritou).

http://dx.doi.org/10.1016/j.compchemeng.2017.09.017
0098-1354/© 2017 Elsevier Ltd. All rights reserved.
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 251

8.1. Test problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261


8.2. Test setup and basis for comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
8.3. Surrogate model settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4. Computational results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4.1. Choice of kriging regression term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4.2. Choice of kriging correlation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4.3. Effect of initial sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4.4. Effect of sample design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
9. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

1. Introduction aerospace systems, (Queipo et al., 2005) discuss surrogate based


optimization and sensitivity analysis, sampling strategies and sur-
The problem discussed in the paper is assessing the performance rogate model validation. (Barton and Meckesheimer, 2006) discuss
of surrogates on the deterministic function f : Rd → R; where the surrogates for guiding optimization of simulations. In this con-
input vector is X = (x1 , x2 , . . ., xd ), d is the number of dimensions text of guiding search towards optimum, they classify surrogates
of XL ≤ X ≤ XU the problem, and there is a single output y. The input as local surrogates that are updated within an iterative frame-
vector X has known lower and upper bounds work and global surrogates that are fitted only once and the
Additionally, some constraints fj ≤ 0, j ∈ J where J is the set of search proceeds using the same surrogate thereafter. For the pur-
all constraints, may be present. It is assumed that evaluation of poses of design optimization, (Wang and Shan, 2007) provide an
the function as well as constraints is computationally expensive overview of surrogate models. Their focus is mainly on solving
and the symbolic form of the function and that of one or more optimization problems such as global optimization, multi-objective
constraints is unknown. From this assumption, it follows that the optimization, and probabilistic design optimization. Motivated
analytical form of the derivatives is also unavailable. Surrogate from computationally intensive aerospace designs, (Forrester and
modeling addresses this problem by obtaining a function f̂ (X) that Keane, 2009) discuss details of surrogate modeling methodology
approximates the function f. focusing on sampling, surrogate model building, and validation.
This problem occurs frequently in multiple engineering and sci- They discuss surrogates such as polynomial interpolation, RBF,
entific disciplines where complex computer simulations or physical Kriging and support vector regression and their advantages and dis-
experiments are used. In these cases, obtaining more data means advantages for achieving better prediction accuracy. (Razavi et al.,
additional experiments and thus it results in significant mate- 2012) investigate the potential of surrogate modeling techniques
rial or economic cost as well as highly non-trivial computational with a focus on the use of surrogates in water resources applica-
expense. As a result, it is difficult to obtain an analytical form tions. They provide an excellent review on use of surrogates in
of the objective function or that of the derivatives. Deriving this water resources. (Nippgen et al., 2016) review surrogate model-
information from surrogate f̂ (x) is relatively easier because its ing strategies from a broader point of view by classifying those
analytical form is known and it is cheaper to evaluate. Several as data-driven, projection-based and multi-fidelity surrogate mod-
applications of surrogates to address this type of problems can eling strategies. They focus on potential of using surrogates for
be found in the literature. For example, (Anthony et al., 1997), applications in groundwater modeling. (Haftka et al., 2016) discuss
(Balabanov and Haftka, 1998), use polynomial, linear response sur- in detail, several strategies for global optimization using surro-
faces in aircraft design. Artificial neural networks (ANN) is used gates, criteria for local and global searches from the point of view of
for process modeling (Meert and Rijckaert, 1998), process control parallelization. It is important to note that with respect to applica-
(Bloch and Denoeux, 2003), (Mujtaba et al., 2006), and for opti- tions, the problems requiring surrogates can be classified in to three
mization (Fernandes, 2006), (Henao and Maravelias, 2011). Kriging classes. The first class of problems is the most fundamental use of
is used for process flowsheet simulations (Palmer and Realff, 2002), surrogates i.e. prediction and modeling. The second class of prob-
design simulations (Yang et al., 2005), (Prebeg et al., 2014), pharma- lems is commonly known as derivative-free optimization (DFO)
ceutical process simulations (Jia et al., 2009), and feasibility analysis where the objective function to be optimized is expensive and thus
(Rogers and Ierapetritou, 2015). Radial basis functions (RBF) is used derivative information is unavailable. The third class of problems
for feasibility analysis (Wang and Ierapetritou, 2016) and param- is feasibility analysis where the objective is also to satisfy design
eter estimation (Müller et al., 2015). It can be observed from the constraints. Prior reviews discuss applications of surrogate models
f̂ (x) applications listed above that there are multiple approaches for only one or two of these three classes. This review emphasizes
proposed in literature to obtain a surrogate. that there is a significant difference between using surrogates for
Several prior reviews discuss these approaches and related each of these three classes of problems and provides a compre-
developments in the field of surrogate models. Surrogate models hensive understanding of surrogate models for all three classes of
and their potential use in simulations is discussed by (Barton, 1992). problems mentioned. There has been a growing interest in model
They discuss polynomial response surface, spline interpolation, selection methodologies for regression models where the aim is to
radial basis functions, regression models, and Kriging surrogates. choose the best model from a given set of models. This problem has
With the focus on modeling and prediction for engineering design, many practical uses in cases where the surrogate model does not
(Simpson et al., 1997) review stationary sampling designs, poly- generalize well on the test set, a phenomenon commonly known as
nomial response surface methods, Kriging and robust methods. overfitting or when there are too many input variables that might
(Jin et al., 2001) studied performance of polynomial regression, contain redundant information. In such cases, it is important to
multivariate adaptive regression splines, radial basis functions, select most relevant variables in order to build simple yet effec-
and Kriging surrogates under multiple criteria such as efficiency, tive surrogate models. Even though model selection is popular in
robustness and model simplicity. Motivated from applications in the field of statistics for over 50 years, prior reviews in the con-
252 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

text of surrogate modeling do not address this issue. As avoiding high variance makes it difficult to have better predictions on new
overfitting an important aspect to consider while building surro- data points. This phenomenon is known as overfitting. To avoid
gate models, an extensive review of model selection strategies is this issue, the effect of unnecessary variables is either removed
provided. using subset selection or suppressed using regularization. Subset
The rest of this paper is organized as follows. Section 2 discusses selection and regularization strategies are explained below.
different types of surrogates and their underlying mathematical
formulations. Sections 3 and 4 describe surrogates in relation to 2.1.1. Subset selection
the DFO problems and feasibility analysis, respectively. Section 5 Subset selection refers to addressing the trade-off between pre-
is devoted towards sampling which is an important component of diction error and the regression model complexity by selecting a
building the right surrogate model. Section 6 describes approaches subset of variables. This step is followed by least squares regres-
for validating surrogates. Section 7 describes some of the existing sion for determining coefficients of the regression model. Several
software implementations. In section 8, a detailed comparison of approaches for subset selection exist in the literature that are
the performance of RBF and Kriging surrogates on a set of 47 test classified here as exhaustive search methods, heuristic methods,
problems is provided whereas Section 9 provides a summary of the methods based on integer programming, methods relying on model
manuscript. fitness measures, Bayesian variable selection methods, and meth-
ods based on analyzing correlations between input variables and
2. Surrogate models the output.
Exhaustive search methods try to exhaustively explore all pos-
In this section, frequently used approaches for obtaining the sible subsets of features and select the subset with minimum
surrogate f̂ (x) are discussed with a focus on the recent advances. prediction error. Advantage of exhaustive search is that a number of
The models that are designed to yield unbiased predictions at the regression models are obtained with comparable prediction accu-
sampled data are referred to as interpolation models, whereas racy. Even though these methods guarantee the selection of best
models that are built by minimizing the error between given data possible model, computational complexity of exhaustive search
and model prediction under a certain criterion are referred to as increases rapidly as the number of subsets increases. An implemen-
regression models. In this section, regression models such as lin- tation of this approach is the leaps and bounds algorithm (Furnival
ear regression, support vector regression are discussed followed by and Wilson, 1974).
interpolation models such as RBF and Kriging. Finally, approaches Heuristic methods try to overcome this drawback by
utilizing more than one of these surrogates are discussed. With using greedy approaches such as forward-stepwise regression,
their ability to provide a quantitative measure of uncertainty in pre- backward-stepwise regression, and forward-stagewise regression.
diction, RBF and Kriging surrogates are the most popular choices for In forward-stepwise regression, variable selection starts from
optimization and feasibility analysis. Therefore, special emphasis is an empty set of variables and proceeds by sequentially adding a
given on these surrogates. variable that improves the fit by largest magnitude. Improvement
in the fit is usually measured by using the F-statistic. Using sum
2.1. Linear regression squared error, F-statistic quantifies the improvement achieved
by addition of a new variable. Backward-stepwise regression is
This is a commonly used approach where a surrogate is repre- an opposite approach that starts from including all variables and
sented as a linear combination of the input variables as given by Eq. sequentially removes variables that have least impact on the
(1). fit. Forward-stagewise regression is similar to forward-stepwise
regression. However, in this case, only the coefficient of the newly

d

f̂ (x) = w0 + xj wj (1) added variable is adjusted keeping other coefficients constant.


Approaches that use integer programming for subset selection
j=1
formulate the subset selection problem as an optimization prob-
where x is a vector of size d; d is the number of variables; w is the lem. In these formulations, an error measure (EM) is minimized
vector of length d + 1. To obtain the weight vector, sum of squared subject constraints that ensure subset selection. One way to impose
errors between the actual data and the surrogate predicted value is such a constraint is by having an upper bound on the number of
minimized. The unconstrained minimization problem can be for- nonzero entries (Konno and Yamamoto, 2009a). In addition to lim-
mulated as given by Eq. (2). iting the number of nonzero entities, these formulations can be
adapted to ensure statistical properties such as robustness, selec-
min||Xw − y||22 (2)
w tive and general sparsity of the model (Bertsimas and King, 2016).
where X is a matrix of size n by d+1 where n is the number of These approaches, however, need a prespecified value for num-
sample points and all elements in the first column of X are 1 and ber of variables that might not be known a priori. Review of these
columns 2 through d + 1 correspond to the input vector; y is a vec- approaches can be found in (Liu and Motoda, 2007). For a known
tor of size n that represents function values at sample points. For number of variables to be selected, an example of problem for-
the case of ordinary least squares, solution in analytical form is mulation for these problems is given by Eqs. (3)–(5) (Konno and
 −1 Yamamoto, 2009b).
w = XT X X T y. When one or more of independent variables are
perfectly correlated, the matrix XT X becomes near singular. As a min EM (3)
result, the coefficients w are not uniquely defined. This kind of
rank deficiency can occur in high dimensional problems where the s.t.
number of data points is less than the number of variables. This is 
d
usually addressed by reducing number of variables by screening or zl = k (4)
by utilizing regularization techniques. As the number of variables
i=1
(d) in this problem increases, either inherently from the problem or
from a combination of existing variables, this system is suscpetible wlL zl ≤ wl ≤ wlU zl , l = 1, . . ., d (5)
to produce high variance. Even though addition of extra variables  
leads to low bias on the data points used for building the model, zl ∈ 0, 1 , l = 1, . . ., d (6)
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 253

Table 1 where,
Model fitness measures.

Model fitness measure definition


     
Pr X|f̂k = Pr X|k , f̂k Pr k |f̂k dk
 

N

1 2 2p(p+1)
AICc N log (yi − Xi w) + 2p +
N N−p−1
and X is sampled data set,  k represents unknowns in surrogate
 i=1  model f̂k . Bayesian variable selection problem is usually considered
N
as a special case of the model selection problem where each model
1 2
HQIC N log N (yi − Xi w) + 2p log (log (N))
consists of a subset of variables. It should be noted that for com-
N i=1
2 parison between two candidate models, the denominator on the
(yi −Xi w)
BICc i=1
+ p log (N) right-hand side of Eq. (7) is the same and therefore, only numer-
N ˆ 2
(yi −Xi w)
2
ator is usually considered for comparison. Finding the model with
RIC i=1
+ 2p log (k)
N ˆ 2
2
highest posterior probability is the fundamental motivation behind
(yi −Xi w) BIC.
Cp i=1
+ 2p − N
ˆ 2
Another class of subset selection approaches is the one that
relies on learning the correlation between input variables and the
output. One such method is sure independence screening (SIS).
Sure independence screening (SIS) relies on learning ranks of input
variables according to their marginal correlation with output y.
where, zl is a binary variable for selection of variable l; k is the num-
After standardizing columns of matrix X, where each column cor-
ber of subsets to be selected; for the coefficient wl in the regression
responds to each input variable, a vector XT y is obtained which
model, wlL and wlU represent the lower and upper bounds respec-
directly signifies marginal correlations of input variables with the
tively. Eq. (4) limits the number of nonzero coefficients used in the
output. With this method, input variables having least impact on
model. Eq. (5) imposes bounds on wl and forces wl to be 0 when zl
the output can be filtered out. Another famous approach involving
is 0.
a similar strategy of assessing impact of an input variable by moni-
Methods utilizing model fitness measures tackle the issue of
toring correlation with output is least angles regression (Efron et al.,
prespecifying number of selected variables by including a penalty
2004). In this approach, coefficients are added in a similar fashion to
for number of nonzero variables. This way these methods address
forward-stepwise regression. However, instead of obtaining a least
the tradeoff between model complexity and prediction accuracy.
squares solution, correlation with the output is monitored and new
Several fitness measures exist in literature. One such measure
variables are added sequentially.
is mean absolute error (MAE). An algorithm to minimize MAE
Finally, subset selection is extremely important especially when
is proposed and used for finding subset of variables (Park and
the number of input variables is much higher than the size of
Klabjan, 2013). Other such measures include Mallow’s Cp (Mallows,
available data set. Few of the approaches to address this class of
1973), Akaike information criterion (AIC)(Akaike, 1974), Bayesian
problems include Dantzig selector (Candes and Tao, 2007), adap-
information criterion (BIC) (Schwarz, 1978), the Hannan-Quinn
tive lasso and sure independence screening (Fan and Lv, 2008). For
information criterion (HQIC) (Hannan and Quinn, 1979), the risk
high dimensional problems, (Cadima et al., 2004) review heuris-
inflation criterion (RIC) (Foster and George, 1994), and mean
tic algorithms for subset selection. An extension of BIC for high
squared error (MSE). These measures are shown in Table 1. These
dimensional problems known as extended BIC is proposed (Chen,
fitness measures can be used as EM in Eq. (3) to form a mixed inte-
2014).
ger quadratic program (Wilson and Sahinidis, 2017), (Cozad et al.,
2014). AIC is based on the idea of minimizing discrepancy between
the original distribution of the data and the distribution given by 2.1.2. Regularization
linear regression model. A well-known discrepancy measure called Subset selection methods lead to a discrete decision of either
Kullback-Leibler divergence is used. Other discrepancy measures accepting or discarding a certain variable. This leads to high vari-
include Kolmogorov-smirnov and Hellinger discrepancy (Linhart ance in prediction and does not reduce the prediction error of the
and Zucchini, 1986). AICc represents correction term added to AIC regression model. Regularization leads to a continuous reduction
for finite sample sizes (Hurvich and Tsai, 1989). Cp simply tries to of the regression model coefficients.
minimize prediction error where mean squared error is the error Regularization penalizes magnitude of regression coefficients w
measure. BICC seeks to maximize approximate posterior proba- to modify the problem given in Eq. (2) to the form given in Eq. (8)
bility. These metrics are tabulated in (Table 1) where, p < k is the (Hastie et al., 2009).
number of coefficients, N is the number of sampled points, ˆ 2 is an
estimate of the error variance. min ||Xw − y||22 + C||w||q (8)
w
Bayesian approach models the uncertainty over unknowns in a
surrogate model using probability theory assuming those as ran- where,
dom variables. The probability distribution that represents this  d 1/q
uncertainty before obtaining samples is referred to as prior distri-  q
bution and that after obtaining samples is referred to as posterior ||w||q = wi (9)
distribution. Suppose that there are M models under consideration i=1
where ith model is  m represented by f̂i and unknowns correspond-
ing to each model are represented as and C is the parameter that decides magnitude of regularization. Eq.
The aim is to select a model with highest posterior probability, (9) is the expression of qth norm. Value of q has a significant effect
which is given by Eq. (7). on properties of the regression model. Values 1 and 2 represent two
commonly used variants of regularization known as lasso and ridge
regression respectively (Seber and Lee, 2003). In contrast to ridge
   
  Pr X|f̂m Pr f̂m regression, lasso has the ability to set regression coefficients exactly
Pr f̂m |X =     (7) to 0. Values of q between 1 and 2 describe a mix between properties
Pr X|f̂k Pr f̂k
kεM of lasso and ridge regression. Another approach to obtain a similar
254 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

mix is known as elastic-net regression where a linear combination Table 2


Commonly used basis functions.
of lasso and ridge regression penalty terms is used.
Type Function  (.)

d
 
min||Xw − y||22 +C 2
˛w + (1 − ˛) |w| (10) Linear  (r) = r
w Cubic  (r) = r 3
j=1 Thin plate spline

 (r) = r 2 log (r)

where, ␣ is a tuning parameter. Other extensions of Lasso include Multi-quadratic  (r) = r2 +  2


−r 2
adaptive lasso (Zou, 2006) where the penalty term is a weighted Gaussian  (r) = e

summation where the weight depends on magnitude of the


coefficient itself. Another for reducing the absolute value of regres-
sion coefficients uses a non-negative garrotte estimator (Breiman, solving a quadratic programming problem. This added complexity
1995). This estimator is obtained by scaling coefficients of least hinders the popularity of SVR (Forrester and Keane, 2009).
squared regression. A penalty is associated with the scaling param-
eters and the problem is to find these scaling parameters. In this 2.3. Radial basis functions
case, a closed form expression for these parameters is available as
a function of coefficients obtained using ordinary least squares. Given n distinct sample points, RBF surrogates can be repre-
sented as given in Eq. (17).
2.2. Support vector regression

n
f̂ (x) = i  (||x − xi ||2 ) + p (x) (17)
The Support Vector Regression (SVR) surrogates are represented
i=1
as the weighted sum of basis functions added to a constant term. A
general form of SVR surrogate is given in Eq. (11). where 1 , . . ., n ∈ R are the weights to be determined; ||.|| is the
Euclidean norm;  (.) is the basis function. There are several options
 n
 
f̂ (X) =  + wi X, X i (11) for choosing the basis function  (.) as shown in Table 2.
In the case of multi-quadratic and Gaussian basis functions, r ≥ 0,
i=1
and  is a positive constant. There is no solid conclusion in literature
Assuming a simple basis function (.) = X, the surrogate can be that decisively concludes one of these basis functions is better than
written as per Eq. (12). others. However, use of cubic basis function with linear tail has
been found to be successful (Björkman and Holmström, 2000). It
f̂ (x) =  + wT X (12)
can be represented by Eq. (18).
This form of the surrogate is similar to that of RBF as well as Krig-
ing. However, the way to calculate unknown parameters for this

n

f̂ (x) = i  (||x − xi ||2 ) + aT x + b (18)


surrogate differs significantly from that of RBF and Kriging surro-
i=1
gates. The unknown parameters  and w in the model are obtained
by formulating a mathematical optimization problem given by Eqs. The weights , a and b in Eq. (18) can be determined uniquely by
(13)–(16). solving the system of equations given by Eq. (19).

1   +(i) 
n
1 ␾ P F
min |w|2 + C  +  −(i) (13) ( )( )=( ) (19)
2 n PT 0 c 0
i=1

s.t. where ␾ is an n by n matrix with ␾ij = (||x − xi ||2 );

w.xi +  − yi ≤ ε +  −(i) (14)


⎛ ⎞
⎛ ⎞ ⎛ ⎞ b1 ⎛ ⎞
x1T 1 1 ⎜b ⎟ f (x1 )
yi − w.xi −  ≤ ε +  +(i) (15) ⎜ 2⎟
⎜ xT 1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ f (x ) ⎟
⎜ 2 ⎟ ⎜ 2⎟ ⎜ ⎟ ⎜ 2 ⎟
 +(i) ,  −(i) ≥ 0 (16) p=⎜
⎜ . . ⎟;
⎟ ␭=⎜ ⎟
⎜ . ⎟; c = ⎜ .. ⎟ ; F=⎜
⎜ . ⎟

⎝ .. .. ⎠ ⎝ .. ⎠ ⎜ . ⎟ ⎝ .. ⎠
Eqs. (14) and (15) allow the sample points to lie within ±ε deviation ⎜ ⎟
⎝ bd ⎠
from the value at sampled points without affecting the surrogate xnT 1 n f (xn )
model. This band of allowed deviation is referred to as ε insensitive a
tube. Slack variables  +(i) and  −(i) ensure feasibility of the prob-
lem by allowing outliers that do not fall within ε insensitive tube. An extension of RBF for the purposes of global optimization using
Trade-off between model complexity and fit is achieved by penaliz- a function called the bumpiness function (described in Section 5.2)
ing outliers by a pre-defined constant C ≥ 0. Combined contribution is proposed (Gutmann, 2001). Several variations of this approach
of the model complexity and the penalty for outliers (Eq. (13)) is are discussed in Section 5 (Björkman and Holmström, 2000), (Regis
minimized. and Shoemaker, 2007).
The above-mentioned formulation is obtained under the
assumption of a linear basis function. Using a different basis 2.4. Kriging
function might require determining additional hyper-parameters
associated with that specific basis function. Details and mathemati- Kriging is based on the idea that a surrogate can be represented
cal derivations related to SVR can be found in (Smola and Schölkopf, as a realization of a stochastic process. This idea was first proposed
2004). in the field of geostatistics by (Krige, 1951) and (Matheron, 1963). It
Finally, SVR is shown to achieve comparable accuracy with that gained popularity after being used for design and analysis of com-
of other surrogates (Clarke et al., 2005). SVR models are accurate as puter experiments by (Sacks et al., 1989). Kriging is also known
well as fast in prediction; however, the time required to build this as Gaussian process regression in the field of machine learning
model is high because finding the unknown parameters requires (Rasmussen and Williams, 2006a).
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 255

A Kriging surrogate in its most general form can be formulated Table 3


Commonly used correlation models in Kriging surrogates.
as:

m Name Correlation model
f̂ (x) = ˇj fj (x) + ε (x) (20) 
d

i=1 Exponential exp(− j |mj |pj ), 0 < pj < 2


j=1
where fj (x) are m known independent basis functions that define
the trend of mean prediction at location x; ˇj are unknown parame-
 d

Squared exponential exp(− j |mj |2 )


ters; ε (x) is a random error at location x that is normally distributed
with zero mean. Kriging predictor is written as j=1 

d

T ∗ T ∗ 0, 1 − j |mj |
f̂ (x) = f (x) ˇ + r(x)  (21) Linear max

T
where, f (x) = [f1 (x) , . . ., fm (x)] ; ˇ* is the vector of generalized
j=1
 
 T 
d

least-square estimates of ˇ = ˇ1 , . . ., ˇm ; r (x) is the correlation Spherical 1 − 1.5j + 0.5j3 , j = min 1, j |mj |
vector of size n x 1 between ε (x) and ε (xi ). ˇ* and * are given as: j=1

 −1 
d
  j
ˇ∗ = F T R−1 F F T R−1 y (22) Matern 1
j −1 j |mj | K j (j |mj |)
  ( j )2
∗ = R −1 ∗ j=1
y − Fˇ (23)

where, R is the covariance matrix of size n x  n where


the generalized exponential correlation model has a higher num-
(i, j) element is the correlation between ε (xi ) and ε xj ; F =
  (1)
  (n)
T ber of hyper-parameters (2d) as opposed to d in case of squared
f x , . . ., f x is n x m matrix; y are observations at avail- exponential correlation. They also suggest choosing Matern corre-
able data. lation model (Table 3) as an alternative to exponential correlation
Estimate of the variance of prediction is given by: model. Differentiability of this correlation model can be controlled
  by choosing an appropriate value of . For example, j = 1 + 12 or
s2 (x) = ˆ 2 1 − r T R−1 r (24)
j = 2 + 12 make sure that there are 1 or 2 derivatives of the corre-
 
∗ T
 
1
where, ˆ 2 = n y − F T ˇ R y − FT ˇ . −1 lation model respectively.
The correlation model used for obtaining R and r depends on a
set of unknowns also known as hyper-parameters. These hyper- 2.4.1.2. Regression models. Based on the choice of the mean pre-
parameters are estimated using maximum likelihood (ML). For diction model f (x)T ˇ (given in Eq. (20)) there are several variants
convenience, the log ML estimate (Eq. (25)) is often used: of Kriging such as simple Kriging, ordinary Kriging, and univer-
  1      sal Kriging (also known as Kriging with a trend). Simple Kriging
log ML  = − n ln 2
 2 + ln det R  assumes the term f (x)T ˇ to be a known constant, ordinary Krig-
2
 T  −1    ing assumes it to be an unknown constant, and universal Kriging
+ y − Fˇ∗ R  y − Fˇ∗ / 2 (25) assumes f (x) to be any other prespecified function of x. In universal
Kriging, usually, f (x) takes form of a lower order polynomial regres-
Having a random error term allows Kriging surrogates to provide sion. However, specifying a trend or a value for the mean when the
an estimate of uncertainty in addition to the predicted value at a underlying function is unknown may lead to inaccuracy in predic-
specific location. tion. To avoid this problem blind Kriging is used (Joseph et al., 2008).
In blind Kriging, the unknown trend is identified using a Bayesian
2.4.1. Variants of Kriging variable selection technique. From a given set of candidate mod-
Kriging surrogate shown in Eq. (20) consists of regression com- els, Bayesian approach tries to select models that have maximum

m posterior probability (Section 2.1.1). Several other approaches for
ponent given by ˇj fj (x) and correlation component implied by variable selection exist in the literature for blind Kriging. For exam-
i=1 ple, (Huang and Chen, 2007) propose a metric known as generalized
ε. Several choices for both of these components are proposed in lit- degrees of freedom which is an estimator of mean squared error.
erature combinations of which lead to multiple variants of Kriging. Variable selection is done by trying to minimize this estimator.
There are a few strategies developed based on penalized like-
2.4.1.1. Correlation models. The random variables ε (x) in a Kriging lihood function for variable selection in Kriging where the idea of
surrogate are assumed to be correlated according to a correlation adding a penalty term in regularization (discussed in Section 2.1.1)
model. For a deterministic and continuous function, if two sam- is implemented in the context of likelihood functions (Fan and
ples are close to each other, their predicted values are close. As a Li, 2001). Unlike penalized least squares approaches discussed in
result, the correlation between random variables is high. Correla- Section 2.1.2, algorithms involving penalized likelihood functions
tion models consider the effect that the correlation decreases as involve operations with covariance matrix which is of a size of the
the distance between two distinct samples increases. Commonly order of size of sampled data set (discussed in Section 2.4.3). As this
used correlation models are depicted in Table 3 where mj is the dis- problem occurs frequently in building Kriging surrogates, efficient
tance between two points;  j and pj are hyper-parameters; d is the optimization algorithms are developed specifically for this problem
number of dimensions of the original problem. In case of Matern (Zou and Li, 2008; Chu et al., 2011; Furrer et al., 2006; Kaufman et al.,
correlation model, is the Gamma function, K j is the modified 2008).
Bessel function of order j . The parameter j > 0 provides control By having different combinations of mean prediction terms f (x)
over the differentiability of correlation model with respect to input and correlation models used for random error ε (x), multiple Krig-
variable xj and therefore that of the Kriging predictor. (Chen et al., ing models can be obtained. One such comparison between Kriging
2013) compare some of these correlation models and their results models was made by (Chen et al., 2013). For regression terms,
show that the squared exponential correlation performs worse than their results reveal that adding complex regression terms to Kriging
the exponential correlation. However, it is important to note that might not be of advantage over ordinary Kriging in terms of predic-
256 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

tion accuracy. Moreover, adding these complex terms might result (Cressie and Johannesson, 2008). Similar approach of reducing size
in multimodal ML function, thus adding an extra computational of covariance matrix from number of sample points (n) to a smaller
expense (Section 2.4.3). number r is used by (Nychka et al., 2015), and (Banerjee et al., 2008).
Another approach is using covariance tapering, where a sparse
2.4.2. Nugget effect covariance matrix is obtained by setting majority of the insignifi-
Kriging by its fundamental problem formulation is an exact cant elements to zero. Sparse matrix inversion techniques are then
interpolation technique. This means that Kriging surrogate pre- used to achieve attractive computational complexity (Furrer et al.,
dicted value matches exactly with the underlying black-box 2006). Another way is to choose only a subset of data for build-
function at the sample points used to build the Kriging model. This ing Kriging model (Liang et al., 2013). There exists a large amount
nature of Kriging might lead to highly oscillating behavior of the of literature for using Kriging on large datasets by combination of
prediction. To suppress this, Kriging regression is an approach that the above-mentioned approaches (Snelson and Ghahramani, 2007),
attempts to add a regression component to Kriging. (Sang and Huang, 2012) or by other frameworks (Tajbakhsh et al.,
In this approach, the covariance matrix is augmented by a term 2014).
known as the nugget. The effect of this added term on Kriging sur- Finally, even though inversion of covariance matrix is a com-
rogates is known as nugget effect. Mathematically, the correlation putationally intensive task, positive definiteness of the covariance
matrix obtained after adding the nugget term ε is shown in Eq. (26). matrix helps current software implementations reduce the compu-
tational complexity by a significant factor. For obtaining maximum
Rmod = R + εI (26)
likelihood within a limited number of function evaluations, some
Because of this modification, as distance between two points software implementations make use of DFO algorithms.
approaches zero, the correlation no longer equals 1. A singular or
ill-conditioned covariance matrix occurs when two of the sample 2.5. Mixture of surrogates
locations are very close to each other or hyper-parameters in the
covariance model are near zero (Ranjan et al., 2011). Incorporat- Realizing the fact that no single type of surrogates outperforms
ing nugget effect in such cases helps in maintaining conditioning all other types for all types of problems, choosing the best type of
of covariance matrix. The remainder of the procedure to obtain surrogate for the problem at hand is a challenging task. It is not
Kriging predictor remains the same as before. always possible to try multiple choices of surrogate models and
choose the surrogate model that shows the best performance. This
2.4.3. Computational aspects of kriging motivates approaches utilizing a combination of surrogates. In gen-
A few key computational aspects of Kriging need to be under- eral, prediction using a mixture of surrogates can be given by Eq.
stood before choosing Kriging surrogate for the problem at hand. (27).
First, obtaining a Kriging surrogate involves inversion of a covari-
ance matrix. The size of this matrix depends on the number 
n

f̂ (x) = wi (x) fˆi (x) (27)


of samples and thus its inversion may become computation-
ally demanding as the number of samples grows. Second, to i=1

obtain hyper-parameters of the correlation model, Kriging max- where, wi (x) is the weight associated with the ith surrogate at
imum likelihood (ML) estimator is optimized. This ML estimator

N
is highly non-convex and has a strong dependence on the inverse design point x. Finally, summation of weights is set to one wi =
of the covariance matrix. The non-convex nature of this function
i=1
demands multiple evaluations to search for global optima. Couple
of approaches are proposed to tackle this problem with likelihood 1. This implies that if all surrogate predictions fˆi (x) are equal, the
maximization can be found in (Toal et al., 2011; Toal et al., 2009). weighted mixture will predict the same value.
From the equations, one can observe that getting stuck at a local Different approaches to determine the weights wi are used in the
optimum affects Kriging surrogate prediction as well as quanti- literature. For example, (Zerpa et al., 2005) use a mixture of surro-
fied uncertainty at unsampled locations. However, with simple gates to optimize alkaline-surfactant-polymer flooding processes.
covariance functions, experience shows that getting stuck at local They use a weighted combination of Kriging, RBF and polynomial
optimum is not a serious problem and often there is no point in find- regression where weights are determined based on the variance
ing the minimizer with great accuracy (Rasmussen and Williams, of individual surrogates. Weights can also be determined using a
2006b), (Lophaven et al., 2002a,b). global cross validation metric called PRESS (discussed in Section
The non-convex and computationally intensive nature of ML 6) (Goel et al., 2007). Another approach for identifying weights is
estimator becomes a bigger problem as the dimensionality of the by weighing the surrogates with the help of an error metric pro-
problem increases. It can be observed from Table 3, that irrespective posed by (Juliane Müller and Piché, 2011). They assign probability
of the correlation model chosen, the number of hyper-parameters, to surrogates with the help of an error metric. These probability
depends on the dimensionality of the problem. To reduce the assignments are then used to determine weights. A variant of effi-
number of hyper-parameters, (Bouhlel et al., 2015) use partial cient global optimization utilizes mixture of surrogates (Felipe A C
least squares. This way they address problems up to 100 dimen- Viana et al., 2013). They propose multiple surrogate efficient global
sions more efficiently than other existing approaches. Another way optimization approach that is able add multiple candidate points
to optimize hyper-parameters is to use cross validation instead for global optimization in a single iteration. Use of multiple surro-
of maximum likelihood. Use of cross-validation is found to be gates, in general, provides a flexibility to emphasize more on good
more robust with respect to correlation model misspecification surrogates and put less emphasis on bad surrogates as per the need.
compared to using maximum likelihood. However, the variance
obtained by Kriging surrogates employing cross validation is larger 3. Derivative-free optimization and surrogates
(Bachoc, 2013).
For the problem with large number of data points, there are The optimization problems for which function derivative infor-
several successful applications of Kriging in the literature. One way mation is not symbolically or numerically available are classified as
to do this is by representing covariance matrix in terms of small DFO problems. There are two sub-categories in algorithms address-
matrices of size r, where r is the number of basis functions used ing DFO problems, one is local search (referred to as local DFO)
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 257

algorithms and the other is global search algorithms (referred to vergent (Wild and Shoemaker, 2013). A similar strategy was used
as global DFO). Local search algorithms are effective in refining the recently where Kriging based efficient global optimization (EGO)
solution or reaching a local optimum from an initial guess. Global where Kriging surrogate was used inside trust-region framework
search algorithms, on the other hand, have a component that allows (Regis, 2016).
escaping from a local minimum. For the purposes of this paper,
it is convenient to classify DFO algorithms as algorithms that do 3.2. Model-based global DFO
not use surrogate models and model-based algorithms. A major
class of local DFO algorithms that do not rely on surrogate models One of the reasons surrogates are promising in the context of
is direct search algorithms. Direct search algorithms sequentially global DFO is the progress made in the area of global optimiza-
examine candidate points generated by a certain strategy some- tion algorithms in the past decade (Tawarmalani and Sahinidis,
times recognizing geometric patterns. Well known examples of the 2005), (Misener and Floudas, 2014). With the help of these algo-
direct search are Hooke and Jeeve’s algorithm (Hooke and Jeeves, rithms, non-convex surrogates can be optimized and used to guide
1960) and simplex method (Nelder and Mead, 1965). Model-based the search. For global DFO, a surrogate model is generated over
approaches, as the name suggests, rely on surrogate models to the entire feasible space or multiple parts of the feasible space. For
guide the search. For the case of global DFO algorithms, the major- example, in the algorithm EGO (Jones et al., 1998), Kriging surro-
ity of the algorithms that do not use surrogate models use an gate is built over the entire feasible space. With the help of expected
approach such as partitioning of the feasible space or a stochas- improvement (EI) function (discussed in Section 5.1), the surro-
tic approach. An example of partitioning algorithm is DIviding gate is updated. In this case, a balance between local search and
RECTangles (DIRECT) algorithm (Jones et al., 1993). Examples of global search is achieved by maximizing the EI function. A simi-
stochastic algorithms include several approaches such as simulated lar balance is achieved for RBF surrogates using a measure called
annealing or genetic algorithms. For details on advances in DFO bumpiness function (Eq. (33)) (Gutmann, 2001). Minimizing the
and an extensive comparative study on box-bounded problems, the bumpiness function for a given target value can be used to focus
readers are referred to (Rios and Sahinidis, 2013). on global and local search. This property of the bumpiness function
As model-based search algorithms have been shown to display is exploited for global DFO (Holmström et al., 2008). EI function,
superior performance compared to these algorithms, it is important as well as bumpiness function, are discussed in more detail in Sec-
to discuss the role of surrogate models in the context of DFO. A tion 5. Another approach using RBF surrogate relies on optimization
major class of model-based local DFO methods known as trust- and sequential updating of RBF surrogate over the feasible space
region methods is discussed followed by model-based global DFO (Regis and Shoemaker, 2005). Finally, global search is also achieved
methods. by conducting local search starting from multiple starting points
obtained using a certain strategy. A complete restart strategy is
3.1. Model-based local DFO proposed that suggests starting from a new sample design if algo-
rithm gets stuck in a local minimum (Regis and Shoemaker, 2007).
Trust-region methods are local search methods that rely on a A more recent example of successful use of surrogates to address
surrogate model in a neighborhood of a given sample location. This constrained global DFO problems can be found in (Boukouvala and
neighborhood is called as trust region and the model is presumed Floudas, 2016). They developed a framework for constrained grey
to be accurate within trust region. The size of the trust region is box optimization named Algorithms for Global Optimization of
defined with the help of radius which is adjusted based on a mea- coNstrAined grey-box compUTational problems (ARGONAUT) that
sure of the accuracy of the surrogate. The sufficiently small value was shown to address a difficult class of problems successfully. In
of trust-region radius usually indicates termination. Because of the this framework, the surrogate is chosen from a set containing lin-
general nature of trust-region framework, several surrogates have ear, general quadratic, sigmoidal, RBF and Kriging models based on
been used in the literature to achieve local approximation. For the accuracy of prediction. Details of this framework are discussed
example, (Powell, 1994) use linear interpolation models to approx- in Section 7.
imate objective and constraint functions in the algorithm COBYLA
(Constrained Optimization BY Linear Approximation). Linear Inter- 4. Feasibility analysis and surrogates
polating surrogates are easy to construct but these surrogates face
difficulty in capturing curvature of the original problem. Use of The ability of a process to satisfy all relevant constraints is
quadratic models of the form given in Eq. (28) is proposed (Conn referred to as feasibility. Feasibility analysis relates to identifying
et al., 1997). conditions under which the process is feasible. The problem of fea-
1 T sibility analysis arises because of several constraints on operation
f̂ (xk + s) = f (xk ) + sT gk + s HK s (28) such as product demand, environmental conditions, the safety of
2
operation, and material properties to name a few. The same prob-
where, k corresponds to iteration k, xk is the current iterate, gk ∈ Rd , lem arises in designing a product or a new material where the
Hk is a matrix of size d by d. Uniquely determining gk and Hk design is restricted by factors such as target quality. To conduct
requires (d+1)(d+2)
2
sample points. This number becomes signifi- a systematic study of profit maximization or to compare multi-
cantly high as the number of dimensions increase. For example, for a ple design alternatives, a precise estimation of feasible operation
30 dimensional problem, this number becomes nearly 500. To avoid regions is crucial. Therefore, understanding the problem of feasi-
this high sampling requirement, (Powell, 2006) proposed underde- bility analysis is extremely important.
termined quadratic interpolation models. These models are proven
to attain stationary local optimum and thus are called locally con-
Mathematically, feasibility is quantified with
 the help of a mea-
sure known as the feasibility function d,  as given in Eq. (29).
vergent. Futher, (Oeuvray and Bierlaire, 2009) use RBF interpolation A positive value of the feasibility function implies that the design
models with cubic basis functions and a linear tail. With modifi- is infeasible.
cations to the set of points used for building RBF models, (Wild
et al., 2008) proposed Optimization by RBF Interpolation in Trust- (d, ) = min max{fj (d, z, )} (29)
z j∈J
regions (ORBIT) algorithm. This algorithm was later extended to
handle constrained optimization problems (Regis and Wild, 2015). where d and z represent design variables and control variables 
RBF based trust region algorithms are proven to be globally con- respectively, bounds on z are written as z ∈ Z = z : z L ≤ z ≤ z U ;
258 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

 
 represents uncertain parameters  ∈ T =  : L ≤  ≤ U ; where EIfeas (x) is the modified EI function value at x; y is the surro-
  gate model predictor; s is the standard error of the predictor;  (.)
fj d, z,  represents constraints. The problem is to check if all
constraints fj can be satisfied for a given design d by adjusting is the normal probability distribution function.
  There is a difference between a search for global optimization
the control variables z. Thus, d,  > 0 implies one or more
  and that for feasibility analysis. For feasibility analysis, the problem
constraints are violated and d,  = 0 implies boundary of the is to find a surface defining the boundary of the feasible space within
feasible region. the box bounded design space as opposed to finding a single opti-
Uncertainty in process input parameters  has a great impact mum in global optimization (Boukouvala and Ierapetritou, 2012).
on feasibility of the process design. The ability of a process to To address this problem, (Boukouvala and Ierapetritou, 2012) use
remain feasible when subject to nominal deviations of uncertain an approach that utilizes Kriging variance to ensure exploration. To
parameters is referred to as process flexibility. Process flexibility guide the search towards better defining the boundary of feasible
is quantified by solvingthe flexibility test problem which checks space, they use the product of feasibility function values of nearby
if feasibility function d,  is non-positive over the entire range samples. Samples on the same side of feasible boundary result in
of uncertain parameters . The flexibility test problem is usually a positive product. More recently, (Wang and Ierapetritou, 2016)
represented as a max-min-max problem as represented in Eq. (30) used an adaptive sampling strategy based on RBF surrogates. They
used bumpiness measure (explained in Section 5.2) to obtain pre-
(d) = max min max{fj (d, z, )} (30) diction error. Substituting this prediction error in EIfeas function and
 z j ∈J
maximizing EIfeas with respect to x, they chose new sample points
for evaluation. Their results show that accuracy obtained from both
Several approaches utilizing the concept of feasibility function and Kriging and RBF is comparable.
flexibility are available in the literature. (Swaney and Grossmann,
1985) quantify feasible region by means of the largest hyper- 5. Sampling
rectangle that inscribes the feasible region. Other approaches for
feasibility analysis given the closed form of the process include The process of generating data points to be able to build sur-
(Straub and Grossmann, 1993), (Floudas and Gümüş, 2001), (Goyal rogates is referred to as sampling. The performance of surrogate
and Ierapetritou, 2002). A good review of these approaches can be models depends strongly on the quality as well as the number
found in (Grossmann et al., 2014). In these approaches, analytical of samples. However, as generating data demands evaluation of
form of the feasibility function is assumed to be known. However, the true function, sampling contributes towards significant com-
simulation models for which the feasibility needs to be assessed putational cost. To maintain the quality of surrogates without
often require substantial computational efforts. For this reason, the incurring excessive sampling cost, studying sampling strategies is
closed form of these models is not available or cannot be utilized. of immense importance.
This motivates approaches involving black-box feasibility analysis. Sampling strategies are broadly classified as adaptive sampling
The black-box nature of the problem and computational complex- and stationary sampling. Stationary sampling consists of methods
ity of the underlying simulations make surrogate-based methods that rely on geometry or pattern such as grid sampling, full and
an intuitive choice for the black-box feasibility analysis. half factorial designs, methods that were derived from the design
The key idea of using surrogate based feasibility analysis is build- of experiments literature such as orthogonal sampling, full and half
ing a surrogate to represent feasibility function given a set of input factorial designs, Box-Behnken design. Some of the widely used
parameters and black-box simulation outputs for the feasibility stationary sampling strategies are Latin Hypercube Sampling (LHS)
function. Two key factors for this class of problems are the choice of (McKay et al., 1979), Sobol (1967) and Halton sampling. LHS is a
surrogate model and sampling strategy. Several approaches exist stratified sampling strategy where a sample is drawn from each
in the literature to address both of these factors. For example, stratum once. To provide better space-filling properties, LHS is done
(Banerjee et al., 2010) use high dimensional model representa- subject to projection filters. Sobol and Halton sampling are quasi-
tion (HDMR), (Boukouvala and Ierapetritou, 2012) use Kriging random strategies where samples are drawn from Sobol and Halton
surrogates, (Wang and Ierapetritou, 2016) use RBF to represent fea- low-discrepancy sequences respectively.
sibility function over the entire domain. In addition to these, for In adaptive sampling, starting from a limited number of samples
mixed-integer programming frameworks, (Zhang et al., 2016) pro- that are generally obtained from stationary sampling, new sample
pose convex region surrogate (CRS) for representing a nonlinear locations are decided sequentially. This strategy aims to minimize
and nonconvex feasible region by a combination of convex regions. sampling requirement by obtaining more samples that benefit the
They approximate cost function for each region by a linear approx- quality of the surrogate. Most of the new adaptive sampling strate-
imation. (Adi et al., 2016) use a random line search for detecting gies rely on some criteria to tackle the trade-off between exploring
boundary points of the feasible region. the most unexplored region (exploration) and refining the region
As described in Section 5, quality of surrogates has a strong near existing samples for better understanding (exploitation). This
dependence on the required quantity and quality of sampling set. approach is most common in the context of global optimiza-
Increasing sample size may lead to a better prediction but it will tion where exploration is required to escape local optima and
result in increased sampling cost. For feasibility analysis problems, exploitation is required to improve available optimum. For Kriging
sampling requirement is higher than that of single objective pre- surrogates, a popular approach is making the use of EI function.
diction due to the presence of constraints. To control the sampling For RBF surrogates, a similar quantitative measure is obtained
cost, approaches employing adaptive sampling are used. Kriging using a function known as bumpiness function. Other approaches
surrogates and a modified version of EI function given in Eq. (31) employing adaptive sampling make use of different strategies to
for adaptive sampling is proposed (Boukouvala and Ierapetritou, address this trade-off. In general, these methods have been shown
2014). to achieve better accuracy with fewer samples (Provost et al., 1999).

⎛  ⎞
 y 1 −0.5
y2 5.1. Expected improvement function
= s⎝√ e ⎠
s2
max EIfeas (x) = s − (31)
x x 2 One commonly used approach to handling exploration and
exploitation is by using EI function (Eq. (32)) as used by (Donald
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 259

R Jones et al., 1998). Some approaches rely on ranking the exploration and exploita-
    tion and weighing both as per need. One such approach was
  fmin − f̂ (x) fmin − f̂ (x) recently proposed by (Garud et al., 2017). They propose a metric
EI (x) = fmin − f̂ (x)  + s (32)
s s consisting of two separate measures for exploration and exploita-
tion. For exploration, they use the sum of squares of the distance
where,  (.) represents the standard normal density function;  (.) a new sample from all the previous samples. For exploitation, the
represents the probability distribution function; f̂ is the surrogate impact of new sample added near an already sampled location is
model predictor; fmin is the current minimum function value and s quantified with the help of a departure function.
is the standard deviation. EI (x) represents the expected improve-
ment at sample location x. The function increases with decreasing j (x) = f̂ (x) − fˆj (x), j ∈ S (36)
f̂ (x) that corresponds to the predicted value and increasing stan-
dard deviation s. Achieving low f̂ (x) and high s correspond to where, f̂ (x) is the surrogate built using all points in the sampled
exploration and exploitation, respectively. As both contribute pos- set S, f̂j (x) is the surrogate built using all points except point j.
itively towards EI function, the trade-off between exploration and Estimating the prediction variance of surrogate model using a tech-
exploitation is addressed by maximizing the EI function. The EI nique called jackknifing, (Eason and Cremaschi, 2014) propose an
function exhibits multiple local optima that might cause numerical adaptive sampling strategy. They use this strategy with a surrogate
problems. model built using ANN and choose samples locations that have high
prediction variance. Advantage of this type of adaptive sampling is
5.2. Bumpiness function that it is not specific to the choice of surrogate model. A study of
space filling sequential design methods can be found in (Crombecq
A similar approach of having a single function to balance explo- et al., 2011). They propose a set of sequential sampling methods
ration and exploitation was proposed by (Gutmann, 2001) for RBF that shows comparable performance with stationary or one-shot
surrogates. This relies on the fact that the RBF surrogate that is experimental design.
obtained by solving the system of equations given by Eq. (19) is the
one that minimizes bumpiness. A quantitative measure of bumpi- 6. Surrogate model validation
ness is given by bumpiness function (Eq. (33)).
2 Assessing the reliability of surrogate model is one of the major
min gn (y) = (−1)m0 +1 n (y)[f̂ (y) − fn∗ ] , y ∈ D\{x1 , x2 , ., xn } (33) concerns because having an inaccurate surrogate model can lead
to waste of resources and have a bad effect on optimization, pre-
where, y is an unsampled point; fn∗ is the target value; m0 is a con-
diction or feasibility analysis. Surrogate model validation is the
stant whose value depends on the basis function used (1 for cubic
process of assessing the reliability of the surrogate model. In addi-
and thin plate splines, 0 for linear and multi-quadratic and −1 for
tion to assessing accuracy, validation techniques can be used to
Gaussian); n (y) is the coefficient of the new term (||x − y||)2 in
select a surrogate model from a set of candidate models and to
the surrogate f̂n (x) if an unsampled point y is added. It is calculated tune hyper-parameters (such as correlation model parameters in
as the nth element of vector v, and v is calulcated by solving the Kriging). For problems of lower dimensions, a visual comparison
system of equations given by Eq. (34). between predictions and true value is possible. However, the dif-
0n ficulty in having enough data for visual comparison and inability
␾y Py to visualize predictions for problems over two dimensions necessi-
( )v = ( 1 ) (34)
PyT 0 tates more sophisticated approaches. As surrogate models cannot
0d+1 be validated on the same data with which those were built, surro-
gate models are built with the help of only a part of the available
␾ y P data. The remaining part of the data is used for testing the accuracy.
␾y = ( ); Py = ( ); (y )i = (||y − xi ||2 ); i = 1, . . ., n. The data set on which the model is built is referred to as training set
yT 0 yT 1
and the set on which the model is tested is referred to as test set.
Minimizing the bumpiness function emphasizes exploration as The metrics used for quantifying the error on test set are referred
well as exploitation depending on fn∗ . A large negative value of to as validation metrics.
fn∗ makes the search global and focuses on exploration whereas, a One possible approach for tackling this is using resampling
value close to current optimal solution makes the search local and strategies such as cross validation and bootstrapping. In cross-
focuses on exploitation. Evaluation of bumpiness function is com- validation, available data is divided into k blocks containing an
equal number of data points. Data from (k-1) blocks are used as
  obtaining n by solving the system
putationally expensive because
given by Eq. (34) is an O n3 operation. However, the cost of this training set and data from the remaining block are used as a test
step can be improved by  the structure of ␾ after which set. The process is repeated for all possible combinations of (k-1)

exploiting
blocks. Finally, an appropriate metric to quantify the error on test
the operation becomes O n2 (Björkman and Holmström, 2000).
set such as the sum of squared errors is evaluated based on the
5.3. Other approaches accuracy of the model on test data that can act as an indicator of
model adequacy. However, with limited data available, using part
The problem of adaptive sampling can be formulated as a DFO of the data for building the surrogate is not always possible. One
problem with the objective function being the difference between such approach known as leave one out cross validation was used
the true function and the surrogate (Cozad et al., 2014). The objec- by (Donald R Jones et al., 1998). In this approach, the number of
tive function is given in Eq. (35) subsets k equals the number of data points or observations, thus
 2 leaving only one data point each time a surrogate is built. A sam-
f (x) − f̂ (x) pling set is considered inadequate to build a quality surrogate if
max , xL ≤ x ≤ xU (35) removal of one data point significantly affects the new model. A
f (x)
similar approach, but with allowing repeated samples in the train-
where, f̂ (x) is the surrogate, f (x) is the true function, and xL and xU ing set is known as bootstrapping. By allowing repeated samples in
are the bounds within which error is to be maximized. the set used to build models, one can have a training set of the size
260 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

Table 4 nique to choose from a large number of possible input variables and
Commonly used surrogate validation metrics.
their transformations. Starting from an initial stationary sample,
Validation metric Formula ALAMO adaptively improves surrogate model by error maximiza-
tion (discussed in Section 5.3). DFO is used for finding the location
Var{y−ŷ}
Explained variance score 1− at which the error between the original model and the surrogate
Var{y}
n

samples −1
is maximum. Finally, ALAMO allows the user to make use of a pri-
1
Mean absolute error nsamples
|yi − ŷ|
ori knowledge about the model response and utilizes it to refine
 
i=0 −1

1
nsamples
2 the model. This is achieved using constrained regression where a
Mean squared error nsamples
yi − ŷ constraint is placed on the response of the regression model. This
i=0  kind of problem is usually nonconvex in nature and therefore, a
Median absolute error median |y1 − yˆ1 |, . . ., |yn − yˆn | deterministic global optimization solver BARON (Tawarmalani and
nsamples −1 2 Sahinidis, 2005) is used to solve this sub problem. In addition to the
yi −ŷi
ability of providing surrogate models from static data-sets, ALAMO
1− nsamples −1 2
2 i=0
R score
yi −y
also implements best subset selection methodology (discussed in
nsamplesi=0 Section 2.1.1). Details about constrained regression in ALAMO can
|yi −ŷi |
Relative average absolute error i=1
be found in (Cozad et al., 2015). Detailed description of the ALAMO

nsamples ∗STD

approach in the context machine learning approaches is given in
max |y1 −yˆ1 |,|y2 −yˆ2 |,...,|yn −yˆn |
(Wilson and Sahinidis, 2017).
Relative maximum absolute error STD
2) Eureqa (Schmidt and Lipson, 2009)
Commercially available software Eureqa starts with an initial
equal to the size of actual data. Usually, number of subsets k cho- dataset and follows by carrying out a symbolic regression where
sen for bootstrapping is much higher than that for cross validation. the search is not restricted just to the coefficients but also to the
Details on resampling methods for validation of surrogates can be form of regression model. It starts with a set of candidate regres-
found in (Bischl et al., 2012). sion models. Accuracy of these models is assessed by carrying out
Validation metrics that are commonly used to quantify the error symbolic differentiation of models and comparing derivatives with
using the above-mentioned resampling strategies are the explained the underlying data set. Based on the discrepancy, these models
variance score, the mean absolute error, the mean squared error, are then iteratively updated till a convergence criterion is met.
the median absolute error, the R2 score, the relative absolute error, Search through candidate regression models is carried out using
and the relative maximum absolute error. These metrics with their algorithms similar to genetic programming details of which can be
respective mathematical equations are shown in Table 4 where y, ŷ, found in (Riolo et al., 2010) and (Riolo et al., 2011).
nsamples , and y denote true value, surrogate predicted value, number 3) Surrogates Toolbox (Viana, 2010)
of samples and mean predicted value, respectively. Relative maxi- This is a general-purpose toolbox for surrogate modeling that
mum absolute error indicates error in one part of feasible space. consists of a number of third party software. This toolbox has four
However, it is not a good indicator of the overall performance. major capabilities. For designing experiments, it supports full facto-
Explained variance score equals R2 score if mean of prediction rial and LHS designs. For surrogate prediction, it supports surrogate
error is zero. (Kersting et al., 2007) use normalized mean squared models of type Kriging, Gaussian process for machine learning
error as well as average Negative Log estimated Predictive Den- (GPML), radial basis neural network, and linear Shepard model.
sity (NLPD) for heteroscedastic Gaussian process regression that For surrogate model validation, it allows classical error valida-
penalizes over-confident as well as underconfident predictions. In tion including coefficient of determination, RMSE as well as cross
the same area, (Boukouvalas and Cornford, 2009) use Mahalanobis validation. For optimization, it allows sensitivity analysis, contour
error that utilizes full predictive covariance avoiding the assump- estimation, and variants of EGO algorithm. List of third party soft-
tion of uncorrelated errors. (Yin et al., 2011) use mean absolute ware used for the above-mentioned purposes is as follows:
error as well as maximum absolute error for validation. For the
case of multiple surrogates, (Felipe A C Viana et al., 2009) used pre- – Gaussian processes for machine learning (GPML) (Rasmussen and
diction sum of squares (PRESS) as an estimator of root mean square Williams, 2006a)
error (RMSE) to pick the best surrogate. Their computational results – Kriging (Lophaven et al., 2002a,b)
reveal that PRESS becomes more and more useful for identifying – SVR (Gunn, 1998)
the best surrogate as the number of sample points increases. PRESS – Linear Shepard model (Thacker et al., 2010)
vector ẽ is the vector of errors obtained from carrying leave one out
cross validation. RMSE is predicted using Eq. (37). 4) Surrogate Modeling (SUMO) Toolbox for surrogate model-
 ing and adaptive sampling
1 SUMO is a toolbox fir surrogate modeling and adaptive sam-
PRESSRMS = ẽT ẽ (37)
nsamples pling proposed by (Gorissen et al., 2010). The user is allowed to
choose from surrogates such as Kriging, splines, support vector
7. Software implementations machines, and ANN. For hyper-parameter optimization, the toolbox
supports particle swarm optimization, efficient global optimiza-
Several software implementations exist for DFO. List of these tion, simulated annealing, genetic algorithm etc. The toolbox has
software can be found in (Rios and Sahinidis, 2013). Relatively less model selection criteria such as cross validation, Akaike informa-
number of software implementations exist for surrogate model tion criteria (AIC), leave-one-out cross validation etc. Additionally,
building for prediction and feasibility analysis. Few of the software it supports adaptive sample selection also known as active learning.
implementations are listed followed by a recent DFO framework: Support for LHS and Box-Bhenken sampling designs is provided.
1) Automated learning of algebraic models for optimization 5) Scikit-learn (Pedregosa et al., 2011)
(ALAMO) (Cozad et al., 2014) scikit-learn is an open source library that implements a range
ALAMO is a regression and classification methodology that of algorithms for machine learning, preprocessing, cross validation
builds simple and accurate surrogates from a minimal set of data and visualization. For surrogate models, scikit-learn provides mul-
points. It makes use of integer programming based subset tech- tiple regression models such as linear and support vector regression
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 261

Table 5
Relevant packages in R.

Package name URL Description

leaps https://CRAN.R-project.org/package=leaps subset selection using leaps and bound algorithm(Furnival and
Wilson, 1974)
SIS https://CRAN.R-project.org/package=SIS sure independence screening (discussed in Section 2.1.1) for high
dimensional problems
stats https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ linear and nonlinear regression model building and prediction,
stats-package.html goodness-of-fit measures, model validation metrics
e1071 https://CRAN.R-project.org/package=e1071 support vectors regression and classification
neuralnet https://CRAN.R-project.org/package=neuralnet building surrogates and prediction using neural networks
subselect https://CRAN.R-project.org/package=subselect variable selection (Cadima et al., 2012)
lars https://CRAN.R-project.org/package=lars least angles regression, forward stagewise regression, and lasso
performanceEstimation https://CRAN.R-project.org/ Validation metrics
package=performanceEstimation
lhs https://CRAN.R-project.org/package=lhs Latin hypercube sampling
randtoolbox https://CRAN.R-project.org/package=randtoolbox Sobol and halton sequence
mlr (Bischl et al., 2016) https://CRAN.R-project.org/package=mlr Machine learning in R, regression and classification

and interpolation models such as Kriging and cross validation met- Table 6
Problem size range and number of test problems.
rics to test the performance of surrogates.
6) Statistics and machine learning toolbox MATLAB Problem size Number of problems
This toolbox supports linear and nonlinear regression, subset 1–2 5
selection using goodness-of-fit measures and regularization, and 3–8 27
support vector machines. It also includes validation metrics for test- 9–30 15
ing performance of surrogates. In addition to these, this toolbox
provides classification algorithms and non-parametric regression Table 7
algorithms. Initial sample sizes and their denoted name.
7) Software packages in R
Initial sample size Size name
R is a well-known open source platform for machine learning
and data analysis and offers number of packages to choose from. 10 x problem dimensions Size 1
20 x problem dimensions Size 2
Few of the packages are mentioned in Table 5.
8) ARGONAUT (Boukouvala and Floudas, 2016)
ARGONAUT addresses DFO problems where there is a partial Library (2017)) and globallib ( GLOBAL Library (2017)). The C ver-
or total lack of analytical expressions or closed form for objec- sion of the problems was obtained from http://archimedes.cheme.
tive function as well as constraints. Several steps of this iterative cmu.edu/?q=dfocomp. All the problems chosen are nonconvex and
framework include bound tightening, variable selection, sampling smooth. Problem data such as problem size and variable bounds
(initial stationary sampling followed by adaptive sampling at loca- are obtained from the same web site.
tions suggested by the framework), surrogate model selection,
and global optimization of the system to obtain new candidate
8.2. Test setup and basis for comparison
points. The deterministic global optimization solver ANTIGONE is
used for bounds tightening as well as global optimization. After
For comparing the performance of RBF and Kriging surro-
bounds tightening, variable selection is done using sure indepen-
gates, three different sampling schemes (Latin Hypercube Sampling
dence screening (SIS) (discussed in Section 2.1.1). Important part of
(LHS), Sobol sampling, and Halton sampling) are used to obtain
this framework is surrogate model selection where surrogates are
initial design. Of these three sampling schemes, LHS is a random
chosen from a set of regression models such as linear, quadratic
sampling method whereas Sobol and Halton are quasi-random
and polynomial models that are easy to optimize and interpolation
sampling methods. Therefore, 10 different random starting designs
models such as Kriging and RBF that are better at predicting data
were obtained for LHS. Initial designs used were kept constant
accurately.
across all problems having the same number of dimensions to
maintain consistency in comparison. To study the effect of initial
8. Computational results and discussion sample size on the quality of surrogate, two different initial sample
sizes (shown in Table 7) were used.
The purpose of computational experiments in this section is For two-dimensional problems, a grid of 2500 samples was used
to illustrate the effect of following choices on the performance of as a test set for prediction for RBF and Kriging surrogates. For prob-
surrogates in terms of their predictive ability. lems with more than two dimensions, an LHS design with 1000
samples was used as a test set for prediction. For 10 random LHS
1. For Kriging models, the choice of regression and correlation designs, the median is considered as a representative of the perfor-
terms mance. For analyzing the test results, the error factor (EF) is used
2. Initial sample size as a measure of relative model performance.
3. Initial sample design scheme
RMSE
EF =
RMSEbest
8.1. Test problems
where, RMSE is the root mean squared error on test set and
In this comparison, the performance of Kriging and RBF surro- RMSEbest is the minimum RMSE showed by the surrogate mod-
gates is tested on problems of dimensions ranging from 2 to 30 over els under comparison for a particular problem. This error factor is
a box bounded domain of interest. Distribution of the problem sizes then used to obtain Dolan-Moré plots (Dolan and Moré, 2002) or
is shown in Table 6. The problems are from princetonlib (Princeton performance profiles of surrogates.
262 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

Table 8
Kriging surrogate models and their abbreviation.

Kriging trend Correlation Abbreviation

constant squared exponential KR11


constant absolute exponential KR12
constant linear KR13
linear squared exponential KR21
linear absolute exponential KR22
linear linear KR23
quadratic squared exponential KR31
quadratic absolute exponential KR32
quadratic linear KR33

Fig. 2. Fraction of problems solved vs error factor for LHS design of size 1.

Fig. 1 compares all Kriging surrogates starting with a sample size


of size 1 obtained using LHS design. KR31, KR32, and KR33 show a
superior performance compared to other models. For a significant
fraction of problems, all the surrogates having constant and linear
regression terms have an EF of greater than 1000. This indicates that
having quadratic regression term helps in improving prediction
accuracy of Kriging models. However, including quadratic regres-
sion term needs (d+1)(d+2)
2
minimum sample points as it requires
building of a fully quadratic model. This number gets large as prob-
lem size (d) increases. For example, for the problems with size
Fig. 1. Fraction of problems solved vs error factor for LHS design of size 1. 30, initial sample size of 10d is insufficient to build the model.
This factor along with added computational expense with complex
8.3. Surrogate model settings regression terms (discussed in Section 2.4.3) makes simpler regres-
sion terms an attractive choice. Another observation is that there
For Kriging surrogates, three different types of correlation mod- is a little difference between KR31, KR32, and KR33 which indi-
els (absolute exponential (pj = 1 in exponential correlation model), cates that as regression terms become complex, effect of correlation
linear, and squared exponential) and three different models for terms decreases. Because of these reasons, Kriging surrogates with
mean prediction term (constant, linear, and quadratic) were used quadratic regression terms are omitted in subsequent illustrations.
that are shown in Table 3. Description of Kriging surrogate models
and their abbreviations used is provided in Table 8. In addition to 8.4.2. Choice of kriging correlation model
the choice of correlation model and mean prediction term, Kriging Fig. 2 compares Kriging models with constant and linear regres-
model needs information for hyper-parameter optimization. This sion terms starting from samples generated from LHS with a size
information includes bounds on hyper-parameters, the initial guess of size 1 and highlights the effect of correlation models on simpler
for optimization of hyper-parameters and an appropriate optimiza- regression terms. It can be observed that the performance of KR11,
tion algorithm for minimizing the ML. In this comparison, bounds KR12, and KR13 is comparable with that of KR21, KR22, and KR23
on all hyper-parameters were [0.1, 10] and an initial guess of 1.1. respectively. This indicates that prediction accuracy using con-
DFO algorithm COBYLA (Powell, 1994) is used for optimizing ML stant regression terms is comparable to that using linear regression
function. The Kriging model is used from the toolbox scikit-learn terms. Second observation is that KR11 and KR21 show a supe-
(Pedregosa et al., 2011). rior performance compared to other Kriging variants. This indicates
For the case of RBF, cubic basis function with a linear tail is used. that squared exponential correlation terms show a superior per-
To build RBF model, the system of equations given by Eq. (19) is formance compared to linear and absolute exponential correlation
solved. models.

8.4. Computational results 8.4.3. Effect of initial sample size


In the next comparison, KR11, KR21 and RBF surrogate mod-
8.4.1. Choice of kriging regression term els are compared as the sample size is changed from size 1 to size
Figs. 1–3 present Dolan-Moré performance profile of Kriging 2 for an LHS design. Fig. 3 shows that, irrespective of the surro-
surrogates on test problems in terms of prediction accuracy. Value gate, increasing initial sample size improves prediction accuracy.
on vertical axis for each point in this plot represents the fraction of This is intuitive because surrogate models are built utilizing more
problems for which surrogates have EF less than the corresponding information.
value on horizontal axis. Best case scenario for a surrogate would be
represented by a vertical line at EF of 1. This would indicate that a 8.4.4. Effect of sample design
particular surrogate model outperformed all other surrogate mod- Finally, it is important to illustrate surrogate prediction accuracy
els for all test problems. For some of the problems, a surrogate may as initial sampling design is changed. Fig. 4 shows that, for initial
have an RMSE close to 0. To avoid numerical difficulties in evalu- sample of size ‘size 1 , Halton sampling shows worst performance
ating EF, values of RMSE less than 0.000001 are approximated as for KR11 surrogate. Fig. 5 shows that performance of surrogates
0.000001. starting from three different sampling schemes is similar. This indi-
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 263

Fig. 3. Fraction of problems solved vs error factor for different sampling sizes.

Fig. 4. Fraction of problems solved vs error factor for different sampling designs starting with size of ‘size 1 .

cates that as sample size increases, sensitivity to sampling design and Sobol sampling schemes perform better than those obtained
scheme decreases. from Halton sampling scheme. However, as sample size increases,
Summarizing the discussion, a number of conclusions can be the prediction accuracy becomes less dependent on initial sample
derived regarding the regression and correlation terms in Krig- design scheme.
ing surrogates, initial sample sizes, and initial sample design
schemes. For correlation models, Kriging surrogates with squared 9. Summary
exponential correlation model show superior performance. For
the regression term, for an initial sample size of size 1, Kriging Surrogate models have attracted interest from multiple engi-
surrogates with quadratic regression term outperform the Krig- neering and scientific disciplines and have found a wide range of
ing models with constant and linear regression terms. However, application domains. However, the choice of surrogate model for
between the other two models, a model with constant regression the problem at hand is not straight-forward because of trade-offs
term performs equally well as that with linear regression term. associated with each surrogate. This choice is better understood
Kriging models with quadratic regression terms require determi- if the problem at hand is classified as a prediction, optimization
nation of a full quadratic model. Therefore, these models have an or a feasibility analysis problem based on the use of surrogate
increasingly high sampling requirement as problem size increases. model. The difference of approaches for each of these classes from
Increasing initial sample size helps in better prediction accuracy surrogate modeling point of view is emphasized. Relevant recent
irrespective of the surrogate model. Finally, for an initial sample advances in each of the three classes are discussed with a focus on
size of size 1, surrogates built with initial samples obtained from LHS the use surrogate models.
264 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

Fig. 5. Fraction of problems solved vs error factor for different sampling designs starting with size of ‘size 2 .

For prediction using surrogates, mathematical formulation and Acknowledgement


recent advances in aspects such as prediction accuracy, numerical
stability, and other computational aspects are discussed. Regres- Financial support from NSF under grants 1159244 and 1434548
sion approaches such as linear and support vector regression are is gratefully acknowledged.
discussed followed by interpolation approaches such as RBF and
Kriging. In the context of linear regression, a commonly used idea Appendix A.
for preventing overfitting is subset selection. Several subset selec-
tion methodologies and ideas behind them are reviewed. In the Problem Problem size Convexity Smoothness Collection
context of blind Kriging, where there is no assumed form for
3pk 30 smooth nonconvex princetonlib
regression term, Bayesian variable selection technique is intro- allinit 3 smooth nonconvex princetonlib
duced. For derivative-free optimization, model-based local and arglinb 10 smooth nonconvex princetonlib
global search strategies are reviewed. Approaches, where surro- arglinc 8 smooth nonconvex princetonlib
biggs6 6 smooth nonconvex princetonlib
gates are used within a framework, are discussed in addition to
branin 2 smooth nonconvex princetonlib
approaches that make direct use of surrogates. Literature displays brownbs 2 smooth nonconvex princetonlib
successful application of surrogates inside trust-region like frame- camel6 2 smooth nonconvex princetonlib
work or a framework that exploits a certain quantitative measure dixon3dq 10 smooth nonconvex princetonlib
such as EI function or bumpiness function. The idea of using EI extrosnb 10 smooth nonconvex princetonlib
genhumps 5 smooth nonconvex princetonlib
function or bumpiness function has been shown to be successfully
griewank 2 smooth nonconvex princetonlib
applied to address feasibility analysis problems. The quality of sur- hart6 6 smooth nonconvex princetonlib
rogates, as well as total computational cost of surrogate prediction, heart6ls 6 smooth nonconvex princetonlib
depends strongly on the sampling quality and efficiency. Adaptive heart8ls 8 smooth nonconvex princetonlib
sampling algorithms improve sampling cost by sequentially choos- hilberta 10 smooth nonconvex princetonlib
hs038 4 smooth nonconvex princetonlib
ing the sample location starting from a stationary design. Details hs045 5 smooth nonconvex princetonlib
of adaptive sampling algorithms and use of surrogates for adap- hs110 10 smooth nonconvex princetonlib
tive sampling is discussed. An important part of surrogate model least 3 smooth nonconvex globallib
building is assessing the quality of surrogate model at hand. Sev- nonmsqrt 9 smooth nonconvex princetonlib
oslbqp 8 smooth nonconvex princetonlib
eral fitness metrics are discussed. There are number of software
palmer1c 8 smooth nonconvex princetonlib
implementations that provide the user with an easy access to ideas palmer1d 7 smooth nonconvex princetonlib
discussed in this review. These software are reviewed with a brief palmer2a 6 smooth nonconvex princetonlib
description and respective capabilities. palmer2c 8 smooth nonconvex princetonlib
Finally, because of their wide applicability for all three classes palmer2e 8 smooth nonconvex princetonlib
palmer3c 8 smooth nonconvex princetonlib
of problems discussed in the paper, a computational study was car- palmer3e 8 smooth nonconvex princetonlib
ried out on Kriging and RBF surrogates. The results obtained agree palmer5b 9 smooth nonconvex princetonlib
with the common observation that having more samples improves qudlin 12 smooth nonconvex princetonlib
the accuracy of surrogates. Additionally, Kriging surrogates with s211 2 smooth nonconvex princetonlib
s271 6 smooth nonconvex princetonlib
squared exponential correlation models were found to perform
s272 6 smooth nonconvex princetonlib
better than those with linear and absolute exponential correla- s273 6 smooth nonconvex princetonlib
tion models. Kriging surrogates with quadratic regression model s276 6 smooth nonconvex princetonlib
perform better than corresponding constant and linear regression s281 10 smooth nonconvex princetonlib
models. However, Kriging surrogates with a constant trend term s282 10 smooth nonconvex princetonlib
s283 10 smooth nonconvex princetonlib
show a comparable performance with those with linear regression s291 10 smooth nonconvex princetonlib
models.
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 265

s294 6 smooth nonconvex princetonlib Chu, T., Zhu, J., Wang, H., 2011. Penalized maximum likelihood estimation and
s295 10 smooth nonconvex princetonlib variable selection in geostatistics. Ann. Stat. 39 (5), 2607–2625, http://dx.doi.
s368 8 smooth nonconvex princetonlib org/10.1214/11-AOS919.
s370 6 smooth nonconvex princetonlib Clarke, S.M., Griebsch, J.H., Simpson, T.W., 2005. Analysis of support vector
s371 9 smooth nonconvex princetonlib regression for approximation of complex engineering analyses. J. Mech. Des.
st bsj3 6 smooth nonconvex globallib 127 (6), 1077, http://dx.doi.org/10.1115/1.1897403.
st cqpjk2 3 smooth nonconvex globallib Conn, a. R., Scheinberg, K., Toint, P.L., 1997. Recent progress in unconstrained
nonlinear optimization without derivatives. Math. Program. 79 (1–3),
397–414, http://dx.doi.org/10.1007/BF02614326.
Cozad, A., Sahinidis, N.V., Miller, D.C., 2014. Learning surrogate models for
References simulation-based optimization. AIChE J. 60 (6), http://dx.doi.org/10.1002/aic.
14418.
Adi, V.S.K., Laxmidewi, R., Chang, C.-T., 2016. An effective computation strategy for Cozad, A., Sahinidis, N.V., Miller, D.C., 2015. A combined first-principles and
assessing operational flexibility of high-dimensional systems with data-driven approach to model building. Comput. Chem. Eng. 73, 116–127,
complicated feasible regions. Chem. Eng. Sci. 147, 137–149, http://dx.doi.org/ http://dx.doi.org/10.1016/j.compchemeng.2014.11.010.
10.1016/j.ces.2016.03.028. Cressie, N., Johannesson, G., 2008. Fixed rank kriging for very large spatial data
Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. sets. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70 (1), 209–226, http://dx.doi.org/
Autom. Control 19 (6), 716–723, http://dx.doi.org/10.1109/TAC.1974.1100705. 10.1111/j.1467-9868.2007.00633.x.
Anthony, A.G., Vladimir, B., Dan, H., Bernard, G., William, H.M., Layne, T.W., Crombecq, K., Laermans, E., Dhaene, T., 2011. Efficient space-filling and
Raphael, T.H., 1997. Multidisciplinary Optimization of a Supersonic Transport non-collapsing sequential design strategies for simulation-based modeling.
Using Design of Experiments Theory and Response Surface Modeling. Virginia Eur. J. Oper. Res. 214 (3), 683–696, http://dx.doi.org/10.1016/j.ejor.2011.05.
Polytechnic Institute & State University, Blacksburg,VA, USA. 032.
Bachoc, F., 2013. Cross validation and maximum likelihood estimations of Dolan, E., Moré, J.J., 2002. Benchmarcking optimization software with performance
hyper-parameters of Gaussian processes with model misspecification. Comput. profiles. Math. Programm. 91 (2), 201–213, http://dx.doi.org/10.1007/
Stat. Data Anal. 66, 55–69, http://dx.doi.org/10.1016/j.csda.2013.03.016. s101070100263.
Balabanov, V., Haftka, R., 1998. Multifidelity response surface model for HSCT wing Eason, J., Cremaschi, S., 2014. Adaptive sequential sampling for surrogate model
bending material weight. Proceedings of 7th . . ., 1–18, http://dx.doi.org/10. generation with artificial neural networks. Comput. Chem. Eng. 68, 220–232,
2514/6.1998-4804. http://dx.doi.org/10.1016/j.compchemeng.2014.05.021.
Banerjee, S., Gelfand, A.E., Finley, A.O., Sang, H., 2008. Gaussian predictive process Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The
models for large spatial data sets. J. R. Stat. Soc. Ser. B: Stat. Methodol. 70 (4), Ann. Stat. 32 (2), 407–499 (Retrieved from) http://statweb.stanford.edu/∼tibs/
825–848, http://dx.doi.org/10.1111/j.1467-9868.2008.00663.x. ftp/lars.pdf.
Banerjee, I., Pal, S., Maiti, S., 2010. Computationally efficient black-box modeling Fan, J., Li, R., 2001. Variable selection via nonconcave penalized. J. Am. Stat. Assoc.
for feasibility analysis. Comput. Chem. Eng. 34 (9), 1515–1521, http://dx.doi. 96 (456), 1348–1360, http://dx.doi.org/10.1198/016214501753382273.
org/10.1016/j.compchemeng.2010.02.016. Fan, J., Lv, J., 2008. Sure independence screening for ultrahigh dimensional feature
Barton, R.R., Meckesheimer, M., 2006. Chapter 18 metamodel-based simulation space. J. R. Stat. Soc. Ser. B: Stat. Methodol. 70 (5), 849–911, http://dx.doi.org/
optimization. Handbooks Oper. Res. Manag. Sci. 13 (C), 535–574, http://dx.doi. 10.1111/j.1467-9868.2008.00674.x.
org/10.1016/S0927-0507(06)13018-2. Fernandes, F.A.N., 2006. Optimization of fischer-tropsch synthesis using neural
Barton, R.R., 1992. Metamodels for simulation input-output relations. In: networks. Chem. Eng. Technol. 29 (4), 449–453, http://dx.doi.org/10.1002/ceat.
Proceedings of the 24th Conference on Winter Simulation, New York, NY, USA: 200500310.
ACM, pp. 289–299, http://dx.doi.org/10.1145/167293.167352. Floudas, C.A., Gümüş, Z.H., 2001. Global optimization in design under uncertainty:
Bertsimas, D., King, A., 2016. OR Forum—an algorithmic approach to linear feasibility test and flexibility index problems. Ind. Eng. Chem. Res. 40 (20),
regression. Oper. Res. 64 (1), 2–16, http://dx.doi.org/10.1287/opre.2015.1436. 4267–4282, http://dx.doi.org/10.1021/ie001014g.
Bischl, B., Mersmann, O., Trautmann, H., Weihs, C., 2012. Resampling methods for Forrester, A.I.J., Keane, A.J., 2009. Recent advances in surrogate-based optimization.
meta-model validation with recommendations for evolutionary computation. Progr. Aerospace Sci. 45 (1-3), 50–79, http://dx.doi.org/10.1016/j.paerosci.
Evol. Comput. 20 (2), 249–275, http://dx.doi.org/10.1162/EVCO a 00069. 2008.11.001.
Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Jones, Z.M., Foster, D.P., George, E.I., 1994. The risk inflation criterion for multiple regression.
2016. mlr: Machine learning in r. J. Mach. Learn. Res. 17, 1–5. The Ann. Stat. 22 (4), 1947–1975 (Retrieved from) http://www.jstor.org/stable/
Björkman, M., Holmström, K., 2000. Global optimization of costly nonconvex 2242493.
functions using radial basis functions. Optim. Eng. 1 (4), 373–397, http://dx. Furnival, G.M., Wilson, R.W., 1974. Regressions by leaps and bounds.
doi.org/10.1023/A:1011584207202. Technometrics 16 (4), 499–511 (Retrieved from) http://www.jstor.org/stable/
Bloch, G., Denoeux, T., 2003. Neural networks for process control and optimization: 1267601.
two industrial applications. ISA Trans. 42 (1), 39–51, http://dx.doi.org/10.1016/ Furrer, R., Genton, M.G., Nychka, D., 2006. Covariance tapering for interpolation of
S0019-0578(07)60112-8. large spatial datasets. J. Comput. Graph. Stat. 15 (3), 502–523, http://dx.doi.
Bouhlel, M., Bartoli, N., Otsmane, A., Bouhlel, M., Bartoli, N., Otsmane, A., Morlier, J., org/10.1198/106186006X132178.
2015. Improving Kriging Surrogates of High-dimensional Design Models by GLOBAL Library. http://www.gamsworld.org/global/globallib.htm.
Partial Least Squares Dimension Reduction To Cite This Version: Improving Garud, S.S., Karimi, I.A., Kraft, M., 2017. Smart sampling algorithm for surrogate
Kriging Surrogates of High-dimensional Design Models by Partial Least model development. Comput. Chem. Eng. 96, 103–114, http://dx.doi.org/10.
Squares Dimension Reduction. Structural and Multidisciplinary Optimization, 1016/j.compchemeng.2016.10.006.
pp. 935–952, http://dx.doi.org/10.1007/s00158-015-1395-9. Goel, T., Haftka, R.T., Shyy, W., Queipo, N.V., 2007. Ensemble of surrogates. Struct.
Boukouvala, F., Floudas, C.A., 2016. ARGONAUT: AlgoRithms for Global Multidiscip. Optim. 33 (3), 199–216, http://dx.doi.org/10.1007/s00158-006-
Optimization of coNstrAined grey-box compUTational problems. Optim. Lett., 0051-9.
http://dx.doi.org/10.1007/s11590-016-1028-2. Gorissen, D., Couckuyt, I., Demeester, P., Dhaene, T., Crombecq, K., 2010. A
Boukouvala, F., Ierapetritou, M.G., 2012. Feasibility analysis of black-box processes surrogate modeling and adaptive sampling toolbox for computer based design.
using an adaptive sampling Kriging-based method. Comput. Chem. Eng. 36 (1), J. Mach. Learn. Res. 11, 2051–2055 (Retrieved from) http://dl.acm.org/citation.
358–368, http://dx.doi.org/10.1016/j.compchemeng.2011.06.005. cfm?id=1859919.
Boukouvala, F., Ierapetritou, M.G., 2014. Derivative-free optimization for expensive Goyal, V., Ierapetritou, M.G., 2002. Determination of operability limits using
constrained problems using a novel expected improvement objective function. simplicial approximation. AIChE J. 48 (12), 2902–2909, http://dx.doi.org/10.
AIChE J. 60 (7), 2462–2474, http://dx.doi.org/10.1002/aic.14442. 1002/aic.690481217.
Boukouvalas, A., Cornford, D., 2009. Learning heteroscedastic gaussian processes Grossmann, I.E., Calfa, B.A., Garcia-Herreros, P., 2014. Evolution of concepts and
for complex datasets. Group 44 (0). models for quantifying resiliency and flexibility of chemical processes. Comput.
Breiman, L., 1995. Better subset regression using the nonnegative garrote. Chem. Eng. 70, 22–34, http://dx.doi.org/10.1016/j.compchemeng.2013.12.013.
Technometrics, http://dx.doi.org/10.1080/00401706.1995.10484371. Gunn, S.R., 1998. Support Vector Machines for Classification and Regression. In
Cadima, J., Cerdeira, J.O., Minhoto, M., 2004. Computational aspects of algorithms Group. University of Southampton.
for variable selection in the context of principal components. Comput. Stat. Gutmann, H.-M., 2001. A radial basis function method for global optimization. J.
Data Anal. 47, 225–236, http://dx.doi.org/10.1016/j.csda.2003.11.001. Global Optim. 19 (3), 201–227, http://dx.doi.org/10.1023/A:1011255519438.
Cadima, J., Cerdeira, J. O., Silva, P. D., & Minhoto, M., 2012. The subselect R package. Haftka, R.T., Villanueva, D., Chaudhuri, A., 2016. Parallel surrogate-assisted global
Candes, E., Tao, T., 2007. The Dantzig selector: statistical estimation when p is optimization with expensive functions? A survey. Struct. Multidiscip. Optim.
much larger than n. Ann. Stat. 35 (6), 2313–2351, http://dx.doi.org/10.1214/ 54 (1), 3–13, http://dx.doi.org/10.1007/s00158-016-1432-3.
009053606000001523. Hannan, E.J., Quinn, B.G., 1979. The determination of the order of an
Chen, H., Loeppky, J.L., Sacks, J., Welch, W.J., 2013. Analysis methods for computer autoregression. J. R. Stat. Soc. Ser. B (Methodol.) 41 (2), 190–195 (Retrieved
experiments: how to assess and what counts? Stat. Sci. 31 (1), 40–60, http:// from) http://www.jstor.org/stable/2985032.
dx.doi.org/10.1214/15-STS531. Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning.
Chen, Z., 2014. Extended bayesian information criteria for model selection with Elements 1, 337–387, http://dx.doi.org/10.1007/b94608.
large model spaces model selection extended Bayesian information criteria for Henao, C.A., Maravelias, C.T., 2011. Surrogate-based superstructure optimization
with large model spaces. Biometrika 95 (3), 759–771, http://dx.doi.org/10. framework. AIChE J. 57 (5), 1216–1232, http://dx.doi.org/10.1002/aic.12341.
1093/biomet/asnO34.
266 A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267

Holmström, K., Quttineh, N.H., Edvall, M.M., 2008. An adaptive radial basis Palmer, K., Realff, M., 2002. Metamodeling approach to optimization of
algorithm (ARBF) for expensive black-box mixed-integer constrained global steady-state flowsheet simulations. Chem. Eng. Res. Des. 80 (7), 760–772,
optimization. Optim. Eng. 9, 311–339, http://dx.doi.org/10.1007/s11081-008- http://dx.doi.org/10.1205/026387602320776830.
9037-3. Park, Y. W., & Klabjan, D., 2013. Subset Selection for Multiple Linear Regression via
Hooke, R., Jeeves, T.A., 1960. “Direct S E a R C H ” Solution of Numerical and Optimization, (i), 1–27.
Statistical Problems **., pp. 212–229. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Huang, H.-C., Chen, C.-S., 2007. Optimal geostatistical model selection. J. Am. Stat. Duchesnay, E., et al., 2011. Scikit-learn: machine learning in {P}ython. J. Mach.
Assoc. 102 (479), 1009–1024, http://dx.doi.org/10.1198/ Learn. Res. 12, 2825–2830.
016214507000000491. Powell, M.J.D., 1994. A direct search optimization method that models the
Hurvich, C.M., Tsai, C.-L., 1989. Regression and time series model selection in small objective and constraint functions by linear interpolation. In: Gomez, S.,
samples. Biometrika 76 (2), 297–307, http://dx.doi.org/10.2307/2336663. Hennart, J.-P. (Eds.), Advances in Optimization and Numerical Analysis.
Jia, Z., Davis, E., Muzzio, F.J., Ierapetritou, M.G., 2009. Predictive modeling for Springer Netherlands, Dordrecht, pp. 51–67, http://dx.doi.org/10.1007/978-
pharmaceutical processes using Kriging and response surface. J. Pharm. Innov. 94-015-8330-5 4.
4 (4), 174–186, http://dx.doi.org/10.1007/s12247-009-9070-6. Powell, M.J.D., 1994. The NEWUOA software for unconstrained optimization
Jin, R., Chen, W., Simpson, T.W., 2001. Comparative studies of metamodelling without derivatives. In: Di Pillo, G., Roma, M. (Eds.), Large-Scale Nonlinear
techniques under multiple modelling criteria. Struct. Multidiscip. Optim. 23 Optimization. Springer US, Boston, MA, pp. 255–297, http://dx.doi.org/10.
(1), 1–13, http://dx.doi.org/10.1007/s00158-001-0160-4. 1007/0-387-30065-1 16.
Jones, D.R., Perttunen, C.D., Stuckman, B.E., 1993. Lipschitzian optimization Prebeg, P., Zanic, V., Vazic, B., 2014. Application of a surrogate modeling to the ship
without the Lipschitz constant. J. Optim. Theory Appl. 79 (1), 157–181, http:// structural design. Ocean Eng. 84, 259–272, http://dx.doi.org/10.1016/j.
dx.doi.org/10.1007/BF00941892. oceaneng.2014.03.032.
Jones, D.R., Schonlau, M., Welch, W.J., 1998. Efficient global optimization of Princeton Library. http://www.gamsworld.org/performance/princetonlib/
expensive black-box functions. J. Global Optim. 13 (4), 455–492, http://dx.doi. princetonlib.htm.
org/10.1023/A:1008306431147. Provost, F., Jensen, D., Oates, T., 1999. Efficient progressive sampling. In:
Joseph, V.R., Hung, Y., Sudjianto, A., 2008. Blind kriging: a new method for Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge
developing metamodels. J. Mech. Des. 130 (3), 1–8, http://dx.doi.org/10.1115/ Discovery and Data Mining, New York, NY, USA: ACM, pp. 23–32, http://dx.doi.
1.2829873. org/10.1145/312129.312188.
Kaufman, C.G., Schervish, M.J., Nychka, D., 2008. Covariance tapering for Queipo, N.V., Haftka, R.T., Shyy, W., Goel, T., Vaidyanathan, R., Kevin Tucker, P.,
likelihood-based estimation in large spatial data sets. J. Am. Stat. Assoc. 103 2005. Surrogate-based analysis and optimization. Prog. Aerosp. Sci. 41 (1),
(484), 1545–1555, http://dx.doi.org/10.1198/016214508000000959. 1–28, http://dx.doi.org/10.1016/j.paerosci.2005.02.001.
Kersting, K., Plagemann, C., Pfaff, P., Burgard, W., 2007. Most likely heteroscedastic Ranjan, P., Haynes, R., Karsten, R., 2011. A computationally stable approach to
gaussian process regression. 24th International Conference on Machine gaussian process interpolation of deterministic computer simulation data.
Learning (ICML 2007), 393–400, http://dx.doi.org/10.1145/1273496.1273546. Technometrics 53 (4), 366–378, http://dx.doi.org/10.1198/TECH.2011.09141.
Konno, H., Yamamoto, R., 2009a. Choosing the best set of variables in regression Rasmussen, C.E., Williams, C., 2006a. Gaussian Processes for Machine Learning.
analysis using integer programming. J. Global Optim. 44 (2), 273–282, http:// Gaussian Processes for Machine Learning. MIT Press (Retrieved from) http://
dx.doi.org/10.1007/s10898-008-9323-9. www.gaussianprocess.org/gpml/.
Konno, H., Yamamoto, R., 2009b. Choosing the best set of variables in regression Rasmussen, C.E., Williams, C., 2006b. Gaussian Processes for Machine Learning.
analysis using integer programming. J. Global Optim. 44 (2), 273–282, http:// Gaussian Processes for Machine Learning. MIT Press.
dx.doi.org/10.1007/s10898-008-9323-9. Razavi, S., Tolson, B.A., Burn, D.H., 2012. Review of surrogate modeling in water
Krige, D.G., 1951. A Statistical Approach to Some Mine Valuation and Allied resources. Water Resour. Res. 48 (7), http://dx.doi.org/10.1029/2011wr011527.
Problems on the Witwatersrand. Regis, R.G., Shoemaker, C.A., 2005. Constrained global optimization of expensive
Liang, F., Cheng, Y., Song, Q., Park, J., Yang, P., 2013. A resampling-based stochastic black box functions using radial basis functions. J. Global Optim. 31 (1),
approximation method for analysis of large geostatistical data. J. Am. Stat. 153–171, http://dx.doi.org/10.1007/s10898-004-0570-0.
Assoc. 108 (501), 325–339, http://dx.doi.org/10.1080/01621459.2012.746061. Regis, R.G., Shoemaker, C.A., 2007. Improved strategies for radial basis function
Linhart, H., Zucchini, W., 1986. Model Selection. John Wiley & Sons, Inc., New York, methods for global optimization. J. Global Optim. 37 (1), 113–135, http://dx.
NY, USA. doi.org/10.1007/s10898-006-9040-1.
Liu, H., Motoda, H., 2007. Computational Methods of Feature Selection (Chapman & Regis, R. G., & Wild, S. M., 2015. CONORBIT: Constrained optimization by radial
Hall/Crc Data Mining and Knowledge Discovery Series). Chapman & Hall/CRC. basis function interpolation in trust regions 1 CONORBIT: Constrained
Lophaven, S., Nielsen, H., & Sondergaard, J., 2002. DACE: A MATLAB Kriging optimization by radial basis function interpolation in trust regions, (October).
Toolbox. In Version 2.0, Tech. Rep. IMMTR-2002-12, Informatics and Regis, R.G., 2016. Trust regions in Kriging-based optimization with expected
Mathematical Modelling. improvement. Eng. Optim. 48 (6), 1037–1059, http://dx.doi.org/10.1080/
Lophaven, S., Nielsen, H., & Søndergaard, J., 2002. Aspects of the matlab toolbox 0305215X.2015.1082350.
DACE. Technical Report, (IMM-REP-2002-13). Riolo, R., Reilly, U.O., Eds, T.M., 2010. Genetic Programming Theory and Practice
Müller, J., Piché, R., 2011. Mixture surrogate models based on Dempster-Shafer VII., http://dx.doi.org/10.1007/978-1-4419-1626-6.
theory for global optimization problems. J. Global Optim. 51 (1), 79–104, Riolo, R., Mcconaghy, T., Eds, E.V., 2011. Genetic Programming Theory and Practice
http://dx.doi.org/10.1007/s10898-010-9620-y. VIII., http://dx.doi.org/10.1007/978-1-4419-7747-2.
Müller, J., Paudel, R., Shoemaker, C.A., Woodbury, J., Wang, Y., Mahowald, N., 2015. Rios, L.M., Sahinidis, N.V., 2013. Derivative-free optimization: a review of
CH4 parameter estimation in CLM4.5bgc using surrogate global optimization. algorithms and comparison of software implementations. J. Global Optim. 56
Geosci. Model Dev. 8 (10), 3285–3310, http://dx.doi.org/10.5194/gmd-8-3285- (3), 1247–1293, http://dx.doi.org/10.1007/s10898-012-9951-y.
2015. Rogers, A., Ierapetritou, M., 2015. Feasibility and flexibility analysis of black-box
Mallows, C.L., 1973. Some comments on CP . Technometrics 15 (4), 661–675, http:// processes part 2: Surrogate-based flexibility analysis. Chem. Eng. Sci. 137,
dx.doi.org/10.2307/1267380. 1005–1013, http://dx.doi.org/10.1016/j.ces.2015.06.026.
Matheron, G., 1963. Principles of geostatistics. Econ. Geol. Sacks, J., Welch, W.J., Mitchell, T.J., Wynn, H.P., 1989. Design and analysis of
McKay, M.D., Beckman, R.J., Conover, W.J., 1979. Comparison of three methods for computer experiments. Stat. Sci. 4, 409–435.
selecting values of input variables in the analysis of output from a computer Sang, H., Huang, J., 2012. A full scale approximation of covariance functions for
code. Technometrics 21 (2), 239–245, http://dx.doi.org/10.1080/00401706. large spatial data sets. J. R. Stat. Soc. 1988, 111–132, http://dx.doi.org/10.1111/
1979.10489755. j.1467-9868.2011.01007.x.
Meert, K., Rijckaert, M., 1998. Intelligent modelling in the chemical process Schmidt, M., Lipson, H., 2009. Distilling natural laws. Science 324 (April), 81–85,
industry with neural networks: a case study. Comput. Chem. Eng. 22 (98), http://dx.doi.org/10.1126/science.1165893.
S587–S593, http://dx.doi.org/10.1016/s0098-1354(98)00104-5. Schwarz, G., 1978. Estimating the dimension of a model. Ann. Stat. 6 (2), 461–464
Misener, R., Floudas, C.A., 2014. ANTIGONE: algorithms for coNTinuous/Integer (Retrieved from) http://www.jstor.org/stable/2958889.
global optimization of nonlinear equations. J. Global Optim. 59 (2–3), 503–526, Seber, G.A.F., Lee, A.J., 2003. Linear Regression Analysis. Wile-Interscience.
http://dx.doi.org/10.1007/s10898-014-0166-2. Simpson, T.W., Peplinski, J.D., Koch, P.N., Allen, J.K., 1997. On the use of statistics in
Mujtaba, I.M., Aziz, N., Hussain, M.A., 2006. Neural network based modelling and design and the implications for deterministic computer experiments.
control in batch reactor. Chem. Eng. Res. Des. 84 (8), 635–644, http://dx.doi. Proceedings of DETC’97 1997 ASME Design Engineering Technical Conferences,
org/10.1205/cherd.05096. 1–14.
Nelder, J.A., Mead, R., 1965. A simplex method for function minimization. Comput. Smola, A.J., Schölkopf, B., 2004. A tutorial on support vector regression. Stat.
J. 7 (4), 308–313, http://dx.doi.org/10.1093/comjnl/7.4.308. Comput. 14 (3), 199–222, http://dx.doi.org/10.1023/B:STCO.0000035301.
Nippgen, F., McGlynn, B., Emanuel, R., Vose, J., 2016. Water resources research. 49549.88.
Water Resour. Res., 1–20, http://dx.doi.org/10.1002/2014WR015716. Snelson, E., & Ghahramani, Z., 2007. Local and global sparse Gaussian process
Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., Sain, S., 2015. A approximations. In M. Meila & X. Shen (Eds.), Proceedings of the Eleventh
multiresolution gaussian process model for the analysis of large spatial International Conference on Artificial Intelligence and Statistics (AISTATS-07)
datasets. J. Comput. Graph. Stat. 24 (2), 579–599, http://dx.doi.org/10.1080/ (Vol. 2, pp. 524–531). Journal of Machine Learning Research − Proceedings
10618600.2014.914946. Track. Retrieved from http://jmlr.csail.mit.edu/proceedings/papers/v2/
Oeuvray, R., Bierlaire, M., 2009. Boosters: a derivative-free algorithm based on snelson07a/snelson07a.pdf.
radial basis functions. Int. J. Model. Simul. 29 (1), 26–36, http://dx.doi.org/10.
1080/02286203.2009.11442507.
A. Bhosekar, M. Ierapetritou / Computers and Chemical Engineering 108 (2018) 250–267 267

Sobol, I.M., 1967. on of points in a cube and the approximate evaluation of Viana, F. A. C., 2010. SURROGATES Toolbox User’s Guide, version 2.1.
integrals. Zh. Vychisl. Mat. I Mat. Fiz. 7 (7), 784–802, http://dx.doi.org/10.1016/ Wang, Z., Ierapetritou, M., 2016. A novel feasibility analysis method for black-box
0041-5553(67)90144-9. processes using a radial basis function adaptive sampling approach. AIChE J.,
Straub, D.A., Grossmann, I.E., 1993. Design optimization of stochastic flexibility. http://dx.doi.org/10.1002/aic.15362 (n/a-n/a).
Comput. Chem. Eng. 17 (4), 339–354, http://dx.doi.org/10.1016/0098- Wang, G.G., Shan, S., 2007. Review of metamodeling techniques in support of
1354(93)80025-I. engineering design optimization. J. Mech. Des. 129 (4), 370, http://dx.doi.org/
Swaney, R.E., Grossmann, I.E., 1985. An index for operational flexibility in chemical 10.1115/1.2429697.
process design. Part I: Formulation and theory. AIChE J. 31 (4), 621–630, http:// Wild, S.M., Shoemaker, C., 2013. Global convergence of radial basis function
dx.doi.org/10.1002/aic.690310412. trust-region algorithms for derivative-free optimization. SIAM Rev. 55 (2),
Tajbakhsh, S., Aybat, N., & Castillo, E. Del., 2014. Sparse Precision Matrix Selection 349–371, http://dx.doi.org/10.1137/120902434.
for Fitting Gaussian Random Field Models to Large Data Sets. arXiv Preprint Wild, S.M., Regis, R.G., Shoemaker, C.A., 2008. ORBIT: optimization by radial basis
arXiv:1405.5576, 25(Ml), 1–18. http://doi.org/10.1007/s10107-014-0826-5. function interpolation in trust-Regions. SIAM J. Sci. Comput. 30 (6),
Tawarmalani, M., Sahinidis, N.V., 2005. A polyhedral branch-and-cut approach to 3197–3219, http://dx.doi.org/10.1137/070691814.
global optimization. Math. Program. 103 (2), 225–249, http://dx.doi.org/10. Wilson, Z.T., Sahinidis, N.V., 2017. The ALAMO approach to machine learning.
1007/s10107-005-0581-8. Comput. Chem. Eng., http://dx.doi.org/10.1016/j.compchemeng.2017.02.010.
Thacker, W.I., Zhang, J., Watson, L.T., Birch, J.B., Iyer, M.A., Berry, M.W., 2010. Yang, R.J., Wang, N., Tho, C.H., Bobineau, J.P., Wang, B.P., 2005. Metamodeling
Algorithm 905: SHEPPACK: modified shepard algorithm for interpolation of development for vehicle frontal impact simulation. J. Mech. Des. 127 (5), 1014,
scattered multivariate data. ACM Trans. Math. Softw. 37 (3), 34, http://dx.doi. http://dx.doi.org/10.1115/1.1906264.
org/10.1145/1824801.1824812, 34:1–34:20. Yin, J., Ng, S.H., Ng, K.M., 2011. Kriging metamodel with modified nugget-effect:
Toal, D.J.J., Forrester, A.I.J., Bressloff, N.W., Keane, A.J., Holden, C., 2009. An adjoint the heteroscedastic variance case. Comput. Ind. Eng. 61 (3), 760–777, http://dx.
for likelihood maximization. Proc. R. Society A Math. Phys. Eng. Sci. 465 (2011), doi.org/10.1016/j.cie.2011.05.008.
3267–3287, http://dx.doi.org/10.1098/rspa.2009.0096. Zerpa, L.E., Queipo, N.V., Pintos, S., Salager, J.-L., 2005. An optimization
Toal, D.J.J., Bressloff, N.W., Keane, A.J., Holden, C.M.E., 2011. The development of a methodology of alkaline-surfactant-polymer flooding processes using field
hybridized particle swarm for kriging hyperparameter tuning. Eng. Optim. 43 scale numerical simulation and multiple surrogates. J. Petrol. Sci. Eng. 47 (3–4),
(6), 675–699, http://dx.doi.org/10.1080/0305215X.2010.508524. 197–208, http://dx.doi.org/10.1016/j.petrol.2005.03.002.
Viana, F.A.C., Haftka, R.T., Steffen, V., 2009. Multiple surrogates: how Zhang, Q., Grossmann, I.E., Sundaramoorthy, A., Pinto, J.M., 2016. Data-driven
cross-validation errors can help us to obtain the best predictor. Struct. Construction of Convex Region Surrogate Models. Optimization and
Multidiscip. Optim. 39 (4), 439–457, http://dx.doi.org/10.1007/s00158-008- Engineering, vol. 17. Springer, US, http://dx.doi.org/10.1007/s11081-015-
0338-0. 9288-8.
Viana, F.A.C., Haftka, R.T., Watson, L.T., 2013. Efficient global optimization Zou, H., Li, R., 2008. One-Step Sparse Estimates in Nonconcave Penalized
algorithm assisted by multiple surrogate techniques. J. Global Optim. 56 (2), Likelihood Models.
669–689, http://dx.doi.org/10.1007/s10898-012-9892-5. Zou, H., 2006. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101
(476), 1418–1429, http://dx.doi.org/10.1198/016214506000000735.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy