Thesis Proposal Zhiyun Gong Revised
Thesis Proposal Zhiyun Gong Revised
Background
Machine learning is a group of artificial intelligence algorithms focusing on training computational
models to learn patterns from data to help people make predictions or generate new insights into the
problem, and it has been widely applied to different areas of study. In biological research, there are
circumstances where the researchers need to assess the quality of experimental results, however,
sometimes a simple numeric quality measurement is not comprehensive enough, or when the data is high-
dimensional and hard to visualize, then making decisions on the data become challenging for
experimental researchers. For example, in 2D Differential In-Gel Electrophoresis (2D-DIGE)
experiments, the results from the first dimension of separation by isoelectric focusing (IEF) will directly
affect the quality of the final gel electrophoresis image. However, there lacks a consensus of how should
those IEF time-series giving rise to the high-quality separation of proteins look like. Therefore, we
believe that supervised machine learning may be a good choice to help classify the time-series data, make
decisions on whether the more expensive and time-consuming second dimension would be worthwhile.
Another type of problem faced in scientific research is the optimization of experimental parameters, to
which the solution is usually very empirical in the real-world wet-lab experiments given that exhaustive
search might be too expensive. We propose that Bayesian optimization (BO), which utilizes the historical
data as prior knowledge, fits a probabilistic surrogate function, and suggests a new candidate likely to
improve the modeling of the objective function for evaluation according to the posterior, could be a good
solution to this optimization problem.
BO is usually conventionally a sequential process, where only one new candidate is suggested in each
round after the previous evaluation is completed. However, in the wet lab research environment, there are
scenarios where the researchers need to tweak different parameters in order to get results with satisfactory
quality, while multiple evaluations can be performed in parallel. For example, there may be multiple
instruments of the same model that the researchers have access to or some instruments may have the
capacity to perform experiments with multiple samples at the same time. Therefore, the parallel version of
the Bayesian optimization which incorporates multiple evaluations and suggests a batch of new
experiments to be run next seems to be a more reasonable and efficient choice, which may help save a
significantly large amount of time. Specifically, in the first case, one instrument may finish one
evaluation earlier than the others even if they started at the same time. Then it's reasonable to incorporate
the newly available data point as soon as possible to the prior knowledge rather than let it stand and wait
for the others to finish. Thus this would be a suitable circumstance for implementing the asynchronous
parallel optimization. In contrast, in the second case, the synchronous version of the optimization would
be a better fit.
There are currently several packages in the R community (Yan, 2016; Wilson, 2020; Bischl et al., 2017;
Roustant, Ginsbourger & Deville, 2012; Kuhn, 2020). Among them, ParBayesianOptimization, mlrMBO,
and DiceOptim support batch parallel optimization, while the others only work in sequential mode.
However, all of these packages require the users to be familiar with writing objectives functions in R and
incorporate them into the specific optimization framework, which might not be convenient for many
1
Zhiyun Gong MS Thesis Proposal 15 December 2020
experimental researchers. Also, asynchronous parallel optimization is not supported by any of the existing
libraries.
Objectives
There are two main objectives of this thesis as summarized below. And the first aim is completed except
for adding more data to the model in the future for improving its performance.
1. We aimed to classify the time series data generated by IEF experiments into good and bad groups using
supervised machine learning methods, which may provide the researchers with the prediction of whether
an IEF experiment would give rise to good image quality if the second dimension SDS-PAGE separation
were performed.
2. We also aim to implement and integrate different variants of Bayesian Optimization, and provide a
product which the user can utilize either by directly running the function programmatically or by using a
user-friendly interface. Using this tool, the user will be able to easily pass in a tabular file containing
experimental results, specify the ranges and types of all the parameters to be optimized, choose the
suitable variant of the algorithm and the surrogate model, as well as specify the batch size and the
maximum number of experiments allowed. The duration of each round of the optimization may depend
on the nature of the actual experimental instrument, and in many cases, this time could be long. So
another feature of the product we propose to develop is the ability to save interim results and load them
back to resume the optimization process from it stopped without losing the historical steps.
Current results
1. Classification of Isoelectric Focusing data of 2D-DIGE (20 Spring)
We used Random Forest as a baseline model and implemented 1-Dimensional Convolutional
Neural Network and Long Short-Term Memory models to map the time series data to binary
labels of the resulting 2D images: good or bad. Our preliminary results suggest that it might be
possible to build a classifier that achieves moderate to high accuracy (80-85%) with good AUCs
(~0.8) using CNNs or RNNs. Both of these models outperform the baseline model, Random
Forests.
We also constructed a prototype web app using the Shiny R package (See Fig 1). This app would
enable users to upload raw IEF experiment data and get prediction results for all lanes of the
experiment by the two trained NN models in seconds.
2
Zhiyun Gong MS Thesis Proposal 15 December 2020
3
Zhiyun Gong MS Thesis Proposal 15 December 2020
Methods
1. Algorithms to implement
Sequential, synchronous, and asynchronous parallel versions of Bayesian Optimization
algorithms will be all included in the package (Kandasamy et al., 2017, 2018). In terms of the
surrogate function, a Gaussian Process regression model with Radial Basis Function kernel is
implemented (by kernlab in R and GPy in Python) to approximate the objective function using
prior knowledge with uncertainty(Karatzoglou et al., 2004; GPy, 2012). Then, the next
combination(s) of parameters to be evaluated will be estimated according to an acquisition
function (Expected Improvement, Upper Confidence Bound, or Probability of Improvement) or
Thompson Sampling from the posterior. Also, we will try to incorporate the re-implemented
IMGPO algorithm which is currently available in Matlab, and a derived batch version of it by
Trevor into our package (Kawaguchi, Kaelbling & Lozano-Pérez, 2016). If time allows,
The second is a fully functional Shiny app with a more visually intuitive interface. The users can
complete multiple rounds of optimization completely without writing any codes. There will be a
dashboard on the user interface showing the evaluated sets of parameters and the next set or batch
4
Zhiyun Gong MS Thesis Proposal 15 December 2020
of parameters to be evaluated next suggested by the algorithm. Once getting satisfactory results,
the user can stop the evaluation and download a report containing all of the parameter
combinations and their corresponding objective value.
If time allows, we also plan to deploy the Shiny app to an online server. Therefore, the users do
not have to install the package and run it locally and will be able to re-upload their previous result
and resume from where they left, given that in many cases it can take a relatively long period of
time to complete a single round in real-world wet lab settings.
Timeline
Time period/ Deadline Tasks
Mid December Proposal final draft ( to the committee and academic advisor)
Winter break (23 Dec 2020 – 31 Documentation of the functions in the package, writing
Jan 2021) introduction and methods sections of the thesis; Comparison of
different methods on existing datasets;
1 – 28 Feburary Testing on real-world experiments; Writing results section of the
thesis
Beginning of March Progression report submission
1 – 31 March Trying to publish the app online; Running additional
experiments and making adjustments according to the report
feedback
1 – 30 April Writing up, final revisions
Beginning of May Thesis defense; thesis document submission
References
Bischl, B., Richter, J., Bossek, J., Horn, D., et al. (2017) mlrMBO: A Modular Framework for Model-
Based Optimization of Expensive Black-Box Functions. [Online]. Available from:
http://arxiv.org/abs/1703.03373.
GPy (n.d.) {GPy}: A Gaussian process framework in python.
Kandasamy, K., Krishnamurthy, A., Schneider, J. & Poczos, B. (2017) Asynchronous Parallel Bayesian
Optimisation via Thompson Sampling. International Conference on Artificial Intelligence and
Statistics, AISTATS 2018. [Online] 133–142. Available from: http://arxiv.org/abs/1705.09236.
Kandasamy, K., Krishnamurthy, A., Schneider, J. & Poczos, B. (2018) Parallelised Bayesian
Optimisation via Thompson Sampling Kirthevasan. International Conference on Artificial
Intelligence and Statistics. 84, 133–142.
5
Zhiyun Gong MS Thesis Proposal 15 December 2020
Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. (2004) kernlab -- An {S4} Package for Kernel
Methods in {R}. Journal of Statistical Software. [Online] 11 (9), 1–20. Available from:
http://www.jstatsoft.org/v11/i09/.
Kawaguchi, K., Kaelbling, L.P. & Lozano-Pérez, T. (2016) Bayesian Optimization with Exponential
Convergence. Advances in Neural Information Processing Systems. [Online] 2015-Janua, 2809–
2817. Available from: http://arxiv.org/abs/1604.01348.
Kuhn, M. (2020) tune: Tidy Tuning Tools. [Online]. Available from:
https://cran.r-project.org/package=tune.
Roustant, O., Ginsbourger, D. & Deville, Y. (2012) DiceKriging , DiceOptim : Two R Packages for the
Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization. Journal of
Statistical Software. [Online] 51 (1), 1–55. Available from: doi:10.18637/jss.v051.i01.
Wilson, S. (2020) ParBayesianOptimization: Parallel Bayesian Optimization of Hyperparameters.
[Online]. Available from: https://github.com/AnotherSamWilson/ParBayesianOptimization.
Wu, N.C., Dai, L., Olson, C.A., Lloyd-Smith, J.O., et al. (2016) Adaptation in protein fitness landscapes
is facilitated by indirect paths. eLife. [Online] 5 (JULY). Available from: doi:10.7554/eLife.16965
[Accessed: 24 June 2020].
Yan, Y. (2016) rBayesianOptimization: Bayesian Optimization of Hyperparameters. [Online]. Available
from: https://cran.r-project.org/package=rBayesianOptimization.