0% found this document useful (0 votes)
25 views6 pages

Thesis Proposal Zhiyun Gong Revised

This thesis proposal focuses on applying machine learning and Bayesian optimization to enhance scientific research, particularly in biological studies. It aims to classify time-series data from IEF experiments and develop a user-friendly tool for optimizing experimental parameters using both sequential and parallel Bayesian optimization methods. The project includes creating a Shiny app for easy access and implementation of these techniques, with preliminary results indicating promising accuracy in classification tasks.

Uploaded by

appa londhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views6 pages

Thesis Proposal Zhiyun Gong Revised

This thesis proposal focuses on applying machine learning and Bayesian optimization to enhance scientific research, particularly in biological studies. It aims to classify time-series data from IEF experiments and develop a user-friendly tool for optimizing experimental parameters using both sequential and parallel Bayesian optimization methods. The project includes creating a Shiny app for easy access and implementation of these techniques, with preliminary results indicating promising accuracy in classification tasks.

Uploaded by

appa londhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Zhiyun Gong MS Thesis Proposal 15 December 2020

Application of Machine Learning and Bayesian Optimization in Scientific Research

Background
Machine learning is a group of artificial intelligence algorithms focusing on training computational
models to learn patterns from data to help people make predictions or generate new insights into the
problem, and it has been widely applied to different areas of study. In biological research, there are
circumstances where the researchers need to assess the quality of experimental results, however,
sometimes a simple numeric quality measurement is not comprehensive enough, or when the data is high-
dimensional and hard to visualize, then making decisions on the data become challenging for
experimental researchers. For example, in 2D Differential In-Gel Electrophoresis (2D-DIGE)
experiments, the results from the first dimension of separation by isoelectric focusing (IEF) will directly
affect the quality of the final gel electrophoresis image. However, there lacks a consensus of how should
those IEF time-series giving rise to the high-quality separation of proteins look like. Therefore, we
believe that supervised machine learning may be a good choice to help classify the time-series data, make
decisions on whether the more expensive and time-consuming second dimension would be worthwhile.

Another type of problem faced in scientific research is the optimization of experimental parameters, to
which the solution is usually very empirical in the real-world wet-lab experiments given that exhaustive
search might be too expensive. We propose that Bayesian optimization (BO), which utilizes the historical
data as prior knowledge, fits a probabilistic surrogate function, and suggests a new candidate likely to
improve the modeling of the objective function for evaluation according to the posterior, could be a good
solution to this optimization problem.

BO is usually conventionally a sequential process, where only one new candidate is suggested in each
round after the previous evaluation is completed. However, in the wet lab research environment, there are
scenarios where the researchers need to tweak different parameters in order to get results with satisfactory
quality, while multiple evaluations can be performed in parallel. For example, there may be multiple
instruments of the same model that the researchers have access to or some instruments may have the
capacity to perform experiments with multiple samples at the same time. Therefore, the parallel version of
the Bayesian optimization which incorporates multiple evaluations and suggests a batch of new
experiments to be run next seems to be a more reasonable and efficient choice, which may help save a
significantly large amount of time. Specifically, in the first case, one instrument may finish one
evaluation earlier than the others even if they started at the same time. Then it's reasonable to incorporate
the newly available data point as soon as possible to the prior knowledge rather than let it stand and wait
for the others to finish. Thus this would be a suitable circumstance for implementing the asynchronous
parallel optimization. In contrast, in the second case, the synchronous version of the optimization would
be a better fit.

There are currently several packages in the R community (Yan, 2016; Wilson, 2020; Bischl et al., 2017;
Roustant, Ginsbourger & Deville, 2012; Kuhn, 2020). Among them, ParBayesianOptimization, mlrMBO,
and DiceOptim support batch parallel optimization, while the others only work in sequential mode.
However, all of these packages require the users to be familiar with writing objectives functions in R and
incorporate them into the specific optimization framework, which might not be convenient for many

1
Zhiyun Gong MS Thesis Proposal 15 December 2020

experimental researchers. Also, asynchronous parallel optimization is not supported by any of the existing
libraries.
Objectives
There are two main objectives of this thesis as summarized below. And the first aim is completed except
for adding more data to the model in the future for improving its performance.

1. We aimed to classify the time series data generated by IEF experiments into good and bad groups using
supervised machine learning methods, which may provide the researchers with the prediction of whether
an IEF experiment would give rise to good image quality if the second dimension SDS-PAGE separation
were performed.

2. We also aim to implement and integrate different variants of Bayesian Optimization, and provide a
product which the user can utilize either by directly running the function programmatically or by using a
user-friendly interface. Using this tool, the user will be able to easily pass in a tabular file containing
experimental results, specify the ranges and types of all the parameters to be optimized, choose the
suitable variant of the algorithm and the surrogate model, as well as specify the batch size and the
maximum number of experiments allowed. The duration of each round of the optimization may depend
on the nature of the actual experimental instrument, and in many cases, this time could be long. So
another feature of the product we propose to develop is the ability to save interim results and load them
back to resume the optimization process from it stopped without losing the historical steps.

Current results
1. Classification of Isoelectric Focusing data of 2D-DIGE (20 Spring)
We used Random Forest as a baseline model and implemented 1-Dimensional Convolutional
Neural Network and Long Short-Term Memory models to map the time series data to binary
labels of the resulting 2D images: good or bad. Our preliminary results suggest that it might be
possible to build a classifier that achieves moderate to high accuracy (80-85%) with good AUCs
(~0.8) using CNNs or RNNs. Both of these models outperform the baseline model, Random
Forests.

We also constructed a prototype web app using the Shiny R package (See Fig 1). This app would
enable users to upload raw IEF experiment data and get prediction results for all lanes of the
experiment by the two trained NN models in seconds.

2
Zhiyun Gong MS Thesis Proposal 15 December 2020

3
Zhiyun Gong MS Thesis Proposal 15 December 2020

2. A prototype of the Shiny-based optimization app (20 Fall)


A prototype of the Shiny app was developed in the 20 Fall semester, basically having achieved
what we want for the interface, where the users can start by setting up a new optimization job by
specifying the types and constraints of all parameters or resuming a previous job. There are
currently four infill methods Expected Improvement, Upper Confidence Bound, Probability of
Improvement, and Thompson Sampling available for choice. Once the results of the new
experimental parameters are evaluated, the users can input the quality measures back to the app,
append the new records to the historical experiments, and decide whether to proceed with
another iteration of optimization.

Methods
1. Algorithms to implement
Sequential, synchronous, and asynchronous parallel versions of Bayesian Optimization
algorithms will be all included in the package (Kandasamy et al., 2017, 2018). In terms of the
surrogate function, a Gaussian Process regression model with Radial Basis Function kernel is
implemented (by kernlab in R and GPy in Python) to approximate the objective function using
prior knowledge with uncertainty(Karatzoglou et al., 2004; GPy, 2012). Then, the next
combination(s) of parameters to be evaluated will be estimated according to an acquisition
function (Expected Improvement, Upper Confidence Bound, or Probability of Improvement) or
Thompson Sampling from the posterior. Also, we will try to incorporate the re-implemented
IMGPO algorithm which is currently available in Matlab, and a derived batch version of it by
Trevor into our package (Kawaguchi, Kaelbling & Lozano-Pérez, 2016). If time allows,

2. Design of the products


There are two possible products from this project. One is an R library, which requires the user to
be comfortable with manually run the optimization functions in R. The package will encompass
the main function, which can be directly called within the user's code either by directly passing
the parameters or input the parameters using the embedded Shiny app.

The second is a fully functional Shiny app with a more visually intuitive interface. The users can
complete multiple rounds of optimization completely without writing any codes. There will be a
dashboard on the user interface showing the evaluated sets of parameters and the next set or batch

4
Zhiyun Gong MS Thesis Proposal 15 December 2020

of parameters to be evaluated next suggested by the algorithm. Once getting satisfactory results,
the user can stop the evaluation and download a report containing all of the parameter
combinations and their corresponding objective value.

If time allows, we also plan to deploy the Shiny app to an online server. Therefore, the users do
not have to install the package and run it locally and will be able to re-upload their previous result
and resume from where they left, given that in many cases it can take a relatively long period of
time to complete a single round in real-world wet lab settings.

3. Testing on simulated/real-world dataset


After implementing the algorithms and properly packaging the functions, the package will be
tested on several datasets with different modes to validate the optimization efficiency of the
algorithms as well as the overall user experience of the whole product. First, we will test the
method of the historical MALDI-ToF experiment data from Prof. Alan Russell's lab with 120
combinations of 4 parameters, as well as a directed protein evolution dataset with 4 mutable loci
published by Wu et al. (2016). Also, we can possibly use the method to help select parameters
using the asynchronous batch algorithm and run new High-Performance Liquid Chromatography
(HPLC) experiments on Emerald Cloud Lab.

Timeline
Time period/ Deadline Tasks
Mid December Proposal final draft ( to the committee and academic advisor)
Winter break (23 Dec 2020 – 31 Documentation of the functions in the package, writing
Jan 2021) introduction and methods sections of the thesis; Comparison of
different methods on existing datasets;
1 – 28 Feburary Testing on real-world experiments; Writing results section of the
thesis
Beginning of March Progression report submission
1 – 31 March Trying to publish the app online; Running additional
experiments and making adjustments according to the report
feedback
1 – 30 April Writing up, final revisions
Beginning of May Thesis defense; thesis document submission

References
Bischl, B., Richter, J., Bossek, J., Horn, D., et al. (2017) mlrMBO: A Modular Framework for Model-
Based Optimization of Expensive Black-Box Functions. [Online]. Available from:
http://arxiv.org/abs/1703.03373.
GPy (n.d.) {GPy}: A Gaussian process framework in python.
Kandasamy, K., Krishnamurthy, A., Schneider, J. & Poczos, B. (2017) Asynchronous Parallel Bayesian
Optimisation via Thompson Sampling. International Conference on Artificial Intelligence and
Statistics, AISTATS 2018. [Online] 133–142. Available from: http://arxiv.org/abs/1705.09236.
Kandasamy, K., Krishnamurthy, A., Schneider, J. & Poczos, B. (2018) Parallelised Bayesian
Optimisation via Thompson Sampling Kirthevasan. International Conference on Artificial
Intelligence and Statistics. 84, 133–142.

5
Zhiyun Gong MS Thesis Proposal 15 December 2020

Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. (2004) kernlab -- An {S4} Package for Kernel
Methods in {R}. Journal of Statistical Software. [Online] 11 (9), 1–20. Available from:
http://www.jstatsoft.org/v11/i09/.
Kawaguchi, K., Kaelbling, L.P. & Lozano-Pérez, T. (2016) Bayesian Optimization with Exponential
Convergence. Advances in Neural Information Processing Systems. [Online] 2015-Janua, 2809–
2817. Available from: http://arxiv.org/abs/1604.01348.
Kuhn, M. (2020) tune: Tidy Tuning Tools. [Online]. Available from:
https://cran.r-project.org/package=tune.
Roustant, O., Ginsbourger, D. & Deville, Y. (2012) DiceKriging , DiceOptim : Two R Packages for the
Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization. Journal of
Statistical Software. [Online] 51 (1), 1–55. Available from: doi:10.18637/jss.v051.i01.
Wilson, S. (2020) ParBayesianOptimization: Parallel Bayesian Optimization of Hyperparameters.
[Online]. Available from: https://github.com/AnotherSamWilson/ParBayesianOptimization.
Wu, N.C., Dai, L., Olson, C.A., Lloyd-Smith, J.O., et al. (2016) Adaptation in protein fitness landscapes
is facilitated by indirect paths. eLife. [Online] 5 (JULY). Available from: doi:10.7554/eLife.16965
[Accessed: 24 June 2020].
Yan, Y. (2016) rBayesianOptimization: Bayesian Optimization of Hyperparameters. [Online]. Available
from: https://cran.r-project.org/package=rBayesianOptimization.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy