Opportunities and Obstacles For Deep Learning in Biology and Medicine
Opportunities and Obstacles For Deep Learning in Biology and Medicine
This manuscript was automatically generated from greenelab/deep-review@a01dd71 on January 19, 2018.
Authors
1,☯ 2 3 4
Travers Ching , Daniel S. Himmelstein , Brett K. Beaulieu-Jones , Alexandr A. Kalinin ,
5 2 6 7 2
Brian T. Do , Gregory P. Way , Enrico Ferrero , Paul-Michael Agapow , Michael Zietz ,
8,9,10 11 12 13
Michael M. Hoffman , Wei Xie , Gail L. Rosen , Benjamin J. Lengerich , Johnny
14 15 12 16
Israeli , Jack Lanchantin , Stephen Woloszynek , Anne E. Carpenter , Avanti
17 18 19,20 21
Shrikumar , Jinbo Xu , Evan M. Cofer , Christopher A. Lavender , Srinivas C.
22 17 23 24 25
Turaga , Amr M. Alexandari , Zhiyong Lu , David J. Harris , Dave DeCaprio ,
15 17,26 23 27 28
Yanjun Qi , Anshul Kundaje , Yifan Peng , Laura K. Wiley , Marwin H.S. Segler ,
29 30 31 32,33,†
Simina M. Boca , S. Joshua Swamidass , Austin Huang , Anthony Gitter ,
2,†
Casey S. Greene
☯
— Author order was determined with a randomized algorithm
†
— To whom correspondence should be addressed: gitter@biostat.wisc.edu (A.G.) and
greenescientist@gmail.com (C.S.G.)
1. Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI
2. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of
Pennsylvania, Philadelphia, PA
3. Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania,
Philadelphia, PA
4. Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI
5. Harvard Medical School, Boston, MA
6. Computational Biology and Stats, Target Sciences, GlaxoSmithKline, Stevenage, United Kingdom
7. Data Science Institute, Imperial College London, London, United Kingdom
8. Princess Margaret Cancer Centre, Toronto, ON, Canada
9. Department of Medical Biophysics, Toronto, ON, Canada
10. Department of Computer Science, Toronto, ON, Canada
11. Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN
12. Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer
Engineering, Drexel University, Philadelphia, PA
13. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
14. Biophysics Program, Stanford University, Stanford, CA
15. Department of Computer Science, University of Virginia, Charlottesville, VA
16. Imaging Platform, Broad Institute of Harvard and MIT, Cambridge, MA
17. Department of Computer Science, Stanford University, Stanford, CA
18. Toyota Technological Institute at Chicago, Chicago, IL
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Abstract
Deep learning, which describes a class of machine learning algorithms, has recently showed
impressive results across a variety of domains. Biology and medicine are data rich, but the data
are complex and often ill-understood. Problems of this nature may be particularly well-suited to
deep learning techniques. We examine applications of deep learning to a variety of biomedical
problems—patient classification, fundamental biological processes, and treatment of patients—and
discuss whether deep learning will transform these tasks or if the biomedical sphere poses unique
challenges. We find that deep learning has yet to revolutionize or definitively resolve any of these
problems, but promising advances have been made on the prior state of the art. Even when
improvement over a previous baseline has been modest, we have seen signs that deep learning
methods may speed or aid human investigation. More work is needed to address concerns related
to interpretability and how to best model each problem. Furthermore, the limited amount of labeled
data for training presents problems in some domains, as do legal and privacy constraints on work
with sensitive health records. Nonetheless, we foresee deep learning powering changes at both
bench and bedside with the potential to transform several areas of biology and medicine.
The term deep learning has come to refer to a collection of new techniques that, together, have
demonstrated breakthrough gains over existing best-in-class machine learning algorithms across
several fields. For example, over the past five years these methods have revolutionized image
classification and speech recognition due to their flexibility and high accuracy [2]. More recently,
deep learning algorithms have shown promise in fields as diverse as high-energy physics [3],
dermatology [4], and translation among written languages [5]. Across fields, “off-the-shelf”
implementations of these algorithms have produced comparable or higher accuracy than previous
best-in-class methods that required years of extensive customization, and specialized
implementations are now being used at industrial scales.
Deep learning approaches grew from research in neural networks, which were first proposed in
1943 [6] as a model for how our brains process information. The history of neural networks is
interesting in its own right [7]. In neural networks, inputs are fed into the input layer, which feeds
into one or more hidden layers, which eventually link to an output layer. A layer consists of a set of
nodes, sometimes called “features” or “units,” which are connected via edges to the immediately
earlier and the immediately deeper layers. In some special neural network architectures, nodes can
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
connect to themselves with a delay. The nodes of the input layer generally consist of the variables
being measured in the dataset of interest—for example, each node could represent the intensity
value of a specific pixel in an image or the expression level of a gene in a specific transcriptomic
experiment. The neural networks used for deep learning have multiple hidden layers. Each layer
essentially performs feature construction for the layers before it. The training process used often
allows layers deeper in the network to contribute to the refinement of earlier layers. For this reason,
these algorithms can automatically engineer features that are suitable for many tasks and
customize those features for one or more specific tasks.
Deep learning does many of the same things as more familiar machine learning approaches. In
particular, deep learning approaches can be used both in supervised applications—where the goal
is to accurately predict one or more labels or outcomes associated with each data point—in the
place of regression approaches, as well as in unsupervised, or “exploratory” applications—where
the goal is to summarize, explain, or identify interesting patterns in a data set—as a form of
clustering. Deep learning methods may in fact combine both of these steps. When sufficient data
are available and labeled, these methods construct features tuned to a specific problem and
combine those features into a predictor. In fact, if the dataset is “labeled” with binary classes, a
simple neural network with no hidden layers and no cycles between units is equivalent to logistic
regression if the output layer is a sigmoid (logistic) function of the input layer. Similarly, for
continuous outcomes, linear regression can be seen as a simple neural network. Thus, in some
ways, supervised deep learning approaches can be seen as a generalization of regression models
that allow for greater flexibility. Recently, hardware improvements and very large training datasets
have allowed these deep learning techniques to surpass other machine learning algorithms for
many problems. In a famous and early example, scientists from Google demonstrated that a neural
network “discovered” that cats, faces, and pedestrians were important components of online videos
[8] without being told to look for them. What if, more generally, deep learning could solve the
challenges presented by the growth of data in biomedicine? Could these algorithms identify the
“cats” hidden in our data—the patterns unknown to the researcher—and suggest ways to act on
them? In this review, we examine deep learning’s application to biomedical science and discuss
the unique challenges that biomedical data pose for deep learning methods.
Several important advances make the current surge of work done in this area possible. Easy-to-
use software packages have brought the techniques of the field out of the specialist’s toolkit to a
broad community of computational scientists. Additionally, new techniques for fast training have
enabled their application to larger datasets [9]. Dropout of nodes, edges, and layers makes
networks more robust, even when the number of parameters is very large. Finally, the larger
datasets now available are also sufficient for fitting the many parameters that exist for deep neural
networks. The convergence of these factors currently makes deep learning extremely adaptable
and capable of addressing the nuanced differences of each domain to which it is applied.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Figure 1: Neural networks come in many different forms. Left: a key for the various types of nodes
used in neural networks. Simple FFNN: a feed forward neural network in which inputs are
connected via some function to an output node and the model is trained to produce some output
for a set of inputs. MLP: the multi-layer perceptron is a feed forward neural network in which there
is at least one hidden layer between the input and output nodes. CNN: the convolutional neural
network is a feed forward neural network in which the inputs are grouped spatially into hidden
nodes. In the case of this example, each input node is only connected to hidden nodes alongside
their neighboring input node. Autoencoder: a type of MLP in which the neural network is trained to
produce an output that matches the input to the network. RNN: a deep recurrent neural network is
used to allow the neural network to retain memory over time or sequential inputs. This figure was
inspired by the Neural Network Zoo by Fjodor Van Veen.
This review discusses recent work in the biomedical domain, and most successful applications
select neural network architectures that are well suited to the problem at hand. We sketch out a
few simple example architectures in Figure 1. If data have a natural adjacency structure, a
convolutional neural network (CNN) can take advantage of that structure by emphasizing local
relationships, especially when convolutional layers are used in early layers of the neural network.
Other neural network architectures such as autoencoders require no labels and are now regularly
used for unsupervised tasks. In this review, we do not exhaustively discuss the different types of
deep neural network architectures; an overview of the principal terms used herein is given in Table
1. Table 1 also provides select example applications, though in practice each neural network
architecture has been broadly applied across multiple types of biomedical data. A recent book from
Goodfellow et al. covers neural network architectures in detail [10], and LeCun et al. provide a
more general introduction [2].
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Table 1: Glossary.
Machine-learning approach
inspired by biological neurons
Neural network
where inputs are fed into one or
(NN)
more layers, producing an output
layer
A process by which
Data augmentation is widely used
transformations that do not affect
in the analysis of images because
relevant properties of the input
Data rotation transformations for
data (e.g. arbitrary rotations of
augmentation biomedical images often do not
histopathology images) are applied
change relevant properties of the
to training examples to increase
image.
the size of the training set.
While deep learning shows increased flexibility over other machine learning approaches, as seen
in the remainder of this review, it requires large training sets in order to fit the hidden layers, as well
as accurate labels for the supervised learning applications. For these reasons, deep learning has
recently become popular in some areas of biology and medicine, while having lower adoption in
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
other areas. At the same time, this highlights the potentially even larger role that it may play in
future research, given the increases in data in all biomedical fields. It is also important to see it as a
branch of machine learning and acknowledge that it has the same limitations as other approaches
in that field. In particular, the results are still dependent on the underlying study design and the
usual caveats of correlation versus causation still apply—a more precise answer is only better than
a less precise one if it answers the correct question.
There are already a number of reviews focused on applications of deep learning in biology [12–16],
healthcare [17,18], and drug discovery [19–22]. Under our guiding question, we sought to highlight
cases where deep learning enabled researchers to solve challenges that were previously
considered infeasible or makes difficult, tedious analyses routine. We also identified approaches
that researchers are using to sidestep challenges posed by biomedical data. We find that domain-
specific considerations have greatly influenced how to best harness the power and flexibility of
deep learning. Model interpretability is often critical. Understanding the patterns in data may be just
as important as fitting the data. In addition, there are important and pressing questions about how
to build networks that efficiently represent the underlying structure and logic of the data. Domain
experts can play important roles in designing networks to represent data appropriately, encoding
the most salient prior knowledge and assessing success or failure. There is also great potential to
create deep learning systems that augment biologists and clinicians by prioritizing experiments or
streamlining tasks that do not require expert judgment. We have divided the large range of topics
into three broad classes: Disease and Patient Categorization, Fundamental Biological Study, and
Treatment of Patients. Below, we briefly introduce the types of questions, approaches and data that
are typical for each class in the application of deep learning.
A key challenge in biomedicine is the accurate classification of diseases and disease subtypes. In
oncology, current “gold standard” approaches include histology, which requires interpretation by
experts, or assessment of molecular markers such as cell surface receptors or gene expression.
One example is the PAM50 approach to classifying breast cancer where the expression of 50
marker genes divides breast cancer patients into four subtypes. Substantial heterogeneity still
remains within these four subtypes [23,24]. Given the increasing wealth of molecular data
available, a more comprehensive subtyping seems possible. Several studies have used deep
learning methods to better categorize breast cancer patients: For instance, denoising
autoencoders, an unsupervised approach, can be used to cluster breast cancer patients [25], and
CNN can help count mitotic divisions, a feature that is highly correlated with disease outcome in
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
histological images [26]. Despite these recent advances, a number of challenges exist in this area
of research, most notably the integration of molecular and imaging data with other disparate types
of data such as electronic health records (EHRs).
Deep learning can be applied to answer more fundamental biological questions; it is especially
suited to leveraging large amounts of data from high-throughput “omics” studies. One classic
biological problem where machine learning, and now deep learning, has been extensively applied
is molecular target prediction. For example, deep recurrent neural networks (RNNs) have been
used to predict gene targets of microRNAs [27], and CNNs have been applied to predict protein
residue-residue contacts and secondary structure [28–30]. Other recent exciting applications of
deep learning include recognition of functional genomic elements such as enhancers and
promoters [31–33] and prediction of the deleterious effects of nucleotide polymorphisms [34].
Treatment of patients
Although the application of deep learning to patient treatment is just beginning, we expect new
methods to recommend patient treatments, predict treatment outcomes, and guide the
development of new therapies. One type of effort in this area aims to identify drug targets and
interactions or predict drug response. Another uses deep learning on protein structures to predict
drug interactions and drug bioactivity [35]. Drug repositioning using deep learning on transcriptomic
data is another exciting area of research [36]. Restricted Boltzmann machines (RBMs) can be
combined into deep belief networks (DBNs) to predict novel drug-target interactions and formulate
drug repositioning hypotheses [37,38]. Finally, deep learning is also prioritizing chemicals in the
early stages of drug discovery for new targets [22].
Deep learning methods applied to a large corpus of patient phenotypes may provide a meaningful
and more data-driven approach to patient categorization. For example, they may identify new
shared mechanisms that would otherwise be obscured due to ad hoc historical definitions of
disease. Perhaps deep neural networks, by reevaluating data without the context of our
assumptions, can reveal novel classes of treatable conditions.
In spite of such optimism, the ability of deep learning models to indiscriminately extract predictive
signals must also be assessed and operationalized with care. Imagine a deep neural network is
provided with clinical test results gleaned from electronic health records. Because physicians may
order certain tests based on their suspected diagnosis, a deep neural network may learn to
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
“diagnose” patients simply based on the tests that are ordered. For some objective functions, such
as predicting an International Classification of Diseases (ICD) code, this may offer good
performance even though it does not provide insight into the underlying disease beyond physician
activity. This challenge is not unique to deep learning approaches; however, it is important for
practitioners to be aware of these challenges and the possibility in this domain of constructing
highly predictive classifiers of questionable actual utility.
Our goal in this section is to assess the extent to which deep learning is already contributing to the
discovery of novel categories. Where it is not, we focus on barriers to achieving these goals. We
also highlight approaches that researchers are taking to address challenges within the field,
particularly with regards to data availability and labeling.
Though there are many commonalities with the analysis of natural images, there are also key
differences. In all cases that we examined, fewer than one million images were available for
training, and datasets are often many orders of magnitude smaller than collections of natural
images. Researchers have developed subtask-specific strategies to address this challenge.
Data augmentation provides an effective strategy for working with small training sets. The practice
is exemplified by a series of papers that analyze images from mammographies [40–44]. To expand
the number and diversity of images, researchers constructed adversarial training examples [43].
Adversarial training examples are constructed by applying a transformation that changes training
images but not their content—for example by rotating an image by a random amount. An
alternative in the domain is to train towards human-created features before subsequent fine-tuning
[41], which can help to sidestep this challenge though it does give up deep learning techniques’
strength as feature constructors.
A second strategy repurposes features extracted from natural images by deep learning models,
such as ImageNet [45], for new purposes. Diagnosing diabetic retinopathy through color fundus
images became an area of focus for deep learning researchers after a large labeled image set was
made publicly available during a 2015 Kaggle competition [46]. Most participants trained neural
networks from scratch [46–48], but Gulshan et al. [49] repurposed a 48-layer Inception-v3 deep
architecture pre-trained on natural images and surpassed the state-of-the-art specificity and
sensitivity. Such features were also repurposed to detect melanoma, the deadliest form of skin
cancer, from dermoscopic [50,51] and non-dermoscopic images of skin lesions [4,52,53] as well as
age-related macular degeneration [54]. Pre-training on natural images can enable very deep
networks to succeed without overfitting. For the melanoma task, reported performance was
competitive with or better than a board of certified dermatologists [4,50]. Reusing features from
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
natural images is also an emerging approach for radiographic images, where datasets are often
too small to train large deep neural networks without these techniques [55–58]. A deep CNN
trained on natural images boosts performance in radiographic images [57]. However, the target
task required either re-training the initial model from scratch with special pre-processing or fine-
tuning of the whole network on radiographs with heavy data augmentation to avoid overfitting.
The technique of reusing features from a different task falls into the broader area of transfer
learning (see Discussion). Though we’ve mentioned numerous successes for the transfer of natural
image features to new tasks, we expect that a lower proportion of negative results have been
published. The analysis of magnetic resonance images (MRIs) is also faced with the challenge of
small training sets. In this domain, Amit et al. [59] investigated the tradeoff between pre-trained
models from a different domain and a small CNN trained only with MRI images. In contrast with the
other selected literature, they found a smaller network trained with data augmentation on few
hundred images from a few dozen patients can outperform a pre-trained out-of-domain classifier.
Another way of dealing with limited training data is to divide rich data—e.g. 3D images—into
numerous reduced projections. Shin et al. [56] compared various deep network architectures,
dataset characteristics, and training procedures for computer tomography-based (CT) abnormality
detection. They concluded that networks as deep as 22 layers could be useful for 3D data, despite
the limited size of training datasets. However, they noted that choice of architecture, parameter
setting, and model fine-tuning needed is very problem- and dataset-specific. Moreover, this type of
task often depends on both lesion localization and appearance, which poses challenges for CNN-
based approaches. Straightforward attempts to capture useful information from full-size images in
all three dimensions simultaneously via standard neural network architectures were
computationally unfeasible. Instead, two-dimensional models were used to either process image
slices individually (2D), or aggregate information from a number of 2D projections in the native
space (2.5D).
Roth et al. compared 2D, 2.5D, and 3D CNNs on a number of tasks for computer-aided detection
from CT scans and showed that 2.5D CNNs performed comparably well to 3D analogs, while
requiring much less training time, especially on augmented training sets [60]. Another advantage of
2D and 2.5D networks is the wider availability of pre-trained models. But reducing the
dimensionality is not always helpful. Nie et al. [61] showed that multimodal, multi-channel 3D deep
architecture was successful at learning high-level brain tumor appearance features jointly from
MRI, functional MRI, and diffusion MRI images, outperforming single-modality or 2D models.
Overall, the variety of modalities, properties and sizes of training sets, the dimensionality of input,
and the importance of end goals in medical image analysis are provoking a development of
specialized deep neural network architectures, training and validation protocols, and input
representations that are not characteristic of widely-studied natural images.
Predictions from deep neural networks can be evaluated for use in workflows that also incorporate
human experts. In a large dataset of mammography images, Kooi et al. [62] demonstrated that
deep neural networks outperform the traditional computer-aided diagnosis system at low sensitivity
and perform comparably at high sensitivity. They also compared network performance to certified
screening radiologists on a patch level and found no significant difference between the network
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
and the readers. However, using deep methods for clinical practice is challenged by the difficulty of
assigning a level of confidence to each prediction. Leibig et al. [48] estimated the uncertainty of
deep networks for diabetic retinopathy diagnosis by linking dropout networks with approximate
Bayesian inference. Techniques that assign confidences to each prediction should aid pathologist-
computer interactions and improve uptake by physicians.
Systems to aid in the analysis of histology slides are also promising use cases for deep learning
[63]. Ciresan et al. [26] developed one of the earliest approaches for histology slides, winning the
2012 International Conference on Pattern Recognition’s Contest on Mitosis Detection while
achieving human-competitive accuracy. In more recent work, Wang et al. [64] analyzed stained
slides of lymph node slices to identify cancers. On this task a pathologist has about a 3% error
rate. The pathologist did not produce any false positives, but did have a number of false negatives.
The algorithm had about twice the error rate of a pathologist, but the errors were not strongly
correlated. In this area, these algorithms may be ready to be incorporated into existing tools to aid
pathologists and reduce the false negative rate. Ensembles of deep learning and human experts
may help overcome some of the challenges presented by data limitations.
One source of training examples with rich phenotypical annotations is the EHR. Billing information
in the form of ICD codes are simple annotations but phenotypic algorithms can combine laboratory
tests, medication prescriptions, and patient notes to generate more reliable phenotypes. Recently,
Lee et al. [65] developed an approach to distinguish individuals with age-related macular
degeneration from control individuals. They trained a deep neural network on approximately
100,000 images extracted from structured electronic health records, reaching greater than 93%
accuracy. The authors used their test set to evaluate when to stop training. In other domains, this
has resulted in a minimal change in the estimated accuracy [66], but we recommend the use of an
independent test set whenever feasible.
Rich clinical information is stored in EHRs. However, manually annotating a large set requires
experts and is time consuming. For chest X-ray studies, a radiologist usually spends a few minutes
per example. Generating the number of examples needed for deep learning is infeasibly
expensive. Instead, researchers may benefit from using text mining to generate annotations [67],
even if those annotations are of modest accuracy. Wang et al. [68] proposed to build predictive
deep neural network models through the use of images with weak labels. Such labels are
automatically generated and not verified by humans, so they may be noisy or incomplete. In this
case, they applied a series of natural language processing (NLP) techniques to the associated
chest X-ray radiological reports. They first extracted all diseases mentioned in the reports using a
state-of-the-art NLP tool, then applied a new method, NegBio [69], to filter negative and equivocal
findings in the reports. Evaluation on four independent datasets demonstrated that NegBio is highly
accurate for detecting negative and equivocal findings (~90% in F₁ score, which balances precision
and recall [70]). The resulting dataset [71] consisted of 112,120 frontal-view chest X-ray images
from 30,805 patients, and each image was associated with one or more text-mined (weakly-
labeled) pathology categories (e.g. pneumonia and cardiomegaly) or “no finding” otherwise.
Further, Wang et al. [68] used this dataset with a unified weakly-supervised multi-label image
classification framework, to detect common thoracic diseases. It showed superior performance
over a benchmark using fully-labeled data.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
With the exception of natural image-like problems (e.g. melanoma detection), biomedical imaging
poses a number of challenges for deep learning. Datasets are typically small, annotations can be
sparse, and images are often high-dimensional, multimodal, and multi-channel. Techniques like
transfer learning, heavy dataset augmentation, and the use of multi-view and multi-stream
architectures are more common than in the natural image domain. Furthermore, high model
sensitivity and specificity can translate directly into clinical value. Thus, prediction evaluation,
uncertainty estimation, and model interpretation methods are also of great importance in this
domain (see Discussion). Finally, there is a need for better pathologist-computer interaction
techniques that will allow combining the power of deep learning methods with human expertise and
lead to better-informed decisions for patient treatment and care.
Figure 2: Deep learning applications, tasks, and models based on NLP perspectives.
Named entity recognition (NER) is a task of identifying text spans that refer to a biological concept
of a specific class, such as disease or chemical, in a controlled vocabulary or ontology. NER is
often needed as a first step in many complex text mining systems. The current state-of-the-art
methods typically reformulate the task as a sequence labeling problem and use conditional random
fields [72–74]. In recent years, word embeddings that contain rich latent semantic information of
words have been widely used to improve the NER performance. Liu et al. studied the effect of word
embeddings on drug name recognition and compared them with traditional semantic features [75].
Tang et al. investigated word embeddings in gene, DNA, and cell line mention detection tasks [76].
Moreover, Wu et al. examined the use of neural word embeddings for clinical abbreviation
disambiguation [77]. Liu et al. exploited task-oriented resources to learn word embeddings for
clinical abbreviation expansion [78].
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Relation extraction involves detecting and classifying semantic relationships between entities from
the literature. At present, kernel methods or feature-based approaches are commonly applied [79–
81]. Deep learning can relieve the feature sparsity and engineering problems. Some studies
focused on jointly extracting biomedical entities and relations simultaneously [82,83], while others
applied deep learning on relation classification given the relevant entities. For example, both
multichannel dependency-based CNNs [84] and shortest path-based CNNs [85,86] are well-suited
for sentence-based protein-protein extraction. Jiang et al. proposed a biomedical domain-specific
word embedding model to reduce the manual labor of designing semantic representation for the
same task [87]. Gu et al. employed a maximum entropy model and a CNN model for chemical-
induced disease relation extraction at the inter- and intra-sentence level, respectively [88]. For
drug-drug interaction, Zhao et al. used a CNN that employs word embeddings with the syntactic
information of a sentence as well as features of part-of-speech tags and dependency trees [89].
Asada et al. experimented with an attention CNN [90], and Yi et al. a RNN model with multiple
attention layers [91]. In both cases, it is a single model with attention mechanism, which allows the
decoder to focus on different parts of the source sentence. As a result, it does not require
dependency parsing or training multiple models. Both attention CNN and RNN have comparable
results, but the CNN model has an advantage in that it can be easily computed in parallel, hence
making it faster with recent graphics processing units (GPUs).
For biotopes event extraction, Li et al. employed CNN and distributed representation [92] while
Mehryary et al. used long short-term memory (LSTM) networks to extract complicated relations
[93]. Li et al. applied word embedding to extract complete events from biomedical text and
achieved results comparable to the state-of-the-art systems [94]. There are also approaches that
identify event triggers rather than the complete event [95,96]. Taken together, deep learning
models outperform traditional kernel methods or feature-based approaches by 1–5% in f-score.
Among various deep learning approaches, CNN stands out as the most popular model both in
terms of computational complexity and performance, while RNN has achieved continuous
progress.
Information retrieval is a task of finding relevant text that satisfies an information need from within a
large document collection. While deep learning has not yet achieved the same level of success in
this area as seen in others, the recent surge of interest and work suggest that this may be quickly
changing. For example, Mohan et al. described a deep learning approach to modeling the
relevance of a document’s text to a query, which they applied to the entire biomedical literature
[97].
To summarize, deep learning has shown promising results in many biomedical text mining tasks
and applications. But to realize its full potential in this domain, either large size of labeled data or
technical advancements in current methods coping with limited labeled data are required.
implement domain-specific features [99]. These features capture unique aspects of the literature
being processed. Deep learning methods are natural feature constructors. In recent work, the
authors evaluated the extent to which deep learning methods could be applied on top of generic
features for domain-specific concept extraction [100]. They found that performance was in line with,
but lower than the best domain-specific method [100]. This raises the possibility that deep learning
may impact the field by reducing the researcher time and cost required to develop specific
solutions, but it may not always lead to performance increases.
In recent work, Yoon et al.[101] analyzed simple features using deep neural networks and found
that the patterns recognized by the algorithms could be re-used across tasks. Their aim was to
analyze the free text portions of pathology reports to identify the primary site and laterality of
tumors. The only features the authors supplied to the algorithms were unigrams (counts for single
words) and bigrams (counts for two-word combinations) in a free text document. They subset the
full set of words and word combinations to the 400 most common. The machine learning algorithms
that they employed (naïve Bayes, logistic regression, and deep neural networks) all performed
relatively similarly on the task of identifying the primary site. However, when the authors evaluated
the more challenging task, evaluating the laterality of each tumor, the deep neural network
outperformed the other methods. Of particular interest, when the authors first trained a neural
network to predict primary site and then repurposed those features as a component of a secondary
neural network trained to predict laterality, the performance was higher than a laterality-trained
neural network. This demonstrates how deep learning methods can repurpose features across
tasks, improving overall predictions as the field tackles new challenges. The Discussion further
reviews this type of transfer learning.
Several authors have created reusable feature sets for medical terminologies using natural
language processing and neural embedding models, as popularized by word2vec [102]. Minarro-
Giménez et al. [103] applied the word2vec deep learning toolkit to medical corpora and evaluated
the efficiency of word2vec in identifying properties of pharmaceuticals based on mid-sized,
unstructured medical text corpora without any additional background knowledge. A goal of learning
terminologies for different entities in the same vector space is to find relationships between
different domains (e.g. drugs and the diseases they treat). It is difficult for us to provide a strong
statement on the broad utility of these methods. Manuscripts in this area tend to compare
algorithms applied to the same data but lack a comparison against overall best-practices for one or
more tasks addressed by these methods. Techniques have been developed for free text medical
notes [104], ICD and National Drug Codes [105,106], and claims data [107]. Methods for neural
embeddings learned from electronic health records have at least some ability to predict disease-
disease associations and implicate genes with a statistical association with a disease [108], but the
evaluations performed did not differentiate between simple predictions (i.e. the same disease in
different sites of the body) and non-intuitive ones. Jagannatha and Yu [109] further employed a
bidirectional LSTM structure to extract adverse drug events from electronic health records, and Lin
et al. [110] investigated using CNN to extract temporal relations. While promising, a lack of rigorous
evaluations of the real-world utility of these kinds of features makes current contributions in this
area difficult to evaluate. Comparisons need to be performed to examine the true utility against
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
leading approaches (i.e. algorithms and data) as opposed to simply evaluating multiple algorithms
on the same potentially limited dataset.
Identifying consistent subgroups of individuals and individual health trajectories from clinical tests is
also an active area of research. Approaches inspired by deep learning have been used for both
unsupervised feature construction and supervised prediction. Early work by Lasko et al. [111],
combined sparse autoencoders and Gaussian processes to distinguish gout from leukemia from
uric acid sequences. Later work showed that unsupervised feature construction of many features
via denoising autoencoder neural networks could dramatically reduce the number of labeled
examples required for subsequent supervised analyses [112]. In addition, it pointed towards
features learned during unsupervised training being useful for visualizing and stratifying subgroups
of patients within a single disease. In a concurrent large-scale analysis of EHR data from 700,000
patients, Miotto et al. [113] used a deep denoising autoencoder architecture applied to the number
and co-occurrence of clinical events to learn a representation of patients (DeepPatient). The model
was able to predict disease trajectories within one year with over 90% accuracy and patient-level
predictions were improved by up to 15% when compared to other methods. Choi et al. [114]
attempted to model the longitudinal structure of EHRs with a RNN to predict future diagnosis and
medication prescriptions on a cohort of 260,000 patients followed for 8 years (Doctor AI). Pham et
al. [115] built upon this concept by using a RNN with a LSTM architecture enabling explicit
modelling of patient trajectories through the use of memory cells. The method, DeepCare,
performed better than shallow models or plain RNN when tested on two independent cohorts for its
ability to predict disease progression, intervention recommendation and future risk prediction.
Nguyen et al. [116] took a different approach and used word embeddings from EHRs to train a
CNN that could detect and pool local clinical motifs to predict unplanned readmission after six
months, with performance better than the baseline method (Deepr). Razavian et al. [117] used a
set of 18 common lab tests to predict disease onset using both CNN and LSTM architectures and
demonstrated an improvement over baseline regression models. However, numerous challenges
including data integration (patient demographics, family history, laboratory tests, text-based patient
records, image analysis, genomic data) and better handling of streaming temporal data with many
features, will need to be overcome before we can fully assess the potential of deep learning for this
application area.
Still, recent work has also revealed domains in which deep networks have proven superior to
traditional methods. Survival analysis models the time leading to an event of interest from a shared
starting point, and in the context of EHR data, often associates these events to subject covariates.
Exploring this relationship is difficult, however, given that EHR data types are often heterogeneous,
covariates are often missing, and conventional approaches require the covariate-event relationship
be linear and aligned to a specific starting point [118]. Early approaches, such as the Faraggi-
Simon feed-forward network, aimed to relax the linearity assumption, but performance gains were
lacking [119]. Katzman et al. in turn developed a deep implementation of the Faraggi-Simon
network that, in addition to outperforming Cox regression, was capable of comparing the risk
between a given pair of treatments, thus potentially acting as recommender system [120]. To
overcome the remaining difficulties, researchers have turned to deep exponential families, a class
of latent generative models that are constructed from any type of exponential family distributions
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
[121]. The result was a deep survival analysis model capable of overcoming challenges posed by
missing data and heterogeneous data types, while uncovering nonlinear relationships between
covariates and failure time. They showed their model more accurately stratified patients as a
function of disease risk score compared to the current clinical implementation.
There is a computational cost for these methods, however, when compared to traditional, non-
neural network approaches. For the exponential family models, despite their scalability [122], an
important question for the investigator is whether he or she is interested in estimates of posterior
uncertainty. Given that these models are effectively Bayesian neural networks, much of their utility
simplifies to whether a Bayesian approach is warranted for a given increase in computational cost.
Moreover, as with all variational methods, future work must continue to explore just how well the
posterior distributions are approximated, especially as model complexity increases [123].
A dearth of true labels is perhaps among the biggest obstacles for EHR-based analyses that
employ machine learning. Popular deep learning (and other machine learning) methods are often
used to tackle classification tasks and thus require ground-truth labels for training. For EHRs this
can mean that researchers must hire multiple clinicians to manually read and annotate individual
patients’ records through a process called chart review. This allows researchers to assign “true”
labels, i.e. those that match our best available knowledge. Depending on the application,
sometimes the features constructed by algorithms also need to be manually validated and
interpreted by clinicians. This can be time consuming and expensive [124]. Because of these costs,
much of this research, including the work cited in this review, skips the process of expert review.
Clinicians’ skepticism for research without expert review may greatly dampen their enthusiasm for
the work and consequently reduce its impact. To date, even well-resourced large national consortia
have been challenged by the task of acquiring enough expert-validated labeled data. For instance,
in the eMERGE consortia and PheKB database [125], most samples with expert validation contain
only 100 to 300 patients. These datasets are quite small even for simple machine learning
algorithms. The challenge is greater for deep learning models with many parameters. While
unsupervised and semi-supervised approaches can help with small sample sizes, the field would
benefit greatly from large collections of anonymized records in which a substantial number of
records have undergone expert review. This challenge is not unique to EHR-based studies. Work
on medical images, omics data in applications for which detailed metadata are required, and other
applications for which labels are costly to obtain will be hampered as long as abundant curated
data are unavailable.
Successful approaches to date in this domain have sidestepped this challenge by making
methodological choices that either reduce the need for labeled examples or that use
transformations to training data to increase the number of times it can be used before overfitting
occurs. For example, the unsupervised and semi-supervised methods that we have discussed
reduce the need for labeled examples [112]. The anchor and learn framework [126] uses expert
knowledge to identify high-confidence observations from which labels can be inferred. The
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
strategies of adversarial training mentioned above can reduce overfitting, if transformations are
available that preserve the meaningful content of the data while transforming irrelevant features
[43]. While adversarial training examples can be easily imagined for certain methods that operate
on images, it is more challenging to figure out what an equivalent transformation would be for a
patient’s clinical test results. Consequently, it may be hard to employ adversarial training examples
with other applications. Finally, approaches that transfer features can also help use valuable
training data most efficiently. Rajkomar et al. trained a deep neural network using generic images
before tuning using only radiology images [57]. Datasets that require many of the same types of
features might be used for initial training, before fine tuning takes place with the more sparse
biomedical examples. Though the analysis has not yet been attempted, it is possible that
analogous strategies may be possible with electronic health records. For example, features learned
from the electronic health record for one type of clinical test (e.g. a decrease over time in a lab
value) may transfer across phenotypes. Methods to accomplish more with little high-quality labeled
data arose in other domains and may also be adapted to this challenge, e.g. data programming
[127]. In data programming, noisy automated labeling functions are integrated.
Numerous commentators have described data as the new oil [128,129]. The idea behind this
metaphor is that data are available in large quantities, valuable once refined, and this underlying
resource will enable a data-driven revolution in how work is done. Contrasting with this perspective,
Ratner, Bach, and Ré described labeled training data, instead of data, as “The New New Oil” [130].
In this framing, data are abundant and not a scarce resource. Instead, new approaches to solving
problems arise when labeled training data become sufficient to enable them. Based on our review
of research on deep learning methods to categorize disease, the latter framing rings true.
We expect improved methods for domains with limited data to play an important role if deep
learning is going to transform how we categorize states of human health. We don’t expect that
deep learning methods will replace expert review. We expect them to complement expert review by
allowing more efficient use of the costly practice of manual annotation.
To construct the types of very large datasets that deep learning methods thrive on, we need robust
sharing of large collections of data. This is in part a cultural challenge. We touch on this challenge
in Discussion. Beyond the cultural hurdles around data sharing, there are also technological and
legal hurdles related to sharing individual health records or deep models built from such records.
This subsection deals primarily with these challenges.
EHRs are designed chiefly for clinical, administrative and financial purposes, such as patient care,
insurance and billing [131]. Science is at best a tertiary priority, presenting challenges to EHR-
based research in general and to deep learning research in particular. Although there is significant
work in the literature around EHR data quality and the impact on research [132], we focus on three
types of challenges: local bias, wider standards, and legal issues. Note these problems are not
restricted to EHRs but can also apply to any large biomedical dataset, e.g. clinical trial data.
Even within the same healthcare system, EHRs can be used differently [133,134]. Individual users
have unique documentation and ordering patterns, with different departments and different
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
hospitals having different priorities that code patients and introduce missing data in a non-random
fashion [135]. Patient data may be kept across several “silos” within a single health system
(e.g. separate nursing documentation, registries, etc.). Even the most basic task of matching
patients across systems can be challenging due to data entry issues [136]. The situation is further
exacerbated by the ongoing introduction, evolution, and migration of EHR systems, especially
where reorganized and acquired healthcare facilities have to merge. Further, even the ostensibly
least-biased data type, laboratory measurements, can be biased based by both the healthcare
process and patient health state [137]. As a result, EHR data can be less complete and less
objective than expected.
In the wider picture, standards for EHRs are numerous and evolving. Proprietary systems,
indifferent and scattered use of health information standards, and controlled terminologies makes
combining and comparison of data across systems challenging [138]. Further diversity arises from
variation in languages, healthcare practices, and demographics. Merging EHRs gathered in
different systems (and even under different assumptions) is challenging [139].
Combining or replicating studies across systems thus requires controlling for both the above biases
and dealing with mismatching standards. This has the practical effect of reducing cohort size,
limiting statistical significance, preventing the detection of weak effects [140], and restricting the
number of parameters that can be trained in a model. Further, rules-based algorithms have been
popular in EHR-based research, but because these are developed at a single institution and
trained with a specific patient population, they do not transfer easily to other healthcare systems
[141]. Genetic studies using EHR data are subject to even more bias, as the differences in
population ancestry across health centers (e.g. proportion of patients with African or Asian
ancestry) can affect algorithm performance. For example, Wiley et al. [142] showed that warfarin
dosing algorithms often under-perform in African Americans, illustrating that some of these issues
are unresolved even at a treatment best practices level. Lack of standardization also makes it
challenging for investigators skilled in deep learning to enter the field, as numerous data
processing steps must be performed before algorithms are applied.
Finally, even if data were perfectly consistent and compatible across systems, attempts to share
and combine EHR data face considerable legal and ethical barriers. Patient privacy can severely
restrict the sharing and use of EHR data [143]. Here again, standards are heterogeneous and
evolving, but often EHR data can often not be exported or even accessed directly for research
purposes without appropriate consent. In the United States, research use of EHR data is subject
both to the Common Rule and the Health Insurance Portability and Accountability Act (HIPAA).
Ambiguity in the regulatory language and individual interpretation of these rules can hamper use of
EHR data [144]. Once again, this has the effect of making data gathering more laborious and
expensive, reducing sample size and study power.
Several technological solutions have been proposed in this direction, allowing access to sensitive
data satisfying privacy and legal concerns. Software like DataShield [145] and ViPAR [146],
although not EHR-specific, allow querying and combining of datasets and calculation of summary
statistics across remote sites by “taking the analysis to the data”. The computation is carried out at
the remote site. Conversely, the EH4CR project [138] allows analysis of private data by use of an
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
inter-mediation layer that interprets remote queries across internal formats and datastores and
returns the results in a de-identified standard form, thus giving real-time consistent but secure
access. Continuous Analysis [147] can allow reproducible computing on private data. Using such
techniques, intermediate results can be automatically tracked and shared without sharing the
original data. While none of these have been used in deep learning, the potential is there.
Even without sharing data, algorithms trained on confidential patient data may present security
risks or accidentally allow for the exposure of individual level patient data. Tramer et al. [148]
showed the ability to steal trained models via public application programming interfaces (APIs).
Dwork and Roth [149] demonstrate the ability to expose individual level information from accurate
answers in a machine learning model. Attackers can use similar attacks to find out if a particular
data instance was present in the original training set for the machine learning model [150], in this
case, whether a person’s record was present. To protect against these attacks, Simmons et al.
[151] developed the ability to perform genome-wide association studies (GWASs) in a differentially
private manner, and Abadi et al. [152] show the ability to train deep learning classifiers under the
differential privacy framework.
These attacks also present a potential hazard for approaches that aim to generate data. Choi et
al. propose generative adversarial neural networks (GANs) as a tool to make sharable EHR data
[153], and Esteban et al. [154] showed that recurrent GANs could be used for time series data.
However, in both cases the authors did not take steps to protect the model from such attacks.
There are approaches to protect models, but they pose their own challenges. Training in a
differentially private manner provides a limited guarantee that an algorithm’s output will be equally
likely to occur regardless of the participation of any one individual. The limit is determined by
parameters which provide a quantification of privacy. Beaulieu-Jones et al. demonstrated the ability
to generate data that preserved properties of the SPRINT clinical trial with GANs under the
differential privacy framework [155]. Both Beaulieu-Jones et al. and Esteban et al. train models on
synthetic data generated under differentially private and observe performance from a transfer
learning evaluation that is only slightly below models trained on the original, real data. Taken
together, these results suggest that differentially private GANs may be an attractive way to
generate sharable datasets for downstream reanalysis.
Federated learning [156] and secure aggregations [157] are complementary approaches that
reinforce differential privacy. Both aim to maintain privacy by training deep learning models from
decentralized data sources such as personal mobile devices without transferring actual training
instances. This is becoming of increasing importance with the rapid growth of mobile health
applications. However, the training process in these approaches places constraints on the
algorithms used and can make fitting a model substantially more challenging. It can be trivial to
train a model without differential privacy, but quite difficult to train one within the differential privacy
framework [155]. This problem can be particularly pronounced with small sample sizes.
While none of these problems are insurmountable or restricted to deep learning, they present
challenges that cannot be ignored. Technical evolution in EHRs and data standards will doubtless
ease—although not solve—the problems of data sharing and merging. More problematic are the
privacy issues. Those applying deep learning to the domain should consider the potential of
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
inadvertently disclosing the participants’ identities. Techniques that enable training on data without
sharing the raw data may have a part to play. Training within a differential privacy framework may
often be warranted.
In April 2016, the European Union adopted new rules regarding the use of personal information,
the General Data Protection Regulation [159]. A component of these rules can be summed up by
the phrase “right to an explanation”. Those who use machine learning algorithms must be able to
explain how a decision was reached. For example, a clinician treating a patient who is aided by a
machine learning algorithm may be expected to explain decisions that use the patient’s data. The
new rules were designed to target categorization or recommendation systems, which inherently
profile individuals. Such systems can do so in ways that are discriminatory and unlawful.
As datasets become larger and more complex, we may begin to identify relationships in data that
are important for human health but difficult to understand. The algorithms described in this review
and others like them may become highly accurate and useful for various purposes, including within
medical practice. However, to discover and avoid discriminatory applications it will be important to
consider interpretability alongside accuracy. A number of properties of genomic and healthcare
data will make this difficult.
First, research samples are frequently non-representative of the general population of interest; they
tend to be disproportionately sick [160], male [161], and European in ancestry [162]. One well-
known consequence of these biases in genomics is that penetrance is consistently lower in the
general population than would be implied by case-control data, as reviewed in [160]. Moreover,
real genetic associations found in one population may not hold in other populations with different
patterns of linkage disequilibrium (even when population stratification is explicitly controlled for
[163]). As a result, many genomic findings are of limited value for people of non-European ancestry
[162] and may even lead to worse treatment outcomes for them. Methods have been developed for
mitigating some of these problems in genomic studies [160,163], but it is not clear how easily they
can be adapted for deep models that are designed specifically to extract subtle effects from high-
dimensional data. For example, differences in the equipment that tended to be used for cases
versus controls have led to spurious genetic findings ( e.g. Sebastiani et al.’s retraction [164]). In
some contexts, it may not be possible to correct for all of these differences to the degree that a
deep network is unable to use them. Moreover, the complexity of deep networks makes it difficult to
determine when their predictions are likely to be based on such nominally-irrelevant features of the
data (called “leakage” in other fields [165]). When we are not careful with our data and models, we
may inadvertently say more about the way the data was collected (which may involve a history of
unequal access and discrimination) than about anything of scientific or predictive value. This fact
can undermine the privacy of patient data [165] or lead to severe discriminatory consequences
[166].
There is a small but growing literature on the prevention and mitigation of data leakage [165], as
well as a closely-related literature on discriminatory model behavior [167], but it remains difficult to
predict when these problems will arise, how to diagnose them, and how to resolve them in practice.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
There is even disagreement about which kinds of algorithmic outcomes should be considered
discriminatory [168]. Despite the difficulties and uncertainties, machine learning practitioners (and
particularly those who use deep neural networks, which are challenging to interpret) must remain
cognizant of these dangers and make every effort to prevent harm from discriminatory predictions.
To reach their potential in this domain, deep learning methods will need to be interpretable (see
Discussion). Researchers need to consider the extent to which biases may be learned by the
model and whether or not a model is sufficiently interpretable to identify bias. We discuss the
challenge of model interpretability more thoroughly in Discussion.
Longitudinal analysis follows a population across time, for example, prospectively from birth or from
the onset of particular conditions. In large patient populations, longitudinal analyses such as the
Framingham Heart Study [169] and the Avon Longitudinal Study of Parents and Children [170]
have yielded important discoveries about the development of disease and the factors contributing
to health status. Yet, a common practice in EHR-based research is to take a snapshot at a point in
time and convert patient data to a traditional vector for machine learning and statistical analysis.
This results in loss of information as timing and order of events can provide insight into a patient’s
disease and treatment [171]. Efforts to model sequences of events have shown promise [172] but
require exceedingly large patient sizes due to discrete combinatorial bucketing. Lasko et al. [111]
used autoencoders on longitudinal sequences of serum uric acid measurements to identify
population subtypes. More recently, deep learning has shown promise working with both
sequences (CNNs) [173] and the incorporation of past and current state (RNNs, LSTMs) [115].
This may be a particular area of opportunity for deep neural networks. The ability to recognize
relevant sequences of events from a large number of trajectories requires powerful and flexible
feature construction methods—an area in which deep neural networks excel.
Progress has been rapid in genomics and imaging, fields where important tasks are readily
adapted to well-established deep learning paradigms. One-dimensional convolutional and
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
recurrent neural networks are well-suited for tasks related to DNA- and RNA-binding proteins,
epigenomics, and RNA splicing. Two dimensional CNNs are ideal for segmentation, feature
extraction, and classification in fluorescence microscopy images [16]. Other areas, such as cellular
signaling, are biologically important but studied less-frequently to date, with some exceptions [176].
This may be a consequence of data limitations or greater challenges in adapting neural network
architectures to the available data. Here, we highlight several areas of investigation and assess
how deep learning might move these fields forward.
Gene expression
Gene expression technologies characterize the abundance of many thousands of RNA transcripts
within a given organism, tissue, or cell. This characterization can represent the underlying state of
the given system and can be used to study heterogeneity across samples as well as how the
system reacts to perturbation. While gene expression measurements were traditionally made by
quantitative polymerase chain reaction (qPCR), low-throughput fluorescence-based methods, and
microarray technologies, the field has shifted in recent years to primarily performing RNA
sequencing (RNA-seq) to catalog whole transcriptomes. As RNA-seq continues to fall in price and
rise in throughput, sample sizes will increase and training deep models to study gene expression
will become even more useful.
Already several deep learning approaches have been applied to gene expression data with varying
aims. For instance, many researchers have applied unsupervised deep learning models to extract
meaningful representations of gene modules or sample clusters. Denoising autoencoders have
been used to cluster yeast expression microarrays into known modules representing cell cycle
processes [177] and to stratify yeast strains based on chemical and mutational perturbations [178].
Shallow (one hidden layer) denoising autoencoders have also been fruitful in extracting biological
insight from thousands of Pseudomonas aeruginosa experiments [179,180] and in aggregating
features relevant to specific breast cancer subtypes [25]. These unsupervised approaches applied
to gene expression data are powerful methods for identifying gene signatures that may otherwise
be overlooked. An additional benefit of unsupervised approaches is that ground truth labels, which
are often difficult to acquire or are incorrect, are nonessential. However, the genes that have been
aggregated into features must be interpreted carefully. Attributing each node to a single specific
biological function risks over-interpreting models. Batch effects could cause models to discover
non-biological features, and downstream analyses should take this into consideration.
Deep learning approaches are also being applied to gene expression prediction tasks. For
example, a deep neural network with three hidden layers outperformed linear regression in
inferring the expression of over 20,000 target genes based on a representative, well-connected set
of about 1,000 landmark genes [181]. However, while the deep learning model outperformed
existing algorithms in nearly every scenario, the model still displayed poor performance. The paper
was also limited by computational bottlenecks that required data to be split randomly into two
distinct models and trained separately. It is unclear how much performance would have increased if
not for computational restrictions.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Epigenomic data, combined with deep learning, may have sufficient explanatory power to infer
gene expression. For instance, the DeepChrome CNN [182] improved prediction accuracy of high
or low gene expression from histone modifications over existing methods. AttentiveChrome [183]
added a deep attention model to further enhance DeepChrome. Deep learning can also integrate
different data types. For example, Liang et al. combined RBMs to integrate gene expression, DNA
methylation, and miRNA data to define ovarian cancer subtypes [184]. While these approaches are
promising, many convert gene expression measurements to categorical or binary variables, thus
ablating many complex gene expression signatures present in intermediate and relative numbers.
Deep learning applied to gene expression data is still in its infancy, but the future is bright. Many
previously untestable hypotheses can now be interrogated as deep learning enables analysis of
increasing amounts of data generated by new technologies. For example, the effects of cellular
heterogeneity on basic biology and disease etiology can now be explored by single-cell RNA-seq
and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit
immensely from deep learning approaches.
Splicing
Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of
exons or including parts of introns, creating enormous spatiotemporal flexibility to generate multiple
distinct proteins from a single gene. This remarkable complexity can lend itself to defects that
underlie many diseases. For instance, splicing mutations in the lamin A (LMNA) gene can lead to
specific variants of dilated cardiomyopathy and limb girdle muscular dystrophy [185]. A recent study
found that quantitative trait loci that affect splicing in lymphoblastoid cell lines are enriched within
risk loci for schizophrenia, multiple sclerosis, and other immune diseases, implicating mis-splicing
as a more widespread feature of human pathologies than previously thought [186]. Therapeutic
strategies that aim to modulate splicing are also currently being considered for disorders such as
Duchenne muscular dystrophy and spinal muscular atrophy [185].
Sequencing studies routinely return thousands of unannotated variants, but which cause functional
changes in splicing and how are those changes manifested? Prediction of a “splicing code” has
been a goal of the field for the past decade. Initial machine learning approaches used a naïve
Bayes model and a 2-layer Bayesian neural network with thousands of hand-derived sequence-
based features to predict the probability of exon skipping [187,188]. With the advent of deep
learning, more complex models provided better predictive accuracy [189,190]. Importantly, these
new approaches can take in multiple kinds of epigenomic measurements as well as tissue identity
and RNA binding partners of splicing factors. Deep learning is critical in furthering these kinds of
integrative studies where different data types and inputs interact in unpredictable (often nonlinear)
ways to create higher-order features. Moreover, as in gene expression network analysis,
interrogating the hidden nodes within neural networks could potentially illuminate important aspects
of splicing behavior. For instance, tissue-specific splicing mechanisms could be inferred by training
networks on splicing data from different tissues, then searching for common versus distinctive
hidden nodes, a technique employed by Qin et al. for tissue-specific transcription factor (TF)
binding predictions [191].
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
A parallel effort has been to use more data with simpler models. An exhaustive study using
readouts of splicing for millions of synthetic intronic sequences uncovered motifs that influence the
strength of alternative splice sites [192]. The authors built a simple linear model using hexamer
motif frequencies that successfully generalized to exon skipping. In a limited analysis using single
nucleotide polymorphisms (SNPs) from three genes, it predicted exon skipping with three times the
accuracy of an existing deep learning-based framework [189]. This case is instructive in that clever
sources of data, not just more descriptive models, are still critical.
We already understand how mis-splicing of a single gene can cause diseases such as limb girdle
muscular dystrophy. The challenge now is to uncover how genome-wide alternative splicing
underlies complex, non-Mendelian diseases such as autism, schizophrenia, Type 1 diabetes, and
multiple sclerosis [193]. As a proof of concept, Xiong et al. [189] sequenced five autism spectrum
disorder and 12 control samples, each with an average of 42,000 rare variants, and identified mis-
splicing in 19 genes with neural functions. Such methods may one day enable scientists and
clinicians to rapidly profile thousands of unannotated variants for functional effects on splicing and
nominate candidates for further investigation. Moreover, these nonlinear algorithms can
deconvolve the effects of multiple variants on a single splice event without the need to perform
combinatorial in vitro experiments. The ultimate goal is to predict an individual’s tissue-specific,
exon-specific splicing patterns from their genome sequence and other measurements to enable a
new branch of precision diagnostics that also stratifies patients and suggests targeted therapies to
correct splicing defects. However, to achieve this we expect that methods to interpret the “black
box” of deep neural networks and integrate diverse data sources will be required.
Transcription factors
Transcription factors are proteins that bind regulatory DNA in a sequence-specific manner to
modulate the activation and repression of gene transcription. High-throughput in vitro experimental
assays that quantitatively measure the binding specificity of a TF to a large library of short
oligonucleotides [194] provide rich datasets to model the naked DNA sequence affinity of individual
TFs in isolation. However, in vivo TF binding is affected by a variety of other factors beyond
sequence affinity, such as competition and cooperation with other TFs, TF concentration, and
chromatin state (chemical modifications to DNA and other packaging proteins that DNA is wrapped
around) [194]. TFs can thus exhibit highly variable binding landscapes across the same genomic
DNA sequence across diverse cell types and states. Several experimental approaches such as
chromatin immunoprecipitation followed by sequencing (ChIP-seq) have been developed to profile
in vivo binding maps of TFs [194]. Large reference compendia of ChIP-seq data are now freely
available for a large collection of TFs in a small number of reference cell states in humans and a
few other model organisms [195]. Due to fundamental material and cost constraints, it is infeasible
to perform these experiments for all TFs in every possible cellular state and species. Hence,
predictive computational models of TF binding are essential to understand gene regulation in
diverse cellular contexts.
Several machine learning approaches have been developed to learn generative and discriminative
models of TF binding from in vitro and in vivo TF binding datasets that associate collections of
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
In 2015, Alipanahi et al. developed DeepBind, the first CNN to classify bound DNA sequences
based on in vitro and in vivo assays against random DNA sequences matched for dinucleotide
sequence composition [201]. The convolutional layers learn pattern detectors reminiscent of PWMs
from a one-hot encoding of the raw input DNA sequences. DeepBind outperformed several state-
of-the-art methods from the DREAM5 in vitro TF-DNA motif recognition challenge [198]. Although
DeepBind was also applied to RNA-binding proteins, in general RNA binding is a separate problem
[202] and accurate models will need to account for RNA secondary structure. Following DeepBind,
several optimized convolutional and recurrent neural network architectures as well as novel hybrid
approaches that combine kernel methods with neural networks have been proposed that further
improve performance [203–206]. Specialized layers and regularizers have also been proposed to
reduce parameters and learn more robust models by taking advantage of specific properties of
DNA sequences such as their reverse complement equivalence [207,208].
While most of these methods learn independent models for different TFs, in vivo multiple TFs
compete or cooperate to occupy DNA binding sites, resulting in complex combinatorial co-binding
landscapes. To take advantage of this shared structure in in vivo TF binding data, multi-task neural
network architectures have been developed that explicitly share parameters across models for
multiple TFs [206,209,210]. Some of these multi-task models train and evaluate classification
performance relative to an unbound background set of regulatory DNA sequences sampled from
the genome rather than using synthetic background sequences with matched dinucleotide
composition.
The above-mentioned TF binding prediction models that use only DNA sequences as inputs have a
fundamental limitation. Because the DNA sequence of a genome is the same across different cell
types and states, a sequence-only model of TF binding cannot predict different in vivo TF binding
landscapes in new cell types not used during training. One approach for generalizing TF binding
predictions to new cell types is to learn models that integrate DNA sequence inputs with other cell-
type-specific data modalities that modulate in vivo TF binding such as surrogate measures of TF
concentration (e.g. TF gene expression) and chromatin state. Arvey et al. showed that combining
the predictions of SVMs trained on DNA sequence inputs and cell-type specific DNase-seq data,
which measures genome-wide chromatin accessibility, improved in vivo TF binding prediction
within and across cell types [211]. Several “footprinting” based methods have also been developed
that learn to discriminate bound from unbound instances of known canonical motifs of a target TF
based on high-resolution footprint patterns of chromatin accessibility that are specific to the target
TF [212]. However, the genome-wide predictive performance of these methods in new cell types
and states has not been evaluated.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Singh et al. developed transfer string kernels for SVMs for cross-context TF binding [216]. Domain
adaptation methods that allow training neural networks which are transferable between differing
training and test set distributions of sequence features could be a promising avenue going forward
[217,218]. These approaches may also be useful for transferring TF binding models across
species.
Another class of imputation-based cross cell type in vivo TF binding prediction methods leverage
the strong correlation between combinatorial binding landscapes of multiple TFs. Given a partially
complete panel of binding profiles of multiple TFs in multiple cell types, a deep learning method
called TFImpute learns to predict the missing binding profile of a target TF in some target cell type
in the panel based on the binding profiles of other TFs in the target cell type and the binding profile
of the target TF in other cell types in the panel [191]. However, TFImpute cannot generalize
predictions beyond the training panel of cell types and requires TF binding profiles of related TFs.
It is worth noting that TF binding prediction methods in the literature based on neural networks and
other machine learning approaches choose to sample the set of bound and unbound sequences in
a variety of different ways. These choices and the choice of performance evaluation measures
significantly confound systematic comparison of model performance (see Discussion).
Several methods have also been developed to interpret neural network models of TF binding.
Alipanahi et al. visualize convolutional filters to obtain insights into the sequence preferences of
TFs [201]. They also introduced in silico mutation maps for identifying important predictive
nucleotides in input DNA sequences by exhaustively forward propagating perturbations to
individual nucleotides to record the corresponding change in output prediction. Shrikumar et al.
[219] proposed efficient backpropagation based approaches to simultaneously score the
contribution of all nucleotides in an input DNA sequence to an output prediction. Lanchantin et al.
[204] developed tools to visualize TF motifs learned from TF binding site classification tasks. These
and other general interpretation techniques (see Discussion) will be critical to improve our
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Multiple TFs act in concert to coordinate changes in gene regulation at the genomic regions known
as promoters and enhancers. Each gene has an upstream promoter, essential for initiating that
gene’s transcription. The gene may also interact with multiple enhancers, which can amplify
transcription in particular cellular contexts. These contexts include different cell types in
development or environmental stresses.
Promoters and enhancers provide a nexus where clusters of TFs and binding sites mediate
downstream gene regulation, starting with transcription. The gold standard to identify an active
promoter or enhancer requires demonstrating its ability to affect transcription or other downstream
gene products. Even extensive biochemical TF binding data has thus far proven insufficient on its
own to accurately and comprehensively locate promoters and enhancers. We lack sufficient
understanding of these elements to derive a mechanistic “promoter code” or “enhancer code”. But
extensive labeled data on promoters and enhancers lends itself to probabilistic classification. The
complex interplay of TFs and chromatin leading to the emergent properties of promoter and
enhancer activity seems particularly apt for representation by deep neural networks.
Promoters
Enhancers
Several neural network approaches yielded promising results in enhancer prediction. Both Basset
[227] and DeepEnhancer [228] used CNNs to predict enhancers. DECRES used a feed-forward
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
neural network [229] to distinguish between different kinds of regulatory elements, such as active
enhancers, and promoters. DECRES had difficulty distinguishing between inactive enhancers and
promoters. They also investigated the power of sequence features to drive classification, finding
that beyond CpG islands, few were useful.
Comparing the performance of enhancer prediction methods illustrates the problems in using
metrics created with different benchmarking procedures. Both the Basset and DeepEnhancer
studies include comparisons to a baseline SVM approach, gkm-SVM [200]. The Basset study
reports gkm-SVM attains a mean area under the precision-recall curve (AUPR) of 0.322 over 164
cell types [227]. The DeepEnhancer study reports for gkm-SVM a dramatically different AUPR of
0.899 on nine cell types [228]. This large difference means it’s impossible to directly compare the
performance of Basset and DeepEnhancer based solely on their reported metrics. DECRES used
a different set of metrics altogether. To drive further progress in enhancer identification, we must
develop a common and comparable benchmarking procedure (see Discussion).
Promoter-enhancer interactions
Micro-RNA binding
Prediction of microRNAs (miRNAs) and miRNA targets is of great interest, as they are critical
components of gene regulatory networks and are often conserved across great evolutionary
distance [231,232]. While many machine learning algorithms have been applied to these tasks,
they currently require extensive feature selection and optimization. For instance, one of the most
widely adopted tools for miRNA target prediction, TargetScan, trained multiple linear regression
models on 14 hand-curated features including structural accessibility of the target site on the
mRNA, the degree of site conservation, and predicted thermodynamic stability of the miRNA-
mRNA complex [233]. Some of these features, including structural accessibility, are imperfect or
empirically derived. In addition, current algorithms suffer from low specificity [234].
mismatches, and wobble base pairing without requiring the user to input secondary structure
predictions or thermodynamic calculations. Further incremental advances in deep learning for
miRNA and target prediction will likely be sufficient to meet the current needs of systems biologists
and other researchers who use prediction tools mainly to nominate candidates that are then tested
experimentally.
Here we focus on deep learning methods for two representative sub-problems: secondary structure
prediction and contact map prediction. Secondary structure refers to local conformation of a
sequence segment, while a contact map contains information on all residue-residue contacts.
Secondary structure prediction is a basic problem and an almost essential module of any protein
structure prediction package. Contact prediction is much more challenging than secondary
structure prediction, but it has a much larger impact on tertiary structure prediction. In recent years,
the accuracy of contact prediction has greatly improved [28,237–239].
One can represent protein secondary structure with three different states (alpha helix, beta strand,
and loop regions) or eight finer-grained states. Accuracy of a three-state prediction is called Q3,
and accuracy of an 8-state prediction is called Q8. Several groups [29,240,241] applied deep
learning to protein secondary structure prediction but were unable to achieve significant
improvement over the de facto standard method PSIPRED [242], which uses two shallow
feedforward neural networks. In 2014, Zhou and Troyanskaya demonstrated that they could
improve Q8 accuracy by using a deep supervised and convolutional generative stochastic network
[243]. In 2016 Wang et al. developed a DeepCNF model that improved Q3 and Q8 accuracy as
well as prediction of solvent accessibility and disorder regions [30,236]. DeepCNF achieved a
higher Q3 accuracy than the standard maintained by PSIPRED for more than 10 years. This
improvement may be mainly due to the ability of convolutional neural fields to capture long-range
sequential information, which is important for beta strand prediction. Nevertheless, the
improvements in secondary structure prediction from DeepCNF are unlikely to result in a
commensurate improvement in tertiary structure prediction since secondary structure mainly
reflects coarse-grained local conformation of a protein structure.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Protein contact prediction and contact-assisted folding (i.e. folding proteins using predicted
contacts as restraints) represents a promising new direction for ab initio folding of proteins without
good templates in PDB. Co-evolution analysis is effective for proteins with a very large number
(>1000) of sequence homologs [239], but fares poorly for proteins without many sequence
homologs. By combining co-evolution information with a few other protein features, shallow neural
network methods such as MetaPSICOV [237] and CoinDCA-NN [244] have shown some
advantage over pure co-evolution analysis for proteins with few sequence homologs, but their
accuracy is still far from satisfactory. In recent years, deeper architectures have been explored for
contact prediction, such as CMAPpro [245], DNCON [246] and PConsC [247]. However, blindly
tested in the well-known CASP competitions, these methods did not show any advantage over
MetaPSICOV [237].
Recently, Wang et al. proposed the deep learning method RaptorX-Contact [28], which significantly
improves contact prediction over MetaPSICOV and pure co-evolution methods, especially for
proteins without many sequence homologs. It employs a network architecture formed by one 1D
residual neural network and one 2D residual neural network. Blindly tested in the latest CASP
competition (i.e. CASP12 [248]), RaptorX-Contact ranked first in F₁ score on free-modeling targets
as well as the whole set of targets. In CAMEO (which can be interpreted as a fully-automated
CASP) [249], its predicted contacts were also able to fold proteins with a novel fold and only 65–
330 sequence homologs. This technique also worked well on membrane proteins even when
trained on non-membrane proteins [250]. RaptorX-Contact performed better mainly due to
introduction of residual neural networks and exploitation of contact occurrence patterns by
simultaneously predicting all the contacts in a single protein.
Taken together, ab initio folding is becoming much easier with the advent of direct evolutionary
coupling analysis and deep learning techniques. We expect further improvements in contact
prediction for proteins with fewer than 1000 homologs by studying new deep network architectures.
However, it is unclear if there is an effective way to use deep learning to improve prediction for
proteins with few or no sequence homologs. Finally, the deep learning methods summarized above
also apply to interfacial contact prediction for protein complexes but may be less effective since on
average protein complexes have fewer sequence homologs.
Some components of cryo-EM image processing remain difficult to automate. For instance, in
particle picking, micrographs are scanned to identify individual molecular images that will be used
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Downstream of particle picking, deep learning is being applied to other aspects of cryo-EM image
processing. Statistical manifold learning has been implemented in the software package ROME to
classify selected particles and elucidate the different conformations of the subject molecule
necessary for accurate 3D structures [257]. These recent tools highlight the general applicability of
deep learning approaches for image processing to increase the throughput of high-resolution cryo-
EM.
Protein-protein interactions
Protein-protein interactions (PPIs) are highly specific and non-accidental physical contacts
between proteins, which occur for purposes other than generic protein production or degradation
[258]. Abundant interaction data have been generated in-part thanks to advances in high-
throughput screening methods, such as yeast two-hybrid and affinity-purification with mass
spectrometry. However, because many PPIs are transient or dependent on biological context, high-
throughput methods can fail to capture a number of interactions. The imperfections and costs
associated with many experimental PPI screening methods have motivated an interest in high-
throughput computational prediction.
Many machine learning approaches to PPI have focused on text mining the literature [259,260], but
these approaches can fail to capture context-specific interactions, motivating de novo PPI
prediction. Early de novo prediction approaches used a variety of statistical and machine learning
tools on structural and sequential data, sometimes with reference to the existing body of protein
structure knowledge. In the context of PPIs—as in other domains—deep learning shows promise
both for exceeding current predictive performance and for circumventing limitations from which
other approaches suffer.
One of the key difficulties in applying deep learning techniques to binding prediction is the task of
representing peptide and protein sequences in a meaningful way. DeepPPI [261] made PPI
predictions from a set of sequence and composition protein descriptors using a two-stage deep
neural network that trained two subnetworks for each protein and combined them into a single
network. Sun et al. [262] applied autocovariances, a coding scheme that returns uniform-size
vectors describing the covariance between physicochemical properties of the protein sequence at
various positions. Wang et al. [263] used deep learning as an intermediate step in PPI prediction.
They examined 70 amino acid protein sequences from each of which they extracted 1260 features.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
A stacked sparse autoencoder with two hidden layers was then used to reduce feature dimensions
and noisiness before a novel type of classification vector machine made PPI predictions.
Beyond predicting whether or not two proteins interact, Du et al. [264] employed a deep learning
approach to predict the residue contacts between two interacting proteins. Using features that
describe how similar a protein’s residue is relative to similar proteins at the same position, the
authors extracted uniform-length features for each residue in the protein sequence. A stacked
autoencoder took two such vectors as input for the prediction of contact between two residues. The
authors evaluated the performance of this method with several classifiers and showed that a deep
neural network classifier paired with the stacked autoencoder significantly exceeded classical
machine learning accuracy.
Because many studies used predefined higher-level features, one of the benefits of deep learning
—automatic feature extraction—is not fully leveraged. More work is needed to determine the best
ways to represent raw protein sequence information so that the full benefits of deep learning as an
automatic feature extractor can be realized.
MHC-peptide binding
An important type of PPI involves the immune system’s ability to recognize the body’s own cells.
The major histocompatibility complex (MHC) plays a key role in regulating this process by binding
antigens and displaying them on the cell surface to be recognized by T cells. Due to its importance
in immunity and immune response, peptide-MHC binding prediction is a useful problem in
computational biology, and one that must account for the allelic diversity in MHC-encoding gene
region.
Shallow, feed-forward neural networks are competitive methods and have made progress toward
pan-allele and pan-length peptide representations. Sequence alignment techniques are useful for
representing variable-length peptides as uniform-length features [265,266]. For pan-allelic
prediction, NetMHCpan [267,268] used a pseudo-sequence representation of the MHC class I
molecule, which included only polymorphic peptide contact residues. The sequences of the peptide
and MHC were then represented using both sparse vector encoding and Blosum encoding, in
which amino acids are encoded by matrix score vectors. A comparable method to the NetMHC
tools is MHCflurry [269], a method which shows superior performance on peptides of lengths other
than nine. MHCflurry adds placeholder amino acids to transform variable-length peptides to length
15 peptides. In training the MHCflurry feed-forward neural network [270], the authors imputed
missing MHC-peptide binding affinities using a Gibbs sampling method, showing that imputation
improves performance for data-sets with roughly 100 or fewer training examples. MHCflurry’s
imputation method increases its performance on poorly characterized alleles, making it competitive
with NetMHCpan for this task. Kuksa et al. [271] developed a shallow, higher-order neural network
(HONN) comprised of both mean and covariance hidden units to capture some of the higher-order
dependencies between amino acid locations. Pretraining this HONN with a semi-restricted
Boltzmann machine, the authors found that the performance of the HONN exceeded that of a
simple deep neural network, as well as that of NetMHC.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Deep learning’s unique flexibility was recently leveraged by Bhattacharya et al. [272], who used a
gated RNN method called MHCnuggets to overcome the difficulty of multiple length peptides.
Under this framework, they used smoothed sparse encoding to represent amino acids individually.
Because MHCnuggets had to be trained for every MHC allele, performance was far better for
alleles with abundant, balanced training data. Vang et al. [273] developed HLA-CNN, a method
which maps amino acids onto a 15-dimensional vector space based on their context relation to
other amino acids before making predictions with a CNN. In a comparison of several current
methods, Bhattacharya et al. found that the top methods—NetMHC, NetMHCpan, MHCflurry, and
MHCnuggets—showed comparable performance, but large differences in speed. Convolutional
neural networks (in this case, HLA-CNN) showed comparatively poor performance, while shallow
and recurrent neural networks performed the best. They found that MHCnuggets—the recurrent
neural network—was by far the fastest-training among the top performing methods.
An important challenge in PPI network prediction is the task of combining different networks and
types of networks. Gligorijevic et al. [276] developed a multimodal deep autoencoder, deepNF, to
find a feature representation common among several different PPI networks. This common lower-
level representation allows for the combination of various PPI data sources towards a single
predictive task. An SVM classifier trained on the compressed features from the middle layer of the
autoencoder outperformed previous methods in predicting protein function.
Hamilton et al. addressed the issue of large, heterogeneous, and changing networks with an
inductive approach called GraphSAGE [277]. By finding node embeddings through learned
aggregator functions that describe the node and its neighbors in the network, the GraphSAGE
approach allows for the generalization of the model to new graphs. In a classification task for the
prediction of protein function, Chen and Zhu [278] optimized this approach and enhanced the
graph convolutional network with a preprocessing step that uses an approximation to the dropout
operation. This preprocessing effectively reduces the number of graph convolutional layers and it
significantly improves both training time and prediction accuracy.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Morphological phenotypes
A field poised for dramatic revolution by deep learning is bioimage analysis. Thus far, the primary
use of deep learning for biological images has been for segmentation—that is, for the identification
of biologically relevant structures in images such as nuclei, infected cells, or vasculature—in
fluorescence or even brightfield channels [279]. Once so-called regions of interest have been
identified, it is often straightforward to measure biological properties of interest, such as
fluorescence intensities, textures, and sizes. Given the dramatic successes of deep learning in
biological imaging, we simply refer to articles that review recent advancements [16,279,280]. For
deep learning to become commonplace for biological image segmentation, we need user-friendly
tools.
We anticipate an additional paradigm shift in bioimaging that will be brought about by deep
learning: what if images of biological samples, from simple cell cultures to three-dimensional
organoids and tissue samples, could be mined for much more extensive biologically meaningful
information than is currently standard? For example, a recent study demonstrated the ability to
predict lineage fate in hematopoietic cells up to three generations in advance of differentiation
[281]. In biomedical research, most often biologists decide in advance what feature to measure in
images from their assay system. Although classical methods of segmentation and feature
extraction can produce hundreds of metrics per cell in an image, deep learning is unconstrained by
human intuition and can in theory extract more subtle features through its hidden nodes. Already,
there is evidence deep learning can surpass the efficacy of classical methods [282], even using
generic deep convolutional networks trained on natural images [283], known as transfer learning.
Recent work by Johnson et al. [284] demonstrated how the use of a conditional adversarial
autoencoder allows for a probabilistic interpretation of cell and nuclear morphology and structure
localization from fluorescence images. The proposed model is able to generalize well to a wide
range of subcellular localizations. The generative nature of the model allows it to produce high-
quality synthetic images predicting localization of subcellular structures by directly modeling the
localization of fluorescent labels. Notably, this approach reduces the modeling time by omitting the
subcellular structure segmentation step.
The impact of further improvements on biomedicine could be enormous. Comparing cell population
morphologies using conventional methods of segmentation and feature extraction has already
proven useful for functionally annotating genes and alleles, identifying the cellular target of small
molecules, and identifying disease-specific phenotypes suitable for drug screening [285–287].
Deep learning would bring to these new kinds of experiments—known as image-based profiling or
morphological profiling—a higher degree of accuracy, stemming from the freedom from human-
tuned feature extraction strategies.
Single-cell data
Single-cell methods are generating excitement as biologists characterize the vast heterogeneity
within unicellular species and between cells of the same tissue type in the same organism [288].
For instance, tumor cells and neurons can both harbor extensive somatic variation [289].
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
However, large challenges exist in studying single cells. Relatively few cells can be assayed at
once using current droplet, imaging, or microwell technologies, and low-abundance molecules or
modifications may not be detected by chance due to a phenomenon known as dropout, not to be
confused with the dropout layer of deep learning. To solve this problem, Angermueller et al. [293]
trained a neural network to predict the presence or absence of methylation of a specific CpG site in
single cells based on surrounding methylation signal and underlying DNA sequence, achieving
several percentage points of improvement compared to random forests or deep networks trained
only on CpG or sequence information. Similar deep learning methods have been applied to impute
low-resolution ChIP-seq signal from bulk tissue with great success, and they could easily be
adapted to single-cell data [191,294]. Deep learning has also been useful for dealing with batch
effects [295].
Examining populations of single cells can reveal biologically meaningful subsets of cells as well as
their underlying gene regulatory networks [296]. Unfortunately, machine learning methods
generally struggle with imbalanced data—when there are many more examples of class 1 than
class 2—because prediction accuracy is usually evaluated over the entire dataset. To tackle this
challenge, Arvaniti et al. [297] classified healthy and cancer cells expressing 25 markers by using
the most discriminative filters from a CNN trained on the data as a linear classifier. They achieved
impressive performance, even for cell types where the subset percentage ranged from 0.1 to 1%,
significantly outperforming logistic regression and distance-based outlier detection methods.
However, they did not benchmark against random forests, which tend to work better for imbalanced
data, and their data was relatively low dimensional.
Neural networks can also learn low-dimensional representations of single-cell gene expression
data for visualization, clustering, and other tasks. Both scvis [298] and scVI [299] are unsupervised
approaches based on VAEs. Whereas scvis primarily focuses on single-cell visualization as a
replacement for t-Distributed Stochastic Neighbor Embedding [300], the scVI model accounts for
zero-inflated expression distributions and can impute zero values that are due to technical effects.
Beyond VAEs, Lin et al. developed a supervised model to predict cell type [301]. Similar to transfer
learning approaches for microscopy images [283], they demonstrated that the hidden layer
representations were informative in general and could be used to identify cellular subpopulations or
match new cells to known cell types. The supervised neural network’s representation was better
overall at retrieving cell types than alternatives, but all methods struggled to recover certain cell
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
types such as hematopoietic stem cells and inner cell mass cells. As the Human Cell Atlas [302]
and related efforts generate more single-cell expression data, there will be opportunities to assess
how well these low-dimensional representations generalize to new cell types as well as abundant
training data to learn broadly-applicable representations.
The sheer quantity of omic information that can be obtained from each cell, as well as the number
of cells in each dataset, uniquely position single-cell data to benefit from deep learning. In the
future, lineage tracing could be revolutionized by using autoencoders to reduce the feature space
of transcriptomic or variant data followed by algorithms to learn optimal cell differentiation
trajectories [303] or by feeding cell morphology and movement into neural networks [281].
Reinforcement learning algorithms [304] could be trained on the evolutionary dynamics of cancer
cells or bacterial cells undergoing selection pressure and reveal whether patterns of adaptation are
random or deterministic, allowing us to develop therapeutic strategies that forestall resistance. We
are excited to see the creative applications of deep learning to single-cell biology that emerge over
the next few years.
Metagenomics
Metagenomics, which refers to the study of genetic material—16S rRNA or whole-genome shotgun
DNA—from microbial communities, has revolutionized the study of micro-scale ecosystems within
and around us. In recent years, machine learning has proved to be a powerful tool for
metagenomic analysis. 16S rRNA has long been used to deconvolve mixtures of microbial
genomes, yet this ignores more than 99% of the genomic content. Subsequent tools aimed to
classify 300–3000 bp reads from complex mixtures of microbial genomes based on tetranucleotide
frequencies, which differ across organisms [305], using supervised [306,307] or unsupervised
methods [308]. Then, researchers began to use techniques that could estimate relative
abundances from an entire sample faster than classifying individual reads [309–312]. There is also
great interest in identifying and annotating sequence reads [313,314]. However, the focus on
taxonomic and functional annotation is just the first step. Several groups have proposed methods
to determine host or environment phenotypes from the organisms that are identified [315–318] or
overall sequence composition [319]. Also, researchers have looked into how feature selection can
improve classification [318,320], and techniques have been proposed that are classifier-
independent [321,322].
Most neural networks are used for phylogenetic classification or functional annotation from
sequence data where there is ample data for training. Neural networks have been applied
successfully to gene annotation (e.g. Orphelia [323] and FragGeneScan [324]). Representations
(similar to Word2Vec [102] in natural language processing) for protein family classification have
been introduced and classified with a skip-gram neural network [325]. Recurrent neural networks
show good performance for homology and protein family identification [326,327].
One of the first techniques of de novo genome binning used self-organizing maps, a type of neural
network [308]. Essinger et al. [328] used Adaptive Resonance Theory to cluster similar genomic
fragments and showed that it had better performance than k-means. However, other methods
based on interpolated Markov models [329] have performed better than these early genome
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
binners. Neural networks can be slow and therefore have had limited use for reference-based
taxonomic classification, with TAC-ELM [330] being the only neural network-based algorithm to
taxonomically classify massive amounts of metagenomic data. An initial study successfully applied
neural networks to taxonomic classification of 16S rRNA genes, with convolutional networks
providing about 10% accuracy genus-level improvement over RNNs and random forests [331].
However, this study evaluated only 3000 sequences.
Neural network uses for classifying phenotype from microbial composition are just beginning. A
simple multi-layer perceptron (MLP) was able to classify wound severity from microbial species
present in the wound [332]. Recently, Ditzler et al. associated soil samples with pH level using
MLPs, DBNs, and RNNs [333]. Besides classifying samples appropriately, internal phylogenetic
tree nodes inferred by the networks represented features for low and high pH. Thus, hidden nodes
might provide biological insight as well as new features for future metagenomic sample
comparison. Also, an initial study has shown promise of these networks for diagnosing disease
[334].
Challenges remain in applying deep neural networks to metagenomics problems. They are not yet
ideal for phenotype classification because most studies contain tens of samples and hundreds or
thousands of features (species). Such underdetermined, or ill-conditioned, problems are still a
challenge for deep neural networks that require many training examples. Also, due to convergence
issues [335], taxonomic classification of reads from whole genome sequencing seems out of reach
at the moment for deep neural networks. There are only thousands of full-sequenced genomes as
compared to hundreds of thousands of 16S rRNA sequences available for training.
However, because RNNs have been applied to base calls for the Oxford Nanopore long-read
sequencer with some success [336] (discussed below), one day the entire pipeline, from denoising
to functional classification, may be combined into one step using powerful LSTMs [337]. For
example, metagenomic assembly usually requires binning then assembly, but could deep neural
nets accomplish both tasks in one network? We believe the greatest potential in deep learning is to
learn the complete characteristics of a metagenomic sample in one complex network.
Current methods achieve relatively high (>99%) precision at 90% recall for SNPs and indel calls
from Illumina short-read data [338], yet this leaves a large number of potentially clinically-important
remaining false positives and false negatives. These methods have so far relied on experts to build
probabilistic models that reliably separate signal from noise. However, this process is time
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
consuming and fundamentally limited by how well we understand and can model the factors that
contribute to noise. Recently, two groups have applied deep learning to construct data-driven
unbiased noise models. One of these models, DeepVariant, leverages Inception, a neural network
trained for image classification by Google Brain, by encoding reads around a candidate SNP as a
221x100 bitmap image, where each column is a nucleotide and each row is a read from the
sample library [338]. The top 5 rows represent the reference, and the bottom 95 rows represent
randomly sampled reads that overlap the candidate variant. Each RGBA (red/green/blue/alpha)
image pixel encodes the base (A, C, G, T) as a different red value, quality score as a green value,
strand as a blue value, and variation from the reference as the alpha value. The neural network
outputs genotype probabilities for each candidate variant. They were able to achieve better
performance than GATK [339], a leading genotype caller, even when GATK was given information
about population variation for each candidate variant. Another method, still in its infancy, hand-
developed 62 features for each candidate variant and fed these vectors into a fully connected deep
neural network [340]. Unfortunately, this feature set required at least 15 iterations of software
development to fine-tune, which suggests that these models may not generalize.
Variant calling will benefit more from optimizing neural network architectures than from developing
features by hand. An interesting and informative next step would be to rigorously test if encoding
raw sequence and quality data as an image, tensor, or some other mixed format produces the best
variant calls. Because many of the latest neural network architectures (ResNet, Inception,
Xception, and others) are already optimized for and pre-trained on generic, large-scale image
datasets [341], encoding genomic data as images could prove to be a generally effective and
efficient strategy.
In limited experiments, DeepVariant was robust to sequencing depth, read length, and even
species [338]. However, a model built on Illumina data, for instance, may not be optimal for Pacific
Biosciences long-read data or MinION nanopore data, which have vastly different specificity and
sensitivity profiles and signal-to-noise characteristics. Recently, Boza et al. used bidirectional
recurrent neural networks to infer the E. coli sequence from MinION nanopore electric current data
with higher per-base accuracy than the proprietary hidden Markov model-based algorithm
Metrichor [336]. Unfortunately, training any neural network requires a large amount of data, which
is often not available for new sequencing technologies. To circumvent this, one very preliminary
study simulated mutations and spiked them into somatic and germline RNA-seq data, then trained
and tested a neural network on simulated paired RNA-seq and exome sequencing data [342].
However, because this model was not subsequently tested on ground-truth datasets, it is unclear
whether simulation can produce sufficiently realistic data to produce reliable models.
Method development for interpreting new types of sequencing data has historically taken two
steps: first, easily implemented hard cutoffs that prioritize specificity over sensitivity, then expert
development of probabilistic models with hand-developed inputs [342]. We anticipate that these
steps will be replaced by deep learning, which will infer features simply by its ability to optimize a
complex model against data.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Neuroscience
Artificial neural networks were originally conceived as a model for computation in the brain [6].
Although deep neural networks have evolved to become a workhorse across many fields, there is
still a strong connection between deep networks and the study of the brain. The rich parallel history
of artificial neural networks in computer science and neuroscience is reviewed in [343–345].
Convolutional neural networks were originally conceived as faithful models of visual information
processing in the primate visual system, and are still considered so [346]. The activations of hidden
units in consecutive layers of deep convolutional networks have been found to parallel the activity
of neurons in consecutive brain regions involved in processing visual scenes. Such models of
neural computation are called “encoding” models, as they predict how the nervous system might
encode sensory information in the world.
Even when they are not directly modeling biological neurons, deep networks have been a useful
computational tool in neuroscience. They have been developed as statistical time series models of
neural activity in the brain. And in contrast to the encoding models described earlier, these models
are used for decoding neural activity, for instance in brain machine interfaces [347]. They have
been crucial to the field of connectomics, which is concerned with mapping the connectivity of
biological neural networks in the brain. In connectomics, deep networks are used to segment the
shapes of individual neurons and to infer their connectivity from 3D electron microscopic images
[348], and they have been also been used to infer causal connectivity from optical measurement
and perturbation of neural activity [349].
It is an exciting time for neuroscience. Recent rapid progress in deep networks continues to inspire
new machine learning based models of brain computation [343]. And neuroscience continues to
inspire new models of artificial intelligence [345].
neural networks have several advantages in representational power, the difficulties in interpretation
may limit clinical applications, a limitation that still remains today. In addition, the challenges faced
by physicians parallel those encountered by deep learning. For a given patient, the number of
possible diseases is very large, with a long tail of rare diseases and patients are highly
heterogeneous and may present with very different signs and symptoms for the same disease.
Still, in 2006 Lisboa and Taktak [355] examined the use of artificial neural networks in medical
journals, concluding that they improved healthcare relative to traditional screening methods in 21 of
27 studies.
While further progress has been made in using deep learning for clinical decision making, it is
hindered by a challenge common to many deep learning applications: it is much easier to predict
an outcome than to suggest an action to change the outcome. Several attempts [118,120] at
recasting the clinical decision-making problem into a prediction problem (i.e. prediction of which
treatment will most improve the patient’s health) have accurately predicted survival patterns, but
technical and medical challenges remain for clinical adoption (similar to those for categorization). In
particular, remaining barriers include actionable interpretability of deep learning models, fitting
deep models to limited and heterogeneous data, and integrating complex predictive models into a
dynamic clinical environment.
A common challenge for deep learning is the interpretability of the models and their predictions.
The task of clinical decision making is necessarily risk-averse, so model interpretability is key.
Without clear reasoning, it is difficult to establish trust in a model. As described above, there has
been some work to directly assign treatment plans without interpretability; however, the removal of
human experts from the decision-making loop make the models difficult to integrate with clinical
practice. To alleviate this challenge, several studies have attempted to create more interpretable
deep models, either specifically for healthcare or as a general procedure for deep learning (see
Discussion).
A common application for deep learning in this domain is the temporal structure of healthcare
records. Many studies [359–362] have used RNNs to categorize patients, but most stop short of
suggesting clinical decisions. Nemati et al. [363] used deep reinforcement learning to optimize a
heparin dosing policy for intensive care patients. However, because the ideal dosing policy is
unknown, the model’s predictions must be evaluated on counter-factual data. This represents a
common challenge when bridging the gap between research and clinical practice. Because the
ground-truth is unknown, researchers struggle to evaluate model predictions in the absence of
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
interventional data, but clinical application is unlikely until the model has been shown to be
effective. The impressive applications of deep reinforcement learning to other domains [304] have
relied on knowledge of the underlying processes (e.g. the rules of the game). Some models have
been developed for targeted medical problems [364], but a generalized engine is beyond current
capabilities.
A clinical deep learning task that has been more successful is the assignment of patients to clinical
trials. Ithapu et al. [365] used a randomized denoising autoencoder to learn a multimodal imaging
marker that predicts future cognitive and neural decline from positron emission tomography (PET),
amyloid florbetapir PET, and structural magnetic resonance imaging. By accurately predicting
which cases will progress to dementia, they were able to efficiently assign patients to a clinical trial
and reduced the required sample sizes by a factor of five. Similarly, Artemov et al. [366] applied
deep learning to predict which clinical trials were likely to fail and which were likely to succeed. By
predicting the side effects and pathway activations of each drug and translating these activations to
a success probability, their deep learning-based approach was able to significantly outperform a
random forest classifier trained on gene expression changes. These approaches suggest
promising directions to improve the efficiency of clinical trials and accelerate drug development.
Drug repositioning
Drug repositioning (or repurposing) is an attractive option for delivering new drugs to the market
because of the high costs and failure rates associated with more traditional drug discovery
approaches [367,368]. A decade ago, the Connectivity Map [369] had a sizeable impact. Reverse
matching disease gene expression signatures with a large set of reference compound profiles
allowed researchers to formulate repurposing hypotheses at scale using a simple non-parametric
test. Since then, several advanced computational methods have been applied to formulate and
validate drug repositioning hypotheses [370–372]. Using supervised learning and collaborative
filtering to tackle this type of problem is proving successful, especially when coupling disease or
compound omic data with topological information from protein-protein or protein-compound
interaction networks [373–375].
For example, Menden et al. [376] used a shallow neural network to predict sensitivity of cancer cell
lines to drug treatment using both cell line and drug features, opening the door to precision
medicine and drug repositioning opportunities in cancer. More recently, Aliper et al. [36] used gene-
and pathway-level drug perturbation transcriptional profiles from the Library of Network-Based
Cellular Signatures [377] to train a fully connected deep neural network to predict drug therapeutic
uses and indications. By using confusion matrices and leveraging misclassification, the authors
formulated a number of interesting hypotheses, including repurposing cardiovascular drugs such
as otenzepad and pinacidil for neurological disorders.
Drug repositioning can also be approached by attempting to predict novel drug-target interactions
and then repurposing the drug for the associated indication [378,379]. Wang et al. [380] devised a
pairwise input neural network with two hidden layers that takes two inputs, a drug and a target
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
binding site, and predicts whether they interact. Wang et al. [37] trained individual RBMs for each
target in a drug-target interaction network and used these models to predict novel interactions
pointing to new indications for existing drugs. Wen et al. [38] extended this concept to deep
learning by creating a DBN called DeepDTIs, which predicts interactions using chemical structure
and protein sequence features.
Drug repositioning appears an obvious candidate for deep learning both because of the large
amount of high-dimensional data available and the complexity of the question being asked.
However, perhaps the most promising piece of work in this space [36] is more of a proof of concept
than a real-world hypothesis-generation tool; notably, deep learning was used to predict drug
indications but not for the actual repositioning. At present, some of the most popular state-of-the-art
methods for signature-based drug repurposing [381] do not use predictive modeling. A mature and
production-ready framework for drug repositioning via deep learning is currently missing.
Drug development
Computational work in this domain aims to identify sufficient candidate active compounds without
exhaustively screening libraries of hundreds of thousands or millions of chemicals. Predicting
chemical activity computationally is known as virtual screening. An ideal algorithm will rank a
sufficient number of active compounds before the inactives, but the rankings of actives relative to
other actives and inactives are less important [384]. Computational modeling also has the potential
to predict ADMET traits for lead generation [385] and how drugs are metabolized [386].
[20]) explored the effects of jointly modeling far more targets than the Merck challenge [390,391],
with Ramsundar et al. [391] showing that the benefits of multi-task networks had not yet saturated
even with 259 targets. Although DeepTox [392], a deep learning approach, won another
competition, the Toxicology in the 21st Century (Tox21) Data Challenge, it did not dominate
alternative methods as thoroughly as in other domains. DeepTox was the top performer on 9 of 15
targets and highly competitive with the top performer on the others. However, for many targets
there was little separation between the top two or three methods.
The nuanced Tox21 performance may be more reflective of the practical challenges encountered in
ligand-based chemical screening than the extreme enthusiasm generated by the Merck
competition. A study of 22 ADMET tasks demonstrated that there are limitations to multi-task
transfer learning that are in part a consequence of the degree to which tasks are related [385].
Some of the ADMET datasets showed superior performance in multi-task models with only 22
ADMET tasks compared to multi-task models with over 500 less-similar tasks. In addition, the
training datasets encountered in practical applications may be tiny relative to what is available in
public datasets and organized competitions. A study of BACE-1 inhibitors included only 1547
compounds [393]. Machine learning models were able to train on this limited dataset, but overfitting
was a challenge and the differences between random forests and a deep neural network were
negligible, especially in the classification setting. Overfitting is still a problem in larger chemical
screening datasets with tens or hundreds of thousands of compounds because the number of
active compounds can be very small, on the order of 0.1% of all tested chemicals for a typical
target [394]. This has motivated low-parameter neural networks that emphasize compound-
compound similarity, such as influence-relevance voter [384,395], instead of predicting compound
activity directly from chemical features.
Much of the recent excitement in this domain has come from what could be considered a creative
experimentation phase, in which deep learning has offered novel possibilities for feature
representation and modeling of chemical compounds. A molecular graph, where atoms are labeled
nodes and bonds are labeled edges, is a natural way to represent a chemical structure. Chemical
features can be represented as a list of molecular descriptors such as molecular weight, atom
counts, functional groups, charge representations, summaries of atom-atom relationships in the
molecular graph, and more sophisticated derived properties [396]. Traditional machine learning
approaches relied on preprocessing the graph into a feature vector of molecular descriptors or a
fixed-width bit vector known as a fingerprint [397]. The same fingerprints have been used by some
drug-target interaction methods discussed above [38]. An overly simplistic but approximately
correct view of chemical fingerprints is that each bit represents the presence or absence of a
particular chemical substructure in the molecular graph. Instead of using molecular descriptors or
fingerprints as input, modern neural networks can represent chemicals as textual strings [398] or
images [399] or operate directly on the molecular graph, which has enabled strategies for learning
novel chemical representations.
Virtual screening and chemical property prediction have emerged as one of the major applications
areas for graph-based neural networks. Duvenaud et al. [400] generalized standard circular
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Advances in chemical representation learning have also enabled new strategies for learning
chemical-chemical similarity functions. Altae-Tran et al. developed a one-shot learning network
[403] to address the reality that most practical chemical screening studies are unable to provide the
thousands or millions of training compounds that are needed to train larger multi-task networks.
Using graph convolutions to featurize chemicals, the network learns an embedding from
compounds into a continuous feature space such that compounds with similar activities in a set of
training tasks have similar embeddings. The approach is evaluated in an extremely challenging
setting. The embedding is learned from a subset of prediction tasks (e.g. activity assays for
individual proteins), and only one to ten labeled examples are provided as training data on a new
task. On Tox21 targets, even when trained with one task-specific active compound and one
inactive compound, the model is able to generalize reasonably well because it has learned an
informative embedding function from the related tasks. Random forests, which cannot take
advantage of the related training tasks, trained in the same setting are only slightly better than a
random classifier. Despite the success on Tox21, performance on MUV datasets, which contains
assays designed to be challenging for chemical informatics algorithms, is considerably worse. The
authors also demonstrate the limitations of transfer learning as embeddings learned from the Tox21
assays have little utility for a drug adverse reaction dataset.
These novel, learned chemical feature representations may prove to be essential for accurately
predicting why some compounds with similar structures yield similar target effects and others
produce drastically different results. Currently, these methods are enticing but do not necessarily
outperform classic approaches by a large margin. The neural fingerprints [400] were narrowly
beaten by regression using traditional circular fingerprints on a drug efficacy prediction task but
were superior for predicting solubility or photovoltaic efficiency. In the original study, graph
convolutions [402] performed comparably to a multi-task network using standard fingerprints and
slightly better than the neural fingerprints [400] on the drug efficacy task but were slightly worse
than the influence-relevance voter method on an HIV dataset [384]. Broader recent benchmarking
has shown that relative merits of these methods depends on the dataset and cross validation
strategy [407], though evaluation in this domain often uses area under the receiver operating
characteristic curve (AUROC) [408], which has limited utility due to the large class imbalance (see
Discussion).
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
We remain optimistic for the potential of deep learning and specifically representation learning in
drug discovery. Rigorous benchmarking on broad and diverse prediction tasks will be as important
as novel neural network architectures to advance the state of the art and convincingly demonstrate
superiority over traditional cheminformatics techniques. Fortunately, there has recently been much
progress in this direction. The DeepChem software [403,409] and MoleculeNet benchmarking suite
[407] built upon it contain chemical bioactivity and toxicity prediction datasets, multiple compound
featurization approaches including graph convolutions, and various machine learning algorithms
ranging from standard baselines like logistic regression and random forests to recent neural
network architectures. Independent research groups have already contributed additional datasets
and prediction algorithms to DeepChem. Adoption of common benchmarking evaluation metrics,
datasets, and baseline algorithms has the potential to establish the practical utility of deep learning
in chemical bioactivity prediction and lower the barrier to entry for machine learning researchers
without biochemistry expertise.
One open question in ligand-based screening pertains to the benefits and limitations of transfer
learning. Multi-task neural networks have shown the advantages of jointly modeling many targets
[390,391]. Other studies have shown the limitations of transfer learning when the prediction tasks
are insufficiently related [385,403]. This has important implications for representation learning. The
typical approach to improve deep learning models by expanding the dataset size may not be
applicable if only “related” tasks are beneficial, especially because task-task relatedness is ill-
defined. The massive chemical state space will also influence the development of unsupervised
representation learning methods [398,410]. Future work will establish whether it is better to train on
massive collections of diverse compounds, drug-like small molecules, or specialized subsets.
When protein structure is available, virtual screening has traditionally relied on docking programs to
predict how a compound best fits in the target’s binding site and score the predicted ligand-target
complex [411]. Recently, deep learning approaches have been developed to model protein
structure, which is expected to improve upon the simpler drug-target interaction algorithms
described above that represent proteins with feature vectors derived from amino acid sequences
[38,380].
There are two established options for representing a protein-compound complex. One option, a 3D
grid, can featurize the input complex [35,416]. Each entry in the grid tracks the types of protein and
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
ligand atoms in that region of the 3D space or descriptors derived from those atoms. Alternatively,
DeepVS [415] and atomic convolutions [412] offer greater flexibility in their convolutions by
eschewing the 3D grid. Instead, they each implement techniques for executing convolutions over
atoms’ neighboring atoms in the 3D space. Gomes et al. demonstrate that currently random forest
on a 1D feature vector that describes the 3D ligand-target structure generally outperforms neural
networks on the same feature vector as well as atomic convolutions and ligand-based neural
networks when predicting the continuous-valued inhibition constant on the PDBBind refined
dataset [412]. However, in the long term, atomic convolutions may ultimately overtake grid-based
methods, as they provide greater freedom to model atom-atom interactions and the forces that
govern binding affinity.
De novo drug design attempts to model the typical design-synthesize-test cycle of drug discovery
60
[417,418]. It explores an estimated 10 synthesizable organic molecules with drug-like properties
without explicit enumeration [394]. To test or score structures, algorithms like those discussed
earlier are used. To “design” and “synthesize”, traditional de novo design software relied on
classical optimizers such as genetic algorithms. Unfortunately, this often leads to overfit, “weird”
molecules, which are difficult to synthesize in the lab. Current programs have settled on rule-based
virtual chemical reactions to generate molecular structures [418]. Deep learning models that
generate realistic, synthesizable molecules have been proposed as an alternative. In contrast to
the classical, symbolic approaches, generative models learned from data would not depend on
laboriously encoded expert knowledge. The challenge of generating molecules has parallels to the
generation of syntactically and semantically correct text [419].
As deep learning models that directly output (molecular) graphs remain under-explored, generative
neural networks for drug design typically represent chemicals with the simplified molecular-input
line-entry system (SMILES), a standard string-based representation with characters that represent
atoms, bonds, and rings [420]. This allows treating molecules as sequences and leveraging recent
progress in recurrent neural networks. Gómez-Bombarelli et al. designed a SMILES-to-SMILES
autoencoder to learn a continuous latent feature space for chemicals [398]. In this learned
continuous space it was possible to interpolate between continuous representations of chemicals
in a manner that is not possible with discrete (e.g. bit vector or string) features or in symbolic,
molecular graph space. Even more interesting is the prospect of performing gradient-based or
Bayesian optimization of molecules within this latent space. The strategy of constructing simple,
continuous features before applying supervised learning techniques is reminiscent of autoencoders
trained on high-dimensional EHR data [112]. A drawback of the SMILES-to-SMILES autoencoder is
that not all SMILES strings produced by the autoencoder’s decoder correspond to valid chemical
structures. Recently, the Grammar Variational Autoencoder, which takes the SMILES grammar into
account and is guaranteed to produce syntactically valid SMILES, has been proposed to alleviate
this issue [421].
representations, with 94% [423] or nearly 98% [420] of generated SMILES corresponding to valid
molecular structures. The initial RNN is then fine-tuned to generate molecules that are likely to be
active against a specific target by either continuing training on a small set of positive examples
[420] or adopting reinforcement learning strategies [423,424]. Both the fine-tuning and
reinforcement learning approaches can rediscover known, held-out active molecules. The great
flexibility of neural networks, and progress in generative models offers many opportunities for deep
architectures in de novo design (e.g. the adaptation of GANs for molecules).
Discussion
Despite the disparate types of data and scientific goals in the learning tasks covered above,
several challenges are broadly important for deep learning in the biomedical domain. Here we
examine these factors that may impede further progress, ask what steps have already been taken
to overcome them, and suggest future research directions.
Although the bias-variance tradeoff is common to all machine learning applications, recent
empirical and theoretical observations suggest that deep learning models may have uniquely
advantageous generalization properties [425,426]. Nevertheless, additional advances will be
needed to establish a coherent theoretical foundation that enables practitioners to better reason
about their models from first principles.
Making predictions in the presence of high class imbalance and differences between training and
generalization data is a common feature of many large biomedical datasets, including deep
learning models of genomic features, patient classification, disease detection, and virtual
screening. Prediction of transcription factor binding sites exemplifies the difficulties with learning
from highly imbalanced data. The human genome has 3 billion base pairs, and only a small fraction
of them are implicated in specific biochemical activities. Less than 1% of the genome can be
confidently labeled as bound for most transcription factors.
Estimating the false discovery rate (FDR) is a standard method of evaluation in genomics that can
also be applied to deep learning model predictions of genomic features. Using deep learning
predictions for targeted validation experiments of specific biochemical activities necessitates a
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
more stringent FDR (typically 5–25%). However, when predicted biochemical activities are used as
features in other models, such as gene expression models, a low FDR may not be necessary.
What is the correspondence between FDR metrics and commonly used classification metrics such
as AUPR and AUROC? AUPR evaluates the average precision, or equivalently, the average FDR
across all recall thresholds. This metric provides an overall estimate of performance across all
possible use cases, which can be misleading for targeted validation experiments. For example,
classification of TF binding sites can exhibit a recall of 0% at 10% FDR and AUPR greater than 0.6.
In this case, the AUPR may be competitive, but the predictions are ill-suited for targeted validation
that can only examine a few of the highest-confidence predictions. Likewise, AUROC evaluates the
average recall across all false positive rate (FPR) thresholds, which is often a highly misleading
metric in class-imbalanced domains [70,427]. Consider a classification model with recall of 0% at
FDR less than 25% and 100% recall at FDR greater than 25%. In the context of TF binding
predictions where only 1% of genomic regions are bound by the TF, this is equivalent to a recall of
100% for FPR greater than 0.33%. In other words, the AUROC would be 0.9967, but the classifier
would be useless for targeted validation. It is not unusual to obtain a chromosome-wide AUROC
greater than 0.99 for TF binding predictions but a recall of 0% at 10% FDR. Consequently,
practitioners must select the metric most tailored to their subsequent use case to use these
methods most effectively.
Genome-wide continuous signals are commonly formulated into classification labels through signal
peak detection. ChIP-seq peaks are used to identify locations of TF binding and histone
modifications. Such procedures rely on thresholding criteria to define what constitutes a peak in the
signal. This inevitably results in a set of signal peaks that are close to the threshold, not sufficient
to constitute a positive label but too similar to positively labeled examples to constitute a negative
label. To avoid an arbitrary label for these examples they may be labeled as “ambiguous”.
Ambiguously labeled examples can then be ignored during model training and evaluation of recall
and FDR. The correlation between model predictions on these examples and their signal values
can be used to evaluate if the model correctly ranks these examples between positive and
negative examples.
In assessing the upper bound on the predictive performance of a deep learning model, it is
necessary to incorporate inherent between-study variation inherent to biomedical research [428].
Study-level variability limits classification performance and can lead to underestimating prediction
error if the generalization error is estimated by splitting a single dataset. Analyses can incorporate
data from multiple labs and experiments to capture between-study variation within the prediction
model mitigating some of these issues.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Uncertainty quantification
Deep learning based solutions for biomedical applications could substantially benefit from
guarantees on the reliability of predictions and a quantification of uncertainty. Due to biological
variability and precision limits of equipment, biomedical data do not consist of precise
measurements but of estimates with noise. Hence, it is crucial to obtain uncertainty measures that
capture how noise in input values propagate through deep neural networks. Such measures can
be used for reliability assessment of automated decisions in clinical and public health applications,
and for guarding against model vulnerabilities in the face of rare or adversarial cases [429].
Moreover, in fundamental biological research, measures of uncertainty help researchers distinguish
between true regularities in the data and patterns that are false or merely anecdotal. There are two
main uncertainties that one can calculate: epistemic and aleatoric [430]. Epistemic uncertainty
describes uncertainty about the model, its structure, or its parameters. This uncertainty is caused
by insufficient training data or by a difference in the training set and testing set distributions, so it
vanishes in the limit of infinite data. On the other hand, aleatoric uncertainty describes uncertainty
inherent in the observations. This uncertainty is due to noisy or missing data, so it vanishes with
the ability to observe all independent variables with infinite precision. A good way to represent
aleatoric uncertainty is to design an appropriate loss function with an uncertainty variable. In the
case of data-dependent aleatoric uncertainty, one can train the model to increase its uncertainty
when it is incorrect due to noisy or missing data, and in the case of task-dependent aleatoric
uncertainty, one can optimize for the best uncertainty parameter for each task [431]. Meanwhile,
there are various methods for modeling epistemic uncertainty, outlined below.
In classification tasks, confidence calibration is the problem of using classifier scores to predict
class membership probabilities that match the true membership likelihoods. These membership
probabilities can be used to assess the uncertainty associated with assigning the example to each
of the classes. Guo et al. [432] observed that contemporary neural networks are poorly calibrated
and provided a simple recommendation for calibration: temperature scaling, a single parameter
special case of Platt scaling [433]. In addition to confidence calibration, there is early work from
Chryssolouris et al. [434] that described a method for obtaining confidence intervals with the
assumption of normally distributed error for the neural network. More recently, Hendrycks and
Gimpel discovered that incorrect or out-of-distribution examples usually have lower maximum
softmax probabilities than correctly classified examples, allowing for effective detection of
misclassified examples [435]. Liang et al. used temperature scaling and small perturbations to
further separate the softmax scores of correctly classified examples and the scores of out-of-
distribution examples, allowing for more effective detection [436]. This approach outperformed the
baseline approaches by a large margin, establishing a new state-of-the-art performance.
An alternative approach for obtaining principled uncertainty estimates from deep learning models is
to use Bayesian neural networks. Deep learning models are usually trained to obtain the most
likely parameters given the data. However, choosing the single most likely set of parameters
ignores the uncertainty about which set of parameters (among the possible models that explain the
given dataset) should be used. This sometimes leads to uncertainty in predictions when the chosen
likely parameters produce high-confidence but incorrect results. On the other hand, the parameters
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
of Bayesian neural networks are modeled as full probability distributions. This Bayesian approach
comes with a whole host of benefits, including better calibrated confidence estimates [437] and
more robustness to adversarial and out-of-distribution examples [438]. Unfortunately, modeling the
full posterior distribution for the model’s parameters given the data is usually computationally
intractable. One popular method for circumventing this high computational cost is called test-time
dropout [439], where an approximate posterior distribution is obtained using variational inference.
Gal and Ghahramani showed that a stack of fully connected layers with dropout between the layers
is equivalent to approximate inference in a Gaussian process model [439]. The authors interpret
dropout as a variational inference method and apply their method to convolutional neural networks.
This is simple to implement and preserves the possibility of obtaining cheap samples from the
approximate posterior distribution. Operationally, obtaining model uncertainty for a given case
becomes as straightforward as leaving dropout turned on and predicting multiple times. The spread
of the different predictions is a reasonable proxy for model uncertainty. This technique has been
successfully applied in an automated system for detecting diabetic retinopathy [440], where
uncertainty-informed referrals improved diagnostic performance and allowed the model to meet the
National Health Service recommended levels of sensitivity and specificity. The authors also found
that entropy performs comparably to the spread obtained via test-time dropout for identifying
uncertain cases, and therefore it can be used instead for automated referrals.
Several other techniques have been proposed for effectively estimating predictive uncertainty as
uncertainty quantification for neural networks continues to be an active research area. Recently,
McClure and Kriegeskorte observed that test-time sampling improved calibration of the probabilistic
predictions, sampling weights led to more robust uncertainty estimates than sampling units, and
spike-and-slab sampling is superior to Gaussian dropconnect and Bernoulli dropout [441]. Krueger
et al. introduced Bayesian hypernetworks [442] as another framework for approximate Bayesian
inference in deep learning, where an invertible generative hypernetwork maps isotropic Gaussian
noise to parameters of the primary network allowing for computationally cheap sampling and
efficient estimation of the posterior. Meanwhile, Lakshminarayanan et al. proposed using deep
ensembles, which are traditionally used for boosting predictive performance, on standard (non-
Bayesian) neural networks to obtain well-calibrated uncertainty estimates that are comparable to
those obtained by Bayesian neural networks [443]. In cases where model uncertainty is known to
be caused by a difference in training and testing distributions, domain adaptation based techniques
can help mitigate the problem [218].
Despite the success and popularity of deep learning, some deep learning models can be
surprisingly brittle. Researchers are actively working on modifications to deep learning frameworks
to enable them to handle probability and embrace uncertainty. Most notably, Bayesian modeling
and deep learning are being integrated with renewed enthusiasm. As a result, several opportunities
for innovation arise: understanding the causes of model uncertainty can lead to novel optimization
and regularization techniques, assessing the utility of uncertainty estimation techniques on various
model architectures and structures can be very useful to practitioners, and extending Bayesian
deep learning to unsupervised settings can be a significant breakthrough [444]. Unfortunately,
uncertainty quantification techniques are underutilized in the computational biology communities
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
and largely ignored in the current deep learning for biomedicine literature. Thus, the practical value
of uncertainty quantification in biomedical domains is yet to be appreciated.
Interpretation
As deep learning models achieve state-of-the-art performance in a variety of domains, there is a
growing need to make the models more interpretable. Interpretability matters for two main reasons.
First, a model that achieves breakthrough performance may have identified patterns in the data
that practitioners in the field would like to understand. However, this would not be possible if the
model is a black box. Second, interpretability is important for trust. If a model is making medical
diagnoses, it is important to ensure the model is making decisions for reliable reasons and is not
focusing on an artifact of the data. A motivating example of this can be found in Ba and Caruana
[445], where a model trained to predict the likelihood of death from pneumonia assigned lower risk
to patients with asthma, but only because such patients were treated as higher priority by the
hospital. In the context of deep learning, understanding the basis of a model’s output is particularly
important as deep learning models are unusually susceptible to adversarial examples [446] and
can output confidence scores over 99.99% for samples that resemble pure noise.
As the concept of interpretability is quite broad, many methods described as improving the
interpretability of deep learning models take disparate and often complementary approaches.
Several approaches ascribe importance on an example-specific basis to the parts of the input that
are responsible for a particular output. These can be broadly divided into perturbation-based
approaches and backpropagation-based approaches.
Perturbation-based approaches change parts of the input and observe the impact on the output of
the network. Alipanahi et al. [201] and Zhou & Troyanskaya [209] scored genomic sequences by
introducing virtual mutations at individual positions in the sequence and quantifying the change in
the output. Umarov et al. [222] used a similar strategy, but with sliding windows where the
sequence within each sliding window was substituted with a random sequence. Kelley et al. [227]
inserted known protein-binding motifs into the centers of sequences and assessed the change in
predicted accessibility. Ribeiro et al. [447] introduced LIME, which constructs a linear model to
locally approximate the output of the network on perturbed versions of the input and assigns
importance scores accordingly. For analyzing images, Zeiler and Fergus [448] applied constant-
value masks to different input patches. More recently, marginalizing over the plausible values of an
input has been suggested as a way to more accurately estimate contributions [449].
selected class. Their method converges in many fewer iterations but requires the perturbation to
have a differentiable form.
Backpropagation-based methods, in which the signal from a target output neuron is propagated
backwards to the input layer, are another way to interpret deep networks that sidestep
inefficiencies of the perturbation-based methods. A classic example of this is calculating the
gradients of the output with respect to the input [451] to compute a “saliency map”. Bach et al.
[452] proposed a strategy called Layerwise Relevance Propagation, which was shown to be
equivalent to the element-wise product of the gradient and input [219,453]. Networks with Rectified
Linear Units (ReLUs) create nonlinearities that must be addressed. Several variants exist for
handling this [448,454]. Backpropagation-based methods are a highly active area of research.
Researchers are still actively identifying weaknesses [455], and new methods are being developed
to address them [219,456,457]. Lundberg and Lee [458] noted that several importance scoring
methods including integrated gradients and LIME could all be considered approximations to
Shapely values [459], which have a long history in game theory for assigning contributions to
players in cooperative games.
Another approach to understanding the network’s predictions is to find artificial inputs that produce
similar hidden representations to a chosen example. This can elucidate the features that the
network uses for prediction and drop the features that the network is insensitive to. In the context
of natural images, Mahendran and Vedaldi [460] introduced the “inversion” visualization, which
uses gradient descent and backpropagation to reconstruct the input from its hidden representation.
The method required placing a prior on the input to favor results that resemble natural images. For
genomic sequence, Finnegan and Song [461] used a Markov chain Monte Carlo algorithm to find
the maximum-entropy distribution of inputs that produced a similar hidden representation to the
chosen input.
A related idea is “caricaturization”, where an initial image is altered to exaggerate patterns that the
network searches for [462]. This is done by maximizing the response of neurons that are active in
the network, subject to some regularizing constraints. Mordvintsev et al. [463] leveraged
caricaturization to generate aesthetically pleasing images using neural networks.
Activation maximization
Activation maximization can reveal patterns detected by an individual neuron in the network by
generating images which maximally activate that neuron, subject to some regularizing constraints.
This technique was first introduced in Ehran et al. [464] and applied in subsequent work
[451,462,463,465]. Lanchantin et al. [204] applied class-based activation maximization to genomic
sequence data. One drawback of this approach is that neural networks often learn highly
distributed representations where several neurons cooperatively describe a pattern of interest.
Thus, visualizing patterns learned by individual neurons may not always be informative.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
RNN-specific approaches
Several interpretation methods are specifically tailored to recurrent neural network architectures.
The most common form of interpretability provided by RNNs is through attention mechanisms,
which have been used in diverse problems such as image captioning and machine translation to
select portions of the input to focus on generating a particular output [466,467]. Deming et al. [468]
applied the attention mechanism to models trained on genomic sequence. Attention mechanisms
provide insight into the model’s decision-making process by revealing which portions of the input
are used by different outputs. Singh et al. used a hierarchy of attention layers to locate important
genome positions and signals for predicting gene expression from histone modifications [183]. In
the clinical domain, Choi et al. [469] leveraged attention mechanisms to highlight which aspects of
a patient’s medical history were most relevant for making diagnoses. Choi et al. [470] later
extended this work to take into account the structure of disease ontologies and found that the
concepts represented by the model aligned with medical knowledge. Note that interpretation
strategies that rely on an attention mechanism do not provide insight into the logic used by the
attention layer.
Visualizing the activation patterns of the hidden state of a recurrent neural network can also be
instructive. Early work by Ghosh and Karamcheti [471] used cluster analysis to study hidden states
of comparatively small networks trained to recognize strings from a finite state machine. More
recently, Karpathy et al. [472] showed the existence of individual cells in LSTMs that kept track of
quotes and brackets in character-level language models. To facilitate such analyses, LSTMVis
[473] allows interactive exploration of the hidden state of LSTMs on different inputs.
Another strategy, adopted by Lanchatin et al. [204] looks at how the output of a recurrent neural
network changes as longer and longer subsequences are supplied as input to the network, where
the subsequences begin with just the first position and end with the entire sequence. In a binary
classification task, this can identify those positions which are responsible for flipping the output of
the network from negative to positive. If the RNN is bidirectional, the same process can be
repeated on the reverse sequence. As noted by the authors, this approach was less effective at
identifying motifs compared to the gradient-based backpropagation approach of Simonyan et al.
[451], illustrating the need for more sophisticated strategies to assign importance scores in
recurrent neural networks.
Murdoch and Szlam [474] showed that the output of an LSTM can be decomposed into a product
of factors, where each factor can be interpreted as the contribution at a particular timestep. The
contribution scores were then used to identify key phrases from a model trained for sentiment
analysis and obtained superior results compared to scores derived via a gradient-based approach.
For example, Way and Greene trained a variational autoencoder (VAE) on gene expression from
The Cancer Genome Atlas (TCGA) [476] and use latent space arithmetic to rapidly isolate and
interpret gene expression features descriptive of high grade serous ovarian cancer subtypes [477].
The most differentiating VAE features were representative of biological processes that are known
to distinguish the subtypes. Latent space arithmetic with features derived using other compression
algorithms were not as informative in this context [478]. Embedding discrete chemical structures
with autoencoders and interpreting the learned continuous representations with latent space
arithmetic has also facilitated predicting drug-like compounds [398]. Furthermore, embedding
biomedical text into lower dimensional latent spaces have improved name entity recognition in a
variety of tasks including annotating clinical abbreviations, genes, cell lines, and drug names [75–
78].
Other approaches have used interpolation through latent space embeddings learned by GANs to
interpret unobserved intermediate states. For example, Osokin et al. trained GANs on two-channel
fluorescent microscopy images to interpret intermediate states of protein localization in yeast cells
[479]. Goldsborough et al. trained a GAN on fluorescent microscopy images and used latent space
interpolation and arithmetic to reveal underlying responses to small molecule perturbations in cell
lines [480].
Miscellaneous approaches
It can often be informative to understand how the training data affects model learning. Toward this
end, Koh and Liang [481] used influence functions, a technique from robust statistics, to trace a
model’s predictions back through the learning algorithm to identify the datapoints in the training set
that had the most impact on a given prediction. A more free-form approach to interpretability is to
visualize the activation patterns of the network on individual inputs and on subsets of the data.
ActiVis and CNNvis [482,483] are two frameworks that enable interactive visualization and
exploration of large-scale deep learning models. An orthogonal strategy is to use a knowledge
distillation approach to replace a deep learning model with a more interpretable model that
achieves comparable performance. Towards this end, Che et al. [484] used gradient boosted trees
to learn interpretable healthcare features from trained deep models.
Finally, it is sometimes possible to train the model to provide justifications for its predictions. Lei et
al. [485] used a generator to identify “rationales”, which are short and coherent pieces of the input
text that produce similar results to the whole input when passed through an encoder. The authors
applied their approach to a sentiment analysis task and obtained substantially superior results
compared to an attention-based method.
Future outlook
While deep learning lags behind most Bayesian models in terms of interpretability, the
interpretability of deep learning is comparable to or exceeds that of many other widely-used
machine learning methods such as random forests or SVMs. While it is possible to obtain
importance scores for different inputs in a random forest, the same is true for deep learning.
Similarly, SVMs trained with a nonlinear kernel are not easily interpretable because the use of the
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
kernel means that one does not obtain an explicit weight matrix. Finally, it is worth noting that some
simple machine learning methods are less interpretable in practice than one might expect. A linear
model trained on heavily engineered features might be difficult to interpret as the input features
themselves are difficult to interpret. Similarly, a decision tree with many nodes and branches may
also be difficult for a human to make sense of.
There are several directions that might benefit the development of interpretability techniques. The
first is the introduction of gold standard benchmarks that different interpretability approaches could
be compared against, similar in spirit to how the ImageNet [45] and CIFAR [486] datasets spurred
the development of deep learning for computer vision. It would also be helpful if the community
placed more emphasis on domains outside of computer vision. Computer vision is often used as
the example application of interpretability methods, but it is not the domain with the most pressing
need. Finally, closer integration of interpretability approaches with popular deep learning
frameworks would make it easier for practitioners to apply and experiment with different
approaches to understanding their deep learning models.
Data limitations
A lack of large-scale, high-quality, correctly labeled training data has impacted deep learning in
nearly all applications we have discussed. The challenges of training complex, high-parameter
neural networks from few examples are obvious, but uncertainty in the labels of those examples
can be just as problematic. In genomics labeled data may be derived from an experimental assay
with known and unknown technical artifacts, biases, and error profiles. It is possible to weight
training examples or construct Bayesian models to account for uncertainty or non-independence in
the data, as described in the TF binding example above. As another example, Park et al. [487]
estimated shared non-biological signal between datasets to correct for non-independence related
to assay platform or other factors in a Bayesian integration of many datasets. However, such
techniques are rarely placed front and center in any description of methods and may be easily
overlooked.
For some types of data, especially images, it is straightforward to augment training datasets by
splitting a single labeled example into multiple examples. For example, an image can easily be
rotated, flipped, or translated and retain its label [42]. 3D MRI and 4D fMRI (with time as a
dimension) data can be decomposed into sets of 2D images [488]. This can greatly expand the
number of training examples but artificially treats such derived images as independent instances
and sacrifices the structure inherent in the data. CellCnn trains a model to recognize rare cell
populations in single-cell data by creating training instances that consist of subsets of cells that are
randomly sampled with replacement from the full dataset [297].
Simulated or semi-synthetic training data has been employed in multiple biomedical domains,
though many of these ideas are not specific to deep learning. Training and evaluating on simulated
data, for instance, generating synthetic TF binding sites with position weight matrices [207] or
RNA-seq reads for predicting mRNA transcript boundaries [489], is a standard practice in
bioinformatics. This strategy can help benchmark algorithms when the available gold standard
dataset is imperfect, but it should be paired with an evaluation on real data, as in the prior
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
examples [207,489]. In rare cases, models trained on simulated data have been successfully
applied directly to real data [489].
Data can be simulated to create negative examples when only positive training instances are
available. DANN [34] adopts this approach to predict the pathogenicity of genetic variants using
semi-synthetic training data from Combined Annotation-Dependent Depletion (CADD) [490].
Though our emphasis here is on the training strategy, it should be noted that logistic regression
outperformed DANN when distinguishing known pathogenic mutations from likely benign variants
in real data. Similarly, a somatic mutation caller has been trained by injecting mutations into real
sequencing datasets [342]. This method detected mutations in other semi-synthetic datasets but
was not validated on real data.
In settings where the experimental observations are biased toward positive instances, such as
MHC protein and peptide ligand binding affinity [270], or the negative instances vastly outnumber
the positives, such as high-throughput chemical screening [395], training datasets have been
augmented by adding additional instances and assuming they are negative. There is some
evidence that this can improve performance [395], but in other cases it was only beneficial when
the real training datasets were extremely small [270]. Overall, training with simulated and semi-
simulated data is a valuable idea for overcoming limited sample sizes but one that requires more
rigorous evaluation on real ground-truth datasets before we can recommend it for widespread use.
There is a risk that a model will easily discriminate synthetic examples but not generalize to real
data.
Multimodal, multi-task, and transfer learning, discussed in detail below, can also combat data
limitations to some degree. There are also emerging network architectures, such as Diet Networks
for high-dimensional SNP data [491]. These use multiple networks to drastically reduce the number
of free parameters by first flipping the problem and training a network to predict parameters
(weights) for each input (SNP) to learn a feature embedding. This embedding (e.g. from principal
component analysis, per class histograms, or a Word2vec [102] generalization) can be learned
directly from input data or take advantage of other datasets or domain knowledge. Additionally, in
this task the features are the examples, an important advantage when it is typical to have 500
thousand or more SNPs and only a few thousand patients. Finally, this embedding is of a much
lower dimension, allowing for a large reduction in the number of free parameters. In the example
given, the number of free parameters was reduced from 30 million to 50 thousand, a factor of 600.
Many have sought to curb these costs, with methods ranging from the very applied (e.g. reduced
numerical precision [493–496]) to the exotic and theoretic ( e.g. training small networks to mimic
large networks and ensembles [445,497]). The largest gains in efficiency have come from
computation with GPUs [492,498–502], which excel at the matrix and vector operations so central
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
to deep learning. The massively parallel nature of GPUs allows additional optimizations, such as
accelerated mini-batch gradient descent [499,500,503,504]. However, GPUs also have limited
memory, making networks of useful size and complexity difficult to implement on a single GPU or
machine [66,498]. This restriction has sometimes forced computational biologists to use
workarounds or limit the size of an analysis. Chen et al. [181] inferred the expression level of all
genes with a single neural network, but due to memory restrictions they randomly partitioned genes
into two separately analyzed halves. In other cases, researchers limited the size of their neural
network [28] or the total number of training instances [398]. Some have also chosen to use
standard central processing unit (CPU) implementations rather than sacrifice network size or
performance [505].
While steady improvements in GPU hardware may alleviate this issue, it is unclear whether
advances will occur quickly enough to keep pace with the growing biological datasets and
increasingly complex neural networks. Much has been done to minimize the memory requirements
of neural networks [445,493–496,506,507], but there is also growing interest in specialized
hardware, such as field-programmable gate arrays (FPGAs) [502,508] and application-specific
integrated circuits (ASICs) [509]. Less software is available for such highly specialized hardware
[508]. But specialized hardware promises improvements in deep learning at reduced time, energy,
and memory [502]. Specialized hardware may be a difficult investment for those not solely
interested in deep learning, but for those with a deep learning focus these solutions may become
popular.
Distributed computing is a general solution to intense computational requirements and has enabled
many large-scale deep learning efforts. Some types of distributed computation [510,511] are not
suitable for deep learning [512], but much progress has been made. There now exist a number of
algorithms [495,512,513], tools [514–516], and high-level libraries [517,518] for deep learning in a
distributed environment, and it is possible to train very complex networks with limited infrastructure
[519]. Besides handling very large networks, distributed or parallelized approaches offer other
advantages, such as improved ensembling [520] or accelerated hyperparameter optimization
[521,522].
Cloud computing, which has already seen wide adoption in genomics [523], could facilitate easier
sharing of the large datasets common to biology [524,525], and may be key to scaling deep
learning. Cloud computing affords researchers flexibility, and enables the use of specialized
hardware (e.g. FPGAs, ASICs, GPUs) without major investment. As such, it could be easier to
address the different challenges associated with the multitudinous layers and architectures
available [526]. Though many are reluctant to store sensitive data (e.g. patient electronic health
records) in the cloud, secure, regulation-compliant cloud services do exist [527].
encouraging scientists to share their hard-won data. It’s precisely those data that would help to
power deep learning in the domain. Efforts are underway to recognize those who promote an
ecosystem of rigorous sharing and analysis [529].
The sharing of high-quality, labeled datasets will be especially valuable. In addition, researchers
who invest time to preprocess datasets to be suitable for deep learning can make the
preprocessing code (e.g. Basset [227] and variationanalysis [340]) and cleaned data
(e.g. MoleculeNet [407]) publicly available to catalyze further research. However, there are
complex privacy and legal issues involved in sharing patient data that cannot be ignored. Solving
these issues will require increased understanding of privacy risks and standards specifying
acceptable levels. In some domains high-quality training data has been generated privately,
i.e. high-throughput chemical screening data at pharmaceutical companies. One perspective is that
there is little expectation or incentive for this private data to be shared. However, data are not
inherently valuable. Instead, the insights that we glean from them are where the value lies. Private
companies may establish a competitive advantage by releasing data sufficient for improved
methods to be developed. Recently, Ramsundar et al. did this with an open source platform
DeepChem, where they released four privately generated datasets [530].
Code sharing and open source licensing is essential for continued progress in this domain. We
strongly advocate following established best practices for sharing source code, archiving code in
repositories that generate digital object identifiers, and open licensing [531] regardless of the
minimal requirements, or lack thereof, set by journals, conferences, or preprint servers. In addition,
it is important for authors to share not only code for their core models but also scripts and code
used for data cleaning (see above) and hyperparameter optimization. These improve
reproducibility and serve as documentation of the detailed decisions that impact model
performance but may not be exhaustively captured in a manuscript’s methods text.
Because many deep learning models are often built using one of several popular software
frameworks, it is also possible to directly share trained predictive models. The availability of pre-
trained models can accelerate research, with image classifiers as an apt example. A pre-trained
neural network can be quickly fine-tuned on new data and used in transfer learning, as discussed
below. Taking this idea to the extreme, genomic data has been artificially encoded as images in
order to benefit from pre-trained image classifiers [338]. “Model zoos”—collections of pre-trained
models—are not yet common in biomedical domains but have started to appear in genomics
applications [293,532]. However, it is important to note that sharing models trained on individual
data requires great care because deep learning models can be attacked to identify examples used
in training. One possible solution to protect individual samples includes training models under
differential privacy [152], which has been used in the biomedical domain [155]. We discussed this
issue as well as recent techniques to mitigate these concerns in the patient categorization section.
DeepChem [403,407,409] and DragoNN [532] exemplify the benefits of sharing pre-trained models
and code under an open source license. DeepChem, which targets drug discovery and quantum
chemistry, has actively encouraged and received community contributions of learning algorithms
and benchmarking datasets. As a consequence, it now supports a large suite of machine learning
approaches, both deep learning and competing strategies, that can be run on diverse test cases.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
This realistic, continual evaluation will play a critical role in assessing which techniques are most
promising for chemical screening and drug discovery. Like formal, organized challenges such as
the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge [213],
DeepChem provides a forum for the fair, critical evaluations that are not always conducted in
individual methodological papers, which can be biased toward favoring a new proposed algorithm.
Likewise DragoNN (Deep RegulAtory GenOmic Neural Networks) offers not only code and a model
zoo but also a detailed tutorial and partner package for simulating training data. These resources,
especially the ability to simulate datasets that are sufficiently complex to demonstrate the
challenges of training neural networks but small enough to train quickly on a CPU, are important for
training students and attracting machine learning researchers to problems in genomics and
healthcare.
In image analysis, previous examples of deep transfer learning applications proved large-scale
natural image sets [45] to be useful for pre-training models that serve as generic feature extractors
for various types of biological images [14,283,534,535]. More recently, deep learning models
predicted protein sub-cellular localization for proteins not originally present in a training set [536].
Moreover, learned features performed reasonably well even when applied to images obtained
using different fluorescent labels, imaging techniques, and different cell types [537]. However, there
are no established theoretical guarantees for feature transferability between distant domains such
as natural images and various modalities of biological imaging. Because learned patterns are
represented in deep neural networks in a layer-wise hierarchical fashion, this issue is usually
addressed by fixing an empirically chosen number of layers that preserve generic characteristics of
both training and target datasets. The model is then fine-tuned by re-training top layers on the
specific dataset in order to re-learn domain-specific high level concepts (e.g. fine-tuning for
radiology image classification [57]). Fine-tuning on specific biological datasets enables more
focused predictions.
In genomics, the Basset package [227] for predicting chromatin accessibility was shown to rapidly
learn and accurately predict on new data by leveraging a model pre-trained on available public
data. To simulate this scenario, authors put aside 15 of 164 cell type datasets and trained the
Basset model on the remaining 149 datasets. Then, they fine-tuned the model with one training
pass of each of the remaining datasets and achieved results close to the model trained on all 164
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
datasets together. In another example, Min et al. [228] demonstrated how training on the
experimentally-validated FANTOM5 permissive enhancer dataset followed by fine-tuning on
ENCODE enhancer datasets improved cell type-specific predictions, outperforming state-of-the-art
results. In drug design, general RNN models trained to generate molecules from the ChEMBL
database have been fine-tuned to produce drug-like compounds for specific targets [420,423].
Related to transfer learning, multimodal learning assumes simultaneous learning from various
types of inputs, such as images and text. It can capture features that describe common concepts
across input modalities. Generative graphical models like RBMs, deep Boltzmann machines, and
DBNs, demonstrate successful extraction of more informative features for one modality (images or
video) when jointly learned with other modalities (audio or text) [538]. Deep graphical models such
as DBNs are well-suited for multimodal learning tasks because they learn a joint probability
distribution from inputs. They can be pre-trained in an unsupervised fashion on large unlabeled
data and then fine-tuned on a smaller number of labeled examples. When labels are available,
convolutional neural networks are ubiquitously used because they can be trained end-to-end with
backpropagation and demonstrate state-of-the-art performance in many discriminative tasks [14].
Jha et al. [190] showed that integrated training delivered better performance than individual
networks. They compared a number of feed-forward architectures trained on RNA-seq data with
and without an additional set of CLIP-seq, knockdown, and over-expression based input features.
The integrative deep model generalized well for combined data, offering a large performance
improvement for alternative splicing event estimation. Chaudhary et al. [539] trained a deep
autoencoder model jointly on RNA-seq, miRNA-seq, and methylation data from TCGA to predict
survival subgroups of hepatocellular carcinoma patients. This multimodal approach that treated
different omic data types as different modalities outperformed both traditional methods (principal
component analysis) and single-omic models. Interestingly, multi-omic model performance did not
improve when combined with clinical information, suggesting that the model was able to capture
redundant contributions of clinical features through their correlated genomic features. Chen et al.
[176] used deep belief networks to learn phosphorylation states of a common set of signaling
proteins in primary cultured bronchial cells collected from rats and humans treated with distinct
stimuli. By interpreting species as different modalities representing similar high-level concepts, they
showed that DBNs were able to capture cross-species representation of signaling mechanisms in
response to a common stimuli. Another application used DBNs for joint unsupervised feature
learning from cancer datasets containing gene expression, DNA methylation, and miRNA
expression data [184]. This approach allowed for the capture of intrinsic relationships in different
modalities and for better clustering performance over conventional k-means.
Multi-task learning is complementary to multimodal and transfer learning. All three techniques can
be used together in the same model. For example, Zhang et al. [534] combined deep model-based
transfer and multi-task learning for cross-domain image annotation. One could imagine extending
that approach to multimodal inputs as well. A common characteristic of these methods is better
generalization of extracted features at various hierarchical levels of abstraction, which is attained
by leveraging relationships between various inputs and task objectives.
Despite demonstrated improvements, transfer learning approaches pose challenges. There are no
theoretically sound principles for pre-training and fine-tuning. Best practice recommendations are
heuristic and must account for additional hyper-parameters that depend on specific deep
architectures, sizes of the pre-training and target datasets, and similarity of domains. However,
similarity of datasets and domains in transfer learning and relatedness of tasks in multi-task
learning is difficult to access. Most studies address these limitations by empirical evaluation of the
model. Unfortunately, negative results are typically not reported. A deep CNN trained on natural
images boosts performance in radiographic images [57]. However, due to differences in imaging
domains, the target task required either re-training the initial model from scratch with special pre-
processing or fine-tuning of the whole network on radiographs with heavy data augmentation to
avoid overfitting. Exclusively fine-tuning top layers led to much lower validation accuracy (81.4
versus 99.5). Fine-tuning the aforementioned Basset model with more than one pass resulted in
overfitting [227]. DeepChem successfully improved results for low-data drug discovery with one-
shot learning for related tasks. However, it clearly demonstrated the limitations of cross-task
generalization across unrelated tasks in one-shot models, specifically nuclear receptor assays and
patient adverse reactions [403].
In the medical domain, multimodal, multi-task and transfer learning strategies not only inherit most
methodological issues from natural image, text, and audio domains, but also pose domain-specific
challenges. There is a compelling need for the development of privacy-preserving transfer learning
algorithms, such as Private Aggregation of Teacher Ensembles [158]. We suggest that these types
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
of models deserve deeper investigation to establish sound theoretical guarantees and determine
limits for the transferability of features between various closely related and distant learning tasks.
Conclusions
Deep learning-based methods now match or surpass the previous state of the art in a diverse array
of tasks in patient and disease categorization, fundamental biological study, genomics, and
treatment development. Returning to our central question: given this rapid progress, has deep
learning transformed the study of human disease? Though the answer is highly dependent on the
specific domain and problem being addressed, we conclude that deep learning has not yet realized
its transformative potential or induced a strategic inflection point. Despite its dominance over
competing machine learning approaches in many of the areas reviewed here and quantitative
improvements in predictive performance, deep learning has not yet definitively “solved” these
problems.
As an analogy, consider recent progress in conversational speech recognition. Since 2009 there
have been drastic performance improvements with error rates dropping from more than 20% to
less than 6% [542] and finally approaching or exceeding human performance in the past year
[543,544]. The phenomenal improvements on benchmark datasets are undeniable, but greatly
reducing the error rate on these benchmarks did not fundamentally transform the domain.
Widespread adoption of conversational speech technologies will require solving the problem,
i.e. methods that surpass human performance, and persuading users to adopt them [542]. We see
parallels in healthcare, where achieving the full potential of deep learning will require outstanding
predictive performance as well as acceptance and adoption by biologists and clinicians. These
experts will rightfully demand rigorous evidence that deep learning has impacted their respective
disciplines—elucidated new biological mechanisms and improved patient outcomes—to be
convinced that the promises of deep learning are more substantive than those of previous
generations of artificial intelligence.
Some of the areas we have discussed are closer to surpassing this lofty bar than others, generally
those that are more similar to the non-biomedical tasks that are now monopolized by deep
learning. In medical imaging, diabetic retinopathy [49], diabetic macular edema [49], tuberculosis
[58], and skin lesion [4] classifiers are highly accurate and comparable to clinician performance.
In other domains, perfect accuracy will not be required because deep learning will primarily
prioritize experiments and assist discovery. For example, in chemical screening for drug discovery,
a deep learning system that successfully identifies dozens or hundreds of target-specific, active
small molecules from a massive search space would have immense practical value even if its
overall precision is modest. In medical imaging, deep learning can point an expert to the most
challenging cases that require manual review [58], though the risk of false negatives must be
addressed. In protein structure prediction, errors in individual residue-residue contacts can be
tolerated when using the contacts jointly for 3D structure modeling. Improved contact map
predictions [28] have led to notable improvements in fold and 3D structure prediction for some of
the most challenging proteins, such as membrane proteins [250].
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Conversely, the most challenging tasks may be those in which predictions are used directly for
downstream modeling or decision-making, especially in the clinic. As an example, errors in
sequence variant calling will be amplified if they are used directly for GWAS. In addition, the
stochasticity and complexity of biological systems implies that for some problems, for instance
predicting gene regulation in disease, perfect accuracy will be unattainable.
We are witnessing deep learning models achieving human-level performance across a number of
biomedical domains. However, machine learning algorithms, including deep neural networks, are
also prone to mistakes that humans are much less likely to make, such as misclassification of
adversarial examples [545,546], a reminder that these algorithms do not understand the semantics
of the objects presented. It may be impossible to guarantee that a model is not susceptible to
adversarial examples, but work in this area is continuing [547,548]. Cooperation between human
experts and deep learning algorithms addresses many of these challenges and can achieve better
performance than either individually [64]. For sample and patient classification tasks, we expect
deep learning methods to augment clinicians and biomedical researchers.
We are optimistic about the future of deep learning in biology and medicine. It is by no means
inevitable that deep learning will revolutionize these domains, but given how rapidly the field is
evolving, we are confident that its full potential in biomedicine has not been explored. We have
highlighted numerous challenges beyond improving training and predictive accuracy, such as
preserving patient privacy and interpreting models. Ongoing research has begun to address these
problems and shown that they are not insurmountable. Deep learning offers the flexibility to model
data in its most natural form, for example, longer DNA sequences instead of k-mers for
transcription factor binding prediction and molecular graphs instead of pre-computed bit vectors for
drug discovery. These flexible input feature representations have spurred creative modeling
approaches that would be infeasible with other machine learning techniques. Unsupervised
methods are currently less-developed than their supervised counterparts, but they may have the
most potential because of how expensive and time-consuming it is to label large amounts of
biomedical data. If future deep learning algorithms can summarize very large collections of input
data into interpretable models that spur scientists to ask questions that they did not know how to
ask, it will be clear that deep learning has transformed biology and medicine.
Methods
To facilitate citation, we defined a markdown citation syntax. We supported citations to the following
identifier types (in order of preference): DOIs, PubMed Central IDs, PubMed IDs, arXiv IDs, and
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
URLs. References were automatically generated from citation metadata by querying APIs to
generate Citation Style Language (CSL) JSON items for each reference. Pandoc and pandoc-
citeproc converted the markdown to HTML and PDF, while rendering the formatted citations and
references. In total, referenced works consisted of 369 DOIs, 6 PubMed Central records, 129 arXiv
manuscripts, and 48 URLs (webpages as well as manuscripts lacking standardized identifiers).
Author contributions
We created an open repository on the GitHub version control platform ( greenelab/deep-review )
[552]. Here, we engaged with numerous authors from papers within and outside of the area. The
manuscript was drafted via GitHub commits by 36 individuals who met the ICMJE standards of
authorship. These were individuals who contributed to the review of the literature; drafted the
manuscript or provided substantial critical revisions; approved the final manuscript draft; and
agreed to be accountable in all aspects of the work. Individuals who did not contribute in all of
these ways, but who did participate, are acknowledged below. We grouped authors into the
following four classes of approximately equal contributions and randomly ordered authors within
each contribution class. Drafted multiple sub-sections along with extensive editing, pull request
reviews, or discussion: A.A.K., B.K.B., B.T.D., D.S.H., E.F., G.P.W., M.M.H., M.Z., P.A., T.C. Drafted
one or more sub-sections: A.E.C., A.M.A., A.S., B.J.L., C.A.L., E.M.C., G.L.R., J.I., J.L., J.X.,
S.C.T., S.W., W.X., Z.L. Revised specific sub-sections or supervised drafting one or more sub-
sections: A.H., A.K., D.D., D.J.H., L.K.W., M.H.S.S., S.J.S., S.M.B., Y.P., Y.Q. Drafted sub-sections,
edited the manuscript, reviewed pull requests, and coordinated co-authors: A.G., C.S.G.
Competing interests
A.K. is on the Advisory Board of Deep Genomics Inc. E.F. is a full-time employee of
GlaxoSmithKline. The remaining authors have no competing interests to declare.
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Acknowledgements
We gratefully acknowledge Christof Angermueller, Kumardeep Chaudhary, Gökcen Eraslan, Mikael
Huss, Bharath Ramsundar and Xun Zhu for their discussion of the manuscript and reviewed
papers on GitHub. We would like to thank Aaron Sheldon, who contributed text but did not formally
approve the manuscript. We would like to thank Anna Greene for a careful proofreading of the
manuscript in advance of the first submission. We would like to thank Robert Gieseke, Ruibang
Luo, Sourav Singh, and GitHub user snikumbh for correcting typos, formatting, and references.
Finally, we acknowledge funding from the Gordon and Betty Moore Foundation awards GBMF4552
(C.S.G. and D.S.H.) and GBMF4563 (D.J.H.); the Howard Hughes Medical Institute (S.C.T.); the
National Institutes of Health awards DP2GM123485 (A.K.), P30CA051008 (S.M.B.), R01AI116794
(B.K.B.), R01GM089652 (A.E.C.), R01GM089753 (J.X.), R01LM012222 (S.J.S.), R01LM012482
(S.J.S.), R21CA220398 (S.M.B.), T32GM007753 (B.T.D.), T32HG000046 (G.P.W.), and
U54AI117924 (A.G.); the National Institutes of Health Intramural Research Program and National
Library of Medicine (Y.P. and Z.L.); the National Science Foundation awards 1245632 (G.L.R.),
1531594 (E.M.C.), and 1564955 (J.X.); the Natural Sciences and Engineering Research Council of
Canada award RGPIN-2015-3948 (M.M.H.); and the Roy and Diana Vagelos Scholars Program in
the Molecular Life Sciences (M.Z.).
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
References
1. Big Data: Astronomical or Genomical?
Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J.
Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, Gene E. Robinson
PLOS Biology (2015-07-07) https://doi.org/10.1371/journal.pbio.1002195
2. Deep learning
Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Nature (2015-05-27) https://doi.org/10.1038/nature14539
5. Google’s Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, … Jeffrey Dean
arXiv (2016-09-26) https://arxiv.org/abs/1609.08144v2
26. Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks
Dan C. Cireşan, Alessandro Giusti, Luca M. Gambardella, Jürgen Schmidhuber
Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013 (2013) https://
doi.org/10.1007/978-3-642-40763-5_51
27. End effector target position learning using feedforward with error back-propagation and
recurrent neural networks
J. Zurada
Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94) (1994) https://
doi.org/10.1109/icnn.1994.374637
28. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model
Sheng Wang, Siqi Sun, Zhen Li, Renyu Zhang, Jinbo Xu
PLOS Computational Biology (2017-01-05) https://doi.org/10.1371/journal.pcbi.1005324
29. A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction
Matt Spencer, Jesse Eickholt, Jianlin Cheng
IEEE/ACM Transactions on Computational Biology and Bioinformatics (2015-01-01) https://
doi.org/10.1109/tcbb.2014.2343960
30. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields
Sheng Wang, Jian Peng, Jianzhu Ma, Jinbo Xu
Scientific Reports (2016-01-11) https://doi.org/10.1038/srep18962
32. Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters
Yifeng Li, Chih-Yu Chen, Wyeth W. Wasserman
Lecture Notes in Computer Science (2015) https://doi.org/10.1007/978-3-319-16706-0_20
34. DANN: a deep learning approach for annotating the pathogenicity of genetic variants
Daniel Quang, Yifei Chen, Xiaohui Xie
Bioinformatics (2014-10-22) https://doi.org/10.1093/bioinformatics/btu703
35. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-
based Drug Discovery
Izhar Wallach, Michael Dzamba, Abraham Heifets
arXiv (2015-10-10) https://arxiv.org/abs/1510.02855v1
36. Deep Learning Applications for Predicting Pharmacological Properties of Drugs and
Drug Repurposing Using Transcriptomic Data
Alexander Aliper, Sergey Plis, Artem Artemov, Alvaro Ulloa, Polina Mamoshina, Alex Zhavoronkov
Molecular Pharmaceutics (2016-07-05) https://doi.org/10.1021/acs.molpharmaceut.6b00248
40. Deep Learning and Structured Prediction for the Segmentation of Mass in Mammograms
Neeraj Dhungel, Gustavo Carneiro, Andrew P. Bradley
Lecture Notes in Computer Science (2015) https://doi.org/10.1007/978-3-319-24553-9_74
41. The Automated Learning of Deep Features for Breast Mass Classification from
Mammograms
Neeraj Dhungel, Gustavo Carneiro, Andrew P. Bradley
Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 (2016) https://
doi.org/10.1007/978-3-319-46723-8_13
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
42. Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram
Classification
Wentao Zhu, Qi Lou, Yeeleng Scott Vang, Xiaohui Xie
Cold Spring Harbor Laboratory (2016-12-20) https://doi.org/10.1101/095794
44. A deep learning approach for the analysis of masses in mammograms with minimal user
intervention
Neeraj Dhungel, Gustavo Carneiro, Andrew P. Bradley
Medical Image Analysis (2017-04) https://doi.org/10.1016/j.media.2017.01.009
48. Leveraging uncertainty information from deep neural networks for disease detection
Christian Leibig, Vaneeda Allken, Murat Seckin Ayhan, Philipp Berens, Siegfried Wahl
Cold Spring Harbor Laboratory (2016-10-28) https://doi.org/10.1101/084210
49. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic
Retinopathy in Retinal Fundus Photographs
Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam
Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, … Dale
R. Webster
JAMA (2016-12-13) https://doi.org/10.1001/jama.2016.17216
51. Automated Melanoma Recognition in Dermoscopy Images via Very Deep Residual
Networks
Lequan Yu, Hao Chen, Qi Dou, Jing Qin, Pheng-Ann Heng
IEEE Transactions on Medical Imaging (2017-04) https://doi.org/10.1109/tmi.2016.2642839
52. Extraction of skin lesions from non-dermoscopic images for surgical excision of
melanoma
M. Hossein Jafari, Ebrahim Nasr-Esfahani, Nader Karimi, S. M. Reza Soroushmehr, Shadrokh
Samavi, Kayvan Najarian
International Journal of Computer Assisted Radiology and Surgery (2017-03-24) https://
doi.org/10.1007/s11548-017-1567-8
53. Melanoma detection by analysis of clinical images using convolutional neural network
E. Nasr-Esfahani, S. Samavi, N. Karimi, S.M.R. Soroushmehr, M.H. Jafari, K. Ward, K. Najarian
2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology
Society (EMBC) (2016-08) https://doi.org/10.1109/embc.2016.7590963
55. Deep learning with non-medical training used for chest pathology identification
Yaniv Bar, Idit Diamant, Lior Wolf, Hayit Greenspan
Medical Imaging 2015: Computer-Aided Diagnosis (2015-03-20) https://
doi.org/10.1117/12.2083124
56. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures,
Dataset Characteristics and Transfer Learning
Hoo-Chang Shin, Holger R. Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jianhua Yao,
Daniel Mollura, Ronald M. Summers
IEEE Transactions on Medical Imaging (2016-05) https://doi.org/10.1109/tmi.2016.2528162
59. Classification of breast MRI lesions using small-size training sets: comparison of deep
learning approaches
Guy Amit, Rami Ben-Ari, Omer Hadad, Einat Monovich, Noa Granot, Sharbell Hashoul
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
61. 3D Deep Learning for Multi-modal Imaging-Guided Survival Time Prediction of Brain
Tumor Patients
Dong Nie, Han Zhang, Ehsan Adeli, Luyan Liu, Dinggang Shen
Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 (2016) https://
doi.org/10.1007/978-3-319-46723-8_25
62. Large scale deep learning for computer aided detection of mammographic lesions
Thijs Kooi, Geert Litjens, Bram van Ginneken, Albert Gubern-Mérida, Clara I. Sánchez, Ritse
Mann, Ard den Heeten, Nico Karssemeijer
Medical Image Analysis (2017-01) https://doi.org/10.1016/j.media.2016.07.007
63. Deep learning as a tool for increased accuracy and efficiency of histopathological
diagnosis
Geert Litjens, Clara I. Sánchez, Nadya Timofeeva, Meyke Hermsen, Iris Nagtegaal, Iringo Kovacs,
Christina Hulsbergen - van de Kaa, Peter Bult, Bram van Ginneken, Jeroen van der Laak
Scientific Reports (2016-05-23) https://doi.org/10.1038/srep26286
65. Deep learning is effective for the classification of OCT images of normal versus Age-
related Macular Degeneration
Cecilia S Lee, Doug M Baughman, Aaron Y Lee
Cold Spring Harbor Laboratory (2016-12-14) https://doi.org/10.1101/094276
69. NegBio: a high-performance tool for negation and uncertainty detection in radiology
reports
Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers, Zhiyong Lu
arXiv (2017-12-16) https://arxiv.org/abs/1712.05898v2
72. TaggerOne: joint named entity recognition and normalization with semi-Markov Models
Robert Leaman, Zhiyong Lu
Bioinformatics (2016-06-09) https://doi.org/10.1093/bioinformatics/btw343
73. tmVar: a text mining approach for extracting sequence variants in biomedical literature
C.-H. Wei, B. R. Harris, H.-Y. Kao, Z. Lu
Bioinformatics (2013-04-05) https://doi.org/10.1093/bioinformatics/btt156
Yue Liu, Tao Ge, Kusum Mathews, Heng Ji, Deborah McGuinness
Proceedings of BioNLP 15 (2015) https://doi.org/10.18653/v1/w15-3810
80. Improving chemical disease relation extraction with rich features and weakly labeled
data
Yifan Peng, Chih-Hsuan Wei, Zhiyong Lu
Journal of Cheminformatics (2016-10-07) https://doi.org/10.1186/s13321-016-0165-z
82. Joint Models for Extracting Adverse Drug Events from Biomedical Text
Fei Li, Yue Zhang, Meishan Zhang, Donghong Ji
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2016)
http://dl.acm.org/citation.cfm?id=3060832.3061018
83. A neural joint model for entity and relation extraction from biomedical text
Fei Li, Meishan Zhang, Guohong Fu, Donghong Ji
BMC Bioinformatics (2017-03-31) https://doi.org/10.1186/s12859-017-1609-9
84. Deep learning for extracting protein-protein interactions from biomedical literature
Yifan Peng, Zhiyong Lu
BioNLP 2017 (2017) https://doi.org/10.18653/v1/w17-2304
85. A Shortest Dependency Path Based Convolutional Neural Network for Protein-Protein
Relation Extraction
Lei Hua, Chanqin Quan
BioMed Research International (2016) https://doi.org/10.1155/2016/8479587
89. Drug drug interaction extraction from biomedical literature using syntax convolutional
neural network
Zhehuan Zhao, Zhihao Yang, Ling Luo, Hongfei Lin, Jian Wang
Bioinformatics (2016-07-27) https://doi.org/10.1093/bioinformatics/btw486
91. Drug-drug Interaction Extraction via Recurrent Neural Network with Multiple Attention
Layers
Zibo Yi, Shasha Li, Jie Yu, Qingbo Wu
arXiv (2017-05-09) https://arxiv.org/abs/1705.03261v2
93. Deep Learning with Minimal Training Data: TurkuNLP Entry in the BioNLP Shared Task
2016
Farrokh Mehryary, Jari Björne, Sampo Pyysalo, Tapio Salakoski, Filip Ginter
Proceedings of the 4th BioNLP Shared Task Workshop (2016) https://doi.org/10.18653/v1/
w16-3009
96. Biomedical Event Trigger Identification Using Bidirectional Recurrent Neural Network
Based Models
Patchigolla V S S Rahul, Sunil Kumar Sahu, Ashish Anand
arXiv (2017-05-26) https://arxiv.org/abs/1705.09516v1
97. Deep Learning for Biomedical Information Retrieval: Learning Textual Relevance from
Click Logs
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
98. Realizing the full potential of electronic health records: the role of natural language
processing
Lucila Ohno-Machado
Journal of the American Medical Informatics Association (2011-09) https://doi.org/10.1136/
amiajnl-2011-000501
99. Machine-learned solutions for three stages of clinical information extraction: the state of
the art at i2b2 2010
Berry de Bruijn, Colin Cherry, Svetlana Kiritchenko, Joel Martin, Xiaodan Zhu
Journal of the American Medical Informatics Association (2011-09) https://doi.org/10.1136/
amiajnl-2011-000150
101. Multi-task Deep Neural Networks for Automated Extraction of Primary Site and
Laterality Information from Cancer Pathology Reports
Hong-Jun Yoon, Arvind Ramanathan, Georgia Tourassi
Advances in Big Data (2016-10-08) https://doi.org/10.1007/978-3-319-47898-2_21
103. Exploring the Application of Deep Learning Techniques on Medical Text Corpora
Minarro-Giménez José Antonio, Marín-Alonso Oscar, Samwald Matthias
Studies in Health Technology and Informatics (2014) https://
doi.org/10.3233/978-1-61499-432-9-584
109. Bidirectional RNN for Medical Event Detection in Electronic Health Records
Abhyuday N Jagannatha, Hong Yu
Proceedings of the conference. Association for Computational Linguistics. North American Chapter.
Meeting (2016-06) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5119627/
111. Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy,
Sparse, and Irregular Clinical Data
Thomas A. Lasko, Joshua C. Denny, Mia A. Levy
PLoS ONE (2013-06-24) https://doi.org/10.1371/journal.pone.0066341
112. Semi-supervised learning of the electronic health record for phenotype stratification
Brett K. Beaulieu-Jones, Casey S. Greene
Journal of Biomedical Informatics (2016-12) https://doi.org/10.1016/j.jbi.2016.10.007
113. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from
the Electronic Health Records
Riccardo Miotto, Li Li, Brian A. Kidd, Joel T. Dudley
Scientific Reports (2016-05-17) https://doi.org/10.1038/srep26094
114. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks
Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, Jimeng Sun
arXiv (2015-11-18) https://arxiv.org/abs/1511.05942v11
119. Comparison of the performance of neural network methods and Cox regression for
censored survival data
Anny Xiang, Pablo Lapuerta, Alex Ryutov, Jonathan Buckley, Stanley Azen
Computational Statistics & Data Analysis (2000-08) https://doi.org/10.1016/s0167-9473
(99)00098-5
126. Electronic medical record phenotyping using the anchor and learn framework
Yoni Halpern, Steven Horng, Youngduck Choi, David Sontag
Journal of the American Medical Informatics Association (2016-04-23) https://doi.org/10.1093/
jamia/ocw011
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
129. “Data is the New Oil” — A Ludicrous Proposition – Twenty One Hundred – Medium
Michael Haupt
Medium (2016-05-02) https://medium.com/twenty-one-hundred/data-is-the-new-oil-a-ludicrous-
proposition-1d91bba4f294
131. Mining electronic health records: towards better research applications and clinical care
Peter B. Jensen, Lars J. Jensen, Søren Brunak
Nature Reviews Genetics (2012-05-02) https://doi.org/10.1038/nrg3208
132. Methods and dimensions of electronic health record data quality assessment: enabling
reuse for clinical research
N. G. Weiskopf, C. Weng
Journal of the American Medical Informatics Association (2013-01-01) https://doi.org/10.1136/
amiajnl-2011-000681
133. Impact of Electronic Health Record Systems on Information Integrity: Quality and
Safety Implications
Sue Bowman
Perspectives in Health Information Management (2013) https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC3797550/
134. Secondary Use of EHR: Data Quality Issues and Informatics Opportunities
Taxiarchis Botsis, Gunnar Hartvigsen, Fei Chen, Chunhua Weng
Summit on Translational Bioinformatics (2010) https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC3041534/
135. Have DRG-based prospective payment systems influenced the number of secondary
diagnoses in health care administrative data?
Lisbeth Serdén, Rikard Lindqvist, Måns Rosén
Health Policy (2003-08) https://doi.org/10.1016/s0168-8510(02)00208-7
136. Why Patient Matching Is a Challenge: Research on Master Patient Index (MPI) Data
Discrepancies in Key Identifying Fields
Beth Haenke Just, David Marc, Megan Munns, Ryan Sandefer
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
138. Using electronic health records for clinical research: The case of the EHR4CR project
Georges De Moor, Mats Sundgren, Dipak Kalra, Andreas Schmidt, Martin Dugas, Brecht
Claerhout, Töresin Karakoyun, Christian Ohmann, Pierre-Yves Lastic, Nadir Ammour, … Pascal
Coorevits
Journal of Biomedical Informatics (2015-02) https://doi.org/10.1016/j.jbi.2014.10.006
145. DataSHIELD: taking the analysis to the data, not the data to the analysis
Amadou Gaye, Yannick Marcon, Julia Isaeva, Philippe LaFlamme, Andrew Turner, Elinor M Jones,
Joel Minion, Andrew W Boyd, Christopher J Newby, Marja-Liisa Nuotio, … Paul R Burton
International Journal of Epidemiology (2014-09-27) https://doi.org/10.1093/ije/dyu188
146. ViPAR: a software platform for the Virtual Pooling and Analysis of Research Data
Kim W Carter, KW Carter, RW Francis, M Bresnahan, M Gissler, TK Grønborg, R Gross, N
Gunnes, G Hammond, M Hornig, … Z Yusof
International Journal of Epidemiology (2015-10-08) https://doi.org/10.1093/ije/dyv193
153. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial
Networks
Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, Jimeng Sun
arXiv (2017-03-19) https://arxiv.org/abs/1703.06490v1
154. Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs
Cristóbal Esteban, Stephanie L. Hyland, Gunnar Rätsch
arXiv (2017-06-08) https://arxiv.org/abs/1706.02633v1
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
155. Privacy-preserving generative deep neural networks support clinical data sharing
Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, James Brian Byrd, Casey S. Greene
Cold Spring Harbor Laboratory (2017-07-05) https://doi.org/10.1101/159756
158. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, Kunal Talwar
(2016-11-02) https://openreview.net/forum?id=HkwoSDPgg
160. Overcoming the Winner’s Curse: Estimating Penetrance Parameters from Case-Control
Data
Sebastian Zöllner, Jonathan K. Pritchard
The American Journal of Human Genetics (2007-04) https://doi.org/10.1086/512821
164. Retraction
P. Sebastiani, N. Solovieff, A. Puca, S. W. Hartley, E. Melista, S. Andersen, D. A. Dworkis, J. B.
Wilk, R. H. Myers, M. H. Steinberg, … T. T. Perls
Science (2011-07-21) https://doi.org/10.1126/science.333.6041.404-a
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
169. The Framingham Heart Study and the epidemiology of cardiovascular disease: a
historical perspective
Syed S Mahmood, Daniel Levy, Ramachandran S Vasan, Thomas J Wang
The Lancet (2014-03) https://doi.org/10.1016/s0140-6736(13)61752-3
172. Temporal disease trajectories condensed from population-wide registry data covering
6.2 million patients
Anders Boeck Jensen, Pope L. Moseley, Tudor I. Oprea, Sabrina Gade Ellesøe, Robert Eriksson,
Henriette Schmock, Peter Bjødstrup Jensen, Lars Juhl Jensen, Søren Brunak
Nature Communications (2014-06-24) https://doi.org/10.1038/ncomms5022
174. Curiosity Creates Cures: The Value and Impact of Basic Research
NIH
(2012-05) https://www.nigms.nih.gov/Education/Documents/curiosity.pdf
175. Multi-omics integration accurately predicts cellular state in unexplored conditions for
Escherichia coli
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
176. Trans-species learning of cellular signaling systems with bimodal deep belief networks
Lujia Chen, Chunhui Cai, Vicky Chen, Xinghua Lu
Bioinformatics (2015-05-20) https://doi.org/10.1093/bioinformatics/btv315
177. Learning structure in gene expression data using deep architectures, with an
application to gene clustering
Aman Gupta, Haohan Wang, Madhavi Ganapathiraju
Cold Spring Harbor Laboratory (2015-11-16) https://doi.org/10.1101/031906
180. Unsupervised extraction of stable expression signatures from public compendia with
eADAGE
Jie Tan, Georgia Doing, Kimberley A Lewis, Courtney E Price, Kathleen M Chen, Kyle C Cady,
Barret Perchuk, Michael T Laub, Deborah A Hogan, Casey S Greene
Cold Spring Harbor Laboratory (2016-10-03) https://doi.org/10.1101/078659
182. DeepChrome: Deep-learning for predicting gene expression from histone modifications
Ritambhara Singh, Jack Lanchantin, Gabriel Robins, Yanjun Qi
arXiv (2016-07-07) https://arxiv.org/abs/1607.02078v1
184. Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep
Learning Approach
Muxuan Liang, Zhizhong Li, Ting Chen, Jianyang Zeng
IEEE/ACM Transactions on Computational Biology and Bioinformatics (2015-07-01) https://
doi.org/10.1109/tcbb.2014.2377729
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
186. RNA splicing is a primary link between genetic variation and disease
Y. I. Li, B. van de Geijn, A. Raj, D. A. Knowles, A. A. Petti, D. Golan, Y. Gilad, J. K. Pritchard
Science (2016-04-28) https://doi.org/10.1126/science.aad9417
188. Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular
context
Hui Yuan Xiong, Yoseph Barash, Brendan J. Frey
Bioinformatics (2011-07-29) https://doi.org/10.1093/bioinformatics/btr444
189. The human splicing code reveals new insights into the genetic determinants of disease
H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, D. Merico, R. K. C. Yuen, Y. Hua, S.
Gueroussov, H. S. Najafabadi, T. R. Hughes, … B. J. Frey
Science (2014-12-18) https://doi.org/10.1126/science.1254806
191. Imputation for transcription factor binding predictions based on deep learning
Qian Qin, Jianxing Feng
PLOS Computational Biology (2017-02-24) https://doi.org/10.1371/journal.pcbi.1005403
192. Learning the Sequence Determinants of Alternative Splicing from Millions of Random
Sequences
Alexander B. Rosenberg, Rupali P. Patwardhan, Jay Shendure, Georg Seelig
Cell (2015-10) https://doi.org/10.1016/j.cell.2015.09.054
194. Absence of a simple code: how transcription factors read the genome
Matthew Slattery, Tianyin Zhou, Lin Yang, Ana Carolina Dantas Machado, Raluca Gordân, Remo
Rohs
Trends in Biochemical Sciences (2014-09) https://doi.org/10.1016/j.tibs.2014.07.002
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
199. High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In
Vivo Binding Predictions
Phaedra Agius, Aaron Arvey, William Chang, William Stafford Noble, Christina Leslie
PLoS Computational Biology (2010-09-09) https://doi.org/10.1371/journal.pcbi.1000916
201. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep
learning
Babak Alipanahi, Andrew Delong, Matthew T Weirauch, Brendan J Frey
Nature Biotechnology (2015-07-27) https://doi.org/10.1038/nbt.3300
202. RNA-protein binding motifs mining with a new hybrid deep learning based cross-
domain knowledge integration approach
Xiaoyong Pan, Hong-Bin Shen
BMC Bioinformatics (2017-02-28) https://doi.org/10.1186/s12859-017-1561-8
204. Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using
Deep Neural Networks
Jack Lanchantin, Ritambhara Singh, Beilun Wang, Yanjun Qi
arXiv (2016-08-12) https://arxiv.org/abs/1608.03644v4
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
205. Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction
Alyssa Morrow, Vaishaal Shankar, Devin Petersohn, Anthony Joseph, Benjamin Recht, Nir Yosef
arXiv (2017-05-31) https://arxiv.org/abs/1706.00125v1
206. Predicting Transcription Factor Binding Sites with Convolutional Kernel Networks
Dexiong Chen, Laurent Jacob, Julien Mairal
Cold Spring Harbor Laboratory (2017-11-10) https://doi.org/10.1101/217257
207. Reverse-complement parameter sharing improves deep learning models for genomics
Avanti Shrikumar, Peyton Greenside, Anshul Kundaje
Cold Spring Harbor Laboratory (2017-01-27) https://doi.org/10.1101/103663
208. Separable Fully Connected Layers Improve Deep Learning Models For Genomics
Amr Mohamed Alexandari, Avanti Shrikumar, Anshul Kundaje
Cold Spring Harbor Laboratory (2017-06-05) https://doi.org/10.1101/146431
209. Predicting effects of noncoding variants with deep learning–based sequence model
Jian Zhou, Olga G Troyanskaya
Nature Methods (2015-08-24) https://doi.org/10.1038/nmeth.3547
210. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the
function of DNA sequences
Daniel Quang, Xiaohui Xie
Nucleic Acids Research (2016-04-15) https://doi.org/10.1093/nar/gkw226
214. FactorNet: a deep learning framework for predicting cell type specific transcription
factor binding from nucleotide-resolution sequential data
Daniel Quang, Xiaohui Xie
Cold Spring Harbor Laboratory (2017-06-18) https://doi.org/10.1101/151274
215. Learning from mistakes: Accurate prediction of cell type-specific transcription factor
binding
Jens Keilwagen, Stefan Posch, Jan Grau
Cold Spring Harbor Laboratory (2017-12-06) https://doi.org/10.1101/230011
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
221. Detection of RNA polymerase II promoters and polyadenylation sites in human DNA
sequence
Sherri Matis, Ying Xu, Manesh Shah, Xiaojun Guan, J.Ralph Einstein, Richard Mural, Edward
Uberbacher
Computers & Chemistry (1996-03) https://doi.org/10.1016/s0097-8485(96)80015-5
223. Cap analysis gene expression for high-throughput analysis of transcriptional starting
point and identification of promoter usage
T. Shiraki, S. Kondo, S. Katayama, K. Waki, T. Kasukawa, H. Kawaji, R. Kodzius, A. Watahiki, M.
Nakamura, T. Arakawa, … Y. Hayashizaki
Proceedings of the National Academy of Sciences (2003-12-08) https://doi.org/10.1073/
pnas.2136655100
227. Basset: learning the regulatory code of the accessible genome with deep convolutional
neural networks
David R. Kelley, Jasper Snoek, John L. Rinn
Genome Research (2016-05-03) https://doi.org/10.1101/gr.200535.115
230. Predicting Enhancer-Promoter Interaction from Genomic Sequence with Deep Neural
Networks
Shashank Singh, Yang Yang, Barnabas Poczos, Jian Ma
Cold Spring Harbor Laboratory (2016-11-02) https://doi.org/10.1101/085241
234. deepTarget: End-to-end Learning Framework for microRNA Target Prediction using
Deep Recurrent Neural Networks
Byunghan Lee, Junghwan Baek, Seunghyun Park, Sungroh Yoon
arXiv (2016-03-30) https://arxiv.org/abs/1603.09123v2
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
236. AUC-Maximized Deep Convolutional Neural Fields for Protein Sequence Labeling
Sheng Wang, Siqi Sun, Jinbo Xu
Machine Learning and Knowledge Discovery in Databases (2016) https://
doi.org/10.1007/978-3-319-46227-1_1
237. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and
long range hydrogen bonding in proteins
David T. Jones, Tanya Singh, Tomasz Kosciolek, Stuart Tetchner
Bioinformatics (2014-11-26) https://doi.org/10.1093/bioinformatics/btu791
241. Improving prediction of secondary structure, local backbone angles and solvent
accessible surface area of proteins by iterative deep learning
Rhys Heffernan, Kuldip Paliwal, James Lyons, Abdollah Dehzangi, Alok Sharma, Jihua Wang,
Abdul Sattar, Yuedong Yang, Yaoqi Zhou
Scientific Reports (2015-06-22) https://doi.org/10.1038/srep11476
243. Deep Supervised and Convolutional Generative Stochastic Network for Protein
Secondary Structure Prediction
Jian Zhou, Olga G. Troyanskaya
arXiv (2014-03-06) https://arxiv.org/abs/1403.1347v1
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
244. Protein contact prediction by integrating joint evolutionary coupling analysis and
supervised learning
Jianzhu Ma, Sheng Wang, Zhiyong Wang, Jinbo Xu
Bioinformatics (2015-08-14) https://doi.org/10.1093/bioinformatics/btv472
246. Predicting protein residue–residue contacts using deep networks and boosting
Jesse Eickholt, Jianlin Cheng
Bioinformatics (2012-10-09) https://doi.org/10.1093/bioinformatics/bts598
247. Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns
Marcin J. Skwark, Daniele Raimondi, Mirco Michel, Arne Elofsson
PLoS Computational Biology (2014-11-06) https://doi.org/10.1371/journal.pcbi.1003889
250. Predicting membrane protein contacts from non-membrane proteins by deep transfer
learning
Zhen Li, Sheng Wang, Yizhou Yu, Jinbo Xu
arXiv (2017-04-24) https://arxiv.org/abs/1704.07207v1
255. DeepPicker: A deep learning approach for fully automated particle picking in cryo-EM
Feng Wang, Huichao Gong, Gaochao Liu, Meijing Li, Chuangye Yan, Tian Xia, Xueming Li,
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Jianyang Zeng
Journal of Structural Biology (2016-09) https://doi.org/10.1016/j.jsb.2016.07.006
257. Massively parallel unsupervised single-particle cryo-EM data clustering via statistical
manifold learning
Jiayi Wu, Yong-Bei Ma, Charles Congdon, Bevin Brett, Shuobing Chen, Yaofang Xu, Qi Ouyang,
Youdong Mao
PLOS ONE (2017-08-07) https://doi.org/10.1371/journal.pone.0182130
260. Deep learning for extracting protein-protein interactions from biomedical literature
Yifan Peng, Zhiyong Lu
arXiv (2017-06-05) https://arxiv.org/abs/1706.01556v2
264. Prediction of residue-residue contact matrix for protein-protein interaction with Fisher
score features and deep learning
Tianchuan Du, Li Liao, Cathy H. Wu, Bilin Sun
Methods (2016-11) https://doi.org/10.1016/j.ymeth.2016.06.001
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
265. Reliable prediction of T-cell epitopes using neural networks with novel sequence
representations
Morten Nielsen, Claus Lundegaard, Peder Worning, Sanne Lise Lauemøller, Kasper Lamberth,
Søren Buus, Søren Brunak, Ole Lund
Protein Science (2003-05) https://doi.org/10.1110/ps.0239403
266. Gapped sequence alignment using artificial neural networks: application to the MHC
class I system
Massimo Andreatta, Morten Nielsen
Bioinformatics (2015-10-29) https://doi.org/10.1093/bioinformatics/btv639
267. NetMHCpan, a method for MHC class I binding prediction beyond humans
Ilka Hoof, Bjoern Peters, John Sidney, Lasse Eggers Pedersen, Alessandro Sette, Ole Lund, Søren
Buus, Morten Nielsen
Immunogenetics (2008-11-12) https://doi.org/10.1007/s00251-008-0341-z
271. High-order neural networks and kernel methods for peptide-MHC binding prediction
Pavel P. Kuksa, Martin Renqiang Min, Rishabh Dugar, Mark Gerstein
Bioinformatics (2015-07-23) https://doi.org/10.1093/bioinformatics/btv371
272. Evaluation of machine learning methods to predict peptide binding to MHC Class I
proteins
Rohit Bhattacharya, Ashok Sivakumar, Collin Tokheim, Violeta Beleva Guthrie, Valsamo
Anagnostou, Victor E. Velculescu, Rachel Karchin
Cold Spring Harbor Laboratory (2017-06-23) https://doi.org/10.1101/154757
279. Deep Learning Automates the Quantitative Analysis of Individual Cells in Live-Cell
Imaging Experiments
David A. Van Valen, Takamasa Kudo, Keara M. Lane, Derek N. Macklin, Nicolas T. Quach, Mialy M.
DeFelice, Inbal Maayan, Yu Tanouchi, Euan A. Ashley, Markus W. Covert
PLOS Computational Biology (2016-11-04) https://doi.org/10.1371/journal.pcbi.1005177
282. Reconstructing cell cycle and disease progression using deep learning
Philipp Eulenberg, Niklas Koehler, Thomas Blasi, Andrew Filby, Anne E. Carpenter, Paul Rees,
Fabian J. Theis, F. Alexander Wolf
Cold Spring Harbor Laboratory (2016-10-17) https://doi.org/10.1101/081364
287. Machine learning and computer vision approaches for phenotypic profiling
Ben T. Grys, Dara S. Lo, Nil Sahin, Oren Z. Kraus, Quaid Morris, Charles Boone, Brenda J.
Andrews
The Journal of Cell Biology (2016-12-09) https://doi.org/10.1083/jcb.201610026
289. Somatic mutation in single human neurons tracks developmental and transcriptional
history
M. A. Lodato, M. B. Woodworth, S. Lee, G. D. Evrony, B. K. Mehta, A. Karger, S. Lee, T. W.
Chittenden, A. M. D’Gama, X. Cai, … C. A. Walsh
Science (2015-10-01) https://doi.org/10.1126/science.aab1785
292. Joint Profiling Of Chromatin Accessibility, DNA Methylation And Transcription In Single
Cells
Stephen J. Clark, Ricard Argelaguet, Chantriolnt-Andreas Kapourani, Thomas M. Stubbs, Heather
J. Lee, Felix Krueger, Guido Sanguinetti, Gavin Kelsey, John C. Marioni, Oliver Stegle, Wolf Reik
Cold Spring Harbor Laboratory (2017-05-17) https://doi.org/10.1101/138685
293. DeepCpG: accurate prediction of single-cell DNA methylation states using deep
learning
Christof Angermueller, Heather J. Lee, Wolf Reik, Oliver Stegle
Genome Biology (2017-04-11) https://doi.org/10.1186/s13059-017-1189-z
297. Sensitive detection of rare disease-associated cell subsets via representation learning.
Eirini Arvaniti, Manfred Claassen
Cold Spring Harbor Laboratory (2016-03-31) https://doi.org/10.1101/046508
298. Interpretable dimensionality reduction of single cell transcriptome data with deep
generative models
Jiarui Ding, Anne E. Condon, Sohrab P. Shah
Cold Spring Harbor Laboratory (2017-09-01) https://doi.org/10.1101/178624
299. A deep generative model for gene expression profiles from single-cell RNA sequencing
Romain Lopez, Jeffrey Regier, Michael Cole, Michael Jordan, Nir Yosef
arXiv (2017-09-07) https://arxiv.org/abs/1709.02082v3
301. Using neural networks for reducing the dimensions of single-cell RNA-Seq data
Chieh Lin, Siddhartha Jain, Hannah Kim, Ziv Bar-Joseph
Nucleic Acids Research (2017-07-31) https://doi.org/10.1093/nar/gkx681
304. Mastering the game of Go with deep neural networks and tree search
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den
Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, …
Demis Hassabis
Nature (2016-01-27) https://doi.org/10.1038/nature16961
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
307. NBC: the Naive Bayes Classification tool webserver for taxonomic classification of
metagenomic reads
G. L. Rosen, E. R. Reichenberger, A. M. Rosenfeld
Bioinformatics (2010-11-08) https://doi.org/10.1093/bioinformatics/btq619
309. Metagenomic microbial community profiling using unique clade-specific marker genes
Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, Curtis
Huttenhower
Nature Methods (2012-06-10) https://doi.org/10.1038/nmeth.2066
315. Utilizing Machine Learning Approaches to Understand the Interrelationship of Diet, the
Human Gastrointestinal Microbiome, and Health
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Heather Guetterman, Loretta Auvil, Nate Russell, Michael Welge, Matt Berry, Lisa Gatzke, Colleen
Bushell, Hannah Holscher
The FASEB Journal (2016-04-01) http://www.fasebj.org/content/30/1_Supplement/406.3
318. Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological
Insights
Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, Nicola Segata
PLOS Computational Biology (2016-07-11) https://doi.org/10.1371/journal.pcbi.1004977
320. Correction: Class Prediction and Feature Selection with Linear Optimization for
Metagenomic Count Data
Zhenqiu Liu, Dechang Chen, Li Sheng, Amy Y. Liu
PLoS ONE (2014-05-12) https://doi.org/10.1371/journal.pone.0097958
336. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads
Vladimír Boža, Broňa Brejová, Tomáš Vinař
PLOS ONE (2017-06-05) https://doi.org/10.1371/journal.pone.0178751
338. Creating a universal SNP and small indel variant caller with deep neural networks
Ryan Poplin, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam S. Gross, Cory Y.
McLean, Mark A. DePristo
Cold Spring Harbor Laboratory (2016-12-14) https://doi.org/10.1101/092890
339. A framework for variation discovery and genotyping using next-generation DNA
sequencing data
Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl,
Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, … Mark J Daly
Nature Genetics (2011-04-10) https://doi.org/10.1038/ng.806
342. Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data
Remi Torracinta, Laurent Mesnard, Susan Levine, Rita Shaknovich, Maureen Hanson, Fabien
Campagne
Cold Spring Harbor Laboratory (2016-10-04) https://doi.org/10.1101/079087
348. Machines that learn to segment images: a crucial technology for connectomics
Viren Jain, H Sebastian Seung, Srinivas C Turaga
Current Opinion in Neurobiology (2010-10) https://doi.org/10.1016/j.conb.2010.07.004
349. Model-based Bayesian inference of neural activity and connectivity from all-optical
interrogation of a neural circuit
Laurence Aitchison, Lloyd Russell, Adam M. Packer, Jinyao Yan, Philippe Castonguay, Michael
Hausser, Srinivas C. Turaga
(2017) http://papers.nips.cc/paper/6940-model-based-bayesian-inference-of-neural-activity-and-
connectivity-from-all-optical-interrogation-of-a-neural-circuit
352. Advantages and disadvantages of using artificial neural networks versus logistic
regression for predicting medical outcomes
Jack V. Tu
Journal of Clinical Epidemiology (1996-11) https://doi.org/10.1016/s0895-4356(96)00002-9
353. Use of an Artificial Neural Network for the Diagnosis of Myocardial Infarction
William G. Baxt
Annals of Internal Medicine (1991-12-01) https://doi.org/10.7326/0003-4819-115-11-843
355. The use of artificial neural networks in decision support in cancer: A systematic review
Paulo J. Lisboa, Azzam F.G. Taktak
Neural Networks (2006-05) https://doi.org/10.1016/j.neunet.2005.10.007
360. Recurrent Neural Networks for Multivariate Time Series with Missing Values
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, Yan Liu
arXiv (2016-06-06) https://arxiv.org/abs/1606.01865v2
362. Phenotyping of Clinical Time Series with LSTM Recurrent Neural Networks
Zachary C. Lipton, David C. Kale, Randall C. Wetzel
arXiv (2015-10-26) https://arxiv.org/abs/1510.07641v2
363. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement
learning approach
Shamim Nemati, Mohammad M. Ghassemi, Gari D. Clifford
2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology
Society (EMBC) (2016-08) https://doi.org/10.1109/embc.2016.7591355
364. From vital signs to clinical outcomes for patients with sepsis: a machine learning basis
for a clinical decision support system
Eren Gultepe, Jeffrey P Green, Hien Nguyen, Jason Adams, Timothy Albertson, Ilias Tagkopoulos
Journal of the American Medical Informatics Association (2014-03) https://doi.org/10.1136/
amiajnl-2013-001815
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
365. Imaging-based enrichment criteria using deep learning algorithms for efficient clinical
trials in mild cognitive impairment
Vamsi K. Ithapu, Vikas Singh, Ozioma C. Okonkwo, Richard J. Chappell, N. Maritza Dowling,
Sterling C. Johnson
Alzheimer’s & Dementia (2015-12) https://doi.org/10.1016/j.jalz.2015.01.010
366. Integrated deep learned transcriptomic and structure-based predictor of clinical trials
outcomes
Artem V Artemov, Evgeny Putin, Quentin Vanhaelen, Alexander Aliper, Ivan V Ozerov, Alex
Zhavoronkov
Cold Spring Harbor Laboratory (2016-12-20) https://doi.org/10.1101/095653
368. An analysis of the attrition of drug candidates from four major pharmaceutical
companies
Michael J. Waring, John Arrowsmith, Andrew R. Leach, Paul D. Leeson, Sam Mandrell, Robert M.
Owen, Garry Pairaudeau, William D. Pennie, Stephen D. Pickett, Jibo Wang, … Alex Weir
Nature Reviews Drug Discovery (2015-06-19) https://doi.org/10.1038/nrd4609
369. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules,
Genes, and Disease
J. Lamb
Science (2006-09-29) https://doi.org/10.1126/science.1132939
375. Drug repositioning for non-small cell lung cancer by using machine learning
algorithms and topological graph theory
Chien-Hung Huang, Peter Mu-Hsin Chang, Chia-Wei Hsu, Chi-Ying F. Huang, Ka-Lok Ng
BMC Bioinformatics (2016-01-11) https://doi.org/10.1186/s12859-015-0845-0
376. Machine Learning Prediction of Cancer Cell Sensitivity to Drugs Based on Genomic
and Chemical Properties
Michael P. Menden, Francesco Iorio, Mathew Garnett, Ultan McDermott, Cyril H. Benes, Pedro J.
Ballester, Julio Saez-Rodriguez
PLoS ONE (2013-04-30) https://doi.org/10.1371/journal.pone.0061318
378. Computational Discovery of Putative Leads for Drug Repositioning through Drug-
Target Interaction Prediction
Edgar D. Coelho, Joel P. Arrais, José Luís Oliveira
PLOS Computational Biology (2016-11-28) https://doi.org/10.1371/journal.pcbi.1005219
379. Large-Scale Off-Target Identification Using Fast and Accurate Dual Regularized One-
Class Collaborative Filtering and Its Application to Drug Repurposing
Hansaim Lim, Aleksandar Poleksic, Yuan Yao, Hanghang Tong, Di He, Luke Zhuang, Patrick Meng,
Lei Xie
PLOS Computational Biology (2016-10-07) https://doi.org/10.1371/journal.pcbi.1005135
382. A guide to drug discovery: Hit and lead generation: beyond high-throughput screening
Konrad H. Bleicher, Hans-Joachim Böhm, Klaus Müller, Alexander I. Alanine
Nature Reviews Drug Discovery (2003-05) https://doi.org/10.1038/nrd1086
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
384. Influence Relevance Voting: An Accurate And Interpretable Virtual High Throughput
Screening Method
S. Joshua Swamidass, Chloé-Agathe Azencott, Ting-Wan Lin, Hugo Gramajo, Shiou-Chuan Tsai,
Pierre Baldi
Journal of Chemical Information and Modeling (2009-04-27) https://doi.org/10.1021/ci8004379
399. Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the
Performance of Expert-developed QSAR/QSPR Models
Garrett B. Goh, Charles Siegel, Abhinav Vishnu, Nathan O. Hodas, Nathan Baker
arXiv (2017-06-20) https://arxiv.org/abs/1706.06689v1
413. TopologyNet: Topology based deep convolutional and multi-task neural networks for
biomolecular property predictions
Zixuan Cang, Guo-Wei Wei
PLOS Computational Biology (2017-07-27) https://doi.org/10.1371/journal.pcbi.1005690
420. Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural
Networks
Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, Mark P. Waller
arXiv (2017-01-05) https://arxiv.org/abs/1701.01329v1
424. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-
control
Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E.
Turner, Douglas Eck
arXiv (2016-11-09) https://arxiv.org/abs/1611.02796v9
430. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
Alex Kendall, Yarin Gal
arXiv (2017-03-15) https://arxiv.org/abs/1703.04977v2
431. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and
Semantics
Alex Kendall, Yarin Gal, Roberto Cipolla
arXiv (2017-05-19) https://arxiv.org/abs/1705.07115v1
433. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized
Likelihood Methods
John C. Platt
ADVANCES IN LARGE MARGIN CLASSIFIERS http://citeseer.ist.psu.edu/viewdoc/summary?
doi=10.1.1.41.1639
438. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods
Nicholas Carlini, David Wagner
arXiv (2017-05-20) https://arxiv.org/abs/1705.07263v2
440. Leveraging uncertainty information from deep neural networks for disease detection
Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens, Siegfried Wahl
Scientific Reports (2017-12) https://doi.org/10.1038/s41598-017-17876-z
443. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan, Alexander Pritzel, Charles Blundell
arXiv (2016-12-05) https://arxiv.org/abs/1612.01474v3
446. Deep Neural Networks are Easily Fooled: High Confidence Predictions for
Unrecognizable Images
Anh Nguyen, Jason Yosinski, Jeff Clune
arXiv (2014-12-05) https://arxiv.org/abs/1412.1897v4
447. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
arXiv (2016-02-16) https://arxiv.org/abs/1602.04938v3
451. Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
arXiv (2013-12-20) https://arxiv.org/abs/1312.6034v2
453. Investigating the influence of noise and distractors on the interpretation of neural
networks
Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, Sven Dähne
arXiv (2016-11-22) https://arxiv.org/abs/1611.07270v1
456. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi
Parikh, Dhruv Batra
arXiv (2016-10-07) https://arxiv.org/abs/1610.02391v3
461. Maximum Entropy Methods for Extracting the Learned Features of Deep Neural
Networks
Alex I Finnegan, Jun S Song
Cold Spring Harbor Laboratory (2017-02-03) https://doi.org/10.1101/105957
467. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard
Zemel, Yoshua Bengio
arXiv (2015-02-10) https://arxiv.org/abs/1502.03044v3
468. Genetic Architect: Discovering Genomic Structure with Learned Neural Architectures
Laura Deming, Sasha Targ, Nate Sauder, Diogo Almeida, Chun Jimmie Ye
arXiv (2016-05-23) https://arxiv.org/abs/1605.07156v1
469. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention
Mechanism
Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart,
Jimeng Sun
arXiv (2016-08-19) https://arxiv.org/abs/1608.05745v4
473. LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural
Networks
Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, Alexander M. Rush
arXiv (2016-06-23) https://arxiv.org/abs/1606.07461v2
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
474. Automatic Rule Extraction from Long Short Term Memory Networks
W. James Murdoch, Arthur Szlam
arXiv (2017-02-08) https://arxiv.org/abs/1702.02540v2
477. Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with
Variational Autoencoders
Gregory P. Way, Casey S. Greene
Cold Spring Harbor Laboratory (2017-08-11) https://doi.org/10.1101/174474
484. Distilling Knowledge from Deep Networks with Applications to Healthcare Domain
Zhengping Che, Sanjay Purushotham, Robinder Khemani, Yan Liu
arXiv (2015-12-11) https://arxiv.org/abs/1512.03542v1
bioRxiv preprint first posted online May. 28, 2017; doi: http://dx.doi.org/10.1101/142760. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
488. DeepAD: Alzheimer′s Disease Classification via Deep Convolutional Neural Networks
using MRI and fMRI
Saman Sarraf, Danielle D. DeSouza, John Anderson, Ghassem Tofighi,
Cold Spring Harbor Laboratory (2016-08-21) https://doi.org/10.1101/070441
490. A general framework for estimating the relative pathogenicity of human genetic
variants
Martin Kircher, Daniela M Witten, Preti Jain, Brian J O’Roak, Gregory M Cooper, Jay Shendure
Nature Genetics (2014-02-02) https://doi.org/10.1038/ng.2892
496. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and
Activations
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio
arXiv (2016-09-22) https://arxiv.org/abs/1609.07061v1
503. Experiments on Parallel Training of Deep Neural Network using Model Averaging
Hang Su, Haoyu Chen
arXiv (2015-07-05) https://arxiv.org/abs/1507.01239v2
Okuno
Molecular Informatics (2016-08-12) https://doi.org/10.1002/minf.201600045
510. MapReduce
Jeffrey Dean, Sanjay Ghemawat
Communications of the ACM (2008-01-01) https://doi.org/10.1145/1327452.1327492
520. Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks
Shizhao Sun, Wei Chen, Jiang Bian, Xiaoguang Liu, Tie-Yan Liu
arXiv (2016-06-02) https://arxiv.org/abs/1606.00575v2
524. The real cost of sequencing: scaling computation to keep pace with data generation
Paul Muir, Shantao Li, Shaoke Lou, Daifeng Wang, Daniel J Spakowicz, Leonidas Salichos, Jing
Zhang, George M. Weinstock, Farren Isaacs, Joel Rozowsky, Mark Gerstein
Genome Biology (2016-03-23) https://doi.org/10.1186/s13059-016-0917-0
Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin
Communications of the ACM (2010-04-01) https://doi.org/10.1145/1721654.1721672
534. Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis
Wenlu Zhang, Rongjian Li, Tao Zeng, Qian Sun, Sudhir Kumar, Jieping Ye, Shuiwang Ji
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining - KDD ’15 (2015) https://doi.org/10.1145/2783258.2783304
535. Deep convolutional neural networks for annotating gene expression patterns in the
mouse brain
Tao Zeng, Rongjian Li, Ravi Mukkamala, Jieping Ye, Shuiwang Ji
BMC Bioinformatics (2015-05-07) https://doi.org/10.1186/s12859-015-0553-9
539. Deep Learning based multi-omics integration robustly predicts survival in liver cancer
Kumardeep Chaudhary, Olivier B. Poirion, Liangqun Lu, Lana X. Garmire
Cold Spring Harbor Laboratory (2017-03-08) https://doi.org/10.1101/114892
540. FIDDLE: An integrative deep learning framework for functional genomic data inference
Umut Eser, L. Stirling Churchman
Cold Spring Harbor Laboratory (2016-10-17) https://doi.org/10.1101/081380
549. The Grey Literature — Proof of prespecified endpoints in medical research with the
bitcoin blockchain
Benjamin Gregory Carlisle
(2014-08-25) https://www.bgcarlisle.com/blog/2014/08/25/proof-of-prespecified-endpoints-in-
medical-research-with-the-bitcoin-blockchain/