Bioinformatics: Missing Value Estimation Methods For DNA Microarrays
Bioinformatics: Missing Value Estimation Methods For DNA Microarrays
6 2001
BIOINFORMATICS Pages 520–525
Received on November 13, 2000; revised on February 22, 2001; accepted on February 26, 2001
ABSTRACT INTRODUCTION
Motivation: Gene expression microarray experiments can DNA microarray technology allows for the monitoring
generate data sets with multiple missing expression val- of expression levels of thousands of genes under a
ues. Unfortunately, many algorithms for gene expression variety of conditions (DeRisi et al., 1997; Spellman
analysis require a complete matrix of gene array values as et al., 1998). Microarrays have been used to study a
input. For example, methods such as hierarchical cluster- variety of biological processes, from differential gene
ing and K-means clustering are not robust to missing data, expression in human tumors (Perou et al., 2000) to yeast
and may lose effectiveness even with a few missing values. sporulation (Chu et al., 1998). Various analysis techniques
Methods for imputing missing data are needed, therefore, have been developed, aimed primarily at identifying
to minimize the effect of incomplete data sets on analy- regulatory patterns or similarities in expression under
ses, and to increase the range of data sets to which these similar conditions. Commonly used analysis methods
algorithms can be applied. In this report, we investigate include clustering techniques (Eisen et al., 1998; Tamayo
automated methods for estimating missing data. et al., 1999), techniques based on partitioning of data
Results: We present a comparative study of several (Heyer et al., 1999; Tamayo et al., 1999), as well as
methods for the estimation of missing values in gene various supervised learning algorithms (Alter et al., 2000;
microarray data. We implemented and evaluated three Brown et al., 2000; Golub et al., 1999; Raychaudhuri et
methods: a Singular Value Decomposition (SVD) based al., 2000; Hastie et al., 2000).
method (SVDimpute), weighted K-nearest neighbors (KN- The data from microarray experiments is usually in
Nimpute), and row average. We evaluated the methods the form of large matrices of expression levels of genes
using a variety of parameter settings and over different real (rows) under different experimental conditions (columns)
data sets, and assessed the robustness of the imputation and frequently with some values missing. Missing values
methods to the amount of missing data over the range of occur for diverse reasons, including insufficient resolution,
1–20% missing values. We show that KNNimpute appears image corruption, or simply due to dust or scratches on
to provide a more robust and sensitive method for missing the slide. Missing data may also occur systematically
value estimation than SVDimpute, and both SVDimpute as a result of the robotic methods used to create them.
and KNNimpute surpass the commonly used row average Our informal analysis of the distribution of missing
method (as well as filling missing values with zeros). We data in real samples shows a combination of all of
report results of the comparative experiments and provide these, but none dominating. Such suspicious data is
recommendations and tools for accurate estimation of usually manually flagged and excluded from subsequent
missing microarray data under a variety of conditions. analysis (Alizadeh et al., 2000). Many analysis methods,
Availability: The software is available at http://smi-web. such as principle components analysis or singular value
stanford.edu/projects/helix/pubs/impute/ decomposition, require complete matrices (Alter et al.,
Contact: russ.altman@stanford.edu 2000; Raychaudhuri et al., 2000). Of course, one solution
∗ To whom correspondence should be addressed. to the missing data problem is to repeat the experiment.
This strategy can be expensive, but has been used in
520
c Oxford University Press 2001
Missing values in DNA microarrays
validation of microarray analysis algorithms (Butte et al., Each data set was pre-processed for the evaluation by
2001). Missing log2 transformed data are often replaced removing rows and columns containing missing expres-
by zeros (Alizadeh et al., 2000) or, less often, by an sion values, yielding ‘complete’ matrices. The methods
average expression over the row, or ‘row average’. This were then evaluated over each dataset as follows. Between
approach is not optimal, since these methods do not 1 and 20% of the data were deleted at random to create
take into consideration the correlation structure of the test data sets. Each method was then used to recover the
data. Thus, many analysis techniques, as well as other introduced missing values for each data set, and the esti-
analysis methods such as hierarchical clustering, k-means mated values were compared to those in the original data
clustering, and self-organizing maps, may benefit from set. The metric used to assess the accuracy of estimation
using more accurately estimated missing values. (henceforth referred to as normalized RMS error) was cal-
There is not a large published literature concerning culated as the Root Mean Squared (RMS) difference be-
missing value estimation for microarray data, but much tween the imputed matrix and the original matrix, divided
work has been devoted to similar problems in other fields. by the average data value in the complete data set. This
The question has been studied in contexts of non-response normalization allowed for comparison of estimation accu-
521
O.Troyanskaya et al.
23
6
12
17
92
1
45
91
by sorting the eigengenes based on their corresponding Number of genes used as neighbors
eigenvalue. Although it has been shown by Alter et al.
(2000) that several significant eigengenes are sufficient to
describe most of the expression data, the exact fraction
Fig. 1. Effect of number of nearest neighbors used for KNN-based
of eigengenes best for estimation needs to be determined
522
Missing values in DNA microarrays
0.4 0.34
1% entries
Normalized RMS error
Normalized error
0.3 0.3 5% entries
0.25 missing
0.28
KNN
0.2 10%
SVD 0.26
entries
0.15 missing
0.24
0.1 15%
0.22 entries
0.05 missing
0.2 20%
0
30 20 10 5 entries
6 7 8 9 10 11 12 13 14 missing
Percent eigengenes used
Number of arrays in data set
data with low noise level (Figures 5 and 6). Under such
estimation problem. In fact, optimal selection of K likely conditions the method performs better than KNNimpute
depends on the average cluster size for the given data if the right number of eigengenes is used for estimation
set. Second, there may be significant noise present in (Figure 6). This likely reflects the signal-processing nature
microarray data. As K increases, the contribution of noise of the SVD-based method. When the expression data
to the estimate overwhelms the contribution of the signal, is dominated by the combined effect of strong patterns
leading to a decrease in accuracy. of regulation over time (as in time-series data), SVD is
To assess the variance in RMS error over repeated ideally suited to estimating expression of an individual
estimations for the same file with the same percent of gene in terms of these constituent patterns. In contrast,
missing values removed, we performed 60 additional runs the KNN-based method exhibits higher performance for
of missing value removal and subsequent estimation on both noisy time series data and non-time series data. As
one of the time series data sets. At 5% values missing SVD-based estimation is essentially a linear regression
and K = 123, the average RMS error was 0.203, with method in lower-dimensional space, this deterioration in
variance of 0.001. Thus, our evaluation method appears performance is not surprising for non-time series data,
to be reliable. where a clear expression pattern is often not present.
Although microarray experiments typically involve a The slightly lower sensitivity to noise compared to
large number of arrays, sometimes experimenters need KNNimpute is most likely due to the fact that expression
to analyze data sets with small numbers of experiments patterns for smaller groups of genes can sometimes not be
(columns in the matrix). KNNimpute can accurately sufficiently represented in the dominant eigengenes used
estimate data for matrices with as low as six columns for estimation.
(Figure 3). We do not recommend using this method on
matrices with less than four columns. Row average
Estimation by row (gene) average, although an im-
SVDimpute provement upon replacing missing values with zeros,
To determine the optimal parameter set for SVDimpute, yielded drastically lower accuracy than either KNN-
the method was evaluated using the most significant 5, 10, or SVD-based estimation (Figure 5). As expected, the
20, and 30% of the eigengenes for estimation (Figure 4). method performs most poorly on non-time series data
The most accurate estimation is achieved when approxi- (normalized RMS error of 0.40 and more), but error on
mately 20% of the eigengenes are used for estimation. In other data sets was also significantly higher than both of
contrast with KNNimpute, where the error curve appears the other methods. This is not surprising, since this row
relatively flat between 10 and 20 neighbors, performance averaging assumes that the expression of a gene in one of
of the SVD-based method deteriorates sharply as the num- the experiments is similar to its expression in a different
ber of eigengenes used is changed. experiment, which is often not true. In contrast to SVD
Although SVD-based estimation provides significantly and KNN, row average does not take advantage of the
higher accuracy than row average on all data sets, its rich information provided by the expression patterns of
performance is sensitive to the type of data being an- other genes (or even duplicate runs of the same gene) in
alyzed. SVDimpute yields best results on time-series the data set.
523
O.Troyanskaya et al.
0.25
CONCLUSIONS
Normalized RMS error
0.24 row
0.23 average KNN- and SVD-based methods provide fast and accurate
0.22
0.21 SVDimpute ways of estimating missing values for microarray data.
0.2 Both methods far surpass the currently accepted solutions
0.19 KNNimpute (filling missing values with zeros or row average) by
0.18
0.17 taking advantage of the correlation structure of the data to
0.16 filled with
zeros
estimate missing expression values. Based on the results
0.15
0 5 10 15 20
of our study, we recommend KNN-based method for
Percent of entries missing imputation of missing values.
Although both KNN and SVD methods are robust
to increasing the fraction of data missing, KNN-based
Fig. 5. Comparison of KNN, SVD, and row average based imputation shows less deterioration in performance with
estimations’ performance on a noisy time series data set. The same increasing percent of missing entries. In addition, the
data set (with identical entries missing) was used to assess the KNNimpute method is more robust than SVD to the type
KNN
0.25 sharp deterioration in performance when a non-optimal
non-time
0.2 series fraction of missing values is used. From the biological
KNN
0.15
noisy time
series
standpoint, KNNimpute has the advantage of providing
KNN
time series
accurate estimation for missing values in genes that belong
0.1 SVD to small tight expression clusters. Missing points for such
non-time
0.05
series genes could be estimated poorly by SVD-based estimation
0
SVD
noisy time if their expression pattern is not similar to any of the
series
0 5 10 15
Percent of entries missing
20
SVD eigengenes used for regression.
KNN-based imputation provides for a robust and sensi-
tive approach to estimating missing data for microarrays.
Fig. 6. Performance of KNNimpute and SVDimpute methods on However, it is important to exercise caution when drawing
different types of data as a function of entries missing. Best critical biological conclusions from data that is partially
performance of each of the methods was plotted. Three sets of imputed. The goal of this method is to provide an accurate
curves represent three data sets (non-time series—top, noisy time way of estimating missing values in order to minimally
series—middle, and time series—bottom). bias the performance of microarray analysis methods.
However, estimated data should be flagged where possible
Although an in-depth study was not performed on and its significance on the discovery of biological results
column average, some experiments were performed with should be assessed in order to avoid drawing unwarranted
this method and it does not yield satisfactory performance conclusions.
(results not shown).
Performance ACKNOWLEDGEMENTS
For a matrix of m rows (genes) and n columns (experi- We would like to thank Soumya Raychaudhari and
ments), the computational complexity of the KNNimpute Joshua Stuart for thoughtful comments on the manuscript
method is approximately O(m 2 n), assuming m k and and discussions, and Orly Alter and Mike Liang for
fewer than 20% of the values missing. The computational helpful suggestions. O.T. is supported by a Howard
complexity of a full SVD calculation is O(n 2 m). How- Hughes Medical Institute predoctoral fellowship and by
ever, SVDimpute utilizes an expectation–maximization a Stanford Graduate Fellowship. M.C. is supported by
algorithm, thus bringing the complexity to O(n 2 mi), NIH training grant LM-07033. T.H. is partially supported
where i is the number of iterations performed before the by NSF grant DMS-9803645 and NIH grant ROI-CA-
threshold value is reached. The row average algorithm is 72028-01. R.T. is supported by the NIH grant 2 R01
the fastest, with computational complexity of O(nm). The CA72028, and NSF grant DMS-9971405. D.B. is partially
KNNimpute method, implemented in C++, takes 3.23 min supported by CA 77097 from the NCI. R.B.A. is supported
on a Pentium III 500 MHz computer to estimate missing by NIH-GM61374, NIH-LM06244, NSF DBI-9600637,
values for a data set with 6153 genes and 14 experiments, SUN Microsystems and a grant from the Burroughs-
with 10% of the entries missing. Wellcome Foundation.
524
Missing values in DNA microarrays
525