Statistics Easy Scientific Datasets
Statistics Easy Scientific Datasets
KATARI • ET AL
Series Editor: Steven G. Krantz, Washington University in St. Louis
Statistics is Easy!
Case Studies on Real Scientific Datasets
Manpreet Singh Katari, New York University
Computational analysis of natural science experiments often confronts noisy data due to natural
variability in environment or measurement. Drawing conclusions in the face of such noise
entails a statistical analysis.
Parametric statistical methods assume that the data is a sample from a population that
can be characterized by a specific distribution (e.g., a normal distribution). When the assumption
is true, parametric approaches can lead to high confidence predictions. However, in many cases
particular distribution assumptions do not hold. In that case, assuming a distribution may yield
false conclusions.
The companion book Statistics is Easy, gave a (nearly) equation-free introduction to
nonparametric (i.e., no distribution assumption) statistical methods. The present book applies
data preparation, machine learning, and nonparametric statistics to three quite different life
science datasets. We provide the code as applied to each dataset in both R and Python 3. We
also include exercises for self-study or classroom use.
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
books provide concise, original presentations of important research and
development topics, published quickly, in digital and print formats.
Statistics is Easy!
Dennis Shasha and Manda Wilson
2008
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S01078ED1V01Y202102MAS039
Lecture #39
Series Editor: Steven G. Krantz, Washington University, St. Louis
Series ISSN
Print 1938-1743 Electronic 1938-1751
Statistics is Easy
Case Studies on Real Scientific Datasets
Sudarshini Tyagi
Goldman Sachs
Dennis Shasha
New York University
M
&C Morgan & cLaypool publishers
ABSTRACT
Computational analysis of natural science experiments often confronts noisy data due to natural
variability in environment or measurement. Drawing conclusions in the face of such noise entails
a statistical analysis.
Parametric statistical methods assume that the data is a sample from a population that can
be characterized by a specific distribution (e.g., a normal distribution). When the assumption
is true, parametric approaches can lead to high confidence predictions. However, in many cases
particular distribution assumptions do not hold. In that case, assuming a distribution may yield
false conclusions.
he companion book Statistics is Easy! gave a (nearly) equation-free introduction to non-
parametric (i.e., no distribution assumption) statistical methods. The present book applies data
preparation, machine learning, and nonparametric statistics to three quite different life science
datasets. We provide the code as applied to each dataset in both R and Python 3. We also include
exercises for self-study or classroom use.
KEYWORDS
scientific data, case studies, nonparametric statistics, machine learning, data clean-
ing, null value imputation
Readers can find Statistics is Easy! Second Edition by Dennis Shasha and Manda Wilson at
https://doi.org/10.2200/S00295ED1V01Y201009MAS008 or http://bit.ly/EasyStats2.
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Technology Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Data Preparation Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Method Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Statistical Analysis Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Overview of the Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xi
Acknowledgments
First, we would like to thank our students whose questions over the years have shaped the nar-
rative of this book. Second, we’d like to thank Professor Kristin Gunsalus for her careful review
and thoughtful comments. Third, we’d like to thank the editorial and publishing team at Mor-
gan & Claypool: Diane Cerra, Christine Kiilerich, Sara Kreisman, C. L. Tondo, Laurel Muller,
Brent Beckley, and Sue Beckley. Last and certainly not least, we’d like to thank our families for
enduring our absences while we worked on this book, which, though concise, required sustained
effort over two years.
Shasha has been partially supported by NIH 1R01GM121753-01A1 the U.S. Na-
tional Science Foundation under grants MCB-1412232 and IOS-1339362 and MCB-0929339.
Katari has been partially supported by NIH-NIGMS: 1 R01 GM121753-01 and DOE-BER:
DE-SC0014377. This support is greatly appreciated.
A special acknowledgment to Rohini, Anjini, and Reena Katari for their continued sup-
port and inspiration.
CHAPTER 1
Introduction
The book is aimed at computer scientists, bioinformaticians, and computationally oriented sci-
entists. We describe three case studies in order to illustrate different aspects of data analysis.
All code is provided in both R and Python 3 and is available on Github (https://github.com/
StatisticsIsEasy/StatisticsIsEasy and https://github.com/StatisticsIsEasy/CaseStudies).
In the Basic Workflow section of the present chapter, we describe the steps of any data-
driven study: data preparation, analysis (including machine learning), and statistical evaluation.
In the Technology Choices section, we describe alternative methods to achieve the objec-
tive of each step of the analysis workflow.
Finally, in the Case Studies Overview section, we briefly describe each case study, its
dataset, and the techniques applied for each of its workflow steps.
Faced with a dataset and an analysis goal of your own, we suggest that you find the most
similar case study from this book and use our code as a starting template to solve your problem.
You may then add and substitute code from elsewhere onto that basic scaffold.
1. Data preparation—Put the data in good form for analysis. This constitutes the majority
of the work for a practicing data scientist and consists of the following general steps.
(a) Format the data used in the dataset, usually through some form of parsing. The end
result is often a table. For example, for the breast cancer dataset, each row corresponds
to a patient and each column value is some tumor measurement.
(b) Normalize raw data so that different features or experimental measurements are ren-
dered comparable.
(c) Handle missing data by ignoring some inputs or imputing missing values.
(a) The choice between a paired vs. unpaired statistical analysis. For example, suppose
we are evaluating a treatment, and that one group of patients receive the treatment
and a different group does not (the control group). We must use an unpaired anal-
ysis, because they are different people. On the other hand, if the same patients are
measured before and after treatment (as when sick patients are given a medicine in
the cystic fibrosis study), then we can use a paired test, because we are comparing the
same person before and after a treatment. A paired test is better at detecting subtle
effects, because it removes the effect of confounding factors due to differences among
patient populations.
(b) The choice of accuracy measure. For classification problems, these are based on no-
tions of “precision” (roughly, the fraction of predictions of the class of interest that
are correct) and “recall” (roughly, the fraction of entities in the class of interest that
are correctly predicted). For problems in which we are trying to predict a numerical
value, the accuracy measure is based on either a relative or absolute numerical error
of a predicted value to an actual value. We discuss both of these cases later in this
chapter.
(c) The task of determining whether some causal factor (e.g., an experimental pertur-
bation) might be associated with some change in a measurement of interest more
than would be expected from random chance alone. For example, the putative causal
factor might be a change in diet and the measurement might be the final chicken
weight. If the measurement with the factor (e.g., new diet) is different enough from
the measurement without the factor (e.g., normal diet), then the difference would be
said to have a low p-value. When the p-value is low enough, then random chance is
unlikely to be an explanation of the difference.
(d) Choosing the correct multiple hypothesis testing correction procedure. When one
asks about many phenomena, e.g., which of thousands of genes is affected by cystic
fibrosis, computing a p-value is not enough. The reason is that by random chance,
some changes to genes that have a low p-value would materialize, even if health status
had no real effect on genes.
1.2. TECHNOLOGY CHOICES 3
The choice then is which multiple hypothesis testing correction procedure to use.
The false discovery rate is the most common. The false discovery rate is an estimate
of the fraction of genes that might fall below a certain p-value threshold simply due
to chance (i.e., unrelated to health status).
(e) Estimating the range of the effect we can expect by choosing one course of action
rather than another. In the chicken case study, for example, we might ask about the
range of the expected magnitude of the weight gain for a chicken on a new diet
compared with a chicken on a normal diet.
(f ) The determination of which features of some treatment or condition that most in-
fluence the result. For example, in the breast cancer case study, we want to find the
features of cell nuclei that are most diagnostic for breast cancer.
Normalization
Most analyses will have to compare different features and weight their effect on some metric of
interest. To take an example from normal life, a person’s temperature is more important than
his or her shoe size in determining whether that person feels sick. If the different features can
take vastly different ranges of values, that may skew the weighting in any analysis.
Certain computational methods therefore perform z scaling of the values: For each value
v of feature X , scaling subtracts the mean of X from v and then divides that result by the
standard deviation of X . Scaling makes sure each value of X is transformed to a number of
4 1. INTRODUCTION
standard deviations away from the mean of X , resulting in what is known as the z -score.
v
zD : (1.1)
For example, let’s take 15 randomly generated body mass index (BMI) values1 :
import numpy as np
bmi = np.random.randint(18,32,15)
bmi
array([30, 22, 21, 25, 25, 29, 23, 20, 25, 27, 20, 18, 29, 27, 19])
Now we calculate the mean and standard deviation of these array values. Then, for each
value, we will subtract the mean and divide by the standard deviation to get the z-score scaled
values.
mu = np.mean(bmi)
sigma = np.std(bmi)
(bmi - mu)/sigma
Different normalization methods can work better for different data. For example, when
preparing each RNA-seq dataset for the cystic fibrosis analysis, we may be more interested in the
relative amount of each gene’s RNA compared to other genes rather than the absolute amount.
Thus, normalization would consist of dividing each gene’s counts by the total counts for a given
RNA-seq reading. This calculates the proportion of the counts that belong to the gene for that
reading. We do this for all RNA-Seq readings so they can now be comparable [Bushel, 2020].
Normalization, whether to do it and how to do it, can influence all downstream analysis, so
it’s good to see whether the results of an analysis are robust to different normalization techniques.
Missing Data
There are two basic ways to deal with missing data within datasets: (i) remove any data item
containing missing data, or (ii) infer the missing values from other data (called imputation).
1 Our github site has both R and Python code, but we use Python in the book text.
1.2. TECHNOLOGY CHOICES 5
While removing the data item requires less effort and thought, it can sometimes lead to a loss
of hard-to-obtain information. For example, suppose some patient in an Alzheimer’s study is
missing a blood plasma test during a particular visit. The data for that day would include other
informative metrics measured for the patient. Throwing out the entire day’s data because of the
one missing test value might result in an excessive loss of information. Common imputation
methods include the following.
1. Method 1: Replace the missing values of some measurement with the arithmetic mean
value of that measurement. In the example above, if the blood pressure reading is missing
for a patient for a single visit, imputation could simply take the arithmetic mean of the
blood pressure readings of all other visits of that patient. If no other measurements are
available for that patient, than take the median measurement of all patient measurements.
2. Method 2: Replace the missing values with median. As in the previous method, but use
median instead of arithmetic mean.
3. Method 3: Linear interpolation. If data comes in the form of a time series per individual,
then the value at time t of some measurement for individual may be well estimated by
taking the arithmetic mean of the value of that measurements of that individual at times
t 1 and t C 1.
4. Method 4: Design a machine learning model to predict the value of a missing measure-
ment given the values of other measurements. This entails building, for each measurement
type with missing values, a model based on the values of other measurement types.
Which imputation method to use can have an important effect on the results of an analysis,
so the method and its justification should be carefully explained in any research paper.
• Precision is the number of correct cancer classifications divided by all predicted cancer
classifications. Symbolically, suppose CancAll is the set of people who have cancer;
ClassCorrect is the set of people whom the classifier claims have cancer and do; and
ClassAll is the set of people whom the classifier claims have cancer whether they do or
not. Then:
jjClassCorrectjj
precision D : (1.2)
jjClassAlljj
(Note that the ||S|| notation means the number of members of set S.)
Higher precision means few false positives.
• Recall is the number of correct cancer classifications divided by all patients who have
cancer. Using the symbols from the previous bullet:
jjClassCorrectjj
recall D : (1.3)
jjCancAlljj
High recall means few false negatives.
• There is usually a trade-off between precision and recall. A low acceptance threshold
may yield higher recall but lower precision and conversely. To balance the two, one can
capture both in one metric called the F-score (also known as the F1-score).
The F-score can be calculated by dividing the product of precision and recall by the
sum of precision and recall and multiplying the result by 2.
The formula for the F score is:
.precision recall/
F score D 2 : (1.4)
.precision C recall/
1. Chicken Diet—This dataset was originally provided by Crowder and Hand [1990], in
Analysis of Repeated Measures (example 5.3), published by Chapman & Hall. The dataset
tracks the weight gain of chicks in four different diets: normal, 10% protein, 20% protein,
and 40% protein replacement. There are 20 chicks who take the normal diet, and 10 chicks
each for the remaining three so a total of 50 chicks. Weight for each chick was measured
on the following days for three weeks: 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, and 21. Five
chicks did not make it to the 21st day so the data contains missing values. The goal is to
see how the weight of a chicken relates to a specific diet.
Technologies used:
2. Breast Cancer—This dataset comes from digitized images of breast mass UCI. The fea-
tures describe the cell nuclei in each image. The goal of the study was to classify tumors as
benign or malignant using predictive modeling. Our reanalysis will test different machine
learning methods to see which model gives the highest diagnostic accuracy and whether
the difference in accuracy is statistically significant. In addition, the reanalysis will test
the performance of each model with missing data to see which model can handle missing
data the best. Finally, we determine which feature(s) are most important in making correct
predictions.
10 1. INTRODUCTION
Technologies used:
• Data preparation: reducing 30 cell-specific attributes into 7 useful features.
• Data preparation: impute missing data.
• Analysis: machine learning techniques including logistic regression, decision
trees, random forests, and support vector machines.
• Analysis: the application of correlation to remove similar features in order to
lower the dimensionality of the machine learning analysis.
• Statistics: precision, recall, and F-measure.
3. Cystic Fibrosis—The RNA-seq data for this analysis comes from NCBI GEO
(GSE124548). The purpose of the original study was to study the effect of a drug
(Lumacaftor/Ivacaftor) to treat cystic fibrosis [Kopp et al., 2020].
RNA-seq is a method for detecting the abundance of messenger RNA (mRNA) of each
gene present in a sample drawn from a patient. Our goal in this book is to identify the
genes whose expression might suggest cystic fibrosis. The main challenges have to do with
normalization, multiple hypothesis testing, identification of significant changes, and use
of confidence intervals to determine magnitude of change.
Technologies used:
• Data preparation: normalize RNA-seq data so that the RNA levels obtained
from different samples are meaningfully comparable.
• Analysis: F-measure based on random forest prediction of diagnosis based on
changed genes.
• Analysis: Random forest for the prediction of cystic fibrosis and importance
ranking of genes.
• Statistics: Shuffle tests to determine the statistical significance of the differences
in gene expression due to disease status and drug use. Multiple hypothesis testing
correction.
11
CHAPTER 2
30
25
20
Frequency
15
10
0
–100 –50 0 50 100
Difference
Figure 2.1: Distribution of the shuffled differences of the Diet1 and Diet2 chicks. The red line
marks the observed difference between the diet groups. The p-value is the area to the right of
and including the red line divided by the area of all the blue lines. 887 out of 10,000 experiments
had a difference of means greater than or equal to 36.95. The p-value of getting a difference of
two means greater than or equal to 36.95 is thus roughly 0.0887, which is not low enough to be
statistically significant.
ence in weight equal to or greater than the difference observed in the study. Pictorially, that is
the ratio of the area to the right of and including the red line (which is the observed difference)
over the entire area of the blue lines.
Table 2.1 summarizes the results of the pairwise comparisons against Diet1.
The table shows that the Diet3 – Diet1 comparison has the smallest p-value, suggesting
that the difference in weight is extremely unlikely to have happened by chance (roughly 0.07%).
Diet2 – Diet1 also shows a difference, but has a roughly 8.9% chance of having happened by
chance.
14 2. CHICK WEIGHT AND DIET
Table 2.1: Significance of different in weights in grams
Diets P-value
Diet2 – Diet1 36.95 0.0887
Diet3 – Diet1 92.55 0.0007
Diet4 – Diet1 60.81 0.0074
Remark: As noted in the introduction, when multiple tests are performed, one or more
tests could yield a low p-value just by chance (i.e., even if there were no effect). In Chapter 4, we
discuss two corrections for this issue. The simplest and most conservative correction is called the
Bonferroni correction. Effectively, it involves multiplying the p-values by the number of tests
performed to get a “family-wise error rate.” So, if we were interested in a family-wise error rate
of 0.05, then Diet3 – Diet1 would pass (because 3 0.0007 is under 0.05) and similarly for
Diet4 – Diet1.
Diets P-value
Diet3 – Diet1 92.55 0.0007 50.29 .. 134.71
Diet4 – Diet1 60.81 0.0074 29.05 .. 93.35
percentile) difference from the bottom of the sorted list to the 9500th difference (95th percentile)
from the top of the sorted list. The table tells us that roughly 90% of the time, we’d expect the
mean of a Diet3 group to improve on the mean of a Diet1 group by between about 50 g and
135 g.
The results are presented in Table 2.2. Note that we should compute confidence intervals
only for those diets that have a sufficiently low p-value when compared with Diet1. As it hap-
pens, Diet3 and Diet4 have a statistically significant advantage over the control diet Diet1, but
Diet2 does not.
Warning to the Unwary: A common mistake is to think that a 90% confidence interval
that does not include 0 implies a low p-value (i.e., the result is unlikely to have occurred by
chance). To see that this is not always true, consider a overly minimalist experiment in which
one chick is given diet X and another chick is given diet Y. Suppose that the diet X chick weighs
more at the end by 50 g. The confidence interval would be 50 g to 50 g (just the single number).
But the p-value would be 0.5. This would thus be an utterly insignificant result that could have
had nothing to do with the diet, e.g., the diet X chick might simply have had weightier genetics.
One might conclude from this example that there simply needs to be some minimum num-
ber of data points (chicks in this case) in order to justify relying on confidence intervals alone.
Unfortunately, this is not the case. While more data points tends to reduce the p-value when
there is an effect, there is no fixed number to use. So, please take the time to do the significance
test before measuring confidence intervals. If the p-value is high, then the confidence interval is
meaningless.
Linear Regression Plot for Diet1 Quadratic Regression Plot for Diet1
400 400
350 350
300 300
Weight
200 200
150 150
100 100
50 50
0 0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Time Time
Linear Regression Plot for Diet3 Quadratic Regression Plot for Diet3
400 400
350 350
300 300
250 250
Weight
Weight
200 200
150 150
100 100
50 50
0 0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Time Time
Figure 2.2: The linear and quadratic regression lines for diet1 and diet3. The quadratic regression
lines curve upward toward the end, potentially giving a better fit.
2.4. REGRESSION ANALYSIS 17
Here we will use linear (degree 1) regression. We use the polyfit() function from the
numpy package to calculate the slope and the intercept representing the rate at which the chick
gains weight. These values are determined by minimizing the squared error.
Using the polynomials and the original values used to fit the line, we can determine the
squared error of the residual. A residual is the difference between the actual point (Y in the
code) and the point on the fitted line (y_hat in the code). We calculate the squared error for
each point. To calculate the RMSE (Root Mean Squared Error), we simply take the square root
of the mean of the squared error (assuming n data points):
s
P
.Y YO /2
RMSE D :
n
The RMSE for the four diets using quadratic and linear methods is summarized in Ta-
ble 2.3.
To determine whether the difference between the linear and quadratic methods is statis-
tically significant, we will perform a nonparametric paired swap test. A paired test is appropriate
here since the same measurement points are being used for linear and quadratic regression.
Here is how to do this concretely. Build a table of four columns: chick, time point, error for
linear regression, error for quadratic regression. Call that the original table, orig. Compute the
RMSE of the third column and subtract from that the RMSE of the fourth column. This gives
a value origDiff, which is the difference between the RMSE of linear regression and quadratic
regression. Quadratic regression should give a lower RMSE (less of an error) because the re-
gression line does not need to be straight.
18 2. CHICK WEIGHT AND DIET
Table 2.3: The diets have relatively small differences in RMSE (Root Mean Squared Error)
between the quadratic and linear regression models. Diet3 shows the biggest difference, but
even there the RMSE difference is under 5% of the linear RMSE.
# This code evaluates the RMSE for both linear and quadratic regression.
# Using the z values, calculate the y_hat (the regression prediction)
# for each time value (x, below).
# Then calculate the Sum of squares ( Sum (y_hat - y)**2 )
def getSS(X,Y,z):
all_diffs=np.array([])
for i in range(len(X)):
if len(z)==2:
# linear regression z[0] is the slope
# and z[1] the intercept
y_hat = (X[i]*z[0]) + z[1]
if len(z)==3:
# quadratic regression
y_hat = (X[i]*X[i]*z[0])+ (X[i]*z[1]) +z[2]
all_diffs=np.append(all_diffs,[(y_hat - Y[i])**2])
diff_sum = sum(all_diffs)
return diff_sum
For every point in time for every chick, we will, with a probability of 50%, switch the linear
regression and quadratic regression errors and recalculate the difference between the RMSE.
2.4. REGRESSION ANALYSIS 19
We will do this 10,000 times and count the number of times the difference was greater than the
one observed. If this is rare, then the lower root mean squared error of quadratic regression is
statistically significant.
do 10,000 times
Table 2.4: Quadratic regression does not have a statistically significantly lower Root Mean
Squared Error (RMSE) than linear regression, but comes close in Diet3. The “NA” stands for
“not applicable” which arises because the p-value > 0.05. When the p-value is high, the con-
fidence interval has no meaning. Diet3 has a p-value close to 0.05, so the confidence interval
might have meaning.
The p-value of obtaining a better fit using quadratic regression compared with linear re-
gression is everywhere larger than 0.05, though it’s close to 0.05 for Diet3 (Table 2.4). Recall that
the confidence interval of a comparison is not meaningful and should not be computed when
the comparison is not statistically significant. That is why we label the confidence intervals other
than for Diet3 as “NA” meaning not applicable.
We conclude that there is no real benefit from a goodness-of-fit perspective to change
the regression model from linear to quadratic. Different data (e.g., economic data showing the
concept of diminishing returns) might gain substantially from quadratic regression.
2.5. EXERCISE 21
2.5 EXERCISE
In the text, we studied the difference in Root Mean Squared Error (RMSE) between linear and
quadratic regression and found that (i) quadratic gives a lower RMSE, (ii) the difference in their
RMSEs is generally not statistically significant, except for Diet3 which has a p-value close to
0.05, and (iii) the difference is relatively small.
In this exercise, you will compute the 90% confidence interval of the RMSE with respect
to the linear regression line L for Diet1. The goal is to gain an idea of how closely another set
of chicks given Diet1 would track the original regression line L.
Hint: The new analysis will proceed as follows for Diet1.
2.6 CODE
A Python book with the code used for the analysis can be found here:
CaseStudies/Chick_weight_diet/Chick_Weight_Diet.ipynb.
23
CHAPTER 3
3.1.1 APPROACH
For each classification method, we will split the data into a “training” set that we will use to
build the model, and a “test” set that we will use to evaluate performance. We will then compare
our predictions with the actual outcomes (benign or malignant). To evaluate robustness, we
will repeat each analysis 100 times using different subsets of the data for training and testing,
a process called “cross-validation.” We will use these to compute confidence intervals for the
F-scores in order to compare the quality of the results of the different methods.
The code for loading the data and doing the analysis is provided in more detail on the
Github repository https://github.com/StatisticsIsEasy/CaseStudies.
1 (http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
24 3. BREAST CANCER CLASSIFICATION
3.2 DATA
The features of the dataset are computed from digitized images of a fine needle aspirate (FNA)
of breast masses. They describe characteristics of the cell nuclei present in the image.
# The code takes the features and labels as input and outputs
# the F-scores on the training and test sets
# using the logistic regression method.
score=np.zeros(shape=(100,2))
for i in range(100):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size = 0.2,
stratify=N)
Figure 3.1: Decision tree classifier. The top line in each box represents the test on the data of a
patient to perform. The second line gini is a measure of how effective the feature branching is in
splitting the data. Of the 569 samples in the training set, branching on “concave_points_mean
<D 0.051” assigns 349 patients to 0 (B for Benign) and 220 to 1 (M for Malignant). The blue
node on the right is for the 220 samples that have a concave points_mean > 0.051 and the
orange node on the left is for the 349 samples that have concave points_mean <D 0.051. Out
of the 220 in the blue node, a majority (192/220) of the samples are M (Malignant). The most
informative further classifying feature-value for the 220 samples is “texture_mean <D 16.395”
and the separation to the next level depends on whether the value is <D (left branch) or >
16.395 (right branch). The gray leaf nodes represent further levels of the decision tree.
28 3. BREAST CANCER CLASSIFICATION
node, the classifier tries to choose the attribute and the value that best separates Benign from
Malignant cancers.
The Decision Tree Classifier is implemented in Python in the scikit-learn library under
DecisionTreeClassifier class. Here’s an example:
Data Point/Instance
Figure 3.2: Simple random forest consisting of three decision trees. If the instance receives two
votes for class B and one for class A, then the random forest will output B.
um
i m ce
ax
M ista n
d
Op
t im
al
hy
pe
r pl
an
e
um
i m ce
ax
M ista n
d
Figure 3.3: Optimal hyperplane. A hyperplane is a line when the data is two dimensional, a
plane for three dimensional data, etc. Optimal in this example means choosing the line among
all possible separating lines (examples of which are shown in the left panel) that separates the
stars from the triangles with as large a distance (called the margin) on either side of the line as
possible.
3.5 EVALUATION
Here we have combined precision (the number of those accurately predicted to have the dis-
ease/the number predicted to have it) and recall (the number of those accurately predicted to
have the disease/the number who have it) through the F-score.
Recall that the formula for the F score is:
.precision recall/
F score D 2 : (3.1)
.precision C recall/
Table 3.1 shows the performance of the models that we’ve just discussed. (Because of the
use of random seeds, your results may be slightly different.) We see that random forests perform
best, which is in fact often the case when there are few features (on the order of a few 10 s). On
3.5. EVALUATION 31
Table 3.1: The confidence intervals of the F-scores of different machine learning methods. The
range reflects the 90% confidence interval of each method. Each method is run 100 times on
a randomly chosen 80–20 split of the data. The 90% confidence interval is based on the fifth
F-score in the sorted order of F-scores to the 95th F-score.
the other hand, when there are thousands of features, Support Vector Machines can perform
better. We suggest trying the different methods on your problem as we are doing in this chapter.
It doesn’t take much work on your part and it can greatly enhance the quality of your results.
Note that, in Table 3.1, the methods show a larger confidence interval for the testing F-
score than for the training F-score. The reason is that the test set is smaller, so there will be more
variance in its F-score. This is illustrated in Figure 3.4. for the distribution of results from 100
logistic regression models.
Train
40
Test
35
30
25
Density 20
15
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 3.4: Distribution of training and testing F-scores for logistic regression. The vertical lines
represent 90% confidence intervals. The test set size is smaller so will tend to have a wider
confidence interval.
might expect if the model were applied to new data. However, this is true only if the analyst is
very careful.
We, authors, have seen researchers deceive themselves into thinking they had amazingly
great results, because they effectively incorporated test data into their training data. One way
this happens is that one takes averages and/or other statistics on all the data and uses that to set
some machine learning parameters. A second way is that one uses cross-validation with many
different hyperparameter settings. A hyperparameter is commonly a configuration input to a
machine learning model (e.g., number of trees used in a random forest) or a method used for
handling missing data or some other method for handling data. If one runs cross-validation with
different hyperparameters until one gets the best results, then one is effectively incorporating test
data into the training process. We call that polluting the test set.
Polluting the test set will make the study results appear stronger, but then generalized
application of the research in question will suffer worse results. That is just bad science.
3.6. MISSING DATA 33
To ensure that you avoid this, you would do well to sequester some of the input data from
all other data at the very beginning. Then you can do cross-validation on the remaining data,
optimizing hyperparameters to your heart’s content. In the end, you create a model based on all
the non-sequestered data and calculate an error on the sequestered data, which is the only error
rate you should report.
In this case study, we use cross-validation alone for the sake of presentation simplicity
(and do not optimize hyperparameters). That corresponds roughly to step (ii) of the following
workflow: (i) sequester some data; (ii) perform hyperparameter tuning and cross-validation on
the non-sequestered data to create an optimized model; (iii) apply the optimized model on the
sequestered data; and (iv) report results only on that sequestered data.
def impute_missing_data(data):
data.fillna(data.median(), inplace=True)
return data
To evaluate the robustness of various missing data methods, we also want to test the per-
formance of the methods with different subsets of missing values. Below is a code example of
using the logistic regression function with the impute strategy.
3.6. MISSING DATA 35
Table 3.2: 35% missing data: the second column shows the F-measures for each classification
method on the test data when there is no missing data. The next two columns look at the 90%
confidence interval of the F-measure on test data (i) when records having missing data are re-
moved (ii) when missing values are imputed. Imputing gives better F-scores.
Table 3.3: 50% missing data: even when 50% of the data is missing, imputing does remarkably
well for Random Forests and Logistic Regression
The qualitative conclusion is clear: the accuracy of the machine learning methods is em-
pirically less sensitive to missing data when using imputation than when removing records, es-
pecially for high levels of missing data (Tables 3.2 and 3.3).
36 3. BREAST CANCER CLASSIFICATION
3.7 FEATURE SELECTION
Feature Selection is the process of selecting or constructing a set of features in the hopes of
obtaining better prediction results. In this section, we talk about the problem of selecting a
subset of a given set of features and how that affects the prediction quality of each classifier (in
the absence of missing data).
radius_mean
0.8
texture_mean 0.3
perimeter_mean 1 0.3
0.4
area_mean 1 0.3 1
texture_mean
perimeter_mean
area_mean
smoothness_mean
compactness_mean
concavity_mean
concave points_mean
symmetry_mean
fractal_dimension_mean
Figure 3.5: Pairwise correlation of the features in the breast cancer dataset.
Table 3.4: Feature selection improves the Support Vector Machine’s (SVM’s) testing F-measure
but doesn’t benefit the other models’ F-measures. In our experience, tree-based learning methods
don’t benefit from feature selection unless there are at least hundreds of features.
radius_mean
0.8
texture_mean 0.3
0.4
smoothness_mean 0.2 0
–0.8
fractal_dimension_mean –0.3 –0.1 0.6 0.6 0.2 0.5
radius_mean
texture_mean
smoothness_mean
compactness_mean
concave points_mean
symmetry_mean
fractal_dimension_mean
Figure 3.6: Pairwise correlations of the reduced set of features.
plt.errorbar(x=feature_imp_df['features'], y=feature_imp_df['mean'],
yerr=feature_imp_df['sd']/np.sqrt(5), fmt='none')
feature_plot.set(xlabel="Features",ylabel="Mean Importance")
feature_plot.set_xticklabels(feature_imp_df['variables'],rotation=90)
0.15
Mean importance
0.10
0.05
0.00
radius_mean
texture_mean
smoothness_mean
compactness_mean
concave points_mean
symmetry_mean
fractal_dimension_mean
Features
Figure 3.7: Feature importance on a scale of 0–1 on the vertical axis. The feature con-
cave_points_mean is of primary importance in determining whether a patient has breast cancer.
Recall that concave_points_mean was the most important feature in the Decision Tree which is
why it was the feature evaluated at the root node of the tree. With more repeats, the error bars
would be narrower.
40 3. BREAST CANCER CLASSIFICATION
3.9 EXERCISES TO TEST YOUR UNDERSTANDING
1. Imputing missing values as opposed to deleting them seems to improve the F-measure
results for each of the machine learning methods. Evaluate the statistical significance of
the improvement nonparametrically when 40% of the data is missing.
2. Imputation and random forests: Create a chart whose x-axis is the percentage of missing
data (0%, 10%, 20%, 30%, 40%, and 50%) and whose y-axis contains the lower and upper
bound of the 90% confidence interval for random forests using imputation.
3. Restrict yourself to the top k (e.g., k D 5) most important features and rerun the classifi-
cation study using only those features. How is the F-measure impacted? If the F-measure
does not decrease too much, the fewer feature model may be considered better based on
the philosophical rule known as Occam’s razor [Gauch, 2003].2
4. If someone presents to you the test error from cross-validation results along with the as-
sertion that that error should be representative of errors on new data, what might you ask
about the analytical process?
5. Redo the analysis done here (with whatever level of missing values you choose) by keeping
out a subset of the data say 16% as sequestered. On the remaining 84% of the data, optimize
2 The approach of retrying predictions using fewer features is sometimes called model simplification. If one simplifies by
choosing various subsets of features to see which gives a good F -meausre, one is polluting the test set, because one is using
the test set to determine the features to use. The net result is that one might get a poor fit on other data. Generalizability
might suffer.
3.9. EXERCISES TO TEST YOUR UNDERSTANDING 41
hyperparameters, missing value imputation, and anything else you choose on a series of
cross-validation experiments. Then build a model with those optimized hyperparameter
values on the 84% of the data and see how you do on the sequestered 16% of the data.
Redo this process several times on a bootstrap of the sequestered 16% of the data (but
with the same optimized hyperparameter values) to get a 90% confidence interval of the
test set results. How does the test set result compare with the best cross-validation results
on the 84% of the data?
6. Comparing the confidence intervals of the machine learning algorithms gave us some indi-
cation of which method was better, but did not establish statistical significance. Suppose
that method M1 has an overall higher accuracy than method M2. To see whether that
difference in accuracy is statistically significant, try a paired test. Compute the p-value to
determine whether the difference in the F-measure of the random forest method compared
to other methods is significant.
Hint: Here is how the paired test could work. (This is similar to what we did in the previous
chapter to compare linear and quadratic regression.) Create a three-element table Tab:
patient, M1 prediction, M2 prediction for some run of M1, and some run of M2. Suppose
M1’s F-score is FM1 and M2’s F-score is FM2 and Diff D FM1 - FM2, where Diff > 0.
43
CHAPTER 4
• 20 healthy patients.
1. Normalization — Researchers have provided the raw and normalized values. We will test
if a simple normalization will provide the same results as the sophisticated one performed
by the researchers.
2. Unpaired tests — When comparing two different populations (in this case, healthy vs.
cystic fibrosis patients), we have to use unpaired tests. We did this also when comparing
the final weights of the chicks, since each chick received only one diet.
3. Multiple hypothesis testing/Reducing false positives — When there are over 15,000
genes that could be expressed differently in healthy patients compared to sick ones, there
might be differences in mRNA expression of some gene g just by chance. We would want to
avoid calling such a gene g “differentially expressed,” because that would constitute a false
positive. So, we describe and use two multiple testing correction techniques: Bonferroni
and Benjamini Hochberg.
4.3.1 NORMALIZATION
The authors have provided raw and normalized data in their excel file, designated as RAW and
NORM in the prefix of the headers. The normalized values are determined using the sophisti-
cated methods of a widely accepted R package called DESeq2 [Love et al., 2014] that tries to
estimate statistical parameters based on a distribution assumption. In the spirit of nonparametric
methods, we will use a distribution-agnostic approach.
In the given dataframe, every column corresponds to a specific individual. Though the
total number of mRNA sequences generated across individuals can vary, we are primarily in-
terested in the relative expression of different kinds of mRNA. To make the data in different
columns comparable, we will simply normalize the mRNA values from the genes of each in-
dividual (corresponding to one RNAseq run and one column) so their total read count is one
million.
46 4. RNA-seq DATA SET
Simple Normalization Approach Code
# calculate the total number of reads mapped to genes from each sample
raw_cols_sums = data_df[raw_cols].sum()
Removing all genes with at least one zero reduced the number of genes from 15,570 to
15,250.
7,361 of the 15,250 genes showed significant p-values (< 0.05) Because we have performed
p-value evaluations on multiple genes, some of those low p-values could have occurred just by
chance. After all, a p-value of say 0.05 means that the observed change in expression had a
1/20th probability of happening by chance if the null hypothesis (in our case that cystic fibrosis
had no effect on the expression of that gene) were true. Because we perform around 15,000
tests, this could give us around 300 (15,000/20) genes that look significant just by chance. Those
would be false positives.
To correct for multiple hypothesis testing (in our case, testing many genes), we can use
several different methods, but we consider just two widely used ones: (i) the extremely conser-
vative Bonferroni correction [Bonferroni, 1936] and (ii) the looser Benjamin Hochberg’s FDR
(false discovery rate) correction [Benjamini and Hochberg, 1995].
4.4.1 BONFERRONI
The Bonferroni corrections limits the FWER (Family-Wise Error Rate). This error rate is
the probability that at least one gene that is called differentially expressed is a false posi-
tive. In the Bonferroni method, we divide the threshold by the number of genes. So, if we
take a family-wise error threshold of 20% or 0.2, the Bonferroni-corrected threshold would be
0.2/len(norm_p_values). In this case we have 0.2/15250 which is 1.3e-05. The only way a gene
will pass this cutoff is if our shuffle test has 1 instance or fewer of obtaining a log fold change
more extreme than the one observed in the 100,000 shuffles. There were 531 such genes. For
those genes, the Bonferonni correction says that roughly 80% of the time, there won’t be any
false positives. Other researchers might choose different Bonferonni thresholds, but going much
lower than 20% might result in no genes.
4.4. DISTINGUISHING SICK FROM HEALTHY PATIENTS 49
Table 4.1: Differentially expressed genes. More than 7,364 genes had p-values less than 0.05.
Using a False Discovery Rate cutoff (Benjamani–Hochberg) of 5%, we get 6,082 genes. Using
the FWER (Family-Wise Error Rate/Bonferroni) cutoff of 0.2 (so p-values of 0.20/15,250), we
get 531 genes.
Given the confidence intervals of the differentially expressed genes, we rank them as fol-
lows. Suppose gene g1 has confidence interval c1low and c1high and g2 has confidence interval
c2low and c2high . Gene g1 has a higher rank (closer to 1) than g2 if max(|c1low |, |c1high |) >
max(|c2low |, |c2high |). Basically, this means that g1 has a more extreme log fold change than g2,
either more strongly negative at the low end or more strongly positive at the high end of the
confidence interval.
Of the top 10 most differentially expressed genes based on confidence intervals, the re-
searchers [Kopp et al., 2020] discussed three genes in detail: MMP9, ANXA3, and SOCS3
(Table 4.2).1
We have 40 samples, 20 healthy and 20 with cystic fibrosis. When we perform our analysis,
cross-validation will use 20% as the test set, which is 8 samples, leaving us with 32 samples for
training. It should be expected that there will be quite a bit of variation in the results depending
on which samples are selected for training and testing. This was less of a problem for breast
cancer because there were many more patients.
2.5
1.5
1.0
0.5
0.0
ANXA3 MMP9 SOCS3
Gene ID
Figure 4.1: The figures show violin plots for the expression ranges of the three genes identified
as potential markers of cystic fibrosis in Kopp et al. [2020]. These genes tend to have higher
expression in cystic fibrosis patients than in healthy patients. Note that since this is based on log
base 2 values, a value of 1 means that the expression is 2 times larger and a value of 2 means 4
times larger in a cystic fibrosis patient than in a healthy patient. A value of 0 would represent no
change.
0.8
0.6
Score
0.4
0.2
0.0
Precision Recall f1_score
Metrics
Figure 4.2: The figures show violin plots for the range of scores for three different metrics:
precision, recall, and F-score. The mean value for the three metrics are 0.91, 0.85, and 0.87,
respectively, suggesting that machine learning applied to differentially expressed genes may lead
to accurate diagnosis, though one should apply the model to sequestered data to avoid polluting
the test set.
54 4. RNA-seq DATA SET
Important genes
0.35
0.30
0.25
Feature importance score
0.20
0.15
0.10
0.05
0.00
3
S2
P1
P9
SM
6B
S3
57
FB
A
1R
O
C
M
A
X
72
O
IL
SO
K
G
M
N
M
53
PF
C
A
SE
10
M
C
LO
Gene ID
Figure 4.3: The horizontal axis shows the names of the most influential genes and the vertical
axis shows their relative importance. Because the random forest makes strong use of randomness,
the exact importance of any specific gene varies across different runs. Nevertheless, we see that
MMP9 and IL1R2 are consistently important.
4.7. EXERCISES 55
4.7 EXERCISES
1. Before studying which genes are most important in the random forest model, we should
first show that the random forest predictions (diagnosing cystic fibrosis vs. healthy people)
have a significantly better accuracy than randomly guessing the health of each person. Try
that and determine the p-value.
Hint: Remember that we have a dataset of 40 individuals, half of whom have cystic fibrosis
and half do not.
The random assignment approach would assign a patient to cystic fibrosis or healthy with
probability 0.5 which corresponds to the probability distribution of our sample. Then we
would evaluate the F-score after that random assignment. The p-value is the number of
times the F-scores resulting from the random assignment was greater than or equal to the
F-score of the random forest (about 0.9).
2. We used the native random forest importance ranking to determine which genes were
most influential in diagnosing cystic fibrosis patients. There is a widely used alternative
method called permutation importance that we used in the last chapter. Describe what
permutation importance does and then apply it to determine the most influential features.
3. We have used a random forest on the differentially expressed genes to try to predict which
genes are the most influential in determining whether a patient is healthy or has cystic
fibrosis. Then we evaluated the precision and recall of our analysis. Do the same analysis
using support vector machines and compare the outcomes.
4. So far, we have done an unpaired comparison of untreated cystic fibrosis patients with
healthy patients. We had to do this, because the two sets of patients were disjoint. By
56 4. RNA-seq DATA SET
contrast, the RNA samples after treatment were taken from the same individuals as before
treatment, so a paired test (i.e., the gene expression value before treatment for a given
individual X compared to the gene expression value after treatment for that same X) is
appropriate. Paired tests often make it easier to identify changes to some given gene g due
to treatment.
Perform a paired test comparing treated vs. untreated patients and see which genes show
a significant log fold change and construct confidence intervals of that log fold change.
Rank the genes as we did before and see if you find any genes that were both differen-
tially expressed in the healthy patient vs. sick patient comparison and in the sick untreated
patient vs. the sick treated patient comparison.
5. Find a larger set of medical/genomics data. Sequester say s% of the data both from healthy
and sick data. Then perform differential expression analysis and the diagnostic analysis (i.e.,
healthy vs. sick) on the remaining (1-s)% of the data, the “non-sequestered” part. On the
non-sequestered data, optimize hyperparameter settings and anything else you choose on
a series of cross-validation experiments. Then build a model with those optimized hyper-
parameter settings on the (1-s)% of the data and see how you do on the sequestered s% of
the data. Without changing hyperparameter settings, repeat this process at least 100 times
on a bootstrap of the initially sequestered s% of the data to get a 90% confidence interval
of the test set results. How does the test set result compare with the best cross-validation
results on the (1-s)% of the data?
57
CHAPTER 5
Bibliography
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and
powerful approach to multiple hypothesis testing. J. Roy. Statist. Soc. B, 57:289–300. DOI:
10.1111/j.2517-6161.1995.tb02031.x. 9, 48
Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni
del R. Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3–62, Google Scholar.
DOI: 10.4135/9781412961288.n455. 9, 48
Bushel, Pierre R., Ferguson, Stephen S., Ramaiahgari, Sreenivasa C., Paules, Richard S., and
Auerbach, Scott S. (2020). Comparison of normalization methods for analysis of TempO-
Seq targeted RNA sequencing data. Front. Genet., 11:594. DOI: 10.3389/fgene.2020.00594.
4
Crowder, M. and Hand, D. (1990). Analysis of Repeated Measures, Chapman & Hall. DOI:
10.1201/9781315137421. 9, 11
Gauch Jr., H. G. (2003). Scientific Method in Practice, Cambridge University Press. DOI:
10.1017/CBO9780511815034. 40
Hastie, T., Hastie, T., Tibshirani, R., and Friedman, J. H. (2001). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, New York, Springer. DOI: 10.1007/978-0-
387-84858-7. 5, 28
Kopp, B. T., Fitch, J., Jaramillo, L., Shrestha, C. L., Robledo-Avila, F., Zhang, S., Palacios, S.,
Woodley, F., Hayes, D. Jr, Partida-Sanchez, S., Ramilo, O., White, P., and Mejias, A. (2020).
Whole-blood transcriptomic responses to lumacaftor/ivacaftor therapy in cystic fibrosis. J.
Cyst Fibros., 19(2):245–254. DOI: 10.1016/j.jcf.2019.08.021. 10, 43, 50, 52
Love, M. I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. DOI: 10.1186/s13059-
014-0550-8. 45
61
Authors’ Biographies
SUDARSHINI TYAGI
Sudarshini Tyagi is currently a software engineer at Goldman Sachs where she uses machine
learning particularly natural language processing and statistics to detect anomalies in financial
regulations. She received her Master’s degree in Computer Science from Courant Institute of
Mathematical Sciences at New York University, where she wrote a thesis on visually detecting
breast cancers from mammograms. She also holds a Bachelor’s degree in Computer Science
from Rashtreeya Vidyalaya College of Engineering, Bengaluru.
DENNIS SHASHA
Dennis Shasha is a Julius Silver Professor of Computer Science at the Courant Institute of New
York University and an Associate Director of NYU Wireless. In addition to his long fascination
with nonparametric statistics, he works on meta-algorithms for machine learning to achieve
guaranteed correctness rates; with biologists on pattern discovery for network inference; with
physicists and financial people on algorithms for time series; on database tuning; and tree and
graph matching.
Because he likes to type, he has written six books of puzzles about a mathematical detec-
tive named Dr. Ecco, a biography about great computer scientists, and a book about the future
of computing. He has also written technical books about database tuning, biological pattern
recognition, time series, DNA computing, resampling statistics, and causal inference in molec-
ular networks.
62 AUTHORS’ BIOGRAPHIES
He has written the puzzle column for various publications including Scientific American,
Dr. Dobb’s Journal, and currently the Communications of the ACM. He is a fellow of the ACM
and an INRIA International Chair.