Nonparametric Statistics and Model Selection: 5.1 Estimating Distributions and Distribution-Free Tests
Nonparametric Statistics and Model Selection: 5.1 Estimating Distributions and Distribution-Free Tests
In Chapter 2, we learned about the t-test and its variations. These were designed to compare
sample means, and relied heavily on assumptions of normality. We were able to apply them to
non-Gaussian populations by using the central limit theorem, but that only really works for
the mean (since the central limit theorem holds for averages of samples). Sometimes, we’re
interested in computing other sample statistics and evaluating their distributions (remember
that all statistics computed from samples are random variables, since they’re functions of the
random samples) so that we can obtain confidence intervals for them. In other situations, we
may not be able to use the central limit theorem due to small sample sizes and/or unusual
distributions.
In this chapter, we’ll focus on techniques that don’t require these assumptions. Such methods
are usually called nonparametric or distribution-free. We’ll first look at some statistical tests,
then move to methods outside the testing framework.
So far, we’ve only used “eyeballing” and visual inspection to see if distributions are similar.
In this section, we’ll look at more quantitative approaches to this problem. Despite this,
don’t forget that visual inspection is usually an excellent place to start!
We’ve seen that it’s important to pay attention to the assumptions inherent in any test.
The methods in this section make fewer assumptions and will help us test whether our
assumptions are accurate.
1
Statistics for Research Projects Chapter 5
10 1.0
8 0.8
6 0.6
4 0.4
2 0.2
0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
18 1.0
16
14 0.8
12
10 0.6
8 0.4
6
4 0.2
2
0 0.0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Figure 5.1: Two Kolmogorov-Smirnov test plots (right column) with histograms of the data
being tested (left column). On the top row, the empirical CDF (green) matches the test
CDF (blue) closely, and the largest difference (dotted vertical red line, near 0.5) is very
small. On the bottom, the empirical CDF is quite different from the test CDF, and the
largest difference is much larger.
is less than or equal to some value. To be a bit more precise, it’s a function F such that
P ≤ a). When talking about data, it’s often useful to look at empirical CDFs:
F (a) = P (x
Fn (a) = n1 i I(xi < a)1 is the CDF of n observed data points.
Now suppose we want to compare two CDFs, F 1 and F 2 . They might be empirical CDFs (to
compare two different datasets and see whether they’re significantly different) or one might
be a reference CDF (to see whether a particular distribution is an appropriate choice for a
dataset). The Kolmogorov-Smirnov test computes the statistic Dn :
This compares the two CDFs and looks at the point of maximum discrepancy; see Figure 5.1
for an example. We can theoretically show that if F 1 is the empirical distribution of x and
F 2 is the true distribution x was drawn from, then limn→∞ Dn = 0. Similarly, if the two
distributions have no overlap at all, the maximum difference will be 1 (when one CDF is 1
and the other is 0). Therefore, we can test distribution equality by comparing the statistic
Dn to 0 (if Dn is significantly larger than 0 and close to 1, then we might conclude that the
distributions are not equal).
Notice that this method is only defined for one-dimensional random variables: although there
1
Remember that I is a function that returns 1 when its argument is true and 0 when its argument is false.
2
Statistics for Research Projects Chapter 5
are extensions to multiple random variables; they are more complex than simply comparing
joint CDFs.
Also notice that this test is sensitive to any differences at all in two distributions: two
distributions with the same mean but significantly different shapes will produce a large
value of Dn .
(1) For each pair i, compute the difference, and keep its absolute value di and its sign Si
(where Si ∈ {−1, 0, +1}). We’ll exclude pairs with Si = 0.
(2) Sort the absolute values di from smallest to largest, and rank them accordingly. Let
Ri be the rank of pair i (for example, if the fifth pair had the third smallest absolute
difference, then R5 = 3).
2
For unmatched pairs, we can use the Mann-Whitney U test, described in the next section
3
Statistics for Research Projects Chapter 5
W has a known distribution. In fact, if N is greater than about 10, it’s approximately
normally distributed (if not, it still has a known form). So, we can evaluate the probability
of observing it under a null hypothesis and thereby obtain a significance level.
Intuitively, if the median difference is 0, then half the signs should be positive and half
should be negative, and the signs shouldn’t be related to the ranks. If the median difference
is nonzero, W will be large (the sum will produce a large negative value or a large positive
value). Notice that once we constructed the rankings and defined Ri , we never used the
actual differences!
(1) Combine all data points and rank them (largest to smallest or smallest to largest).
(2) Add up the ranks for data points in the first group; call this R1 . Find the number of
points in the group; call it n1 . Compute U1 = R1 − n1 (n1 + 1)/2. Compute U2 similarly
for the second group.
As with W from the Wilcoxon test, U has a known distribution. If n1 and n2 are reasonably
large, it’s approximately normally distributed with mean n1 n2 /2 under the null hypothesis.
If the two medians are very different, U will be close to 0, and if they’re similar, U will be
close to n1 n2 /2. Intuitively, here’s why:
• If the values in the first sample were all bigger than the values in the second sample,
then R1 = n1 (n1 + 1)/24 : this is the smallest possible value for R1 . U1 would then be
0.
• If the ranks between the two groups aren’t very different, then U1 will be close to U2 .
With a little algebra, you can show that the sum U1 +U2 will always be n1 n2 . If they’re
both about the same, then they’ll both be near half this value, or n1 n2 /2.
3
Under some reasonable assumptions about the distributions of the data (see the Mann-Whitney U
article on Wikipedia for more details), this test can be used with a null hypothesis of equal medians and a
corresponding alternative hypothesis of a significant difference in medians.
4
R1 = n1 (n1 + 1)/2 because in this case, the ranks for all the values from the first dataset would be 1
through n1 , and the sum of these values is n1 (n1 + 1)/2.
4
Statistics for Research Projects Chapter 5
All the approaches we’ve described involve computing a test statistic from data and measur-
ing how unlikely our data are based on the distribution of that statistic. If we don’t know
enough about the distribution of our test statistic, we can use the data to tell us about
the distribution: this is exactly what resampling-based methods do. Permutation tests
“sample” different relabelings of the data in order to give us a sense for how significant the
true relabeling’s result is. Bootstrap creates “new” datasets by resampling several times
from the data itself, and treats those as separate samples. The next example illustrates a
real-world example where these methods are useful.
Example: Chicago teaching scandal
In 2002, economists Steven Levitt and Brian Jacob investigated cheating in Chicago public schools, but
not in the way you might think: they decided to investigate cheating by teachers, usually by changing
student answers after the students had taken standardized testsa
So, how’d they do it? Using statistics! They went through test scores from thousands of classrooms in
Chicago schools, and for each classroom, computed two measures:
(1) How unexpected is that classroom’s performance? This was computed by looking at every student’s
performance the year before and the year after. If many students had an unusually high score one
year that wasn’t sustained the following year, then cheating was likely.
(2) How suspicious are the answer sheets? This was computed by looking at how similar the A-B-C-D
patterns on different students’ answer sheets were.
Unfortunately, computing measures like performance and answer sheet similarity is tricky, and results
in quantities that don’t have well-defined distributions! As a result, it isn’t easy to determine a null
distribution for these quantities, but we still want to evaluate how unexpected or suspicious they are.
To solve this problem, Levitt and Jacob use two nonparametric methods to determine appropriate null
distributions as a way of justifying these measures. In particular:
• They assume (reasonably) that most classrooms have teachers who don’t cheat, so by looking at
the 50th to 75th percentiles of both measures above, they can obtain a null distribution for the
correlation between the two.
• In order to test whether the effects they observed are because of cheating teachers, they randomly
re-assign all the students to new, hypothetical classrooms and repeat their analysis. As a type of
permutation test, this allows them to establish a baseline level for these measures by which they
can evaluate the values they observed.
While neither of these methods is exactly like what we’ll discuss here, they’re both examples of a key
idea in nonparametric statistics: using the data to generate a null hypothesis rather than assuming any
kind of distribution.
What’d they find? 3.4% of classrooms had teachers who cheated on at least one standardized test when
the two measures above were thresholded at the 95th percentile. They also used regression with a variety
of classroom demographics to determine that academically poorer classrooms were more likely to have
cheating teachers, and that policies that put more weight on test scores correlated with increased teacher
cheating.
a
see Jacob and Levitt. Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher
Cheating. For more economic statistics, see Steven Levitt’s book with Stephen Dubner, Freakonomics.
5
Statistics for Research Projects Chapter 5
5.2.2 Bootstrap
Suppose we have some complicated statistic y that we computed from our data x. If we
want to provide a confidence interval for this statistic, we need to know its variance. When
our statistic was simply x̄, we could compute the statistic’s standard deviation √ (i.e., the
standard error of that statistic) from our estimated standard deviation using sx / n. But
for more complicated statistics, where we don’t know the distributions, how do we provide
a confidence interval?
6
Statistics for Research Projects Chapter 5
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10
10
8
6
4
2
0
0 2 4 6 8 10
10 10 10 10 10
8 8 8 8 8
6 6 6 6 6
4 4 4 4 4
2 2 2 2 2
0 0 0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Figure 5.2: An illustration of bootstrap sampling. The top figure shows the true distribution that
our data points are drawn from, and the second figure shows a histogram of the particular data
points we observed (N = 50). The bottom row shows various bootstrap resamplings of our data
(with n = N = 50). Even though they were obtained from our data, they can be thought of as
samples from the true distribution (top).
One approach is a method called bootstrap. The key idea here is that we can resample
points from our data, compute a statistic, and repeat several times to look at the variance
across different resamplings.
Recall that our original data (N points) are randomly generated from some true distribution.
If we randomly sample n points (n ≤ N , and often n = N ) from our data with replacement 5 ,
these points will also be random samples from our true distribution, as shown in Figure 5.2.
So, we can compute our statistic over this smaller random sample and repeat many times,
measuring the variance of the statistic across the different sample runs.
Everything we’ve talked about so far has been based on the idea of trying to approximating
the true distribution of observed data with samples. Bootstrap takes this a step further and
samples from the samples to generate more data.
A similar method, known as jackknife, applies a similar process, but looks at N − 1 points
taken without replacement each time instead of n with replacement. Put more simply, we
remove one point at a time and test the model. Notice that we’ve seen a similar idea before:
our initial definition of Cook’s distance was based on the idea of removing one point at
a time. In practice, boostrap is more widely used than jackknife; jackknife also has very
different theoretical properties.
5
This means that a single data point can be sampled more than once.
7
Statistics for Research Projects Chapter 5
When fitting models (such as regression) to real data, we’ll often have a choice to make for
model complexity: a more complex model might fit the data better but be harder to interpret,
while a simpler model might be more interpretable but produce a larger error. We’ve seen
this before when looking at polynomial regression models and LASSO in Chapters 3 and 4.
In this section, we’ll learn how to pick the “best” model for some data among several choices.
• The training set is what we’ll use to fit the model for each possible value of the
manually-set parameters,
• the validation set is what we’ll use to determine parameters that require manual
settings,
• and the test set is what we use to evaluate our results for reporting, and get a sense
for how well our model will do on new data in the real world.
It’s critically important to properly separate the test set from the training and validation
sets! At this point you may be wondering: why do we need separate test and validation
sets? The answer is that we choose a model based on its validation set performance: if we
really want to see generalization error, we need to see how it does on some new data, not
data that we used to pick it.
A good analogy is to think of model fitting and parameter determination as a student learning
and taking a practice exam respectively, and model evaluation as that student taking an
actual exam. Using the test data in any way during the training or validation process is like
giving the student an early copy of the exam: it’s cheating!
Figure 5.3 illustrates a general trend we usually see in this setup: as we increase model
complexity, the training error (i.e. the error of the model on the training set) will go down,
while the testing error will hit a “sweet spot” and then start increasing due to overfitting.
For example, if we’re using LASSO (linear regression with sum-of-absolute-value penalty)
as described in Chapter4, we need to choose our regularization parameter λ. Recall that λ
controls model sparsity/complexity: small values of λ lead to complex models, while large
values lead to simpler models. One approach is:
(a) Choose several possibilities for λ and, for each one, compute coefficients using the training
set.
8
Statistics for Research Projects Chapter 5
Training error
Validation error
Error
1 2 3 4 5 6 7 8
Polynomial degree
Figure 5.3: Training and validation error from fitting a polynomial to data. The data were generated
from a fourth-order polynomial. The validation error is smallest at this level, while the training
error continues to decrease as more complex models overfit the training data.
(b) Then, look at how well each one does on the validation set. Separating training and
validation helps guard against overfitting: if a model is overfit to the training data, then
it probably won’t do very well on the validation data.
(c) Once we’ve determined the best value for λ (i.e., the one that achieves minimum error in
step (b)), we can fit the model on all the training and validation data, and then see how
well it does on the test data. The test data, which the model/parameters have never
seen before, should give a measure of how well the model will do on arbitrary new data
that it sees.
The procedure described above completely separates our fitting and evaluation processes,
but it does so at the cost of preventing us from using much of the data. Recall from last
week that using more data for training typically decreases the variance of our estimates, and
helps us get more accurate results. We also need to have enough data for validation, since
using too little will leave us vulnerable to overfitting.
One widely used solution that lets us use more data for training is cross-validation. Here’s
how it works:
(1) First, divide the non-test data into K uniformly sized blocks, often referred to as folds.
This gives us K training-validation pairs: in each pair, the training set consists of K − 1
blocks, and the validation set is the remaining block.
(2) For each training/validation pair, repeat steps (a) and (b) above: this gives us K different
errors for each value of λ. We can average these together to get an average error for each
λ, which we’ll then use to select a model.
(3) Repeat step (c) above to obtain the test error as an evaluation of the model.
Although these examples were described with respect to LASSO and the parameter λ, the
procedures are much more general: we could have easily replaced “value for λ” above with
a different measure of model complexity.
9
Statistics for Research Projects Chapter 5
Also note that we could use a bootstrap-like approach here too: instead of deterministically
dividing our dataset into K parts, we could have randomly subsampled the non-test data K
different times and applied the same procedure.
H1 H2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
If we observe the number 2, the likelihood of this observation under H1 is 0.5, while the likelihood under
H2 is 0.25. Therefore, by choosing the model with the highest probability placed on our observation,
we’re also choosing the simpler model.
Intuitively, the more values that a model allows for, the more it has to spread out its probability for
each value.
10