18 CV & Model Selection
18 CV & Model Selection
Last time we talked about how to use the F test to test between two different models, when
one is nested inside the other. However, often we might have many possible models to choose from,
e.g.:
measures how well the model does on the exact same observations it was built from. This is like
seeing a test’s answers before you take it! It is not a good estimate of what we really care about,
which is the out-of-sample or prediction risk (i.e., the error on predicting a new, future Y we
haven’t seen yet),
2
R = R(m)b = E[(Y − m(X))
b ] (2)
where (X, Y ) is a new observation that was not used to construct m(X).
b The prediction risk is
also known as the generalization error.
R
btrain can be extremely different from R when there are many features. And as you’d expect,
R
btrain will always be an underestimate of R, and will always be made smaller with larger more
complex models.
n
btrain = 1
X
R b i ))2 .
(Yi − m(X
n
i=1
1
Now imagine that we get a new data set at the same Xi ’s but with new errors:
Y0 = Xβ + 0
where and 0 are independent but identically distributed. The design matrix is the same, the
true parameters β are the same, but the noise is different. This might seem a bit strange. But it’s
similar to observing a new (X, Y ).
How well can we predict these new Yi0 s? The predicted values using our model are still m(X
b i ).
Define the out-of-sample prediction error by
n
bout = 1
X
R (Yi0 − m(X
b i ))2
n
i=1
E[R
btrain ] < E[R
bout ]
Theorem 1
n
" #
1 X
0 2 2σ 2 p
E[R
btrain ] = E (Yi − m(X
b i )) − (4)
n n
i=1
Proof. Now
Now Yi0 is independent of Yi , but has the same distribution. This tells us that E [Yi0 ] = E [Yi ],
Var [Yi0 ] = Var [Yi ], but Cov [Yi0 , m(X
b i )] = 0. So
b i )])2
0
b i ))2 = Var [Yi ] + Var [m(X
E (Yi − m(X b i )] + (E [Yi ] − E [m(X (9)
2
= E (Yi − m(X b i )) + 2Cov [Yi , m(X
b i )] . (10)
2
Now
b = Cov(Y, HY) = σ 2 H
Cov(Y, Y)
b i )] = σ 2 Hii . So,
and so Cov[Yi , m(X
" n # " n # " n #
1X 0 1 X 2 1 X 2q
E b i ))2 = E
(Yi − m(X b i ))2 + σ 2 tr H = E
(Yi − m(X b i ))2 + σ 2 .
(Yi − m(X
n n n n n
i=1 i=1 i=1
The term (2/n)σ 2 q is called the optimism of the model — the amount by which the training
error systematically under-estimates its true expected squared error. Notice that the optimism:
• Grows with σ 2 : more noise gives the model more opportunities to seem to fit well by capital-
izing on chance.
• Shrinks with n: at any fixed level of noise, more data makes it harder to pretend the fit is
better than it really is.
• Grows with q: every extra parameter is another control which can be adjusted to fit to the
noise.
Minimizing the training error completely ignores the bias from optimism, so it is guaranteed to
pick models which are too large and predict poorly out of sample.
We can now state an estimate of the risk. Define
σ2q
btrain + 2b
Cp = R (11)
n
which is known as Mallow’s Cp . We see that E[Cp ] = Rout . You can compute Cp in R as follows:
3 Cross-Validation
Mallow’s Cp is nice, but our derivation assumed that the linear model is correct. Is there a way to
estimate prediction error without assuming the model is correct? Yes! It’s called cross-validation,
and is one of the most powerful tools in statisticsP
& data science.
The basic idea is this. The training error n−1 i (Yi − m(X
b i ))2 is biased because Yi and m(X
b i)
are dependent. Cross-validation methods break the data into a training set to get m(X
b i ) and a test
set to get Yi . This makes them independent and mimics what it is like to predict a new observation.
There are two main flavors of cross-validation: K-fold cross-validation and leave-one-out cross-
validation.
3
3.1 K-fold Cross-Validation
K-fold cross-validation goes as follows.
• Randomly divide the data into K equally-sized parts, or “folds”. A common choice is K = 5
or K = 10.
In other words, divide the data into K groups B1 , . . . , BK . For j ∈ {1, . . . , K}, estimate m
b from
the data {B1 , . . . , Bj−1 , Bj+1 , . . . , BK }. Then let
bj = 1
X
G b i ))2
(Yi − m(X
nj
i∈Bj
Computing LOOCV sounds painful. Fortunately, there is a simple, amazing shortcut formula:
n 2
1X Yi − m(X
b i)
LOOCV = .
n 1 − Hii
i=1
4
By doing a Taylor series we see that (1 − γ)−2 ≈ 1 + 2γ. Hence,
n
1 + 2γ X
LOOCV ≈ b i ))2
(Yi − m(X (12)
n
i=1
n n
1X 2 1X
= (Yi − m(X
b i )) + 2γ b i ))2
(Yi − m(X (13)
n n
i=1 i=1
σ2q
2b
= training error + = Cp . (14)
n
So, Mallows Cp can be thought of as an approximation to cross-validation.
4 R Practicalities
You can get the LOOCV from R as follows:
out = lm(y ~ x)
numerator = y - fitted(out)
denominator = 1 - hatvalues(out)
LOOCV = mean( (numerator/denominator)^2 )
To get Cp I suggest you use the code after (??). To get K-fold cross-validation you can either
write your own code (excellent idea!) or use the boot package together with glm as follows:
library(boot)
out = glm(y ~ x,data = D)
cv.glm(D,out,K=5)$delta[1]
This looks pretty strange, but that will give you the value that you want. (There may be an
easier way. Let me know if you find one.) Here is an example. We will generate data from a
quadratic. We will then fit polynomials up to order 10. Then we will plot the LOOCV and the
K-fold cross-validation using K = 5.
library("boot")
plot(x,y)
LOOCV = rep(0,10)
KFoldCV = rep(0,10)
5
## fit polynomials and get the cross-validation scores
for(j in 1:10){
out = glm(y ~ poly(x,j),data=D)
print(summary(out))
LOOCV[j] = mean(((y - fitted(out))/(1-hatvalues(out)))^2)
KFoldCV[j] = cv.glm(D,out,K=5)$delta[1]
}
## plot them
plot(1:10,LOOCV,type="l",lwd=3)
lines(1:10,KFoldCV,lwd=3,col="blue")
● ●● ●
●
●●
●●
●
●● ● ● ●
● ● ●●● ●
●
● ● ● ●
2.0
● ● ● ●● ● ●
●
● ●● ● ●● ●●
●
● ●
●●
●● ●
●
● ●● ●
●
●
● ● ●
●
●
1.5
● ●
●
●
●
●
● ●
●●
●
y
●
●
1.0
● ●●
●
● ●●
●
●
●
0.5
●●
●
●
● ●
●
● ●
●●●
0.0
6
0.06
0.05
0.04
LOOCV
0.03
0.02
0.01
2 4 6 8 10
1:10
Figure 2: LOOCV is in black. K-fold is in blue. The x-axis is the order of the polynomial.
7
5 Another Example
Cp = function(y,W){
### Cp from output of lm
n = length(W$residuals)
p = length(W$coefficients)
sigma = summary(W)$sigma
train = mean( (y - fitted(out))^2)
mallows = train + 2*sigma^2*p/n
return(mallows)
}
CV = function(y,W){
### leave-one-out cross-validation from output of lm
numerator = y - fitted(W)
denominator = 1 - hatvalues(W)
LOOCV = mean( (numerator/denominator)^2 )
return(LOOCV)
}
pdf("AnotherExample.pdf")
par(mfrow=c(2,1))
n = 100
x = runif(n,-1,1)
y = x^3 + rnorm(n,0,.1)
plot(x,y)
outcp = rep(0,5)
outcv = rep(0,5)
for(i in 1:5){
out = lm(y ~ poly(x,i))
outcp[i] = Cp(y,out)
outcv[i] = CV(y,out)
}
a = min(c(outcp,outcv))
b = max(c(outcp,outcv))
plot(1:5,outcp,type="l",lwd=3,ylim=c(a,b),xlab="Degree",ylab="Error")
lines(1:5,outcv,col="red",lwd=3)
dev.off()
8
1.0
●●
●●
●
● ●●
●● ● ●
●
● ●●●●
● ●
● ● ● ●
● ●
●● ●● ● ●● ● ● ●
● ● ●●
0.0
● ● ●● ● ●● ●● ●●● ● ● ●
● ● ●● ●●
y
● ● ● ● ●
●
● ●●●
●
● ● ● ● ●
● ●●
●
● ●
● ●●
● ●
●
●●● ●
●●
−1.0
●●
●
x
0.030
Error
0.015
1 2 3 4 5
Degree
9
6 Inference after Selection with Data Splitting
All of the inferential statistics we have done in earlier lectures presumed that our choice of model
was completely fixed, and not at all dependent on the data. If different data sets would lead us
to use different models, and our data are (partly) random, then which model we’re using is also
random. This leads to some extra uncertainty in, say, our estimate of the slope on X1 , which is not
accounted for by our formulas for the sampling distributions, hypothesis tests, confidence sets, etc.
A very common response to this problem, among practitioners, is to ignore it, or at least hope
it doesn’t matter. This can be OK, if the data-generating distribution forces us to pick one model
with very high probability, or if all of the models we might pick are very similar to each other.
Otherwise, ignoring it leads to nonsense.
Here, for instance, I simulate 200 data points where the Y variable is a standard Gaussian, and
there are 100 independent predictor variables, all also standard Gaussians, independent of each
other and of Y :
n = 200
p = 100
y = rnorm(n)
x = matrix(rnorm(n*p),nrow=n)
df = data.frame(y=y,x)
mdl = lm(y~., data=df)
Of the 100 predictors, 5 have t-statistics which are significant at the 0.05 level or less. (The
expected number would be 5.) If we select the model using just those variables we get
##
## Call:
## lm(formula = y ~ ., data = df[, c(1, stars)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53035 -0.75081 0.03042 0.58347 2.63677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03084 0.07092 0.435 0.6641
## X21 -0.13821 0.07432 -1.860 0.0644
## X25 0.12472 0.06945 1.796 0.0741
## X41 0.13696 0.07279 1.882 0.0614
## X83 -0.03067 0.07239 -0.424 0.6722
## X88 0.14585 0.07040 2.072 0.0396
##
## Residual standard error: 0.9926 on 194 degrees of freedom
## Multiple R-squared: 0.06209,Adjusted R-squared: 0.03792
## F-statistic: 2.569 on 5 and 194 DF, p-value: 0.02818
10
Notice that final over-all F statistic: it’s testing whether including those variables fits better
than an intercept-only model, and saying it thinks it does, with a definitely significant p-value. This
is the case even though, by construction, the response is completely independent of all predictors.
This is not a fluke: if you re-run my simulation many times, your p-values in the full F test will not
be uniformly distributed (as they would be on all 100 predictors), but rather will have a distribution
strongly shifted over to the left. Similarly, if we looked at the confidence intervals, they would be
much too narrow.
These issues do not go away if the true model isn’t “everything is independent of everything
else”, but rather has some structure. Because we picked the model to predict well on this data,
if we then run hypothesis tests on that same data, they’ll be too likely to tell us everything is
significant, and our confidence intervals will be too narrow. Doing statistical inference on the same
data we used to select our model is just broken. It may not always be as spectacularly broken as
in my demo above, but it’s still broken.
There are three ways around this. One is to pretend the issue doesn’t exist; as I said, this
is popular, but it’s got nothing else to recommend it. Another, is to not do tests or confidence
intervals. The third approach, which is in many ways the simplest, is to use data splitting.
Data splitting is (for regression) a very simple procedure:
• Calculate your favorite model selection criterion for all your candidate models using only the
first part of the data. Pick one model as the winner.
• Re-estimate the winner, and calculate all your inferential statistics, using only the other half
of the data.
7 History
Cross-validation goes back in statistics into the 1950s, if not earlier, but did not become formalized
as a tool until the 1970s, with the work of Stone (1974). It was adopted, along with many other
statistical ideas, by computer scientists during the period in the late 1980s–early 1990s when the
modern area of “machine learning” emerged from (parts of) earlier areas called “artificial intelli-
gence”, “pattern recognition”, “connectionism”, “neural networks”, or indeed “machine learning”.
Subsequently, many of the scientific descendants of the early machine learners forgot where their
ideas came from, to the point where many people now think cross-validation is something computer
science contributed to data analysis.
11