0% found this document useful (0 votes)
15 views11 pages

18 CV & Model Selection

The document discusses model selection and cross-validation techniques in regression analysis, emphasizing the importance of estimating out-of-sample prediction risk. It introduces concepts such as training error, generalization error, and optimism, and presents Mallow's Cp as a method for estimating prediction error. Additionally, it explains cross-validation methods, including K-fold and leave-one-out cross-validation (LOOCV), as effective ways to assess model performance without assuming model correctness.

Uploaded by

MLGen GT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

18 CV & Model Selection

The document discusses model selection and cross-validation techniques in regression analysis, emphasizing the importance of estimating out-of-sample prediction risk. It introduces concepts such as training error, generalization error, and optimism, and presents Mallow's Cp as a method for estimating prediction error. Additionally, it explains cross-validation methods, including K-fold and leave-one-out cross-validation (LOOCV), as effective ways to assess model performance without assuming model correctness.

Uploaded by

MLGen GT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Model Selection & Cross-Validation

Last time we talked about how to use the F test to test between two different models, when
one is nested inside the other. However, often we might have many possible models to choose from,
e.g.:

• including some covariates but not others

• all possible two-way interactions

• all possible squared and cubic terms

• combinations of the above, etc.

What should be done then?

1 Training error & prediction risk


Let m(X
b i ) be the predicted value from some fitted regression model. The in-sample or training
error
btrain = 1
X
R b i ))2
(Yi − m(X (1)
n
i

measures how well the model does on the exact same observations it was built from. This is like
seeing a test’s answers before you take it! It is not a good estimate of what we really care about,
which is the out-of-sample or prediction risk (i.e., the error on predicting a new, future Y we
haven’t seen yet),
2
R = R(m)b = E[(Y − m(X))
b ] (2)
where (X, Y ) is a new observation that was not used to construct m(X).
b The prediction risk is
also known as the generalization error.
R
btrain can be extremely different from R when there are many features. And as you’d expect,
R
btrain will always be an underestimate of R, and will always be made smaller with larger more
complex models.

2 Generalization and Optimism


Here we will study the precise difference between R btrain and R, which will lead to a practical
method for correcting Rtrain to get a better estimate of out-of-sample risk.
b
We assume the linear model
Y = Xβ + .
From these data P
we constructed our estimator β,
b which has dimension q = p + 1. Our fitted model
is m(X)
b = β0 + j βj Xj . The training error — also called the in-sample error — is as usual
b b

n
btrain = 1
X
R b i ))2 .
(Yi − m(X
n
i=1

1
Now imagine that we get a new data set at the same Xi ’s but with new errors:

Y0 = Xβ + 0

where  and 0 are independent but identically distributed. The design matrix is the same, the
true parameters β are the same, but the noise is different. This might seem a bit strange. But it’s
similar to observing a new (X, Y ).
How well can we predict these new Yi0 s? The predicted values using our model are still m(X
b i ).
Define the out-of-sample prediction error by
n
bout = 1
X
R (Yi0 − m(X
b i ))2
n
i=1

We will now show that


2
E[R bout ] − 2σ q
btrain ] = E[R (3)
n
which shows that R
btrain is a biased estimate of the out-of-sample error in that

E[R
btrain ] < E[R
bout ]

btrain + 2bσ2 q to estimate the generalization error.


and also suggests that we can use R n
As we often do, we will condition on X fixed. Notice that m(X b i ) is a function of of all the
Yi so these are dependent random variables. On the other hand, m(X b i ) and Yi0 are completely
statistically independent.

Theorem 1
n
" #
1 X
0 2 2σ 2 p
E[R
btrain ] = E (Yi − m(X
b i )) − (4)
n n
i=1

Proof. Now

b i ))2 = Var [Yi − m(X b i )])2


 
E (Yi − m(X b i )] + (E [Yi − m(X (5)
b i )] − 2Cov [Yi , m(X
= Var [Yi ] + Var [m(X b i )])2 . (6)
b i )] + (E [Yi ] − E [m(X

On the other hand,


2
E (Yi0 − m(X
b i ))2 = Var Yi0 − m(X
b i ) + E Yi0 − m(X
    
b i) (7)
 0  0   0 2
b i )] − 2Cov Yi , m(X
= Var Yi + Var [m(X b i ) + E Yi − E [m(X
b i )] .(8)

Now Yi0 is independent of Yi , but has the same distribution. This tells us that E [Yi0 ] = E [Yi ],
Var [Yi0 ] = Var [Yi ], but Cov [Yi0 , m(X
b i )] = 0. So

b i )])2
 0
b i ))2 = Var [Yi ] + Var [m(X

E (Yi − m(X b i )] + (E [Yi ] − E [m(X (9)
2
 
= E (Yi − m(X b i )) + 2Cov [Yi , m(X
b i )] . (10)

Averaging over data points,


" n # " n # n
1X 0 2 1 X
2 2X
E (Yi − m(X
b i )) = E (Yi − m(X
b i )) + Cov [Yi , m(X
b i )].
n n n
i=1 i=1 i=1

2
Now
b = Cov(Y, HY) = σ 2 H
Cov(Y, Y)
b i )] = σ 2 Hii . So,
and so Cov[Yi , m(X
" n # " n # " n #
1X 0 1 X 2 1 X 2q
E b i ))2 = E
(Yi − m(X b i ))2 + σ 2 tr H = E
(Yi − m(X b i ))2 + σ 2 . 
(Yi − m(X
n n n n n
i=1 i=1 i=1

The term (2/n)σ 2 q is called the optimism of the model — the amount by which the training
error systematically under-estimates its true expected squared error. Notice that the optimism:

• Grows with σ 2 : more noise gives the model more opportunities to seem to fit well by capital-
izing on chance.

• Shrinks with n: at any fixed level of noise, more data makes it harder to pretend the fit is
better than it really is.

• Grows with q: every extra parameter is another control which can be adjusted to fit to the
noise.

Minimizing the training error completely ignores the bias from optimism, so it is guaranteed to
pick models which are too large and predict poorly out of sample.
We can now state an estimate of the risk. Define
σ2q
btrain + 2b
Cp = R (11)
n
which is known as Mallow’s Cp . We see that E[Cp ] = Rout . You can compute Cp in R as follows:

out = lm(y ~ x1 + x2 + etc )


n = length(y)
s = summary(out)
q = length(out$coefficients)
sigma = s$sigma
training_error = mean((y - fitted(out))^2)
Cp = training_error + 2*sigma^2*q/n

3 Cross-Validation
Mallow’s Cp is nice, but our derivation assumed that the linear model is correct. Is there a way to
estimate prediction error without assuming the model is correct? Yes! It’s called cross-validation,
and is one of the most powerful tools in statisticsP
& data science.
The basic idea is this. The training error n−1 i (Yi − m(X
b i ))2 is biased because Yi and m(X
b i)
are dependent. Cross-validation methods break the data into a training set to get m(X
b i ) and a test
set to get Yi . This makes them independent and mimics what it is like to predict a new observation.
There are two main flavors of cross-validation: K-fold cross-validation and leave-one-out cross-
validation.

3
3.1 K-fold Cross-Validation
K-fold cross-validation goes as follows.

• Randomly divide the data into K equally-sized parts, or “folds”. A common choice is K = 5
or K = 10.

• For each fold

– Temporarily hold out that fold, calling it the “testing set”.


– Call the other K − 1 folds, taken together, the “training set”.
– Estimate the model on the training set.
– Calculate the traing error on the testing set.

• Average training error estimates over the folds.

In other words, divide the data into K groups B1 , . . . , BK . For j ∈ {1, . . . , K}, estimate m
b from
the data {B1 , . . . , Bj−1 , Bj+1 , . . . , BK }. Then let

bj = 1
X
G b i ))2
(Yi − m(X
nj
i∈Bj

where nj is the number of points in Bj . Finally, we estimate the generalization error by


K
1 Xb
G=
b Gj .
K
j=1

3.2 Leave-one-out Cross-Validation (LOOCV)


Let m
b −i (Xi ) be the predicted value when we leave out (Xi , Yi ) from the dataset. The leave-one-
out cross-validation score (LOOCV) is
n
1X
LOOCV = b −i (Xi ))2 .
(Yi − m
n
i=1

Computing LOOCV sounds painful. Fortunately, there is a simple, amazing shortcut formula:
n  2
1X Yi − m(X
b i)
LOOCV = .
n 1 − Hii
i=1

So computing LOOCV is actually very fast.


It also interesting to note the following. We know that tr(H) = q. So the average value of the
Hii0 s is γ ≡ q/n. If we approximate each Hii with γ we have
n  2
1X Yi − m(X
b i)
LOOCV ≈ ..
n 1−γ
i=1

4
By doing a Taylor series we see that (1 − γ)−2 ≈ 1 + 2γ. Hence,
n
1 + 2γ X
LOOCV ≈ b i ))2
(Yi − m(X (12)
n
i=1
n n
1X 2 1X
= (Yi − m(X
b i )) + 2γ b i ))2
(Yi − m(X (13)
n n
i=1 i=1
σ2q
2b
= training error + = Cp . (14)
n
So, Mallows Cp can be thought of as an approximation to cross-validation.

4 R Practicalities
You can get the LOOCV from R as follows:

out = lm(y ~ x)
numerator = y - fitted(out)
denominator = 1 - hatvalues(out)
LOOCV = mean( (numerator/denominator)^2 )

To get Cp I suggest you use the code after (??). To get K-fold cross-validation you can either
write your own code (excellent idea!) or use the boot package together with glm as follows:

library(boot)
out = glm(y ~ x,data = D)
cv.glm(D,out,K=5)$delta[1]

This looks pretty strange, but that will give you the value that you want. (There may be an
easier way. Let me know if you find one.) Here is an example. We will generate data from a
quadratic. We will then fit polynomials up to order 10. Then we will plot the LOOCV and the
K-fold cross-validation using K = 5.

library("boot")

## generate the data


n = 100
x = runif(n)
y = 2 + x - 3*x^2 + rnorm(n,0,.1)
D = data.frame(x=x,y=y)

plot(x,y)

LOOCV = rep(0,10)
KFoldCV = rep(0,10)

5
## fit polynomials and get the cross-validation scores
for(j in 1:10){
out = glm(y ~ poly(x,j),data=D)
print(summary(out))
LOOCV[j] = mean(((y - fitted(out))/(1-hatvalues(out)))^2)
KFoldCV[j] = cv.glm(D,out,K=5)$delta[1]
}

## plot them
plot(1:10,LOOCV,type="l",lwd=3)
lines(1:10,KFoldCV,lwd=3,col="blue")

● ●● ●

●●
●●

●● ● ● ●
● ● ●●● ●

● ● ● ●
2.0

● ● ● ●● ● ●

● ●● ● ●● ●●

● ●
●●
●● ●

● ●● ●


● ● ●


1.5

● ●





● ●
●●

y



1.0

● ●●

● ●●



0.5

●●


● ●

● ●
●●●
0.0

0.2 0.4 0.6 0.8 1.0

Figure 1: The data.

6
0.06
0.05
0.04
LOOCV

0.03
0.02
0.01

2 4 6 8 10

1:10

Figure 2: LOOCV is in black. K-fold is in blue. The x-axis is the order of the polynomial.

7
5 Another Example

Cp = function(y,W){
### Cp from output of lm
n = length(W$residuals)
p = length(W$coefficients)
sigma = summary(W)$sigma
train = mean( (y - fitted(out))^2)
mallows = train + 2*sigma^2*p/n
return(mallows)
}

CV = function(y,W){
### leave-one-out cross-validation from output of lm
numerator = y - fitted(W)
denominator = 1 - hatvalues(W)
LOOCV = mean( (numerator/denominator)^2 )
return(LOOCV)
}

pdf("AnotherExample.pdf")
par(mfrow=c(2,1))
n = 100
x = runif(n,-1,1)
y = x^3 + rnorm(n,0,.1)
plot(x,y)
outcp = rep(0,5)
outcv = rep(0,5)

for(i in 1:5){
out = lm(y ~ poly(x,i))
outcp[i] = Cp(y,out)
outcv[i] = CV(y,out)
}

a = min(c(outcp,outcv))
b = max(c(outcp,outcv))
plot(1:5,outcp,type="l",lwd=3,ylim=c(a,b),xlab="Degree",ylab="Error")
lines(1:5,outcv,col="red",lwd=3)
dev.off()

8
1.0

●●
●●

● ●●
●● ● ●

● ●●●●
● ●
● ● ● ●
● ●
●● ●● ● ●● ● ● ●
● ● ●●
0.0

● ● ●● ● ●● ●● ●●● ● ● ●
● ● ●● ●●
y

● ● ● ● ●

● ●●●

● ● ● ● ●
● ●●

● ●
● ●●
● ●

●●● ●
●●
−1.0

●●

−1.0 −0.5 0.0 0.5 1.0

x
0.030
Error

0.015

1 2 3 4 5

Degree

Figure 3: Comparing Cp and LOOCV.

9
6 Inference after Selection with Data Splitting
All of the inferential statistics we have done in earlier lectures presumed that our choice of model
was completely fixed, and not at all dependent on the data. If different data sets would lead us
to use different models, and our data are (partly) random, then which model we’re using is also
random. This leads to some extra uncertainty in, say, our estimate of the slope on X1 , which is not
accounted for by our formulas for the sampling distributions, hypothesis tests, confidence sets, etc.
A very common response to this problem, among practitioners, is to ignore it, or at least hope
it doesn’t matter. This can be OK, if the data-generating distribution forces us to pick one model
with very high probability, or if all of the models we might pick are very similar to each other.
Otherwise, ignoring it leads to nonsense.
Here, for instance, I simulate 200 data points where the Y variable is a standard Gaussian, and
there are 100 independent predictor variables, all also standard Gaussians, independent of each
other and of Y :

n = 200
p = 100
y = rnorm(n)
x = matrix(rnorm(n*p),nrow=n)
df = data.frame(y=y,x)
mdl = lm(y~., data=df)

Of the 100 predictors, 5 have t-statistics which are significant at the 0.05 level or less. (The
expected number would be 5.) If we select the model using just those variables we get

##
## Call:
## lm(formula = y ~ ., data = df[, c(1, stars)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53035 -0.75081 0.03042 0.58347 2.63677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03084 0.07092 0.435 0.6641
## X21 -0.13821 0.07432 -1.860 0.0644
## X25 0.12472 0.06945 1.796 0.0741
## X41 0.13696 0.07279 1.882 0.0614
## X83 -0.03067 0.07239 -0.424 0.6722
## X88 0.14585 0.07040 2.072 0.0396
##
## Residual standard error: 0.9926 on 194 degrees of freedom
## Multiple R-squared: 0.06209,Adjusted R-squared: 0.03792
## F-statistic: 2.569 on 5 and 194 DF, p-value: 0.02818

10
Notice that final over-all F statistic: it’s testing whether including those variables fits better
than an intercept-only model, and saying it thinks it does, with a definitely significant p-value. This
is the case even though, by construction, the response is completely independent of all predictors.
This is not a fluke: if you re-run my simulation many times, your p-values in the full F test will not
be uniformly distributed (as they would be on all 100 predictors), but rather will have a distribution
strongly shifted over to the left. Similarly, if we looked at the confidence intervals, they would be
much too narrow.
These issues do not go away if the true model isn’t “everything is independent of everything
else”, but rather has some structure. Because we picked the model to predict well on this data,
if we then run hypothesis tests on that same data, they’ll be too likely to tell us everything is
significant, and our confidence intervals will be too narrow. Doing statistical inference on the same
data we used to select our model is just broken. It may not always be as spectacularly broken as
in my demo above, but it’s still broken.
There are three ways around this. One is to pretend the issue doesn’t exist; as I said, this
is popular, but it’s got nothing else to recommend it. Another, is to not do tests or confidence
intervals. The third approach, which is in many ways the simplest, is to use data splitting.
Data splitting is (for regression) a very simple procedure:

• Randomly divide your data set into two parts.

• Calculate your favorite model selection criterion for all your candidate models using only the
first part of the data. Pick one model as the winner.

• Re-estimate the winner, and calculate all your inferential statistics, using only the other half
of the data.

(Division into two equal halves is optional, but usual.)


Because the winning model is statistically independent of the second half of the data, the
confidence intervals, hypothesis tests, etc., can treat it as though that model were fixed a priori.
Since we’re only using n/2 data points to calculate confidence intervals (or whatever), they will be
somewhat wider than if we really had fixed the model in advance and used all n data points, but
that can be viewed as the price we pay for having to select a model based on data. Alternatively,
after our first split-and-test, we could then swap our two halves and average the corresponding
results to get back the full sample size n instead of n/2.

7 History
Cross-validation goes back in statistics into the 1950s, if not earlier, but did not become formalized
as a tool until the 1970s, with the work of Stone (1974). It was adopted, along with many other
statistical ideas, by computer scientists during the period in the late 1980s–early 1990s when the
modern area of “machine learning” emerged from (parts of) earlier areas called “artificial intelli-
gence”, “pattern recognition”, “connectionism”, “neural networks”, or indeed “machine learning”.
Subsequently, many of the scientific descendants of the early machine learners forgot where their
ideas came from, to the point where many people now think cross-validation is something computer
science contributed to data analysis.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy