Linear Model Selection and Regularization
Linear Model Selection and Regularization
library(ISLR)
library(MASS)
library(leaps)
library(glmnet)
library(pls)
library(caTools)
Conceptual
1. (a)
• Best subset selection as the lowest training RSS as it fits models for every possible combination of
predictors. When p is very large, this increases the chance of finding models that fit the training data
very well.
(b)
• Best test RSS could be provided by any of the models. Bet subset considers more models than the
other two, and the best model found on the training set could also be the best one for a test set.
Forward and backward stepwise consider a lot fewer models, but might find a model that fits the test
set very well as it tends to avoid over fitting when compared to best subset.
(c)
• True; the k+1 model is derived by adding a predictor that gives greatest improvement to previous k
model.
• True; the k model is derived by removing the least useful predictor from the k+1 model.
• False; forward and backward stepwise can select different predictors.
• False; forward and backward stepwise can select different predictors.
2. (a)
(iii) Less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease
in variance. As lambda increases, flexibility of fit decreases, and so the estimated coefficients decrease
with some being zero. This leads to a substantial decrease in the variance of the predictions for a small
increase in bias.
(b)
(c)
1
(ii) Non-linear models will generally be more flexible, and so predictions tend to have a higher variance and
lower bias. So predictions will improve if the variance rises less than a decrease in the bias (bias-variance
trade off).
3. (a)
• (i) Decrease steadily. As s increases the constraint on beta decreases and the RSS reduces until we
reach the least squares answer.
(b) - (ii) Decreases initially as RSS in reduced from the maximum value (when B=0) as we move towards
the best training value for B1. Eventually starts increasing as the B values start fitting the training set
extremely well, and so over fitting the test set.
(c)
(d)
(e)
4. (a)
• (iii) Steadily increase. As lambda increases the constraint on beta also increases, this means beta
becomes progressively smaller. As such the training RSS will steadily increase to the maximum
value (when B ~ 0 at very large lambda values.)
(b)
• (ii) Decreases initially and then starts increasing in a U shape.When lambda=0, we get a least squares
fit that has high variance and low bias. As lambda increases, the flexibility of the fit decreases,
so reducing variance of predictions for a small increase in bias. This results in more accurate
predictions and so the test RSS will decrease initially. As lambda increases beyond the ideal
point, we start seeing a much greater increase in bias than the reduction in variance, and so
predictions will become more biased. Consequently, we will see a rise in the test RSS.
(c)
(d)
(e)
6. (a)
2
# Chart of Ridge Regression Eqn(6.12) against beta.
y = 10
beta = seq(-10,10,0.1)
lambda = 5
## [1] 118
# Plot
plot(beta, eqn1, main="Ridge Regression Optimization", xlab="beta", ylab="Ridge Eqn output",type="l")
points(beta[118],eqn1[118],col="red", pch=24,type = "p")
points(est.beta, est.value,col="blue",pch=20,type ="p")
600
400
200
−10 −5 0 5 10
beta
• For y=10 and lambda=5, beta=10/(1 + 5)= 1.67 minimizes the ridge regression equation.
• As can also be seen on the graph the minimum beta is at 1.67.
(b)
3
# Chart of Lasso Eqn (6.13) against beta.
# Eqn (6.13) output
eqn2 = (y - beta)^2 + lambda*(abs(beta))
# Plot
plot(beta, eqn2, main="Lasso Optimization", xlab="beta", ylab="Eqn 6.13 output",type="l")
points(est.beta2, est.value2,col="red",pch=20,type ="p")
Lasso Optimization
400
Eqn 6.13 output
300
200
100
−10 −5 0 5 10
beta
• For y=10 and lambda=5, beta=10-(5/2)= 7.5 minimizes the lasso equation.
• As can also be seen on the graph the minimum beta is at 7.5.
Applied
(8) (a) (b)
set.seed(1)
X = c(rnorm(100))
e = rnorm(100, mean=0, sd=0.25)
b0 = 50 ; b1 = 6 ; b2 = 3 ; b3 = 1.5
4
Y = c(b0 + b1*X + b2*X^2 + b3*X^3 + e)
(c)
reg.summary$bic
reg.summary$adjr2
• Cp reduces substantially from the one and two variable model to the three variable model. It reduces
slightly in the four variable model and rises in small increments thereafter.
• BIC value is lowest for three variable model.
• Adjusted R2 increases to 0.999 in the three variable model from the two model value of 0.95. The
metric does not change much in the higher variable models.
• These statistical metrics point to the three variable model (with the squared and cubed terms) as being
the best choice. We can confirm this visually in the charts below.
5
−300
20000
BIC
Cp
−700
0
2 4 6 8 10 2 4 6 8 10
0.85
2 4 6 8 10
Number of variables
(d)
6
## 3 ( 1 )
"*" "*" "*" " " " " " " " " " " " " " "
## 4 ( 1 )
"*" "*" "*" " " "*" " " " " " " " " " "
## 5 ( 1 )
"*" "*" "*" " " "*" "*" " " " " " " " "
## 6 ( 1 )
"*" "*" "*" " " "*" "*" " " " " "*" " "
## 7 ( 1 )
"*" "*" "*" " " "*" "*" "*" " " "*" " "
## 8 ( 1 )
"*" "*" "*" " " "*" "*" "*" "*" "*" " "
## 9 ( 1 )
"*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
regfwd.summary$cp
regfwd.summary$bic
regfwd.summary$adjr2
• The statistical metrics are very similar to that for best subset selection.
7
## 4 ( 1 )
"*" "*" "*" " " " " " " " " " " "*" " "
## 5 ( 1 )
"*" "*" "*" " " " " " " " " "*" "*" " "
## 6 ( 1 )
"*" "*" "*" " " " " " " " " "*" "*" "*"
## 7 ( 1 )
"*" "*" "*" " " " " "*" " " "*" "*" "*"
## 8 ( 1 )
"*" "*" "*" "*" " " "*" " " "*" "*" "*"
## 9 ( 1 )
"*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
regbwd.summary$cp
regbwd.summary$bic
regbwd.summary$adjr2
• Results are very similar to best subset and forward selection. As expected, all these metrics show that
the three variable model with the squared and cubed term is the best.
(e)
# Lasso model.
lasso = glmnet(x[train,], y[train], alpha=1)
8
120 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 1 0
Mean−Squared Error
80
60
40
20
0
−1 0 1 2
Log(λ)
bestlam = cv.out$lambda.min
• The lasso model creates a sparse model with four variables. The intercept and coefficients for X, X^2
and X^3 closely match the ones chosen in 8(b) whilst the value for x^4 is very small. Therefore, this
model provides an accurate estimation of the response Y.
(f)
b7 = 5
Y2 = c(b0 + b7*X^7 + e)
df2 = data.frame(X,X^2,X^3,X^4,X^5,X^6,X^7,X^8,X^9,X^10,Y2)
9
# Best subset selection.
regfit.full2 = regsubsets(Y2~.,data=df2,nvmax=10)
reg.summary = summary(regfit.full2)
reg.summary
reg.summary$cp
reg.summary$bic
reg.summary$adjr2
10
• Adjusted R2 is 0.999+ for the one variable model.
• These statistical metrics point to the one variable model with the x^7 term as being the best choice.
# Lasso model.
# Matrix of predictors for a response vector y.
x2 = model.matrix(Y2~.,df2)[,-1]
y2 = df2$Y2
## [1] 0.1960455
• Lasso model using best value of lambda results in a sparse model with one variable. It assigns a
non-zero coefficient to the variable (𝑋 7 ) that best explains the response Y, and assigns a zero to the
rest.
set.seed(2)
x = model.matrix(Apps~.,College)[,-1]
y = College$Apps
grid = 10^seq(10,-2,length=100)
# Linear model using least squares (ridge regression with lambda=0) and test MSE.
linear.model = glmnet(x[train,], y[train], alpha=0, lambda=grid, thresh=1e-12)
linear.pred = predict(linear.model, s=0, newx=x[test,],exact=T,x=x[train,],y=y[train])
mean((linear.pred-y.test)^2)
## [1] 1367387
11
# Dataframes for use with lm() function.
train.df = data.frame(College[train,])
test.df = data.frame(College[test,])
## [1] 1367393
• The MSE from ridge regression with lambda=0 and lm() function are reasonably similar, and the slight
discrepancy is likely due to approximation used by glmnet().
(c)
## [1] 1355876
• Best value of lambda is 387.9 and the test MSE is slightly lower than for least squares.
(d)
lasso.mod = glmnet(x[train,],y[train],alpha=1,lambda=grid)
lasso.pred = predict(lasso.mod, s=bestlam, newx=x[test,])
err.lasso = mean((lasso.pred-y.test)^2)
err.lasso
## [1] 1357174
## [1] 18
12
lasso.coef
(e)
Apps
1.5e+07
1.0e+07
MSEP
5.0e+06
0 5 10 15
number of components
#summary(pcr.fit)
• From the graph we can observe that the MSE reduces rapidly as M increases and is lowest at M=16.
However, the reduction from M=5 to M=16 is small when compared the reduction from M=1 to M=5.
13
• The test MSE for both M values are shown below.
## [1] 1945054
## [1] 1230277
• Test MSE for M=5 is 1945054, which is much larger than least squares.
• Best MSE is achieved when M=16, which gives a test MSE reasonably lower than least squared.
• Using M=17 gives the same result as least squares.
(f)
Apps
1.5e+07
1.0e+07
MSEP
5.0e+06
0 5 10 15
number of components
14
pls.pred = predict(pls.fit, x[test,], ncomp=8)
err.pls = mean((pls.pred-y.test)^2)
err.pls
## [1] 1298214
• The test MSE is similar to PCR and slightly lower than least squares.
(g)
400000
0
Models
• All the models give reasonably similar results, with PCR and PLS giving slightly lower test MSE’s.
test.avg = mean(y.test)
lm.r2 = 1 - mean((lm.pred - y.test)^2) / mean((test.avg - y.test)^2)
ridge.r2 = 1 - mean((ridge.pred - y.test)^2) / mean((test.avg - y.test)^2)
lasso.r2 = 1 - mean((lasso.pred - y.test)^2) / mean((test.avg - y.test)^2)
pcr.r2 = 1 - mean((pcr.pred - y.test)^2) / mean((test.avg - y.test)^2)
pls.r2 = 1 - mean((pls.pred - y.test)^2) / mean((test.avg - y.test)^2)
15
barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), xlab="Models", ylab="R2",
names=c("lm", "ridge", "lasso", "pcr", "pls"))
0.8
0.6
R2
0.4
0.2
0.0
Models
• Every model has a a R2 metric of around 0.85 or above. SO we can be reasonably confident about the
accuracy of the predictions.
10. (a)
n = 1000
p = 20
X = matrix(rnorm(n*p), n, p)
#B = rnorm(p)
B = sample(-10:10, 20)
B[1] = 0
B[4] = 0
B[7] = 0
B[11] = 0
B[15] = 0
B[19] = 0
e = rnorm(1000, mean=0, sd=0.1)
16
Y = X%*%B + e # %*% does matrix multiplication
df = data.frame(X, Y)
(b)
(c)
1000
500
0
5 10 15 20
Variables
17
• Minimum training MSE is at maximum number of variables: 20.
(d)
for(i in 1:20){
model=lm.regsubsets(regfit.full, i)
model.pred = predict(model, newdata=test, type=c("response"))
test.mse[i] = mean((test$Y-model.pred)^2)
}
# Plot
plot(1:20,test.mse,xlab = "Variables",ylab = "Test MSE",
main = "Test MSE v Number of Variables", pch = 1, type = "b")
200
100
0
5 10 15 20
Variables
• Test MSE decreases rapidly as the number of variables increase, but the minimum is not at the max
number of variables.
• Minimum test MSE is when number of variables: 13.
(e)
• Minimum Test MSE occurs at a model with 13 variables. The test MSE deceases rapidly until it
reaches the minimum and then starts to rise thereafter.
18
• As the model flexibility increases, it is better able to fit the data set. This results in the Test MSE
decreasing rapidly until it reaches a minimum. Thereafter, further increases in model flexibility causes
over fitting and hence results in an increase in the Test MSE.
(f)
## (Intercept) X2 X3 X5 X6
## -0.001334982 -6.003548929 -7.008083537 -10.009342703 1.999414193
## X8 X9 X10 X13 X14
## 6.978017828 -2.989193466 4.997145917 6.005053131 3.973003175
## X16 X17 X18 X20
## 8.007345963 -7.999568356 9.993296372 3.005914845
## [1] 0 -6 -7 0 -10 2 0 7 -3 5 0 0 6 4 0 8 -8 10 0
## [20] 3
• Best model variables exactly match the 13 non-zero variables from the original model, and their re-
spective coefficients are highly similar.
(g)
# Loop to calculate root squared errors of the true and estimated coefficients.
coef.err = rep(NA,20)
for (i in 1:20){
a = coef(regfit.full, i)
coef.err[i] = sqrt(sum(((a[-1] - B[names(a)[-1]])^2)))
}
19
Coefficient Error v Number of Variables.
3.0
2.5
2.0
Coef error
1.5
1.0
0.5
0.0
5 10 15 20
Variables
which.min(coef.err)
## [1] 13
• The chart starts in a disjointed manner before the coefficient errors start reducing rapidly. Eventually,
it does show a minimum at the same variable size as for the test mse. However, when using a different
random seed the coefficient error chart does not always find a minimum at the same variable size as
the test mse chart. Therefore, a model that gives a minimum for coefficient error does not necessarily
lead to a lower test mse.
11. (a)
set.seed(121)
k = 10
20
folds = sample(1:k, nrow(Boston), replace=TRUE)
cv.errors = matrix(NA,k,13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred = predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] = mean((Boston$crim[folds == j] - pred)^2)
}
}
43.0
42.0
2 4 6 8 10 12
Number of variables
x = model.matrix(crim~.,Boston)[,-1]
y = Boston$crim
grid = 10^seq(10,-2,length=100)
21
test = (-train)
y.test = y[test]
set.seed(121)
cv.out = cv.glmnet(x[train,], y[train], alpha=1)
bestlam = cv.out$lambda.min
lasso.mod = glmnet(x[train,],y[train],alpha=1,lambda=grid)
lasso.pred = predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 31.61342
• Test MSE of 31.6, with only age and tax being exactly zero, we have a best model with 10 variables.
## [1] 31.87334
• Lambda chosen by cross validation is close to zero, so both ridge regression and lasso test mse are
similar to that provided by least squares.
22
# Using PCR
pcr.fit = pcr(crim~., data=Boston, subset=train, scale=T, validation="CV")
validationplot(pcr.fit, val.type="MSEP")
crim
75
70
65
MSEP
60
55
50
0 2 4 6 8 10 12
number of components
## [1] 33.74218
(b) (c)
• I would choose the Lasso model, as it gives the lowest test mse.
• Lasso models are generally more interpretable.
• It results in a sparse model with 10 variables. Two variables whose effect on the response were below
the required threshold were removed.
23