0% found this document useful (0 votes)

13 views23 pages

Linear Model Selection and Regularization

Uploaded by

ycguan99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views23 pages

Linear Model Selection and Regularization

Uploaded by

ycguan99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Ch.

6 Exercises: Linear Model Selection and Regularization

library(ISLR)
library(MASS)
library(leaps)
library(glmnet)
library(pls)
library(caTools)

Conceptual
1. (a)

• Best subset selection as the lowest training RSS as it fits models for every possible combination of
predictors. When p is very large, this increases the chance of finding models that fit the training data
very well.

(b)

• Best test RSS could be provided by any of the models. Bet subset considers more models than the
other two, and the best model found on the training set could also be the best one for a test set.
Forward and backward stepwise consider a lot fewer models, but might find a model that fits the test
set very well as it tends to avoid over fitting when compared to best subset.

(c)

• True; the k+1 model is derived by adding a predictor that gives greatest improvement to previous k
model.
• True; the k model is derived by removing the least useful predictor from the k+1 model.
• False; forward and backward stepwise can select different predictors.
• False; forward and backward stepwise can select different predictors.

2. (a)

(iii) Less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease
in variance. As lambda increases, flexibility of fit decreases, and so the estimated coefficients decrease
with some being zero. This leads to a substantial decrease in the variance of the predictions for a small
increase in bias.

(b)

(iii) same as (a), except every variable has a non-zero coefficient.

(c)

1
(ii) Non-linear models will generally be more flexible, and so predictions tend to have a higher variance and
lower bias. So predictions will improve if the variance rises less than a decrease in the bias (bias-variance
trade off).

3. (a)

• (i) Decrease steadily. As s increases the constraint on beta decreases and the RSS reduces until we
reach the least squares answer.

(b) - (ii) Decreases initially as RSS in reduced from the maximum value (when B=0) as we move towards
the best training value for B1. Eventually starts increasing as the B values start fitting the training set
extremely well, and so over fitting the test set.
(c)

• (iii) Steadily increase.

(d)

• (iv) Steadily decrease.

(e)

• (v) Remains constant.

4. (a)

• (iii) Steadily increase. As lambda increases the constraint on beta also increases, this means beta
becomes progressively smaller. As such the training RSS will steadily increase to the maximum
value (when B ~ 0 at very large lambda values.)

(b)

• (ii) Decreases initially and then starts increasing in a U shape.When lambda=0, we get a least squares
fit that has high variance and low bias. As lambda increases, the flexibility of the fit decreases,
so reducing variance of predictions for a small increase in bias. This results in more accurate
predictions and so the test RSS will decrease initially. As lambda increases beyond the ideal
point, we start seeing a much greater increase in bias than the reduction in variance, and so
predictions will become more biased. Consequently, we will see a rise in the test RSS.

(c)

• (iv) Steadily decrease.

(d)

• (iii) Steadily increase.

(e)

• (v) Remains constant.

6. (a)

2
# Chart of Ridge Regression Eqn(6.12) against beta.
y = 10
beta = seq(-10,10,0.1)
lambda = 5

# Eqn (6.12) output

eqn1 = (y - beta)^2 + lambda*(beta^2)
which.min(eqn1) #index at minimum, similar to using (6.14).

## [1] 118

# Minimum beta estimated using (6.14).

est.beta = y/(1 + lambda)
est.value = (y - est.beta)^2 + lambda*(est.beta^2)

# Plot
plot(beta, eqn1, main="Ridge Regression Optimization", xlab="beta", ylab="Ridge Eqn output",type="l")
points(beta[118],eqn1[118],col="red", pch=24,type = "p")
points(est.beta, est.value,col="blue",pch=20,type ="p")

Ridge Regression Optimization

800
Ridge Eqn output

600
400
200

−10 −5 0 5 10

beta

• For y=10 and lambda=5, beta=10/(1 + 5)= 1.67 minimizes the ridge regression equation.
• As can also be seen on the graph the minimum beta is at 1.67.

(b)

3
# Chart of Lasso Eqn (6.13) against beta.
# Eqn (6.13) output
eqn2 = (y - beta)^2 + lambda*(abs(beta))

# Minimum beta estimated using (6.15).

est.beta2 = y - lambda/2
est.value2 = (y - est.beta2)^2 + lambda*(abs(est.beta2))

# Plot
plot(beta, eqn2, main="Lasso Optimization", xlab="beta", ylab="Eqn 6.13 output",type="l")
points(est.beta2, est.value2,col="red",pch=20,type ="p")

Lasso Optimization
400
Eqn 6.13 output

300
200
100

−10 −5 0 5 10

beta

• For y=10 and lambda=5, beta=10-(5/2)= 7.5 minimizes the lasso equation.
• As can also be seen on the graph the minimum beta is at 7.5.

Applied
(8) (a) (b)

set.seed(1)
X = c(rnorm(100))
e = rnorm(100, mean=0, sd=0.25)
b0 = 50 ; b1 = 6 ; b2 = 3 ; b3 = 1.5

4
Y = c(b0 + b1*X + b2*X^2 + b3*X^3 + e)

# Dataframe with Y (response and X,X^1..X^10 variables)

df = data.frame(X,X^2,X^3,X^4,X^5,X^6,X^7,X^8,X^9,X^10,Y)

(c)

# Best subset selection.

regfit.full = regsubsets(Y~.,data=df,nvmax=10)
reg.summary = summary(regfit.full)
reg.summary$cp

## [1] 3.129646e+04 8.595767e+03 2.185943e+00 6.067483e-01 2.178201e+00

## [6] 3.995581e+00 5.786906e+00 7.169409e+00 9.153558e+00 1.100000e+01

reg.summary$bic

## [1] -160.3392 -284.1762 -732.0341 -731.3031 -727.1721 -722.7696 -718.3966

## [8] -714.4815 -709.8941 -705.4614

reg.summary$adjr2

## [1] 0.8146192 0.9481556 0.9994322 0.9994480 0.9994448 0.9994400 0.9994352

## [8] 0.9994329 0.9994267 0.9994213

• Cp reduces substantially from the one and two variable model to the three variable model. It reduces
slightly in the four variable model and rises in small increments thereafter.
• BIC value is lowest for three variable model.

• Adjusted R2 increases to 0.999 in the three variable model from the two model value of 0.95. The
metric does not change much in the higher variable models.
• These statistical metrics point to the three variable model (with the squared and cubed terms) as being
the best choice. We can confirm this visually in the charts below.

# Plot of Cp, bic and adjr2 metrics.

par(mfrow=c(2,2))
plot(reg.summary$cp,xlab="Number of variables", ylab="Cp",type="l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)],
col = "red", pch = 20)

plot(reg.summary$bic,xlab="Number of variables", ylab="BIC",type="l")

points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)],
col = "red", pch = 20)

plot(reg.summary$adjr2,xlab="Number of variables", ylab="Adjusted RSq",type="l")

points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)],
col = "red", pch = 20)

5
−300
20000

BIC
Cp

−700
0

2 4 6 8 10 2 4 6 8 10

Number of variables Number of variables

1.00
Adjusted RSq

0.85

2 4 6 8 10

Number of variables

(d)

# Forward stepwise selection.

regfit.fwd = regsubsets(Y~.,data=df,nvmax=10,method='forward')
regfwd.summary = summary(regfit.fwd)
regfwd.summary

## Subset selection object

## Call: regsubsets.formula(Y ~ ., data = df, nvmax = 10, method = "forward")
## 10 Variables (and intercept)
## Forced in Forced out
## X FALSE FALSE
## X.2 FALSE FALSE
## X.3 FALSE FALSE
## X.4 FALSE FALSE
## X.5 FALSE FALSE
## X.6 FALSE FALSE
## X.7 FALSE FALSE
## X.8 FALSE FALSE
## X.9 FALSE FALSE
## X.10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
## X X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" "*" " " " " " " " " " " " " " "

6
## 3 ( 1 )
"*" "*" "*" " " " " " " " " " " " " " "
## 4 ( 1 )
"*" "*" "*" " " "*" " " " " " " " " " "
## 5 ( 1 )
"*" "*" "*" " " "*" "*" " " " " " " " "
## 6 ( 1 )
"*" "*" "*" " " "*" "*" " " " " "*" " "
## 7 ( 1 )
"*" "*" "*" " " "*" "*" "*" " " "*" " "
## 8 ( 1 )
"*" "*" "*" " " "*" "*" "*" "*" "*" " "
## 9 ( 1 )
"*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

regfwd.summary$cp

## [1] 3.129646e+04 1.607895e+04 2.185943e+00 6.067483e-01 2.178201e+00

## [6] 4.062107e+00 6.016511e+00 7.958257e+00 9.784279e+00 1.100000e+01

regfwd.summary$bic

## [1] -160.3392 -222.0568 -732.0341 -731.3031 -727.1721 -722.6957 -718.1412

## [8] -713.6008 -709.1892 -705.4614

regfwd.summary$adjr2

## [1] 0.8146192 0.9035099 0.9994322 0.9994480 0.9994448 0.9994396 0.9994338

## [8] 0.9994279 0.9994227 0.9994213

• The statistical metrics are very similar to that for best subset selection.

# Backward stepwise selection.

regfit.bwd = regsubsets(Y~.,data=df,nvmax=10,method='backward')
regbwd.summary = summary(regfit.bwd)
regbwd.summary

## Subset selection object

## Call: regsubsets.formula(Y ~ ., data = df, nvmax = 10, method = "backward")
## 10 Variables (and intercept)
## Forced in Forced out
## X FALSE FALSE
## X.2 FALSE FALSE
## X.3 FALSE FALSE
## X.4 FALSE FALSE
## X.5 FALSE FALSE
## X.6 FALSE FALSE
## X.7 FALSE FALSE
## X.8 FALSE FALSE
## X.9 FALSE FALSE
## X.10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: backward
## X X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " " " " " " " " " " " "

7
## 4 ( 1 )
"*" "*" "*" " " " " " " " " " " "*" " "
## 5 ( 1 )
"*" "*" "*" " " " " " " " " "*" "*" " "
## 6 ( 1 )
"*" "*" "*" " " " " " " " " "*" "*" "*"
## 7 ( 1 )
"*" "*" "*" " " " " "*" " " "*" "*" "*"
## 8 ( 1 )
"*" "*" "*" "*" " " "*" " " "*" "*" "*"
## 9 ( 1 )
"*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

regbwd.summary$cp

## [1] 3.337929e+04 8.595767e+03 2.185943e+00 9.808795e-01 2.731015e+00

## [6] 4.418100e+00 5.925033e+00 7.169409e+00 9.153558e+00 1.100000e+01

regbwd.summary$bic

## [1] -153.9152 -284.1762 -732.0341 -730.8911 -726.5609 -722.3012 -718.2429

## [8] -714.4815 -709.8941 -705.4614

regbwd.summary$adjr2

## [1] 0.8023195 0.9481556 0.9994322 0.9994458 0.9994414 0.9994374 0.9994343

## [8] 0.9994329 0.9994267 0.9994213

• Results are very similar to best subset and forward selection. As expected, all these metrics show that
the three variable model with the squared and cubed term is the best.

(e)

# Matrix of predictors for a response vector y.

x = model.matrix(Y~.,df)[,-1]
y = df$Y

# Training and test sets.

train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
y.test = y[test]

# Lasso model.
lasso = glmnet(x[train,], y[train], alpha=1)

cv.out = cv.glmnet(x[train,],y[train], alpha=1)

plot(cv.out)

8
120 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 1 0
Mean−Squared Error

80
60
40
20
0

−1 0 1 2

Log(λ)

bestlam = cv.out$lambda.min

• Higher values of lambda result in an increase in the MSE.

• Best value of lambda is 0.2.

# Coefficients from lasso model with best lambda.

out = glmnet(x,y,alpha=1)
lasso.coef = predict(out, type="coefficients",s=0.20)[1:11,]
lasso.coef

## (Intercept) X X.2 X.3 X.4 X.5

## 50.164059615 5.841735821 2.801284260 1.484136848 0.004163542 0.000000000
## X.6 X.7 X.8 X.9 X.10
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000

• The lasso model creates a sparse model with four variables. The intercept and coefficients for X, X^2
and X^3 closely match the ones chosen in 8(b) whilst the value for x^4 is very small. Therefore, this
model provides an accurate estimation of the response Y.

(f)

b7 = 5
Y2 = c(b0 + b7*X^7 + e)
df2 = data.frame(X,X^2,X^3,X^4,X^5,X^6,X^7,X^8,X^9,X^10,Y2)

9
# Best subset selection.
regfit.full2 = regsubsets(Y2~.,data=df2,nvmax=10)
reg.summary = summary(regfit.full2)
reg.summary

## Subset selection object

## Call: regsubsets.formula(Y2 ~ ., data = df2, nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## X FALSE FALSE
## X.2 FALSE FALSE
## X.3 FALSE FALSE
## X.4 FALSE FALSE
## X.5 FALSE FALSE
## X.6 FALSE FALSE
## X.7 FALSE FALSE
## X.8 FALSE FALSE
## X.9 FALSE FALSE
## X.10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## X X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10
## 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " " " "*" " " "*" " " " " " "
## 4 ( 1 ) "*" "*" "*" " " " " " " "*" " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " " "*" " " " " " "
## 6 ( 1 ) "*" " " "*" " " " " "*" "*" "*" " " "*"
## 7 ( 1 ) "*" " " "*" " " "*" "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" " " "*"
## 9 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

reg.summary$cp

## [1] -0.05835158 -0.57457577 -0.26264076 0.73888307 2.40063344 3.84493124

## [7] 5.81405698 7.18226563 9.16504068 11.00000000

reg.summary$bic

## [1] -1420.684 -1418.737 -1415.955 -1412.444 -1408.213 -1404.224 -1399.653

## [8] -1395.754 -1391.168 -1386.748

reg.summary$adjr2

## [1] 0.9999994 0.9999994 0.9999994 0.9999994 0.9999994 0.9999994 0.9999994

## [8] 0.9999994 0.9999994 0.9999994

• Cp is lowest for one variable model with the x^7 term.

• BIC value is lowest for the one variable model.

10
• Adjusted R2 is 0.999+ for the one variable model.
• These statistical metrics point to the one variable model with the x^7 term as being the best choice.

# Lasso model.
# Matrix of predictors for a response vector y.
x2 = model.matrix(Y2~.,df2)[,-1]
y2 = df2$Y2

# Training and test sets.

train2 = sample(1:nrow(x2), nrow(x2)/2)
test2 = (-train2)
y2.test = y2[test2]

# Cross validation to find best value for lambda.

cv.out = cv.glmnet(x[train,],y[train], alpha=1)
cv.out$lambda.min

## [1] 0.1960455

# Lasso model with best lambda value.

lasso2 = glmnet(x2, y2, alpha=1)
lasso.coef2 = predict(lasso2,type="coefficients",s=0.2)[1:11,]
lasso.coef2

## (Intercept) X X.2 X.3 X.4 X.5

## 50.604887 0.000000 0.000000 0.000000 0.000000 0.000000
## X.6 X.7 X.8 X.9 X.10
## 0.000000 4.854434 0.000000 0.000000 0.000000

• Lasso model using best value of lambda results in a sparse model with one variable. It assigns a
non-zero coefficient to the variable (𝑋 7 ) that best explains the response Y, and assigns a zero to the
rest.

(9) (a) (b)

set.seed(2)
x = model.matrix(Apps~.,College)[,-1]
y = College$Apps
grid = 10^seq(10,-2,length=100)

# Training and test sets

train = sample(1:nrow(x), nrow(x)/1.3)
test = (-train)
y.test = y[test]

# Linear model using least squares (ridge regression with lambda=0) and test MSE.
linear.model = glmnet(x[train,], y[train], alpha=0, lambda=grid, thresh=1e-12)
linear.pred = predict(linear.model, s=0, newx=x[test,],exact=T,x=x[train,],y=y[train])
mean((linear.pred-y.test)^2)

## [1] 1367387

11
# Dataframes for use with lm() function.
train.df = data.frame(College[train,])
test.df = data.frame(College[test,])

# Test MSE using lm() function.

lm.fit = lm(Apps~., data=train.df)
lm.pred = predict(lm.fit, test.df, type=c("response"))
err.lm = mean((lm.pred-test.df$Apps)^2)
err.lm

## [1] 1367393

• The MSE from ridge regression with lambda=0 and lm() function are reasonably similar, and the slight
discrepancy is likely due to approximation used by glmnet().

(c)

# Ridge regression using lambda chosen by CV.

cv.out = cv.glmnet(x[train,],y[train],alpha=0)
bestlam = cv.out$lambda.min

ridge.mod = glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12)

ridge.pred = predict(ridge.mod, s=bestlam, newx=x[test,])
err.ridge = mean((ridge.pred-y.test)^2)
err.ridge

## [1] 1355876

• Best value of lambda is 387.9 and the test MSE is slightly lower than for least squares.

(d)

# Lasso model using lambda chosen by CV.

cv.out = cv.glmnet(x[train,],y[train],alpha=1)
bestlam = cv.out$lambda.min

lasso.mod = glmnet(x[train,],y[train],alpha=1,lambda=grid)
lasso.pred = predict(lasso.mod, s=bestlam, newx=x[test,])
err.lasso = mean((lasso.pred-y.test)^2)
err.lasso

## [1] 1357174

lasso.coef = predict(lasso.mod, type="coefficients", s=bestlam)[1:18,]

length(lasso.coef[lasso.coef!= 0])

## [1] 18

12
lasso.coef

## (Intercept) PrivateYes Accept Enroll Top10perc

## -821.56743989 -376.48364328 1.61159267 -1.10449603 42.86250271
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -11.29156663 0.08949160 0.05954328 -0.07693368 0.18154296
## Books Personal PhD Terminal S.F.Ratio
## 0.21045953 0.05559108 -6.07750546 -4.90698208 21.51866518
## perc.alumni Expend Grad.Rate
## 3.28302767 0.07234029 5.11006362

• Test MSE is slightly lower than least squares regression.

• There are no non-zero variables, but many variables are heavily shrunk.

(e)

# PCR model with M chosen by cross validation.

pcr.fit = pcr(Apps~., data=College, subset=train, scale=T, validation="CV")
validationplot(pcr.fit, val.type="MSEP")

Apps
1.5e+07
1.0e+07
MSEP

5.0e+06

0 5 10 15

number of components

#summary(pcr.fit)

• From the graph we can observe that the MSE reduces rapidly as M increases and is lowest at M=16.
However, the reduction from M=5 to M=16 is small when compared the reduction from M=1 to M=5.

13
• The test MSE for both M values are shown below.

pcr.pred = predict(pcr.fit, x[test,], ncomp=5)

mean((pcr.pred-y.test)^2)

## [1] 1945054

pcr.pred = predict(pcr.fit, x[test,], ncomp=16)

err.pcr = mean((pcr.pred-y.test)^2)
err.pcr

## [1] 1230277

• Test MSE for M=5 is 1945054, which is much larger than least squares.
• Best MSE is achieved when M=16, which gives a test MSE reasonably lower than least squared.
• Using M=17 gives the same result as least squares.

(f)

# PLS model with M chosen by cross validation.

pls.fit = plsr(Apps~., data=College, subset=train, scale=T, validation="CV")
validationplot(pls.fit, val.type="MSEP")

Apps
1.5e+07
1.0e+07
MSEP

5.0e+06

0 5 10 15

number of components

• The lowest MSE occurs when m~8.

14
pls.pred = predict(pls.fit, x[test,], ncomp=8)
err.pls = mean((pls.pred-y.test)^2)
err.pls

## [1] 1298214

• The test MSE is similar to PCR and slightly lower than least squares.

(g)

err.all = c(err.lm, err.ridge, err.lasso, err.pcr, err.pls)

barplot(err.all, xlab="Models", ylab="Test MSE", names=c("lm", "ridge", "lasso", "pcr", "pls"))
1200000
800000
Test MSE

400000
0

lm ridge lasso pcr pls

Models

• All the models give reasonably similar results, with PCR and PLS giving slightly lower test MSE’s.

• We will now compute the R2 metric for each model.

test.avg = mean(y.test)
lm.r2 = 1 - mean((lm.pred - y.test)^2) / mean((test.avg - y.test)^2)
ridge.r2 = 1 - mean((ridge.pred - y.test)^2) / mean((test.avg - y.test)^2)
lasso.r2 = 1 - mean((lasso.pred - y.test)^2) / mean((test.avg - y.test)^2)
pcr.r2 = 1 - mean((pcr.pred - y.test)^2) / mean((test.avg - y.test)^2)
pls.r2 = 1 - mean((pls.pred - y.test)^2) / mean((test.avg - y.test)^2)

15
barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), xlab="Models", ylab="R2",
names=c("lm", "ridge", "lasso", "pcr", "pls"))
0.8
0.6
R2

0.4
0.2
0.0

lm ridge lasso pcr pls

Models

• Every model has a a R2 metric of around 0.85 or above. SO we can be reasonably confident about the
accuracy of the predictions.

10. (a)

# Simulated data set.

set.seed(111)

n = 1000
p = 20
X = matrix(rnorm(n*p), n, p)
#B = rnorm(p)
B = sample(-10:10, 20)
B[1] = 0
B[4] = 0
B[7] = 0
B[11] = 0
B[15] = 0
B[19] = 0
e = rnorm(1000, mean=0, sd=0.1)

16
Y = X%*%B + e # %*% does matrix multiplication

df = data.frame(X, Y)

(b)

# Train and test sets.

library(caTools)
sample = sample.split(df$Y, 0.10)
train = subset(df, sample==T)
test = subset(df, sample==F)

(c)

# Best subset selection on training set.

regfit.full = regsubsets(Y~., data=train, nvmax=20)
reg.summary = summary(regfit.full)

# Training MSE for best model of each size.

train.mse = (reg.summary$rss)/length(train)
plot(1:20,train.mse,xlab = "Variables",ylab = "Training MSE",
main = "Training MSE v Number of Variables", pch = 1, type = "b")

Training MSE v Number of Variables

2000
1500
Training MSE

1000
500
0

5 10 15 20

Variables

• As expected, training MSE decreases monotonically as the number of variables increase.

17
• Minimum training MSE is at maximum number of variables: 20.

(d)

# For loop to compute test MSE for each best model.

library(HH)
test.mse = rep(NA,20)

for(i in 1:20){
model=lm.regsubsets(regfit.full, i)
model.pred = predict(model, newdata=test, type=c("response"))
test.mse[i] = mean((test$Y-model.pred)^2)
}

# Plot
plot(1:20,test.mse,xlab = "Variables",ylab = "Test MSE",
main = "Test MSE v Number of Variables", pch = 1, type = "b")

Test MSE v Number of Variables

500
400
300
Test MSE

200
100
0

5 10 15 20

Variables

• Test MSE decreases rapidly as the number of variables increase, but the minimum is not at the max
number of variables.
• Minimum test MSE is when number of variables: 13.

(e)

• Minimum Test MSE occurs at a model with 13 variables. The test MSE deceases rapidly until it
reaches the minimum and then starts to rise thereafter.

18
• As the model flexibility increases, it is better able to fit the data set. This results in the Test MSE
decreasing rapidly until it reaches a minimum. Thereafter, further increases in model flexibility causes
over fitting and hence results in an increase in the Test MSE.

(f)

# Coefficients of best model found through subset selection.

coef(regfit.full, 13)

## (Intercept) X2 X3 X5 X6
## -0.001334982 -6.003548929 -7.008083537 -10.009342703 1.999414193
## X8 X9 X10 X13 X14
## 6.978017828 -2.989193466 4.997145917 6.005053131 3.973003175
## X16 X17 X18 X20
## 8.007345963 -7.999568356 9.993296372 3.005914845

## [1] 0 -6 -7 0 -10 2 0 7 -3 5 0 0 6 4 0 8 -8 10 0
## [20] 3

• Best model variables exactly match the 13 non-zero variables from the original model, and their re-
spective coefficients are highly similar.

(g)

# Original coefficient values transposed and columns renamed.

B = as.data.frame(t(B))
names(B) = paste0('X', 1:(ncol(B)))

# Loop to calculate root squared errors of the true and estimated coefficients.
coef.err = rep(NA,20)
for (i in 1:20){
a = coef(regfit.full, i)
coef.err[i] = sqrt(sum(((a[-1] - B[names(a)[-1]])^2)))
}

plot(1:20,coef.err,xlab = "Variables",ylab = "Coef error",

main="Coefficient Error v Number of Variables.", pch = 1, type = "b")

19
Coefficient Error v Number of Variables.
3.0
2.5
2.0
Coef error

1.5
1.0
0.5
0.0

5 10 15 20

Variables

which.min(coef.err)

## [1] 13

• The chart starts in a disjointed manner before the coefficient errors start reducing rapidly. Eventually,
it does show a minimum at the same variable size as for the test mse. However, when using a different
random seed the coefficient error chart does not always find a minimum at the same variable size as
the test mse chart. Therefore, a model that gives a minimum for coefficient error does not necessarily
lead to a lower test mse.

11. (a)

# Best subset selection using cross validation with 10 folds.

#Predict function from p249.

predict.regsubsets <- function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars] %*% coefi
}

set.seed(121)
k = 10

20
folds = sample(1:k, nrow(Boston), replace=TRUE)
cv.errors = matrix(NA,k,13, dimnames = list(NULL, paste(1:13)))

for (j in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred = predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] = mean((Boston$crim[folds == j] - pred)^2)
}
}

mean.cv.errors = apply(cv.errors, 2, mean)

plot(1:13, mean.cv.errors, xlab = "Number of variables", ylab = "CV error",
main= "Best subset selection", pch = 1, type = "b")

Best subset selection

45.0
44.0
CV error

43.0
42.0

2 4 6 8 10 12

Number of variables

• CV error is lowest for model with 9 variables. CV Error = 42.0151126.

# Using Lasso to create a sparse model.

set.seed(121)

x = model.matrix(crim~.,Boston)[,-1]
y = Boston$crim
grid = 10^seq(10,-2,length=100)

train = sample(1:nrow(x), nrow(x)/1.3)

21
test = (-train)
y.test = y[test]

set.seed(121)
cv.out = cv.glmnet(x[train,], y[train], alpha=1)
bestlam = cv.out$lambda.min

lasso.mod = glmnet(x[train,],y[train],alpha=1,lambda=grid)
lasso.pred = predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 31.61342

lasso.coef = predict(lasso.mod, type="coefficients", s=bestlam)[1:13,]

lasso.coef

## (Intercept) zn indus chas nox rm

## 11.863977269 0.035551345 -0.069356583 -0.341147083 -7.159328191 0.258906989
## age dis rad tax ptratio black
## 0.000000000 -0.756621224 0.510388304 0.000000000 -0.159640151 -0.009241809
## lstat
## 0.145509242

• Test MSE of 31.6, with only age and tax being exactly zero, we have a best model with 10 variables.

# Using Ridge Regression.

cv.out = cv.glmnet(x[train,], y[train], alpha=0)
bestlam = cv.out$lambda.min

glm.mod = glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12)

glm.pred = predict(glm.mod, s=bestlam, newx=x[test,])
mean((glm.pred-y.test)^2)

## [1] 31.87334

glm.coef = predict(glm.mod, type="coefficients", s=bestlam)[1:13,]

glm.coef

## (Intercept) zn indus chas nox rm

## 9.139463700 0.033824862 -0.087751836 -0.530186066 -6.237996789 0.383596934
## age dis rad tax ptratio black
## 0.003905644 -0.708321504 0.425606232 0.003344803 -0.125766385 -0.010097779
## lstat
## 0.155705890

• Lambda chosen by cross validation is close to zero, so both ridge regression and lasso test mse are
similar to that provided by least squares.

22
# Using PCR
pcr.fit = pcr(crim~., data=Boston, subset=train, scale=T, validation="CV")
validationplot(pcr.fit, val.type="MSEP")

crim
75
70
65
MSEP

60
55
50

0 2 4 6 8 10 12

number of components

pcr.pred = predict(pcr.fit, x[test,], ncomp=8)

mean((pcr.pred-y.test)^2)

## [1] 33.74218

• Test MSE of 33.7.

(b) (c)

• I would choose the Lasso model, as it gives the lowest test mse.
• Lasso models are generally more interpretable.
• It results in a sparse model with 10 variables. Two variables whose effect on the response were below
the required threshold were removed.

Jurnal Asli Diagram Sa
No ratings yet
Jurnal Asli Diagram Sa
11 pages
Lasso and Ridge Regression
No ratings yet
Lasso and Ridge Regression
30 pages
PI150 Series Frequency Inverter Operation Manual: 1.foreword
100% (3)
PI150 Series Frequency Inverter Operation Manual: 1.foreword
15 pages
Statistical Modelling: Regression: Choosing The Independent Variables
No ratings yet
Statistical Modelling: Regression: Choosing The Independent Variables
14 pages
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
No ratings yet
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
7 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Slides 2
No ratings yet
Slides 2
27 pages
Model Selection
No ratings yet
Model Selection
11 pages
Lecture 6 Model Selection and Regularization 11oct2023
No ratings yet
Lecture 6 Model Selection and Regularization 11oct2023
29 pages
Handout5 Regularization
No ratings yet
Handout5 Regularization
20 pages
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
No ratings yet
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
24 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
Reg 07
No ratings yet
Reg 07
22 pages
Week8 Lecture 1 ML SPR25
No ratings yet
Week8 Lecture 1 ML SPR25
20 pages
q6-5 Solution (Ridge and Lasso)
No ratings yet
q6-5 Solution (Ridge and Lasso)
7 pages
HW4 Solution
No ratings yet
HW4 Solution
13 pages
Estimation of Time-Varying Par in STAT Models - Bertsimas Et - Al. (1999) - PUB
No ratings yet
Estimation of Time-Varying Par in STAT Models - Bertsimas Et - Al. (1999) - PUB
21 pages
VariableSelectionAndModelBuilding IIT
No ratings yet
VariableSelectionAndModelBuilding IIT
22 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Cappstone
No ratings yet
Cappstone
2 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
Data Entry
No ratings yet
Data Entry
2 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Stats216 hw3 PDF
No ratings yet
Stats216 hw3 PDF
26 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
Vortex Tube Thesis
100% (3)
Vortex Tube Thesis
8 pages
Model Selection-Handout PDF
No ratings yet
Model Selection-Handout PDF
57 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Similar With My Research
No ratings yet
Similar With My Research
8 pages
Model Selection Using Multi-Objective Optimization: Pared
No ratings yet
Model Selection Using Multi-Objective Optimization: Pared
4 pages
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
No ratings yet
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
17 pages
HW6 Solution
No ratings yet
HW6 Solution
10 pages
Business Plan-Rssk CNC Automation
100% (1)
Business Plan-Rssk CNC Automation
22 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Fall 2023 - CS302P - 1
No ratings yet
Fall 2023 - CS302P - 1
2 pages
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
No ratings yet
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
10 pages
Assignment 2: Introduction To Machine Learning Prof. B. Ravindran
100% (1)
Assignment 2: Introduction To Machine Learning Prof. B. Ravindran
3 pages
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
No ratings yet
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
22 pages
SAP S - 4HANA Sourcing and Procurement - 1
100% (2)
SAP S - 4HANA Sourcing and Procurement - 1
36 pages
t2 Sol
No ratings yet
t2 Sol
5 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Econometrics Mock Exam - Solutions
No ratings yet
Econometrics Mock Exam - Solutions
3 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Mechanical Engineering - Lab Manual For Measurement and Instrumentation
No ratings yet
Mechanical Engineering - Lab Manual For Measurement and Instrumentation
18 pages
Kondor Regression
No ratings yet
Kondor Regression
4 pages
Active Directory Sync With Lotus ADSync
100% (1)
Active Directory Sync With Lotus ADSync
24 pages
Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes
No ratings yet
Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes
5 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Ridge Regression and Lasso Estimators For Data Analysis - 1749804481151
No ratings yet
Ridge Regression and Lasso Estimators For Data Analysis - 1749804481151
38 pages
1 MODERN TECHNOLOGY AS INSTRUCTIONAL DEVICES M Tesis Menzy 02-25-16 4 PM
No ratings yet
1 MODERN TECHNOLOGY AS INSTRUCTIONAL DEVICES M Tesis Menzy 02-25-16 4 PM
91 pages
Summary and Discussion Of: "Stability Selection"
No ratings yet
Summary and Discussion Of: "Stability Selection"
7 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
No ratings yet
Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
159 pages
TP MSDC 3
No ratings yet
TP MSDC 3
6 pages
K Agitation
No ratings yet
K Agitation
6 pages
Lars Based S Estimator
No ratings yet
Lars Based S Estimator
10 pages
2025-03-17
No ratings yet
2025-03-17
3 pages
Syllabus Computer Class-3
No ratings yet
Syllabus Computer Class-3
9 pages
Exercise 3 Computer Intensive Statistics
No ratings yet
Exercise 3 Computer Intensive Statistics
10 pages
LLDP
No ratings yet
LLDP
6 pages
Regression Notes - Part-1
No ratings yet
Regression Notes - Part-1
17 pages
Statistical Methods For Bioinformatics Lecture 3
No ratings yet
Statistical Methods For Bioinformatics Lecture 3
33 pages
Ec2 1
No ratings yet
Ec2 1
11 pages
Hertz Heat Recovery
No ratings yet
Hertz Heat Recovery
11 pages
StockEdge Combined
No ratings yet
StockEdge Combined
807 pages
7-Forex Trading Is A Business Learn To Trade The Market PDF
No ratings yet
7-Forex Trading Is A Business Learn To Trade The Market PDF
8 pages
System Overview MEI633
No ratings yet
System Overview MEI633
46 pages
Unit III 8254
No ratings yet
Unit III 8254
29 pages
COMSATS University Islamabad: Terminal Examination, SPRING 2021
No ratings yet
COMSATS University Islamabad: Terminal Examination, SPRING 2021
6 pages
Etx 2 v6.7 - Datasheet
No ratings yet
Etx 2 v6.7 - Datasheet
13 pages
BB - Cac Phuong Phap Dieu Khien Tien Tien Nham Nang Cao Chat Luong Va TKNL - 11tr
No ratings yet
BB - Cac Phuong Phap Dieu Khien Tien Tien Nham Nang Cao Chat Luong Va TKNL - 11tr
11 pages
Gixirobodata
No ratings yet
Gixirobodata
2 pages
ISAAC Info For Online Portfolio: About
No ratings yet
ISAAC Info For Online Portfolio: About
1 page
COSMOS
No ratings yet
COSMOS
3 pages
Digital Check SmartSource Pro English
No ratings yet
Digital Check SmartSource Pro English
2 pages
Profile
No ratings yet
Profile
2 pages
Unit 5
No ratings yet
Unit 5
9 pages
SIPGA Project List
No ratings yet
SIPGA Project List
1 page
5 - Pile Contact Safety Switch
No ratings yet
5 - Pile Contact Safety Switch
1 page
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.