0% found this document useful (0 votes)
4 views11 pages

SL 3

The document discusses various statistical learning concepts, including properties of symmetric matrices, model selection criteria (AIC, BIC, RIC), false discovery rates, and ridge regression. It also covers LASSO methods and the James-Stein estimator, providing formulas and comparisons of prediction errors across different models. The analysis is based on regression models fitted to a dataset with 100 observations and 10 variables.

Uploaded by

Ventus ™
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

SL 3

The document discusses various statistical learning concepts, including properties of symmetric matrices, model selection criteria (AIC, BIC, RIC), false discovery rates, and ridge regression. It also covers LASSO methods and the James-Stein estimator, providing formulas and comparisons of prediction errors across different models. The analysis is based on regression models fitted to a dataset with 100 observations and 10 variables.

Uploaded by

Ventus ™
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Statistical learning

Assignment 3
Jakub Skalski

March 11, 2025

1. Trace of symmetric real matrix is the sum of its eigenvalues


We can easily show this by first decomposing real symmetric matrix A into its eigenvectors
and eigenvalues and then using the circular property of the trace:
X
tr(A) = tr(P ΛP T ) = tr(P P T Λ) = tr(Λ) = λi
i

2. Properties of A = X T X
2.1 Positive semi-definite
We say that matrix A is positive semi-definite if it satisfies vT Av ≥ 0 for all vectors
v ∈ Rn . Consider any vector v ∈ Rn , then:

vT Av = v T X T Xv = (Xv)T Xv = ∥Xv∥2 ≥ 0

2.2 Non-negative eigenvalues


Consider any eigenvector v ∈ Rn . Since A is positive semi-definite:
vT Av = vT λv = vT vλ ≥ 0
Clearly, vT v is strictly positive, therefore λ ≥ 0.

2.3 At least one eigenvalue is zero when p > n


Matrix is singular, if and only if it is not full rank:
rank(A) < p
First, note that rank(A) is bounded from above by the rank of X T :
rank(A) = rank(X T X) ≤ rank(X T ) ≤ n
From the original assumptions it follows that rank(X T X) ≤ n < p.

1
Assignment 3 - Statistical learning - Jakub Skalski

3. Model selection
The data consists of 100 observations of 10 variables. We fit 10 regression models,
where k-th model includes only the first k variables. The residual sums of squares for
these 10 consecutive models are equal to (1731, 730, 49, 38.9, 32, 29, 28.5 27.8, 27.6,
26.6). Let us consider different criteria for the model selection under the assumption of a
standard error term.

3.1 Akaike Information Criterion

Table 1: Values of the criterion AIC = RSS + 2kσ 2 for each model

k 1 2 3 4 5 6 7 8 9 10
AIC 1733 734 55 46.9 42 41 42.5 43.8 45.6 46.6

3.2 Bayesian Information Criterion

Table 2: Values of the criterion BIC = RSS + klog(n)σ 2 for each model

k 1 2 3 4 5 6 7 8 9 10
≈ BIC 1735 739 63 57 55 57 61 65 69 72

3.3 Risk Inflation Criterion

Table 3: Values of the criterion RIC = RSS + 2klog(p)σ 2 for each model

k 1 2 3 4 5 6 7 8 9 10
≈ RIC 1735 739 63 57 55 57 61 65 69 72

2
Assignment 3 - Statistical learning - Jakub Skalski

4. False discovery rate


Assuming the orthogonal design (X T X = I) and n = p = 10000 we calculate the
expected number of false discoveries of AIC, BIC and RIC for β̂i ∼ N (0, σ 2 ) when
βi = 0.

4.1 Akaike Information Criterion



AIC selects variables satisfying β̂i ≥ 2σ, meaning the probability of type I error is:

P (Xi selected |βi = 0) = 2(1 − Φ( 2) = 0.16
The number of false discoveries for this criterion is 0.16 ∗ 10000 = 1600.

4.2 Bayesian Information Criterion



BIC selects variables satisfying β̂i ≥ log nσ, meaning the probability of type I error is:
p
P (Xi selected |βi = 0) = 2(1 − Φ( log n)) ≈ 0.0024
The number of false discoveries for this criterion is 0.0024 ∗ 10000 = 24.

4.3 Risk Inflation Criterion



BIC selects variables satisfying β̂i ≥ 2 log pσ, meaning the probability of type I error is:
p
P (Xi selected |βi = 0) = 2(1 − Φ( 2 log p)) ≈ 0.00002
The number of false discoveries for this criterion is 0.00002 ∗ 10000 = 0.2.

5. Choosing the right criterion


5.1 Akaike Information Criterion
When the primary aim is to select a model that provides the best predictive perfor-
mance rather than identifying the true underlying model, since it directly coincides with
minimizing the prediction error.

5.2 Bayesian Information Criterion


When there is a belief that one of the candidate models is the true model and the sample
size is relatively large, making it ideal for explanatory analysis.

5.3 Risk Inflation Criterion


Mostly useful in high-dimensional settings, such as variable selection in regression
models with a large number of predictors.

3
Assignment 3 - Statistical learning - Jakub Skalski

6. Ridge regression formulas


The multiple regression models are used to describe the relationship between a de-
pendent variable and multiple independent variables. In matrix notation, the multiple
regression model can be formulated as follows:

Y = Xβ + ϵ, ϵ ∼ N (0, σ 2 I)

The general ridge regression solution can be obtained analytically by solving for the zero
gradient of this convex function:

∇[(Y − Xβ)T (Y − Xβ) + λβ T β] = 2X T Xβ − 2X T Y + 2λβ = 0

The closed form general solution is thus:

β̂ = (X T X + λI)−1 X T Y

Under orthogonal design this simplifies to:


1
β̂ = ΛX T Y, where Λ :=
1+λ

6.1 Bias
First, let us compute the expectation of β̂:

E[β̂] = ΛX T E[Y ] = ΛX T Xβ = Λβ

Then the ridge regression bias is equal to the following:


λ
E[β̂] − β = Λβ − β = − β
1+λ
Which, in the case of OLS, where λ = 0 is zero.

6.2 Variance
The ridge regression estimator variance is the following:
σ2
V [β̂] = V [ΛX T Y ] = V [ΛX T (Xβ + ϵ)] = V [ΛX T ϵ] = ΛX T σ 2 XΛ = σ 2 Λ2 =
(1 + λ)2
Which for OLS is simply σ 2 .

6.3 Mean squared error


σ2 σ 2 + λ2 β T β
E[(β̂ − β)T (β̂ − β)] = V [β̂] + (E[β̂] − β)2 = + λ 2 2 T
Λ β β =
(1 + λ)2 (1 + λ)2
which equates to σ 2 for OLS.

4
Assignment 3 - Statistical learning - Jakub Skalski

7. Prediction error
Suppose that for a given data set with 40 explanatory variables the residual sums of
squares from the least squares method and the ridge regression are equal to 4.5 and
11.6, respectively. For the ridge regression the trace of X(X T X + yI)−1 X T is equal
to 32. We will now compute and compare the resulting prediction errors of these two
methods.

7.1 Ordinary least squares


P ˆEo = RSS + 2σ 2 p = 4.5 + 80σ 2

7.2 Ridge regression


P ˆEr = RSS + 2σ 2 T r(M ) = 11.6 + 64σ 2

7.3 Comparison
Assuming σ = 1, we surmise that P Eo is greater than P Er .

8. LASSO
8.1 False discovery rate and power
Under orthogonal design, Lasso selects variable Xi when |β̂i | > λ, where β̂ ∼ N (β, σ 2 ),
so the false discovery rate equates to P (Xi selected |βi = 0) = 2(1 − Φ( σλ )), while the
power is P (Xi selected |βi ̸= 0) = 2(1 − Φ( λ−β i
σ )).

8.2 Computing adaptive LASSO


Solving adaptive LASSO is actually the same as first solving regular LASSO for some
scaled input data and then correcting the produced estimator.

8.2.1 Explanation
1
Let βj = wj · βjw . Replacing in the minimization problem we obtain:
 
  2 p
 1 X 1 
β̂ w = arg min y−X · βw +λ wj · βjw
βw  w 2 wj 
j=1

5
Assignment 3 - Statistical learning - Jakub Skalski

Now, we can simply multiply into the absolute value in the latter sum to get rid of the wj ,
so that it becomes:
 
  2 p
 1 X 
β̂ w = arg min y − X · β w
+ λ β w
j
βw  w 2 
j=1

Finally, scaling the input data to account for this change we arrive at the regular LASSO
formulation we are familiar with:
 
 Xp 
w w w 2 w
β̂ = arg min ∥y − X β ∥2 + λ βj
βw  
j=1

8.2.2 Solution

The approach is known as the LARS algorithm:


1. Define xw
i,j = xj /wj , j = 1, 2, . . . , p.

2. Solve the lasso problem for the scaled data,

3. Output β̂j = β̂jw /wj , j = 1, 2, . . . , p.


Typically, one would simply pass the vector of absolute inverses of the initial weights to
some regular LASSO solver (like glmnet, for instance).

8.3 Closed form solution under orthogonal design


Solving LASSO is equivalent to minimizing the following objective function:
p
1 X
||Y − Xβ||2 + λi |βi |
2
i

Expanding and rearranging the terms we obtain:


p
1 X
−Y Xβ + ||β||2 +
T
λi |βi |
2
i

Recall that solution to the least squares problem is β̂ = X T Y , and so we can simplify
further:
p p
X 1 2 X
−β̄i βi + βi + λi |βi | = Zi
2
i i
Minimizing the above amounts to minimizing each individual Zi . Taking the subdifferential
of such with respect to β and solving for zero gradient we arrive at:

βi = β̂i − λi

6
Assignment 3 - Statistical learning - Jakub Skalski

Which is only feasible for the non-negatives. Adjusting for the general case we obtain
the following closed form solution:

βi∗ = sgn(β̂i )(|β̂i | − λi )+

8.4 Relation to the ordinary least squares method


One could obtain weights for each variable through OLS along with the estimator and
use them for the adaptive lasso. For instance, suppose that the ordinary least squares
weight w1 is 14 and its estimator of β1 under the orthogonal design is equal to 3. Also,
the regular LASSO estimator of this parameter is equal to 2. We can compute the tuning
parameter first:
2 = sgn(3)(3 − λ)+ =⇒ λ = 1
Then, the adaptive LASSO estimator would be:
1 11
sgn(β̂i )(|β̂i | − λi wi ) = 3 − =
4 4

9. Project 1
9.1 James-Stein Estimators
The most common estimation approach is the Maximum Likelihood Estimator (MLE),
which often corresponds to the sample mean in many contexts. The sample mean
is an unbiased estimator, meaning its expected value is equal to the true parameter
value. When estimating the mean of a multivariate normal distribution with three or
more dimensions, the sample mean is not the best estimator in terms of mean squared
error (Stein paradox). The James-Stein estimator improves upon the sample mean by
”shrinking” it towards a central point (often the origin). This shrinkage reduces the mean
squared error of the estimator.

9.1.1 Shrink to zero


 
p−2
µ̂JS = 1 − σ 2 ∥x̄∥2 x̄

9.1.2 Shrink to the common mean


 
p−2
µ̂JS = µ0 + 1 − σ 2 ∥x̄−µ 0∥
2 (x̄ − µ0 )

9.1.3 Experiment

Here, we test the validity of the previous statements. We compute the James-Stein
estimators on a standardized genes data and compare them to the classical maximum

7
Assignment 3 - Statistical learning - Jakub Skalski

likelihood estimator (MLE). Estimators are computed from the first 5 observations, while
the remaining 205 observations are used for validation.

Table 4: Mean squared estimation errors

estimator mle zero mean


MSE ≈ 84 ≈ 85 ≈7

Figure 1: Estimators

9.2 Multiple Regression


Here we look into
 the ordinary least square method. The generated data is sampled
1
from N 0, 1000 . The response variable Y is modeled as Y = Xβ + ϵ, where the first

five elements of β are 3, and the rest are 0. The error term ϵ follows N (0, I). We fit the
model for k = 2, 5, 10, 100, 500, 950 variables.

8
Assignment 3 - Statistical learning - Jakub Skalski

9.2.1 Prediction Errors

For each model we compute prediction error estimators:

• P E1 = RSS + 2ps2

• P E2 = RSS + 2pσ 2
P  Yi −Ŷi 2
• P E3 = ni=1 1−M ii

Figure 2: Prediction error residuals

PE2 is clearly the superior estimator (since it uses the real variance) while PE3 (the
cross-validation error) appears to struggle for highly dimensional data.

9
Assignment 3 - Statistical learning - Jakub Skalski

Table 5: Rounded conditional prediction error estimate averages across thirty runs

k PE PE1 PE2 PE3


2 1002 1001 1001 1001
5 1005 1006 1006 1006
10 1009 1006 1006 1006
100 1098 1099 1099 1111
500 1497 1501 1500 2006
950 1950 1899 1948 20293

The bivariate model achieves supreme prediction error scores by a very small margin.
Using only the prediction error as the metric it would make for the optimal choice.

10. Project 2
The following experiment is designed to investigate the efficacy of various regression
techniques, particularly those involving regularization, such as ridge regression, LASSO,
and SLOPE. We also compare the results with mBIC2 model selection criterion. For
each of the described methods we calculate the square estimation errors E1 = ∥β̂ − β∥2 ,
E2 = ∥X(β̂ − β)∥2 , FDP and TPP.

10.1 Multiple regression methods


10.2 Ordinary least squares estimator
 2
n
X p
X
β̂ OLS = argmin yi − β0 − xij βj 
β i=1 j=1

Computed with cv.glmnet and SLOPE by setting λ to zero.

10.2.1 Ridge estimator


  2 
Xn p
X p
X 
β̂ ridge = argmin yi − β0 − xij βj  +λ 2
βj
β 
i=1 j=1

j=1

Computed with cv.glmnet by passing the alpha = 0 parameter.

10.2.2 LASSO estimator


  2 
Xn p
X p
X 
β̂ lasso = argmin yi − β0 − xij βj  + λ |βj |
β 
i=1 j=1

j=1

10
Assignment 3 - Statistical learning - Jakub Skalski

When fitting a cross-validated lasso model using cv.glmnet, two lambda values are
commonly reported:

• lambda.min: The value of lambda that gives the minimum mean cross-validated
error. This is often referred to as the ”best” lambda because it directly minimizes
the prediction error on the validation set.

• lambda.1se: The largest value of lambda for which the mean cross-validated error
is within one standard error of the minimum. This lambda value usually results in a
sparser model (fewer non-zero coefficients), potentially improving interpretability
and generalization by favoring simpler models.

10.3 Results

Table 6: Rounded error averages across ten runs

model E1 E2 FDP TPP


mBIC2 22 22 0 1
lasso (min) 99 87 0.81 1
lasso (1se) 126 114 0.53 1
lasso (arg) 1479 652 0.97 1
SLOPE 18493 989 0.97 1
ridge 553 405 - -
lasso (ols) 126 114 - -
SLOPE (ols) 720 691 - -

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy