SL 3
SL 3
Assignment 3
Jakub Skalski
2. Properties of A = X T X
2.1 Positive semi-definite
We say that matrix A is positive semi-definite if it satisfies vT Av ≥ 0 for all vectors
v ∈ Rn . Consider any vector v ∈ Rn , then:
vT Av = v T X T Xv = (Xv)T Xv = ∥Xv∥2 ≥ 0
1
Assignment 3 - Statistical learning - Jakub Skalski
3. Model selection
The data consists of 100 observations of 10 variables. We fit 10 regression models,
where k-th model includes only the first k variables. The residual sums of squares for
these 10 consecutive models are equal to (1731, 730, 49, 38.9, 32, 29, 28.5 27.8, 27.6,
26.6). Let us consider different criteria for the model selection under the assumption of a
standard error term.
Table 1: Values of the criterion AIC = RSS + 2kσ 2 for each model
k 1 2 3 4 5 6 7 8 9 10
AIC 1733 734 55 46.9 42 41 42.5 43.8 45.6 46.6
Table 2: Values of the criterion BIC = RSS + klog(n)σ 2 for each model
k 1 2 3 4 5 6 7 8 9 10
≈ BIC 1735 739 63 57 55 57 61 65 69 72
Table 3: Values of the criterion RIC = RSS + 2klog(p)σ 2 for each model
k 1 2 3 4 5 6 7 8 9 10
≈ RIC 1735 739 63 57 55 57 61 65 69 72
2
Assignment 3 - Statistical learning - Jakub Skalski
3
Assignment 3 - Statistical learning - Jakub Skalski
Y = Xβ + ϵ, ϵ ∼ N (0, σ 2 I)
The general ridge regression solution can be obtained analytically by solving for the zero
gradient of this convex function:
β̂ = (X T X + λI)−1 X T Y
6.1 Bias
First, let us compute the expectation of β̂:
E[β̂] = ΛX T E[Y ] = ΛX T Xβ = Λβ
6.2 Variance
The ridge regression estimator variance is the following:
σ2
V [β̂] = V [ΛX T Y ] = V [ΛX T (Xβ + ϵ)] = V [ΛX T ϵ] = ΛX T σ 2 XΛ = σ 2 Λ2 =
(1 + λ)2
Which for OLS is simply σ 2 .
4
Assignment 3 - Statistical learning - Jakub Skalski
7. Prediction error
Suppose that for a given data set with 40 explanatory variables the residual sums of
squares from the least squares method and the ridge regression are equal to 4.5 and
11.6, respectively. For the ridge regression the trace of X(X T X + yI)−1 X T is equal
to 32. We will now compute and compare the resulting prediction errors of these two
methods.
7.3 Comparison
Assuming σ = 1, we surmise that P Eo is greater than P Er .
8. LASSO
8.1 False discovery rate and power
Under orthogonal design, Lasso selects variable Xi when |β̂i | > λ, where β̂ ∼ N (β, σ 2 ),
so the false discovery rate equates to P (Xi selected |βi = 0) = 2(1 − Φ( σλ )), while the
power is P (Xi selected |βi ̸= 0) = 2(1 − Φ( λ−β i
σ )).
8.2.1 Explanation
1
Let βj = wj · βjw . Replacing in the minimization problem we obtain:
2 p
1 X 1
β̂ w = arg min y−X · βw +λ wj · βjw
βw w 2 wj
j=1
5
Assignment 3 - Statistical learning - Jakub Skalski
Now, we can simply multiply into the absolute value in the latter sum to get rid of the wj ,
so that it becomes:
2 p
1 X
β̂ w = arg min y − X · β w
+ λ β w
j
βw w 2
j=1
Finally, scaling the input data to account for this change we arrive at the regular LASSO
formulation we are familiar with:
Xp
w w w 2 w
β̂ = arg min ∥y − X β ∥2 + λ βj
βw
j=1
8.2.2 Solution
Recall that solution to the least squares problem is β̂ = X T Y , and so we can simplify
further:
p p
X 1 2 X
−β̄i βi + βi + λi |βi | = Zi
2
i i
Minimizing the above amounts to minimizing each individual Zi . Taking the subdifferential
of such with respect to β and solving for zero gradient we arrive at:
βi = β̂i − λi
6
Assignment 3 - Statistical learning - Jakub Skalski
Which is only feasible for the non-negatives. Adjusting for the general case we obtain
the following closed form solution:
9. Project 1
9.1 James-Stein Estimators
The most common estimation approach is the Maximum Likelihood Estimator (MLE),
which often corresponds to the sample mean in many contexts. The sample mean
is an unbiased estimator, meaning its expected value is equal to the true parameter
value. When estimating the mean of a multivariate normal distribution with three or
more dimensions, the sample mean is not the best estimator in terms of mean squared
error (Stein paradox). The James-Stein estimator improves upon the sample mean by
”shrinking” it towards a central point (often the origin). This shrinkage reduces the mean
squared error of the estimator.
9.1.3 Experiment
Here, we test the validity of the previous statements. We compute the James-Stein
estimators on a standardized genes data and compare them to the classical maximum
7
Assignment 3 - Statistical learning - Jakub Skalski
likelihood estimator (MLE). Estimators are computed from the first 5 observations, while
the remaining 205 observations are used for validation.
Figure 1: Estimators
five elements of β are 3, and the rest are 0. The error term ϵ follows N (0, I). We fit the
model for k = 2, 5, 10, 100, 500, 950 variables.
8
Assignment 3 - Statistical learning - Jakub Skalski
• P E1 = RSS + 2ps2
• P E2 = RSS + 2pσ 2
P Yi −Ŷi 2
• P E3 = ni=1 1−M ii
PE2 is clearly the superior estimator (since it uses the real variance) while PE3 (the
cross-validation error) appears to struggle for highly dimensional data.
9
Assignment 3 - Statistical learning - Jakub Skalski
Table 5: Rounded conditional prediction error estimate averages across thirty runs
The bivariate model achieves supreme prediction error scores by a very small margin.
Using only the prediction error as the metric it would make for the optimal choice.
10. Project 2
The following experiment is designed to investigate the efficacy of various regression
techniques, particularly those involving regularization, such as ridge regression, LASSO,
and SLOPE. We also compare the results with mBIC2 model selection criterion. For
each of the described methods we calculate the square estimation errors E1 = ∥β̂ − β∥2 ,
E2 = ∥X(β̂ − β)∥2 , FDP and TPP.
10
Assignment 3 - Statistical learning - Jakub Skalski
When fitting a cross-validated lasso model using cv.glmnet, two lambda values are
commonly reported:
• lambda.min: The value of lambda that gives the minimum mean cross-validated
error. This is often referred to as the ”best” lambda because it directly minimizes
the prediction error on the validation set.
• lambda.1se: The largest value of lambda for which the mean cross-validated error
is within one standard error of the minimum. This lambda value usually results in a
sparser model (fewer non-zero coefficients), potentially improving interpretability
and generalization by favoring simpler models.
10.3 Results
11