0% found this document useful (0 votes)
9 views13 pages

Lecture03d Ridge

The document discusses regularization techniques in machine learning, specifically focusing on Ridge Regression (L2-Regularization) and Lasso (L1-Regularization). It explains how these methods help mitigate overfitting by penalizing complex models and favoring simpler ones, with mathematical formulations and interpretations provided. Additionally, it highlights the geometric interpretations of these regularization techniques and their connections to maximum likelihood and maximum a posteriori estimations.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

Lecture03d Ridge

The document discusses regularization techniques in machine learning, specifically focusing on Ridge Regression (L2-Regularization) and Lasso (L1-Regularization). It explains how these methods help mitigate overfitting by penalizing complex models and favoring simpler ones, with mathematical formulations and interpretations provided. Additionally, it highlights the geometric interpretations of these regularization techniques and their connections to maximum likelihood and maximum a posteriori estimations.

Uploaded by

Quan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Machine Learning Course - CS-433

Regularization:
Ridge Regression and Lasso

Sept 25, 2024

Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
We have seen that by augmenting
the feature vector we can make lin-
ear models as powerful as we want.
Unfortunately this leads to the prob-
lem of overfitting. Regularization is
a way to mitigate this undesirable
behavior.
We will discuss regularization in
the context of linear models, but
the same principle applies also to
more complex models such as neu-
ral nets.

Regularization
Through regularization, we can pe-
nalize complex models and favor
simpler ones:

min L(w) + Ω(w)


w

The second term Ω is a regular-


izer, measuring the complexity of
the model given by w.
L2-Regularization: Ridge Regression
The most frequently used regularizer
is the standard Euclidean norm (L2-
norm), that is

Ω(w) = λ∥w∥22
2
P 2
where ∥w∥2 = i wi . Here
the main effect is that large
model weights wi will be penalized
(avoided), since we consider them
“unlikely”, while small ones are ok.
When L is MSE, this is called ridge
regression:
N
1 X ⊤
2
min yn − xn w + λ∥w∥22
w 2N n=1

Least squares is a special case of


this: set λ := 0.
Explicit solution for w: Differ-
entiating and setting to zero:

wridge = (X⊤X + λ′I)−1X⊤y
λ′
(here for simpler notation 2N = λ)
Ridge Regression to Fight Ill-Conditioning
The eigenvalues of (X⊤X + λ′I) are all at least λ′ and so the
inverse always exists. This is also referred to as lifting the
eigenvalues.
Proof: Write the Eigenvalue decomposition of X⊤X as USU⊤.
We then have

X⊤X + λ′I = USU⊤ + λ′UIU⊤


= U[S + λ′I]U⊤.

We see now that every Eigenvalue is “lifted” by an amount λ′.

Here is an alternative proof. Recall that for a symmetric


matrix A we can also compute eigenvalues by looking at the
so-called Rayleigh ratio,
v⊤Av
R(A, v) = ⊤ .
v v
Note that if v is an eigenvector with eigenvalue λ then the
Rayleigh coefficient indeed gives us λ. We can find the small-
est and largest eigenvalue by minimizing and maximizing this
coefficient. But note that if we apply this to the symmetric
matrix X⊤X + λ′I then for any vector v we have
v⊤(X⊤X + λ′I)v λ′v⊤v ′

≥ ⊤
= λ .
v v v v
L1-Regularization: The Lasso
As an alternative measure of the
complexity of the model, we can use
a different norm. A very important
case is the L1-norm, leading to L1-
regularization. In combination with
the MSE cost function, this is known
as the Lasso:
N
1 X
min [yn − x⊤ 2
n w] + λ ∥w∥1
w 2N n=1

where
X
∥w∥1 := |wi|.
i
The figure above shows a “ball” of constant L1 norm. To
keep things simple assume that X⊤X is invertible. We claim
that in this case the set
2
{w : ∥y − Xw = α} (1)

is an ellipsoid and this ellipsoid simply scales around its origin


as we change α. We claim that for the L1-regularization the
optimum solution is likely going to be sparse (only has few
non-zero components) compared to the case where we use
L2-regularization.
Why is this the case? Assume that a genie tells you the L1-
norm of the optimum solution. Draw the L1-ball with that
norm value (think of 2D to visualize it). So now you know
that the optimal point is somewhere on the surface of this
“ball”. Further you know that there are ellipsoids, all with
the same mean and rotation that describes the equal error
surfaces incurred by the first term. The optimum solution
is where the “smallest” of these ellipsoids just touches the
L1-ball. Due to the geometry of this ball this point is more
likely to be on one of the “corner” points. In turn, sparsity
is desirable, since it leads to a “simple” model.
How do we see the claim that (1) describes and ellipsoid?
First look at α = ∥Xw∥2 = w⊤X⊤Xw. This is a quadratic
form. Let A = X⊤X. Note that A is a symmetric matrix
and by assumption it has full rank. If A is a diagonal matrix
with strictly positive elements ai along the diagonal then this
describes the equation
X
aiwi2 = α,
i

which is indeed the equation for an ellipsoid. In the general


case, A can be written as (using the SVD) A = UBUT ,
where B is a diagonal matrix with strictly positive entries.
This then corresponds to an ellipsoid with rotated axes. If
we now look at α = ∥y − Xw∥2, where y is in the column
space of X then we can write it as α = ∥X(w0 − w)∥2 for
a suitable chosen w0 and so this corresponds to a shifted
ellipsoid. Finally, for the general case, write y as y = y∥ +
y⊥, where y∥ is the component of y that lies in the subspace
spanned by the columns of X and y⊥ is the component that
is orthogonal. In this case

α = ∥y − Xw∥2
= ∥y∥ + y⊥ − Xw∥2
= ∥y⊥∥2 + ∥y∥ − Xw∥2
= ∥y⊥∥2 + ∥X(w0 − w)∥2.

Hence this is then equivalent to the equation ∥X(w0−w)∥2 =


α − ∥y⊥∥2, proving the claim. From this we also see that
if X⊤X is not full rank then what we get is not an ellipsoid
but a cylinder with an ellipsoidal cross-section.
Additional Notes
Other Types of Regularization
Popular methods such as shrinkage, dropout and weight decay (in the
context of neural networks), early stopping of the optimization are all
different forms of regularization.

Another view of regularization: The ridge regression formulation we have


seen above is similar to the following constrained problem (for some
τ > 0).
N
1 X
min (yn − x⊤ 2
n w) , such that ∥w∥22 ≤ τ
w 2N n=1

The following picture illustrates this.


w2

w?

w1

Figure 1: Geometric interpretation of Ridge Regression. Blue lines in-


dicate the level sets of the MSE cost function.
For the case of using L1 regularization (known as the Lasso, when used
with MSE) we analogously consider
N
1 X
min (yn − x⊤ 2
n w) , such that ∥w∥1 ≤ τ
w 2N n=1

This forces some of the elements of w to be strictly 0 and therefore


enforces sparsity in the model (some features will not be used since their
coefficients are zero).

• Why does L1 regularizer enforce sparsity? Hint: Draw a picture


similar to the above, and locate the optimal solution.
• Why is it good to have sparsity in the model? Is it going to be
better than least-squares? When and why?
Ridge Regression as MAP estimator
Recall that classic least-squares linear regression can be interpreted as
the maximum likelihood estimator:
(a)
wlse = arg min − log p(y, X|w)
w
(b)
= arg min − log p(X|w)p(y|X, w)
w
(c)
= arg min − log p(X)p(y|X, w)
w
(d)
= arg min − log p(y|X, w)
w
"N #
(e) Y
= arg min − log p(yn|xn, w)
w
"n=1
N
#
(f ) Y
= arg min − log N (yn | x⊤ 2
n w, σ )
w
" n=1
N
#
Y 1 − 12 (yn −x⊤
n w)
2
= arg min − log √ e 2σ
w
n=1 2πσ 2
N
1X 1
= arg min −N log( √ )+ 2
(yn − x⊤
n w)2
w 2πσ 2 n=1

N
1 X
= arg min 2
(yn − x⊤
n w)
2
w 2σ n=1
In step (a) on the right we wrote down the negative of the log of the
likelihood. The maximum likelihood criterion choses that parameter w
that minimizes this quantity (i.e., maximizes the likelihood). In step (b)
we factored the likelihood. The usual assumption is that the choice of
the input samples xn does not depend on the model parameter (which
only influces the output given the input. Hence, in step (c) we removed
the conditioning. Since the factor p(X) does not depend on w, i.e., is a
constant wrt to w) we can remove it. This is done in step (d). In step
(e) we used the assumption that the samples are iid. In step (f) we then
used our assumption that the samples have the form yn = wn⊤w + Zn,
where Zn is a Gaussian noise with mean zero and variance σ2. The rest
is calculus.

Ridge regression has a very similar interpretation. Now we start with


the posterior p(w|X, y) and chose that parameter w that maximizes
this posterior. Hence this is called the maximum-a-posteriori (MAP)
estimate. As before, we take the log and add a minus sign and minimize
instead. In order to compute the posterior we use Bayes law and we
assume that the components of the weight vector are iid Gaussians with
mean zero and variance λ1 .

wridge = arg min − log p(w|X, y)


w
(a) p(y, X|w)p(w)
= arg min − log
w p(y, X)
(b)
= arg min − log p(y, X|w)p(w)
w
(c)
= arg min − log p(y|X, w)p(w)
w
" N
#
Y
= arg min − log p(w) p(yn|xn, w)
w
n=1
" N
#
Y
= arg min − log N (w | 0, λ1 I) N (yn | x⊤ 2
n w, σ )
w
n=1
" N
#
1 − λ2 ∥w∥2
Y 1 − 12 (yn −x⊤
n w)
2
= arg min − log e √ e 2σ
w (2π λ1 )D/2 n=1 2πσ 2
N
X 1 ⊤ 2 λ 2
= arg min 2
(y n − x n w) + ∥w∥ .
w
n=1
2σ 2

In step (a) we used Bayes’ law. In step (b) and (c) we eliminated quan-
tities that do not depend on w.
Regularization as prior information
Based on the previous derivation of ridge regression, we can more gen-
erally see regularization as encoding any kind of prior information we
have and thus it can be understood as a compressed form of data, in
this sense, using regularization is equivalent to adding data which helps
reduce overfitting.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy