Lecture03d Ridge
Lecture03d Ridge
Regularization:
Ridge Regression and Lasso
Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
We have seen that by augmenting
the feature vector we can make lin-
ear models as powerful as we want.
Unfortunately this leads to the prob-
lem of overfitting. Regularization is
a way to mitigate this undesirable
behavior.
We will discuss regularization in
the context of linear models, but
the same principle applies also to
more complex models such as neu-
ral nets.
Regularization
Through regularization, we can pe-
nalize complex models and favor
simpler ones:
Ω(w) = λ∥w∥22
2
P 2
where ∥w∥2 = i wi . Here
the main effect is that large
model weights wi will be penalized
(avoided), since we consider them
“unlikely”, while small ones are ok.
When L is MSE, this is called ridge
regression:
N
1 X ⊤
2
min yn − xn w + λ∥w∥22
w 2N n=1
where
X
∥w∥1 := |wi|.
i
The figure above shows a “ball” of constant L1 norm. To
keep things simple assume that X⊤X is invertible. We claim
that in this case the set
2
{w : ∥y − Xw = α} (1)
α = ∥y − Xw∥2
= ∥y∥ + y⊥ − Xw∥2
= ∥y⊥∥2 + ∥y∥ − Xw∥2
= ∥y⊥∥2 + ∥X(w0 − w)∥2.
w?
w1
In step (a) we used Bayes’ law. In step (b) and (c) we eliminated quan-
tities that do not depend on w.
Regularization as prior information
Based on the previous derivation of ridge regression, we can more gen-
erally see regularization as encoding any kind of prior information we
have and thus it can be understood as a compressed form of data, in
this sense, using regularization is equivalent to adding data which helps
reduce overfitting.