0% found this document useful (0 votes)

9 views13 pages

Lecture03d Ridge

The document discusses regularization techniques in machine learning, specifically focusing on Ridge Regression (L2-Regularization) and Lasso (L1-Regularization). It explains how these methods help mitigate overfitting by penalizing complex models and favoring simpler ones, with mathematical formulations and interpretations provided. Additionally, it highlights the geometric interpretations of these regularization techniques and their connections to maximum likelihood and maximum a posteriori estimations.

Uploaded by

Quan Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views13 pages

Lecture03d Ridge

Uploaded by

Quan Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Machine Learning Course - CS-433

Regularization:
Ridge Regression and Lasso

Sept 25, 2024

Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
We have seen that by augmenting
the feature vector we can make lin-
ear models as powerful as we want.
Unfortunately this leads to the prob-
lem of overfitting. Regularization is
a way to mitigate this undesirable
behavior.
We will discuss regularization in
the context of linear models, but
the same principle applies also to
more complex models such as neu-
ral nets.

Regularization
Through regularization, we can pe-
nalize complex models and favor
simpler ones:

min L(w) + Ω(w)

The second term Ω is a regular-

izer, measuring the complexity of
the model given by w.
L2-Regularization: Ridge Regression
The most frequently used regularizer
is the standard Euclidean norm (L2-
norm), that is

Ω(w) = λ∥w∥22
2
P 2
where ∥w∥2 = i wi . Here
the main effect is that large
model weights wi will be penalized
(avoided), since we consider them
“unlikely”, while small ones are ok.
When L is MSE, this is called ridge
regression:
N
1 X ⊤
2
min yn − xn w + λ∥w∥22
w 2N n=1

Least squares is a special case of

this: set λ := 0.
Explicit solution for w: Differ-
entiating and setting to zero:
⋆
wridge = (X⊤X + λ′I)−1X⊤y
λ′
(here for simpler notation 2N = λ)
Ridge Regression to Fight Ill-Conditioning
The eigenvalues of (X⊤X + λ′I) are all at least λ′ and so the
inverse always exists. This is also referred to as lifting the
eigenvalues.
Proof: Write the Eigenvalue decomposition of X⊤X as USU⊤.
We then have

X⊤X + λ′I = USU⊤ + λ′UIU⊤

= U[S + λ′I]U⊤.

We see now that every Eigenvalue is “lifted” by an amount λ′.

Here is an alternative proof. Recall that for a symmetric

matrix A we can also compute eigenvalues by looking at the
so-called Rayleigh ratio,
v⊤Av
R(A, v) = ⊤ .
v v
Note that if v is an eigenvector with eigenvalue λ then the
Rayleigh coefficient indeed gives us λ. We can find the small-
est and largest eigenvalue by minimizing and maximizing this
coefficient. But note that if we apply this to the symmetric
matrix X⊤X + λ′I then for any vector v we have
v⊤(X⊤X + λ′I)v λ′v⊤v ′
⊤
≥ ⊤
= λ .
v v v v
L1-Regularization: The Lasso
As an alternative measure of the
complexity of the model, we can use
a different norm. A very important
case is the L1-norm, leading to L1-
regularization. In combination with
the MSE cost function, this is known
as the Lasso:
N
1 X
min [yn − x⊤ 2
n w] + λ ∥w∥1
w 2N n=1

where
X
∥w∥1 := |wi|.
i
The figure above shows a “ball” of constant L1 norm. To
keep things simple assume that X⊤X is invertible. We claim
that in this case the set
2
{w : ∥y − Xw = α} (1)

is an ellipsoid and this ellipsoid simply scales around its origin

as we change α. We claim that for the L1-regularization the
optimum solution is likely going to be sparse (only has few
non-zero components) compared to the case where we use
L2-regularization.
Why is this the case? Assume that a genie tells you the L1-
norm of the optimum solution. Draw the L1-ball with that
norm value (think of 2D to visualize it). So now you know
that the optimal point is somewhere on the surface of this
“ball”. Further you know that there are ellipsoids, all with
the same mean and rotation that describes the equal error
surfaces incurred by the first term. The optimum solution
is where the “smallest” of these ellipsoids just touches the
L1-ball. Due to the geometry of this ball this point is more
likely to be on one of the “corner” points. In turn, sparsity
is desirable, since it leads to a “simple” model.
How do we see the claim that (1) describes and ellipsoid?
First look at α = ∥Xw∥2 = w⊤X⊤Xw. This is a quadratic
form. Let A = X⊤X. Note that A is a symmetric matrix
and by assumption it has full rank. If A is a diagonal matrix
with strictly positive elements ai along the diagonal then this
describes the equation
X
aiwi2 = α,
i

which is indeed the equation for an ellipsoid. In the general

case, A can be written as (using the SVD) A = UBUT ,
where B is a diagonal matrix with strictly positive entries.
This then corresponds to an ellipsoid with rotated axes. If
we now look at α = ∥y − Xw∥2, where y is in the column
space of X then we can write it as α = ∥X(w0 − w)∥2 for
a suitable chosen w0 and so this corresponds to a shifted
ellipsoid. Finally, for the general case, write y as y = y∥ +
y⊥, where y∥ is the component of y that lies in the subspace
spanned by the columns of X and y⊥ is the component that
is orthogonal. In this case

α = ∥y − Xw∥2
= ∥y∥ + y⊥ − Xw∥2
= ∥y⊥∥2 + ∥y∥ − Xw∥2
= ∥y⊥∥2 + ∥X(w0 − w)∥2.

Hence this is then equivalent to the equation ∥X(w0−w)∥2 =

α − ∥y⊥∥2, proving the claim. From this we also see that
if X⊤X is not full rank then what we get is not an ellipsoid
but a cylinder with an ellipsoidal cross-section.
Additional Notes
Other Types of Regularization
Popular methods such as shrinkage, dropout and weight decay (in the
context of neural networks), early stopping of the optimization are all
different forms of regularization.

Another view of regularization: The ridge regression formulation we have

seen above is similar to the following constrained problem (for some
τ > 0).
N
1 X
min (yn − x⊤ 2
n w) , such that ∥w∥22 ≤ τ
w 2N n=1

The following picture illustrates this.

Figure 1: Geometric interpretation of Ridge Regression. Blue lines in-

dicate the level sets of the MSE cost function.
For the case of using L1 regularization (known as the Lasso, when used
with MSE) we analogously consider
N
1 X
min (yn − x⊤ 2
n w) , such that ∥w∥1 ≤ τ
w 2N n=1

This forces some of the elements of w to be strictly 0 and therefore

enforces sparsity in the model (some features will not be used since their
coefficients are zero).

• Why does L1 regularizer enforce sparsity? Hint: Draw a picture

similar to the above, and locate the optimal solution.
• Why is it good to have sparsity in the model? Is it going to be
better than least-squares? When and why?
Ridge Regression as MAP estimator
Recall that classic least-squares linear regression can be interpreted as
the maximum likelihood estimator:
(a)
wlse = arg min − log p(y, X|w)
w
(b)
= arg min − log p(X|w)p(y|X, w)
w
(c)
= arg min − log p(X)p(y|X, w)
w
(d)
= arg min − log p(y|X, w)
w
"N #
(e) Y
= arg min − log p(yn|xn, w)
w
"n=1
N
#
(f ) Y
= arg min − log N (yn | x⊤ 2
n w, σ )
w
" n=1
N
#
Y 1 − 12 (yn −x⊤
n w)
2
= arg min − log √ e 2σ
w
n=1 2πσ 2
N
1X 1
= arg min −N log( √ )+ 2
(yn − x⊤
n w)2
w 2πσ 2 n=1
2σ
N
1 X
= arg min 2
(yn − x⊤
n w)
2
w 2σ n=1
In step (a) on the right we wrote down the negative of the log of the
likelihood. The maximum likelihood criterion choses that parameter w
that minimizes this quantity (i.e., maximizes the likelihood). In step (b)
we factored the likelihood. The usual assumption is that the choice of
the input samples xn does not depend on the model parameter (which
only influces the output given the input. Hence, in step (c) we removed
the conditioning. Since the factor p(X) does not depend on w, i.e., is a
constant wrt to w) we can remove it. This is done in step (d). In step
(e) we used the assumption that the samples are iid. In step (f) we then
used our assumption that the samples have the form yn = wn⊤w + Zn,
where Zn is a Gaussian noise with mean zero and variance σ2. The rest
is calculus.

Ridge regression has a very similar interpretation. Now we start with

the posterior p(w|X, y) and chose that parameter w that maximizes
this posterior. Hence this is called the maximum-a-posteriori (MAP)
estimate. As before, we take the log and add a minus sign and minimize
instead. In order to compute the posterior we use Bayes law and we
assume that the components of the weight vector are iid Gaussians with
mean zero and variance λ1 .

wridge = arg min − log p(w|X, y)

w
(a) p(y, X|w)p(w)
= arg min − log
w p(y, X)
(b)
= arg min − log p(y, X|w)p(w)
w
(c)
= arg min − log p(y|X, w)p(w)
w
" N
#
Y
= arg min − log p(w) p(yn|xn, w)
w
n=1
" N
#
Y
= arg min − log N (w | 0, λ1 I) N (yn | x⊤ 2
n w, σ )
w
n=1
" N
#
1 − λ2 ∥w∥2
Y 1 − 12 (yn −x⊤
n w)
2
= arg min − log e √ e 2σ
w (2π λ1 )D/2 n=1 2πσ 2
N
X 1 ⊤ 2 λ 2
= arg min 2
(y n − x n w) + ∥w∥ .
w
n=1
2σ 2

In step (a) we used Bayes’ law. In step (b) and (c) we eliminated quan-
tities that do not depend on w.
Regularization as prior information
Based on the previous derivation of ridge regression, we can more gen-
erally see regularization as encoding any kind of prior information we
have and thus it can be understood as a compressed form of data, in
this sense, using regularization is equivalent to adding data which helps
reduce overfitting.

Lec 13
No ratings yet
Lec 13
54 pages
Namma Kalvi 12th Maths 1st Midterm Important Questions
No ratings yet
Namma Kalvi 12th Maths 1st Midterm Important Questions
41 pages
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
No ratings yet
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
25 pages
6 Complexity
No ratings yet
6 Complexity
22 pages
Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Volume 2
No ratings yet
Volume 2
270 pages
Bias
No ratings yet
Bias
62 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
02c.L1L2 Regularization
No ratings yet
02c.L1L2 Regularization
50 pages
Slides 2
No ratings yet
Slides 2
27 pages
Mathematics 11 03784 v3
No ratings yet
Mathematics 11 03784 v3
12 pages
Least Squares
No ratings yet
Least Squares
12 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Class03 RLS
No ratings yet
Class03 RLS
28 pages
Computation of Electromagnetic Fields Around HVDC Transmission Line Tying Egypt and Ksa
No ratings yet
Computation of Electromagnetic Fields Around HVDC Transmission Line Tying Egypt and Ksa
6 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Math Notes 12
No ratings yet
Math Notes 12
70 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
Chapman-Kolmogorov Equations 30 The Effect of 48511
No ratings yet
Chapman-Kolmogorov Equations 30 The Effect of 48511
9 pages
Unit 4
No ratings yet
Unit 4
62 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
Chapter6 Slides
No ratings yet
Chapter6 Slides
28 pages
ECEN615 Fall2022 Lect16-1
No ratings yet
ECEN615 Fall2022 Lect16-1
47 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Cheatsheet 2
No ratings yet
Cheatsheet 2
5 pages
MIT Regression
No ratings yet
MIT Regression
5 pages
LP-III Lab Manual
No ratings yet
LP-III Lab Manual
49 pages
Regularization
No ratings yet
Regularization
46 pages
Inf Theory 3
No ratings yet
Inf Theory 3
76 pages
Lect 6
No ratings yet
Lect 6
10 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Difference Between l1 and l2 Regularisation
No ratings yet
Difference Between l1 and l2 Regularisation
4 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
Learning From Data: 9: Regularization
No ratings yet
Learning From Data: 9: Regularization
37 pages
Schedule Jee Main 2025 Test Series Droppers July Batch
No ratings yet
Schedule Jee Main 2025 Test Series Droppers July Batch
4 pages
Regularization Methods Intro 1694372556
No ratings yet
Regularization Methods Intro 1694372556
38 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
Chapter4 Slides
No ratings yet
Chapter4 Slides
42 pages
Feature Selection
No ratings yet
Feature Selection
19 pages
09 Regularization
No ratings yet
09 Regularization
8 pages
Mathematical Proficiency of Senior High School Students of Lemery National High School
100% (1)
Mathematical Proficiency of Senior High School Students of Lemery National High School
66 pages
Group Multiplication Tables and Sudokus
No ratings yet
Group Multiplication Tables and Sudokus
10 pages
Regularization
No ratings yet
Regularization
45 pages
The Risk of Machine Learning
No ratings yet
The Risk of Machine Learning
66 pages
sparsitNERANETEE NEURA RMNURLA NER WERK MANUA KPAPER PD PDF BENE
No ratings yet
sparsitNERANETEE NEURA RMNURLA NER WERK MANUA KPAPER PD PDF BENE
8 pages
L2 Regularization
No ratings yet
L2 Regularization
10 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Discrete Mathematics
No ratings yet
Discrete Mathematics
66 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Telegram @unacademyplusdiscounts: A B A + B A B A + B
No ratings yet
Telegram @unacademyplusdiscounts: A B A + B A B A + B
80 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Sparse Inverse Covariance Estimation With The Graphical Lasso
No ratings yet
Sparse Inverse Covariance Estimation With The Graphical Lasso
14 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Linear Algebra Solution Manual by Peter Olver
80% (10)
Linear Algebra Solution Manual by Peter Olver
350 pages
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
No ratings yet
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
38 pages
Lasso NIPS
No ratings yet
Lasso NIPS
8 pages
Elements of Statistical Learning II - Ch.3 Linear Regression - Notes
No ratings yet
Elements of Statistical Learning II - Ch.3 Linear Regression - Notes
4 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
R Mat 312 Assignments One
No ratings yet
R Mat 312 Assignments One
11 pages
Lecture03b Overfitting
No ratings yet
Lecture03b Overfitting
5 pages
07PPLapdon - BG - T7 Editted
No ratings yet
07PPLapdon - BG - T7 Editted
21 pages
Regularization
No ratings yet
Regularization
3 pages
08 Giaigandung Hephuongtrinh BG Tuan8 Editted
No ratings yet
08 Giaigandung Hephuongtrinh BG Tuan8 Editted
18 pages
School Profile
No ratings yet
School Profile
5 pages
Regularization
No ratings yet
Regularization
4 pages
3.1+ (PPT) +Linear+Programming+ +Sensitivity+Analysis
No ratings yet
3.1+ (PPT) +Linear+Programming+ +Sensitivity+Analysis
17 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Representer Function
No ratings yet
Representer Function
12 pages
Mae101 Exercises Guide 1
No ratings yet
Mae101 Exercises Guide 1
6 pages
Tutorial 2
No ratings yet
Tutorial 2
3 pages
Maths in Physics PDF
No ratings yet
Maths in Physics PDF
263 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lecture03b Overfitting Annotated
No ratings yet
Lecture03b Overfitting Annotated
5 pages
Very Important Grade 9 Maths
No ratings yet
Very Important Grade 9 Maths
2 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Slide 1
No ratings yet
Slide 1
4 pages
Field Probs
No ratings yet
Field Probs
19 pages
Progression and Series CPP-tripathi
No ratings yet
Progression and Series CPP-tripathi
3 pages
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
No ratings yet
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
5 pages
Solid Works Assignment 1
No ratings yet
Solid Works Assignment 1
3 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Maths Holiday Homework
No ratings yet
Maths Holiday Homework
13 pages
AI34
No ratings yet
AI34
3 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
PE281 Green's Functions Course Notes
100% (1)
PE281 Green's Functions Course Notes
11 pages
IMO Level2 Mock1 Class4
No ratings yet
IMO Level2 Mock1 Class4
12 pages
Jism 10 Maths
No ratings yet
Jism 10 Maths
8 pages
Script Dynamics
No ratings yet
Script Dynamics
13 pages
B Ridge - and - Lasso - Regression
No ratings yet
B Ridge - and - Lasso - Regression
5 pages
Bfs Find Shortest Path On Unweighted Graph
No ratings yet
Bfs Find Shortest Path On Unweighted Graph
3 pages
Ex 4 7 FSC Part1 M Shahid PDF
No ratings yet
Ex 4 7 FSC Part1 M Shahid PDF
3 pages
Ridge Regression
No ratings yet
Ridge Regression
5 pages
Turbine Generator Governor Droop Isochronous Fundamentals - A Graphical Approach
100% (1)
Turbine Generator Governor Droop Isochronous Fundamentals - A Graphical Approach
8 pages
Data Handling
No ratings yet
Data Handling
2 pages
Pascal's Triangle: Patterns Within The Triangle
No ratings yet
Pascal's Triangle: Patterns Within The Triangle
5 pages
IBSTL49 Sets and Venn Diagrams
No ratings yet
IBSTL49 Sets and Venn Diagrams
3 pages
Scope For FINAL Exam Grade 9 2024 - TERM 4
80% (5)
Scope For FINAL Exam Grade 9 2024 - TERM 4
3 pages
RISC-V Instruction Set Summary
No ratings yet
RISC-V Instruction Set Summary
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture03d Ridge

Uploaded by

Lecture03d Ridge

Uploaded by

Machine Learning Course - CS-433

Sept 25, 2024

min L(w) + Ω(w)

The second term Ω is a regular-

Least squares is a special case of

X⊤X + λ′I = USU⊤ + λ′UIU⊤

We see now that every Eigenvalue is “lifted” by an amount λ′.

Here is an alternative proof. Recall that for a symmetric

is an ellipsoid and this ellipsoid simply scales around its origin

which is indeed the equation for an ellipsoid. In the general

Hence this is then equivalent to the equation ∥X(w0−w)∥2 =

Another view of regularization: The ridge regression formulation we have

The following picture illustrates this.

Figure 1: Geometric interpretation of Ridge Regression. Blue lines in-

This forces some of the elements of w to be strictly 0 and therefore

• Why does L1 regularizer enforce sparsity? Hint: Draw a picture

Ridge regression has a very similar interpretation. Now we start with

wridge = arg min − log p(w|X, y)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.