0% found this document useful (0 votes)

15 views28 pages

Class03 RLS

The document discusses regularized least squares algorithms for statistical learning problems. It introduces regularized linear regression methods, including ordinary least squares (OLS) and ridge regression. OLS finds the linear function that minimizes empirical risk by solving a linear system using the pseudoinverse. Ridge regression adds a regularization term to the linear system to introduce a bias towards certain solutions and reduce sensitivity to noise.

Uploaded by

timetif286

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views28 pages

Class03 RLS

Uploaded by

timetif286

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

MIT 9.520/6.

860, Fall 2018

Statistical Learning Theory and Applications

Class 03: Regularized Least Squares

Lorenzo Rosasco
Learning problem and algorithms

Solve
min L (f ), L (f ) = E(x,y )∼P [`(y, f (x))],
f ∈F
given only
Sn = (x1 , y1 ), . . . , (xn , yn ) ∼ P n .

Learning algorithm
Sn → b fSn ,
f =b

f estimates fP given the observed examples Sn .

L.Rosasco, 9.520/6.860 2018

Learning problem and algorithms

Solve
min L (f ), L (f ) = E(x,y )∼P [`(y, f (x))],
f ∈F
given only
Sn = (x1 , y1 ), . . . , (xn , yn ) ∼ P n .

Learning algorithm
Sn → b fSn ,
f =b

f estimates fP given the observed examples Sn .

How can we design a learning algorithm?

L.Rosasco, 9.520/6.860 2018

Algorithm design: complexity and regularization

The design of most algorithms proceed as follows:

I Pick a (possibly large) class of function H, ideally

min L (f ) = min L (f ).
f ∈H f ∈F

I Define a procedure Aγ (Sn ) = f̂γ ∈ H to explore the space H.

L.Rosasco, 9.520/6.860 2018

Empirical risk minimization

A classical example (called M-estimation in statistics).

Consider (Hγ )γ such that

H1 ⊂ H2 , . . . Hγ ⊂ . . . H.

Then, let
n
1X
f̂γ = min b
L (f ), L (f ) =
b `(yi , f (xi )).
f ∈Hγ n
i =1

This is the idea we discuss next.

L.Rosasco, 9.520/6.860 2018

Linear functions

Let H be the space of linear functions

f (x) = w > x.

Then,
I f ↔ w is one to one,
D E
I inner product f , f̄ := w > w̄,
H
I norm/metric f − f̄ := kw − w̄k.
H

Linear functions are the conceptual building block of most functions.

L.Rosasco, 9.520/6.860 2018

Linear least squares

ERM with least squares also called ordinary least squares (OLS)
n
1X
min (yi − w > xi )2 .
w∈Rd n
i =1
| {z }
L (w )
b

I Statistics later. . .
I . . . now computations.

L.Rosasco, 9.520/6.860 2018

Matrices and linear systems

X ∈ Rnd and b
Let b Y ∈ Rn . Then
n
1X 1 b b 2
(yi − w > xi )2 = Y − Xw .
n n
i =1

This is the least squares problem associated to the linear system

Y.
Xw = b
b

L.Rosasco, 9.520/6.860 2018

Overdetermined lin. syst.
n >d

Yb b
Xw

Yb
b
w

b
X
Rd
Rd Rn

@w
b Xw = b
s.t. b Y

L.Rosasco, 9.520/6.860 2018

Least squares solutions

From the optimality conditions

1 b b 2
∇w Y − Xw =0
n
we can derive the normal equation

X >b
b X >b
Xw = b Y ⇔ X >b
b = (b
w X )−1 b
X >b
Y.

L.Rosasco, 9.520/6.860 2018

Underdetermined lin. syst.
n <d

b
w Yb

b
X Rn
R d
Rd

∃w
b Xw = b
s.t. b Y
possibly not unique. . .
L.Rosasco, 9.520/6.860 2018
Minimal norm solution

There can be many solutions

Xw
b Y,
b=b X w0 = 0
and b ⇒b
X (b Y.
w + w0 ) = b

Consider
min kwk2 , subj. to Y.
Xw = b
b
w∈Rd

Using the method of Lagrange multipliers, the solution is

X > (b
b=b
w X > )−1b
Xb Y.

L.Rosasco, 9.520/6.860 2018

Pseudoinverse

w X †b
b=b Y

For n > d , (independent columns)

X † = (b
b X >b
X )−1 b
X >.

For n < d , (independent rows)

X† = b
b X > (bX > )−1 .
Xb

L.Rosasco, 9.520/6.860 2018

Spectral view

Consider the SVD of b

X
r
X
X = USV >
b ⇔ Xw =
b sj (vj> w)uj ,
j =1

here r ≤ n ∧ d is the rank of b

Then,
r
X 1 >b
b† = b
w X †b
Y= (u Y )vj .
sj j
j =1

L.Rosasco, 9.520/6.860 2018

Pseudoinverse and bias

r
X 1 >b
b† = b
w X †b
Y= (u Y )vj .
sj j
j =1

(vj )j are principal components of b

X : OLS “likes” principal
components.

Not all linear functions are the same for OLS!

The pseudoinverse introduces a bias towards certain solutions.

L.Rosasco, 9.520/6.860 2018

From OLS to ridge regression

Recall, it also holds,

X † = lim (b
b X >b
X + λI )−1 b X > (b
X > = lim b X > + λI )−1 .
Xb
λ→0+ λ→0+

Consider for λ > 0,

bλ = (b
w X >b
X + λI )−1 b
X >b
Y.

This is called ridge regression.

L.Rosasco, 9.520/6.860 2018

Spectral view on ridge regression

bλ = (b
w X >b
X + λI )−1 b
X >b
Y

Considering the SVD of b

X,
r
X sj
bλ =
w 2
(uj>b
Y )vj .
j =1
s j + λ

L.Rosasco, 9.520/6.860 2018

Ridge regression as filtering

r
X sj
bλ =
w 2
(uj>b
Y )vj
s
j =1 j
+ λ

The function
s
F (s) = ,
s2 + λ
acts as a low pass filter (low frequencies= principal components).

I For s small, F (s) ≈ 1/λ.

I For s big, F (s) ≈ 1/s.

L.Rosasco, 9.520/6.860 2018

Ridge regression as ERM

bλ = (b
w X >b
X + λI )−1 b
X >b
Y
is the solution of
2
min b X w + λkwk2 .
Y −b
d
w∈R | {z }
Lλ ( w )
b

It follows from,
2 > b b 1 >b 2 >b
Lλ (w) = − b
∆b X (Y − X w) + 2λw = 2( bX X + λI )w − bX Y.
n n n

L.Rosasco, 9.520/6.860 2018

Ridge regression as ERM

ERM interpretation suggests the rescaling

bλ = (b
w X >b
X + nλI )−1 b
X >b
Y

since
1 b b 2
min Y − X w + λkwk2 .
w∈Rd n
| {z }
Lλ (w )
b

L.Rosasco, 9.520/6.860 2018

Related ideas

Tikhonov
1 b b 2
min Y − Xw + λkwk2
w∈Rd n

Morozov
1 b b 2
min kwk2 subj. to Y − Xw ≤δ
w∈Rd n

Ivanov
1 b b 2
min Y − Xw , subj. to kwk2 ≤ R
w∈Rd n

L.Rosasco, 9.520/6.860 2018

Ridge regression and SRM

The constraint
kwk2 ≤ R

I restricts the search of solution,

I shrinks the solution coefficients.

L.Rosasco, 9.520/6.860 2018

Different views on regularization

w X †b
b=b Y X >b
bλ = (b
w X + λI )−1 b
X >b
Y

n
1X
min kwk2 min (yi − w > xi )2 + λ kwk2
w∈Rd s.t. b
X w =b
Y w∈Rd n
i =1

I Introduces a bias towards certain solutions: small

norm/principal components,
I controls the stability of the solution .

L.Rosasco, 9.520/6.860 2018

Complexity of ridge regression

Back to computations.

Solving
bλ = (b
w X >b
X + λI )−1 b
X >b
Y
requires essentially (using a direct solver)
I time O (nD 2 + D 3 ),
I memory O (nd ∨ D 2 ).

What if n D ?

L.Rosasco, 9.520/6.860 2018

Representer theorem in disguise

A simple observation
Using SVD we can see that

X >b
(b X + λI )−1 b
X> = b
X > (bX > + λI )−1
Xb

L.Rosasco, 9.520/6.860 2018

requires essentially (using a direct solver)

I time O (n 2 D + n 3 ),
I memory O (nd ∨ n 2 ).

L.Rosasco, 9.520/6.860 2018

Representer theorem

Note that
X n
λ > bb> −1 b
b = X (X X + λI ) Y =
w b xi ci .
| {z } i =1
c∈Rn
The coefficients vector is a linear combination of the input points.

Then
n
X
f̂ λ (x) = x > w
bλ = x > b
X >c = x > xi ci
i =1
The function we obtain is a linear combination of inner products.

This will be the key to nonparametric learning.

L.Rosasco, 9.520/6.860 2018

Summing up

I From OLS to ridge regression

I Different views: (spectral) filtering and ERM
I Regularization and bias.

TBD
I Beyond linear models.
I Optimization.
I Model selection.

L.Rosasco, 9.520/6.860 2018

ParticipantCaseWorksheets 072018
0% (7)
ParticipantCaseWorksheets 072018
11 pages
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
0% (1)
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
4 pages
Lesson Plan Grade 4 and 5 Math
100% (2)
Lesson Plan Grade 4 and 5 Math
2 pages
MIT Regression
No ratings yet
MIT Regression
5 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
33 pages
Class05 LogisticsSVM
No ratings yet
Class05 LogisticsSVM
33 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lecture03d Ridge
No ratings yet
Lecture03d Ridge
13 pages
02c.L1L2 Regularization
No ratings yet
02c.L1L2 Regularization
50 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
10 Regression, Including Least-Squares Linear and Logistic Regression
No ratings yet
10 Regression, Including Least-Squares Linear and Logistic Regression
5 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Linear Least Squares Problems
No ratings yet
Linear Least Squares Problems
38 pages
Lecture03a Least Squares Annotated
No ratings yet
Lecture03a Least Squares Annotated
9 pages
SLAM Least Squares Notes
No ratings yet
SLAM Least Squares Notes
12 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
6 Complexity
No ratings yet
6 Complexity
22 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Numerical Methods For Least Squares Problems, Second Edition
No ratings yet
Numerical Methods For Least Squares Problems, Second Edition
510 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
Representer Function
No ratings yet
Representer Function
12 pages
Ridge Regression
No ratings yet
Ridge Regression
9 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
3.1 Least-Squares Problems
No ratings yet
3.1 Least-Squares Problems
28 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Day 1
No ratings yet
Day 1
41 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
Regularized Least-Squares Classification
No ratings yet
Regularized Least-Squares Classification
24 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
06 Fitting Matching
No ratings yet
06 Fitting Matching
13 pages
Linear Least Squares
No ratings yet
Linear Least Squares
21 pages
ECEN615 Fall2022 Lect16-1
No ratings yet
ECEN615 Fall2022 Lect16-1
47 pages
Exercise 03
No ratings yet
Exercise 03
5 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
Regression Interpolation
No ratings yet
Regression Interpolation
34 pages
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
No ratings yet
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
5 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
NO LINEALs
No ratings yet
NO LINEALs
61 pages
Leastsquares Minnorm Problems
No ratings yet
Leastsquares Minnorm Problems
6 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
ML - Lec 4-Introduction To Regression
No ratings yet
ML - Lec 4-Introduction To Regression
65 pages
Least Squares
No ratings yet
Least Squares
12 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
31 Least Squares
No ratings yet
31 Least Squares
39 pages
Least Squares Full Resume
No ratings yet
Least Squares Full Resume
15 pages
Chap 03
No ratings yet
Chap 03
59 pages
HW 4
No ratings yet
HW 4
7 pages
Class06 SGD
No ratings yet
Class06 SGD
24 pages
Least
No ratings yet
Least
2 pages
HW 5
No ratings yet
HW 5
5 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
MADE EASY GATE 2019 Rank Predictor - Rank Calculator and Estimator PDF
No ratings yet
MADE EASY GATE 2019 Rank Predictor - Rank Calculator and Estimator PDF
30 pages
Math in Our World 2nd Edition Sobecki Bluman Matthews Test Bank
100% (46)
Math in Our World 2nd Edition Sobecki Bluman Matthews Test Bank
26 pages
Adjustment Disorders Edited
No ratings yet
Adjustment Disorders Edited
3 pages
SC - MSC - Cs - 1st Sem - 2018
No ratings yet
SC - MSC - Cs - 1st Sem - 2018
2 pages
VLJ Training Guidelines
No ratings yet
VLJ Training Guidelines
14 pages
Examreg
No ratings yet
Examreg
29 pages
Paper 4
No ratings yet
Paper 4
8 pages
122 - PED12 - Lesson Proper For Week 2
No ratings yet
122 - PED12 - Lesson Proper For Week 2
5 pages
TPS 30
No ratings yet
TPS 30
40 pages
Filipino Education Philosophy (New Slide)
100% (1)
Filipino Education Philosophy (New Slide)
20 pages
NVS - NWDA - Pension-Supremecourt - Judgment
No ratings yet
NVS - NWDA - Pension-Supremecourt - Judgment
32 pages
Fact Sheet Scietech - English and Filipino Article 1
90% (10)
Fact Sheet Scietech - English and Filipino Article 1
2 pages
Ctevt Model Entrance Questions
No ratings yet
Ctevt Model Entrance Questions
4 pages
Shareef Resume
No ratings yet
Shareef Resume
1 page
Cpar M1
No ratings yet
Cpar M1
23 pages
Beury Donald Attorney California 141733 Binder6
No ratings yet
Beury Donald Attorney California 141733 Binder6
47 pages
Appreciative Inquiry: References
No ratings yet
Appreciative Inquiry: References
5 pages
CACES Bible Quiz 2025 Edited 2.0
No ratings yet
CACES Bible Quiz 2025 Edited 2.0
3 pages
Texas Hospitals' CMS Overall Quality Ratings
No ratings yet
Texas Hospitals' CMS Overall Quality Ratings
9 pages
Keyboard Prep Piano Classes For 8-10yo
No ratings yet
Keyboard Prep Piano Classes For 8-10yo
3 pages
Coursera S6NWUF93EUMF PDF
No ratings yet
Coursera S6NWUF93EUMF PDF
1 page
The Psychology of Risk Embracing Uncertainty To Stay Profitable
No ratings yet
The Psychology of Risk Embracing Uncertainty To Stay Profitable
4 pages
Professional Practice
No ratings yet
Professional Practice
19 pages
Approaches To The Study of Social Problems
No ratings yet
Approaches To The Study of Social Problems
2 pages
B.Sc. Eligibility
No ratings yet
B.Sc. Eligibility
10 pages
Python Programming Brochure
No ratings yet
Python Programming Brochure
7 pages
Academic Poster Template 4
No ratings yet
Academic Poster Template 4
1 page
IENG112 Notes 1
No ratings yet
IENG112 Notes 1
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Class03 RLS

Uploaded by

Class03 RLS

Uploaded by

MIT 9.520/6.

860, Fall 2018

Class 03: Regularized Least Squares

f estimates fP given the observed examples Sn .

L.Rosasco, 9.520/6.860 2018

f estimates fP given the observed examples Sn .

How can we design a learning algorithm?

L.Rosasco, 9.520/6.860 2018

The design of most algorithms proceed as follows:

I Pick a (possibly large) class of function H, ideally

I Define a procedure Aγ (Sn ) = f̂γ ∈ H to explore the space H.

L.Rosasco, 9.520/6.860 2018

A classical example (called M-estimation in statistics).

Consider (Hγ )γ such that

This is the idea we discuss next.

L.Rosasco, 9.520/6.860 2018

Let H be the space of linear functions

Linear functions are the conceptual building block of most functions.

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

This is the least squares problem associated to the linear system

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

From the optimality conditions

L.Rosasco, 9.520/6.860 2018

There can be many solutions

Using the method of Lagrange multipliers, the solution is

L.Rosasco, 9.520/6.860 2018

For n > d , (independent columns)

For n < d , (independent rows)

L.Rosasco, 9.520/6.860 2018

Consider the SVD of b

here r ≤ n ∧ d is the rank of b

L.Rosasco, 9.520/6.860 2018

(vj )j are principal components of b

Not all linear functions are the same for OLS!

The pseudoinverse introduces a bias towards certain solutions.

L.Rosasco, 9.520/6.860 2018

Recall, it also holds,

Consider for λ > 0,

This is called ridge regression.

L.Rosasco, 9.520/6.860 2018

Considering the SVD of b

L.Rosasco, 9.520/6.860 2018

I For s small, F (s) ≈ 1/λ.

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

ERM interpretation suggests the rescaling

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

I restricts the search of solution,

L.Rosasco, 9.520/6.860 2018

I Introduces a bias towards certain solutions: small

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

requires essentially (using a direct solver)

L.Rosasco, 9.520/6.860 2018

This will be the key to nonparametric learning.

L.Rosasco, 9.520/6.860 2018

I From OLS to ridge regression

L.Rosasco, 9.520/6.860 2018

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.