0% found this document useful (0 votes)
15 views28 pages

Class03 RLS

The document discusses regularized least squares algorithms for statistical learning problems. It introduces regularized linear regression methods, including ordinary least squares (OLS) and ridge regression. OLS finds the linear function that minimizes empirical risk by solving a linear system using the pseudoinverse. Ridge regression adds a regularization term to the linear system to introduce a bias towards certain solutions and reduce sensitivity to noise.

Uploaded by

timetif286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

Class03 RLS

The document discusses regularized least squares algorithms for statistical learning problems. It introduces regularized linear regression methods, including ordinary least squares (OLS) and ridge regression. OLS finds the linear function that minimizes empirical risk by solving a linear system using the pseudoinverse. Ridge regression adds a regularization term to the linear system to introduce a bias towards certain solutions and reduce sensitivity to noise.

Uploaded by

timetif286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

MIT 9.520/6.

860, Fall 2018


Statistical Learning Theory and Applications

Class 03: Regularized Least Squares

Lorenzo Rosasco
Learning problem and algorithms

Solve
min L (f ), L (f ) = E(x,y )∼P [`(y, f (x))],
f ∈F
given only
Sn = (x1 , y1 ), . . . , (xn , yn ) ∼ P n .

Learning algorithm
Sn → b fSn ,
f =b

f estimates fP given the observed examples Sn .


b

L.Rosasco, 9.520/6.860 2018


Learning problem and algorithms

Solve
min L (f ), L (f ) = E(x,y )∼P [`(y, f (x))],
f ∈F
given only
Sn = (x1 , y1 ), . . . , (xn , yn ) ∼ P n .

Learning algorithm
Sn → b fSn ,
f =b

f estimates fP given the observed examples Sn .


b

How can we design a learning algorithm?

L.Rosasco, 9.520/6.860 2018


Algorithm design: complexity and regularization

The design of most algorithms proceed as follows:

I Pick a (possibly large) class of function H, ideally

min L (f ) = min L (f ).
f ∈H f ∈F

I Define a procedure Aγ (Sn ) = f̂γ ∈ H to explore the space H.

L.Rosasco, 9.520/6.860 2018


Empirical risk minimization

A classical example (called M-estimation in statistics).

Consider (Hγ )γ such that

H1 ⊂ H2 , . . . Hγ ⊂ . . . H.

Then, let
n
1X
f̂γ = min b
L (f ), L (f ) =
b `(yi , f (xi )).
f ∈Hγ n
i =1

This is the idea we discuss next.

L.Rosasco, 9.520/6.860 2018


Linear functions

Let H be the space of linear functions

f (x) = w > x.

Then,
I f ↔ w is one to one,
D E
I inner product f , f̄ := w > w̄,
H
I norm/metric f − f̄ := kw − w̄k.
H

Linear functions are the conceptual building block of most functions.

L.Rosasco, 9.520/6.860 2018


Linear least squares

ERM with least squares also called ordinary least squares (OLS)
n
1X
min (yi − w > xi )2 .
w∈Rd n
i =1
| {z }
L (w )
b

I Statistics later. . .
I . . . now computations.

L.Rosasco, 9.520/6.860 2018


Matrices and linear systems

X ∈ Rnd and b
Let b Y ∈ Rn . Then
n
1X 1 b b 2
(yi − w > xi )2 = Y − Xw .
n n
i =1

This is the least squares problem associated to the linear system

Y.
Xw = b
b

L.Rosasco, 9.520/6.860 2018


Overdetermined lin. syst.
n >d

Yb b
Xw

Yb
b
w

b
X
Rd
Rd Rn

@w
b Xw = b
s.t. b Y

L.Rosasco, 9.520/6.860 2018


Least squares solutions

From the optimality conditions


1 b b 2
∇w Y − Xw =0
n
we can derive the normal equation

X >b
b X >b
Xw = b Y ⇔ X >b
b = (b
w X )−1 b
X >b
Y.

L.Rosasco, 9.520/6.860 2018


Underdetermined lin. syst.
n <d

b
w Yb

b
X Rn
R d
Rd

∃w
b Xw = b
s.t. b Y
possibly not unique. . .
L.Rosasco, 9.520/6.860 2018
Minimal norm solution

There can be many solutions

Xw
b Y,
b=b X w0 = 0
and b ⇒b
X (b Y.
w + w0 ) = b

Consider
min kwk2 , subj. to Y.
Xw = b
b
w∈Rd

Using the method of Lagrange multipliers, the solution is

X > (b
b=b
w X > )−1b
Xb Y.

L.Rosasco, 9.520/6.860 2018


Pseudoinverse

w X †b
b=b Y

For n > d , (independent columns)

X † = (b
b X >b
X )−1 b
X >.

For n < d , (independent rows)

X† = b
b X > (bX > )−1 .
Xb

L.Rosasco, 9.520/6.860 2018


Spectral view

Consider the SVD of b


X
r
X
X = USV >
b ⇔ Xw =
b sj (vj> w)uj ,
j =1

here r ≤ n ∧ d is the rank of b


X.

Then,
r
X 1 >b
b† = b
w X †b
Y= (u Y )vj .
sj j
j =1

L.Rosasco, 9.520/6.860 2018


Pseudoinverse and bias

r
X 1 >b
b† = b
w X †b
Y= (u Y )vj .
sj j
j =1

(vj )j are principal components of b


X : OLS “likes” principal
components.

Not all linear functions are the same for OLS!

The pseudoinverse introduces a bias towards certain solutions.

L.Rosasco, 9.520/6.860 2018


From OLS to ridge regression

Recall, it also holds,

X † = lim (b
b X >b
X + λI )−1 b X > (b
X > = lim b X > + λI )−1 .
Xb
λ→0+ λ→0+

Consider for λ > 0,

bλ = (b
w X >b
X + λI )−1 b
X >b
Y.

This is called ridge regression.

L.Rosasco, 9.520/6.860 2018


Spectral view on ridge regression

bλ = (b
w X >b
X + λI )−1 b
X >b
Y

Considering the SVD of b


X,
r
X sj
bλ =
w 2
(uj>b
Y )vj .
j =1
s j + λ

L.Rosasco, 9.520/6.860 2018


Ridge regression as filtering

r
X sj
bλ =
w 2
(uj>b
Y )vj
s
j =1 j
+ λ

The function
s
F (s) = ,
s2 + λ
acts as a low pass filter (low frequencies= principal components).

I For s small, F (s) ≈ 1/λ.


I For s big, F (s) ≈ 1/s.

L.Rosasco, 9.520/6.860 2018


Ridge regression as ERM

bλ = (b
w X >b
X + λI )−1 b
X >b
Y
is the solution of
2
min b X w + λkwk2 .
Y −b
d
w∈R | {z }
Lλ ( w )
b

It follows from,
2 > b b 1 >b 2 >b
Lλ (w) = − b
∆b X (Y − X w) + 2λw = 2( bX X + λI )w − bX Y.
n n n

L.Rosasco, 9.520/6.860 2018


Ridge regression as ERM

ERM interpretation suggests the rescaling

bλ = (b
w X >b
X + nλI )−1 b
X >b
Y

since
1 b b 2
min Y − X w + λkwk2 .
w∈Rd n
| {z }
Lλ (w )
b

L.Rosasco, 9.520/6.860 2018


Related ideas

Tikhonov
1 b b 2
min Y − Xw + λkwk2
w∈Rd n

Morozov
1 b b 2
min kwk2 subj. to Y − Xw ≤δ
w∈Rd n

Ivanov
1 b b 2
min Y − Xw , subj. to kwk2 ≤ R
w∈Rd n

L.Rosasco, 9.520/6.860 2018


Ridge regression and SRM

The constraint
kwk2 ≤ R

I restricts the search of solution,


I shrinks the solution coefficients.

L.Rosasco, 9.520/6.860 2018


Different views on regularization

w X †b
b=b Y X >b
bλ = (b
w X + λI )−1 b
X >b
Y

n
1X
min kwk2 min (yi − w > xi )2 + λ kwk2
w∈Rd s.t. b
X w =b
Y w∈Rd n
i =1

I Introduces a bias towards certain solutions: small


norm/principal components,
I controls the stability of the solution .

L.Rosasco, 9.520/6.860 2018


Complexity of ridge regression

Back to computations.

Solving
bλ = (b
w X >b
X + λI )−1 b
X >b
Y
requires essentially (using a direct solver)
I time O (nD 2 + D 3 ),
I memory O (nd ∨ D 2 ).

What if n  D ?

L.Rosasco, 9.520/6.860 2018


Representer theorem in disguise

A simple observation
Using SVD we can see that

X >b
(b X + λI )−1 b
X> = b
X > (bX > + λI )−1
Xb

L.Rosasco, 9.520/6.860 2018


More on complexity

Then
bλ = b
w X > (bX > + λI )−1b
Xb Y.

requires essentially (using a direct solver)


I time O (n 2 D + n 3 ),
I memory O (nd ∨ n 2 ).

L.Rosasco, 9.520/6.860 2018


Representer theorem

Note that
X n
λ > bb> −1 b
b = X (X X + λI ) Y =
w b xi ci .
| {z } i =1
c∈Rn
The coefficients vector is a linear combination of the input points.

Then
n
X
f̂ λ (x) = x > w
bλ = x > b
X >c = x > xi ci
i =1
The function we obtain is a linear combination of inner products.

This will be the key to nonparametric learning.

L.Rosasco, 9.520/6.860 2018


Summing up

I From OLS to ridge regression


I Different views: (spectral) filtering and ERM
I Regularization and bias.

TBD
I Beyond linear models.
I Optimization.
I Model selection.

L.Rosasco, 9.520/6.860 2018

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy