Class03 RLS
Class03 RLS
Lorenzo Rosasco
Learning problem and algorithms
Solve
min L (f ), L (f ) = E(x,y )∼P [`(y, f (x))],
f ∈F
given only
Sn = (x1 , y1 ), . . . , (xn , yn ) ∼ P n .
Learning algorithm
Sn → b fSn ,
f =b
Solve
min L (f ), L (f ) = E(x,y )∼P [`(y, f (x))],
f ∈F
given only
Sn = (x1 , y1 ), . . . , (xn , yn ) ∼ P n .
Learning algorithm
Sn → b fSn ,
f =b
min L (f ) = min L (f ).
f ∈H f ∈F
H1 ⊂ H2 , . . . Hγ ⊂ . . . H.
Then, let
n
1X
f̂γ = min b
L (f ), L (f ) =
b `(yi , f (xi )).
f ∈Hγ n
i =1
f (x) = w > x.
Then,
I f ↔ w is one to one,
D E
I inner product f , f̄ := w > w̄,
H
I norm/metric f − f̄ := kw − w̄k.
H
ERM with least squares also called ordinary least squares (OLS)
n
1X
min (yi − w > xi )2 .
w∈Rd n
i =1
| {z }
L (w )
b
I Statistics later. . .
I . . . now computations.
X ∈ Rnd and b
Let b Y ∈ Rn . Then
n
1X 1 b b 2
(yi − w > xi )2 = Y − Xw .
n n
i =1
Y.
Xw = b
b
Yb b
Xw
Yb
b
w
b
X
Rd
Rd Rn
@w
b Xw = b
s.t. b Y
X >b
b X >b
Xw = b Y ⇔ X >b
b = (b
w X )−1 b
X >b
Y.
b
w Yb
b
X Rn
R d
Rd
∃w
b Xw = b
s.t. b Y
possibly not unique. . .
L.Rosasco, 9.520/6.860 2018
Minimal norm solution
Xw
b Y,
b=b X w0 = 0
and b ⇒b
X (b Y.
w + w0 ) = b
Consider
min kwk2 , subj. to Y.
Xw = b
b
w∈Rd
X > (b
b=b
w X > )−1b
Xb Y.
w X †b
b=b Y
X † = (b
b X >b
X )−1 b
X >.
X† = b
b X > (bX > )−1 .
Xb
Then,
r
X 1 >b
b† = b
w X †b
Y= (u Y )vj .
sj j
j =1
r
X 1 >b
b† = b
w X †b
Y= (u Y )vj .
sj j
j =1
X † = lim (b
b X >b
X + λI )−1 b X > (b
X > = lim b X > + λI )−1 .
Xb
λ→0+ λ→0+
bλ = (b
w X >b
X + λI )−1 b
X >b
Y.
bλ = (b
w X >b
X + λI )−1 b
X >b
Y
r
X sj
bλ =
w 2
(uj>b
Y )vj
s
j =1 j
+ λ
The function
s
F (s) = ,
s2 + λ
acts as a low pass filter (low frequencies= principal components).
bλ = (b
w X >b
X + λI )−1 b
X >b
Y
is the solution of
2
min b X w + λkwk2 .
Y −b
d
w∈R | {z }
Lλ ( w )
b
It follows from,
2 > b b 1 >b 2 >b
Lλ (w) = − b
∆b X (Y − X w) + 2λw = 2( bX X + λI )w − bX Y.
n n n
bλ = (b
w X >b
X + nλI )−1 b
X >b
Y
since
1 b b 2
min Y − X w + λkwk2 .
w∈Rd n
| {z }
Lλ (w )
b
Tikhonov
1 b b 2
min Y − Xw + λkwk2
w∈Rd n
Morozov
1 b b 2
min kwk2 subj. to Y − Xw ≤δ
w∈Rd n
Ivanov
1 b b 2
min Y − Xw , subj. to kwk2 ≤ R
w∈Rd n
The constraint
kwk2 ≤ R
w X †b
b=b Y X >b
bλ = (b
w X + λI )−1 b
X >b
Y
n
1X
min kwk2 min (yi − w > xi )2 + λ kwk2
w∈Rd s.t. b
X w =b
Y w∈Rd n
i =1
Back to computations.
Solving
bλ = (b
w X >b
X + λI )−1 b
X >b
Y
requires essentially (using a direct solver)
I time O (nD 2 + D 3 ),
I memory O (nd ∨ D 2 ).
What if n D ?
A simple observation
Using SVD we can see that
X >b
(b X + λI )−1 b
X> = b
X > (bX > + λI )−1
Xb
Then
bλ = b
w X > (bX > + λI )−1b
Xb Y.
Note that
X n
λ > bb> −1 b
b = X (X X + λI ) Y =
w b xi ci .
| {z } i =1
c∈Rn
The coefficients vector is a linear combination of the input points.
Then
n
X
f̂ λ (x) = x > w
bλ = x > b
X >c = x > xi ci
i =1
The function we obtain is a linear combination of inner products.
TBD
I Beyond linear models.
I Optimization.
I Model selection.