0% found this document useful (0 votes)
7 views8 pages

Sol3 2015

Uploaded by

gcy572092284
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

Sol3 2015

Uploaded by

gcy572092284
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Mehryar Mohri

Foundations of Machine Learning 2015


Courant Institute of Mathematical Sciences
Homework assignment 3
November 24, 2015
Due: December 07, 2015

A. Boosting-type Algorithm

1. Show that for all u ∈ R and integer p > 1, 1u≤0 ≤ Φp (−u) where
Φp (u) = max((1 + u)p , 0). Show that Φp is convex and differentiable.

Solution: We first show that 1u≤0 ≤ Φp (−u) for all u. Observe that for
u > 0, 1u≤0 = 0 ≤ Φp (−u) by definition of Φp . For u ≤ 0, −u ≥ 0 and
Φp (−u) = (1 − u)p ≥ 1 = 1u≤0 , which proves the desired statement.
Now we show that Φp (u) is differentiable and convex. We consider two
cases: p is even and p is odd. If p is even then Φp (u) = (1 + u)p for
all u since (1 + u)p ≥ 0 for all u. Therefore, Φ0p (u) = p(1 + u)p−1 and
Φ00p (u) = p(p − 1)(1 + u)p−2 . Moreover, p(p − 1)(1 + u)p−2 > 0 since
p − 2 is even and p − 1 > 0. This shows that Φp (u) is differentiable
and convex in this case.
Now if p is odd then Φp (u) = 0 for u ∈ (−∞, −1] and Φp (u) = (1 + u)p
for u ∈ (−1, ∞). Therefore, Φ0p (u) = 0 on (−∞, −1) and Φ0p (u) =
p(1 + u)p−1 for u ∈ (−1, ∞). To show that Φp is differentiable at −1
we consider left and right derivatives:
Φp (u) − Φp (−1) 0
lim = lim =0
u↑−1 u+1 u↑−1 u + 1
Φp (u) − Φp (−1) (1 + u)p
lim = lim = lim (1 + u)p−1 = 0
u↓−1 u+1 u↓−1 1 + u u↓−1

Similarly, we observe that Φ00p (u) = 0 on (−∞, −1) and Φ00p (u) = p(1 +
u)p−1 for u ∈ (−1, ∞). Using the same arguments as for the first
derivative and the fact that p ≥ 3 since p is odd we get
Φ0p (u) − Φ0p (−1) 0
lim = lim =0
u↑−1 u+1 u↑−1 u + 1
Φ0p (u) − Φ0p (−1) p(1 + u)p−1
lim = lim = lim p(1 + u)p−2 = 0
u↓−1 u+1 u↓−1 1+u u↓−1

1
and hence Φ00p (−1) = 0. It follows that Φ00p (u) ≥ 0 for all u and Φp is
convex.

2. Use Φp to derive a boosting-type algorithm using coordinate descent.


You should give a full description of your algorithm, including the
pseudocode, details for the choice of the step and direction, as well as
a generalization bound.

Solution: We assume that we have access to N weak learners


PN h1 , . . . , hN
and the goal is to learn an ensemble hypothesis g = j=1 alphaj hj
and predict according to sgn(g). Observe that
m m m
1 X 1 X 1 X
1sgn(g(xi ))6=yi = 1yi g(xi )≤0 ≤ Φp (−yi g(xi ))
m m m
i=1 i=1 i=1

by previous part of this question and our boosting-type algorithm that


we provide consists of applying coordinate descent to this convex and
1 Pm
differentiable objective. If F (αt ) = m i=1 Φp (−yi ft (xi )) and ft =
Pt
s=1 αs hs is the solution after t iterations, then at iteration t + 1 our
algorithm picks the direction
 
hk = argmin F 0 αt + ηej .
hj : j=1,...,N η=0

That is,
m
X
hk = argmax yi hj (xi )(1 − yi ft (xi ))p−1 . (1)
hj : j=1,...,N i=1

Once the direction is determined, the step size αt+1 is set by solving

F 0 (αt + ηek ) = 0 (2)

for η. This solution can be found using line search. The pseudocode
for this algorithm is given in Algorithm 1. Note that Mt (i)s are used
to avoid computing yi ft (xi ) from scratch at every iteration t.
As an ensemble method this algorithm enjoys generalization bound of
Corollary 6.1 from the textbook.

2
Algorithm 1 Boosting-type Algorithm.
Inputs: sample ((x1 , y1 ), . . . , (xm , ym )).
for i = 1 to m do
M1 (i) ← 0
end for
for t = 1 to T do
ht ← solution of (1)
αt ← solution of (2)
for i = 1 to m do
Mt+1 (i) ← Mt (i) + yi αt ht (xi )
end for
end Pfor
g ← Tt=1 αt ht
return: h = sgn(g).

B. L2 -Regularized Maxent

This problem studies L2 -regularized Maxent. We will use the notation intro-
duced in class and will denote by JS the dual objective function minimized
given a sample S:
λ
kwk22 + E − log pw [x] ,
 
JS (w) =
2 x∼S

where λ > 0 is a regularization parameter. We will assume that the feature


vector is bounded: kΦ(x)k2 ≤ r for all x ∈ X , for some r > 0.

1. Use McDiarmid’s inequality to prove that for any δ > 0, with proba-
bility at least 1 − δ, the following inequality holds:
r r 
2r2

1
E [Φ(x)] − E [Φ(x)] ≤ 1 + log .
x∼D x∼S 2 m δ

Solution:
For any sample S, define Γ(S) = k Ex∼D [Φ(x)] − Ex∼S [Φ(x)] k2 . Let
S 0 be a sample differing from S by one point, say xm in S and x0m in
S 0 , then, we can write
1 2r
|Γ(S 0 )−Γ(S)| ≤ E [Φ(x)]− E [Φ(x)] ≤ kΦ(x0m )−Φ(xm ) ≤ .
x∼S 0 x∼S 2 m 2 m

3
Thus, for any δ > 0, with probability at least 1 − δ,
r
2r2 1
Γ(S) ≤ E m [Γ(S)] + log .
S∼D m δ
h
1 Pm 1
Recall that Ex∼S [Φ(x)] = m i=1 Φ(xi ), and denote Xi = m Ex∈D [Φ(x)]−
i Pm
Φ(xi ) , so that i=1 Xi = Ex∼D [Φ(x)] − Ex∼S [Φ(x)].
Then, by Jensen’s inequality,
h i
E m [Γ(S)] = E m E [Φ(x)] − E [Φ(x)]
S∼D S∼D x∼D x∼S 2
r h 2i
≤ Em E [Φ(x)] − E [Φ(x)]
S∼D x∼D x∼S 2
v
u  m 
u 1 X
= t Em X i · X j
S∼D m2
i,j=1
v
u m
u 1 X
= t E[kXi k2 ] (for i 6= j, E[Xi · Xj ] = E[Xi ] · E[Xj ] = 0)
m2
i=1
r
1
= E[kX1 k2 ] (xi s drawn i.i.d.)
m
r
E[kX1 k2 ] + E[kX2 k2 ]
=
r 2m
E[kX1 − X2 k2 ]
= (E[X1 · X2 ] = E[X1 ] · E[X2 ] = 0)
r 2mr
(2r)2 2r2
= = .
2m m

2. Let wb be the L2 -regularized maxent solution for a sample S and wD


the solution for an infinite sample:

w
b = argmin JS (w) and wD = argmin JD (w).
w∈RN w∈RN

where JD (w) = λ2 kwk22 + Ex∼D − log pw [x] . Use the definition of


 

wb and wD as minimizers (use gradients) to prove that the following


inequality holds:

Ex∼S [Φ(x)] − Ex∼D [Φ(x)]


2
kw
b − wD k2 ≤ .
λ

4
Solution:
P 
Define function Q for all w by Q(w) = log Z = log x exp(w·Φ(x)) .
Q is convex as a composition of the log-sum function with an affine
function and we can write for any w:
λ
JS (w) = kwk22 − w · E [Φ(x)] + Q(w)
2 x∼S
λ
JD (w) = kwk22 − w · E [Φ(x)] + Q(w).
2 x∼D

Since the gradient of the objective function is zero at the minimum,


we can write

∇JS (w) b − E [Φ(x)] + ∇Q(w)


b = 0 = λw b
x∼S
∇JD (wD ) = 0 = λwD − E [Φ(x)] + ∇Q(wD ).
x∼D

Taking the difference yields:

λ(wD − w)
b = E [Φ(x)] − E [Φ(x)] + ∇Q(w)
b − ∇Q(wD ).
x∼D x∼S

Taking the inner product of each side with wD − w


b gives:

b 22 = (wD − w)
λkwD − wk b · [ E [Φ(x)] − E [Φ(x)]]
x∼D x∼S
b − ∇Q(wD )) · (wD − w)
+ (∇Q(w) b
≤ (wD − w)
b · [ E [Φ(x)] − E [Φ(x)]]
x∼D x∼S

≤ kwD − wk
b 2 E [Φ(x)] − E [Φ(x)] . (Cauchy-Schwarz ineq.)
x∼D x∼S 2

where we used (∇Q(w)b − ∇Q(wD )) · (wD − w)


b ≤ 0, which holds by
the convexity of Q.

3. For any w and any distribution Q define LQ (w) by LQ (w) = Ex∼Q [− log pw [x]].
Show that
h i λ λ
LD (w)−L
b D (wD ) ≤ (w−w
b D )· E [Φ(x)]− E [Φ(x)] + kwD k22 − kwk
b 22 .
x∼S x∼D 2 2

Solution:

5
LD (w)
b − LD (wD )
= LD (w)
b − LS (w)
b + LS (w)
b − LD (wD )
h i
=wb · E [Φ(x)] − E [Φ(x)] + LS (w)b − LD (wD )
x∼S x∼D
h i λ λ
=wb · E [Φ(x)] − E [Φ(x)] + LS (w)b + kwk b 22 − kwk
b 22 − LD (wD )
x∼S x∼D 2 2
h i λ λ
≤wb · E [Φ(x)] − E [Φ(x)] + LS (wD ) + kwD k22 − kwk b 22 − LD (wD )
x∼S x∼D 2 2
h i λ λ
b − wD ) · E [Φ(x)] − E [Φ(x)] + kwD k22 − kwk
= (w b 22 .
x∼S x∼D 2 2

4. Use that to show that the following inequality holds for any w:
1 2 λ
LD (w)
b ≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 .
λ x∼S x∼D 2 2

Solution:
In view of the previous inequality, for any w, we can write

LD (w)
b
h i λ λ
b − wD ) ·
= (w E [Φ(x)] − E [Φ(x)] + LD (wD ) + kwD k22 − kwkb 22
x∼S x∼D 2 2
h i λ λ
b − wD ) · E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 − kwk
≤ (w b 22
x∼S x∼D 2 2
1 2 λ λ
≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 − kwk b 22
λ x∼S x∼D 2 2 2
1 2 λ
≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 .
λ x∼S x∼D 2 2

5. Conclude by proving that for any δ > 0, with probability at least 1−δ,
the following inequality holds:

1 2
r
2r2
 
λ 2
LD (w)
b ≤ inf LD (w) + kwk2 + 1 + log .
w∈RN 2 λm δ

Solution:
This follows immediately the application of the inequality derived in
1).

6
C. Randomized Halving

In class, we showed that, in the realizable scenario (at least one expert is
always correct), the number of mistakes made by Halving is upper bounded
by log2 N . Here, we consider for the same realizable scenario a randomized
version of Having defined as follows.
As for Halving, let Ht denote the set of remaining experts at the begin-
ning of round t, with H1 = H the full set of N experts. At each round, let rt
be the fraction of experts in Ht predicting 1. Then, the prediction ybt made
by the algorithm is 1 with probability
 
1 1
pt = log2 1 3 + 1 rt > 3 ,
2 1 − rt rt ≤ 4 4

0 with probability 1 − pt . The true label yt is then received and Ht+1 is


derived from Ht by removing all experts who made a mistake.

1. Write the pseudocode of the algorithm.

2. Define the potential function Φt = log2 |Ht |. Let µt = 1yt 6=ybt , prove
that for all t ≥ 1, E[µt ] ≤ Φt −Φ
2
t+1
.

Solution:
If yt = 0, then E[µt ] = pt . In round t, rt |Ht | experts make a mistake
and are removed. Thus, |Ht+1 | = (1 − rt )|Ht | and we can write
 
1 1 |Ht | 1 1 1 1
(Φt − Φt+1 ) = log2 = log2 ≥ min log2 ,1 .
2 2 |Ht+1 | 2 1 − rt 2 1 − rt

Observe that for rt > 34 , 12 log2 1−r


1
t
> 12 log2 1−3/4
1
= 1
2 log2 4 = 1.
 
Thus, min 12 log2 1−r
1
t
, 1 = pt and 12 (Φt − Φt+1 ) ≥ pt .
If yt = 1, then E[µt ] = 1 − pt . In round t, (1 − rt )|Ht | experts make a
mistake and are removed. Thus, |Ht+1 | = rt |Ht | and we can write
1 1 1 1 1
(Φt − Φt+1 ) = log2 = − log2 rt = 1 − log2 (4rt ).
2 2 rt 2 2

Thus, for rt > 43 , 12 (Φt − Φt+1 ) > 1 − 21 log2 (3) > 0 = 1 − 1 = 1 − rt .


For rt ≤ 34 , using the fact that x(x − 1) ≤ 1/4 for all x ∈ [0, 1], we can

7
write
 
1 1 4rt (1 − rt )
1 − log2 (4rt ) = 1 − log2
2 2 1 − rt
 
1 1
≥ 1 − log2
2 1 − rt
= 1 − pt .

3. Show that the expected number of mistakes made by randomized Halv-


ing is at most 12 log2 N .

Solution:
In view of the previous questions,
X 1X 1 1
E[µt ] ≤ Φt − Φt+1 ≤ Φ1 = log2 N.
2 2 2
t≥1 t≥1

4. (Bonus question) Prove that no randomized algorithm makes fewer


than b 21 log2 N c mistakes, in expectation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy