Sol3 2015
Sol3 2015
A. Boosting-type Algorithm
1. Show that for all u ∈ R and integer p > 1, 1u≤0 ≤ Φp (−u) where
Φp (u) = max((1 + u)p , 0). Show that Φp is convex and differentiable.
Solution: We first show that 1u≤0 ≤ Φp (−u) for all u. Observe that for
u > 0, 1u≤0 = 0 ≤ Φp (−u) by definition of Φp . For u ≤ 0, −u ≥ 0 and
Φp (−u) = (1 − u)p ≥ 1 = 1u≤0 , which proves the desired statement.
Now we show that Φp (u) is differentiable and convex. We consider two
cases: p is even and p is odd. If p is even then Φp (u) = (1 + u)p for
all u since (1 + u)p ≥ 0 for all u. Therefore, Φ0p (u) = p(1 + u)p−1 and
Φ00p (u) = p(p − 1)(1 + u)p−2 . Moreover, p(p − 1)(1 + u)p−2 > 0 since
p − 2 is even and p − 1 > 0. This shows that Φp (u) is differentiable
and convex in this case.
Now if p is odd then Φp (u) = 0 for u ∈ (−∞, −1] and Φp (u) = (1 + u)p
for u ∈ (−1, ∞). Therefore, Φ0p (u) = 0 on (−∞, −1) and Φ0p (u) =
p(1 + u)p−1 for u ∈ (−1, ∞). To show that Φp is differentiable at −1
we consider left and right derivatives:
Φp (u) − Φp (−1) 0
lim = lim =0
u↑−1 u+1 u↑−1 u + 1
Φp (u) − Φp (−1) (1 + u)p
lim = lim = lim (1 + u)p−1 = 0
u↓−1 u+1 u↓−1 1 + u u↓−1
Similarly, we observe that Φ00p (u) = 0 on (−∞, −1) and Φ00p (u) = p(1 +
u)p−1 for u ∈ (−1, ∞). Using the same arguments as for the first
derivative and the fact that p ≥ 3 since p is odd we get
Φ0p (u) − Φ0p (−1) 0
lim = lim =0
u↑−1 u+1 u↑−1 u + 1
Φ0p (u) − Φ0p (−1) p(1 + u)p−1
lim = lim = lim p(1 + u)p−2 = 0
u↓−1 u+1 u↓−1 1+u u↓−1
1
and hence Φ00p (−1) = 0. It follows that Φ00p (u) ≥ 0 for all u and Φp is
convex.
That is,
m
X
hk = argmax yi hj (xi )(1 − yi ft (xi ))p−1 . (1)
hj : j=1,...,N i=1
Once the direction is determined, the step size αt+1 is set by solving
for η. This solution can be found using line search. The pseudocode
for this algorithm is given in Algorithm 1. Note that Mt (i)s are used
to avoid computing yi ft (xi ) from scratch at every iteration t.
As an ensemble method this algorithm enjoys generalization bound of
Corollary 6.1 from the textbook.
2
Algorithm 1 Boosting-type Algorithm.
Inputs: sample ((x1 , y1 ), . . . , (xm , ym )).
for i = 1 to m do
M1 (i) ← 0
end for
for t = 1 to T do
ht ← solution of (1)
αt ← solution of (2)
for i = 1 to m do
Mt+1 (i) ← Mt (i) + yi αt ht (xi )
end for
end Pfor
g ← Tt=1 αt ht
return: h = sgn(g).
B. L2 -Regularized Maxent
This problem studies L2 -regularized Maxent. We will use the notation intro-
duced in class and will denote by JS the dual objective function minimized
given a sample S:
λ
kwk22 + E − log pw [x] ,
JS (w) =
2 x∼S
1. Use McDiarmid’s inequality to prove that for any δ > 0, with proba-
bility at least 1 − δ, the following inequality holds:
r r
2r2
1
E [Φ(x)] − E [Φ(x)] ≤ 1 + log .
x∼D x∼S 2 m δ
Solution:
For any sample S, define Γ(S) = k Ex∼D [Φ(x)] − Ex∼S [Φ(x)] k2 . Let
S 0 be a sample differing from S by one point, say xm in S and x0m in
S 0 , then, we can write
1 2r
|Γ(S 0 )−Γ(S)| ≤ E [Φ(x)]− E [Φ(x)] ≤ kΦ(x0m )−Φ(xm ) ≤ .
x∼S 0 x∼S 2 m 2 m
3
Thus, for any δ > 0, with probability at least 1 − δ,
r
2r2 1
Γ(S) ≤ E m [Γ(S)] + log .
S∼D m δ
h
1 Pm 1
Recall that Ex∼S [Φ(x)] = m i=1 Φ(xi ), and denote Xi = m Ex∈D [Φ(x)]−
i Pm
Φ(xi ) , so that i=1 Xi = Ex∼D [Φ(x)] − Ex∼S [Φ(x)].
Then, by Jensen’s inequality,
h i
E m [Γ(S)] = E m E [Φ(x)] − E [Φ(x)]
S∼D S∼D x∼D x∼S 2
r h 2i
≤ Em E [Φ(x)] − E [Φ(x)]
S∼D x∼D x∼S 2
v
u m
u 1 X
= t Em X i · X j
S∼D m2
i,j=1
v
u m
u 1 X
= t E[kXi k2 ] (for i 6= j, E[Xi · Xj ] = E[Xi ] · E[Xj ] = 0)
m2
i=1
r
1
= E[kX1 k2 ] (xi s drawn i.i.d.)
m
r
E[kX1 k2 ] + E[kX2 k2 ]
=
r 2m
E[kX1 − X2 k2 ]
= (E[X1 · X2 ] = E[X1 ] · E[X2 ] = 0)
r 2mr
(2r)2 2r2
= = .
2m m
w
b = argmin JS (w) and wD = argmin JD (w).
w∈RN w∈RN
4
Solution:
P
Define function Q for all w by Q(w) = log Z = log x exp(w·Φ(x)) .
Q is convex as a composition of the log-sum function with an affine
function and we can write for any w:
λ
JS (w) = kwk22 − w · E [Φ(x)] + Q(w)
2 x∼S
λ
JD (w) = kwk22 − w · E [Φ(x)] + Q(w).
2 x∼D
λ(wD − w)
b = E [Φ(x)] − E [Φ(x)] + ∇Q(w)
b − ∇Q(wD ).
x∼D x∼S
b 22 = (wD − w)
λkwD − wk b · [ E [Φ(x)] − E [Φ(x)]]
x∼D x∼S
b − ∇Q(wD )) · (wD − w)
+ (∇Q(w) b
≤ (wD − w)
b · [ E [Φ(x)] − E [Φ(x)]]
x∼D x∼S
≤ kwD − wk
b 2 E [Φ(x)] − E [Φ(x)] . (Cauchy-Schwarz ineq.)
x∼D x∼S 2
3. For any w and any distribution Q define LQ (w) by LQ (w) = Ex∼Q [− log pw [x]].
Show that
h i λ λ
LD (w)−L
b D (wD ) ≤ (w−w
b D )· E [Φ(x)]− E [Φ(x)] + kwD k22 − kwk
b 22 .
x∼S x∼D 2 2
Solution:
5
LD (w)
b − LD (wD )
= LD (w)
b − LS (w)
b + LS (w)
b − LD (wD )
h i
=wb · E [Φ(x)] − E [Φ(x)] + LS (w)b − LD (wD )
x∼S x∼D
h i λ λ
=wb · E [Φ(x)] − E [Φ(x)] + LS (w)b + kwk b 22 − kwk
b 22 − LD (wD )
x∼S x∼D 2 2
h i λ λ
≤wb · E [Φ(x)] − E [Φ(x)] + LS (wD ) + kwD k22 − kwk b 22 − LD (wD )
x∼S x∼D 2 2
h i λ λ
b − wD ) · E [Φ(x)] − E [Φ(x)] + kwD k22 − kwk
= (w b 22 .
x∼S x∼D 2 2
4. Use that to show that the following inequality holds for any w:
1 2 λ
LD (w)
b ≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 .
λ x∼S x∼D 2 2
Solution:
In view of the previous inequality, for any w, we can write
LD (w)
b
h i λ λ
b − wD ) ·
= (w E [Φ(x)] − E [Φ(x)] + LD (wD ) + kwD k22 − kwkb 22
x∼S x∼D 2 2
h i λ λ
b − wD ) · E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 − kwk
≤ (w b 22
x∼S x∼D 2 2
1 2 λ λ
≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 − kwk b 22
λ x∼S x∼D 2 2 2
1 2 λ
≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 .
λ x∼S x∼D 2 2
5. Conclude by proving that for any δ > 0, with probability at least 1−δ,
the following inequality holds:
1 2
r
2r2
λ 2
LD (w)
b ≤ inf LD (w) + kwk2 + 1 + log .
w∈RN 2 λm δ
Solution:
This follows immediately the application of the inequality derived in
1).
6
C. Randomized Halving
In class, we showed that, in the realizable scenario (at least one expert is
always correct), the number of mistakes made by Halving is upper bounded
by log2 N . Here, we consider for the same realizable scenario a randomized
version of Having defined as follows.
As for Halving, let Ht denote the set of remaining experts at the begin-
ning of round t, with H1 = H the full set of N experts. At each round, let rt
be the fraction of experts in Ht predicting 1. Then, the prediction ybt made
by the algorithm is 1 with probability
1 1
pt = log2 1 3 + 1 rt > 3 ,
2 1 − rt rt ≤ 4 4
2. Define the potential function Φt = log2 |Ht |. Let µt = 1yt 6=ybt , prove
that for all t ≥ 1, E[µt ] ≤ Φt −Φ
2
t+1
.
Solution:
If yt = 0, then E[µt ] = pt . In round t, rt |Ht | experts make a mistake
and are removed. Thus, |Ht+1 | = (1 − rt )|Ht | and we can write
1 1 |Ht | 1 1 1 1
(Φt − Φt+1 ) = log2 = log2 ≥ min log2 ,1 .
2 2 |Ht+1 | 2 1 − rt 2 1 − rt
7
write
1 1 4rt (1 − rt )
1 − log2 (4rt ) = 1 − log2
2 2 1 − rt
1 1
≥ 1 − log2
2 1 − rt
= 1 − pt .
Solution:
In view of the previous questions,
X 1X 1 1
E[µt ] ≤ Φt − Φt+1 ≤ Φ1 = log2 N.
2 2 2
t≥1 t≥1