0% found this document useful (0 votes)

7 views8 pages

Sol3 2015

Uploaded by

gcy572092284

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views8 pages

Sol3 2015

Uploaded by

gcy572092284

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Mehryar Mohri

Foundations of Machine Learning 2015

Courant Institute of Mathematical Sciences
Homework assignment 3
November 24, 2015
Due: December 07, 2015

A. Boosting-type Algorithm

1. Show that for all u ∈ R and integer p > 1, 1u≤0 ≤ Φp (−u) where
Φp (u) = max((1 + u)p , 0). Show that Φp is convex and differentiable.

Solution: We first show that 1u≤0 ≤ Φp (−u) for all u. Observe that for
u > 0, 1u≤0 = 0 ≤ Φp (−u) by definition of Φp . For u ≤ 0, −u ≥ 0 and
Φp (−u) = (1 − u)p ≥ 1 = 1u≤0 , which proves the desired statement.
Now we show that Φp (u) is differentiable and convex. We consider two
cases: p is even and p is odd. If p is even then Φp (u) = (1 + u)p for
all u since (1 + u)p ≥ 0 for all u. Therefore, Φ0p (u) = p(1 + u)p−1 and
Φ00p (u) = p(p − 1)(1 + u)p−2 . Moreover, p(p − 1)(1 + u)p−2 > 0 since
p − 2 is even and p − 1 > 0. This shows that Φp (u) is differentiable
and convex in this case.
Now if p is odd then Φp (u) = 0 for u ∈ (−∞, −1] and Φp (u) = (1 + u)p
for u ∈ (−1, ∞). Therefore, Φ0p (u) = 0 on (−∞, −1) and Φ0p (u) =
p(1 + u)p−1 for u ∈ (−1, ∞). To show that Φp is differentiable at −1
we consider left and right derivatives:
Φp (u) − Φp (−1) 0
lim = lim =0
u↑−1 u+1 u↑−1 u + 1
Φp (u) − Φp (−1) (1 + u)p
lim = lim = lim (1 + u)p−1 = 0
u↓−1 u+1 u↓−1 1 + u u↓−1

Similarly, we observe that Φ00p (u) = 0 on (−∞, −1) and Φ00p (u) = p(1 +
u)p−1 for u ∈ (−1, ∞). Using the same arguments as for the first
derivative and the fact that p ≥ 3 since p is odd we get
Φ0p (u) − Φ0p (−1) 0
lim = lim =0
u↑−1 u+1 u↑−1 u + 1
Φ0p (u) − Φ0p (−1) p(1 + u)p−1
lim = lim = lim p(1 + u)p−2 = 0
u↓−1 u+1 u↓−1 1+u u↓−1

1
and hence Φ00p (−1) = 0. It follows that Φ00p (u) ≥ 0 for all u and Φp is
convex.

2. Use Φp to derive a boosting-type algorithm using coordinate descent.

You should give a full description of your algorithm, including the
pseudocode, details for the choice of the step and direction, as well as
a generalization bound.

Solution: We assume that we have access to N weak learners

PN h1 , . . . , hN
and the goal is to learn an ensemble hypothesis g = j=1 alphaj hj
and predict according to sgn(g). Observe that
m m m
1 X 1 X 1 X
1sgn(g(xi ))6=yi = 1yi g(xi )≤0 ≤ Φp (−yi g(xi ))
m m m
i=1 i=1 i=1

by previous part of this question and our boosting-type algorithm that

we provide consists of applying coordinate descent to this convex and
1 Pm
differentiable objective. If F (αt ) = m i=1 Φp (−yi ft (xi )) and ft =
Pt
s=1 αs hs is the solution after t iterations, then at iteration t + 1 our
algorithm picks the direction

hk = argmin F 0 αt + ηej .
hj : j=1,...,N η=0

That is,
m
X
hk = argmax yi hj (xi )(1 − yi ft (xi ))p−1 . (1)
hj : j=1,...,N i=1

Once the direction is determined, the step size αt+1 is set by solving

F 0 (αt + ηek ) = 0 (2)

for η. This solution can be found using line search. The pseudocode
for this algorithm is given in Algorithm 1. Note that Mt (i)s are used
to avoid computing yi ft (xi ) from scratch at every iteration t.
As an ensemble method this algorithm enjoys generalization bound of
Corollary 6.1 from the textbook.

2
Algorithm 1 Boosting-type Algorithm.
Inputs: sample ((x1 , y1 ), . . . , (xm , ym )).
for i = 1 to m do
M1 (i) ← 0
end for
for t = 1 to T do
ht ← solution of (1)
αt ← solution of (2)
for i = 1 to m do
Mt+1 (i) ← Mt (i) + yi αt ht (xi )
end for
end Pfor
g ← Tt=1 αt ht
return: h = sgn(g).

B. L2 -Regularized Maxent

This problem studies L2 -regularized Maxent. We will use the notation intro-
duced in class and will denote by JS the dual objective function minimized
given a sample S:
λ
kwk22 + E − log pw [x] ,

JS (w) =
2 x∼S

where λ > 0 is a regularization parameter. We will assume that the feature

vector is bounded: kΦ(x)k2 ≤ r for all x ∈ X , for some r > 0.

1. Use McDiarmid’s inequality to prove that for any δ > 0, with proba-
bility at least 1 − δ, the following inequality holds:
r r
2r2

1
E [Φ(x)] − E [Φ(x)] ≤ 1 + log .
x∼D x∼S 2 m δ

Solution:
For any sample S, define Γ(S) = k Ex∼D [Φ(x)] − Ex∼S [Φ(x)] k2 . Let
S 0 be a sample differing from S by one point, say xm in S and x0m in
S 0 , then, we can write
1 2r
|Γ(S 0 )−Γ(S)| ≤ E [Φ(x)]− E [Φ(x)] ≤ kΦ(x0m )−Φ(xm ) ≤ .
x∼S 0 x∼S 2 m 2 m

3
Thus, for any δ > 0, with probability at least 1 − δ,
r
2r2 1
Γ(S) ≤ E m [Γ(S)] + log .
S∼D m δ
h
1 Pm 1
Recall that Ex∼S [Φ(x)] = m i=1 Φ(xi ), and denote Xi = m Ex∈D [Φ(x)]−
i Pm
Φ(xi ) , so that i=1 Xi = Ex∼D [Φ(x)] − Ex∼S [Φ(x)].
Then, by Jensen’s inequality,
h i
E m [Γ(S)] = E m E [Φ(x)] − E [Φ(x)]
S∼D S∼D x∼D x∼S 2
r h 2i
≤ Em E [Φ(x)] − E [Φ(x)]
S∼D x∼D x∼S 2
v
u m
u 1 X
= t Em X i · X j
S∼D m2
i,j=1
v
u m
u 1 X
= t E[kXi k2 ] (for i 6= j, E[Xi · Xj ] = E[Xi ] · E[Xj ] = 0)
m2
i=1
r
1
= E[kX1 k2 ] (xi s drawn i.i.d.)
m
r
E[kX1 k2 ] + E[kX2 k2 ]
=
r 2m
E[kX1 − X2 k2 ]
= (E[X1 · X2 ] = E[X1 ] · E[X2 ] = 0)
r 2mr
(2r)2 2r2
= = .
2m m

2. Let wb be the L2 -regularized maxent solution for a sample S and wD

the solution for an infinite sample:

w
b = argmin JS (w) and wD = argmin JD (w).
w∈RN w∈RN

where JD (w) = λ2 kwk22 + Ex∼D − log pw [x] . Use the definition of

wb and wD as minimizers (use gradients) to prove that the following

inequality holds:

Ex∼S [Φ(x)] − Ex∼D [Φ(x)]

2
kw
b − wD k2 ≤ .
λ

4
Solution:
P
Define function Q for all w by Q(w) = log Z = log x exp(w·Φ(x)) .
Q is convex as a composition of the log-sum function with an affine
function and we can write for any w:
λ
JS (w) = kwk22 − w · E [Φ(x)] + Q(w)
2 x∼S
λ
JD (w) = kwk22 − w · E [Φ(x)] + Q(w).
2 x∼D

Since the gradient of the objective function is zero at the minimum,

we can write

∇JS (w) b − E [Φ(x)] + ∇Q(w)

b = 0 = λw b
x∼S
∇JD (wD ) = 0 = λwD − E [Φ(x)] + ∇Q(wD ).
x∼D

Taking the difference yields:

λ(wD − w)
b = E [Φ(x)] − E [Φ(x)] + ∇Q(w)
b − ∇Q(wD ).
x∼D x∼S

Taking the inner product of each side with wD − w

b gives:

b 22 = (wD − w)
λkwD − wk b · [ E [Φ(x)] − E [Φ(x)]]
x∼D x∼S
b − ∇Q(wD )) · (wD − w)
+ (∇Q(w) b
≤ (wD − w)
b · [ E [Φ(x)] − E [Φ(x)]]
x∼D x∼S

≤ kwD − wk
b 2 E [Φ(x)] − E [Φ(x)] . (Cauchy-Schwarz ineq.)
x∼D x∼S 2

where we used (∇Q(w)b − ∇Q(wD )) · (wD − w)

b ≤ 0, which holds by
the convexity of Q.

3. For any w and any distribution Q define LQ (w) by LQ (w) = Ex∼Q [− log pw [x]].
Show that
h i λ λ
LD (w)−L
b D (wD ) ≤ (w−w
b D )· E [Φ(x)]− E [Φ(x)] + kwD k22 − kwk
b 22 .
x∼S x∼D 2 2

Solution:

5
LD (w)
b − LD (wD )
= LD (w)
b − LS (w)
b + LS (w)
b − LD (wD )
h i
=wb · E [Φ(x)] − E [Φ(x)] + LS (w)b − LD (wD )
x∼S x∼D
h i λ λ
=wb · E [Φ(x)] − E [Φ(x)] + LS (w)b + kwk b 22 − kwk
b 22 − LD (wD )
x∼S x∼D 2 2
h i λ λ
≤wb · E [Φ(x)] − E [Φ(x)] + LS (wD ) + kwD k22 − kwk b 22 − LD (wD )
x∼S x∼D 2 2
h i λ λ
b − wD ) · E [Φ(x)] − E [Φ(x)] + kwD k22 − kwk
= (w b 22 .
x∼S x∼D 2 2

4. Use that to show that the following inequality holds for any w:
1 2 λ
LD (w)
b ≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 .
λ x∼S x∼D 2 2

Solution:
In view of the previous inequality, for any w, we can write

LD (w)
b
h i λ λ
b − wD ) ·
= (w E [Φ(x)] − E [Φ(x)] + LD (wD ) + kwD k22 − kwkb 22
x∼S x∼D 2 2
h i λ λ
b − wD ) · E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 − kwk
≤ (w b 22
x∼S x∼D 2 2
1 2 λ λ
≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 − kwk b 22
λ x∼S x∼D 2 2 2
1 2 λ
≤ E [Φ(x)] − E [Φ(x)] + LD (w) + kwk22 .
λ x∼S x∼D 2 2

5. Conclude by proving that for any δ > 0, with probability at least 1−δ,
the following inequality holds:

1 2
r
2r2

λ 2
LD (w)
b ≤ inf LD (w) + kwk2 + 1 + log .
w∈RN 2 λm δ

Solution:
This follows immediately the application of the inequality derived in
1).

6
C. Randomized Halving

In class, we showed that, in the realizable scenario (at least one expert is
always correct), the number of mistakes made by Halving is upper bounded
by log2 N . Here, we consider for the same realizable scenario a randomized
version of Having defined as follows.
As for Halving, let Ht denote the set of remaining experts at the begin-
ning of round t, with H1 = H the full set of N experts. At each round, let rt
be the fraction of experts in Ht predicting 1. Then, the prediction ybt made
by the algorithm is 1 with probability

1 1
pt = log2 1 3 + 1 rt > 3 ,
2 1 − rt rt ≤ 4 4

0 with probability 1 − pt . The true label yt is then received and Ht+1 is

derived from Ht by removing all experts who made a mistake.

1. Write the pseudocode of the algorithm.

2. Define the potential function Φt = log2 |Ht |. Let µt = 1yt 6=ybt , prove
that for all t ≥ 1, E[µt ] ≤ Φt −Φ
2
t+1
.

Solution:
If yt = 0, then E[µt ] = pt . In round t, rt |Ht | experts make a mistake
and are removed. Thus, |Ht+1 | = (1 − rt )|Ht | and we can write

1 1 |Ht | 1 1 1 1
(Φt − Φt+1 ) = log2 = log2 ≥ min log2 ,1 .
2 2 |Ht+1 | 2 1 − rt 2 1 − rt

Observe that for rt > 34 , 12 log2 1−r

1
t
> 12 log2 1−3/4
1
= 1
2 log2 4 = 1.

Thus, min 12 log2 1−r
1
t
, 1 = pt and 12 (Φt − Φt+1 ) ≥ pt .
If yt = 1, then E[µt ] = 1 − pt . In round t, (1 − rt )|Ht | experts make a
mistake and are removed. Thus, |Ht+1 | = rt |Ht | and we can write
1 1 1 1 1
(Φt − Φt+1 ) = log2 = − log2 rt = 1 − log2 (4rt ).
2 2 rt 2 2

Thus, for rt > 43 , 12 (Φt − Φt+1 ) > 1 − 21 log2 (3) > 0 = 1 − 1 = 1 − rt .

For rt ≤ 34 , using the fact that x(x − 1) ≤ 1/4 for all x ∈ [0, 1], we can

7
write

1 1 4rt (1 − rt )
1 − log2 (4rt ) = 1 − log2
2 2 1 − rt

1 1
≥ 1 − log2
2 1 − rt
= 1 − pt .

3. Show that the expected number of mistakes made by randomized Halv-

ing is at most 12 log2 N .

Solution:
In view of the previous questions,
X 1X 1 1
E[µt ] ≤ Φt − Φt+1 ≤ Φ1 = log2 N.
2 2 2
t≥1 t≥1

4. (Bonus question) Prove that no randomized algorithm makes fewer

than b 21 log2 N c mistakes, in expectation.

Regularization
No ratings yet
Regularization
42 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Gradient Descendent
No ratings yet
Gradient Descendent
10 pages
AIT2001 Final
No ratings yet
AIT2001 Final
8 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
2023 Exam2 Solution
No ratings yet
2023 Exam2 Solution
12 pages
Note 3
No ratings yet
Note 3
9 pages
Steepest Descent Algorithm
No ratings yet
Steepest Descent Algorithm
28 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
Convex - Optimization - Homework 3
No ratings yet
Convex - Optimization - Homework 3
6 pages
Sol3 2020
No ratings yet
Sol3 2020
5 pages
553.740 Project 2 Optimization. Fall 2020 Due On Wednesday October 21
No ratings yet
553.740 Project 2 Optimization. Fall 2020 Due On Wednesday October 21
5 pages
Machine Learning Homework1 Solutions
No ratings yet
Machine Learning Homework1 Solutions
16 pages
EE 6106: Online Learning and Optimisation Homework 1
No ratings yet
EE 6106: Online Learning and Optimisation Homework 1
4 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Homework 1: Instructions and Notes
No ratings yet
Homework 1: Instructions and Notes
2 pages
Convexity: 18.657: Mathematics of Machine Learning
No ratings yet
Convexity: 18.657: Mathematics of Machine Learning
6 pages
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
16 pages
Closed-Loop Control of DC Drives With Controlled Rectifier
0% (1)
Closed-Loop Control of DC Drives With Controlled Rectifier
40 pages
HW 1
No ratings yet
HW 1
11 pages
INTE 30103 Information Processing and Handling in Libraries and Information Centers
No ratings yet
INTE 30103 Information Processing and Handling in Libraries and Information Centers
4 pages
DL 1
No ratings yet
DL 1
10 pages
Icml04 Apprentice Extended
No ratings yet
Icml04 Apprentice Extended
6 pages
CS 256: LMS Algorithms
No ratings yet
CS 256: LMS Algorithms
23 pages
A Simple Proof of AdaBoost Algorithm
No ratings yet
A Simple Proof of AdaBoost Algorithm
4 pages
ML Ctanujit
No ratings yet
ML Ctanujit
56 pages
Sup LAWS
No ratings yet
Sup LAWS
11 pages
MLB Assignment 7 Final
No ratings yet
MLB Assignment 7 Final
16 pages
2020 Exam2 Solution
No ratings yet
2020 Exam2 Solution
9 pages
CS 771A: Introduction To Machine Learning Name Roll No Dept
No ratings yet
CS 771A: Introduction To Machine Learning Name Roll No Dept
2 pages
Hw2 - Raymond Von Mizener - Chirag Mahapatra
No ratings yet
Hw2 - Raymond Von Mizener - Chirag Mahapatra
13 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Md-070 Application Extensions Technical Design
100% (1)
Md-070 Application Extensions Technical Design
16 pages
Sol 2
No ratings yet
Sol 2
7 pages
Mendel and Heredity Worksheet
No ratings yet
Mendel and Heredity Worksheet
11 pages
Lecture - 8 Steady State Diffusion Equation
No ratings yet
Lecture - 8 Steady State Diffusion Equation
16 pages
KRK-rpg2 Manual PDF
No ratings yet
KRK-rpg2 Manual PDF
20 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Amit Yadav Project
No ratings yet
Amit Yadav Project
49 pages
Theory of Estimation
No ratings yet
Theory of Estimation
11 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
No ratings yet
CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
8 pages
K. v. Narayanan, B. Lakshmikutty - Stoichiometry and Process Calculations-PHI Learning (2017)
No ratings yet
K. v. Narayanan, B. Lakshmikutty - Stoichiometry and Process Calculations-PHI Learning (2017)
613 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Solutions To Selected Problems-Duda, Hart
67% (3)
Solutions To Selected Problems-Duda, Hart
12 pages
Maths - Matrices - Matrices Multiplication Symmetric - Skew-Symmetric - Assingment - 9 June 2020
100% (1)
Maths - Matrices - Matrices Multiplication Symmetric - Skew-Symmetric - Assingment - 9 June 2020
2 pages
Sheet 2 Solution
No ratings yet
Sheet 2 Solution
5 pages
EVD Evolution Eng
No ratings yet
EVD Evolution Eng
52 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
Iec60364 5 52 (Ed2.0) en - D PDF
50% (2)
Iec60364 5 52 (Ed2.0) en - D PDF
8 pages
Application of IR - ITC
No ratings yet
Application of IR - ITC
23 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Angew Chem Int Ed - 2017 - Choi
No ratings yet
Angew Chem Int Ed - 2017 - Choi
5 pages
EE364a Homework 6 Solutions: I 1,..., K I I I
No ratings yet
EE364a Homework 6 Solutions: I 1,..., K I I I
20 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Avr4311 E2
No ratings yet
Avr4311 E2
2 pages
SOP On Zero FIR
No ratings yet
SOP On Zero FIR
12 pages
2019-20-I ES Key
No ratings yet
2019-20-I ES Key
4 pages
JZC 32F Etc PDF
No ratings yet
JZC 32F Etc PDF
1 page
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Memorandum: Rivergate Place, Murrarie, QLD Hope Harbour Marina, QLD +1300 052 081
No ratings yet
Memorandum: Rivergate Place, Murrarie, QLD Hope Harbour Marina, QLD +1300 052 081
3 pages
Group Permutations
No ratings yet
Group Permutations
5 pages
Notification NIT Calicut Various Vacancy Posts
No ratings yet
Notification NIT Calicut Various Vacancy Posts
6 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
SAP Material Training
No ratings yet
SAP Material Training
37 pages
Final Exam Epfl 2020 Machine Leaning
No ratings yet
Final Exam Epfl 2020 Machine Leaning
16 pages
Parts of Speech Test Bank
No ratings yet
Parts of Speech Test Bank
14 pages
Marketing Principles
No ratings yet
Marketing Principles
54 pages
Accident Definition & Meaning - Merriam-Webster
No ratings yet
Accident Definition & Meaning - Merriam-Webster
8 pages
IASC Template
No ratings yet
IASC Template
7 pages
ML ES 23-24-II Key
No ratings yet
ML ES 23-24-II Key
4 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
Microsilica 92% Dark Grey
No ratings yet
Microsilica 92% Dark Grey
3 pages
ES Key
No ratings yet
ES Key
6 pages
Unit 1 - Set Theory, Types of Sets, Set Operations
No ratings yet
Unit 1 - Set Theory, Types of Sets, Set Operations
20 pages
Smart Assistive Multi Final
No ratings yet
Smart Assistive Multi Final
11 pages
Tata Motors
No ratings yet
Tata Motors
38 pages
Awsm: CS 771A: Intro To Machine Learning, IIT Kanpur (19 Oct 2022) Name Roll No Dept
No ratings yet
Awsm: CS 771A: Intro To Machine Learning, IIT Kanpur (19 Oct 2022) Name Roll No Dept
2 pages
Chemistry Practicals Notes
No ratings yet
Chemistry Practicals Notes
30 pages
Acct Statement XX0539 12042025
No ratings yet
Acct Statement XX0539 12042025
43 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Sol3 2015

Uploaded by

Sol3 2015

Uploaded by

Mehryar Mohri

Foundations of Machine Learning 2015

2. Use Φp to derive a boosting-type algorithm using coordinate descent.

Solution: We assume that we have access to N weak learners

by previous part of this question and our boosting-type algorithm that

F 0 (αt + ηek ) = 0 (2)

where λ > 0 is a regularization parameter. We will assume that the feature

2. Let wb be the L2 -regularized maxent solution for a sample S and wD

where JD (w) = λ2 kwk22 + Ex∼D − log pw [x] . Use the definition of

wb and wD as minimizers (use gradients) to prove that the following

Ex∼S [Φ(x)] − Ex∼D [Φ(x)]

Since the gradient of the objective function is zero at the minimum,

∇JS (w) b − E [Φ(x)] + ∇Q(w)

Taking the difference yields:

Taking the inner product of each side with wD − w

where we used (∇Q(w)b − ∇Q(wD )) · (wD − w)

0 with probability 1 − pt . The true label yt is then received and Ht+1 is

1. Write the pseudocode of the algorithm.

Observe that for rt > 34 , 12 log2 1−r

Thus, for rt > 43 , 12 (Φt − Φt+1 ) > 1 − 21 log2 (3) > 0 = 1 − 1 = 1 − rt .

3. Show that the expected number of mistakes made by randomized Halv-

4. (Bonus question) Prove that no randomized algorithm makes fewer

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.