Problems Chap7
Problems Chap7
e-Chapter 7
Pierre Paquay
Problem 7.1
To solve this problem, we first begin by separating the positive decision region into two components : the
lower one corresponding to x2 ∈ [−1, 1] and the upper one corresponding to x2 ∈ [1, 2]. To define the decision
region, we need 7 perceptrons, namely
h4 (x) = sign(x1 + 2), h5 (x) = sign(x1 + 1), h6 (x) = sign(x1 − 1), h7 (x) = sign(x1 − 2)
for the vertical lines. We are now able to define the lower decision region by h2 h3 h4 h7 , and the upper decision
region by h1 h2 h5 h6 , which means that the total decision region is defined by
f = h2 h3 h4 h7 + h1 h2 h5 h6
Problem 7.2
(a) Let x and x0 be two points from the same region. If we consider a set of M hyperplanes defined by
{x : wiT x = 0}, we have that
or put more simply that sign(wiT x) = sign(wiT x0 ) = si for i = 1, · · · , M where si = ±1. We begin by the case
where si = 1. Here, we know that wiT x > 0 and wiT x0 > 0, consequently we have that, for λ ∈ [0, 1],
and
sign(wiT (λx + (1 − λ)x0 )) = 1.
Now, we consider the case where si = −1. Here, we know that wiT x < 0 and wiT x0 < 0, consequently we have
that, for λ ∈ [0, 1],
wiT (λx + (1 − λ)x0 ) = λwiT x + (1 − λ)wiT x0 < 0
and
sign(wiT (λx + (1 − λ)x0 )) = −1.
So, in conclusion, the region is actually convex.
(b) A region is defined as the following set
thus a region is characterized by a particular M -uple (s1 , · · · , sM ). Since there are at most 2M of such
M -uples, we have at most 2M different regions.
(c) Let B(N, d) be the maximum number of regions created by M hyperplanes in d-dimensional space. Now,
consider adding an (M + 1)th hyperplane; this hyperplane can obviously be viewed as a (d − 1)-dimensional
1
space, so if we project the initial M hyperplanes into this space, we obtain M hyperplanes in a (d − 1)-
dimensional space. These hyperplanes can create at most B(M, d − 1) regions in this space, and for each of
these regions, we get two regions in the original d-dimensional space. Thus, this means that the (M + 1)th
hyperplane intersects at most B(M, d − 1) of the regions created by the M hyperplanes in the d-dimensional
space, and so
B(M + 1, d) ≤ B(M, d) + B(M, d − 1).
for all d. Now, we assume the statement is true for M = M0 and all d, we will prove that the statement is
still true for M = M0 + 1 and all d. We have that
We have thus proved the induction step, so the statement is true for all M and d.
Problem 7.3
2
Now the condition is also sufficient because if cm = +1, we have
x∈r
⇔ (h1 (x), · · · , hM (x)) = (c1 , · · · , cM )
⇔ hcmm (x) = +1, ∀m
QM
m=1 hm (x) = +1
cm
⇔
⇔ tr (x) = +1.
And if x is in a negative region (f (x) = −1), we know that x ∈ / ri for all i, so tri (x) = −1 for all i which
means that
tr1 (x) + · · · + trk (x) = −1 = f (x).
Problem 7.4
which characterizes the penultimate layer of our perceptron. For the layer before, we have that tri =
(i) (i)
c c
h11 · · · hMM , and consequently
M
1 X c(i)
tri = sign(−M + + hmm );
2 m=1
moreover, the previous layer may be characterized with
c(i)
hmm = sign(c(i) T
m wm x).
Putting all this together, we obtain the following characterization of a 3-layer perceptron
k M
1 X 1 X
f = sign(k − + sign(−M + + sign(cm
(i) T
wm x)))
2 i=1 2 m=1
3
Problem 7.5
First, we decompose the unit hypercube [0, 1]d into 1/d -hypercubes (hypercube whose sides have length
equal to ), thus we get a grid-like structure of our unit hypercube. Now, if we consider a decision region
(which may be composed by disconnected regions) whose boundary surfaces are smooth, this decision region
partition the unit hypercube into two regions : one labelled +1 and one labelled −1. We now have k
(i) (i),T
-hypercubes labelled +1 which are formed by 2d hyperplanes each defined by hm = sign(wm x) where
m = 1, · · · , 2d and i = 1, · · · , k . So, the first layer whose task is to activate the hyperplanes involved in the
positive -hypercubes is characterized by
m = sign(wm
h(i) (i),T
x).
Now to activate the positive -hypercubes Hi themselves we characterize the second layer by
(i) (i)
(i) (i)
tHi = (h1 )c1 · · · (h2d )c2d ,
(i)
where the cm are defined as in Problem 7.3 and 7.4; or
2d
1 X (i) c(i)
tHi = sign(−2d + + (h ) m ).
2 m=1 m
And finally to activate all the positive -hypercubes, we define the MLP output h by
h = tH1 + · · · + tHk ;
or
k
1 X
h = sign(k − + tH ).
2 i=1 i
Putting all this together, we obtain the following characterization of a 3-layer perceptron
k 2d
1 X 1 X
h = sign(k − + sign(−2d + + sign(c(i)
m wm
(i),T
x))).
2 i=1 2 m=1
Now, it remains to see that the above MLP can aribtrarily closely approximate the initial positive decision
region D+ (and consequently the negative decision region also); to do so, we first note that
Vol(Hi ) = d → 0 and k → ∞
when → 0. So, the -hypercubes can be made arbitrarily small, which obviously means that the total
volume of the positive -hypercubes can be made arbitrarily close to the volume of the positive decision
region (because of its smoothness). Mathematically, we may write that
k
X
Vol(H1 ∪ · · · ∪ Hk ) = d → Vol(D+ )
i=1
when → 0. This means that the region where our 3-layer perceptron will output +1 (resp. −1) converges to
the positive (resp. negative) decision region in our unit hypercube.
Problem 7.6
(l) (l)
For a specific layer l, if we replace the weight wij with wij + , we need to recompute the corresponding
node output of that layer and also the node outputs for the subsequent layers (which are the ones numbered
(l)
from l + 1 to L). Consequently, for each weight wij , we have
L
X L
X
d(l) (d(l−1) + 1) + 1 + d(l)
k=l+1 k=l+1
4
multiplications and θ-evaluations respectively; this means that the computational complexity of obtaining
the partial derivatives is overall equal to
L L L
!
X X X
2 d(l) (d(l−1) + 1) d(l) (d(l−1) + 1) + 1 + d(l)
l=1 k=l+1 k=l+1
| {z }
=|W |
L L
!
X X
≤ 2|W | d(l) (d(l−1) + 1) +1 + d(l)
|{z}
k=1 k=1 ≤d(l) (d(l−1) +1)
| {z }
=|W |
(l) (l)
since we need to compute the derivatives corresponding to wij + and also to wij − .
Problem 7.7
1
Ein = trace(Y Y T − ZV Y Y − Y V T Z T + ZV V T Z T )
N
1
= trace(Y Y T − 2ZV Y T + ZV V T Z T ),
N
since trace(A) = trace(AT ). We are now ready to compute the derivatives, we have
∂trace(AXB) ∂trace(AXX T B)
= AT B T and = BAX + AT B T X.
∂X ∂X
5
We also have
Ein
1
= trace(Y Y T − 2(V0 + θ(XW )V1 )Y T + (V0 + θ(XW )V1 )(V0T + V1T θ(XW )T )
N
1
= trace(Y Y T − 2V0 Y T − 2θ(XW )V1 Y T + V0 V0T + V0 V1T θ(XW )T + θ(XW )V1 V0T + θ(XW )V1 V1T θ(XW )T )
N
1
= trace(Y Y T − 2V0 Y T − 2θ(XW )V1 Y T + V0 V0T + 2θ(XW )V1 V0T + V1 V1T θ(XW )T θ(XW )),
N
since the trace can be permuted in a cycle and trace(A) = trace(AT ). The other derivative may be written as
∂Ein 1 ∂trace(θ(XW )V1 Y T ) ∂trace(θ(XW )V1 V0T ) ∂trace(V1 V1T θ(XW )T θ(XW ))
= (−2 +2 + )
∂W N ∂W ∂W ∂W
1
= (−2X T θ0 (XW ) ⊗ Y V1T + 2X T θ0 (XW ) ⊗ V0 V1T + X T (θ0 (XW ) ⊗ [θ(XW )2V1 V1T ])
N
= 2X T [θ0 (XW ) ⊗ (−Y V1T + V0 V1T + θ(XW )V1 V1T )]
Problem 7.8
(a) By hypothesis, we know that {η1 , η2 , η3 } with η1 < η2 < η3 is an U-arrangement which means that
Since E(η) is a quadratic curve, we know that it is decreasing (resp. increasing) to the left (resp. right) of
its minimum η̄. So if we assume that η̄ < η1 , we get that E(η1 ) ≤ E(η2 ) ≤ E(η3 ) which is impossible by
definition of an U-arrangement; and if we assume that η̄ > η3 , we get that E(η1 ) ≥ E(η2 ) ≥ E(η3 ) which is
also impossible by definition of an U-arrangement. Consequently, we have η̄ ∈ [η1 , η3 ].
(b) First, we solve the linear system in a, b, and c below
1
2
η1 η1
D = η22 1
η2 ,
1
η32 η3
6
and
1
2
η1 e1
/D = −(e1 − e2 )(η1 − η3 ) + (e1 − e3 )(η1 − η2 ) .
2 2 2 2
b = η22 1
e2
D
1
η32 e3
Since the minimum of such a quadratic function is given by −b/2a, we finally get
" #
1 (e1 − e2 )(η12 − η32 ) − (e1 − e3 )(η12 − η22 )
η̄ = .
2 (e1 − e2 )(η1 − η3 ) − (e1 − e3 )(η1 − η2 )
In this case, we can use this new η20 in place of η2 and proceed with the algorithm.
Problem 7.9
(a) Since w is uniformly sampled in the unit cube, we may write that
" #
1
P[E(w) ≤ E(w ) + ]∗
= P (w − w ) H(w − w ) ≤
∗ T ∗
2
Z
= dw1 · · · dwd
(w−w∗ )T H(w−w∗ )≤2
Z
∂w
= | det | dx1 · · · dxd
x Hx≤2 | {z∂x }
T
=1
where we have made the change of variables x = w − w∗ . As H is positive definite and symmetric, we know
that there exists an orthogonal matrix A such that H = Adiag(λ21 , · · · , λ2d )AT . Thus, if we use y = AT x as a
change of variables, we now get that
Z
P[E(w) ≤ E(w∗ ) + ] = dx1 · · · dxd
xT Hx≤2
Z
∂x
= | det | dy1 · · · dyd .
y T diag(λ21 ,··· ,λ2d )y≤2 ∂y
| {z }
=|A|=1
We now use a third change of variables z = diag(λ1 , · · · , λd )y, in this case we obtain
7
Z
P[E(w) ≤ E(w ) + ]
∗
= dy1 · · · dyd
y T diag(λ21 ,··· ,λ2d )y≤2
Z
∂y
= | det | dz1 · · · dzd
z T z≤2 | {z∂z}
1
= |λ =√ 1
1 ···λd | det H
1 Sd (2)
Z
= √ dz1 · · · dzd = √ .
det H z T z≤2 det H
P[E(wmin ) > E(w∗ ) + ] = P[(E(w1 ) > E(w∗ ) + ) ∩ · · · ∩ (E(wN ) > E(w∗ ) + )]
N
Y
= P[E(w1 ) > E(w∗ ) + ]
i=1
= (1 − P[E(w1 ) ≤ E(w∗ ) + ])N
!N
Sd (2)
= 1− √ .
det H
!N
Sd (2)
P[E(wmin ) > E(w ) + ]
∗
= 1− √
det H
!d/2 !!N
1 8eπ d
≈ 1− √
πd λ̄ dd/2
| {z }
≈µd
!d !N
1 µ
≈ 1− √ √ .
πd d
!d !N
1 µ
P[E(wmin ) > E(w ) + ] ≈
∗
1− √ √
πd d
µ d
N ln(1− √1 ( √ ) )
≈ e πd d
µ d
−N √1 ( √ )
≈ e πd d
√1 log η
≈ e πd
8
because we have !d
1 µ 1
−N √ √ ≈ √ log η.
πd d πd
In conclusion, we get that
√1
P[E(wmin ) > E(w∗ ) + ] ≈ η πd ≥η
since 0 ≤ η ≤ 1; thus we may now write that
P[E(wmin ) ≤ E(w∗ ) + ] ≤ 1 − η.
Problem 7.10
as well; so we get
x(l) = θ(s(l) ) = θ(0) = tanh(0) = 0
for l = 1, · · · , L. This impacts the gradient in the following way, we may write
∂e
= x(l−1) (δ (l) )T = 0
∂W (l)
for l = 2, · · · , L. To see what happens when l = 1, we first note that
(2)
d
(1) (1) (2) (2)
X
δj =θ 0
(sj ) wjk δk = 0
k=1 |{z}
=0
for all j; which means that ∂e/∂W (1) = 0. In conclusion, we have in this case that
∂Ein 1 X ∂en
= =0
∂W (l) N n ∂W (l)
Problem 7.12
From Problem 7.11, the gradient descent update step may be written as
wt+1 = wt − ηH(wt − w∗ );
9
(wt+1 − w∗ ) = (wt − w∗ ) − ηt H(wt − w∗ )
⇔ t+1 = t − ηt Ht
⇔ t+1 = (I − ηt H)t
where t = wt −w∗ . Since H is symmetric, one can form an orthonormal basis with its eigenvectors. Projecting
t and t+1 onto this basis, we see that in this basis, each component decouples from the others, and letting
(α) be the αth component in this basis, we see that
where λα is a positive eigenvalue of H (which is positive definite). Now, by proceeding recursively and by
using the Taylor expansion, we are able to write that
t
Y
t+1 (α) = 1 (α) (1 − ηi λα )
i=1
Yt
= 1 (α) eln(1−ηi λα )
i=1
Pt
ln(1−ηi λα )
= 1 (α)e i=1
Pt
(−ηi λα − 21 λ2α ηi2 )
≈ 1 (α)e i=1
Pt Pt
ηi − 21 λ2α ηi2
≈ 1 (α)e−λα i=1 i=1
since ηt → 0, we have that 1 − ηt λα > 0. However, since t ηt = +∞ and t ηt2 < ∞, we get that
P P
Pt 1 2
Pt 2
e−λα i=1 ηi → 0 and e− 2 λα i=1 ηi ≤ C,
which gives us
t Pt Pt 2
1 2
(1 − ηi λα ) ≈ e|−λα {z i=1 η}i e|− 2 λα{z i=1 ηi} → 0.
Y
i=1 →0 ≤C
Problem 7.13
(a) In general, the finite difference approximation to the first order partial derivatives of a function f (x, y) is
given by
∂f f (x + h, y) − f (x − h, y)
≈
∂x 2h
and
∂f f (x, y + h) − f (x, y − h)
≈ .
∂y 2h
If we apply the same idea to the function E(w1 , w2 ), we get
∂E E(w1 + h, w2 ) − E(w1 − h, w2 )
≈
∂w1 2h
10
and
∂E E(w1 , w2 + h) − E(w1 , w2 − h)
≈ .
∂w2 2h
If we consider now the second order partial derivatives, we may write that
It remains to compute the last second order partial derivative, we have that
≈
∂w1 ∂w2 2h
E(w1 + h, w2 + h) + E(w1 − h, w2 − h) − E(w1 + h, w2 − h) − E(w1 − h, w2 + h)
≈ .
4h2
Problem 7.14
1
L = Ein (wt ) + g T ∆w + ∆wT Ht ∆w + α(∆wT ∆w − η 2 )
2
1
= Ein (wt ) + g ∆w + ∆wT (Ht + 2αI)∆w − αη 2
T
2
which gives us
∆w = −(Ht + 2αI)−1 gt
since Ht + 2αI is positive definite; and also that
∇α L = ∆wT ∆w − η 2 = 0,
which gives us
∆wT ∆w = η 2 .
11
(Ht + 2αI)∆w = −gt
⇒ ∆w (Ht + 2αI)∆w = −∆wT gt
T
| {z∆w} = −∆w gt
⇒ ∆wT Ht ∆w + 2α ∆w T T
=η 2
Problem 7.15
Problem 7.16
(a) If we assume that r > N/2, we get that the number of points classified as −1 is less or equal to N/2. So,
in this case it suffices to relabel all −1 points as +1, and we may safely assume that r ≤ N/2.
(b) Since r ≤ N/2 and d ≥ 1, we also have that
jrk N
q= ≤ .
d 2
(c) Let us consider a subset of k ≤ d points. If k = d, the hyperplane containing those points does not contain
any other, since if it were not the case, we would have d + 1 points in a d − 1 dimensional hyperplane, which
is impossible by hypothesis. Now, if k < d, we can find an infinite number of hyperplanes containing those k
points x1 , · · · , xk ; it remains to find the one that does not contain any other points. If we consider x1 , · · · , xk
supplemented by d − k points xk+1 , · · · , xd , we can always find an hyperplane wT x + b = 0 such that
w xi + b = 0
T
w T xj + b = 1
minn |wiT xn + bi |
hi =
2
where xn ∈
/ Di ; in this case, for xn ∈
/ Di , we have
12
Consequently, we get
sign(−wiT xn − bi + hi ) + sign(wiT xn + bi + hi ) = 2.
Now, we consider xn ∈
/ Di , we have
Consequently, we get
sign(−wiT xn − bi + hi ) + sign(wiT xn + bi + hi ) = 0.
It is easy to check that this MLP implements our arbitrary dichotomy. This shows that this MLP can classify
any N = md points, thus
dV C ≥ md.
13