0% found this document useful (0 votes)

53 views13 pages

Problems Chap7

The document summarizes solutions to several problems regarding perceptrons and decision regions: [1] Problem 7.1 involves using 7 perceptrons to define the decision region for a 2D problem as the union of two regions. [2] Problem 7.2 proves properties of decision regions defined by hyperplanes, including their convexity and that there can be at most 2^M different regions for M hyperplanes. [3] Problems 7.3 and 7.4 characterize a multi-layer perceptron in terms of compositions of sign functions of weighted sums of inputs.

Uploaded by

Mohamed Taha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views13 pages

Problems Chap7

Uploaded by

Mohamed Taha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Problem Solutions

e-Chapter 7
Pierre Paquay

Problem 7.1

To solve this problem, we first begin by separating the positive decision region into two components : the
lower one corresponding to x2 ∈ [−1, 1] and the upper one corresponding to x2 ∈ [1, 2]. To define the decision
region, we need 7 perceptrons, namely

h1 (x) = sign(x2 − 2), h2 (x) = sign(x2 − 1), h3 (x) = sign(x2 + 1),

for the horizontal lines, and

h4 (x) = sign(x1 + 2), h5 (x) = sign(x1 + 1), h6 (x) = sign(x1 − 1), h7 (x) = sign(x1 − 2)

for the vertical lines. We are now able to define the lower decision region by h2 h3 h4 h7 , and the upper decision
region by h1 h2 h5 h6 , which means that the total decision region is defined by

f = h2 h3 h4 h7 + h1 h2 h5 h6

which actually characterizes a 3-layer perceptron.

Problem 7.2

(a) Let x and x0 be two points from the same region. If we consider a set of M hyperplanes defined by
{x : wiT x = 0}, we have that

(sign(w1T x), · · · , sign(wM

T
x)) = (sign(w1T x0 ), · · · , sign(wM x ));
T 0

or put more simply that sign(wiT x) = sign(wiT x0 ) = si for i = 1, · · · , M where si = ±1. We begin by the case
where si = 1. Here, we know that wiT x > 0 and wiT x0 > 0, consequently we have that, for λ ∈ [0, 1],

wiT (λx + (1 − λ)x0 ) = λwiT x + (1 − λ)wiT x0 > 0

and
sign(wiT (λx + (1 − λ)x0 )) = 1.
Now, we consider the case where si = −1. Here, we know that wiT x < 0 and wiT x0 < 0, consequently we have
that, for λ ∈ [0, 1],
wiT (λx + (1 − λ)x0 ) = λwiT x + (1 − λ)wiT x0 < 0
and
sign(wiT (λx + (1 − λ)x0 )) = −1.
So, in conclusion, the region is actually convex.
(b) A region is defined as the following set

{x : (sign(w1T x), · · · , sign(wM

T
x)) = (s1 , · · · , sM ); si ∈ {−1, 1}};

thus a region is characterized by a particular M -uple (s1 , · · · , sM ). Since there are at most 2M of such
M -uples, we have at most 2M different regions.
(c) Let B(N, d) be the maximum number of regions created by M hyperplanes in d-dimensional space. Now,
consider adding an (M + 1)th hyperplane; this hyperplane can obviously be viewed as a (d − 1)-dimensional

1
space, so if we project the initial M hyperplanes into this space, we obtain M hyperplanes in a (d − 1)-
dimensional space. These hyperplanes can create at most B(M, d − 1) regions in this space, and for each of
these regions, we get two regions in the original d-dimensional space. Thus, this means that the (M + 1)th
hyperplane intersects at most B(M, d − 1) of the regions created by the M hyperplanes in the d-dimensional
space, and so
B(M + 1, d) ≤ B(M, d) + B(M, d − 1).

Now, we will prove that

d
X M
B(M, d) ≤
i
i=0

by induction. We begin by evaluating the boundary conditions, we have

1
X M M M
B(M, 1) = M + 1 ≤ = + =M +1
i 0 1
i=0

for all M , and

d
1 1 1
X
B(1, d) = 2 ≤ = + =2
i 0 1
i=0

for all d. Now, we assume the statement is true for M = M0 and all d, we will prove that the statement is
still true for M = M0 + 1 and all d. We have that

B(M0 + 1, d) ≤ B(M0 , d) + B(M0 , d − 1)

d X d−1
X M0 M0
≤ +
i i
i=0 i=0
X d X d
M0 M0 M0
= + +
0 i i−1
i=1 i=1
d
" #
X M0 M0
= 1+ +
i i−1
i=1 | {z }
M0 + 1

=
i
d
M0 + 1
X
= .
i
i=0

We have thus proved the induction step, so the statement is true for all M and d.

Problem 7.3

We begin by proving the following equivalence relation

hm (x) = cm ⇔ hcmm (x) = +1.

The condition is necessary because if cm = +1, we have

hcmm (x) = hm (x) = cm = +1;

and if cm = −1, we have

hcmm (x) = hm (x) = cm = +1.

2
Now the condition is also sufficient because if cm = +1, we have

+1 = hcmm (x) = hm (x),

which means that hm (x) = +1 = cm ; and if cm = −1, we have

+1 = hcmm (x) = hm (x),

which implies that hm (x) = −1 = cm .

Now we are able to write that

x∈r
⇔ (h1 (x), · · · , hM (x)) = (c1 , · · · , cM )
⇔ hcmm (x) = +1, ∀m
QM
m=1 hm (x) = +1
cm
⇔
⇔ tr (x) = +1.

The above relation also implies that

/ r ⇔ tr (x) = −1.
x∈
Now if x is in a positive region (f (x) = +1), we know that there exists i such that x ∈ ri , and consequently
that tri (x) = +1 which means that

tr1 (x) + · · · + trk (x) = +1 = f (x).

And if x is in a negative region (f (x) = −1), we know that x ∈ / ri for all i, so tri (x) = −1 for all i which
means that
tr1 (x) + · · · + trk (x) = −1 = f (x).

Problem 7.4

Since f = tr1 + · · · + trk , we may write that

k
1 X
f = sign(k − + tr ),
2 i=1 i

which characterizes the penultimate layer of our perceptron. For the layer before, we have that tri =
(i) (i)
c c
h11 · · · hMM , and consequently
M
1 X c(i)
tri = sign(−M + + hmm );
2 m=1
moreover, the previous layer may be characterized with
c(i)
hmm = sign(c(i) T
m wm x).

Putting all this together, we obtain the following characterization of a 3-layer perceptron
k M
1 X 1 X
f = sign(k − + sign(−M + + sign(cm
(i) T
wm x)))
2 i=1 2 m=1

whose structure is given by [d, kM, k, 1].

3
Problem 7.5

First, we decompose the unit hypercube [0, 1]d into 1/d -hypercubes (hypercube whose sides have length
equal to ), thus we get a grid-like structure of our unit hypercube. Now, if we consider a decision region
(which may be composed by disconnected regions) whose boundary surfaces are smooth, this decision region
partition the unit hypercube into two regions : one labelled +1 and one labelled −1. We now have k
(i) (i),T
-hypercubes labelled +1 which are formed by 2d hyperplanes each defined by hm = sign(wm x) where
m = 1, · · · , 2d and i = 1, · · · , k . So, the first layer whose task is to activate the hyperplanes involved in the
positive -hypercubes is characterized by

m = sign(wm
h(i) (i),T
x).
Now to activate the positive -hypercubes Hi themselves we characterize the second layer by
(i) (i)
(i) (i)
tHi = (h1 )c1 · · · (h2d )c2d ,
(i)
where the cm are defined as in Problem 7.3 and 7.4; or
2d
1 X (i) c(i)
tHi = sign(−2d + + (h ) m ).
2 m=1 m

And finally to activate all the positive -hypercubes, we define the MLP output h by
h = tH1 + · · · + tHk ;
or
k
1 X

h = sign(k − + tH ).
2 i=1 i
Putting all this together, we obtain the following characterization of a 3-layer perceptron
k 2d
1 X 1 X
h = sign(k − + sign(−2d + + sign(c(i)
m wm
(i),T
x))).
2 i=1 2 m=1

Now, it remains to see that the above MLP can aribtrarily closely approximate the initial positive decision
region D+ (and consequently the negative decision region also); to do so, we first note that
Vol(Hi ) = d → 0 and k → ∞
when → 0. So, the -hypercubes can be made arbitrarily small, which obviously means that the total
volume of the positive -hypercubes can be made arbitrarily close to the volume of the positive decision
region (because of its smoothness). Mathematically, we may write that
k
X
Vol(H1 ∪ · · · ∪ Hk ) = d → Vol(D+ )
i=1

when → 0. This means that the region where our 3-layer perceptron will output +1 (resp. −1) converges to
the positive (resp. negative) decision region in our unit hypercube.

Problem 7.6
(l) (l)
For a specific layer l, if we replace the weight wij with wij + , we need to recompute the corresponding
node output of that layer and also the node outputs for the subsequent layers (which are the ones numbered
(l)
from l + 1 to L). Consequently, for each weight wij , we have
L
X L
X
d(l) (d(l−1) + 1) + 1 + d(l)
k=l+1 k=l+1

4
multiplications and θ-evaluations respectively; this means that the computational complexity of obtaining
the partial derivatives is overall equal to

L L L
!
X X X
2 d(l) (d(l−1) + 1) d(l) (d(l−1) + 1) + 1 + d(l)
l=1 k=l+1 k=l+1
| {z }
=|W |
L L
!
X X
≤ 2|W | d(l) (d(l−1) + 1) +1 + d(l)
|{z}
k=1 k=1 ≤d(l) (d(l−1) +1)
| {z }
=|W |

≤ 2|W |(2|W | + 1) = O(|W |2 )

(l) (l)
since we need to compute the derivatives corresponding to wij + and also to wij − .

Problem 7.7

(a) We know that

N
1 X
Ein = ||yi − ŷi ||2 ,
N i=1
and also that
y1T − ŷ1T ||y1 − ŷ1 ||2
   
∗ ∗
(Y − Ŷ )(Y − Ŷ ) =  .
.. ..
 (y1 − ŷ1 , · · · , yN − ŷN ) = 
T
. .
   
∗ ∗
T T 2
yN − ŷN ∗ ∗ ||yN − ŷN ||

Consequently, we get that

1
Ein = trace((Y − Ŷ )(Y − Ŷ )T ).
N

(b) We may write that

1
Ein = trace(Y Y T − ZV Y Y − Y V T Z T + ZV V T Z T )
N
1
= trace(Y Y T − 2ZV Y T + ZV V T Z T ),
N

since trace(A) = trace(AT ). We are now ready to compute the derivatives, we have

∂Ein 1 ∂trace(ZV Y T ) ∂trace(ZV V T Z T )

= (−2 + )
∂V N | ∂V
{z } | ∂V
{z }
=Z T Y =Z T ZV +Z T ZV =2Z T ZV
1
= (2Z T ZV − 2Z T Y ),
N

because of the following identities

∂trace(AXB) ∂trace(AXX T B)
= AT B T and = BAX + AT B T X.
∂X ∂X

5
We also have

Ein
1
= trace(Y Y T − 2(V0 + θ(XW )V1 )Y T + (V0 + θ(XW )V1 )(V0T + V1T θ(XW )T )
N
1
= trace(Y Y T − 2V0 Y T − 2θ(XW )V1 Y T + V0 V0T + V0 V1T θ(XW )T + θ(XW )V1 V0T + θ(XW )V1 V1T θ(XW )T )
N
1
= trace(Y Y T − 2V0 Y T − 2θ(XW )V1 Y T + V0 V0T + 2θ(XW )V1 V0T + V1 V1T θ(XW )T θ(XW )),
N

since the trace can be permuted in a cycle and trace(A) = trace(AT ). The other derivative may be written as

∂Ein 1 ∂trace(θ(XW )V1 Y T ) ∂trace(θ(XW )V1 V0T ) ∂trace(V1 V1T θ(XW )T θ(XW ))
= (−2 +2 + )
∂W N ∂W ∂W ∂W
1
= (−2X T θ0 (XW ) ⊗ Y V1T + 2X T θ0 (XW ) ⊗ V0 V1T + X T (θ0 (XW ) ⊗ [θ(XW )2V1 V1T ])
N
= 2X T [θ0 (XW ) ⊗ (−Y V1T + V0 V1T + θ(XW )V1 V1T )]

because of the following identities

∂trace(θ(BX)A) ∂trace(Aθ(BX)T θ(BX))

= B T θ0 (BX) ⊗ AT and = B T [θ0 (BX) ⊗ (θ(BX)(A + AT ))].
∂X ∂X

Problem 7.8

(a) By hypothesis, we know that {η1 , η2 , η3 } with η1 < η2 < η3 is an U-arrangement which means that

E(η2 ) < min{E(η1 ), E(η3 )}.

Since E(η) is a quadratic curve, we know that it is decreasing (resp. increasing) to the left (resp. right) of
its minimum η̄. So if we assume that η̄ < η1 , we get that E(η1 ) ≤ E(η2 ) ≤ E(η3 ) which is impossible by
definition of an U-arrangement; and if we assume that η̄ > η3 , we get that E(η1 ) ≥ E(η2 ) ≥ E(η3 ) which is
also impossible by definition of an U-arrangement. Consequently, we have η̄ ∈ [η1 , η3 ].
(b) First, we solve the linear system in a, b, and c below

 E(η1 ) = aη12 + bη1 + c = e1



E(η2 ) = aη22 + bη2 + c = e2 .

E(η3 ) = aη32 + bη3 + c = e3


Let D be the determinant of the system, which is

1
2
η1 η1
D = η22 1

η2 ,
1

η32 η3

where D 6= 0 since η1 < η2 < η3 ; now we easily get that

e1 η1 1

(e1 − e2 )(η1 − η3 ) − (e1 − e3 )(η1 − η2 )
a = e2 η2 1 /D =

e3 η3 1 D

6
and
1
2
η1 e1
/D = −(e1 − e2 )(η1 − η3 ) + (e1 − e3 )(η1 − η2 ) .
2 2 2 2

b = η22 1

e2
D
1

η32 e3

Since the minimum of such a quadratic function is given by −b/2a, we finally get
" #
1 (e1 − e2 )(η12 − η32 ) − (e1 − e3 )(η12 − η22 )
η̄ = .
2 (e1 − e2 )(η1 − η3 ) − (e1 − e3 )(η1 − η2 )

(c) We enumerate the four cases below.

1. If η̄ < η2 :
• If E(η̄) < E(η2 ), then {η1 , η̄, η2 } is a new U-arrangement.
• If E(η̄) > E(η2 ), then {η̄, η2 , η3 } is a new U-arrangement.
2. If η̄ > η2 :
• If E(η̄) < E(η2 ), then {η2 , η̄, η3 } is a new U-arrangement.
• If E(η̄) > E(η2 ), then {η1 , η2 , η̄} is a new U-arrangement.
(d) If η̄ = η2 , by continuity we are always able to find another η20 close to η2 such that

E(η20 ) < min{E(η1 ), E(η3 )}.

In this case, we can use this new η20 in place of η2 and proceed with the algorithm.

Problem 7.9

(a) Since w is uniformly sampled in the unit cube, we may write that

" #
1
P[E(w) ≤ E(w ) + ]∗
= P (w − w ) H(w − w ) ≤
∗ T ∗
2
Z
= dw1 · · · dwd
(w−w∗ )T H(w−w∗ )≤2
Z
∂w
= | det | dx1 · · · dxd
x Hx≤2 | {z∂x }
T

where we have made the change of variables x = w − w∗ . As H is positive definite and symmetric, we know
that there exists an orthogonal matrix A such that H = Adiag(λ21 , · · · , λ2d )AT . Thus, if we use y = AT x as a
change of variables, we now get that

Z
P[E(w) ≤ E(w∗ ) + ] = dx1 · · · dxd
xT Hx≤2
Z
∂x
= | det | dy1 · · · dyd .
y T diag(λ21 ,··· ,λ2d )y≤2 ∂y
| {z }
=|A|=1

We now use a third change of variables z = diag(λ1 , · · · , λd )y, in this case we obtain

7
Z
P[E(w) ≤ E(w ) + ]
∗
= dy1 · · · dyd
y T diag(λ21 ,··· ,λ2d )y≤2
Z
∂y
= | det | dz1 · · · dzd
z T z≤2 | {z∂z}
1
= |λ =√ 1
1 ···λd | det H

1 Sd (2)
Z
= √ dz1 · · · dzd = √ .
det H z T z≤2 det H

(b) It is clear that

P[E(wmin ) > E(w∗ ) + ] = P[(E(w1 ) > E(w∗ ) + ) ∩ · · · ∩ (E(wN ) > E(w∗ ) + )]
N
Y
= P[E(w1 ) > E(w∗ ) + ]
i=1
= (1 − P[E(w1 ) ≤ E(w∗ ) + ])N
!N
Sd (2)
= 1− √ .
det H

We may write that

!d/2
π d/2 (2d ) 1 8eπ
Sd (2) = ≈√ d ,
Γ(d/2 + 1) πd d
moreover, we also have that
λ̄d = det H.
Consequently, we may write that

!N
Sd (2)
P[E(wmin ) > E(w ) + ]
∗
= 1− √
det H
!d/2 !!N
1 8eπ d
≈ 1− √
πd λ̄ dd/2
| {z }
≈µd
!d !N
1 µ
≈ 1− √ √ .
πd d

(c) From point (b), we know that

!d !N
1 µ
P[E(wmin ) > E(w ) + ] ≈
∗
1− √ √
πd d
µ d
N ln(1− √1 ( √ ) )
≈ e πd d
µ d
−N √1 ( √ )
≈ e πd d

√1 log η
≈ e πd

8
because we have !d
1 µ 1
−N √ √ ≈ √ log η.
πd d πd
In conclusion, we get that
√1
P[E(wmin ) > E(w∗ ) + ] ≈ η πd ≥η
since 0 ≤ η ≤ 1; thus we may now write that

P[E(wmin ) ≤ E(w∗ ) + ] ≤ 1 − η.

Problem 7.10

If we initialize all weights to 0, we have W (l) = 0 for l = 1, · · · , L. Consequently, we have that

s(l) = (W (l) )T x(l−1) = 0

as well; so we get
x(l) = θ(s(l) ) = θ(0) = tanh(0) = 0
for l = 1, · · · , L. This impacts the gradient in the following way, we may write
∂e
= x(l−1) (δ (l) )T = 0
∂W (l)
for l = 2, · · · , L. To see what happens when l = 1, we first note that
(2)
d
(1) (1) (2) (2)
X
δj =θ 0
(sj ) wjk δk = 0
k=1 |{z}
=0

for all j; which means that ∂e/∂W (1) = 0. In conclusion, we have in this case that

∂Ein 1 X ∂en
= =0
∂W (l) N n ∂W (l)

for l = 1, · · · , L. If we use gradient descent to update the weights, we have that

∂Ein
W (l) ← W (l) − η = W (l) ;
∂W (l)
and if we use stochastic gradient descent to update the weights, we have that
∂en
W (l) ← W (l) − η = W (l) .
∂W (l)
In each case, the weights remain constant (equal to 0) which is actually something we do not want when we
are searching for an optimum.

Problem 7.12

From Problem 7.11, the gradient descent update step may be written as

wt+1 = wt − ηH(wt − w∗ );

if we substract w∗ from both sides, we see that

9
(wt+1 − w∗ ) = (wt − w∗ ) − ηt H(wt − w∗ )
⇔ t+1 = t − ηt Ht
⇔ t+1 = (I − ηt H)t

where t = wt −w∗ . Since H is symmetric, one can form an orthonormal basis with its eigenvectors. Projecting
t and t+1 onto this basis, we see that in this basis, each component decouples from the others, and letting
(α) be the αth component in this basis, we see that

t+1 (α) = (1 − ηt λα )t (α)

where λα is a positive eigenvalue of H (which is positive definite). Now, by proceeding recursively and by
using the Taylor expansion, we are able to write that

t
Y
t+1 (α) = 1 (α) (1 − ηi λα )
i=1
Yt
= 1 (α) eln(1−ηi λα )
i=1
Pt
ln(1−ηi λα )
= 1 (α)e i=1
Pt
(−ηi λα − 21 λ2α ηi2 )
≈ 1 (α)e i=1
Pt Pt
ηi − 21 λ2α ηi2
≈ 1 (α)e−λα i=1 i=1

since ηt → 0, we have that 1 − ηt λα > 0. However, since t ηt = +∞ and t ηt2 < ∞, we get that
P P

Pt 1 2
Pt 2
e−λα i=1 ηi → 0 and e− 2 λα i=1 ηi ≤ C,

which gives us
t Pt Pt 2
1 2
(1 − ηi λα ) ≈ e|−λα {z i=1 η}i e|− 2 λα{z i=1 ηi} → 0.
Y

i=1 →0 ≤C

In conclusion, we have that

t
Y
wt+1 (α) − w∗ (α) = 1 (α) (1 − ηi λα ) → 0
i=1
for all α.

Problem 7.13

(a) In general, the finite difference approximation to the first order partial derivatives of a function f (x, y) is
given by
∂f f (x + h, y) − f (x − h, y)
≈
∂x 2h
and
∂f f (x, y + h) − f (x, y − h)
≈ .
∂y 2h
If we apply the same idea to the function E(w1 , w2 ), we get
∂E E(w1 + h, w2 ) − E(w1 − h, w2 )
≈
∂w1 2h

10
and
∂E E(w1 , w2 + h) − E(w1 , w2 − h)
≈ .
∂w2 2h
If we consider now the second order partial derivatives, we may write that

∂w1 (w1 + h, w2 ) − ∂w1 (w1 − h, w2 )

∂E ∂E
∂2E
≈
∂w12 2h
E(w1 +2h,w2 )−E(w1 ,w2 ) 1 −2h,w2 )
2h − E(w1 ,w2 )−E(w
2h
≈
2h
E(w1 + 2h, w2 ) + E(w1 − 2h, w2 ) − 2E(w1 , w2 )
≈ ;
4h2

and, by the same reasoning, that

∂2E E(w1 , w2 + 2h) + E(w1 , w2 − 2h) − 2E(w1 , w2 )

≈ .
∂w22 4h2

It remains to compute the last second order partial derivative, we have that

E(w1 +h,w2 +h)−E(w1 −h,w2 +h)

∂2E 2h − E(w1 +h,w2 −h)−E(w
2h
1 −h,w2 −h)

≈
∂w1 ∂w2 2h
E(w1 + h, w2 + h) + E(w1 − h, w2 − h) − E(w1 + h, w2 − h) − E(w1 − h, w2 + h)
≈ .
4h2

Problem 7.14

(a) The Lagrangian for this constrained optimization problem is

1
L = Ein (wt ) + g T ∆w + ∆wT Ht ∆w + α(∆wT ∆w − η 2 )
2
1
= Ein (wt ) + g ∆w + ∆wT (Ht + 2αI)∆w − αη 2
T
2

where α is the Lagrange multiplier.

(b) If we solve the previous expression for ∆w and α, we get that

∇∆w L = gt + (Ht + 2αI)∆w = 0,

which gives us
∆w = −(Ht + 2αI)−1 gt
since Ht + 2αI is positive definite; and also that

∇α L = ∆wT ∆w − η 2 = 0,

which gives us
∆wT ∆w = η 2 .

(c) We know that α satisfies the following equations

11
(Ht + 2αI)∆w = −gt
⇒ ∆w (Ht + 2αI)∆w = −∆wT gt
T

| {z∆w} = −∆w gt
⇒ ∆wT Ht ∆w + 2α ∆w T T

=η 2

⇒ 2αη = −∆w Ht ∆w − ∆wT gt

2 T

⇒ α = − 2η12 (∆wT gt + ∆wT Ht ∆w).

Problem 7.15

(a) If we use the inversion formula, we immediately get that

−1 Hk−1 gk+1 gk+1

T
Hk−1
Hk+1 = (Hk + gk+1 gk+1
T
)−1 = Hk−1 − T H −1 g
.
1 + gk+1 k k+1

Problem 7.16

(a) If we assume that r > N/2, we get that the number of points classified as −1 is less or equal to N/2. So,
in this case it suffices to relabel all −1 points as +1, and we may safely assume that r ≤ N/2.
(b) Since r ≤ N/2 and d ≥ 1, we also have that
jrk N
q= ≤ .
d 2

(c) Let us consider a subset of k ≤ d points. If k = d, the hyperplane containing those points does not contain
any other, since if it were not the case, we would have d + 1 points in a d − 1 dimensional hyperplane, which
is impossible by hypothesis. Now, if k < d, we can find an infinite number of hyperplanes containing those k
points x1 , · · · , xk ; it remains to find the one that does not contain any other points. If we consider x1 , · · · , xk
supplemented by d − k points xk+1 , · · · , xd , we can always find an hyperplane wT x + b = 0 such that

w xi + b = 0
T

w T xj + b = 1

where i = 1, · · · , k and j = k + 1, · · · , d since it is a linear system of d equations in d + 1 unknowns. In

conclusion, wT x + b = 0 is a hyperplane containing x1 , · · · , xk and not xk+1 , · · · , xd ; thus it cannot contain
any other points by hypothesis.
(d) Let wiT x + bi = 0 be the hyperplane containing the points in Di and no other. We define hi by

minn |wiT xn + bi |
hi =
2
where xn ∈
/ Di ; in this case, for xn ∈
/ Di , we have

|wiT xn + bi | ≥ min |wiT xn + bi | > hi ,

and for xn ∈ Di , we have

|wiT xn + bi | = 0 < hi .

(e) We consider xn ∈ Di , in this case we have

−hi < wiT xn + bi < hi ⇔ −wiT xn − bi + hi > 0 and wiT xn + bi + hi > 0.

12
Consequently, we get
sign(−wiT xn − bi + hi ) + sign(wiT xn + bi + hi ) = 2.
Now, we consider xn ∈
/ Di , we have

wiT xn + bi > hi or wiT xn + bi < −hi ⇔ −wiT xn − bi + hi < 0 or wiT xn + bi + hi < 0.

Consequently, we get
sign(−wiT xn − bi + hi ) + sign(wiT xn + bi + hi ) = 0.

(f ) The explicit formula for our 2-layer MLP is given by

q
!
X
f (x) = sign [sign(wiT x + bi + hi ) + sign(−wiT xi − bi + hi )] − 1 .
i=1

It is easy to check that this MLP implements our arbitrary dichotomy. This shows that this MLP can classify
any N = md points, thus
dV C ≥ md.

Basic 1
No ratings yet
Basic 1
78 pages
Hussin, C. H. C.-305-332
No ratings yet
Hussin, C. H. C.-305-332
28 pages
Deep Learning
No ratings yet
Deep Learning
10 pages
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1996
No ratings yet
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1996
15 pages
A Detailed Analysis of The Brachistochrone Problem
No ratings yet
A Detailed Analysis of The Brachistochrone Problem
15 pages
Generalized Eigenvalue Problem For An Interface Elliptic Equation
No ratings yet
Generalized Eigenvalue Problem For An Interface Elliptic Equation
28 pages
Conv Solutions
No ratings yet
Conv Solutions
63 pages
A Diagram Free Approach To The Stochastic Estimates in Regularity Structures
No ratings yet
A Diagram Free Approach To The Stochastic Estimates in Regularity Structures
97 pages
PTS3 Solutions Exercises B&E
No ratings yet
PTS3 Solutions Exercises B&E
51 pages
Co CV 220165
No ratings yet
Co CV 220165
23 pages
Solutions To Exercises in An Introduction To Convexity: Øyvind Ryan
No ratings yet
Solutions To Exercises in An Introduction To Convexity: Øyvind Ryan
63 pages
Weatherwax Nocedal Solutions
No ratings yet
Weatherwax Nocedal Solutions
23 pages
2023 CCcorenglish
No ratings yet
2023 CCcorenglish
3 pages
Solutions For Problems in Applied Optimization by Ross Baldick
No ratings yet
Solutions For Problems in Applied Optimization by Ross Baldick
12 pages
Publi Sched Zaa
No ratings yet
Publi Sched Zaa
12 pages
Theory of Quadrature PDF
No ratings yet
Theory of Quadrature PDF
280 pages
Solutions To Partial Differential Equations by Lawrence Evans
No ratings yet
Solutions To Partial Differential Equations by Lawrence Evans
66 pages
Problems Chap8
No ratings yet
Problems Chap8
22 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
renAMS117 120 2015
No ratings yet
renAMS117 120 2015
7 pages
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1995
No ratings yet
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1995
11 pages
Lecture 7
No ratings yet
Lecture 7
14 pages
Icml04 Apprentice Extended
No ratings yet
Icml04 Apprentice Extended
6 pages
Variations 2015
No ratings yet
Variations 2015
189 pages
TMA4180 Optimisation I Spring 2018: Solutions To Exercise Set 7
No ratings yet
TMA4180 Optimisation I Spring 2018: Solutions To Exercise Set 7
4 pages
HW2 Sol
No ratings yet
HW2 Sol
4 pages
Osgood Uniqueness Theorem
No ratings yet
Osgood Uniqueness Theorem
5 pages
hw01 Cvxopt sp19
No ratings yet
hw01 Cvxopt sp19
3 pages
Evans Solutions
No ratings yet
Evans Solutions
19 pages
SEng Math 2 Mid Sol
No ratings yet
SEng Math 2 Mid Sol
8 pages
Ineq Lagrange PDF
100% (1)
Ineq Lagrange PDF
7 pages
أفضل 20 أداة اختراق والقرصنة ألاخلاقية مفتوحة المصدر في 2020 - GNU-Linux Revolution
No ratings yet
أفضل 20 أداة اختراق والقرصنة ألاخلاقية مفتوحة المصدر في 2020 - GNU-Linux Revolution
1,033 pages
Notes 220 17 PDF
No ratings yet
Notes 220 17 PDF
88 pages
Advanced Mathematical Techniques in Chemical Engineering Prof. S. de Department of Chemical Engineering Indian Institute of Technology, Kharagpur
No ratings yet
Advanced Mathematical Techniques in Chemical Engineering Prof. S. de Department of Chemical Engineering Indian Institute of Technology, Kharagpur
23 pages
Assignment 2 Solutions
100% (1)
Assignment 2 Solutions
10 pages
Internet of Things, Smart Spaces, and Next Generation Networks and Systems (2014)
100% (1)
Internet of Things, Smart Spaces, and Next Generation Networks and Systems (2014)
729 pages
Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
Introduction To Partial Differential Equations 802635S: Valeriy Serov University of Oulu 2011
No ratings yet
Introduction To Partial Differential Equations 802635S: Valeriy Serov University of Oulu 2011
122 pages
Figure 1: Illustration of 2-Norm Vs 1-Norm
No ratings yet
Figure 1: Illustration of 2-Norm Vs 1-Norm
12 pages
Figure 1: Illustration of 2-Norm Vs 1-Norm
No ratings yet
Figure 1: Illustration of 2-Norm Vs 1-Norm
12 pages
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
No ratings yet
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
32 pages
Gauss Quad LP
No ratings yet
Gauss Quad LP
19 pages
Second Order Elliptic Equations
No ratings yet
Second Order Elliptic Equations
56 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
20 pages
Introduction To Stochastic Optimization-2
No ratings yet
Introduction To Stochastic Optimization-2
15 pages
Seemous 2011
100% (1)
Seemous 2011
7 pages
Article 3 - 241123 - 190345
No ratings yet
Article 3 - 241123 - 190345
11 pages
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1996
No ratings yet
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1996
15 pages
Cours Pristina Mazari PDF
No ratings yet
Cours Pristina Mazari PDF
29 pages
HW1, Math 228A: Date Due and Handed in 10/12/2010
No ratings yet
HW1, Math 228A: Date Due and Handed in 10/12/2010
22 pages
Application of Six Sigma With Respect To Abbott Laboratories.
100% (1)
Application of Six Sigma With Respect To Abbott Laboratories.
17 pages
Lectures On Convex Sets: Niels Lauritzen
No ratings yet
Lectures On Convex Sets: Niels Lauritzen
93 pages
(Barrientos O.) A Branch and Bound Method For Solv
No ratings yet
(Barrientos O.) A Branch and Bound Method For Solv
17 pages
Lesson Plan - Metal Work
50% (2)
Lesson Plan - Metal Work
6 pages
Trainz 2004 DRAFT Content Creation Procedures
100% (1)
Trainz 2004 DRAFT Content Creation Procedures
101 pages
Lectures On Convex Sets
No ratings yet
Lectures On Convex Sets
93 pages
Evans Pde Solutions, Chapter 2: U B Du + Cu 0 On R U Gonr
0% (1)
Evans Pde Solutions, Chapter 2: U B Du + Cu 0 On R U Gonr
19 pages
M. Ali Asdar Departement of Pulmonology and Respiratory Medicine Faculty of Medicine University of Indonesia - Persahabatan General Hospital Jakarta
No ratings yet
M. Ali Asdar Departement of Pulmonology and Respiratory Medicine Faculty of Medicine University of Indonesia - Persahabatan General Hospital Jakarta
30 pages
Dietary Practices Among Individuals With Type 2 Diabetes (Diabetes Mellitus) : A Guide To Nutrition Intervention
100% (2)
Dietary Practices Among Individuals With Type 2 Diabetes (Diabetes Mellitus) : A Guide To Nutrition Intervention
68 pages
Parkinson Disease & ALS Cheat Sheet
No ratings yet
Parkinson Disease & ALS Cheat Sheet
4 pages
Classes in C++
No ratings yet
Classes in C++
73 pages
Variational Principles
No ratings yet
Variational Principles
61 pages
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1995
No ratings yet
International Competition in Mathematics For Universtiy Students in Plovdiv, Bulgaria 1995
11 pages
SET225 Tutorial 1
No ratings yet
SET225 Tutorial 1
26 pages
Problems Chap2
No ratings yet
Problems Chap2
26 pages
6648 0400 5 PS Pi 0001 - F PDF
100% (1)
6648 0400 5 PS Pi 0001 - F PDF
97 pages
Problems Chap9
No ratings yet
Problems Chap9
23 pages
Problems Chap1
No ratings yet
Problems Chap1
20 pages
Data Structures - Data Structures - : Lecture 1: Introduction
No ratings yet
Data Structures - Data Structures - : Lecture 1: Introduction
55 pages
Equlibrium
No ratings yet
Equlibrium
20 pages
Aluminum and Glass Company in Qatar
No ratings yet
Aluminum and Glass Company in Qatar
5 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
MFM Assignment 1 Draft
No ratings yet
MFM Assignment 1 Draft
9 pages
Airbnb Seasonality and Revenue Data Trends For Grand Prairie - AirDNA MarketMinder
No ratings yet
Airbnb Seasonality and Revenue Data Trends For Grand Prairie - AirDNA MarketMinder
2 pages
Data Structures, Sample Test Questions For The Material After Test 2, With Answers
No ratings yet
Data Structures, Sample Test Questions For The Material After Test 2, With Answers
12 pages
Camry - EF932 - Instructions - For - Use - Manual 21
No ratings yet
Camry - EF932 - Instructions - For - Use - Manual 21
8 pages
MS-Syllabus TCChem
No ratings yet
MS-Syllabus TCChem
17 pages
File Structures: ÷ Öof Ú Êëov F
No ratings yet
File Structures: ÷ Öof Ú Êëov F
9 pages
Rahwaz Syndicate Profile
No ratings yet
Rahwaz Syndicate Profile
3 pages
Hopf Bifurcation Normal Form
100% (2)
Hopf Bifurcation Normal Form
3 pages
Sony Ericsson Product
No ratings yet
Sony Ericsson Product
34 pages
Problems Chap3
No ratings yet
Problems Chap3
27 pages
Node N
No ratings yet
Node N
5 pages
Taller de Circuitos
No ratings yet
Taller de Circuitos
9 pages
BS en 60974-6-2003 (2005)
No ratings yet
BS en 60974-6-2003 (2005)
24 pages
Keralauniversity of Fisheries & Ocean Studies: Panangad P.O., Kochi 682 506, Kerala, India
No ratings yet
Keralauniversity of Fisheries & Ocean Studies: Panangad P.O., Kochi 682 506, Kerala, India
13 pages
Econ2330 Ch09
No ratings yet
Econ2330 Ch09
65 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Das PDF
No ratings yet
Das PDF
3 pages
People v. Pagal
No ratings yet
People v. Pagal
3 pages
Problems Chap5
No ratings yet
Problems Chap5
2 pages
5 People Who Disappeared But Would Reappear Years Later
No ratings yet
5 People Who Disappeared But Would Reappear Years Later
5 pages
Bone Forming Tumors
No ratings yet
Bone Forming Tumors
81 pages
Glass Fibers ASM PDF
No ratings yet
Glass Fibers ASM PDF
9 pages
Datasheet ERA SOLAR ERA-72HC - (525-550) M
No ratings yet
Datasheet ERA SOLAR ERA-72HC - (525-550) M
1 page
Sec A: Project: It Building, Bhaktapur NEA Supply GEN Supply
No ratings yet
Sec A: Project: It Building, Bhaktapur NEA Supply GEN Supply
3 pages
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Chitoglucan New Overview
No ratings yet
Chitoglucan New Overview
6 pages
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)
Social Media Tools For Researchers
No ratings yet
Social Media Tools For Researchers
3 pages
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Packet Tracer Activity 3.5.1
No ratings yet
Packet Tracer Activity 3.5.1
2 pages
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Problems Chap7

Uploaded by

Problems Chap7

Uploaded by

Problem Solutions

h1 (x) = sign(x2 − 2), h2 (x) = sign(x2 − 1), h3 (x) = sign(x2 + 1),

for the horizontal lines, and

which actually characterizes a 3-layer perceptron.

(sign(w1T x), · · · , sign(wM

wiT (λx + (1 − λ)x0 ) = λwiT x + (1 − λ)wiT x0 > 0

{x : (sign(w1T x), · · · , sign(wM

Now, we will prove that

by induction. We begin by evaluating the boundary conditions, we have

for all M , and

B(M0 + 1, d) ≤ B(M0 , d) + B(M0 , d − 1)

We begin by proving the following equivalence relation

hm (x) = cm ⇔ hcmm (x) = +1.

The condition is necessary because if cm = +1, we have

hcmm (x) = hm (x) = cm = +1;

and if cm = −1, we have

+1 = hcmm (x) = hm (x),

which means that hm (x) = +1 = cm ; and if cm = −1, we have

+1 = hcmm (x) = hm (x),

which implies that hm (x) = −1 = cm .

The above relation also implies that

tr1 (x) + · · · + trk (x) = +1 = f (x).

Since f = tr1 + · · · + trk , we may write that

whose structure is given by [d, kM, k, 1].

≤ 2|W |(2|W | + 1) = O(|W |2 )

(a) We know that

Consequently, we get that

(b) We may write that

∂Ein 1 ∂trace(ZV Y T ) ∂trace(ZV V T Z T )

because of the following identities

because of the following identities

∂trace(θ(BX)A) ∂trace(Aθ(BX)T θ(BX))

E(η2 ) < min{E(η1 ), E(η3 )}.

 E(η1 ) = aη12 + bη1 + c = e1

E(η2 ) = aη22 + bη2 + c = e2 .

Let D be the determinant of the system, which is

where D 6= 0 since η1 < η2 < η3 ; now we easily get that

(c) We enumerate the four cases below.

E(η20 ) < min{E(η1 ), E(η3 )}.

(b) It is clear that

We may write that

(c) From point (b), we know that

If we initialize all weights to 0, we have W (l) = 0 for l = 1, · · · , L. Consequently, we have that

s(l) = (W (l) )T x(l−1) = 0

for l = 1, · · · , L. If we use gradient descent to update the weights, we have that

if we substract w∗ from both sides, we see that

t+1 (α) = (1 − ηt λα )t (α)

In conclusion, we have that

∂w1 (w1 + h, w2 ) − ∂w1 (w1 − h, w2 )

and, by the same reasoning, that

∂2E E(w1 , w2 + 2h) + E(w1 , w2 − 2h) − 2E(w1 , w2 )

E(w1 +h,w2 +h)−E(w1 −h,w2 +h)

(a) The Lagrangian for this constrained optimization problem is

where α is the Lagrange multiplier.

∇∆w L = gt + (Ht + 2αI)∆w = 0,

(c) We know that α satisfies the following equations

⇒ 2αη = −∆w Ht ∆w − ∆wT gt

⇒ α = − 2η12 (∆wT gt + ∆wT Ht ∆w).

(a) If we use the inversion formula, we immediately get that

−1 Hk−1 gk+1 gk+1

where i = 1, · · · , k and j = k + 1, · · · , d since it is a linear system of d equations in d + 1 unknowns. In

|wiT xn + bi | ≥ min |wiT xn + bi | > hi ,

and for xn ∈ Di , we have

(e) We consider xn ∈ Di , in this case we have

−hi < wiT xn + bi < hi ⇔ −wiT xn − bi + hi > 0 and wiT xn + bi + hi > 0.

wiT xn + bi > hi or wiT xn + bi < −hi ⇔ −wiT xn − bi + hi < 0 or wiT xn + bi + hi < 0.

(f ) The explicit formula for our 2-layer MLP is given by

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

t+1 (α) = (1 − ηt λα )t (α)