0% found this document useful (0 votes)
53 views13 pages

Problems Chap7

The document summarizes solutions to several problems regarding perceptrons and decision regions: [1] Problem 7.1 involves using 7 perceptrons to define the decision region for a 2D problem as the union of two regions. [2] Problem 7.2 proves properties of decision regions defined by hyperplanes, including their convexity and that there can be at most 2^M different regions for M hyperplanes. [3] Problems 7.3 and 7.4 characterize a multi-layer perceptron in terms of compositions of sign functions of weighted sums of inputs.

Uploaded by

Mohamed Taha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views13 pages

Problems Chap7

The document summarizes solutions to several problems regarding perceptrons and decision regions: [1] Problem 7.1 involves using 7 perceptrons to define the decision region for a 2D problem as the union of two regions. [2] Problem 7.2 proves properties of decision regions defined by hyperplanes, including their convexity and that there can be at most 2^M different regions for M hyperplanes. [3] Problems 7.3 and 7.4 characterize a multi-layer perceptron in terms of compositions of sign functions of weighted sums of inputs.

Uploaded by

Mohamed Taha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Problem Solutions

e-Chapter 7
Pierre Paquay

Problem 7.1

To solve this problem, we first begin by separating the positive decision region into two components : the
lower one corresponding to x2 ∈ [−1, 1] and the upper one corresponding to x2 ∈ [1, 2]. To define the decision
region, we need 7 perceptrons, namely

h1 (x) = sign(x2 − 2), h2 (x) = sign(x2 − 1), h3 (x) = sign(x2 + 1),

for the horizontal lines, and

h4 (x) = sign(x1 + 2), h5 (x) = sign(x1 + 1), h6 (x) = sign(x1 − 1), h7 (x) = sign(x1 − 2)

for the vertical lines. We are now able to define the lower decision region by h2 h3 h4 h7 , and the upper decision
region by h1 h2 h5 h6 , which means that the total decision region is defined by

f = h2 h3 h4 h7 + h1 h2 h5 h6

which actually characterizes a 3-layer perceptron.

Problem 7.2

(a) Let x and x0 be two points from the same region. If we consider a set of M hyperplanes defined by
{x : wiT x = 0}, we have that

(sign(w1T x), · · · , sign(wM


T
x)) = (sign(w1T x0 ), · · · , sign(wM x ));
T 0

or put more simply that sign(wiT x) = sign(wiT x0 ) = si for i = 1, · · · , M where si = ±1. We begin by the case
where si = 1. Here, we know that wiT x > 0 and wiT x0 > 0, consequently we have that, for λ ∈ [0, 1],

wiT (λx + (1 − λ)x0 ) = λwiT x + (1 − λ)wiT x0 > 0

and
sign(wiT (λx + (1 − λ)x0 )) = 1.
Now, we consider the case where si = −1. Here, we know that wiT x < 0 and wiT x0 < 0, consequently we have
that, for λ ∈ [0, 1],
wiT (λx + (1 − λ)x0 ) = λwiT x + (1 − λ)wiT x0 < 0
and
sign(wiT (λx + (1 − λ)x0 )) = −1.
So, in conclusion, the region is actually convex.
(b) A region is defined as the following set

{x : (sign(w1T x), · · · , sign(wM


T
x)) = (s1 , · · · , sM ); si ∈ {−1, 1}};

thus a region is characterized by a particular M -uple (s1 , · · · , sM ). Since there are at most 2M of such
M -uples, we have at most 2M different regions.
(c) Let B(N, d) be the maximum number of regions created by M hyperplanes in d-dimensional space. Now,
consider adding an (M + 1)th hyperplane; this hyperplane can obviously be viewed as a (d − 1)-dimensional

1
space, so if we project the initial M hyperplanes into this space, we obtain M hyperplanes in a (d − 1)-
dimensional space. These hyperplanes can create at most B(M, d − 1) regions in this space, and for each of
these regions, we get two regions in the original d-dimensional space. Thus, this means that the (M + 1)th
hyperplane intersects at most B(M, d − 1) of the regions created by the M hyperplanes in the d-dimensional
space, and so
B(M + 1, d) ≤ B(M, d) + B(M, d − 1).

Now, we will prove that


d  
X M
B(M, d) ≤
i
i=0

by induction. We begin by evaluating the boundary conditions, we have


1      
X M M M
B(M, 1) = M + 1 ≤ = + =M +1
i 0 1
i=0

for all M , and


d  
1 1 1
X    
B(1, d) = 2 ≤ = + =2
i 0 1
i=0

for all d. Now, we assume the statement is true for M = M0 and all d, we will prove that the statement is
still true for M = M0 + 1 and all d. We have that

B(M0 + 1, d) ≤ B(M0 , d) + B(M0 , d − 1)


d   X d−1  
X M0 M0
≤ +
i i
i=0 i=0
  X d   X d  
M0 M0 M0
= + +
0 i i−1
i=1 i=1
d
"     #
X M0 M0
= 1+ +
i i−1
i=1 | {z }
M0 + 1
 
=
i
d 
M0 + 1
X 
= .
i
i=0

We have thus proved the induction step, so the statement is true for all M and d.

Problem 7.3

We begin by proving the following equivalence relation

hm (x) = cm ⇔ hcmm (x) = +1.

The condition is necessary because if cm = +1, we have

hcmm (x) = hm (x) = cm = +1;

and if cm = −1, we have


hcmm (x) = hm (x) = cm = +1.

2
Now the condition is also sufficient because if cm = +1, we have

+1 = hcmm (x) = hm (x),

which means that hm (x) = +1 = cm ; and if cm = −1, we have

+1 = hcmm (x) = hm (x),

which implies that hm (x) = −1 = cm .


Now we are able to write that

x∈r
⇔ (h1 (x), · · · , hM (x)) = (c1 , · · · , cM )
⇔ hcmm (x) = +1, ∀m
QM
m=1 hm (x) = +1
cm

⇔ tr (x) = +1.

The above relation also implies that


/ r ⇔ tr (x) = −1.
x∈
Now if x is in a positive region (f (x) = +1), we know that there exists i such that x ∈ ri , and consequently
that tri (x) = +1 which means that

tr1 (x) + · · · + trk (x) = +1 = f (x).

And if x is in a negative region (f (x) = −1), we know that x ∈ / ri for all i, so tri (x) = −1 for all i which
means that
tr1 (x) + · · · + trk (x) = −1 = f (x).

Problem 7.4

Since f = tr1 + · · · + trk , we may write that


k
1 X
f = sign(k − + tr ),
2 i=1 i

which characterizes the penultimate layer of our perceptron. For the layer before, we have that tri =
(i) (i)
c c
h11 · · · hMM , and consequently
M
1 X c(i)
tri = sign(−M + + hmm );
2 m=1
moreover, the previous layer may be characterized with
c(i)
hmm = sign(c(i) T
m wm x).

Putting all this together, we obtain the following characterization of a 3-layer perceptron
k M
1 X 1 X
f = sign(k − + sign(−M + + sign(cm
(i) T
wm x)))
2 i=1 2 m=1

whose structure is given by [d, kM, k, 1].

3
Problem 7.5

First, we decompose the unit hypercube [0, 1]d into 1/d -hypercubes (hypercube whose sides have length
equal to ), thus we get a grid-like structure of our unit hypercube. Now, if we consider a decision region
(which may be composed by disconnected regions) whose boundary surfaces are smooth, this decision region
partition the unit hypercube into two regions : one labelled +1 and one labelled −1. We now have k
(i) (i),T
-hypercubes labelled +1 which are formed by 2d hyperplanes each defined by hm = sign(wm x) where
m = 1, · · · , 2d and i = 1, · · · , k . So, the first layer whose task is to activate the hyperplanes involved in the
positive -hypercubes is characterized by

m = sign(wm
h(i) (i),T
x).
Now to activate the positive -hypercubes Hi themselves we characterize the second layer by
(i) (i)
(i) (i)
tHi = (h1 )c1 · · · (h2d )c2d ,
(i)
where the cm are defined as in Problem 7.3 and 7.4; or
2d
1 X (i) c(i)
tHi = sign(−2d + + (h ) m ).
2 m=1 m

And finally to activate all the positive -hypercubes, we define the MLP output h by
h = tH1 + · · · + tHk ;
or
k
1 X 

h = sign(k − + tH ).
2 i=1 i
Putting all this together, we obtain the following characterization of a 3-layer perceptron
k 2d
1 X 1 X
h = sign(k − + sign(−2d + + sign(c(i)
m wm
(i),T
x))).
2 i=1 2 m=1

Now, it remains to see that the above MLP can aribtrarily closely approximate the initial positive decision
region D+ (and consequently the negative decision region also); to do so, we first note that
Vol(Hi ) = d → 0 and k → ∞
when  → 0. So, the -hypercubes can be made arbitrarily small, which obviously means that the total
volume of the positive -hypercubes can be made arbitrarily close to the volume of the positive decision
region (because of its smoothness). Mathematically, we may write that
k
X
Vol(H1 ∪ · · · ∪ Hk ) = d → Vol(D+ )
i=1

when  → 0. This means that the region where our 3-layer perceptron will output +1 (resp. −1) converges to
the positive (resp. negative) decision region in our unit hypercube.

Problem 7.6
(l) (l)
For a specific layer l, if we replace the weight wij with wij + , we need to recompute the corresponding
node output of that layer and also the node outputs for the subsequent layers (which are the ones numbered
(l)
from l + 1 to L). Consequently, for each weight wij , we have
L
X L
X
d(l) (d(l−1) + 1) + 1 + d(l)
k=l+1 k=l+1

4
multiplications and θ-evaluations respectively; this means that the computational complexity of obtaining
the partial derivatives is overall equal to

L L L
!
X X X
2 d(l) (d(l−1) + 1) d(l) (d(l−1) + 1) + 1 + d(l)
l=1 k=l+1 k=l+1
| {z }
=|W |
L L
!
X X
≤ 2|W | d(l) (d(l−1) + 1) +1 + d(l)
|{z}
k=1 k=1 ≤d(l) (d(l−1) +1)
| {z }
=|W |

≤ 2|W |(2|W | + 1) = O(|W |2 )

(l) (l)
since we need to compute the derivatives corresponding to wij +  and also to wij − .

Problem 7.7

(a) We know that


N
1 X
Ein = ||yi − ŷi ||2 ,
N i=1
and also that
y1T − ŷ1T ||y1 − ŷ1 ||2
   
∗ ∗
(Y − Ŷ )(Y − Ŷ ) =  .
.. ..
 (y1 − ŷ1 , · · · , yN − ŷN ) = 
T
. .
   
∗ ∗
T T 2
yN − ŷN ∗ ∗ ||yN − ŷN ||

Consequently, we get that


1
Ein = trace((Y − Ŷ )(Y − Ŷ )T ).
N

(b) We may write that

1
Ein = trace(Y Y T − ZV Y Y − Y V T Z T + ZV V T Z T )
N
1
= trace(Y Y T − 2ZV Y T + ZV V T Z T ),
N

since trace(A) = trace(AT ). We are now ready to compute the derivatives, we have

∂Ein 1 ∂trace(ZV Y T ) ∂trace(ZV V T Z T )


= (−2 + )
∂V N | ∂V
{z } | ∂V
{z }
=Z T Y =Z T ZV +Z T ZV =2Z T ZV
1
= (2Z T ZV − 2Z T Y ),
N

because of the following identities

∂trace(AXB) ∂trace(AXX T B)
= AT B T and = BAX + AT B T X.
∂X ∂X

5
We also have

Ein
1
= trace(Y Y T − 2(V0 + θ(XW )V1 )Y T + (V0 + θ(XW )V1 )(V0T + V1T θ(XW )T )
N
1
= trace(Y Y T − 2V0 Y T − 2θ(XW )V1 Y T + V0 V0T + V0 V1T θ(XW )T + θ(XW )V1 V0T + θ(XW )V1 V1T θ(XW )T )
N
1
= trace(Y Y T − 2V0 Y T − 2θ(XW )V1 Y T + V0 V0T + 2θ(XW )V1 V0T + V1 V1T θ(XW )T θ(XW )),
N

since the trace can be permuted in a cycle and trace(A) = trace(AT ). The other derivative may be written as

∂Ein 1 ∂trace(θ(XW )V1 Y T ) ∂trace(θ(XW )V1 V0T ) ∂trace(V1 V1T θ(XW )T θ(XW ))
= (−2 +2 + )
∂W N ∂W ∂W ∂W
1
= (−2X T θ0 (XW ) ⊗ Y V1T + 2X T θ0 (XW ) ⊗ V0 V1T + X T (θ0 (XW ) ⊗ [θ(XW )2V1 V1T ])
N
= 2X T [θ0 (XW ) ⊗ (−Y V1T + V0 V1T + θ(XW )V1 V1T )]

because of the following identities

∂trace(θ(BX)A) ∂trace(Aθ(BX)T θ(BX))


= B T θ0 (BX) ⊗ AT and = B T [θ0 (BX) ⊗ (θ(BX)(A + AT ))].
∂X ∂X

Problem 7.8

(a) By hypothesis, we know that {η1 , η2 , η3 } with η1 < η2 < η3 is an U-arrangement which means that

E(η2 ) < min{E(η1 ), E(η3 )}.

Since E(η) is a quadratic curve, we know that it is decreasing (resp. increasing) to the left (resp. right) of
its minimum η̄. So if we assume that η̄ < η1 , we get that E(η1 ) ≤ E(η2 ) ≤ E(η3 ) which is impossible by
definition of an U-arrangement; and if we assume that η̄ > η3 , we get that E(η1 ) ≥ E(η2 ) ≥ E(η3 ) which is
also impossible by definition of an U-arrangement. Consequently, we have η̄ ∈ [η1 , η3 ].
(b) First, we solve the linear system in a, b, and c below

 E(η1 ) = aη12 + bη1 + c = e1


E(η2 ) = aη22 + bη2 + c = e2 .


E(η3 ) = aη32 + bη3 + c = e3

Let D be the determinant of the system, which is

1
2
η1 η1
D = η22 1

η2 ,
1

η32 η3

where D 6= 0 since η1 < η2 < η3 ; now we easily get that


e1 η1 1

(e1 − e2 )(η1 − η3 ) − (e1 − e3 )(η1 − η2 )
a = e2 η2 1 /D =

e3 η3 1 D

6
and
1
2
η1 e1
/D = −(e1 − e2 )(η1 − η3 ) + (e1 − e3 )(η1 − η2 ) .
2 2 2 2

b = η22 1

e2
D
1

η32 e3

Since the minimum of such a quadratic function is given by −b/2a, we finally get
" #
1 (e1 − e2 )(η12 − η32 ) − (e1 − e3 )(η12 − η22 )
η̄ = .
2 (e1 − e2 )(η1 − η3 ) − (e1 − e3 )(η1 − η2 )

(c) We enumerate the four cases below.


1. If η̄ < η2 :
• If E(η̄) < E(η2 ), then {η1 , η̄, η2 } is a new U-arrangement.
• If E(η̄) > E(η2 ), then {η̄, η2 , η3 } is a new U-arrangement.
2. If η̄ > η2 :
• If E(η̄) < E(η2 ), then {η2 , η̄, η3 } is a new U-arrangement.
• If E(η̄) > E(η2 ), then {η1 , η2 , η̄} is a new U-arrangement.
(d) If η̄ = η2 , by continuity we are always able to find another η20 close to η2 such that

E(η20 ) < min{E(η1 ), E(η3 )}.

In this case, we can use this new η20 in place of η2 and proceed with the algorithm.

Problem 7.9

(a) Since w is uniformly sampled in the unit cube, we may write that

" #
1
P[E(w) ≤ E(w ) + ]∗
= P (w − w ) H(w − w ) ≤ 
∗ T ∗
2
Z
= dw1 · · · dwd
(w−w∗ )T H(w−w∗ )≤2
Z
∂w
= | det | dx1 · · · dxd
x Hx≤2 | {z∂x }
T

=1

where we have made the change of variables x = w − w∗ . As H is positive definite and symmetric, we know
that there exists an orthogonal matrix A such that H = Adiag(λ21 , · · · , λ2d )AT . Thus, if we use y = AT x as a
change of variables, we now get that

Z
P[E(w) ≤ E(w∗ ) + ] = dx1 · · · dxd
xT Hx≤2
Z
∂x
= | det | dy1 · · · dyd .
y T diag(λ21 ,··· ,λ2d )y≤2 ∂y
| {z }
=|A|=1

We now use a third change of variables z = diag(λ1 , · · · , λd )y, in this case we obtain

7
Z
P[E(w) ≤ E(w ) + ]

= dy1 · · · dyd
y T diag(λ21 ,··· ,λ2d )y≤2
Z
∂y
= | det | dz1 · · · dzd
z T z≤2 | {z∂z}
1
= |λ =√ 1
1 ···λd | det H

1 Sd (2)
Z
= √ dz1 · · · dzd = √ .
det H z T z≤2 det H

(b) It is clear that

P[E(wmin ) > E(w∗ ) + ] = P[(E(w1 ) > E(w∗ ) + ) ∩ · · · ∩ (E(wN ) > E(w∗ ) + )]
N
Y
= P[E(w1 ) > E(w∗ ) + ]
i=1
= (1 − P[E(w1 ) ≤ E(w∗ ) + ])N
!N
Sd (2)
= 1− √ .
det H

We may write that


!d/2
π d/2 (2d ) 1 8eπ
Sd (2) = ≈√ d ,
Γ(d/2 + 1) πd d
moreover, we also have that
λ̄d = det H.
Consequently, we may write that

!N
Sd (2)
P[E(wmin ) > E(w ) + ]

= 1− √
det H
!d/2 !!N
1 8eπ d
≈ 1− √
πd λ̄ dd/2
| {z }
≈µd
!d !N
1 µ
≈ 1− √ √ .
πd d

(c) From point (b), we know that

!d !N
1 µ
P[E(wmin ) > E(w ) + ] ≈

1− √ √
πd d
µ d
N ln(1− √1 ( √ ) )
≈ e πd d
µ d
−N √1 ( √ )
≈ e πd d

√1 log η
≈ e πd

8
because we have !d
1 µ 1
−N √ √ ≈ √ log η.
πd d πd
In conclusion, we get that
√1
P[E(wmin ) > E(w∗ ) + ] ≈ η πd ≥η
since 0 ≤ η ≤ 1; thus we may now write that

P[E(wmin ) ≤ E(w∗ ) + ] ≤ 1 − η.

Problem 7.10

If we initialize all weights to 0, we have W (l) = 0 for l = 1, · · · , L. Consequently, we have that

s(l) = (W (l) )T x(l−1) = 0

as well; so we get
x(l) = θ(s(l) ) = θ(0) = tanh(0) = 0
for l = 1, · · · , L. This impacts the gradient in the following way, we may write
∂e
= x(l−1) (δ (l) )T = 0
∂W (l)
for l = 2, · · · , L. To see what happens when l = 1, we first note that
(2)
d
(1) (1) (2) (2)
X
δj =θ 0
(sj ) wjk δk = 0
k=1 |{z}
=0

for all j; which means that ∂e/∂W (1) = 0. In conclusion, we have in this case that

∂Ein 1 X ∂en
= =0
∂W (l) N n ∂W (l)

for l = 1, · · · , L. If we use gradient descent to update the weights, we have that


∂Ein
W (l) ← W (l) − η = W (l) ;
∂W (l)
and if we use stochastic gradient descent to update the weights, we have that
∂en
W (l) ← W (l) − η = W (l) .
∂W (l)
In each case, the weights remain constant (equal to 0) which is actually something we do not want when we
are searching for an optimum.

Problem 7.12

From Problem 7.11, the gradient descent update step may be written as

wt+1 = wt − ηH(wt − w∗ );

if we substract w∗ from both sides, we see that

9
(wt+1 − w∗ ) = (wt − w∗ ) − ηt H(wt − w∗ )
⇔ t+1 = t − ηt Ht
⇔ t+1 = (I − ηt H)t

where t = wt −w∗ . Since H is symmetric, one can form an orthonormal basis with its eigenvectors. Projecting
t and t+1 onto this basis, we see that in this basis, each component decouples from the others, and letting
(α) be the αth component in this basis, we see that

t+1 (α) = (1 − ηt λα )t (α)

where λα is a positive eigenvalue of H (which is positive definite). Now, by proceeding recursively and by
using the Taylor expansion, we are able to write that

t
Y
t+1 (α) = 1 (α) (1 − ηi λα )
i=1
Yt
= 1 (α) eln(1−ηi λα )
i=1
Pt
ln(1−ηi λα )
= 1 (α)e i=1
Pt
(−ηi λα − 21 λ2α ηi2 )
≈ 1 (α)e i=1
Pt Pt
ηi − 21 λ2α ηi2
≈ 1 (α)e−λα i=1 i=1

since ηt → 0, we have that 1 − ηt λα > 0. However, since t ηt = +∞ and t ηt2 < ∞, we get that
P P

Pt 1 2
Pt 2
e−λα i=1 ηi → 0 and e− 2 λα i=1 ηi ≤ C,

which gives us
t Pt Pt 2
1 2
(1 − ηi λα ) ≈ e|−λα {z i=1 η}i e|− 2 λα{z i=1 ηi} → 0.
Y

i=1 →0 ≤C

In conclusion, we have that


t
Y
wt+1 (α) − w∗ (α) = 1 (α) (1 − ηi λα ) → 0
i=1
for all α.

Problem 7.13

(a) In general, the finite difference approximation to the first order partial derivatives of a function f (x, y) is
given by
∂f f (x + h, y) − f (x − h, y)

∂x 2h
and
∂f f (x, y + h) − f (x, y − h)
≈ .
∂y 2h
If we apply the same idea to the function E(w1 , w2 ), we get
∂E E(w1 + h, w2 ) − E(w1 − h, w2 )

∂w1 2h

10
and
∂E E(w1 , w2 + h) − E(w1 , w2 − h)
≈ .
∂w2 2h
If we consider now the second order partial derivatives, we may write that

∂w1 (w1 + h, w2 ) − ∂w1 (w1 − h, w2 )


∂E ∂E
∂2E

∂w12 2h
E(w1 +2h,w2 )−E(w1 ,w2 ) 1 −2h,w2 )
2h − E(w1 ,w2 )−E(w
2h

2h
E(w1 + 2h, w2 ) + E(w1 − 2h, w2 ) − 2E(w1 , w2 )
≈ ;
4h2

and, by the same reasoning, that

∂2E E(w1 , w2 + 2h) + E(w1 , w2 − 2h) − 2E(w1 , w2 )


≈ .
∂w22 4h2

It remains to compute the last second order partial derivative, we have that

E(w1 +h,w2 +h)−E(w1 −h,w2 +h)


∂2E 2h − E(w1 +h,w2 −h)−E(w
2h
1 −h,w2 −h)


∂w1 ∂w2 2h
E(w1 + h, w2 + h) + E(w1 − h, w2 − h) − E(w1 + h, w2 − h) − E(w1 − h, w2 + h)
≈ .
4h2

Problem 7.14

(a) The Lagrangian for this constrained optimization problem is

1
L = Ein (wt ) + g T ∆w + ∆wT Ht ∆w + α(∆wT ∆w − η 2 )
2
1
= Ein (wt ) + g ∆w + ∆wT (Ht + 2αI)∆w − αη 2
T
2

where α is the Lagrange multiplier.


(b) If we solve the previous expression for ∆w and α, we get that

∇∆w L = gt + (Ht + 2αI)∆w = 0,

which gives us
∆w = −(Ht + 2αI)−1 gt
since Ht + 2αI is positive definite; and also that

∇α L = ∆wT ∆w − η 2 = 0,

which gives us
∆wT ∆w = η 2 .

(c) We know that α satisfies the following equations

11
(Ht + 2αI)∆w = −gt
⇒ ∆w (Ht + 2αI)∆w = −∆wT gt
T

| {z∆w} = −∆w gt
⇒ ∆wT Ht ∆w + 2α ∆w T T

=η 2

⇒ 2αη = −∆w Ht ∆w − ∆wT gt


2 T

⇒ α = − 2η12 (∆wT gt + ∆wT Ht ∆w).

Problem 7.15

(a) If we use the inversion formula, we immediately get that

−1 Hk−1 gk+1 gk+1


T
Hk−1
Hk+1 = (Hk + gk+1 gk+1
T
)−1 = Hk−1 − T H −1 g
.
1 + gk+1 k k+1

Problem 7.16

(a) If we assume that r > N/2, we get that the number of points classified as −1 is less or equal to N/2. So,
in this case it suffices to relabel all −1 points as +1, and we may safely assume that r ≤ N/2.
(b) Since r ≤ N/2 and d ≥ 1, we also have that
jrk N
q= ≤ .
d 2

(c) Let us consider a subset of k ≤ d points. If k = d, the hyperplane containing those points does not contain
any other, since if it were not the case, we would have d + 1 points in a d − 1 dimensional hyperplane, which
is impossible by hypothesis. Now, if k < d, we can find an infinite number of hyperplanes containing those k
points x1 , · · · , xk ; it remains to find the one that does not contain any other points. If we consider x1 , · · · , xk
supplemented by d − k points xk+1 , · · · , xd , we can always find an hyperplane wT x + b = 0 such that

w xi + b = 0
 T

w T xj + b = 1

where i = 1, · · · , k and j = k + 1, · · · , d since it is a linear system of d equations in d + 1 unknowns. In


conclusion, wT x + b = 0 is a hyperplane containing x1 , · · · , xk and not xk+1 , · · · , xd ; thus it cannot contain
any other points by hypothesis.
(d) Let wiT x + bi = 0 be the hyperplane containing the points in Di and no other. We define hi by

minn |wiT xn + bi |
hi =
2
where xn ∈
/ Di ; in this case, for xn ∈
/ Di , we have

|wiT xn + bi | ≥ min |wiT xn + bi | > hi ,


n

and for xn ∈ Di , we have


|wiT xn + bi | = 0 < hi .

(e) We consider xn ∈ Di , in this case we have

−hi < wiT xn + bi < hi ⇔ −wiT xn − bi + hi > 0 and wiT xn + bi + hi > 0.

12
Consequently, we get
sign(−wiT xn − bi + hi ) + sign(wiT xn + bi + hi ) = 2.
Now, we consider xn ∈
/ Di , we have

wiT xn + bi > hi or wiT xn + bi < −hi ⇔ −wiT xn − bi + hi < 0 or wiT xn + bi + hi < 0.

Consequently, we get
sign(−wiT xn − bi + hi ) + sign(wiT xn + bi + hi ) = 0.

(f ) The explicit formula for our 2-layer MLP is given by


q
!
X
f (x) = sign [sign(wiT x + bi + hi ) + sign(−wiT xi − bi + hi )] − 1 .
i=1

It is easy to check that this MLP implements our arbitrary dichotomy. This shows that this MLP can classify
any N = md points, thus
dV C ≥ md.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy