NNLS1 2019 HW4 Solutions
NNLS1 2019 HW4 Solutions
Prayag
Neural networks and learning systems-I
April 23, 2019
Problem 6.1.
Solution. The given data x is linearly separable and the separating hyperplane is given by
wT x + b = 0 where w denotes the weight vector and b denotes the bias. The hyperplane is
said to correspond to a canonical pair (w, b) if for the set of input patters {xi }N
i=1 satisfies
Let wT xi + b = g(xi ) where yi gives the distance of the input data xi from the separating
hyperplane. We know that any point xi can be decomposed into two components as given
below:
w
xi = xp + r
kwk
where xp is the normal projection of the point x on to the hyperplane and r is the distance
of the data point from the hyperplane.
T w
g(xi ) = w xp + r +b
kwk
wT w
= wT xp + r +b
kwk
= wT xp + b + rkwk
= g(xp ) + rkwk
= rkwk Since g(xp ) = 0.
g(xi )
=⇒ r=
kwk
From (1), we know that there exists at least one xi such that wT xi + b = 1 or wT xi + b = −1.
Therefore g(xi ) = ±1. Therefore
(
1
kwk
if Class 1
r= 1
− kwk if Class -1
2
The optimal separation between the two classes is given by 2r = kwk
.
1
Problem 6.3.
Solution. Given problem:
N
1 T X
min w w+C ζi
2 i=1
subject to ζi ≥ 0 ∀i = 1, 2, . . . , N
di (wT xi + b) ≥ 1 − ζi ∀i = 1, 2, . . . , N
Writing this in the standard form to write the Lagrange, we get
N
1 T X
min w w+C ζi
2 i=1
subject to ζi ≥ 0 ∀i = 1, 2, . . . , N
di (wT xi + b) − 1 + ζi ≥ 0 ∀i = 1, 2, . . . , N
The Lagrange can now be written as follows using the Lagrange multipliers λi and αi as
N N N
1 T X X X
αi di (wT xi + b) − 1 + ζi
L= w w+C ζi − λi ζi −
2 i=1 i=1 i=1
N N N N N N
1 T X X X X X X
L= w w+C ζi − λi ζi − αi di wT xi − αi di b + αi − αi ζi
2 i=1 i=1 i=1 i=1 i=1 i=1
:0
0
XN X N
" N N N # N N
1 X XX X X
T
= −1 αi αj di dj xi xj + (λi +α −
i )
λi − αi ζi − b αi di + αi
2 i=1 j=1
i=1 i=1 i=1 i=1 i=1
N X
N N
1 X X
= − αi αj di dj xT
i xj + αi
2 i=1 j=1 i=1
2
From this, the dual can be written as follows
N N N
X 1 XX
max αi − α i α j di dj xT
i xj
i=1
2 i=1 j=1
N
P
αi di = 0
i=1
subject to C − λi − αi = 0 ∀i = 1, 2, . . . , N
αi ≥ 0
λi ≥ 0
Problem 6.11.
Solution. It is given that a joint probability density function pX1 ,X2 (x1 , x2 ) over an H-by-
H product space is said to be a P-matrix provided it satisfies finitely positive semidefinite
property. The matrix P will be positive semidefinite if for every non-zero column vector z,
the value obtained from zT P z is positive or zero.
Let us consider the simple case of two-element set X = [X1 , X2 ] of random variables.
Case 1: Are all P -kernels joint distributions?
P Pa given P −kernel P (x, y), we can generate an identical kernel the P̂ -kernel if it satisfies
From
P (x, y) = C, where C is some constant such that C < ∞. We can define the P̂ -
x∈X y∈X
kernels as P̂ (x, y) = C1 P (x, y). This definition satisfies the properties of a P-matrix since we
have only scaled the elements. Since P̂ (x, y) is also a joint distribution, we can say that all
P -kernels are joint distributions.
Case 2: Are all joint distributions P -kernels?
Considering the two element case, let us create a joint distribution and verify if it satisfies
the properties of a P -kernel. The joint probability matrix for a two element case would be
given by
p(x1 , x1 ) p(x1 , x2 )
PX,Y =
p(x2 , x1 ) p(x2 , x2 )
3
Considering a particular case where p(x1 , x1 ) = 0, p(x1 , x2 ) = 0.5, p(x2 , x1 ) = 0.5, p(x2 , x2 ) =
0, we get the
0 0.5
PX,Y =
0.5 0
Solving for the eigenvalues, we get λ = ±0.5. From the given definition, the P-matrix must
be positive semidefinite, but an eigenvalue in the above case is negative. Therefore not all
joint distributions are P -kernels.
Problem 6.21.
Solution. Given k(xi , .) and k(xj , .) denote a pain of kernels, where i, j = 1, 2, . . . , N and
the vectors have the same dimensionality. We need to show that
hk(xi , .), k(xj , .)i = k(xi , xj ). (2)
Let f (.) and g(.) be two functions defined over a vector space F such that
N
X
f (.) = ai k(xi , .), (3)
i=1
XN
g(.) = bj k(xj , .) (4)
j=1
We know that
k(xi , xj ) = φT (xi )φ(xj ) (7)
Using (7) in (5) and (6), we get
N
X
f (xj ) = ai φT (xi )φ(xj ), (8)
i=1
XN
g(xi ) = bj φT (xi )φ(xj ). (9)
j=1
4
Taking the inner product using (10) and (11), we get
N
!T N
!
X X
hf, gi = ai φ(xi ) bj φ(xj )
i=1 j=1
N X
X N
hf, gi = ai bj φT (xi )φ(xj )
i=1 j=1
N X
X N
hf, gi = ai bj k(xi , xj ) (12)
i=1 j=1
XN N
X
hf, gi =h ai k(xi , .), bj k(xj , .)i
i=1 j=1
N X
X N
hf, gi = ai bj hk(xi , .)k(xj , .)i (13)
i=1 j=1
Comparing (12) and (13), we get hk(xi , .), k(xj , .)i = k(xi , xj ).
Problem 6.25.
Solution. (a) Generation of data set of three concentric circles with the radii as mentioned
in the question. The data generated is as below.
(b) The support machine was trained with C = 500 and the decision boundary obtained
is as given below.
(c) The network was tested and an accuracy of 68% was obtained. We could argue that
the value of C might play a role in the accuracy of the SVM.
5
(d) The network was trained with C = 100 and C = 2500. The decision boundaries
obtained are as given below.
It is observed that for the case of C = 100, the network accuracy was 62% and for
C = 2500, the network accuracy was 64%.
Problem 2.
6
Therefore, for the kernel to be valid, the condition to be satisfied is β1 > −β0 xT x.
However, we cannot comment on whether it satisfies Mercer’s theorem or not.
We know that Mercer’s theorem for polynomial type (xT x + 1)p always satisfy Mercer’s
theorem. Let us consider the Maclaurin series for tanh function.
x3 2x5 17x7
tanh(x) = x − + − + ···
3 15 315
Using the above, we get
(β0 xT x + β1 )3 2(β0 xT x + β1 )5
tanh(β0 xT x + β1 ) = (β0 xT x + β1 ) − + − ···
3 15
Assuming β0 xT x + β1 is a small value, we take the 1st order approximation of the function
to get
tanh(β0 xT x + β1 ) ≈ (β0 xT x + β1 )
Comparing this with the polynomial kernel, we see that for values β0 = 1, β1 = 1 and p = 1,
the kernel satisfies the Mercer’s theorem.
tanh(xT x + 1) ≈ (xT x + 1)
Using
√ the above idea, for positive values of β0 , we can define a new variable such that
x̃ = β0 x. Taking the inner product, we see that
p p p p
x̃T x̃ = ( β0 x)T ( β0 x) = β0 β0 xT x = β0 xT x
Therefore, we can approximate any positive β0 using the above method. To see how this
works, let us consider an example of x to be a two element vector.
x
β0 x x = β0 [x1 x2 ] 1 = β0 (x21 + x22 )
T
x2
hp p i √β x
T
x̃ x̃ = β0 x1 β0 x2 √ 0 1 = β0 x21 + β0 x22 = β0 (x21 + x22 )
β 0 x2
tanh(β0 xT x + 1) ≈ (x̃T x̃ + 1)
The above solution relies on the assumption that β0 xT x + β1 is small. Therefore, for β0 > 0
and β1 < 0, it would be a better approximation of the kernel as compared to the case of
β0 > 0 and β1 > 0. The observations have been summarized in the table below.
These observations are in line with the theoretic proofs obtained in the paper “A study
on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods.”
by Lin, H.T. and Lin, C.J. Their results are as given below.
Therefore, we see that the tanh(.) kernel satisfies the Mercer’s theorem better when
β0 > 0 and β1 < 0.
7
β0 β1 Observations
+ - A good approximation of the Mercer kernel as β0 xT x + β1 is small
+ + Not as good approximation as β0 xT x + β1 is larger than above
- + Valid kernel only when β1 > −β0 xT x, otherwise invalid
- - Not a valid kernel
β0 β1 Results
+ - Kernel is conditionally positive semidefinite for small β1 , and is similar to RBF for small β0
+ + In general not as good as the (+, −) case
- + Objective value of a function becomes −∞ after β1 is large
- - Easily the objective value of the function becomes −∞
Problem 3.
Substituting yi = wT φ(xi ) and writing the above in the standard form to write the Lagrange,
we get
N −1
1 X
min L (di , yi )
N i=0
c0 − kwk2 ≥ 0
+ ζi − di + wT φ(xi )
≥ 0
0
subject to + ζi − wT φ(xi ) + di ≥0 ∀i = 0, 1, 2, . . . , N − 1.
ζi ≥ 0
0
ζi ≥0
8
The primal can be set up using the Lagrange as
N −1 −1
1 X 0
2
N
X
βi + ζi − di + wT φ(xi )
L= ζi + ζi − α c0 − kwk −
N i=0 i=0
N −1 N −1 N −1
0 0 0 0
X X X
T
− βi + ζi − w φ(xi ) + di − γi ζi − γi ζi (14)
i=0 i=0 i=0
0 0
where α, βi , βi , γi , and γi are the Lagrangian multipliers. Differentiating the Lagrange and
equating to zero, we get
N −1 N −1 N −1
∂L X X 0 1 X 0
= 0 =⇒ 2αw − βi φ(xi ) + βi φ(xi ) = 0 =⇒ w = βi − βi φ(xi )
∂w i=0 i=0
2α i=0
∂L 1 1
= 0 =⇒ − βi − γi = 0 =⇒ βi + γi =
∂ζi N N
0
∂L 1 0 0 0 0 1
= 0 =⇒ − βi − γi = 0 =⇒ βi + γi =
∂ζi N N
9
Substituting the above values, we get
N −1 N −1
0 0 0 0 0
X X
L= (βi + γi − βi − γi ) ζi + βi + γi − βi − γi ζi − αc0
i=0 i=0
N −1 N −1
α X X 0
0
+ βi − βi βj − βj φ(xi )T φ(xj )
4α2 i=0 j=0
N −1 N −1
1 XX 0
0
− βi − βi βj − βj φ(xi )T φ(xj )
2α i=0 j=0
N −1 N −1
0 0
X X
− (βi + βi ) + (βi + βi )di
i=0 i=0
−1 :0 :0
N
! N −1
!
X X 0
0 0 0
0
= βi+γ
i −
β
i − γi ζi + βi+γ
i −
β i − γi ζi − αc0
i=0 i=0
N −1 N −1
1 1 X X 0
0
+ − βi − βi βj − βj φ(xi )T φ(xj )
4α 2α i=0 j=0
N −1 N −1
0 0
X X
− (βi + βi ) + (βi + βi )di
i=0 i=0
N −1 N −1
1 XX 0
0
= − αc0 − βi − βi βj − βj φ(xi )T φ(xj )
4α i=0 j=0
N −1 N −1
0 0
X X
− (βi + βi ) + (βi + βi )di
i=0 i=0
10
0
We observe that the Lagrange multiplier γi and γi appear only in the constraint βi + γi = N1
0 0 0
and βi + γi = N1 respectively. For γi ≥ 0 to be true, N1 − βi ≥ 0 =⇒ N1 ≥ βi and for γi ≥ 0
0 0 0
to be true N1 − βi ≥ 0 =⇒ N1 ≥ βi . Combining this with βi ≥ 0 and βi ≥ 0 constraint, we
0
get 0 ≤ βi ≤ N1 and 0 ≤ βi ≤ N1 respectively. The dual problem can now be written as
N −1 N −1
1 XX 0
0
max − αc0 − βi − βi βj − βj φ(xi )T φ(xj )
4α i=0 j=0
N −1 N −1
0 0
X X
− (βi + βi ) + (βi + βi )di
i=0 i=0
α≥0
1
subject to 0 ≤ βi ≤ N ∀i = 0, 1, 2, . . . , N − 1.
0 1
0 ≤ βi ≤ N
11