SVM Problems1
SVM Problems1
Assignment 2 Solutions
(a) Construct a support vector machine that computes the XOR function. Use values of +1 and -1
(instead of 1 and 0) for both inputs and outputs, so that an example looks like ([−1, 1], 1) or
([−1, −1], −1). Map the input [x1 , x2 ] into a space consisting of x1 and x1 x2 . Draw the four
input points in this space, and the maximal margin separator. What is the margin? Now draw the
separating line back in the original Euclidian input space.
The examples map from [x1 , x2 ] to [x1 , x1 x2 ] coordinates as follows:
[−1, −1] (negative) maps to [−1, +1]
[−1, +1] (positive) maps to [−1, −1]
[+1, −1] (positive) maps to [+1, −1]
[+1, +1] (negative) maps to [+1, +1]
Thus, the positive examples have x1 x2 = − 1 and the negative examples have x1 x2 = + 1. The
maximum margin separator is the line x1 x2 = 0, with a margin of 1. The separator corresponds to
the x1 = 0 and x2 = 0 axes in the original space—this can be thought of as the limit of a hyperbolic
separator with two branches.
(b) Recall that the equation of the circle in the 2-dimensional plane is (x1 − a)2 + (x2 − b)2 − r2 = 0.
Expand out the formula and show that every circular region is linearly separable from the rest of
the plane in the feature space (x1 , x2 , x21 , x22 ).
The circle equation expands into five terms
corresponding to weights w = (2a, 2b, 1, 1) and intercept a2 + b2 − r2 . This shows that a circular
boundary is linear in this feature space, allowing linear separability.
In fact, the three features x1 , x2 , x21 + x22 suffice.
(c) Recall that the equation of an ellipse in the 2-dimensional plane is c(x1 − a)2 + d(x2 − b)2 − 1 = 0.
Show that an SVM using the polynomial kernel of degree 2, K(u, v) = (1 + u · v)2 , is equivalent
to a linear SVM in the feature space (1, x1 , x2 , x21 , x22 , x1 x2 ) and hence that SVMs with this kernel
can separate any elliptic region from the rest of the plane.
The (axis-aligned) ellipse equation expands into six terms
2. (12 pts) Logistic regression is a method of fitting a probabilistic classifier that gives soft linear thresholds.
(See Russell & Norvig, Section 18.6.4.) It is common to use logistic regression with an objective function
consisting of the negative log probability of the data plus an L2 regularizer:
N
X 1
L(w) = − log + λ||w||22
i=1
1 + eyi (wT xi +b)
1
∂L
(a) Find the partial derivatives ∂wj .
1 ∂g(z) ∂ log(g(z))
First define the function g(z) = 1+e z . Note that ∂z = g(z)(1 − g(z)) and also ∂z =
1
g(z) g(z)(1 − g(z)) = (1 − g(z)). Then we get,
n
∂L X
=− yi xij (1 − g(yi (wT xi + b))) + 2λwj
∂wj i=1
∂2L
(b) Find the partial second derivatives ∂wj ∂wk .
n
∂2L X
= yi2 xij xik g(yi (wT xi + b))(1 − g(yi (wT xi + b))) + 2λδjk
∂wj ∂wk i=1
where δjk = 1 if i = j, 0 if i 6= j.
(c) From these results, show that L(w) is a convex function.
Hint: A function L is convex if its Hessian (the matrix H of second derivatives with elements
2
Hj,k = ∂w∂j ∂w
L
k
) is positive semi-definite (PSD). A matrix H is PSD if and only if
X
aT Ha ≡ aj ak Hj,k ≥ 0
j,k
√
Define Pi = g(yi (wT xi + b))(1 − g(yi (wT xi + b))) and ρij = yi xij Pi . Then,
X X X X X X
aT Ha = aj ak xij xik yi2 Pi + 2λ a2j = aT ρi ρTi a + 2λ a2j = (aT ρi )2 + 2λ a2j ≥ 0
j,k,i j i j i j
for λ ≥ 0.
class x1 x2
+ 1 1
+ 2 2
+ 2 0
– 0 0
– 1 0
– 0 1
(a) Plot these six training points. Are the classes {+, −} linearly separable?
As seen in the plot the classes are linearly separable.
2
Figure 1: Question 3
(b) Construct the weight vector of the maximum margin hyperplane by inspection and identify the
support vectors.
The maximum margin hyperplane should have a slope of −1 and should satisfy x1 = 3/2, x2 = 0.
Therefore it’s equation is x1 + x2 = 3/2, and the weight vector is (1, 1)T .
(c) If you remove one of the support vectors does the size of the optimal margin decrease, stay the
same, or increase? In this specific dataset the optimal margin increases when we remove the
support vectors (1, 0) or (1, 1) and stays the same when we remove the other two.
(d) (Extra Credit) Is your answer to (c) also true for any dataset? Provide a counterexample or give
a short proof.
When we drop some constraints in a constrained maximization problem, we get an optimal value
which is at least as good the previous one. It is because the set of candidates satisfying the original
(larger, stronger) set of contraints is a subset of the candidates satisfying the new (smaller, weaker)
set of constraints. So, for the weaker constraints, the oldoptimal solution is still available and
there may be additions soltons that are even better. In mathematical form:
Finally, note that in SVM problems we are maximizing the margin subject to the constraints
given by training points. When we drop any of the constraints the margin can increase or stay
the same depending on the dataset. In general problems with realistic datasets it is expected
that the margin increases when we drop support vectors. The data in this problem is constructed
to demonstrate that when removing some constraints the margin can stay the same or increase
depending on the geometry.
3
Figure 2: Question 4
(class) x
+ 0
– −1
– +1
Using the method of Lagrange multipliers show that the solution is ŵ = (0, 0, −2)T , b = 1 and the
margin is ||ŵ1|| .
2
For optimization problems with inequality constraints such as the above, we should apply KKT
conditions which is a generalization of Lagrange multipliers. However this problem can be solved
easier by noting that we have three vectors in the 3-dimensional space and all of them are support
vectors. Hence the all 3 constraints hold with equality. Therefore we can apply the method of
Lagrange multipliers to,
min 12 ||w||22 s.t. (5)
w,b
4
We have 3 constraints, and should have 3 Lagrange multipliers. We first form the Lagrangian
function L(w, λ) where λ = (λ1 , λ2 , λ3 ) as follows
3
1 X
L(w, λ) = ||w||22 + λi (yi (wT φ(xi ) + b) − 1) (7)
2 i=1
and differentiate with respect to optimization variables w and b and equate to zero,
3
∂L(w, λ) X
=w+ λi yi φ(xi ) = 0 (8)
∂w i=1
3
∂L(w, λ) X
= λi yi = 0 . (9)
∂b i=1
Using the data points φ(xi ), we get the following equations from the above lines,
w1 +λ1 − λ2 − λ3 = 0 (10)
√ √
w2 + 2λ2 − 2λ3 = 0 (11)
w3 −λ2 − λ3 = 0 (12)
λ1 −λ2 − λ3 = 0 (13)
(14)
Using (10) and (14) we get w1 = 0. Then plugging this to equality constraints in the optimization
problem, we get
b = 1 (15)
√
− 2w2 + w3 + b = −1 (16)
√
+ 2w2 + w3 + b = −1 (17)
(18)
(16) and (17) imply that w2 = 0, and w3 = −2. Therefore the optimal weights are ŵ = (0, 0, −2)T
and b = 1. And the margin is 1/2.
(d) Show that the solution remains the same if the constraints are changed to
yi (wT φ(xi ) + b) ≥ ρ, i = 1, 2, 3
for any ρ ≥ 1.
Note that changing the constraints in the solution of part (c) only changes equation (15-17) and we
get b = ρ, and ŵ = (0, 0, −2ρ)T . However the hyperplane described by the equation, ŵT x + b = 0,
remains the same as before: {x : −2ρx3 + ρ = 0} ≡ {x : −2x3 + 1 = 0}. Hence, we have the same
classifier in two cases: assign class label + if ŵT x + b ≥ 0 and assign class − otherwise.
(e) (Extra Credit) Is your answer to (d) also true for any dataset and ρ ≥ 1? Provide a counterexample
or give a short proof.
This is true for any dataset and it follows from the homogeneity of the optimization problem. For
constraints yi (wT φ(xi ) + b) ≥ ρ, we can define new weight vectors w̃ = w/ρ and b̃ = b/ρ. So
that the constraints in the new variables are yi (w̃T φ(xi ) + b̃) = 1. And equivalently optimize the
following,
min 12 ρ2 ||w̃||22 s.t. (19)
w̃,b̃