Lecture 7 - Kernels and Support Vector
Lecture 7 - Kernels and Support Vector
2DV516/2DT916
Jonas Nordqvist
jonas.nordqvist@lnu.se
Department of Mathematics
Kernels and support vector machines 1(41)
Agenda
Department of Mathematics
Kernels and support vector machines 2(41)
Example: an ill-defined problem
Find the separating hyperplane
Department of Mathematics
Kernels and support vector machines 3(41)
Hyperplanes
Department of Mathematics
Kernels and support vector machines 4(41)
Hyperplanes
Department of Mathematics
Kernels and support vector machines 4(41)
Hyperplanes
The hyperplane always parts the space in two disjoint parts, for example above or
below a line.
Given p points in the ambient space a hyperplane through these points are unique
up to a constant, e.g. 2Y + 6X + 4 = 0 and Y + 3X + 2 = 0.
Department of Mathematics
Kernels and support vector machines 4(41)
Hyperplanes
There are essentially three scenarios for a point x 0 = (x10 ; x20 ; : : : ; xp0 ), either
▶ x 0 lies in the hyperplane and
0 + 0 + +
1 x1
0 = 0;
p xp
0 + 0 + +
1 x1
0 > 0;
p xp
0 + 0 + +
1 x1
0 < 0:
p xp
Department of Mathematics
Kernels and support vector machines 5(41)
Geometric motivation
: 1 X1 + + p Xp = 0:
Department of Mathematics
Kernels and support vector machines 6(41)
Geometric motivation
: 1 X1 + + p Xp = 0:
Assume that x = (x1 ; : : : ; xp ) lies ‘above’ then we will show that
0 + 1 x1 + + p xp > 0:
Department of Mathematics
Kernels and support vector machines 6(41)
Geometric motivation
: 1 X1 + + p Xp = 0:
Assume that x = (x1 ; : : : ; xp ) lies ‘above’ then we will show that
0 + 1 x1 + + p xp > 0:
(x )T = jjx jj jj jj cos();
where is the angle between x and . The sign of this quantity is determined
completely by cos(). Thus, (x )T > 0 if and only if 2 2 .
Department of Mathematics
Kernels and support vector machines 6(41)
Geometric motivation
x
2 2
Department of Mathematics
Kernels and support vector machines 7(41)
Back to the problem
We want to find a hyperplane that separates our data.
Department of Mathematics
Kernels and support vector machines 8(41)
Back to the problem
We want to find a hyperplane that separates our data.
Department of Mathematics
Kernels and support vector machines 8(41)
Back to the problem
We want to find a hyperplane that separates our data.
Is this always possible given any data? No! Why is the problem ill-defined?
Department of Mathematics
Kernels and support vector machines 8(41)
Back to the problem
We want to find a hyperplane that separates our data.
Is this always possible given any data? No! Why is the problem ill-defined? No
unique solution!
Department of Mathematics
Kernels and support vector machines 9(41)
Maximal margin classifier
We want to choose the hyperplane which separates the data, and which has the
largest margin (or cushion or slab) separating the two classes.
Note that there are only three points contributing to the computation of the slab,
the ones on the margin. These points are called support vectors.
Department of Mathematics
Kernels and support vector machines 10(41)
Formulating the (hard) problem
Remark
We will consider the binary classification case. For this problem it is convenient
to use 1 for positive and 1 for negatives labels in y as this implies
yi ( 0 + 1 xi 1 + + p xip ) 0;
for all 1 i n.
Denote by M the distance from the hyperplane to the two classes.1
The main objective is the following
max
; 1 ;:::;
M (2)
0 p
X
p
subject to 2
i = jj jj = 1 (3)
i =1
yi ( 0 + 1 xi 1 + + p xip ) M: (4)
1
By distance to a class we mean the shortest distance from any point in the class to the
hyperplane.
Department of Mathematics
Kernels and support vector machines 11(41)
Distance formula
Lemma
The distance between the point xi and the hyperplane is given by
jj jj yi ( 0 +
1
1 xi 1 + + p xip ):
Department of Mathematics
Kernels and support vector machines 12(41)
Distance formula
Lemma
The distance between the point xi and the hyperplane is given by
jj jj yi ( 0 +
1
1 xi 1 + + p xip ):
0 + 1 X1 + + p Xp = 0; (5)
has normal = ( 1; : : : ; p ) ⊺.
Hence, the shortest path from a point xi to the hyperplane goes along the line
described by and the point xi . So, the equation of the line is given by xi + t ,
2
t R.
0 + 1 (xi 1 + t 1 ) + + p (xip + t p ) = 0
Department of Mathematics
Kernels and support vector machines 12(41)
Distance formula
= 0 + + +
1 xi 1 p xip
= jj 1jj2 ( 0 + + + ):
t 2
1 + + p2 1 xi 1 p xip
Denote by x the point in the hyperplane which is the intersection between the
hyperplane and the line. Then the smallest distance between xi and x and thus
the plane is given by
x xi = t jj =t ; jj jj jj j jjj jj
and we obtain
Department of Mathematics
Kernels and support vector machines 13(41)
Reformulating the problem
Denote by := ( 1 ; : : : ; p ), and
p
jj jj = 2
1 + + p2 :
The distance from the hyperplane to any point xi is given by yi (xi + 0 ), by (3),
⊺
and in particular if jj jj
is no longer necessarily equal to 1 we have
Put M = 1=jj jj, and hence maximizing the margin M implies minimizing jj jj.
This is further equivalent to min jj jj2 . So, our problem can instead be formulated
as
Department of Mathematics
Kernels and support vector machines 14(41)
Reformulating the problem
Denote by := ( 1 ; : : : ; p ), and
p
jj jj = 2
1 + + p2 :
The distance from the hyperplane to any point xi is given by yi (xi + 0 ), by (3),
⊺
and in particular if jj jj
is no longer necessarily equal to 1 we have
Put M = 1=jj jj, and hence maximizing the margin M implies minimizing jj jj.
This is further equivalent to min jj jj2 . So, our problem can instead be formulated
as
Department of Mathematics
Kernels and support vector machines 14(41)
Maximal margin classifier is non-robust
The maximal margin classifier has good performance on very special problems,
but it is very non-robust. Here this means: minor changes in the input data, may
yield major changes in the decision boundary.
Department of Mathematics
Kernels and support vector machines 15(41)
Decision boundary examples
Department of Mathematics
Kernels and support vector machines 16(41)
Soften the margin
Department of Mathematics
Kernels and support vector machines 17(41)
Formulating the support vector classifier
To soften the margin, we need to introduce a slack variables "i 0 for each
instance.
Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier
To soften the margin, we need to introduce a slack variables "i 0 for each
instance.
The slack "i measures how much xi may violate the margin. Our two objectives
are now
Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier
To soften the margin, we need to introduce a slack variables "i 0 for each
instance.
The slack "i measures how much xi may violate the margin. Our two objectives
are now
▶ make the slack variables as small as possible
Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier
To soften the margin, we need to introduce a slack variables "i 0 for each
instance.
The slack "i measures how much xi may violate the margin. Our two objectives
are now
▶ make the slack variables as small as possible
▶ minimize jj jj
X
n
min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + 1 xi 1 + + p xip ) (1 "i ) and "i 0
Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier
To soften the margin, we need to introduce a slack variables "i 0 for each
instance.
The slack "i measures how much xi may violate the margin. Our two objectives
are now
▶ make the slack variables as small as possible
▶ minimize jj jj
X
n
min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + 1 xi 1 + + p xip ) (1 "i ) and "i 0
Note that if
▶ "i = 0, xi is on the correct side of the margin
▶ "i > 0, xi has violated the margin
▶ "i > 1, xi is on the wrong side of the hyperplane
Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier
The parameter C can be seen as a regularization parameter.
Department of Mathematics
Kernels and support vector machines 19(41)
Formulating the support vector classifier
The parameter C can be seen as a regularization parameter.
X
n
min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + xi
⊺
) (1 "i ) and "i 0
C will serve as a tradeoff between
▶ making the margin large
▶ making sure that most examples have margin at least 1= jj jj
Department of Mathematics
Kernels and support vector machines 19(41)
Formulating the support vector classifier
The parameter C can be seen as a regularization parameter.
X
n
min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + xi
⊺
) (1 "i ) and "i 0
C will serve as a tradeoff between
▶ making the margin large
▶ making sure that most examples have margin at least 1= jj jj
If yi ( 0 + xi ) 1 then there is no cost of misclassification, but if not, then we
⊺
This perhaps looks like a familiar setting for ML and we’ll revisit this in a couple
of slides.
Department of Mathematics
Kernels and support vector machines 19(41)
Formal explanation
We may consider the Lagrangian relaxation of the problem and obtain the
following Lagrangian primal function
X
n
X
n
X
n
LP = 12 jj jj2 + C "i + i ((1 "i ) yi ( 0 + xi⊺ )) i "i :
i =1 i =1 i =1
This is primal Lagrangian LP . Setting derivative w.r.t. , 0 and " to zero yields
@ LP = 0 () X
n
@
= i y i xi
i =1
@ LP = 0 () X
n
0= i yi
@ 0
i =1
@ LP = 0 () = C i ; 8i
@" i
Department of Mathematics
Kernels and support vector machines 20(41)
Support vector classifier
Using the LP -formulation and the partials of the previous slide yields the dual
formulation LD
X
n
1 XX
n n
LD = i
2
i
⊺
j yi yj xi xj :
i =1 i =1 j =1
Department of Mathematics
Kernels and support vector machines 21(41)
Support vector classifier
Using the LP -formulation and the partials of the previous slide yields the dual
formulation LD
X
n
1 XX
n n
LD = i
2
i
⊺
j yi yj xi xj :
i =1 i =1 j =1
X
n
X
n
^= i yi x i ; implying f (x ) = 0 + x⊺ = 0 + i yi x
⊺
xi :
i =1 i =1
Some of the i are zero and for the ones it is not, the corresponding xi is a
support vector. Hence, classification done by computing the dot product between
the test point x and all the support vectors xi .
Department of Mathematics
Kernels and support vector machines 21(41)
Support vector classifier
The function which we fit (again) is the hypothesis
X
n
f (x ) = 0 + x⊺ = 0 + i yi x
⊺
xi ;
i =1
Department of Mathematics
Kernels and support vector machines 22(41)
Support vector classifier
The function which we fit (again) is the hypothesis
X
n
f (x ) = 0 + x⊺ = 0 + i yi x
⊺
xi ;
i =1
Department of Mathematics
Kernels and support vector machines 22(41)
Support vector classifier
The function which we fit (again) is the hypothesis
X
n
f (x ) = 0 + x⊺ = 0 + i yi x
⊺
xi ;
i =1
Department of Mathematics
Kernels and support vector machines 22(41)
SVM in the cost + regularization paradigm
min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1
Department of Mathematics
Kernels and support vector machines 23(41)
SVM in the cost + regularization paradigm
min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1
The cost function here is known as the hinge loss. It is convex but not
(everywhere) differentiable, thus requiring some extra work in order to perform
gradient descent.
Department of Mathematics
Kernels and support vector machines 23(41)
SVM in the cost + regularization paradigm
min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1
The cost function here is known as the hinge loss. It is convex but not
(everywhere) differentiable, thus requiring some extra work in order to perform
gradient descent.
In this manner we see that the support vector classifier behaves quite similar to
logistic regression with ridge regularization.
Department of Mathematics
Kernels and support vector machines 23(41)
SVM in the cost + regularization paradigm
min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1
The cost function here is known as the hinge loss. It is convex but not
(everywhere) differentiable, thus requiring some extra work in order to perform
gradient descent.
In this manner we see that the support vector classifier behaves quite similar to
logistic regression with ridge regularization.
Note that
hinge + quadratic = convex:
Department of Mathematics
Kernels and support vector machines 23(41)
Hinge loss and binomial deviance
Presented below are the cost functions for support vectors classifiers and logistic
regression respectively.
Department of Mathematics
Kernels and support vector machines 24(41)
Convex functions
A function f is said to be convex if for every x ; y in its domain and every in the
unit interval we have
Department of Mathematics
Kernels and support vector machines 25(41)
Hinge + quadratic
Theorem
Any local minimum of a convex function is a global minimum.
Department of Mathematics
Kernels and support vector machines 26(41)
Hinge + quadratic
Theorem
Any local minimum of a convex function is a global minimum.
▶ Good news, since this implies that gradient descent yields the minimum
▶ Bad news are, as previously announced, the loss function is not everywhere
differentiable.
▶ A solution is to use: sub-gradients
▶ This however is out of scope of the course and we’ll only introduce it for sake
of completeness.
Department of Mathematics
Kernels and support vector machines 26(41)
Gradient descent version of the problem
J( )= max[0; 1 y f (x )] +
i i
2
j :
2
i =1 j =1
Thus we have
X
n
rJ ( ) = + s (xi ; yi );
i =1
where
yx ; if yx < 1
s (x ; y ) =
0; otherwise.
Department of Mathematics
Kernels and support vector machines 27(41)
Extending further – solving non-linear problems
Perform a transformation of the data in order to make it linearly separable. Here
is an example where x 7!
(x ; x 2 ).
Department of Mathematics
Kernels and support vector machines 28(41)
Implicit enlargement of the feature space
We’ve already seen how to solve non-linear problems for instance by enlarging the
feature space. However there are some problems associated to this task. What if
there are 100 features, and we want to find all polynomial combinations of these
of degree less than or equal to 5? Then we extend our feature space to include
)
> 79 million(!) features computational issues.
Our aim now is to find some other suitable way to implicitly enlarge the feature
space.
The main idea that we want to obtain is how to enlarge the feature space to allow
for non-linear decision boundaries, without having to add pre-defined new
features.
Department of Mathematics
Kernels and support vector machines 29(41)
Kernels
The feature space can implicitly be extended by the use of kernels.
h
K (xi ; xj ) = xi ; xj i:
▶ polynomial kernel:
K (xi ; xj ) = (1 + xi ; xjh i)d :
▶ Gaussian (RBF) kernel:
Pp
K (xi ; xj ) = exp( jjxi xj jj2 ) = e k =1 (xki xkj )2
Department of Mathematics
Kernels and support vector machines 30(41)
Kernels
The feature space can implicitly be extended by the use of kernels.
h
K (xi ; xj ) = xi ; xj i:
▶ polynomial kernel:
K (xi ; xj ) = (1 + xi ; xjh i)d :
▶ Gaussian (RBF) kernel:
Pp
K (xi ; xj ) = exp( jjxi xj jj2 ) = e k =1 (xki xkj )2
Department of Mathematics
Kernels and support vector machines 30(41)
Kernels
Department of Mathematics
Kernels and support vector machines 31(41)
Gaussian kernel is a kernel
Department of Mathematics
Kernels and support vector machines 32(41)
Gaussian kernel is very flexible
Department of Mathematics
Kernels and support vector machines 33(41)
The strength of kernels
h
Consider the polynomial kernel: K (x ; y ) = (1 + x ; y i)2 . Let x = (x1 ; x2 ) and
y = (y1 ; y2 ).
Department of Mathematics
Kernels and support vector machines 34(41)
The strength of kernels
K (x ; y ) = (1 + x ; y h
i)2 = (1 + x1 y1 + x2 y2 )2
= 1 + (x1 y1 )2 + (x2 y2 )2 + 2x1 y1 + 2x2 y2 + 2x1 y1 x2 y2 ):
However, note that
p p p p p p
K (x ; y ) = h(1; x12 ; x22 ; 2x1 ; 2x2 ; 2x1 x2 ); (1; y12 ; y22 ; 2y1 ; 2y2 ; i:
2y1 y2 )
Department of Mathematics
Kernels and support vector machines 34(41)
The strength of kernels
K (x ; y ) = (1 + x ; y h
i)2 = (1 + x1 y1 + x2 y2 )2
= 1 + (x1 y1 )2 + (x2 y2 )2 + 2x1 y1 + 2x2 y2 + 2x1 y1 x2 y2 ):
However, note that
p p p p p p
K (x ; y ) = h(1; x12 ; x22 ; 2x1 ; 2x2 ; 2x1 x2 ); (1; y12 ; y22 ; 2y1 ; 2y2 ; i:
2y1 y2 )
Department of Mathematics
Kernels and support vector machines 34(41)
Kernel Ridge Regression
Linear (ridge) regression is given by solving
X
n
b = arg min 1
n
( ⊺ x i yi )2 + jjjj2
i =1
b = (X ⊺ X + n I ) 1
X ⊺y :
Department of Mathematics
Kernels and support vector machines 35(41)
Kernel Ridge Regression
Linear (ridge) regression is given by solving
X
n
b = arg min 1
n
( ⊺ x i yi )2 + jjjj2
i =1
b = (X ⊺ X + n I ) 1
X ⊺y :
X
n
b = arg min 1
n
(⊺ (xi ) yi )2 + jjjj2
i =1
solved by
b = ((X )⊺ (X ) + n I ) 1 (X )⊺ y
If (x ) = (1; x ; x 2 ; x 3 ; : : : ; x d ) then the above is equivalent to polynomial
regression. This problem simplifies due to the kernel trick.
Department of Mathematics
Kernels and support vector machines 35(41)
Support vector machines
When combining support vector classifiers with kernels we obtain what is usually
called the support vector machines.
Department of Mathematics
Kernels and support vector machines 36(41)
SVM hyperparameters
There are several hyperparameters of the support vector machines which has to be
tuned in order to obtain good performance.
General advice: hyperparameter tuning may by very time consuming, hence if you
have many training points use only a subset for training the hyperparameters,
which can give good estimates.
Department of Mathematics
Kernels and support vector machines 37(41)
Binarization techniques
There are two very common and much used binarization techniques, i.e. methods
for handling mutliclass classification.
Previously you’ve seen the one-vs-all techniques (Lecture 3). In sklearn for SVM
multi-class the built-in technique is one-vs-one.
Department of Mathematics
Kernels and support vector machines 38(41)
OVA vs OVO
Consider a three-class classification task with only three training examples as seen
below. Suppose that a maximal margin classifier is used and trained on the
mentioned training examples, together with either (i) OVO binarization or (ii)
OVA binarization.
a) In each of these cases (i) and (ii). How many classifiers needs to be trained?
b) Are the same decision boundaries produced by (i) and (ii)?
Department of Mathematics
Kernels and support vector machines 39(41)
SVM in sklearn
Department of Mathematics
Kernels and support vector machines 40(41)
Assignment summary
Department of Mathematics
Kernels and support vector machines 41(41)