0% found this document useful (0 votes)
6 views66 pages

Lecture 7 - Kernels and Support Vector

The document discusses kernels and support vector machines, focusing on concepts such as maximal margin classifiers, support vector classifiers, and the formulation of these classifiers. It explains the geometric properties of hyperplanes and the importance of support vectors in determining the decision boundary. Additionally, it addresses the limitations of maximal margin classifiers and introduces the concept of slack variables to create a more robust support vector classifier.

Uploaded by

trol.man890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views66 pages

Lecture 7 - Kernels and Support Vector

The document discusses kernels and support vector machines, focusing on concepts such as maximal margin classifiers, support vector classifiers, and the formulation of these classifiers. It explains the geometric properties of hyperplanes and the importance of support vectors in determining the decision boundary. Additionally, it addresses the limitations of maximal margin classifiers and introduces the concept of slack variables to create a more robust support vector classifier.

Uploaded by

trol.man890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Kernels and support vector machines

2DV516/2DT916

Jonas Nordqvist

jonas.nordqvist@lnu.se

May 16, 2025

Department of Mathematics
Kernels and support vector machines 1(41)
Agenda

▶ Maximal margin classifier


▶ Support vector classifier
▶ Kernels
▶ Support vector machines
▶ Kernel Ridge Regression
Reading Instructions
Chapter 8

Department of Mathematics
Kernels and support vector machines 2(41)
Example: an ill-defined problem
Find the separating hyperplane

But first... what is a hyperplane?

Department of Mathematics
Kernels and support vector machines 3(41)
Hyperplanes

Hyperplanes are a geometrical structure, in particular a subspace of dimension


one less than its ambient space, i.e. the space in which the hyperplane is defined.

Department of Mathematics
Kernels and support vector machines 4(41)
Hyperplanes

Hyperplanes are a geometrical structure, in particular a subspace of dimension


one less than its ambient space, i.e. the space in which the hyperplane is defined.

If the ambient space is of dimension:


▶ 1 = ) a hyperplane is a dot.
▶ 2 =) a hyperplane is a line.
▶ 3 =) a hyperplane is a plane.
▶ 4 =) a hyperplane is a 3-dimensional plane
▶ p =) the hyperplane is a (p 1)-dimensional plane

Department of Mathematics
Kernels and support vector machines 4(41)
Hyperplanes

Hyperplanes are a geometrical structure, in particular a subspace of dimension


one less than its ambient space, i.e. the space in which the hyperplane is defined.

If the ambient space is of dimension:


▶ 1 = ) a hyperplane is a dot.
▶ 2 =) a hyperplane is a line.
▶ 3 =) a hyperplane is a plane.
▶ 4 =) a hyperplane is a 3-dimensional plane
▶ p =) the hyperplane is a (p 1)-dimensional plane

The hyperplane always parts the space in two disjoint parts, for example above or
below a line.

Given p points in the ambient space a hyperplane through these points are unique
up to a constant, e.g. 2Y + 6X + 4 = 0 and Y + 3X + 2 = 0.

Department of Mathematics
Kernels and support vector machines 4(41)
Hyperplanes

A hyperplane in p-dimensional ambient space can be described by an equation of


the form
0 + 1 X1 + 2 X2 + 
+ p Xp = 0: (1)
Any point x which lies in the hyperplane satisfies (1).

There are essentially three scenarios for a point x 0 = (x10 ; x20 ; : : : ; xp0 ), either
▶ x 0 lies in the hyperplane and

0 + 0 +  +
1 x1
0 = 0;
p xp

▶ x 0 lies ‘above’ the hyperplane and

0 + 0 +  +
1 x1
0 > 0;
p xp

▶ or x 0 lies ‘below’ the hyperplane and

0 + 0 +  +
1 x1
0 < 0:
p xp

Department of Mathematics
Kernels and support vector machines 5(41)
Geometric motivation

Let  be a hyperplane in a p-dimensional space, and put := ( 1 ; : : : ; p )⊺ . For


simplicity we assume 0 := 0, then the equation of  is given by

: 1 X1 +  + p Xp = 0:

Department of Mathematics
Kernels and support vector machines 6(41)
Geometric motivation

Let  be a hyperplane in a p-dimensional space, and put := ( 1 ; : : : ; p )⊺ . For


simplicity we assume 0 := 0, then the equation of  is given by

 : 1 X1 +    + p Xp = 0:
Assume that x  = (x1 ; : : : ; xp ) lies ‘above’  then we will show that

0 + 1 x1  +    + p xp  > 0:

Recall that since 0 = 0 we have 0 + 1 x1  +    + p xp  = (x  )T .

Department of Mathematics
Kernels and support vector machines 6(41)
Geometric motivation

Let  be a hyperplane in a p-dimensional space, and put := ( 1 ; : : : ; p )⊺ . For


simplicity we assume 0 := 0, then the equation of  is given by

 : 1 X1 +    + p Xp = 0:
Assume that x  = (x1 ; : : : ; xp ) lies ‘above’  then we will show that

0 + 1 x1  +    + p xp  > 0:

Recall that since 0 = 0 we have 0 + 1 x1  +    + p xp  = (x  )T .

By the definition of the scalar product we have

(x  )T = jjx  jj  jj jj cos();
where  is the angle between x and . The sign of this quantity is determined
completely by cos(). Thus, (x  )T > 0 if and only if 2    2 .

Department of Mathematics
Kernels and support vector machines 6(41)
Geometric motivation

x

 
2 2

Department of Mathematics
Kernels and support vector machines 7(41)
Back to the problem
We want to find a hyperplane that separates our data.

Is this always possible given any data?

Department of Mathematics
Kernels and support vector machines 8(41)
Back to the problem
We want to find a hyperplane that separates our data.

Is this always possible given any data? No!

Department of Mathematics
Kernels and support vector machines 8(41)
Back to the problem
We want to find a hyperplane that separates our data.

Is this always possible given any data? No! Why is the problem ill-defined?

Department of Mathematics
Kernels and support vector machines 8(41)
Back to the problem
We want to find a hyperplane that separates our data.

Is this always possible given any data? No! Why is the problem ill-defined? No
unique solution!
Department of Mathematics
Kernels and support vector machines 9(41)
Maximal margin classifier
We want to choose the hyperplane which separates the data, and which has the
largest margin (or cushion or slab) separating the two classes.

Note that there are only three points contributing to the computation of the slab,
the ones on the margin. These points are called support vectors.

Department of Mathematics
Kernels and support vector machines 10(41)
Formulating the (hard) problem

Remark
We will consider the binary classification case. For this problem it is convenient
to use 1 for positive and 1 for negatives labels in y as this implies

yi ( 0 + 1 xi 1 +  + p xip )  0;
for all 1  i  n.
Denote by M the distance from the hyperplane to the two classes.1
The main objective is the following

max
; 1 ;:::;
M (2)
0 p

X
p

subject to 2
i = jj jj = 1 (3)
i =1
yi ( 0 + 1 xi 1 +  + p xip )  M: (4)

1
By distance to a class we mean the shortest distance from any point in the class to the
hyperplane.
Department of Mathematics
Kernels and support vector machines 11(41)
Distance formula

Lemma
The distance between the point xi and the hyperplane is given by

jj jj yi ( 0 +
1
1 xi 1 +  + p xip ):

Department of Mathematics
Kernels and support vector machines 12(41)
Distance formula

Lemma
The distance between the point xi and the hyperplane is given by

jj jj yi ( 0 +
1
1 xi 1 +  + p xip ):

A hyperplane with the equation

0 + 1 X1 +  + p Xp = 0; (5)

has normal = ( 1; : : : ; p ) ⊺.

Hence, the shortest path from a point xi to the hyperplane goes along the line
described by and the point xi . So, the equation of the line is given by xi + t ,
2
t R.

The line satisfies (5) in exactly one point

0 + 1 (xi 1 + t 1 ) +    + p (xip + t p ) = 0

Department of Mathematics
Kernels and support vector machines 12(41)
Distance formula

Solving for t yields

= 0 + +  +
1 xi 1 p xip
= jj 1jj2 ( 0 + +  + ):
t 2
1 +    + p2 1 xi 1 p xip

Denote by x  the point in the hyperplane which is the intersection between the
hyperplane and the line. Then the smallest distance between xi and x  and thus
the plane is given by
x  xi = t jj =t ; jj jj jj j jjj jj
and we obtain

jt jjj jj = jj 1 jj j 0 + 1 xi 1 +  + p xip j = jj 1 jj yi ( 0 + 1 xi 1 +  + p xip ):

Department of Mathematics
Kernels and support vector machines 13(41)
Reformulating the problem
Denote by := ( 1 ; : : : ; p ), and
p
jj jj = 2
1 +    + p2 :
The distance from the hyperplane to any point xi is given by yi (xi + 0 ), by (3),

and in particular if jj jj
is no longer necessarily equal to 1 we have

jj jj yi (xi + 0 )  M () yi (xi + 0 )  M jj jj:


1 ⊺ ⊺

Put M = 1=jj jj, and hence maximizing the margin M implies minimizing jj jj.
This is further equivalent to min jj jj2 . So, our problem can instead be formulated
as

min 21 jj jj2 ; subject to yi ( 0 + 1 xi 1 +  + p xip )  1;


or equivalently
min 12 ⊺
; subject to yi ( 0 + ⊺
xi )  1;

Department of Mathematics
Kernels and support vector machines 14(41)
Reformulating the problem
Denote by := ( 1 ; : : : ; p ), and
p
jj jj = 2
1 +    + p2 :
The distance from the hyperplane to any point xi is given by yi (xi + 0 ), by (3),

and in particular if jj jj
is no longer necessarily equal to 1 we have

jj jj yi (xi + 0 )  M () yi (xi + 0 )  M jj jj:


1 ⊺ ⊺

Put M = 1=jj jj, and hence maximizing the margin M implies minimizing jj jj.
This is further equivalent to min jj jj2 . So, our problem can instead be formulated
as

min 21 jj jj2 ; subject to yi ( 0 + 1 xi 1 +  + p xip )  1;


or equivalently
min 12 ⊺
; subject to yi ( 0 + ⊺
xi )  1;
Note that the points which give equality in the above constraint are the support
vectors.

Department of Mathematics
Kernels and support vector machines 14(41)
Maximal margin classifier is non-robust

The maximal margin classifier has good performance on very special problems,
but it is very non-robust. Here this means: minor changes in the input data, may
yield major changes in the decision boundary.

Department of Mathematics
Kernels and support vector machines 15(41)
Decision boundary examples

Department of Mathematics
Kernels and support vector machines 16(41)
Soften the margin

A natural extension of the maximal margin classifier is the support vector


classifier, which allows for some violation of the margin, but in most cases
instances are on the correct side of the margin.

Department of Mathematics
Kernels and support vector machines 17(41)
Formulating the support vector classifier

To soften the margin, we need to introduce a slack variables "i  0 for each
instance.

Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier

To soften the margin, we need to introduce a slack variables "i  0 for each
instance.

The slack "i measures how much xi may violate the margin. Our two objectives
are now

Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier

To soften the margin, we need to introduce a slack variables "i  0 for each
instance.

The slack "i measures how much xi may violate the margin. Our two objectives
are now
▶ make the slack variables as small as possible

Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier

To soften the margin, we need to introduce a slack variables "i  0 for each
instance.

The slack "i measures how much xi may violate the margin. Our two objectives
are now
▶ make the slack variables as small as possible
▶ minimize jj jj
X
n

min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + 1 xi 1 +  + p xip )  (1 "i ) and "i  0

Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier

To soften the margin, we need to introduce a slack variables "i  0 for each
instance.

The slack "i measures how much xi may violate the margin. Our two objectives
are now
▶ make the slack variables as small as possible
▶ minimize jj jj
X
n

min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + 1 xi 1 +  + p xip )  (1 "i ) and "i  0
Note that if
▶ "i = 0, xi is on the correct side of the margin
▶ "i > 0, xi has violated the margin
▶ "i > 1, xi is on the wrong side of the hyperplane

Department of Mathematics
Kernels and support vector machines 18(41)
Formulating the support vector classifier
The parameter C can be seen as a regularization parameter.

Department of Mathematics
Kernels and support vector machines 19(41)
Formulating the support vector classifier
The parameter C can be seen as a regularization parameter.
X
n

min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + xi

)  (1 "i ) and "i  0
C will serve as a tradeoff between
▶ making the margin large
▶ making sure that most examples have margin at least 1= jj jj

Department of Mathematics
Kernels and support vector machines 19(41)
Formulating the support vector classifier
The parameter C can be seen as a regularization parameter.
X
n

min 1
; ;" 2
jj jj2 + C "i
0
i =1
subject to yi ( 0 + xi

)  (1 "i ) and "i  0
C will serve as a tradeoff between
▶ making the margin large
▶ making sure that most examples have margin at least 1= jj jj

If yi ( 0 + xi ) 1 then there is no cost of misclassification, but if not, then we

have to ‘pay’ the price "i = 1 yi ( 0 + xi ). Hence, the ‘price’ is given by


"i = maxf0; 1 yi ( 0 + xi⊺ )g;


and replacing "i by this expression in our minimization scheme yields
X
n
min
; 2
1
jj jj2 + C maxf0; 1 yi ( 0 + xi⊺ )g:
0
i =1

This perhaps looks like a familiar setting for ML and we’ll revisit this in a couple
of slides.
Department of Mathematics
Kernels and support vector machines 19(41)
Formal explanation
We may consider the Lagrangian relaxation of the problem and obtain the
following Lagrangian primal function
X
n
X
n
X
n
LP = 12 jj jj2 + C "i + i ((1 "i ) yi ( 0 + xi⊺ )) i "i :
i =1 i =1 i =1

This is primal Lagrangian LP . Setting derivative w.r.t. , 0 and " to zero yields
@ LP = 0 () X
n

@
= i y i xi
i =1

@ LP = 0 () X
n
0= i yi
@ 0
i =1
@ LP = 0 () = C i ; 8i
@" i

Substituting back in LP we obtain the dual LD which we want to maximize. We


get
X
n
1 XX
n n
LD = i
2
i

j yi yj xi xj :
i =1 i =1 j =1

Department of Mathematics
Kernels and support vector machines 20(41)
Support vector classifier

Using the LP -formulation and the partials of the previous slide yields the dual
formulation LD
X
n
1 XX
n n
LD = i
2
i

j yi yj xi xj :
i =1 i =1 j =1

Pto be maximized subject to the constraints


which is supposed
i 2
[0; C ] and ni=1 i yi = 0.

Department of Mathematics
Kernels and support vector machines 21(41)
Support vector classifier

Using the LP -formulation and the partials of the previous slide yields the dual
formulation LD
X
n
1 XX
n n
LD = i
2
i

j yi yj xi xj :
i =1 i =1 j =1

Pto be maximized subject to the constraints


which is supposed
i 2
[0; C ] and ni=1 i yi = 0.
Our final take on this is that the support vector classifier can be written as

X
n
X
n
^= i yi x i ; implying f (x ) = 0 + x⊺ = 0 + i yi x

xi :
i =1 i =1

Some of the i are zero and for the ones it is not, the corresponding xi is a
support vector. Hence, classification done by computing the dot product between
the test point x and all the support vectors xi .

Department of Mathematics
Kernels and support vector machines 21(41)
Support vector classifier
The function which we fit (again) is the hypothesis
X
n
f (x ) = 0 + x⊺ = 0 + i yi x

xi ;
i =1

and the resulting classifier is


G (x ) = sgn[f (x )]:

Department of Mathematics
Kernels and support vector machines 22(41)
Support vector classifier
The function which we fit (again) is the hypothesis
X
n
f (x ) = 0 + x⊺ = 0 + i yi x

xi ;
i =1

and the resulting classifier is


G (x ) = sgn[f (x )]:
A very important take for the proceeding part of our SVM journey is that
classification of a point x only depends on the dot product with x and the support
vectors xi " #
X
n
G (x ) = sgn[f (x )] = sgn 0 + i yi x

xi :
i =1

Department of Mathematics
Kernels and support vector machines 22(41)
Support vector classifier
The function which we fit (again) is the hypothesis
X
n
f (x ) = 0 + x⊺ = 0 + i yi x

xi ;
i =1

and the resulting classifier is


G (x ) = sgn[f (x )]:
A very important take for the proceeding part of our SVM journey is that
classification of a point x only depends on the dot product with x and the support
vectors xi " #
X
n
G (x ) = sgn[f (x )] = sgn 0 + i yi x

xi :
i =1
The dot product is an example of a so-called inner product which is typically
h i
denoted by ; . Henceforth, we adopt this notation which will turn out to be
convenient for us. Hence,
X
n
f (x ) = 0 + i yi hx ; xi i:
i =1

Department of Mathematics
Kernels and support vector machines 22(41)
SVM in the cost + regularization paradigm

It is possible to formulate the support vector classifier in terms of a more familiar


setting of optimization, namely
!
X
n
X
p

min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1

i.e. cost + regularization penalty.

Department of Mathematics
Kernels and support vector machines 23(41)
SVM in the cost + regularization paradigm

It is possible to formulate the support vector classifier in terms of a more familiar


setting of optimization, namely
!
X
n
X
p

min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1

i.e. cost + regularization penalty.

The cost function here is known as the hinge loss. It is convex but not
(everywhere) differentiable, thus requiring some extra work in order to perform
gradient descent.

Department of Mathematics
Kernels and support vector machines 23(41)
SVM in the cost + regularization paradigm

It is possible to formulate the support vector classifier in terms of a more familiar


setting of optimization, namely
!
X
n
X
p

min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1

i.e. cost + regularization penalty.

The cost function here is known as the hinge loss. It is convex but not
(everywhere) differentiable, thus requiring some extra work in order to perform
gradient descent.

In this manner we see that the support vector classifier behaves quite similar to
logistic regression with ridge regularization.

Department of Mathematics
Kernels and support vector machines 23(41)
SVM in the cost + regularization paradigm

It is possible to formulate the support vector classifier in terms of a more familiar


setting of optimization, namely
!
X
n
X
p

min
; 1 ;:::;
max[0; 1 y f (x )] +
i i
2
2
j ;
0 p
i =1 j =1

i.e. cost + regularization penalty.

The cost function here is known as the hinge loss. It is convex but not
(everywhere) differentiable, thus requiring some extra work in order to perform
gradient descent.

In this manner we see that the support vector classifier behaves quite similar to
logistic regression with ridge regularization.

Note that
hinge + quadratic = convex:

Department of Mathematics
Kernels and support vector machines 23(41)
Hinge loss and binomial deviance
Presented below are the cost functions for support vectors classifiers and logistic
regression respectively.

Department of Mathematics
Kernels and support vector machines 24(41)
Convex functions
A function f is said to be convex if for every x ; y in its domain and every  in the
unit interval we have

f (x + (1 )y )  f (x ) + (1 )f (y ):

Department of Mathematics
Kernels and support vector machines 25(41)
Hinge + quadratic

Theorem
Any local minimum of a convex function is a global minimum.

Department of Mathematics
Kernels and support vector machines 26(41)
Hinge + quadratic

Theorem
Any local minimum of a convex function is a global minimum.

▶ Good news, since this implies that gradient descent yields the minimum
▶ Bad news are, as previously announced, the loss function is not everywhere
differentiable.
▶ A solution is to use: sub-gradients
▶ This however is out of scope of the course and we’ll only introduce it for sake
of completeness.

Department of Mathematics
Kernels and support vector machines 26(41)
Gradient descent version of the problem

Our cost function is


!
X
n
X
p

J( )= max[0; 1 y f (x )] +
i i
2
j :
2
i =1 j =1

Note that if g (x ) = max(x ; 0), then



1; if x > 0
g 0 (x ) =
0; otherwise.

Thus we have
X
n
rJ ( ) =  + s (xi ; yi );
i =1

where 
yx ; if yx < 1
s (x ; y ) =
0; otherwise.

Department of Mathematics
Kernels and support vector machines 27(41)
Extending further – solving non-linear problems
Perform a transformation of the data in order to make it linearly separable. Here
is an example where x 7!
(x ; x 2 ).

Problem: There are many ways to transforming the data.

Department of Mathematics
Kernels and support vector machines 28(41)
Implicit enlargement of the feature space

We’ve already seen how to solve non-linear problems for instance by enlarging the
feature space. However there are some problems associated to this task. What if
there are 100 features, and we want to find all polynomial combinations of these
of degree less than or equal to 5? Then we extend our feature space to include
)
> 79 million(!) features computational issues.
Our aim now is to find some other suitable way to implicitly enlarge the feature
space.

The main idea that we want to obtain is how to enlarge the feature space to allow
for non-linear decision boundaries, without having to add pre-defined new
features.

Department of Mathematics
Kernels and support vector machines 29(41)
Kernels
The feature space can implicitly be extended by the use of kernels.

Kernels can be thought of as a generalization of the dot-product in some implicit


feature-space, which might be significantly larger (or even infinite). It can also be
considered a measure of similarity.

Example of kernels K to use are


▶ linear kernel (this gives the support vector classifier):

h
K (xi ; xj ) = xi ; xj i:
▶ polynomial kernel:
K (xi ; xj ) = (1 + xi ; xjh i)d :
▶ Gaussian (RBF) kernel:
Pp
K (xi ; xj ) = exp( jjxi xj jj2 ) = e k =1 (xki xkj )2

Department of Mathematics
Kernels and support vector machines 30(41)
Kernels
The feature space can implicitly be extended by the use of kernels.

Kernels can be thought of as a generalization of the dot-product in some implicit


feature-space, which might be significantly larger (or even infinite). It can also be
considered a measure of similarity.

Example of kernels K to use are


▶ linear kernel (this gives the support vector classifier):

h
K (xi ; xj ) = xi ; xj i:
▶ polynomial kernel:
K (xi ; xj ) = (1 + xi ; xjh i)d :
▶ Gaussian (RBF) kernel:
Pp
K (xi ; xj ) = exp( jjxi xj jj2 ) = e k =1 (xki xkj )2

Why is this better? Requires (significantly) less computations as we do not have


to explicitly extend the feature space.

Department of Mathematics
Kernels and support vector machines 30(41)
Kernels

As we said before Kernels are dot-products in extended features spaces, i.e.


another formulation of kernel is

K (x ; y ) = h'(x ); '(y )i;


where ' denotes a projection onto the extended feature space.
New kernels can be constructed from other kernels in the following way (the list is
not complete). Assume that c > 0 is a constant, f is any function and q is a
polynomial with non-negative coefficients and finally that k1 ; k2 are valid kernels.
Then the following are valid kernels
▶ K (x ; y ) = ck1 (x ; y )
▶ K (x ; y ) = f (x )k1 (x ; y )f (y )
▶ K (x ; y ) = q (k1 (x ; y ))
▶ K (x ; y ) = exp(k1 (x ; y ))
▶ K (x ; y ) = k1 (x ; y ) + k2 (x ; y )
▶ K (x ; y ) = k1 (x ; y )k2 (x ; y )

Department of Mathematics
Kernels and support vector machines 31(41)
Gaussian kernel is a kernel

We recall that the Gaussian kernel was given by

K (x ; y ) = exp( jjx y jj2 ):


First we note that
jjx y jj2 = hx ; x i + hy ; y i 2hx ; y i:
Denote by f (x ) = hx ; x i. Then we have

exp( jjx y jj2 ) = exp(2 f (x )hx ; y if (y )):


By the results from last slide this is in fact a kernel.

Department of Mathematics
Kernels and support vector machines 32(41)
Gaussian kernel is very flexible

The Gaussian kernel captures local information since K (x ; y ) !


0 as
jj
x y jj ! 1
. This also allows for capturing highly complex datamargins.

Figure: Left: RBF kernel, Right: Linear kernel

Department of Mathematics
Kernels and support vector machines 33(41)
The strength of kernels

h
Consider the polynomial kernel: K (x ; y ) = (1 + x ; y i)2 . Let x = (x1 ; x2 ) and
y = (y1 ; y2 ).

Department of Mathematics
Kernels and support vector machines 34(41)
The strength of kernels

Consider the polynomial kernel: K (x ; y ) = (1 + x ; y h i)2 . Let x = (x1 ; x2 ) and


y = (y1 ; y2 ). Then

K (x ; y ) = (1 + x ; y h
i)2 = (1 + x1 y1 + x2 y2 )2
= 1 + (x1 y1 )2 + (x2 y2 )2 + 2x1 y1 + 2x2 y2 + 2x1 y1 x2 y2 ):
However, note that
p p p p p p
K (x ; y ) = h(1; x12 ; x22 ; 2x1 ; 2x2 ; 2x1 x2 ); (1; y12 ; y22 ; 2y1 ; 2y2 ; i:
2y1 y2 )

Hence, the kernel implicitly computed a scalar product in a higher dimensional


space, without having to map x and y to its higher features. This is known as the
kernel trick.

Department of Mathematics
Kernels and support vector machines 34(41)
The strength of kernels

Consider the polynomial kernel: K (x ; y ) = (1 + x ; y h i)2 . Let x = (x1 ; x2 ) and


y = (y1 ; y2 ). Then

K (x ; y ) = (1 + x ; y h
i)2 = (1 + x1 y1 + x2 y2 )2
= 1 + (x1 y1 )2 + (x2 y2 )2 + 2x1 y1 + 2x2 y2 + 2x1 y1 x2 y2 ):
However, note that
p p p p p p
K (x ; y ) = h(1; x12 ; x22 ; 2x1 ; 2x2 ; 2x1 x2 ); (1; y12 ; y22 ; 2y1 ; 2y2 ; i:
2y1 y2 )

Hence, the kernel implicitly computed a scalar product in a higher dimensional


space, without having to map x and y to its higher features. This is known as the
kernel trick.
p p p
In the case that the map (x ) = (1; x12 ; x22 ; 2x1 ; 2x2 ; 2x1 x2 ) maps to a degree
d = 6 polynomial it is not of much use, but technically we may let d , which !1
is not feasible if we need explicit computations.

Department of Mathematics
Kernels and support vector machines 34(41)
Kernel Ridge Regression
Linear (ridge) regression is given by solving

X
n
b = arg min 1
 n
( ⊺ x i yi )2 +  jjjj2
i =1

This is solved by the normal equations

b = (X ⊺ X + n I ) 1
X ⊺y :

Department of Mathematics
Kernels and support vector machines 35(41)
Kernel Ridge Regression
Linear (ridge) regression is given by solving

X
n
b = arg min 1
 n
( ⊺ x i yi )2 +  jjjj2
i =1

This is solved by the normal equations

b = (X ⊺ X + n I ) 1
X ⊺y :

By applying a transformation  to x as a preprocess we obtain

X
n
b = arg min 1
 n
(⊺ (xi ) yi )2 +  jjjj2
i =1

solved by
b = ((X )⊺ (X ) + n I ) 1 (X )⊺ y
If (x ) = (1; x ; x 2 ; x 3 ; : : : ; x d ) then the above is equivalent to polynomial
regression. This problem simplifies due to the kernel trick.

Department of Mathematics
Kernels and support vector machines 35(41)
Support vector machines

When combining support vector classifiers with kernels we obtain what is usually
called the support vector machines.

The SVM formulation is


X
n
1 XX
n n
max i yi
2
i j (xi )⊺ (xj );
i =1 i =1 j =1

where (xi ) (xj ) is represented using a kernel.


Advantages of support vector machines


▶ very flexible (illustrated further in Assignment 3)
▶ robust (c.f. maximal margin classifier)
▶ built-in regularization C (which has to be tuned).

Department of Mathematics
Kernels and support vector machines 36(41)
SVM hyperparameters

There are several hyperparameters of the support vector machines which has to be
tuned in order to obtain good performance.

Some examples of hyperparameters are


▶ The penalty factor C
▶ The scale parameter for Gaussian kernels
▶ The degree d for polynomial kernels

In order to achieve optimal results these hyperparameters should be optimized


and unfortunately it can be quite costful to find them. Also, note that poor choice
of the hyperparameters can radically shift the results. So don’t be scared of by
poor results right away.

General advice: hyperparameter tuning may by very time consuming, hence if you
have many training points use only a subset for training the hyperparameters,
which can give good estimates.

Department of Mathematics
Kernels and support vector machines 37(41)
Binarization techniques

There are two very common and much used binarization techniques, i.e. methods
for handling mutliclass classification.

Previously you’ve seen the one-vs-all techniques (Lecture 3). In sklearn for SVM
multi-class the built-in technique is one-vs-one.

Algorithm: one-vs-one classification


(X ; y ) is the dataset, k is the number of classes
1. For i = 1; : : : ; k
1.1 For j =1 ;:::;i 1
k
1.1.1 Put Xt := X [(y == i y == j ); :]
k
1.1.2 Put yt := y [(y == i y == j )]
1.1.3 Train a (binary) classifier using (Xt ; yt )
k

2. Store each of the 2
classifiers and let majority voting decide predictions
Numpy note: The OR-command k may in numpy be used for instance by
np.logical_or.

Department of Mathematics
Kernels and support vector machines 38(41)
OVA vs OVO

Consider a three-class classification task with only three training examples as seen
below. Suppose that a maximal margin classifier is used and trained on the
mentioned training examples, together with either (i) OVO binarization or (ii)
OVA binarization.
a) In each of these cases (i) and (ii). How many classifiers needs to be trained?
b) Are the same decision boundaries produced by (i) and (ii)?

Department of Mathematics
Kernels and support vector machines 39(41)
SVM in sklearn

In scikit-learn the command for creating a support vector machines is given by


▶ svm.SVC()
after importing svm from sklearn.

The typical named-valued-pairs which you have to pass are


▶ ’kernel’ rbf (default), linear, poly
▶ ’gamma’ for instance for Gaussian and coefficient for poly
▶ ’C’ is simply the C parameter
▶ ’degree’ is simply the degree for polynomial kernel

Department of Mathematics
Kernels and support vector machines 40(41)
Assignment summary

This lecture comes with 3 exercises in Assignment 3.


1. Various kernels: In this exercise you will compare various kernels for SVM, on
a dummy dataset.
2. Implicit versus explicit (VG-exercise): In this exercise you should compare
the training time using implicit and explicit mappings to higher dimension
spaces.
3. One versus all MNIST: (only 2dv516) In this exercise you will train a support
vector machine with an rbf kernel to recognize handwritten digits in the
MNIST dataset. You will also implement your own version of the
one-versus-all scheme for multiclass-problems to be compared to the built-in
one-vs-one scheme.

Department of Mathematics
Kernels and support vector machines 41(41)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy