0% found this document useful (0 votes)
20 views39 pages

Handout 03 Classic Classifiers

The document provides an overview of classic supervised learning algorithms including k-nearest neighbors (k-NN), decision trees, and support vector machines (SVM), highlighting their methods, advantages, and disadvantages. It discusses the kernel trick, which allows SVM to handle non-linear classification by transforming input data into higher-dimensional spaces without explicitly computing them. The document emphasizes the importance of selecting the appropriate classifier based on the specific problem at hand, as stated in the 'no free lunch' theorem.

Uploaded by

zhangx30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

Handout 03 Classic Classifiers

The document provides an overview of classic supervised learning algorithms including k-nearest neighbors (k-NN), decision trees, and support vector machines (SVM), highlighting their methods, advantages, and disadvantages. It discusses the kernel trick, which allows SVM to handle non-linear classification by transforming input data into higher-dimensional spaces without explicitly computing them. The document emphasizes the importance of selecting the appropriate classifier based on the specific problem at hand, as stated in the 'no free lunch' theorem.

Uploaded by

zhangx30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

03 - Know your Classics

Supervised Learning: k-NN, decision trees, SVM and kernel trick

François Pitié

Assistant Professor in Media Signal Processing


Department of Electronic & Electrical Engineering, Trinity College Dublin
[4C16/5C16] Deep Learning and its Applications — 2024/2025

1
Before we dive into Neural Networks, keep in mind that Neural Nets
have been around for a while and, until recently, they were not the
method of choice for Machine Learning.
A zoo of algorithms exits out there, and we’ll briefly introduce here
some of the classic methods for supervised learning.

2
k-nearest neighbours

k-nearest neighbours is a very simple yet powerful technique. For an


input x, you retrieve the k-nearest neighbours in the training data,
then return the majority class amoung the k values. You can also
return the confidence as a proportion of the majority class.

3
k-nearest neighbours

input data 1-NN 3-NN 10-NN 1.00


0.75
0.50
0.25
Acc:97.5% Acc:97.5% Acc:95.0% 0.00
input data 1-NN 3-NN 10-NN

Acc:87.5% Acc:92.5% Acc:87.5%


input data 1-NN 3-NN 10-NN

Acc:92.5% Acc:92.5% Acc:92.5%

Decision boundaries on 3 problems. The intensity of the shades indicates


the certainty we have about the prediction.
4
k-nearest neighbours

pros:

• It is a non-parametric technique.
• It works surprisingly well and you can obtain high accuracy if
the training set is large enough.

cons:

• Finding the nearest neighbours is computationally expensive


and doesn’t scale with the training set.
• It may generalise very badly if your training set is small.
• You don’t learn much about the features themselves.

5
Decision Trees

In decision trees (Breiman et al., 1984) and its many variants, each
node of the decision tree is associated with a region of the input
space, and internal nodes partition that region into sub-regions (in
a divide and conquer fashion).

The regions are split along the axes of the input space (eg. at each
node you take a decision according to a binary test such as x2 < 3).

6
Decision Trees

input data Decision Tree Random Forest AdaBoost 1.00


0.75
0.50
0.25
Acc:95.0% Acc:92.5% Acc:92.5% 0.00
input data Decision Tree Random Forest AdaBoost

Acc:77.5% Acc:82.5% Acc:82.5%


input data Decision Tree Random Forest AdaBoost

Acc:95.0% Acc:92.5% Acc:95.0%

In Ada Boost and Random Forests multiple decision trees are used to
aggregate a probability on the prediction.
7
Decision Trees

Random Forests gained a lot of popularity before the rise of Neural


Nets as they can be very efficiently computed.
For instance they where used for the body part identification in the
Microsoft Kinect.

[1] Real-Time Human Pose Recognition in Parts from a Single Depth Image
J. Shotton, A. Fitzgibbon, A. Blake, A. Klpman, M. Finocchio, B. Moore, T. Sharp, 2011
[https://goo.gl/UTM6s1]

8
Decision Trees

pros:
• It is fast.
cons:
• Decisions are taken along axes (eg. x1 < 3) but it could be more
efficient to split the classes along a diagonal (eg. x1 < x2 ):

9
Decision Trees

SEE ALSO:

Ada Boost, Random Forests, XGBoost.

LINKS:

https://www.youtube.com/watch?v=p17C9q2M00Q

10
SVM

Until recently Support Vector Machines were the most popular tech-
nique around.
Like in Logistic Regression, SVM starts as a linear classifier:

y = [x⊺ w > 0]

The difference with logistic regression lies in the choice of the loss
function.

11
SVM

Whereas in logistic regression the loss function was based on the


cross-entropy, the loss function in SVM is based on the Hinge loss
function:

N
LSV M (w) = ∑[yi = 0] max(0, 1 + x⊺i w) + [yi = 1] max(0, 1 − x⊺i w)
i=1

12
SVM

From a geometrical point of view, SVM seeks to find the hyperplane


that maximises the separation between the two classes.

2 d
4
4 2 0 2 4

13
SVM

There is a lot more to SVM, but this will be not coverd in this course.

14
No Free Lunch Theorem

Note that there is a priori no advantage of using linear SVM over lo-
gistic regression in terms of performance alone. It all depends on the
type of data you have.
Recall that the choice of loss function directly relates to assumptions
you make about the distribution of the prediction errors, and thus
about the dataset of your problem.

15
No Free Lunch Theorem

This is formalised in the “no free lunch” theorem (Wolpert, 1996), which
tells us that classifiers perform equally well when averaged over all
possible problems. In other words: your choice of classifier should
depend on the problem at hand.

Classifier A

Classifier B
performance

Classifier C

problems/dataset

16
SVM

SVM gained popularity when it became associated


with the kernel trick.

17
Kernel Trick

Recall that in linear regression, we managed to fit non-linear func-


tions by augmenting the feature space with higher order polynomials
of each the observations, e.g, x, x2 , x3 , etc.
What we’ve done is to map the original features into a higher dimen-
sional feature space: ϕ ∶ x ↦ ϕ(x). In our case we had:

⎛1⎞
⎜x⎟
⎜ ⎟
⎜ ⎟
ϕ(x) = ⎜x2 ⎟
⎜ ⎟
⎜x3 ⎟
⎜ ⎟
⎝⋮⎠

18
kernel Trick

The idea here is the same: we want to find a feature map x ↦ ϕ(x) that
transforms the input data into a new dataset that can be solved using a
linear classifier.

19
Transforming the original features into more complex ones is a key
ingredient of machine learning, and something that we’ll see again
with Deep Learning.
The collected features are usually not optimal for linearly separating
the classes and it is often unclear how these should be transformed.
We would like the machine learning technique to learn how to best
recombine the features so as to yield optimal class separation.

20
So our first problem is to find a useful feature transformation ϕ. An-
other problem is that the size of the new feature vectors ϕ(x) could
potentially grow very large.
Consider the following polynomial augmentations:

ϕ ([x1 , x2 ]⊺ ) = [1 , x1 , x2 , x1 x2 , x21 , x22 ]⊺

ϕ ([x1 , x2 , x3 ]⊺ ) = [1 , x1 , x2 , x3 , x1 x3 , x1 x2 , x2 x3 , x21 , x22 , x23 ]⊺

The new feature vectors have significantly increased in size.


It can be shown that for input features of dimension p and a polyno-
mial of degree d, the expanded features are of dimension (p+d)!
p! d!
.

21
For example, if you have p = 100 features per observation and that you
are looking at a polynomial of order 5, the resulting feature vector is
of dimension about 100 millions!!
Now, recall that Least-Squares solutions are given by

ŵ = (X ⊺ X)−1 X ⊺ y

if ϕ(x) is of dimension 100 millions, then X ⊺ X is of size 108 ×108 . This


is totally impractical.

22
So, we want to transform the original features into higher level fea-
tures but we want to this comes at the cost of greatly increasing the
dimension of the original problem.
The Kernel trick offers an elegant solution to this problem and allows
us to use very complex mapping functions ϕ without having to ever
explicitly compute them.

23
Kernel Trick

We start from the observation that most loss functions only operates
on the scores x⊺ w, eg:
n
ŵ = arg min E(w) = ∑ e(x⊺i w)
w
i=1

We can show that (see lecture notes), that for any x, the score at the
optimum x⊺ ŵ can then be re-expressed as:
n
x⊺ ŵ = ∑ αi x⊺ xi ,
i=1

where the scalars x⊺ xi are dot-products between feature vectors.


The new weights α = [α1 , ⋯, αn ] can be seen as a re-parametrisation
of the p×1 vector ŵ into a n×1 vector α, with E(w) being re-expressed
as E(α). These are often called the dual coefficients in SVM.

24
Kernel Trick

Things get interesting when using our expanded features:


n
ϕ(x)⊺ ŵ = ∑ αi ϕ(x)⊺ ϕ(xi )
i=1

To compute the score, we only ever need to know how to compute


the dot products ϕ(x)⊺ ϕ(xi ), not the actual high dimensional feature
vector ϕ(xi ).
Introducing the kernel function:

(u, v) ↦ κ(u, v) = ϕ(u)⊺ ϕ(v) ,

which allows us to rewrite the score as:


n
ϕ(x)⊺ ŵ = ∑ αi κ(x, xi ).
i=1

25
Kernel Trick

The kernel trick builds on the Theory of Reproducing Kernels, which


we says that for a whole class of kernel functions κ we can find a
mapping ϕ that is such that κ(u, v) = ϕ(u)⊺ ϕ(v).
The key is that we can define κ without having to explicitly define ϕ.

26
Kernel Trick

Many kernel functions are possible. For instance, the polynomial ker-
nel is defined as:

κ(u, v) = (r − γu⊺ v)d

and one can show that this is equivalent to using a polynomial map-
ping as proposed earlier. Except that instead of requiring 100’s of
millions of dimensions, we only need to take scalar products between
vectors of dimension p.

27
Kernel Trick

The most commonly used kernel is probably the Radial Basis Function
(RBF) kernel:

κ(u, v) = e−γ∥u−v∥
2

The induced mapping ϕ is infinitely dimensional, but that’s OK be-


cause we never need to evaluate ϕ(x).

28
Kernel Trick: Intuition (pt1)

To have some intuition about these kernels, consider the kernel trick
for a RBF kernel. The score for a particular observation x is:
n
score(x) = ∑ αi κ(x, xi )
i=1

The kernel function κ(u, v) = e−γ∥u−v∥ is a measure of similarity be-


2

tween observations. If both observations are similar, κ(u, v) ≈ 1. If


they are very different, κ(u, v) ≈ 0. We can see it as a neighbour-
hood indicator function. If the observations are close, κ(u, v) ≈ 1,
else κ(u, v) ≈ 0. The scale of this neighbourhood is controlled by γ.
(as you can imaging, this is less intuitive for other kernels)

29
Kernel Trick: Intuition (pt2)

Let’s choose αi = 1 for positive observations and αi = −1 for negative


observations. This is obviously not the optimal, but this is in fact close
to what happens in SVM. We have now something resembling k-NN.
Indeed, look at the score:
n
score(x) = ∑ αi κ(x, xi )
i=1


⎪1 if yi positive
≈ ∑ ⎨
⎪−1
i∈neighbours of x ⎪ if yi negative

≈ # positive neighbours of x − # negative neighbours of x

This makes sense: if x has more positive than negative neighbours in


the dataset, then its score should be high, and its prediction positive.
Thus we have here something similar to k-NN. The main difference is
that instead of finding a fixed number of the k closest neighbours, we
consider all the neighbours within some radius (controlled by γ).
30
Kernel Trick: Intuition (pt3)

In SVM, the actual values of α̂i are estimated by ways of the minimi-
sation of the Hinge loss.
The optimisation falls outside of the scope of this course material. We
could use Gradient Descent, but, as it turns out, the Hinge loss makes
this problem a constrained optimisation problem and we can use a
solver for that. The good news is that we can find the global minimum
without having to worry about convergence issues.
We find after optimisation that, indeed, −1 ≤ α̂i ≤ 1, with the sign of
α̂i indicating the class membership, thus following a similar idea to
what was proposed in the previous slide.

31
Kernel Trick: Intuition (pt4)
4

α12 =+0.9
3

α23 =0
2
α17 =+0.6

1
α37 =0

α2 =-0.6 α14 =+0.9 α18 =+0.7


0

α5 =-0.8

−1
α13 =0 α3 =-0.2

−2 α11 =0

α6 =-0.7
−3
α8 =-0.7

−4
−4 −3 −2 −1 0 1 2 3 4

SVM-RBF example with score contour lines. The thickness of each observa-
tion’s outer circle is proportional to ∣αi ∣ (no outer circle means αi = 0). Only a
subset of datapoints, called support vectors, have non-null αi . They lie near
the class boundary and are the only datapoints used in making predictions.
32
SVM results with polynomial kernel

input data linear SVM poly SVM d = 2 poly SVM d = 3 1.00


0.75
0.50
0.25
Acc:87.5% Acc:45.0% Acc:85.0% 0.00
input data linear SVM poly SVM d = 2 poly SVM d = 3

Acc:40.0% Acc:95.0% Acc:40.0%


input data linear SVM poly SVM d = 2 poly SVM d = 3

Acc:95.0% Acc:45.0% Acc:85.0%

Decision Boundaries for SVM using a linear and polynomial kernels.

33
SVM results with RBF kernel

input data RBF SVM = 1 RBF SVM = 5 RBF SVM = 10 1.00


0.75
0.50
0.25
Acc:95.0% Acc:97.5% Acc:95.0% 0.00
input data RBF SVM = 1 RBF SVM = 5 RBF SVM = 10

Acc:85.0% Acc:87.5% Acc:82.5%


input data RBF SVM = 1 RBF SVM = 5 RBF SVM = 10

Acc:95.0% Acc:92.5% Acc:90.0%

Decision Boundaries for SVM using Gaussian kernels. The value of γ controls
the smoothness of the boundary by setting the size of the neighbourhood.
34
Other Kernel Methods Exist

Support vector machines are not the only algorithm that can avail of
the kernel trick. Many other linear models (including logistic regres-
sion) can be enhanced in this way. They are known as kernel methods.

35
Kernel Methods Drawbacks

A major drawback to kernel methods is that the cost of evaluating the


decision function is proportional to the number of training examples,
because the ith observation contributes a term αi κ(x, xi ) to the de-
cision function.
As we have seen, SVM somehow mitigates this by learning which ex-
amples contribute the most (the support vectors).
The cost of training is however still high for large datasets (eg. with
tens of thousands of datapoints).

36
Kernel Methods and Neural Networks

Evidence that deep learning could outperform kernel SVM on large


datasets emerged in 2006 when team lead by G. Hinton demonstrated
that a neural network on the MNIST benchmark. The real tipping point
occured with the 2012 paper by A. Krizhevsky, I. Sutskever and G. Hinton
(see handout-00).

37
References

SEE ALSO:

Gaussian Processes,
Reproducing kernel Hilbert spaces,
kernel Logistic Regression

LINKS:

Laurent El Ghaoui’s lecture at Berkeley: https://goo.gl/hY1Bpn


Eric Kim’s python tutorial on SVM: https://goo.gl/73iBdx

38
Take Away

Neural Nets have existed for a while, but it is only recently (2012) that
they have started to surpass all other techniques.
Kernel based techniques have been very popular up to recently as
they offer an elegant way of transforming input features into more
complex features that can then be linearly separated.
The problem with kernel techniques is that they cannot deal efficiently
with large datasets (eg. more than 10’s of thousands of observations)

39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy