Open navigation menu

Scribd

0% found this document useful (0 votes)

21 views125 pages

Lecture 7 - SVM

The document discusses Support Vector Machines (SVMs), a supervised learning technique primarily used for classification. SVMs aim to find a hyperplane that maximizes the margin between two classes, which leads to better generalization and accuracy on test data. The optimization problem for SVMs involves minimizing a quadratic function subject to linear constraints, making it a convex optimization problem with a unique solution.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views125 pages

Lecture 7 - SVM

The document discusses Support Vector Machines (SVMs), a supervised learning technique primarily used for classification. SVMs aim to find a hyperplane that maximizes the margin between two classes, which leads to better generalization and accuracy on test data. The optimization problem for SVMs involves minimizing a quadratic function subject to linear constraints, making it a convex optimization problem with a unique solution.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 125

Artificial Intelligence II (CS4442 & CS9542)

Classification: Support Vector Machines

Boyu Wang
Department of Computer Science
University of Western Ontario
Intuitions

1
Intuitions

I So far, the supervised learning algorithms we have learned (e.g.,

linear regression, logistic regression, Gaussian discriminant
analysis) are based on probabilistic assumptions:

- We assume that the conditional probability p(y|x) follows

some distributions or functions parameterized by w.
- Then, we choose the parameters by the principle of
maximum likelihood.

I Is there any other principle to choose the parameter (e.g.,

without any probabilistic assumption)?

I A support vector machines (SVM) aims to find the model

parameters by the large margin principle.

2
What is a support vector machine (SVM)
A supervised learning technique (mostly for classification).

I Goal: learn a mapping f : X → Y

f( )= female
f( )= male
I The learning algorithm is trained on a set of examples
(x1 , y1 ), . . . , (xm , ym ), x ∈ Rn , y ∈ {−1, 1}

3
Linear Separator w > x + b = 0

x2 𝒚 = +𝟏
(Female)
Length of hair

𝒚 = −𝟏
(Male)

Height x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Support vector machines (SVMs)

I A SVM classifier finds a (linear) separator (hyperplane) that

maximizes the margin between the two classes of examples.

I The SVM is a type of large margin classifier (Other examples

include boosting algorithms, voted-perceptron).

I Theoretical justifications: large margin leads to good

generalization (i.e., high accuracy on test data).

5
Problem Formulation: Margin

6
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎
−𝒃
|𝒘|

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎 ⇨ on the line

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎 ⇨ on the line

𝒘ᵀ𝒙 + 𝒃 < 𝟎 ⇨ below the line

x1 7
Illustration of the geometry of hyperplane

x2
𝒘ᵀ𝒙 + 𝒃 > 𝟎 ⇨ above the line

𝒘ᵀ𝒙 + 𝒃 = 𝟎 ⇨ on the line

𝒘ᵀ𝒙 + 𝒃 < 𝟎 ⇨ below the line

x1 7
Illustration of the geometry of hyperplane

x2

𝒚= 𝟏?

x1 7
Illustration of the geometry of hyperplane

x2
𝒚= 𝟏!

𝒚= 𝟏?

x1 7
Illustration of the geometry of margin

x2
𝒙

x1 8
Illustration of the geometry of margin

x2
𝒙
𝜸

𝒙′
𝒘

x1 8
Illustration of the geometry of margin

x2
𝒙
𝜸

𝒙′
𝒘

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸

𝒙′
𝒘

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||

𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||

𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=−
||𝒘||

x1 8
Illustration of the geometry of margin

x2
𝒙
𝜸 𝒘ᵀ𝒙 + 𝒃
In general, 𝜸 = 𝒚 ||𝒘||
𝒙′
𝒘

𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=−
||𝒘||

x1 8
Large margin intuition

I Margin is the distance between an example and the decision

hyperplane, denoted by γ.
I Maximize the margin of the data points which are closest to the
hyperplane.

I Given a training set S = {(xi , yi ); i = 1, . . . , m}, learn a

hyperplane to maximize the margin γ, where
yi (w > xi +b)
γ = min γi = min ||w||
i=1,...,m i=1,...,m

x2

𝜸
𝜸

x1
Large margin formulation

I Objective function:
max γ
w,b

yi (w > xi + b)
s.t. ≥ γ, ∀i = 1, . . . , m
||w||

I The decision hyperplane and γ are invariant to the rescaling of

the parameters – need additional constraints.

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

10
x1
Large margin formulation

I Option 1:
max γ
w,b

s.t. yi (w > xi + b) ≥ γ, ∀i = 1, . . . , m
||w|| = 1

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

x1
11
Large margin formulation

I Option 1:
max γ
w,b

s.t. yi (w > xi + b) ≥ γ, ∀i = 1, . . . , m
||w|| = 1

Non-convex optimization problem, hard to solve!

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

x1
11
Large margin formulation

I Option 2:

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

x1

12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸
𝜸

x1

12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸 y(w > x+b) 1

𝜸 ⇒γ= ||w|| = ||w||

x1

12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸 y(w > x+b) 1

𝜸 ⇒γ= ||w|| = ||w||

x1
1
max
w,b ||w||
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m
12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸 y(w > x+b) 1

𝜸 ⇒γ= ||w|| = ||w||

x1
1
min ||w||22
w,b 2
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m

12
Support vector machines (SVMs)

Optimization Formulation
1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m

I Maximize the margin while correctly classifying all examples

correctly.

1
I Maximize the margin (= ||w|| ) ⇔ minimize ||w||22 ⇒ prefer simple
model.

13
Optimization

14
Optimization formulation

Primal optimization problem

1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m

I An instance of convex optimization – admits a unique solution.

I Minimize a quadratic function with m linear constraints.

I An instance of quadratic programming (QP) – ‘off-the-shelf”

packages are available (e.g., quadprog (MATLAB), CVXOPT).

15
Digression: constrained optimization
Optimization with inequality constraints
minw f (w)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Method of Lagrange multipliers:

Pm
Lagrangian: L(w, α) = f (w) + i=1 αi gi (w)
s.t. α 0

16
Digression: constrained optimization
Optimization with inequality constraints
minw f (w)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Method of Lagrange multipliers:

Pm
Lagrangian: L(w, α) = f (w) + i=1 αi gi (w)
s.t. α 0

Let θP (w) be defined as

θP (w) = max L(w, α)
α0
(
f (w), if w satisfies primal constraints
=
∞, otherwise.

16
Lagrangian primal problem

Original (primal) problem

minw f (w) (1)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Lagrangian primal problem:

min θP (w) = min max L(w, α) (2)

w w α0

17
Lagrangian primal problem

Original (primal) problem

minw f (w) (1)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Lagrangian primal problem:

min θP (w) = min max L(w, α) (2)

w w α0

Claim: Problem (1) is equivalent to problem (2).

17
Lagrangian dual problem

I Consider a slightly different problem:

θD (α) = min L(w, α)
w

18
Lagrangian dual problem

I Consider a slightly different problem:

θD (α) = min L(w, α)
w

Lagrangian dual problem

max θD (α) = max min L(w, α)
α0 α0 w

18
Lagrangian dual problem

I Consider a slightly different problem:

θD (α) = min L(w, α)
w

Lagrangian dual problem

max θD (α) = max min L(w, α)
α0 α0 w

Lagrangian primal problem

min θP (w) = min max L(w, α)
w w α0

18
Lagrangian dual problem

I Consider a slightly different problem:

θD (α) = min L(w, α)
w

Lagrangian dual problem

max θD (α) = max min L(w, α)
α0 α0 w

Lagrangian primal problem

min θP (w) = min max L(w, α)
w w α0

d ∗ = max min L(w, α) ≤ min max L(w, α) = p∗

α0 w w α0

18
Lagrangian dual problem

I Consider a slightly different problem:

θD (α) = min L(w, α)
w

Lagrangian dual problem

max θD (α) = max min L(w, α)
α0 α0 w

Lagrangian primal problem

min θP (w) = min max L(w, α)
w w α0

d ∗ = max min L(w, α) ≤ min max L(w, α) = p∗

α0 w w α0
∗ ∗
For SVM, d = p .
18
Relation between primal and dual

I For d ∗ = p∗ = L(w ∗ , α∗ ), certain conditions should be satisfied,

known as Karush-Kuhn-Tucker (KKT) conditions:

∂
L(w ∗ , α∗ ) = 0 (C1)
∂w
αi∗ gi (w ∗ ) = 0, i = 1, . . . , m (C2)
∗
gi (w ) ≤ 0, i = 1, . . . , m (C3)
αi∗ ≥ 0 i = 1, . . . , m (C4)

I Eq. (C2) is called the KKT dual complementarity condition – if

αi∗ > 0, then gi (w ∗ ) = 0 (the constraint is active).

I Roadmap: Primal problem ⇒ Lagrangian function L(w, α)

⇒ minw L(w, α)⇒ dual problem.

19
Back to SVM
Original (primal) optimization problem
1 2
minw,b 2 ||w||2

s.t. 1 − yi (w > xi + b) ≤ 0, ∀i = 1, . . . , m

I Lagrangian function:
m
1 X
L(w, b, α) = ||w||22 + αi [1 − yi (w > xi + b)]
2
i=1

I KKT conditions:
∂ Pm
L=0 ⇒ w= i=1 αi yi xi
∂w
∂ Pm
L=0 ⇒ i=1 αi yi = 0
∂b

20
Back to SVM
Plugging the KKT conditions back into L:
Dual optimization problem
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Having found α

I By KKT condition
m
X
w= αi yi xi
i=1

I For any support vector xi , we have yi (w > xi + b) = 1:

b = yi − w > xi , ∀αi > 0 (i.e., for any support vector xi )
21
Two key observations from dual

Observation 1: Most αi ’s are 0

𝒘ᵀ𝒙 + 𝒃 = 𝟏
I KKT condition (C2): x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

αi [yi (w > xi + b) − 1] = 0 𝒘ᵀ𝒙 + 𝒃 = −𝟏

I If xi is not on margin:
yi (w > xi + b) > 1 𝜸
⇒ αi = 0 𝜸

I αi > 0 only for xi on margin.

I These are support vectors.
x1
I Only support vectors matter.

22
Two key observations from dual

Observation 2: Only inner product matters

I Dual optimization problem
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , m
Pm
I Primal solution: w = i=1 αi yi xi
I For prediction:
P
m
f (x) = sgn w > x + b = sgn >

i=1 αi yi xi x + b

I Replace the inner products with kernel functions

xi> xj ⇒ k(xi , xj )
to produce non-linear SVMs.

23
Kernel Trick

24
Kernel functions

I Whenever a learning algorithm can be written in terms of inner

products, there is a kernel version of this algorithm.

I Given a (nonlinear) feature mapping φ : x → φ(x), its kernel is

the function k : Rn × Rn → R:

k(xi , xj ) = φ(xi )> φ(xj )

I Conversely, by choosing the kernel function k , we implicitly

choose a feature mapping φ.

I A kernel function is the inner product in the space defined by φ –

a notion of similarity.

25
Dual formulation of SVMs
Training
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P
m
αi yi xi> x + b

f (x) = sgn i=1

26
Dual formulation of SVMs
Training
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P
m
αi yi xi> x + b

f (x) = sgn i=1

Kernel trick: a feature mapping φ : x → φ(x)

xi> xj →φ(xi )> φ(xj ) ⇔ k (xi , xj )

In the dual formulation, feature mapping φ is not needed either

to learn or to make predictions!

26
The kernel trick
Training
Pm 1
Pm
maxα i=1 αi − 2 i,j=1 αi αj yi yj k (xi , xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P
m
f (x) = sgn i=1 αi yi k(xi , x) + b

Kernel trick: a feature mapping φ : x → φ(x)

xi> xj →φ(xi )> φ(xj ) ⇔ k (xi , xj )

In the dual formulation, feature mapping φ is not needed either

to learn or to make predictions!

27
Commonly used kernels

I Linear kernels – linear SVMs.

k (xi , xj ) = xi> xj

I Gaussian kernels
||xi −xj ||2
2
k (xi , xj ) = e− 2σ 2

I Polynomial kernels
d
k(xi , xj ) = 1 + xi> xj

28
Commonly used kernels

I Linear kernels – linear SVMs.

k (xi , xj ) = xi> xj

I Gaussian kernels
||xi −xj ||2
2
k (xi , xj ) = e− 2σ 2

I Polynomial kernels
d
k(xi , xj ) = 1 + xi> xj

I We can impose prior knowledge by designing

sin ||xi −xj ||22 /T
problem-dependent kernels (e.g., k (xi , xj ) = exp(− 2σ 2
)
for periodic time series)!

28
Effect of kernel functions

Training data
29
Effect of kernel functions

Linear kernel
29
Effect of kernel functions

Gaussian kernel
29
Effect of kernel functions

Polynomial kernel
29
Effect of kernel functions

Training data
30
Effect of kernel functions

Linear kernel
30
Effect of kernel functions

Gaussian kernel
30
Non-separable Case

31
Non-separable case

I The derivation of SVMs as presented so far assumed that the

data is linearly separable.
I What if the data is not separable?

x2

x1

32
Non-separable case

I The derivation of SVMs as presented so far assumed that the

data is linearly separable.
I What if the data is not separable?

x2

x1

32
Non-separable case

I The derivation of SVMs as presented so far assumed that the

data is linearly separable.
I What if the data is not separable?

x2

x1

32
Non-separable case

I The derivation of SVMs as presented so far assumed that the

data is linearly separable.
I What if the data is not separable?

x2

x1

32
Introducing slack variables

I Non-separable ⇒ for any separator w > x + b = 0, there exists xi

in training set such that
yi (w > xi + b) 6≥ 1

I Solution: introducing slack variables ξi ≥ 0 – permit SVMs to

make mistakes.
yi (w > xi + b) ≥ 1 − ξi

x2

33
x1
Introducing slack variables

I Non-separable ⇒ for any separator w > x + b = 0, there exists xi

in training set such that
yi (w > xi + b) 6≥ 1

I Solution: introducing slack variables ξi ≥ 0 – permit SVMs to

make mistakes.
yi (w > xi + b) ≥ 1 − ξi

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

33
x1
Introducing slack variables

I Non-separable ⇒ for any separator w > x + b = 0, there exists xi

in training set such that
yi (w > xi + b) 6≥ 1

I Solution: introducing slack variables ξi ≥ 0 – permit SVMs to

make mistakes.
yi (w > xi + b) ≥ 1 − ξi

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝝃𝒊

𝝃𝒋

33
x1
Introducing slack variables (primal)
nothing!
Primal optimization problem for separable cases
1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , n

I New constraints with slack variables:

yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

34
Introducing slack variables (primal)
nothing!
Primal optimization problem for separable cases
1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , n

I New constraints with slack variables:

yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

Primal optimization problem for non-separable cases

1
Pm
nothing! minw,b,ξ 2
2 ||w||2 +C i=1 ξi

s.t. yi (w > xi + b) ≥ 1 − ξi , ∀i = 1, . . . , n
ξi ≥ 0, ∀i = 1, . . . , n

34
Introducing slack variables (dual)
nothing!
Dual optimization problem for separable cases
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0

αi ≥ 0, i = 1, . . . , n

I New constraints with slack variables:

yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

35
Introducing slack variables (dual)
nothing!
Dual optimization problem for separable cases
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0

αi ≥ 0, i = 1, . . . , n

I New constraints with slack variables:

yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

Dual optimization problem for non-separable cases

Pm 1
Pm >
nothing! maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0

0 ≤ αi ≤ C, i = 1, . . . , n
35
Introducing slack variables (dual)

By KKT dual-complementarity conditions

x2

x1 36
Introducing slack variables (dual)

By KKT dual-complementarity conditions

αi = 0 ⇒ yi (w > xi + b) > 1 ⇒ outside the margin

x2

x1 36
Introducing slack variables (dual)

By KKT dual-complementarity conditions

αi = 0 ⇒ yi (w > xi + b) > 1 ⇒ outside the margin
0 < αi < C ⇒ yi (w > xi + b) = 1 ⇒ on the margin

x2

x1 36
Introducing slack variables (dual)

By KKT dual-complementarity conditions

αi = 0 ⇒ yi (w > xi + b) > 1 ⇒ outside the margin
0 < αi < C ⇒ yi (w > xi + b) = 1 ⇒ on the margin
αi = C ⇒ yi (w > xi + b) < 1 ⇒ inside the margin/misclassified

x2

x1 36
Effect of C

Training data
37
Effect of C

C = 1000
37
Effect of C

C=1
37
Effect of kernel functions

Training data
38
Effect of kernel functions

Linear kernel
38
Effect of kernel functions

Gaussian kernel, σ = 1
38
Effect of kernel functions

Gaussian kernel, σ = 0.1

38
Effect of kernel functions

Gaussian kernel, σ = 10
38
Effect of kernel functions

Polynomial kernel, d = 3
38
Effect of kernel functions

Polynomial kernel, d = 10
38
Effect of kernel functions

Polynomial kernel, d = 1
38
Effect of model parameters for Gaussian kernels

Training data
39
Effect of model parameters for Gaussian kernels

σ 2 = 10, C = 1
39
Effect of model parameters for Gaussian kernels

σ 2 = 10, C = 100
39
Effect of model parameters for Gaussian kernels

σ 2 = 10, C = 10000
39
Effect of model parameters for Gaussian kernels

σ 2 = 1, C = 1
39
Effect of model parameters for Gaussian kernels

σ 2 = 1, C = 100
39
Effect of model parameters for Gaussian kernels

σ 2 = 1, C = 10000
39
Effect of model parameters for Gaussian kernels

σ 2 = 0.1, C = 1
39
Effect of model parameters for Gaussian kernels

σ 2 = 0.1, C = 100
39
Effect of model parameters for Gaussian kernels

σ 2 = 0.1, C = 10000
39
Effect of model parameters
𝑪=𝟏 𝑪 = 𝟏𝟎𝟎 𝑪 = 𝟏𝟎𝟎𝟎𝟎

𝝈𝟐 = 𝟏𝟎
𝝈𝟐 = 𝟏
𝝈𝟐 = 𝟎. 𝟏

40
Gaussian kernels with different parameters.
Support Vector Regression

41
Support vector regression (SVR)

I Linear regression aims to minimize the mean-squared error

m
X
min (w > xi + b − yi )2
w,b
i=1

I Another option would be to minimize the absolute error

m
X
min |w > xi + b − yi |
w,b
i=1

I This is more robust to outliers than the squared loss

I But we cannot require that all points be approximated correctly

(overfitting!)

42
Loss function for SVR

I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than

43
Loss function for SVR

I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than

I Analogous to the large margin intuition in SVM (i.e., only support

vectors matter)

43
Loss function for SVR

I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than

I Analogous to the large margin intuition in SVM (i.e., only support

vectors matter)

I We introduce the -insensitive loss:

Xm
J = V (xi , yi ),
i=1

where
(
0, if |w > xi + b − yi | ≤
V (xi , yi ) =
|w > xi + b − yi | − , otherwise

43
-insensitive loss
(
0, if |w > xi + b − yi | ≤
V (xi , yi ) = >
|w xi + b − yi | − , otherwise

44
Effect of
As increases, the function is allowed to move away from the data
points, the number of support vectors decreases, the learning curve
becomes smoother, and the fit on the training set gets worse.

Figure credit: Andrew Zisserman

45
Summary

46
What you should know

From this lecture

I Large margin intuition and formulation for SVMs.

I Use of Lagrange multipliers to transform optimization problems.

I Primal and dual optimization problems for SVMs.

I Kernel tricks for nonlinear SVMs.

I SVMs for non-separable cases.

I Support vector regression (SVR).

47
SVMs Summary

I Advantages

- Maximize margin – regularize the model complexity and

less sensitive to outliers.
- Work well on high-dimensional small data sets.
- Use kernel tricks – produce nonlinear classifier and impose
prior knowledge.
- Have nice theoretical properties.

48
SVMs Summary

I Advantages

- Maximize margin – regularize the model complexity and

less sensitive to outliers.
- Work well on high-dimensional small data sets.
- Use kernel tricks – produce nonlinear classifier and impose
prior knowledge.
- Have nice theoretical properties.

I Disadvantages

- Computationally expensive for large-size data sets.

- Choosing a good kernel function and model parameter is
not easy.
48

You might also like

Lecture 5. Support Vector Machines SVM
No ratings yet
Lecture 5. Support Vector Machines SVM
47 pages
7 SVM For Scientists Annotated
No ratings yet
7 SVM For Scientists Annotated
76 pages
Elective Mathematics Super Mock 2025
No ratings yet
Elective Mathematics Super Mock 2025
4 pages
Lec 06 SVM
No ratings yet
Lec 06 SVM
34 pages
Support Vector Machine
No ratings yet
Support Vector Machine
49 pages
Data Structure & Algorithms Lab Manual V1.2-1
No ratings yet
Data Structure & Algorithms Lab Manual V1.2-1
97 pages
ML-chap13 2024 110331
No ratings yet
ML-chap13 2024 110331
67 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
MIT15 097S12 Lec12
No ratings yet
MIT15 097S12 Lec12
14 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
No ratings yet
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
15 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
10 SVM
No ratings yet
10 SVM
77 pages
ML TCS Lecture 15
No ratings yet
ML TCS Lecture 15
46 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
Water Well Drilling Machine and Tools Catalogue
No ratings yet
Water Well Drilling Machine and Tools Catalogue
49 pages
Support Vector Machines: Constantin F. Aliferis & Ioannis Tsamardinos
No ratings yet
Support Vector Machines: Constantin F. Aliferis & Ioannis Tsamardinos
37 pages
ML - 5 Sovan LR SVM 1
No ratings yet
ML - 5 Sovan LR SVM 1
59 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
Support Vector Machines: More Generally Kernel Methods
No ratings yet
Support Vector Machines: More Generally Kernel Methods
58 pages
SVM Tutorial
100% (1)
SVM Tutorial
34 pages
Support Vector Machine
No ratings yet
Support Vector Machine
46 pages
Foundations of Machine Learning: Part A: Logistic Regression
No ratings yet
Foundations of Machine Learning: Part A: Logistic Regression
63 pages
An Idiot's Guide To Support Vector Machines
No ratings yet
An Idiot's Guide To Support Vector Machines
28 pages
Install Active-Directory
No ratings yet
Install Active-Directory
4 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
SVM 30thoct Annotated
No ratings yet
SVM 30thoct Annotated
35 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Support Vector Machines
No ratings yet
Support Vector Machines
33 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
Bill of Engineering Measurements and Evaluation (BEME)
No ratings yet
Bill of Engineering Measurements and Evaluation (BEME)
18 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
SVM Student
No ratings yet
SVM Student
40 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
No ratings yet
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
28 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
Grouping of Resistances-1
No ratings yet
Grouping of Resistances-1
13 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Schourup Discourse Markers Lingua 1999
No ratings yet
Schourup Discourse Markers Lingua 1999
39 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
SVM Slides
No ratings yet
SVM Slides
22 pages
Lec5 Support Vector Machine
No ratings yet
Lec5 Support Vector Machine
28 pages
Support Vector Machine
No ratings yet
Support Vector Machine
50 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
SVM New
No ratings yet
SVM New
12 pages
Lecture 9 - SVMs
No ratings yet
Lecture 9 - SVMs
8 pages
An Idiot Guide To SVM
No ratings yet
An Idiot Guide To SVM
25 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Deh-P4180sd crt4248
No ratings yet
Deh-P4180sd crt4248
83 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Plates and Screws: An Overview: Presented by DR Oteki Misiani
100% (1)
Plates and Screws: An Overview: Presented by DR Oteki Misiani
45 pages
10 SVM
No ratings yet
10 SVM
23 pages
SVM Seminarbericht Hofmann
No ratings yet
SVM Seminarbericht Hofmann
16 pages
Determination of PKa Values For API
100% (1)
Determination of PKa Values For API
9 pages
Application of PLC Control Technology in Intelligent Automatic Control
No ratings yet
Application of PLC Control Technology in Intelligent Automatic Control
4 pages
Internship Training in Python
No ratings yet
Internship Training in Python
2 pages
Project Report On Conflict Management
No ratings yet
Project Report On Conflict Management
57 pages
250kW MEGATRON - Battery Energy Storage Systems Datasheet - 2022 - Symte...
100% (2)
250kW MEGATRON - Battery Energy Storage Systems Datasheet - 2022 - Symte...
15 pages
BIHANA2015 - Hollis - Performance Tuning in Sap Hana PDF
No ratings yet
BIHANA2015 - Hollis - Performance Tuning in Sap Hana PDF
75 pages
Asce 7-22 CH 01 - For PC
100% (2)
Asce 7-22 CH 01 - For PC
17 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
13 pages
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
No ratings yet
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
14 pages
Windows OS Internal Training
No ratings yet
Windows OS Internal Training
66 pages
Tut 2
No ratings yet
Tut 2
2 pages
Simulation of Pre-Stressed Slabs Using Abaqus CDP Material Model
No ratings yet
Simulation of Pre-Stressed Slabs Using Abaqus CDP Material Model
10 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
Myfile
No ratings yet
Myfile
10 pages
JASCO FT-IR Spectrometers
No ratings yet
JASCO FT-IR Spectrometers
2 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Visual Basic 6.0
No ratings yet
Visual Basic 6.0
9 pages
Path Alignment Cross Polarization Parabolic Antennas TP 108827
No ratings yet
Path Alignment Cross Polarization Parabolic Antennas TP 108827
7 pages
3 Phase Power Measurement
No ratings yet
3 Phase Power Measurement
6 pages
Gr09 Maths Term2 Pack01 Practice Paper Memo
No ratings yet
Gr09 Maths Term2 Pack01 Practice Paper Memo
5 pages
Worksheet Graphing Systems
No ratings yet
Worksheet Graphing Systems
3 pages
Practice Exam I Solutions
No ratings yet
Practice Exam I Solutions
18 pages
Bla Bla
No ratings yet
Bla Bla
6 pages
Cable
No ratings yet
Cable
2 pages
Paper 1 Topic 4 - SL Questions
No ratings yet
Paper 1 Topic 4 - SL Questions
2 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Alternative Proxy