0% found this document useful (0 votes)
21 views125 pages

Lecture 7 - SVM

The document discusses Support Vector Machines (SVMs), a supervised learning technique primarily used for classification. SVMs aim to find a hyperplane that maximizes the margin between two classes, which leads to better generalization and accuracy on test data. The optimization problem for SVMs involves minimizing a quadratic function subject to linear constraints, making it a convex optimization problem with a unique solution.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views125 pages

Lecture 7 - SVM

The document discusses Support Vector Machines (SVMs), a supervised learning technique primarily used for classification. SVMs aim to find a hyperplane that maximizes the margin between two classes, which leads to better generalization and accuracy on test data. The optimization problem for SVMs involves minimizing a quadratic function subject to linear constraints, making it a convex optimization problem with a unique solution.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

Artificial Intelligence II (CS4442 & CS9542)

Classification: Support Vector Machines

Boyu Wang
Department of Computer Science
University of Western Ontario
Intuitions

1
Intuitions

I So far, the supervised learning algorithms we have learned (e.g.,


linear regression, logistic regression, Gaussian discriminant
analysis) are based on probabilistic assumptions:

- We assume that the conditional probability p(y|x) follows


some distributions or functions parameterized by w.
- Then, we choose the parameters by the principle of
maximum likelihood.

I Is there any other principle to choose the parameter (e.g.,


without any probabilistic assumption)?

I A support vector machines (SVM) aims to find the model


parameters by the large margin principle.

2
What is a support vector machine (SVM)
A supervised learning technique (mostly for classification).

I Goal: learn a mapping f : X → Y

f( )= female
f( )= male
I The learning algorithm is trained on a set of examples
(x1 , y1 ), . . . , (xm , ym ), x ∈ Rn , y ∈ {−1, 1}

3
Linear Separator w > x + b = 0

x2 𝒚 = +𝟏
(Female)
Length of hair

𝒚 = −𝟏
(Male)

Height x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2
Split the data

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Linear Separator w > x + b = 0

x2

x1 4
Support vector machines (SVMs)

I A SVM classifier finds a (linear) separator (hyperplane) that


maximizes the margin between the two classes of examples.

I The SVM is a type of large margin classifier (Other examples


include boosting algorithms, voted-perceptron).

I Theoretical justifications: large margin leads to good


generalization (i.e., high accuracy on test data).

5
Problem Formulation: Margin

6
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎
−𝒃
|𝒘|

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎 ⇨ on the line

x1 7
Illustration of the geometry of hyperplane

x2

𝒘ᵀ𝒙 + 𝒃 = 𝟎 ⇨ on the line

𝒘ᵀ𝒙 + 𝒃 < 𝟎 ⇨ below the line

x1 7
Illustration of the geometry of hyperplane

x2
𝒘ᵀ𝒙 + 𝒃 > 𝟎 ⇨ above the line

𝒘ᵀ𝒙 + 𝒃 = 𝟎 ⇨ on the line

𝒘ᵀ𝒙 + 𝒃 < 𝟎 ⇨ below the line

x1 7
Illustration of the geometry of hyperplane

x2

𝒚= 𝟏?

x1 7
Illustration of the geometry of hyperplane

x2
𝒚= 𝟏!

𝒚= 𝟏?

x1 7
Illustration of the geometry of margin

x2
𝒙

x1 8
Illustration of the geometry of margin

x2
𝒙
𝜸

𝒙′
𝒘

x1 8
Illustration of the geometry of margin

x2
𝒙
𝜸

𝒙′
𝒘

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸

𝒙′
𝒘

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||

𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||

x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎

𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||

𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=−
||𝒘||

x1 8
Illustration of the geometry of margin

x2
𝒙
𝜸 𝒘ᵀ𝒙 + 𝒃
In general, 𝜸 = 𝒚 ||𝒘||
𝒙′
𝒘

𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=−
||𝒘||

x1 8
Large margin intuition

I Margin is the distance between an example and the decision


hyperplane, denoted by γ.
I Maximize the margin of the data points which are closest to the
hyperplane.

I Given a training set S = {(xi , yi ); i = 1, . . . , m}, learn a


hyperplane to maximize the margin γ, where
yi (w > xi +b)
γ = min γi = min ||w||
i=1,...,m i=1,...,m

x2

𝜸
𝜸

x1
Large margin formulation

I Objective function:
max γ
w,b

yi (w > xi + b)
s.t. ≥ γ, ∀i = 1, . . . , m
||w||

I The decision hyperplane and γ are invariant to the rescaling of


the parameters – need additional constraints.

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

10
x1
Large margin formulation

I Option 1:
max γ
w,b

s.t. yi (w > xi + b) ≥ γ, ∀i = 1, . . . , m
||w|| = 1

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

x1
11
Large margin formulation

I Option 1:
max γ
w,b

s.t. yi (w > xi + b) ≥ γ, ∀i = 1, . . . , m
||w|| = 1

Non-convex optimization problem, hard to solve!

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

x1
11
Large margin formulation

I Option 2:

x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

𝜸
𝜸

x1

12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸
𝜸

x1

12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸 y(w > x+b) 1


𝜸 ⇒γ= ||w|| = ||w||

x1

12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸 y(w > x+b) 1


𝜸 ⇒γ= ||w|| = ||w||

x1
1
max
w,b ||w||
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m
12
Large margin formulation

I Option 2:

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝜸 y(w > x+b) 1


𝜸 ⇒γ= ||w|| = ||w||

x1
1
min ||w||22
w,b 2
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m

12
Support vector machines (SVMs)

Optimization Formulation
1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m

I Maximize the margin while correctly classifying all examples


correctly.

1
I Maximize the margin (= ||w|| ) ⇔ minimize ||w||22 ⇒ prefer simple
model.

13
Optimization

14
Optimization formulation

Primal optimization problem


1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m

I An instance of convex optimization – admits a unique solution.

I Minimize a quadratic function with m linear constraints.

I An instance of quadratic programming (QP) – ‘off-the-shelf”


packages are available (e.g., quadprog (MATLAB), CVXOPT).

15
Digression: constrained optimization
Optimization with inequality constraints
minw f (w)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Method of Lagrange multipliers:


Pm
Lagrangian: L(w, α) = f (w) + i=1 αi gi (w)
s.t. α  0

16
Digression: constrained optimization
Optimization with inequality constraints
minw f (w)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Method of Lagrange multipliers:


Pm
Lagrangian: L(w, α) = f (w) + i=1 αi gi (w)
s.t. α  0

Let θP (w) be defined as


θP (w) = max L(w, α)
α0
(
f (w), if w satisfies primal constraints
=
∞, otherwise.

16
Lagrangian primal problem

Original (primal) problem


minw f (w) (1)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Lagrangian primal problem:

min θP (w) = min max L(w, α) (2)


w w α0

17
Lagrangian primal problem

Original (primal) problem


minw f (w) (1)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m

Lagrangian primal problem:

min θP (w) = min max L(w, α) (2)


w w α0

Claim: Problem (1) is equivalent to problem (2).

17
Lagrangian dual problem

I Consider a slightly different problem:


θD (α) = min L(w, α)
w

18
Lagrangian dual problem

I Consider a slightly different problem:


θD (α) = min L(w, α)
w

Lagrangian dual problem


max θD (α) = max min L(w, α)
α0 α0 w

18
Lagrangian dual problem

I Consider a slightly different problem:


θD (α) = min L(w, α)
w

Lagrangian dual problem


max θD (α) = max min L(w, α)
α0 α0 w

Lagrangian primal problem


min θP (w) = min max L(w, α)
w w α0

18
Lagrangian dual problem

I Consider a slightly different problem:


θD (α) = min L(w, α)
w

Lagrangian dual problem


max θD (α) = max min L(w, α)
α0 α0 w

Lagrangian primal problem


min θP (w) = min max L(w, α)
w w α0

d ∗ = max min L(w, α) ≤ min max L(w, α) = p∗


α0 w w α0

18
Lagrangian dual problem

I Consider a slightly different problem:


θD (α) = min L(w, α)
w

Lagrangian dual problem


max θD (α) = max min L(w, α)
α0 α0 w

Lagrangian primal problem


min θP (w) = min max L(w, α)
w w α0

d ∗ = max min L(w, α) ≤ min max L(w, α) = p∗


α0 w w α0
∗ ∗
For SVM, d = p .
18
Relation between primal and dual

I For d ∗ = p∗ = L(w ∗ , α∗ ), certain conditions should be satisfied,


known as Karush-Kuhn-Tucker (KKT) conditions:


L(w ∗ , α∗ ) = 0 (C1)
∂w
αi∗ gi (w ∗ ) = 0, i = 1, . . . , m (C2)

gi (w ) ≤ 0, i = 1, . . . , m (C3)
αi∗ ≥ 0 i = 1, . . . , m (C4)

I Eq. (C2) is called the KKT dual complementarity condition – if


αi∗ > 0, then gi (w ∗ ) = 0 (the constraint is active).

I Roadmap: Primal problem ⇒ Lagrangian function L(w, α)


⇒ minw L(w, α)⇒ dual problem.

19
Back to SVM
Original (primal) optimization problem
1 2
minw,b 2 ||w||2

s.t. 1 − yi (w > xi + b) ≤ 0, ∀i = 1, . . . , m

I Lagrangian function:
m
1 X
L(w, b, α) = ||w||22 + αi [1 − yi (w > xi + b)]
2
i=1

I KKT conditions:
∂ Pm
L=0 ⇒ w= i=1 αi yi xi
∂w
∂ Pm
L=0 ⇒ i=1 αi yi = 0
∂b

20
Back to SVM
Plugging the KKT conditions back into L:
Dual optimization problem
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Having found α

I By KKT condition
m
X
w= αi yi xi
i=1

I For any support vector xi , we have yi (w > xi + b) = 1:


b = yi − w > xi , ∀αi > 0 (i.e., for any support vector xi )
21
Two key observations from dual

Observation 1: Most αi ’s are 0


𝒘ᵀ𝒙 + 𝒃 = 𝟏
I KKT condition (C2): x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎

αi [yi (w > xi + b) − 1] = 0 𝒘ᵀ𝒙 + 𝒃 = −𝟏

I If xi is not on margin:
yi (w > xi + b) > 1 𝜸
⇒ αi = 0 𝜸

I αi > 0 only for xi on margin.


I These are support vectors.
x1
I Only support vectors matter.

22
Two key observations from dual

Observation 2: Only inner product matters


I Dual optimization problem
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , m
Pm
I Primal solution: w = i=1 αi yi xi
I For prediction:
P 
m
f (x) = sgn w > x + b = sgn >
 
i=1 αi yi xi x + b

I Replace the inner products with kernel functions


xi> xj ⇒ k(xi , xj )
to produce non-linear SVMs.

23
Kernel Trick

24
Kernel functions

I Whenever a learning algorithm can be written in terms of inner


products, there is a kernel version of this algorithm.

I Given a (nonlinear) feature mapping φ : x → φ(x), its kernel is


the function k : Rn × Rn → R:

k(xi , xj ) = φ(xi )> φ(xj )

I Conversely, by choosing the kernel function k , we implicitly


choose a feature mapping φ.

I A kernel function is the inner product in the space defined by φ –


a notion of similarity.

25
Dual formulation of SVMs
Training
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P 
m
αi yi xi> x + b

f (x) = sgn i=1

26
Dual formulation of SVMs
Training
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P 
m
αi yi xi> x + b

f (x) = sgn i=1

Kernel trick: a feature mapping φ : x → φ(x)


xi> xj →φ(xi )> φ(xj ) ⇔ k (xi , xj )

In the dual formulation, feature mapping φ is not needed either


to learn or to make predictions!

26
The kernel trick
Training
Pm 1
Pm
maxα i=1 αi − 2 i,j=1 αi αj yi yj k (xi , xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P 
m
f (x) = sgn i=1 αi yi k(xi , x) + b

Kernel trick: a feature mapping φ : x → φ(x)


xi> xj →φ(xi )> φ(xj ) ⇔ k (xi , xj )

In the dual formulation, feature mapping φ is not needed either


to learn or to make predictions!

27
Commonly used kernels

I Linear kernels – linear SVMs.


k (xi , xj ) = xi> xj

I Gaussian kernels
||xi −xj ||2
2
k (xi , xj ) = e− 2σ 2

I Polynomial kernels
d
k(xi , xj ) = 1 + xi> xj

28
Commonly used kernels

I Linear kernels – linear SVMs.


k (xi , xj ) = xi> xj

I Gaussian kernels
||xi −xj ||2
2
k (xi , xj ) = e− 2σ 2

I Polynomial kernels
d
k(xi , xj ) = 1 + xi> xj

I We can impose prior knowledge by designing


sin ||xi −xj ||22 /T
problem-dependent kernels (e.g., k (xi , xj ) = exp(− 2σ 2
)
for periodic time series)!

28
Effect of kernel functions

Training data
29
Effect of kernel functions

Linear kernel
29
Effect of kernel functions

Gaussian kernel
29
Effect of kernel functions

Polynomial kernel
29
Effect of kernel functions

Training data
30
Effect of kernel functions

Linear kernel
30
Effect of kernel functions

Gaussian kernel
30
Non-separable Case

31
Non-separable case

I The derivation of SVMs as presented so far assumed that the


data is linearly separable.
I What if the data is not separable?

x2

x1

32
Non-separable case

I The derivation of SVMs as presented so far assumed that the


data is linearly separable.
I What if the data is not separable?

x2

x1

32
Non-separable case

I The derivation of SVMs as presented so far assumed that the


data is linearly separable.
I What if the data is not separable?

x2

x1

32
Non-separable case

I The derivation of SVMs as presented so far assumed that the


data is linearly separable.
I What if the data is not separable?

x2

x1

32
Introducing slack variables

I Non-separable ⇒ for any separator w > x + b = 0, there exists xi


in training set such that
yi (w > xi + b) 6≥ 1

I Solution: introducing slack variables ξi ≥ 0 – permit SVMs to


make mistakes.
yi (w > xi + b) ≥ 1 − ξi

x2

33
x1
Introducing slack variables

I Non-separable ⇒ for any separator w > x + b = 0, there exists xi


in training set such that
yi (w > xi + b) 6≥ 1

I Solution: introducing slack variables ξi ≥ 0 – permit SVMs to


make mistakes.
yi (w > xi + b) ≥ 1 − ξi

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

33
x1
Introducing slack variables

I Non-separable ⇒ for any separator w > x + b = 0, there exists xi


in training set such that
yi (w > xi + b) 6≥ 1

I Solution: introducing slack variables ξi ≥ 0 – permit SVMs to


make mistakes.
yi (w > xi + b) ≥ 1 − ξi

𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏

𝝃𝒊

𝝃𝒋

33
x1
Introducing slack variables (primal)
nothing!
Primal optimization problem for separable cases
1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , n

I New constraints with slack variables:


yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

34
Introducing slack variables (primal)
nothing!
Primal optimization problem for separable cases
1 2
minw,b 2 ||w||2

s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , n

I New constraints with slack variables:


yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

Primal optimization problem for non-separable cases


1
Pm
nothing! minw,b,ξ 2
2 ||w||2 +C i=1 ξi

s.t. yi (w > xi + b) ≥ 1 − ξi , ∀i = 1, . . . , n
ξi ≥ 0, ∀i = 1, . . . , n

34
Introducing slack variables (dual)
nothing!
Dual optimization problem for separable cases
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0

αi ≥ 0, i = 1, . . . , n

I New constraints with slack variables:


yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

35
Introducing slack variables (dual)
nothing!
Dual optimization problem for separable cases
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0

αi ≥ 0, i = 1, . . . , n

I New constraints with slack variables:


yi (w > xi + b) ≥ 1 − ξi , s.t. ξi ≥ 0, i = 1, . . . , n

Dual optimization problem for non-separable cases


Pm 1
Pm >
nothing! maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0

0 ≤ αi ≤ C, i = 1, . . . , n
35
Introducing slack variables (dual)

By KKT dual-complementarity conditions

x2

x1 36
Introducing slack variables (dual)

By KKT dual-complementarity conditions


αi = 0 ⇒ yi (w > xi + b) > 1 ⇒ outside the margin

x2

x1 36
Introducing slack variables (dual)

By KKT dual-complementarity conditions


αi = 0 ⇒ yi (w > xi + b) > 1 ⇒ outside the margin
0 < αi < C ⇒ yi (w > xi + b) = 1 ⇒ on the margin

x2

x1 36
Introducing slack variables (dual)

By KKT dual-complementarity conditions


αi = 0 ⇒ yi (w > xi + b) > 1 ⇒ outside the margin
0 < αi < C ⇒ yi (w > xi + b) = 1 ⇒ on the margin
αi = C ⇒ yi (w > xi + b) < 1 ⇒ inside the margin/misclassified

x2

x1 36
Effect of C

Training data
37
Effect of C

C = 1000
37
Effect of C

C=1
37
Effect of kernel functions

Training data
38
Effect of kernel functions

Linear kernel
38
Effect of kernel functions

Gaussian kernel, σ = 1
38
Effect of kernel functions

Gaussian kernel, σ = 0.1


38
Effect of kernel functions

Gaussian kernel, σ = 10
38
Effect of kernel functions

Polynomial kernel, d = 3
38
Effect of kernel functions

Polynomial kernel, d = 10
38
Effect of kernel functions

Polynomial kernel, d = 1
38
Effect of model parameters for Gaussian kernels

Training data
39
Effect of model parameters for Gaussian kernels

σ 2 = 10, C = 1
39
Effect of model parameters for Gaussian kernels

σ 2 = 10, C = 100
39
Effect of model parameters for Gaussian kernels

σ 2 = 10, C = 10000
39
Effect of model parameters for Gaussian kernels

σ 2 = 1, C = 1
39
Effect of model parameters for Gaussian kernels

σ 2 = 1, C = 100
39
Effect of model parameters for Gaussian kernels

σ 2 = 1, C = 10000
39
Effect of model parameters for Gaussian kernels

σ 2 = 0.1, C = 1
39
Effect of model parameters for Gaussian kernels

σ 2 = 0.1, C = 100
39
Effect of model parameters for Gaussian kernels

σ 2 = 0.1, C = 10000
39
Effect of model parameters
𝑪=𝟏 𝑪 = 𝟏𝟎𝟎 𝑪 = 𝟏𝟎𝟎𝟎𝟎

𝝈𝟐 = 𝟏𝟎
𝝈𝟐 = 𝟏
𝝈𝟐 = 𝟎. 𝟏

40
Gaussian kernels with different parameters.
Support Vector Regression

41
Support vector regression (SVR)

I Linear regression aims to minimize the mean-squared error


m
X
min (w > xi + b − yi )2
w,b
i=1

I Another option would be to minimize the absolute error


m
X
min |w > xi + b − yi |
w,b
i=1

I This is more robust to outliers than the squared loss

I But we cannot require that all points be approximated correctly


(overfitting!)

42
Loss function for SVR

I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than 

43
Loss function for SVR

I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than 

I Analogous to the large margin intuition in SVM (i.e., only support


vectors matter)

43
Loss function for SVR

I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than 

I Analogous to the large margin intuition in SVM (i.e., only support


vectors matter)

I We introduce the -insensitive loss:


Xm
J = V (xi , yi ),
i=1

where
(
0, if |w > xi + b − yi | ≤ 
V (xi , yi ) =
|w > xi + b − yi | − , otherwise

43
-insensitive loss
(
0, if |w > xi + b − yi | ≤ 
V (xi , yi ) = >
|w xi + b − yi | − , otherwise

44
Effect of 
As  increases, the function is allowed to move away from the data
points, the number of support vectors decreases, the learning curve
becomes smoother, and the fit on the training set gets worse.

Figure credit: Andrew Zisserman

45
Summary

46
What you should know

From this lecture


I Large margin intuition and formulation for SVMs.

I Use of Lagrange multipliers to transform optimization problems.

I Primal and dual optimization problems for SVMs.

I Kernel tricks for nonlinear SVMs.

I SVMs for non-separable cases.

I Support vector regression (SVR).

47
SVMs Summary

I Advantages

- Maximize margin – regularize the model complexity and


less sensitive to outliers.
- Work well on high-dimensional small data sets.
- Use kernel tricks – produce nonlinear classifier and impose
prior knowledge.
- Have nice theoretical properties.

48
SVMs Summary

I Advantages

- Maximize margin – regularize the model complexity and


less sensitive to outliers.
- Work well on high-dimensional small data sets.
- Use kernel tricks – produce nonlinear classifier and impose
prior knowledge.
- Have nice theoretical properties.

I Disadvantages

- Computationally expensive for large-size data sets.


- Choosing a good kernel function and model parameter is
not easy.
48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy