Lecture 7 - SVM
Lecture 7 - SVM
Boyu Wang
Department of Computer Science
University of Western Ontario
Intuitions
1
Intuitions
2
What is a support vector machine (SVM)
A supervised learning technique (mostly for classification).
f( )= female
f( )= male
I The learning algorithm is trained on a set of examples
(x1 , y1 ), . . . , (xm , ym ), x ∈ Rn , y ∈ {−1, 1}
3
Linear Separator w > x + b = 0
x2 𝒚 = +𝟏
(Female)
Length of hair
𝒚 = −𝟏
(Male)
Height x1 4
Linear Separator w > x + b = 0
x2
Split the data
x1 4
Linear Separator w > x + b = 0
x2
Split the data
x1 4
Linear Separator w > x + b = 0
x2
Split the data
x1 4
Linear Separator w > x + b = 0
x2
Split the data
x1 4
Linear Separator w > x + b = 0
x2
Split the data
x1 4
Linear Separator w > x + b = 0
x2
Split the data
x1 4
Linear Separator w > x + b = 0
x2
x1 4
Linear Separator w > x + b = 0
x2
x1 4
Linear Separator w > x + b = 0
x2
x1 4
Linear Separator w > x + b = 0
x2
x1 4
Linear Separator w > x + b = 0
x2
x1 4
Linear Separator w > x + b = 0
x2
x1 4
Support vector machines (SVMs)
5
Problem Formulation: Margin
6
Illustration of the geometry of hyperplane
x2
𝒘ᵀ𝒙 + 𝒃 = 𝟎
x1 7
Illustration of the geometry of hyperplane
x2
𝒘ᵀ𝒙 + 𝒃 = 𝟎
−𝒃
|𝒘|
x1 7
Illustration of the geometry of hyperplane
x2
𝒘ᵀ𝒙 + 𝒃 = 𝟎
x1 7
Illustration of the geometry of hyperplane
x2
x1 7
Illustration of the geometry of hyperplane
x2
x1 7
Illustration of the geometry of hyperplane
x2
𝒘ᵀ𝒙 + 𝒃 > 𝟎 ⇨ above the line
x1 7
Illustration of the geometry of hyperplane
x2
𝒚= 𝟏?
x1 7
Illustration of the geometry of hyperplane
x2
𝒚= 𝟏!
𝒚= 𝟏?
x1 7
Illustration of the geometry of margin
x2
𝒙
x1 8
Illustration of the geometry of margin
x2
𝒙
𝜸
𝒙′
𝒘
x1 8
Illustration of the geometry of margin
x2
𝒙
𝜸
𝒙′
𝒘
x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸
𝒙′
𝒘
x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎
𝒙′
𝒘
x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎
𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||
x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎
𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||
x1 8
Illustration of the geometry of margin
𝒘
x2 𝒙′ = 𝒙 − 𝜸
||𝒘||
𝒙
𝜸 𝒘ᵀ𝒙′ + 𝒃 = 𝟎
𝒙′
𝒘 𝒘
𝒘ᵀ 𝒙 − 𝜸 +𝒃 = 𝟎
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=−
||𝒘||
x1 8
Illustration of the geometry of margin
x2
𝒙
𝜸 𝒘ᵀ𝒙 + 𝒃
In general, 𝜸 = 𝒚 ||𝒘||
𝒙′
𝒘
𝒘ᵀ𝒙 + 𝒃
𝜸=
||𝒘||
𝒘ᵀ𝒙 + 𝒃
𝜸=−
||𝒘||
x1 8
Large margin intuition
x2
𝜸
𝜸
x1
Large margin formulation
I Objective function:
max γ
w,b
yi (w > xi + b)
s.t. ≥ γ, ∀i = 1, . . . , m
||w||
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝜸
𝜸
10
x1
Large margin formulation
I Option 1:
max γ
w,b
s.t. yi (w > xi + b) ≥ γ, ∀i = 1, . . . , m
||w|| = 1
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝜸
𝜸
x1
11
Large margin formulation
I Option 1:
max γ
w,b
s.t. yi (w > xi + b) ≥ γ, ∀i = 1, . . . , m
||w|| = 1
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝜸
𝜸
x1
11
Large margin formulation
I Option 2:
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝜸
𝜸
x1
12
Large margin formulation
I Option 2:
𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏
𝜸
𝜸
x1
12
Large margin formulation
I Option 2:
𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏
x1
12
Large margin formulation
I Option 2:
𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏
x1
1
max
w,b ||w||
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m
12
Large margin formulation
I Option 2:
𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏
x1
1
min ||w||22
w,b 2
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m
12
Support vector machines (SVMs)
Optimization Formulation
1 2
minw,b 2 ||w||2
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m
1
I Maximize the margin (= ||w|| ) ⇔ minimize ||w||22 ⇒ prefer simple
model.
13
Optimization
14
Optimization formulation
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , m
15
Digression: constrained optimization
Optimization with inequality constraints
minw f (w)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m
16
Digression: constrained optimization
Optimization with inequality constraints
minw f (w)
s.t. gi (w) ≤ 0, ∀i = 1, . . . , m
16
Lagrangian primal problem
17
Lagrangian primal problem
17
Lagrangian dual problem
18
Lagrangian dual problem
18
Lagrangian dual problem
18
Lagrangian dual problem
18
Lagrangian dual problem
∂
L(w ∗ , α∗ ) = 0 (C1)
∂w
αi∗ gi (w ∗ ) = 0, i = 1, . . . , m (C2)
∗
gi (w ) ≤ 0, i = 1, . . . , m (C3)
αi∗ ≥ 0 i = 1, . . . , m (C4)
19
Back to SVM
Original (primal) optimization problem
1 2
minw,b 2 ||w||2
s.t. 1 − yi (w > xi + b) ≤ 0, ∀i = 1, . . . , m
I Lagrangian function:
m
1 X
L(w, b, α) = ||w||22 + αi [1 − yi (w > xi + b)]
2
i=1
I KKT conditions:
∂ Pm
L=0 ⇒ w= i=1 αi yi xi
∂w
∂ Pm
L=0 ⇒ i=1 αi yi = 0
∂b
20
Back to SVM
Plugging the KKT conditions back into L:
Dual optimization problem
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n
Having found α
I By KKT condition
m
X
w= αi yi xi
i=1
I If xi is not on margin:
yi (w > xi + b) > 1 𝜸
⇒ αi = 0 𝜸
22
Two key observations from dual
23
Kernel Trick
24
Kernel functions
25
Dual formulation of SVMs
Training
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n
Prediction
P
m
αi yi xi> x + b
f (x) = sgn i=1
26
Dual formulation of SVMs
Training
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n
Prediction
P
m
αi yi xi> x + b
f (x) = sgn i=1
26
The kernel trick
Training
Pm 1
Pm
maxα i=1 αi − 2 i,j=1 αi αj yi yj k (xi , xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n
Prediction
P
m
f (x) = sgn i=1 αi yi k(xi , x) + b
27
Commonly used kernels
I Gaussian kernels
||xi −xj ||2
2
k (xi , xj ) = e− 2σ 2
I Polynomial kernels
d
k(xi , xj ) = 1 + xi> xj
28
Commonly used kernels
I Gaussian kernels
||xi −xj ||2
2
k (xi , xj ) = e− 2σ 2
I Polynomial kernels
d
k(xi , xj ) = 1 + xi> xj
28
Effect of kernel functions
Training data
29
Effect of kernel functions
Linear kernel
29
Effect of kernel functions
Gaussian kernel
29
Effect of kernel functions
Polynomial kernel
29
Effect of kernel functions
Training data
30
Effect of kernel functions
Linear kernel
30
Effect of kernel functions
Gaussian kernel
30
Non-separable Case
31
Non-separable case
x2
x1
32
Non-separable case
x2
x1
32
Non-separable case
x2
x1
32
Non-separable case
x2
x1
32
Introducing slack variables
x2
33
x1
Introducing slack variables
𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏
33
x1
Introducing slack variables
𝒘ᵀ𝒙 + 𝒃 = 𝟏
x2 𝒘ᵀ𝒙 + 𝒃 = 𝟎
𝒘ᵀ𝒙 + 𝒃 = −𝟏
𝝃𝒊
𝝃𝒋
33
x1
Introducing slack variables (primal)
nothing!
Primal optimization problem for separable cases
1 2
minw,b 2 ||w||2
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , n
34
Introducing slack variables (primal)
nothing!
Primal optimization problem for separable cases
1 2
minw,b 2 ||w||2
s.t. yi (w > xi + b) ≥ 1, ∀i = 1, . . . , n
s.t. yi (w > xi + b) ≥ 1 − ξi , ∀i = 1, . . . , n
ξi ≥ 0, ∀i = 1, . . . , n
34
Introducing slack variables (dual)
nothing!
Dual optimization problem for separable cases
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0
αi ≥ 0, i = 1, . . . , n
35
Introducing slack variables (dual)
nothing!
Dual optimization problem for separable cases
Pm 1
Pm >
maxα i=1 αi − 2 i,j=1 αi αj yi yj (xi xj )
Pm
s.t. i=1 αi yi = 0
αi ≥ 0, i = 1, . . . , n
0 ≤ αi ≤ C, i = 1, . . . , n
35
Introducing slack variables (dual)
x2
x1 36
Introducing slack variables (dual)
x2
x1 36
Introducing slack variables (dual)
x2
x1 36
Introducing slack variables (dual)
x2
x1 36
Effect of C
Training data
37
Effect of C
C = 1000
37
Effect of C
C=1
37
Effect of kernel functions
Training data
38
Effect of kernel functions
Linear kernel
38
Effect of kernel functions
Gaussian kernel, σ = 1
38
Effect of kernel functions
Gaussian kernel, σ = 10
38
Effect of kernel functions
Polynomial kernel, d = 3
38
Effect of kernel functions
Polynomial kernel, d = 10
38
Effect of kernel functions
Polynomial kernel, d = 1
38
Effect of model parameters for Gaussian kernels
Training data
39
Effect of model parameters for Gaussian kernels
σ 2 = 10, C = 1
39
Effect of model parameters for Gaussian kernels
σ 2 = 10, C = 100
39
Effect of model parameters for Gaussian kernels
σ 2 = 10, C = 10000
39
Effect of model parameters for Gaussian kernels
σ 2 = 1, C = 1
39
Effect of model parameters for Gaussian kernels
σ 2 = 1, C = 100
39
Effect of model parameters for Gaussian kernels
σ 2 = 1, C = 10000
39
Effect of model parameters for Gaussian kernels
σ 2 = 0.1, C = 1
39
Effect of model parameters for Gaussian kernels
σ 2 = 0.1, C = 100
39
Effect of model parameters for Gaussian kernels
σ 2 = 0.1, C = 10000
39
Effect of model parameters
𝑪=𝟏 𝑪 = 𝟏𝟎𝟎 𝑪 = 𝟏𝟎𝟎𝟎𝟎
𝝈𝟐 = 𝟏𝟎
𝝈𝟐 = 𝟏
𝝈𝟐 = 𝟎. 𝟏
40
Gaussian kernels with different parameters.
Support Vector Regression
41
Support vector regression (SVR)
42
Loss function for SVR
I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than
43
Loss function for SVR
I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than
43
Loss function for SVR
I Intuition: we should allow some errors (as long as they are not
large) so that the algorithm would be more robust to noise – we
only care about the fitting error when it is larger than
where
(
0, if |w > xi + b − yi | ≤
V (xi , yi ) =
|w > xi + b − yi | − , otherwise
43
-insensitive loss
(
0, if |w > xi + b − yi | ≤
V (xi , yi ) = >
|w xi + b − yi | − , otherwise
44
Effect of
As increases, the function is allowed to move away from the data
points, the number of support vectors decreases, the learning curve
becomes smoother, and the fit on the training set gets worse.
45
Summary
46
What you should know
47
SVMs Summary
I Advantages
48
SVMs Summary
I Advantages
I Disadvantages