Data Mining and Machine Learning: Fundamental Concepts and Algorithms
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 1/
Hyperplanes
Let D = {(x i , yi )}ni=1 be a classification dataset, with n points in a d-dimensional
space. We assume that there are only two class labels, that is, yi ∈ {+1, −1},
denoting the positive and negative classes.
A hyperplane in d dimensions is given as the set of all points x ∈ Rd that satisfy
the equation h(x) = 0, where h(x) is the hyperplane function:
h(x) = w T x + b = w1 x1 + w2 x2 + · · · + wd xd + b
Here, w is a d dimensional weight vector and b is a scalar, called the bias.
For points that lie on the hyperplane, we have
h(x) = w T x + b = 0
The weight vector w specifies the direction that is orthogonal or normal to the
hyperplane, which fixes the orientation of the hyperplane, whereas the bias b fixes
the offset of the hyperplane in the d-dimensional space, i.e., where the hyperplane
intersects each of the axes:
−b
wi xi = −b or xi =
wi
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 2/
Separating Hyperplane
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 3/
Geometry of a Hyperplane: Distance
Consider a point x ∈ Rd that does not lie on the hyperplane. Let x p be the orthogonal
projection of x on the hyperplane, and let r = x − x p . Then we can write x as
w
x = xp + r = xp + r
kw k
yi h(x i )
δi =
kw k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 4/
Geometry of a Hyperplane in 2D
h (x
w1 q2 − p2 5−0 5
− = = =−
5 )= w2 q1 − p1 2−4 2
0
w
kw k
bc
bc
bc bc x Given (4, 0), the offset b is:
4
w bc
b r=
r kw
k b = −5x1 − 2x2 = −5 · 4 − 2 · 0 = −20
3 bc
xp bc
ut bc
5
2 ut Given w = and b = −20:
bc 2
ut ut
b
1 kw k x1
ut
ut
h(x) = w T x + b = 5 2 − 20 = 0
x2
0
0 1 2 3 4 5
−b −(−20)
δ = y r = −1 r = = √ = 3.71
kw k 29
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 5/
Margin and Support Vectors
The distance of a point x from the hyperplane h(x) = 0 is thus given as
y h(x)
δ=y r =
kw k
The margin is the minimum distance of a point from the separating hyperplane:
yi (w T x i + b)
δ ∗ = min
xi kw k
All the points (or vectors) that achieve the minimum distance are called support
vectors for the hyperplane. They satisfy the condition:
y ∗ (w T x ∗ + b)
δ∗ =
kw k
y ∗ (w T x ∗ + b) 1
δ∗ = =
kw k kw k
For the canonical hyperplane, for each support vector x ∗i (with label yi∗ ), we have
yi∗ h(x ∗i ) = 1, and for any point that is not a support vector we have yi h(x i ) > 1.
Over all points, we have
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 7/
Separating Hyperplane: Margin and Support Vectors
Shaded points are support vectors
T
5
h(x) = x − 20 = 0
h (x 2
5
0
bc 1 1 1
bC s= = !=
4 bc bc y ∗ h(x ∗ ) T
5 2 6
bc −1 − 20
1
2 2
3 kw k bC
ut 1 bc
kw k
1 5 5/6 −20
2 uT w= = b=
bC 6 2 2/6 6
ut ut
1 T T
uT
ut 5/6 0.833
h(x) = x − 20/6 = x − 3.33
2/6 0.333
0 1 2 3 4 5
y ∗ h(x ∗ ) 1 6
δ∗ = =q = √ = 1.114
kw k 5
2 2
2 29
6
+ 6
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 8/
SVM: Linear and Separable Case
Assume that the points are linearly separable, that is, there exists a separating
hyperplane that perfectly classifies each point.
The goal of SVMs is to choose the canonical hyperplane, h∗ , that yields the
maximum margin among all possible separating hyperplanes
∗ 1
h = arg max
w ,b kw k
kw k2
Objective Function: min
w ,b 2
Linear Constraints: yi (w T x i + b) ≥ 1, ∀x i ∈ D
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 9/
SVM: Linear and Separable Case
We turn the constrained SVM optimization into an unconstrained one by introducing a
Lagrange multiplier αi for each constraint. The new objective function, called the
Lagrangian, then becomes
n
1 X
min L = kw k2 − αi yi (w T x i + b) − 1
2
i =1
We can see that w can be expressed as a linear combination of the data points x i , with
the signed Lagrange multipliers, αi yi , serving as the coefficients.
Further, the sum of the signed Lagrange multipliers, αi yi , must be zero.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 10
SVM: Linear and Separable Case
n
X n
X
Incorporating w = αi yi x i and αi yi = 0 into the Lagrangian we obtain the
i =1 i =1
new dual Lagrangian objective function, which is specified purely in terms of the
Lagrange multipliers:
n n n
X 1 XX
Objective Function: max Ldual = αi − αi αj yi yj x Ti x j
α
i =1
2 i =1 j =1
n
X
Linear Constraints: αi ≥ 0, ∀i ∈ D, and αi yi = 0
i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 11
SVM: Linear and Separable Case
Once we have obtained the αi values for i = 1, . . . , n, we can solve for the weight
vector w and the bias b. Each of the Lagrange multipliers αi satisfies the KKT
conditions at the optimal solution:
αi yi (w T x i + b) − 1 = 0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 12
Linear and Separable Case: Weight Vector and Bias
Once we know αi for all points, we can compute the weight vector w by taking
the summation only for the support vectors:
X
w= αi yi x i
i ,αi >0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 13
SVM Classifier
Given the optimal hyperplane function h(x) = w T x + b, for any new point z, we
predict its class as
ŷ = sign(h(z)) = sign(w T z + b)
where the sign(·) function returns +1 if its argument is positive, and −1 if its
argument is negative.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 14
Example Dataset: Separable Case
xi xi 1 xi 2 yi
x1 3.5 4.25 +1
x2 4 3 +1
5
bc x3 4 4 +1
bc
4 bc bc x4 4.5 1.75 +1
bc
x5 4.9 4.5 +1
3 bc
x6 5 4 +1
ut bc
ut
x7 5.5 2.5 +1
2
bc
ut ut x8 5.5 3.5 +1
1
ut
x9 0.5 1.5 −1
ut
x 10 1 2.5 −1
0 1 2 3 4 5
x 11 1.25 0.5 −1
x 12 1.5 1.5 −1
x 13 2 2 −1
x 14 2.5 0.75 −1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 15
Optimal Separating Hyperplane
5
0
x 13 2 2 −1 0.3589
bc
bC x 14 2.5 0.75 −1 0.0437
4 bc bc
bc
1
bC
The weight vector and bias are:
3 kw k
ut 1 bc
kw k X 0.833
2 uT
bC
w= αi y i x i =
ut ut 0.334
i ,αi >0
1
uT
ut b = avg{bi } = −3.332
0 1 2 3 4 5
The optimal hyperplane is given as follows:
T
0.833
h(x) = x − 3.332 = 0
0.334
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 16
Soft Margin SVM: Linear and Nonseparable Case
yi (w T x i + b) ≥ 1 − ξi
where ξi ≥ 0 is the slack variable for point x i , which indicates how much the point
violates the separability condition, that is, the point may no longer be at least
1/ kw k away from the hyperplane.
The slack values indicate three types of points. If ξi = 0, then the corresponding
point x i is at least kw1 k away from the hyperplane.
If 0 < ξi < 1, then the point is within the margin and still correctly classified, that
is, it is on the correct side of the hyperplane.
However, if ξi ≥ 1 then the point is misclassified and appears on the wrong side of
the hyperplane.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 17
Soft Margin Hyperplane
Shaded points are the support vectors
h (x
)=
5
0
1 bc
kw k
1 bC
4 kw k bc bc
bc
3 bC bC uT
ut bc
2 uT uT bC
bC
ut ut
1
uT
ut
0 1 2 3 4 5
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 18
SVM: Soft Margin or Linearly Non-separable Case
In the nonseparable case, also called the soft margin the SVM objective function is
( n
)
kw k2 X
Objective Function: min +C (ξi )k
w ,b ,ξi 2 i =1
Linear Constraints: yi (w T x i + b) ≥ 1 − ξi , ∀x i ∈ D
ξi ≥ 0 ∀x i ∈ D
n n n
X 1 XX
max Ldual = αi − αi αj yi yj x Ti x j
α
i =1
2 i =1 j =1
n n n
X 1 XX T 1
max Ldual = αi − αi αj yi yj x i x j + δij
α
i =1
2 i =1 j =1 2C
xi xi1 xi2 yi
x1 3.5 4.25 +1
5
x2 4 3 +1
bc
x3 4 4 +1
bc x4 4.5 1.75 +1
4 bc bc
x5 4.9 4.5 +1
bc x6 5 4 +1
3 bc bc ut x7 5.5 2.5 +1
ut bc
x8 5.5 3.5 +1
x9 0.5 1.5 −1
2 ut ut bc
bc x 10 1 2.5 −1
ut ut x 11 1.25 0.5 −1
1 x 12 1.5 1.5 −1
ut
ut x 13 2 2 −1
x 14 2.5 0.75 −1
x 15 4 2 +1
0 1 2 3 4 5
x 16 2 3 +1
x 17 3 2 −1
x 18 5 3 −1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 21
Example Dataset: Linearly Non-separable Case
Let k = 1 and C = 1, then solving the Ldual yields the following support vectors
and Lagrangian values αi :
xi xi 1 xi 2 yi αi
x1 3.5 4.25 +1 0.0271
x2 4 3 +1 0.2162
x4 4.5 1.75 +1 0.9928
x 13 2 2 −1 0.9928
x 14 2.5 0.75 −1 0.2434
x 15 4 2 +1 1
x 16 2 3 +1 1
x 17 3 2 −1 1
x 18 5 3 −1 1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 22
Example Dataset: Linearly Non-separable Case
The slack ξi = 0 for all points that are not support vectors, and also for those
support vectors that are on the margin. Slack is positive only for the remaining
support vectors and it can be computed as: ξi = 1 − yi (w T x i + b).
Thus, for all support vectors not on the margin, we have
xi wT xi wT xi + b ξi = 1 − yi (w T x i + b)
x 15 4.001 0.667 0.333
x 16 2.667 −0.667 1.667
x 17 3.167 −0.167 0.833
x 18 5.168 1.834 2.834
The slack variable ξi > 1 for those points that are misclassified (i.e., are on the
wrong side of the hyperplane), namely x 16 = (3, 3)T and x 18 = (5, 3)T . The other
two points are correctly classified, but lie within the margin, and thus satisfy
0 < ξi < 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 23
Kernel SVM: Nonlinear Case
The linear SVM approach can be used for datasets with a nonlinear decision
boundary via the kernel trick.
Conceptually, the idea is to map the original d-dimensional points x i in the input
space to points φ(x i ) in a high-dimensional feature space via some nonlinear
transformation φ.
Given the extra flexibility, it is more likely that the points φ(x i ) might be linearly
separable in the feature space.
A linear decision surface in feature space actually corresponds to a nonlinear
decision surface in the input space.
Further, the kernel trick allows us to carry out all operations via the kernel function
computed in input space, rather than having to map the points into feature space.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 24
Nonlinear SVM
There is no linear classifier that can discriminate between the points. However,
there exists a perfect quadratic classifier that can separate the two classes.
√ √ √
φ(x) = ( 2x1 , 2x2 , x12 , x22 , 2x1 x2 )T
bc
bc bc
5 bc bc
bc bc bC bc
4 uT
ut
ut ut ut
3 ut ut uT
ut ut
ut ut
2 bC bC
bc
1 bc bC bc bc
bc
0
0 1 2 3 4 5 6 7
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 25
Nonlinear SVMs: Kernel Trick
To apply the kernel trick for nonlinear SVM classification, we have to show that
all operations require only the kernel function:
K (x i , x j ) = φ(x i )T φ(x j )
Applying φ to each point, we can obtain the new dataset in the feature space
D φ = {φ(x i ), yi }ni=1 .
The SVM objective function in feature space is given as
( n
)
kw k2 X
Objective Function: min +C (ξi )k
w ,b ,ξi 2 i =1
where w is the weight vector, b is the bias, and ξi are the slack variables, all in
feature space.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 26
Nonlinear SVMs: Kernel Trick
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 27
Nonlinear SVMs: Weight Vector and Bias
We cannot directly obtain the weight vector without transforming the points, since
X
w= αi yi φ(x i )
αi >0
X
ŷ = sign(w T φ(z) + b) = sign αi yi K (x i , z) + b
αi >0
All SVM operations can be carried out in terms of the kernel function
K (x i , x j ) = φ(x i )T φ(x j ). Thus, any nonlinear kernel function can be used to do
nonlinear classification in the input space.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 28
Nonlinear SVM: Inhomogeneous Quadratic Kernel
bc
bc bc
5 bc bc
bc
bc bC bc
4 uT
ut
ut ut ut
3 ut ut uT
ut ut
ut ut
2 bC bC
bc
1 bc bC bc bc
bc
0
0 1 2 3 4 5 6 7
xi (xi 1 , xi 2 )T φ(x i ) yi αi
x1 (1, 2)T (1, 1.41, 2.83, 1, 4, 2.83)T +1 0.6198
x2 (4, 1)T (1, 5.66, 1.41, 16, 1, 5.66)T +1 2.069
x3 (6, 4.5)T (1, 8.49, 6.36, 36, 20.25, 38.18)T +1 3.803
x4 (7, 2)T (1, 9.90, 2.83, 49, 4, 19.80)T +1 0.3182
x5 (4, 4)T (1, 5.66, 5.66, 16, 16, 15.91)T −1 2.9598
x6 (6, 3)T (1, 8.49, 4.24, 36, 9, 25.46)T −1 3.8502
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 30
Nonlinear SVM: Inhomogeneous Quadratic Kernel
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 31
SVM Training Algorithms
Instead of dealing explicitly with the bias b, we map each point x i ∈ Rd to the
point x ′i ∈ Rd +1 as follows:
w = (w1 , . . . , wd , b)T
h(x ′ ) : w T x ′ = w1 xi 1 + · · · + wd xid + b = 0
Pn
After the mapping, the constraint i =1 αi yi = 0 does not apply in the SVM dual
formulations. The new set of constraints is given as
yi w T x ≥ 1 − ξi
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 32
Dual Optimization: Gradient Ascent
The dual optimization objective for hinge loss is given as
n n n
X 1 XX
max J(α) = αi − αi αj yi yj K (x i , x j )
α
i =1
2 i =1 j =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 33
Stochastic Gradient Ascent
Starting from an initial α, the gradient ascent approach successively updates by
moving in the direction of the gradient ∇J(α):
αt +1 = αt + ηt ∇J(αt )
where αt is the estimate at the tth step, and ηt is the step size.
The optimal step size is:
1
ηk =
K (x k , x k )
Instead of updating the entire α vector in each step, in the stochastic gradient
ascent approach, we update each component αk independently and immediately
use the new value to update other components. The update rule for the k-th
component is given as
n
!
∂J(α) X
αk = αk + η k = αk + ηk 1 − yk αi yi K (x i , x k )
∂αk i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 34
Algorithm SVM-Dual
SVM-Dual (D, K , C , ǫ):
xi
1 foreach x i ∈ D do x i ←
1
2 if loss = hinge then
3 K ← {K (x i , x j )}i,j=1,...,n // kernel matrix, hinge loss
4 else if loss = quadratic then
1
5 K ← {K (x i , x j ) + 2C δij }i,j=1,...,n // kernel matrix, quadratic loss
1
6 for k = 1, . . . , n do ηk ← K (x k ,x k )
7 t ←0
8 α0 ← (0, . . . , 0)T
9 repeat
10 α ← αt
11 for k = 1 to n do
// update kth component of α
n
X
12 αk ← αk + ηk 1 − yk αi yi K (x i , x k )
i=1
13 if αk < 0 then αk ← 0
14 if αk > C then αk ← C
15 αt+1 = α
16 t ← t +1
17 until kαt − αt−1 k ≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 35
SVM Dual Algorithm: Iris Data – Linear Kernel
c1 : Iris-setosa (circles) and c2 : other types of Iris flowers (triangles)
X2 h1000 h10
bC
bC
bC
bC
4.0
bC
bC bC uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT
uT X1
2
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Hyperplane h10 uses C = 10 and h1000 uses C = 1000:
h10 has a larger margin, but also a larger slack; h1000 has a smaller margin, but it
minimizes the slack.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 36
SVM Dual Algorithm: Quadratic versus Linear Kernel
c1 : Iris-versicolor (circles) and c2 : other types of Iris flowers (triangles)
u2 hq
uT uT
uT uT
uT
1.0
uT
uT uT uT
uT
uT bC uT
uT uT Tu
uT uT uT
bC
0.5 uT uT
uT bC
bC uT uT uT
uT
uT uT uT uT uT uT
Tu bC bC uT uT uT
uT Tu uT uT bC bC
bC uT
bC uT
uT uT uT bC
bC
uT uT
uT uT uT uT uT uT
uT bC uT
uT Tu bC uT
0 uT uT bC uT
bC bC bC Tu
uT uT bC Tu uT uT Tu
uT uT
uT uT uT uT
uT bC bC bC uT uT uT uT
uT Cb uT
uT
bC bC Cb Cb uT
uT bC bC bC bC Cb Tu
bC bC bC
uT Cb bC uT uT
−0.5 uT uT uT
bC bC Cb uT
Cb bC bC
bC Cb bC
uT uT
bC hl
uT
−1.0 bC bC
uT
bC
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 37
Primal Solution: Newton Optimization
Consider the primal optimization function for soft margin SVMs. With
w , x i ∈ Rd +1 , we have to minimize the objective function:
n
1 X
min J(w ) = kw k2 + C (ξi )k
w 2 i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 38
Primal Solution: Newton Optimization, Quadratic Loss
The objective function can be rewritten as
n
1 X k
J(w ) = kw k2 + C max 0, 1 − yi (w T x i )
2 i =1
1 X k
= kw k2 + C 1 − yi (w T x i )
2 T
yi (w x i )<1
For quadratic loss, we have k = 2 and the gradient or the rate of change of the
objective function at w is given as the partial derivative of J(w ) with respect to
w:
∂J(w )
∇w = = w − 2C v + 2C Sw
∂w
where the vector v and the matrix S are given as
X X
v= yi x i S= x i x Ti
yi (w T x i )<1 yi (w T x i )<1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 39
Primal Solution: Newton Optimization, Quadratic Loss
∂∇w
Hw = = I + 2C S
∂w
Because we want to minimize the objective function J(w ), we should move in the
direction opposite to the gradient. The Newton optimization update rule for w is
given as
w t +1 = w t − ηt H −1
w t ∇w t
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 40
Primal SVM Algorithm
SVM-Primal (D, C , ǫ):
1 foreach xi ∈
D do
xi
2 xi ← // map to Rd +1
1
3 t ←0
4 w 0 ← (0, . . . , 0)T // initialize w t ∈ Rd +1
5 repeat X
6 v← yi x i
yi (w T
t x i )<1
X
7 S← x i x Ti
yi (w T
t x i )<1
8 ∇ ← (I + 2C S)w t − 2C v // gradient
9
10 H ← I + 2C S // Hessian
11
X2 hd , hp
bC
bC
bC
bC
4.0
bC
bC bC uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT
uT
2 X1
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 42
SVM Primal Kernel Algorithm: Newton Optimization
The linear soft margin primal algorithm, with quadratic loss, can easily be
extended to work on any kernel matrix K :
SVM-Primal-Kernel (D, K , C , ǫ):
1 foreach x
i ∈D do
xi
2 xi ← // map to Rd +1
1
3 K ← {K (x i , x j )}i ,j =1,...,n // compute kernel matrix
4 t ←0
5 β 0 ← (0, . . . , 0)T // initialize β t ∈ Rn
6 repeat X
7 v← yi K i
yi (K T
i β t )<1
X
8 S← K i K Ti
yi (K T
i β t )<1
9 ∇ ← (K + 2C S)β t − 2C v // gradient
10 H ← K + 2C S // Hessian
11 β t +1 ← β t − ηt H −1 ∇ // Newton update rule
12 t
← t + 1
13 until
β t − β t −1
≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 43
SVM Quadratic Kernel: Dual and Primal Solutions
c1 : Iris-versicolor (circles) and c2 : other types of Iris flowers (triangles)
u2 hd hp
uT uT
uT uT
uT
1.0
uT
uT uT uT
uT
uT bC uT
uT uT Tu
uT uT uT
bC
0.5 uT uT
uT bC
bC uT uT uT
uT
uT uT uT uT uT uT
Tu bC bC uT uT uT
uT Tu uT uT bC bC
bC uT
bC uT
uT uT uT bC
bC
uT uT
uT uT uT uT uT uT
uT bC uT
uT Tu bC uT
0 uT uT bC uT
bC bC bC Tu
uT uT bC Tu uT uT Tu
uT uT
uT uT uT uT
uT bC bC bC uT uT uT uT
uT Cb uT
uT
bC bC Cb Cb uT
uT bC bC bC bC Cb Tu
bC bC bC
uT Cb bC uT uT
−0.5 uT uT uT
bC bC Cb uT
Cb bC bC
bC Cb bC
uT uT
bC
uT
−1.0 bC bC
uT
bC
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 44
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 45