0% found this document useful (0 votes)
75 views45 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document discusses support vector machines (SVMs) and key concepts related to SVMs, including: 1) SVMs use hyperplanes to create linear classifiers, where hyperplanes are defined by weight vectors and biases that separate data points into two classes. 2) The optimal hyperplane is the one that maximizes the margin between the hyperplane and the nearest data points of each class, called support vectors. 3) The distance from a data point to the hyperplane can be calculated based on the point's class, the hyperplane's weight vector and bias. The margin is the minimum of these distances for all points.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views45 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document discusses support vector machines (SVMs) and key concepts related to SVMs, including: 1) SVMs use hyperplanes to create linear classifiers, where hyperplanes are defined by weight vectors and biases that separate data points into two classes. 2) The optimal hyperplane is the one that maximizes the margin between the hyperplane and the nearest data points of each class, called support vectors. 3) The distance from a data point to the hyperplane can be calculated based on the point's class, the hyperplane's weight vector and bias. The margin is the minimum of these distances for all points.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 21: Support Vector Machines

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 1/
Hyperplanes
Let D = {(x i , yi )}ni=1 be a classification dataset, with n points in a d-dimensional
space. We assume that there are only two class labels, that is, yi ∈ {+1, −1},
denoting the positive and negative classes.
A hyperplane in d dimensions is given as the set of all points x ∈ Rd that satisfy
the equation h(x) = 0, where h(x) is the hyperplane function:
h(x) = w T x + b = w1 x1 + w2 x2 + · · · + wd xd + b
Here, w is a d dimensional weight vector and b is a scalar, called the bias.
For points that lie on the hyperplane, we have
h(x) = w T x + b = 0

The weight vector w specifies the direction that is orthogonal or normal to the
hyperplane, which fixes the orientation of the hyperplane, whereas the bias b fixes
the offset of the hyperplane in the d-dimensional space, i.e., where the hyperplane
intersects each of the axes:
−b
wi xi = −b or xi =
wi
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 2/
Separating Hyperplane

A hyperplane splits the d-dimensional data space into two half-spaces.


A dataset is said to be linearly separable if each half-space has points only from a
single class.
If the input dataset is linearly separable, then we can find a separating hyperplane
h(x) = 0, such that for all points labeled yi = −1, we have h(x i ) < 0, and for all
points labeled yi = +1, we have h(x i ) > 0.
The hyperplane function h(x) thus serves as a linear classifier or a linear
discriminant, which predicts the class y for any given point x, according to the
decision rule:
(
+1 if h(x) > 0
y=
−1 if h(x) < 0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 3/
Geometry of a Hyperplane: Distance
Consider a point x ∈ Rd that does not lie on the hyperplane. Let x p be the orthogonal
projection of x on the hyperplane, and let r = x − x p . Then we can write x as
w
x = xp + r = xp + r
kw k

where r is the directed distance of the point x from x p .


To obtain an expression for r , consider the value h(x), we have:
   
w w
h(x) = h x p + r = wT xp + r + b = r kw k
kw k kw k

The directed distance r of point x to the hyperplane is thus:


h(x)
r=
kw k

To obtain distance, which must be non-negative, we multiply r by the class label yi of


the point x i because when h(x i ) < 0, the class is −1, and when h(x i ) > 0 the class is +1:

yi h(x i )
δi =
kw k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 4/
Geometry of a Hyperplane in 2D

p = (p1 , p2 ) = (4, 0), q = (q1 , q2 ) = (2, 5)


h(x) < 0 h(x) > 0

h (x
w1 q2 − p2 5−0 5
− = = =−
5 )= w2 q1 − p1 2−4 2
0
w
kw k
bc
bc
bc bc x Given (4, 0), the offset b is:
4
w bc

b r=
r kw
k b = −5x1 − 2x2 = −5 · 4 − 2 · 0 = −20
3 bc
xp bc
ut bc
 
5
2 ut Given w = and b = −20:
bc 2
ut ut
b  
1 kw k  x1
ut
ut
h(x) = w T x + b = 5 2 − 20 = 0
x2
0
0 1 2 3 4 5
−b −(−20)
δ = y r = −1 r = = √ = 3.71
kw k 29

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 5/
Margin and Support Vectors
The distance of a point x from the hyperplane h(x) = 0 is thus given as

y h(x)
δ=y r =
kw k

The margin is the minimum distance of a point from the separating hyperplane:

yi (w T x i + b)
 
δ ∗ = min
xi kw k

All the points (or vectors) that achieve the minimum distance are called support
vectors for the hyperplane. They satisfy the condition:

y ∗ (w T x ∗ + b)
δ∗ =
kw k

where y ∗ is the class label for x ∗ .


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 6/
Canonical Hyperplane
Multiplying the hyperplane equation on both sides by some scalar s yields an
equivalent hyperplane:

s h(x) = s w T x + s b = (sw )T x + (sb) = 0

To obtain the unique or canonical hyperplane, we choose the scalar


s = y ∗ (w T1x ∗ +b) so that the absolute distance of a support vector from the
hyperplane is 1, i.e., the margin is

y ∗ (w T x ∗ + b) 1
δ∗ = =
kw k kw k

For the canonical hyperplane, for each support vector x ∗i (with label yi∗ ), we have
yi∗ h(x ∗i ) = 1, and for any point that is not a support vector we have yi h(x i ) > 1.
Over all points, we have

yi (w T x i + b) ≥ 1, for all points x i ∈ D

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 7/
Separating Hyperplane: Margin and Support Vectors
Shaded points are support vectors

 T
5
h(x) = x − 20 = 0
h (x 2

Given x ∗ = (2, 2)T , y ∗ = −1.


)=

5
0

bc 1 1 1
bC s= = !=
4 bc bc y ∗ h(x ∗ )  T  
5 2 6
bc −1 − 20
1
2 2
3 kw k bC

ut 1 bc    
kw k
1 5 5/6 −20
2 uT w= = b=
bC 6 2 2/6 6
ut ut

1 T T
uT
 
ut 5/6 0.833
h(x) = x − 20/6 = x − 3.33
2/6 0.333
0 1 2 3 4 5

y ∗ h(x ∗ ) 1 6
δ∗ = =q = √ = 1.114
kw k 5
2 2
2 29
6
+ 6

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 8/
SVM: Linear and Separable Case

Assume that the points are linearly separable, that is, there exists a separating
hyperplane that perfectly classifies each point.
The goal of SVMs is to choose the canonical hyperplane, h∗ , that yields the
maximum margin among all possible separating hyperplanes
 
∗ 1
h = arg max
w ,b kw k

We can obtain an equivalent minimization formulation:

kw k2
 
Objective Function: min
w ,b 2
Linear Constraints: yi (w T x i + b) ≥ 1, ∀x i ∈ D

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 9/
SVM: Linear and Separable Case
We turn the constrained SVM optimization into an unconstrained one by introducing a
Lagrange multiplier αi for each constraint. The new objective function, called the
Lagrangian, then becomes
n
1 X  
min L = kw k2 − αi yi (w T x i + b) − 1
2
i =1

L should be minimized w.r.t. w and b, and it should be maximized w.r.t. αi .


Taking the derivative of L with respect to w and b, and setting those to zero, we obtain
n n
∂ X X
L=w− αi yi x i = 0 or w = αi y i x i
∂w
i =1 i =1
n
∂ X
L= αi y i = 0
∂b
i =1

We can see that w can be expressed as a linear combination of the data points x i , with
the signed Lagrange multipliers, αi yi , serving as the coefficients.
Further, the sum of the signed Lagrange multipliers, αi yi , must be zero.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 10
SVM: Linear and Separable Case

n
X n
X
Incorporating w = αi yi x i and αi yi = 0 into the Lagrangian we obtain the
i =1 i =1
new dual Lagrangian objective function, which is specified purely in terms of the
Lagrange multipliers:

n n n
X 1 XX
Objective Function: max Ldual = αi − αi αj yi yj x Ti x j
α
i =1
2 i =1 j =1
n
X
Linear Constraints: αi ≥ 0, ∀i ∈ D, and αi yi = 0
i =1

where α = (α1 , α2 , . . . , αn )T is the vector comprising the Lagrange multipliers.


Ldual is a convex quadratic programming problem (note the αi αj terms), which
admits a unique optimal solution.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 11
SVM: Linear and Separable Case

Once we have obtained the αi values for i = 1, . . . , n, we can solve for the weight
vector w and the bias b. Each of the Lagrange multipliers αi satisfies the KKT
conditions at the optimal solution:

αi yi (w T x i + b) − 1 = 0


which gives rise to two cases:


(1) αi = 0, or
(2) yi (w T x i + b) − 1 = 0, which implies yi (w T x i + b) = 1

This is a very important result because if αi > 0, then yi (w T x i + b) = 1, and thus


the point x i must be a support vector.
On the other hand, if yi (w T x i + b) > 1, then αi = 0, that is, if a point is not a
support vector, then αi = 0.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 12
Linear and Separable Case: Weight Vector and Bias
Once we know αi for all points, we can compute the weight vector w by taking
the summation only for the support vectors:

X
w= αi yi x i
i ,αi >0

Only the support vectors determine w , since αi = 0 for other points.


To compute the bias b, we first compute one solution bi , per support vector, as
follows:
1
yi (w T x i + b) = 1, which implies bi = − w T x i = yi − w T x i
yi

The bias b is taken as the average value:

b = avgαi >0 {bi }

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 13
SVM Classifier

Given the optimal hyperplane function h(x) = w T x + b, for any new point z, we
predict its class as

ŷ = sign(h(z)) = sign(w T z + b)

where the sign(·) function returns +1 if its argument is positive, and −1 if its
argument is negative.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 14
Example Dataset: Separable Case

xi xi 1 xi 2 yi
x1 3.5 4.25 +1
x2 4 3 +1
5
bc x3 4 4 +1
bc
4 bc bc x4 4.5 1.75 +1
bc
x5 4.9 4.5 +1
3 bc
x6 5 4 +1
ut bc

ut
x7 5.5 2.5 +1
2
bc
ut ut x8 5.5 3.5 +1
1
ut
x9 0.5 1.5 −1
ut
x 10 1 2.5 −1
0 1 2 3 4 5
x 11 1.25 0.5 −1
x 12 1.5 1.5 −1
x 13 2 2 −1
x 14 2.5 0.75 −1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 15
Optimal Separating Hyperplane

Solving the Ldual quadratic program yields


xi xi1 xi2 yi αi
x1 3.5 4.25 +1 0.0437
x2 4 3 +1 0.2162
h (x

x4 4.5 1.75 +1 0.1427


)=

5
0

x 13 2 2 −1 0.3589
bc
bC x 14 2.5 0.75 −1 0.0437
4 bc bc

bc
1
bC
The weight vector and bias are:
3 kw k

ut 1 bc  
kw k X 0.833
2 uT
bC
w= αi y i x i =
ut ut 0.334
i ,αi >0
1
uT
ut b = avg{bi } = −3.332
0 1 2 3 4 5
The optimal hyperplane is given as follows:
T

0.833
h(x) = x − 3.332 = 0
0.334

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 16
Soft Margin SVM: Linear and Nonseparable Case

The assumption that the dataset be perfectly linearly separable is unrealistic.


SVMs can handle non-separable points by introducing slack variables ξi as follows:

yi (w T x i + b) ≥ 1 − ξi

where ξi ≥ 0 is the slack variable for point x i , which indicates how much the point
violates the separability condition, that is, the point may no longer be at least
1/ kw k away from the hyperplane.
The slack values indicate three types of points. If ξi = 0, then the corresponding
point x i is at least kw1 k away from the hyperplane.
If 0 < ξi < 1, then the point is within the margin and still correctly classified, that
is, it is on the correct side of the hyperplane.
However, if ξi ≥ 1 then the point is misclassified and appears on the wrong side of
the hyperplane.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 17
Soft Margin Hyperplane
Shaded points are the support vectors

h (x
)=
5

0
1 bc
kw k
1 bC
4 kw k bc bc

bc

3 bC bC uT

ut bc

2 uT uT bC
bC
ut ut

1
uT
ut

0 1 2 3 4 5

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 18
SVM: Soft Margin or Linearly Non-separable Case
In the nonseparable case, also called the soft margin the SVM objective function is

( n
)
kw k2 X
Objective Function: min +C (ξi )k
w ,b ,ξi 2 i =1

Linear Constraints: yi (w T x i + b) ≥ 1 − ξi , ∀x i ∈ D
ξi ≥ 0 ∀x i ∈ D

where C and k are constants that incorporate the cost of misclassification.


Pn
The term i =1 (ξi )k gives the loss, that is, an estimate of the deviation from the
separable case.
The scalar C is a regularization constant that controls the trade-off between
maximizing the margin or minimizing the loss. For example, if C → 0, then the
loss component essentially disappears, and the objective defaults to maximizing
the margin. On the other hand, if C → ∞, then the margin ceases to have much
effect, and the objective function tries to minimize the loss.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 19
SVM: Soft Margin Loss Function
The constant k governs the form of the loss. When k = 1, called hinge loss, the
goal is to minimize the sum of the slack variables, whereas when k = 2, called
quadratic loss, the goal is to minimize the sum of the squared slack variables.
Hinge Loss: Assuming k = 1, the SVM dual Lagrangian is given as

n n n
X 1 XX
max Ldual = αi − αi αj yi yj x Ti x j
α
i =1
2 i =1 j =1

The only difference from the separable case is that 0 ≤ αi ≤ C .


Quadratic Loss: Assuming k = 2, the dual objective is:

n n n  
X 1 XX T 1
max Ldual = αi − αi αj yi yj x i x j + δij
α
i =1
2 i =1 j =1 2C

where δ is the Kronecker delta function, defined as δij = 1 if and only if i = j.


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 20
Example Dataset: Linearly Non-separable Case

xi xi1 xi2 yi
x1 3.5 4.25 +1
5
x2 4 3 +1
bc
x3 4 4 +1
bc x4 4.5 1.75 +1
4 bc bc
x5 4.9 4.5 +1
bc x6 5 4 +1
3 bc bc ut x7 5.5 2.5 +1
ut bc
x8 5.5 3.5 +1
x9 0.5 1.5 −1
2 ut ut bc
bc x 10 1 2.5 −1
ut ut x 11 1.25 0.5 −1
1 x 12 1.5 1.5 −1
ut
ut x 13 2 2 −1
x 14 2.5 0.75 −1
x 15 4 2 +1
0 1 2 3 4 5
x 16 2 3 +1
x 17 3 2 −1
x 18 5 3 −1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 21
Example Dataset: Linearly Non-separable Case
Let k = 1 and C = 1, then solving the Ldual yields the following support vectors
and Lagrangian values αi :
xi xi 1 xi 2 yi αi
x1 3.5 4.25 +1 0.0271
x2 4 3 +1 0.2162
x4 4.5 1.75 +1 0.9928
x 13 2 2 −1 0.9928
x 14 2.5 0.75 −1 0.2434
x 15 4 2 +1 1
x 16 2 3 +1 1
x 17 3 2 −1 1
x 18 5 3 −1 1

The optimal hyperplane is given as follows:


 T
0.834
h(x) = x − 3.334 = 0
0.333

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 22
Example Dataset: Linearly Non-separable Case
The slack ξi = 0 for all points that are not support vectors, and also for those
support vectors that are on the margin. Slack is positive only for the remaining
support vectors and it can be computed as: ξi = 1 − yi (w T x i + b).
Thus, for all support vectors not on the margin, we have

xi wT xi wT xi + b ξi = 1 − yi (w T x i + b)
x 15 4.001 0.667 0.333
x 16 2.667 −0.667 1.667
x 17 3.167 −0.167 0.833
x 18 5.168 1.834 2.834

The total slack is given as


X
ξi = ξ15 + ξ16 + ξ17 + ξ18 = 0.333 + 1.667 + 0.833 + 2.834 = 5.667
i

The slack variable ξi > 1 for those points that are misclassified (i.e., are on the
wrong side of the hyperplane), namely x 16 = (3, 3)T and x 18 = (5, 3)T . The other
two points are correctly classified, but lie within the margin, and thus satisfy
0 < ξi < 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 23
Kernel SVM: Nonlinear Case

The linear SVM approach can be used for datasets with a nonlinear decision
boundary via the kernel trick.
Conceptually, the idea is to map the original d-dimensional points x i in the input
space to points φ(x i ) in a high-dimensional feature space via some nonlinear
transformation φ.
Given the extra flexibility, it is more likely that the points φ(x i ) might be linearly
separable in the feature space.
A linear decision surface in feature space actually corresponds to a nonlinear
decision surface in the input space.
Further, the kernel trick allows us to carry out all operations via the kernel function
computed in input space, rather than having to map the points into feature space.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 24
Nonlinear SVM
There is no linear classifier that can discriminate between the points. However,
there exists a perfect quadratic classifier that can separate the two classes.
√ √ √
φ(x) = ( 2x1 , 2x2 , x12 , x22 , 2x1 x2 )T
bc

bc bc

5 bc bc

bc bc bC bc

4 uT

ut
ut ut ut
3 ut ut uT
ut ut
ut ut

2 bC bC

bc
1 bc bC bc bc
bc

0
0 1 2 3 4 5 6 7

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 25
Nonlinear SVMs: Kernel Trick

To apply the kernel trick for nonlinear SVM classification, we have to show that
all operations require only the kernel function:

K (x i , x j ) = φ(x i )T φ(x j )

Applying φ to each point, we can obtain the new dataset in the feature space
D φ = {φ(x i ), yi }ni=1 .
The SVM objective function in feature space is given as

( n
)
kw k2 X
Objective Function: min +C (ξi )k
w ,b ,ξi 2 i =1

Linear Constraints: yi (w T φ(x i ) + b) ≥ 1 − ξi , and ξi ≥ 0, ∀x i ∈ D

where w is the weight vector, b is the bias, and ξi are the slack variables, all in
feature space.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 26
Nonlinear SVMs: Kernel Trick

For hinge loss, the dual Lagrangian in feature space is given as


n n n
X 1 XX
max Ldual = αi − αi αj yi yj φ(x i )T φ(x j )
α
i =1
2 i =1 j =1
n n n
X 1 XX
= αi − αi αj yi yj K (x i , x j )
i =1
2 i =1 j =1
Pn
Subject to the constraints that 0 ≤ αi ≤ C , and i =1 αi yi = 0.
The dual Lagrangian depends only on the dot product between two vectors in
feature space φ(x i )T φ(x j ) = K (x i , x j ), and thus we can solve the optimization
problem using the kernel matrix K = {K (x i , x j )}i ,j =1,...,n .
For quadratic loss, the dual Lagrangian corresponds to the use of a new kernel
1 1
Kq (x i , x j ) = x Ti x j + δij = K (x i , x j ) + δij
2C 2C

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 27
Nonlinear SVMs: Weight Vector and Bias
We cannot directly obtain the weight vector without transforming the points, since
X
w= αi yi φ(x i )
αi >0

However, we can compute the bias via kernel operations, since


X
bi = yi − w T φ(x i ) = yi − αj yj K (x j , x i )
αj >0

Likewise, we can predict the class for a new point z as follows:

 
X
ŷ = sign(w T φ(z) + b) = sign  αi yi K (x i , z) + b 
αi >0

All SVM operations can be carried out in terms of the kernel function
K (x i , x j ) = φ(x i )T φ(x j ). Thus, any nonlinear kernel function can be used to do
nonlinear classification in the input space.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 28
Nonlinear SVM: Inhomogeneous Quadratic Kernel
bc

bc bc

5 bc bc
bc
bc bC bc

4 uT

ut
ut ut ut
3 ut ut uT
ut ut
ut ut

2 bC bC

bc
1 bc bC bc bc
bc

0
0 1 2 3 4 5 6 7

The optimal quadratic hyperplane is obtained by setting C = 4, and using an


inhomogeneous polynomial kernel of degree q = 2:
K (x i , x j ) = φ(x i )T φ(x j ) = (1 + x Ti x j )2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 29
Nonlinear SVM: Inhomogeneous Quadratic Kernel
φ maps x i into feature space as follows:
  √ √ √ T
φ x = (x1 , x2 )T = 1, 2x1 , 2x2 , x12 , x22 , 2x1 x2

x 1 = (1, 2)T is transformed into


 √ √ √ T
φ(x i ) = 1, 2 · 1, 2 · 2, 12 , 22 , 2 · 1 · 2 = (1, 1.41, 2.83, 1, 2, 2.83)T

Solving Ldual , we found the following six support vectors:

xi (xi 1 , xi 2 )T φ(x i ) yi αi
x1 (1, 2)T (1, 1.41, 2.83, 1, 4, 2.83)T +1 0.6198
x2 (4, 1)T (1, 5.66, 1.41, 16, 1, 5.66)T +1 2.069
x3 (6, 4.5)T (1, 8.49, 6.36, 36, 20.25, 38.18)T +1 3.803
x4 (7, 2)T (1, 9.90, 2.83, 49, 4, 19.80)T +1 0.3182
x5 (4, 4)T (1, 5.66, 5.66, 16, 16, 15.91)T −1 2.9598
x6 (6, 3)T (1, 8.49, 4.24, 36, 9, 25.46)T −1 3.8502

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 30
Nonlinear SVM: Inhomogeneous Quadratic Kernel

We compute the weight vector for the hyperplane:


X
w= αi yi φ(x i ) = (0, −1.413, −3.298, 0.256, 0.82, −0.018)T
αi >0

as well as the bias:


b = −8.841
The decision boundary in input space corresponds to an ellipse, centered at
(4.046, 2.907), with axis lengths 2.78 and 1.55.
Notice that we explicitly transformed all the points into the feature space just for
illustration purposes.
The kernel trick allows us to achieve the same goal using only the kernel function.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 31
SVM Training Algorithms
Instead of dealing explicitly with the bias b, we map each point x i ∈ Rd to the
point x ′i ∈ Rd +1 as follows:

x ′i = (xi 1 , . . . , xid , 1)T

We also map the weight vector to Rd +1 , with wd +1 = b, so that

w = (w1 , . . . , wd , b)T

The equation of the hyperplane is then given as follows:

h(x ′ ) : w T x ′ = w1 xi 1 + · · · + wd xid + b = 0

Pn
After the mapping, the constraint i =1 αi yi = 0 does not apply in the SVM dual
formulations. The new set of constraints is given as

yi w T x ≥ 1 − ξi

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 32
Dual Optimization: Gradient Ascent
The dual optimization objective for hinge loss is given as
n n n
X 1 XX
max J(α) = αi − αi αj yi yj K (x i , x j )
α
i =1
2 i =1 j =1

subject to the constraints 0 ≤ αi ≤ C for all i = 1, . . . , n. Here


α = (α1 , α2 , · · · , αn )T ∈ Rn .
The gradient or the rate of change in the objective function at α is given as the
partial derivative of J(α) with respect to α, that is, with respect to each αk :
 T
∂J(α) ∂J(α) ∂J(α)
∇J(α) = , ,...,
∂α1 ∂α2 ∂αn
where the kth component of the gradient is obtained by differentiating J(αk ) with
respect to αk :
n
!
∂J(α) ∂J(αk ) X
= = 1 − yk αi yi K (x i , x k )
∂αk ∂αk i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 33
Stochastic Gradient Ascent
Starting from an initial α, the gradient ascent approach successively updates by
moving in the direction of the gradient ∇J(α):

αt +1 = αt + ηt ∇J(αt )

where αt is the estimate at the tth step, and ηt is the step size.
The optimal step size is:
1
ηk =
K (x k , x k )

Instead of updating the entire α vector in each step, in the stochastic gradient
ascent approach, we update each component αk independently and immediately
use the new value to update other components. The update rule for the k-th
component is given as
n
!
∂J(α) X
αk = αk + η k = αk + ηk 1 − yk αi yi K (x i , x k )
∂αk i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 34
Algorithm SVM-Dual
SVM-Dual (D, K , C , ǫ):  
xi
1 foreach x i ∈ D do x i ←
1
2 if loss = hinge then
3 K ← {K (x i , x j )}i,j=1,...,n // kernel matrix, hinge loss
4 else if loss = quadratic then
1
5 K ← {K (x i , x j ) + 2C δij }i,j=1,...,n // kernel matrix, quadratic loss
1
6 for k = 1, . . . , n do ηk ← K (x k ,x k )
7 t ←0
8 α0 ← (0, . . . , 0)T
9 repeat
10 α ← αt
11 for k = 1 to n do
// update kth component of α
 n
X 
12 αk ← αk + ηk 1 − yk αi yi K (x i , x k )
i=1
13 if αk < 0 then αk ← 0
14 if αk > C then αk ← C
15 αt+1 = α
16 t ← t +1
17 until kαt − αt−1 k ≤ ǫ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 35
SVM Dual Algorithm: Iris Data – Linear Kernel
c1 : Iris-setosa (circles) and c2 : other types of Iris flowers (triangles)
X2 h1000 h10
bC

bC
bC
bC
4.0
bC
bC bC uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT

uT X1
2
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Hyperplane h10 uses C = 10 and h1000 uses C = 1000:

h10 (x) : 2.74x1 − 3.74x2 − 3.09 = 0


h1000 (x) : 8.56x1 − 7.14x2 − 23.12 = 0

h10 has a larger margin, but also a larger slack; h1000 has a smaller margin, but it
minimizes the slack.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 36
SVM Dual Algorithm: Quadratic versus Linear Kernel
c1 : Iris-versicolor (circles) and c2 : other types of Iris flowers (triangles)

u2 hq
uT uT

uT uT
uT
1.0
uT
uT uT uT
uT
uT bC uT
uT uT Tu
uT uT uT
bC
0.5 uT uT
uT bC
bC uT uT uT
uT
uT uT uT uT uT uT
Tu bC bC uT uT uT
uT Tu uT uT bC bC
bC uT
bC uT
uT uT uT bC
bC
uT uT
uT uT uT uT uT uT
uT bC uT
uT Tu bC uT
0 uT uT bC uT
bC bC bC Tu
uT uT bC Tu uT uT Tu
uT uT
uT uT uT uT
uT bC bC bC uT uT uT uT
uT Cb uT
uT
bC bC Cb Cb uT
uT bC bC bC bC Cb Tu
bC bC bC
uT Cb bC uT uT
−0.5 uT uT uT
bC bC Cb uT
Cb bC bC
bC Cb bC
uT uT
bC hl
uT
−1.0 bC bC

uT
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 37
Primal Solution: Newton Optimization

Consider the primal optimization function for soft margin SVMs. With
w , x i ∈ Rd +1 , we have to minimize the objective function:
n
1 X
min J(w ) = kw k2 + C (ξi )k
w 2 i =1

subject to the linear constraints:

yi (w T x i ) ≥ 1 − ξi and ξi ≥ 0 for all i = 1, . . . , n

Rearranging the above, we obtain an expression for ξi

ξi ≥ 1 − yi (w T x i ) and ξi ≥ 0, which implies that


ξi = max 0, 1 − yi (w T x i )


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 38
Primal Solution: Newton Optimization, Quadratic Loss
The objective function can be rewritten as
n
1 X k
J(w ) = kw k2 + C max 0, 1 − yi (w T x i )

2 i =1
1 X k
= kw k2 + C 1 − yi (w T x i )
2 T
yi (w x i )<1

For quadratic loss, we have k = 2 and the gradient or the rate of change of the
objective function at w is given as the partial derivative of J(w ) with respect to
w:
∂J(w )
∇w = = w − 2C v + 2C Sw
∂w
where the vector v and the matrix S are given as
X X
v= yi x i S= x i x Ti
yi (w T x i )<1 yi (w T x i )<1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 39
Primal Solution: Newton Optimization, Quadratic Loss

The Hessian matrix is defined as the matrix of second-order partial derivatives of


J(w ) with respect to w , which is given as

∂∇w
Hw = = I + 2C S
∂w

Because we want to minimize the objective function J(w ), we should move in the
direction opposite to the gradient. The Newton optimization update rule for w is
given as

w t +1 = w t − ηt H −1
w t ∇w t

where ηt > 0 is a scalar value denoting the step size at iteration t.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 40
Primal SVM Algorithm
SVM-Primal (D, C , ǫ):
1 foreach xi ∈ 
D do
xi
2 xi ← // map to Rd +1
1
3 t ←0
4 w 0 ← (0, . . . , 0)T // initialize w t ∈ Rd +1
5 repeat X
6 v← yi x i
yi (w T
t x i )<1
X
7 S← x i x Ti
yi (w T
t x i )<1
8 ∇ ← (I + 2C S)w t − 2C v // gradient
9
10 H ← I + 2C S // Hessian
11

12 w t +1 ← w t − ηt H −1 ∇ // Newton update rule


13 until kw t − w t −1 k ≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 41
SVMs: Dual and Primal Solutions
c1 : Iris-setosa (circles) and c2 : other types of Iris flowers (triangles)

X2 hd , hp
bC

bC
bC
bC
4.0
bC
bC bC uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT

uT
2 X1
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 42
SVM Primal Kernel Algorithm: Newton Optimization
The linear soft margin primal algorithm, with quadratic loss, can easily be
extended to work on any kernel matrix K :
SVM-Primal-Kernel (D, K , C , ǫ):
1 foreach x 
i ∈D  do
xi
2 xi ← // map to Rd +1
1
3 K ← {K (x i , x j )}i ,j =1,...,n // compute kernel matrix
4 t ←0
5 β 0 ← (0, . . . , 0)T // initialize β t ∈ Rn
6 repeat X
7 v← yi K i
yi (K T
i β t )<1
X
8 S← K i K Ti
yi (K T
i β t )<1
9 ∇ ← (K + 2C S)β t − 2C v // gradient
10 H ← K + 2C S // Hessian
11 β t +1 ← β t − ηt H −1 ∇ // Newton update rule
12 t ← t + 1
13 until β t − β t −1 ≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 43
SVM Quadratic Kernel: Dual and Primal Solutions
c1 : Iris-versicolor (circles) and c2 : other types of Iris flowers (triangles)

u2 hd hp
uT uT

uT uT
uT
1.0
uT
uT uT uT
uT
uT bC uT
uT uT Tu
uT uT uT
bC
0.5 uT uT
uT bC
bC uT uT uT
uT
uT uT uT uT uT uT
Tu bC bC uT uT uT
uT Tu uT uT bC bC
bC uT
bC uT
uT uT uT bC
bC
uT uT
uT uT uT uT uT uT
uT bC uT
uT Tu bC uT
0 uT uT bC uT
bC bC bC Tu
uT uT bC Tu uT uT Tu
uT uT
uT uT uT uT
uT bC bC bC uT uT uT uT
uT Cb uT
uT
bC bC Cb Cb uT
uT bC bC bC bC Cb Tu
bC bC bC
uT Cb bC uT uT
−0.5 uT uT uT
bC bC Cb uT
Cb bC bC
bC Cb bC
uT uT
bC
uT
−1.0 bC bC

uT
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 44
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 21: Support Vector Machines

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy