0% found this document useful (0 votes)

75 views45 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document discusses support vector machines (SVMs) and key concepts related to SVMs, including: 1) SVMs use hyperplanes to create linear classifiers, where hyperplanes are defined by weight vectors and biases that separate data points into two classes. 2) The optimal hyperplane is the one that maximizes the margin between the hyperplane and the nearest data points of each class, called support vectors. 3) The distance from a data point to the hyperplane can be calculated based on the point's class, the hyperplane's weight vector and bias. The margin is the minimum of these distances for all points.

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views45 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 21: Support Vector Machines

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 1/
Hyperplanes
Let D = {(x i , yi )}ni=1 be a classification dataset, with n points in a d-dimensional
space. We assume that there are only two class labels, that is, yi ∈ {+1, −1},
denoting the positive and negative classes.
A hyperplane in d dimensions is given as the set of all points x ∈ Rd that satisfy
the equation h(x) = 0, where h(x) is the hyperplane function:
h(x) = w T x + b = w1 x1 + w2 x2 + · · · + wd xd + b
Here, w is a d dimensional weight vector and b is a scalar, called the bias.
For points that lie on the hyperplane, we have
h(x) = w T x + b = 0

The weight vector w specifies the direction that is orthogonal or normal to the
hyperplane, which fixes the orientation of the hyperplane, whereas the bias b fixes
the offset of the hyperplane in the d-dimensional space, i.e., where the hyperplane
intersects each of the axes:
−b
wi xi = −b or xi =
wi
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 2/
Separating Hyperplane

A hyperplane splits the d-dimensional data space into two half-spaces.

A dataset is said to be linearly separable if each half-space has points only from a
single class.
If the input dataset is linearly separable, then we can find a separating hyperplane
h(x) = 0, such that for all points labeled yi = −1, we have h(x i ) < 0, and for all
points labeled yi = +1, we have h(x i ) > 0.
The hyperplane function h(x) thus serves as a linear classifier or a linear
discriminant, which predicts the class y for any given point x, according to the
decision rule:
(
+1 if h(x) > 0
y=
−1 if h(x) < 0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 3/
Geometry of a Hyperplane: Distance
Consider a point x ∈ Rd that does not lie on the hyperplane. Let x p be the orthogonal
projection of x on the hyperplane, and let r = x − x p . Then we can write x as
w
x = xp + r = xp + r
kw k

where r is the directed distance of the point x from x p .

To obtain an expression for r , consider the value h(x), we have:

w w
h(x) = h x p + r = wT xp + r + b = r kw k
kw k kw k

The directed distance r of point x to the hyperplane is thus:

h(x)
r=
kw k

To obtain distance, which must be non-negative, we multiply r by the class label yi of

the point x i because when h(x i ) < 0, the class is −1, and when h(x i ) > 0 the class is +1:

yi h(x i )
δi =
kw k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 4/
Geometry of a Hyperplane in 2D

p = (p1 , p2 ) = (4, 0), q = (q1 , q2 ) = (2, 5)

h(x) < 0 h(x) > 0

h (x
w1 q2 − p2 5−0 5
− = = =−
5 )= w2 q1 − p1 2−4 2
0
w
kw k
bc
bc
bc bc x Given (4, 0), the offset b is:
4
w bc

b r=
r kw
k b = −5x1 − 2x2 = −5 · 4 − 2 · 0 = −20
3 bc
xp bc
ut bc

5
2 ut Given w = and b = −20:
bc 2
ut ut
b
1 kw k x1
ut
ut
h(x) = w T x + b = 5 2 − 20 = 0
x2
0
0 1 2 3 4 5
−b −(−20)
δ = y r = −1 r = = √ = 3.71
kw k 29

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 5/
Margin and Support Vectors
The distance of a point x from the hyperplane h(x) = 0 is thus given as

y h(x)
δ=y r =
kw k

The margin is the minimum distance of a point from the separating hyperplane:

yi (w T x i + b)

δ ∗ = min
xi kw k

All the points (or vectors) that achieve the minimum distance are called support
vectors for the hyperplane. They satisfy the condition:

y ∗ (w T x ∗ + b)
δ∗ =
kw k

where y ∗ is the class label for x ∗ .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 6/
Canonical Hyperplane
Multiplying the hyperplane equation on both sides by some scalar s yields an
equivalent hyperplane:

s h(x) = s w T x + s b = (sw )T x + (sb) = 0

To obtain the unique or canonical hyperplane, we choose the scalar

s = y ∗ (w T1x ∗ +b) so that the absolute distance of a support vector from the
hyperplane is 1, i.e., the margin is

y ∗ (w T x ∗ + b) 1
δ∗ = =
kw k kw k

For the canonical hyperplane, for each support vector x ∗i (with label yi∗ ), we have
yi∗ h(x ∗i ) = 1, and for any point that is not a support vector we have yi h(x i ) > 1.
Over all points, we have

yi (w T x i + b) ≥ 1, for all points x i ∈ D

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 7/
Separating Hyperplane: Margin and Support Vectors
Shaded points are support vectors

T
5
h(x) = x − 20 = 0
h (x 2

Given x ∗ = (2, 2)T , y ∗ = −1.

5
0

bc 1 1 1
bC s= = !=
4 bc bc y ∗ h(x ∗ ) T
5 2 6
bc −1 − 20
1
2 2
3 kw k bC

ut 1 bc
kw k
1 5 5/6 −20
2 uT w= = b=
bC 6 2 2/6 6
ut ut

1 T T
uT

ut 5/6 0.833
h(x) = x − 20/6 = x − 3.33
2/6 0.333
0 1 2 3 4 5

y ∗ h(x ∗ ) 1 6
δ∗ = =q = √ = 1.114
kw k 5
2 2
2 29
6
+ 6

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 8/
SVM: Linear and Separable Case

Assume that the points are linearly separable, that is, there exists a separating
hyperplane that perfectly classifies each point.
The goal of SVMs is to choose the canonical hyperplane, h∗ , that yields the
maximum margin among all possible separating hyperplanes

∗ 1
h = arg max
w ,b kw k

We can obtain an equivalent minimization formulation:

kw k2

Objective Function: min
w ,b 2
Linear Constraints: yi (w T x i + b) ≥ 1, ∀x i ∈ D

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 9/
SVM: Linear and Separable Case
We turn the constrained SVM optimization into an unconstrained one by introducing a
Lagrange multiplier αi for each constraint. The new objective function, called the
Lagrangian, then becomes
n
1 X
min L = kw k2 − αi yi (w T x i + b) − 1
2
i =1

L should be minimized w.r.t. w and b, and it should be maximized w.r.t. αi .

Taking the derivative of L with respect to w and b, and setting those to zero, we obtain
n n
∂ X X
L=w− αi yi x i = 0 or w = αi y i x i
∂w
i =1 i =1
n
∂ X
L= αi y i = 0
∂b
i =1

We can see that w can be expressed as a linear combination of the data points x i , with
the signed Lagrange multipliers, αi yi , serving as the coefficients.
Further, the sum of the signed Lagrange multipliers, αi yi , must be zero.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 10
SVM: Linear and Separable Case

n
X n
X
Incorporating w = αi yi x i and αi yi = 0 into the Lagrangian we obtain the
i =1 i =1
new dual Lagrangian objective function, which is specified purely in terms of the
Lagrange multipliers:

n n n
X 1 XX
Objective Function: max Ldual = αi − αi αj yi yj x Ti x j
α
i =1
2 i =1 j =1
n
X
Linear Constraints: αi ≥ 0, ∀i ∈ D, and αi yi = 0
i =1

where α = (α1 , α2 , . . . , αn )T is the vector comprising the Lagrange multipliers.

Ldual is a convex quadratic programming problem (note the αi αj terms), which
admits a unique optimal solution.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 11
SVM: Linear and Separable Case

Once we have obtained the αi values for i = 1, . . . , n, we can solve for the weight
vector w and the bias b. Each of the Lagrange multipliers αi satisfies the KKT
conditions at the optimal solution:

αi yi (w T x i + b) − 1 = 0

which gives rise to two cases:

(1) αi = 0, or
(2) yi (w T x i + b) − 1 = 0, which implies yi (w T x i + b) = 1

This is a very important result because if αi > 0, then yi (w T x i + b) = 1, and thus

the point x i must be a support vector.
On the other hand, if yi (w T x i + b) > 1, then αi = 0, that is, if a point is not a
support vector, then αi = 0.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 12
Linear and Separable Case: Weight Vector and Bias
Once we know αi for all points, we can compute the weight vector w by taking
the summation only for the support vectors:

X
w= αi yi x i
i ,αi >0

Only the support vectors determine w , since αi = 0 for other points.

To compute the bias b, we first compute one solution bi , per support vector, as
follows:
1
yi (w T x i + b) = 1, which implies bi = − w T x i = yi − w T x i
yi

The bias b is taken as the average value:

b = avgαi >0 {bi }

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 13
SVM Classifier

Given the optimal hyperplane function h(x) = w T x + b, for any new point z, we
predict its class as

ŷ = sign(h(z)) = sign(w T z + b)

where the sign(·) function returns +1 if its argument is positive, and −1 if its
argument is negative.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 14
Example Dataset: Separable Case

xi xi 1 xi 2 yi
x1 3.5 4.25 +1
x2 4 3 +1
5
bc x3 4 4 +1
bc
4 bc bc x4 4.5 1.75 +1
bc
x5 4.9 4.5 +1
3 bc
x6 5 4 +1
ut bc

ut
x7 5.5 2.5 +1
2
bc
ut ut x8 5.5 3.5 +1
1
ut
x9 0.5 1.5 −1
ut
x 10 1 2.5 −1
0 1 2 3 4 5
x 11 1.25 0.5 −1
x 12 1.5 1.5 −1
x 13 2 2 −1
x 14 2.5 0.75 −1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 15
Optimal Separating Hyperplane

Solving the Ldual quadratic program yields

xi xi1 xi2 yi αi
x1 3.5 4.25 +1 0.0437
x2 4 3 +1 0.2162
h (x

x4 4.5 1.75 +1 0.1427

5
0

x 13 2 2 −1 0.3589
bc
bC x 14 2.5 0.75 −1 0.0437
4 bc bc

bc
1
bC
The weight vector and bias are:
3 kw k

ut 1 bc
kw k X 0.833
2 uT
bC
w= αi y i x i =
ut ut 0.334
i ,αi >0
1
uT
ut b = avg{bi } = −3.332
0 1 2 3 4 5
The optimal hyperplane is given as follows:
T

0.833
h(x) = x − 3.332 = 0
0.334

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 16
Soft Margin SVM: Linear and Nonseparable Case

The assumption that the dataset be perfectly linearly separable is unrealistic.

SVMs can handle non-separable points by introducing slack variables ξi as follows:

yi (w T x i + b) ≥ 1 − ξi

where ξi ≥ 0 is the slack variable for point x i , which indicates how much the point
violates the separability condition, that is, the point may no longer be at least
1/ kw k away from the hyperplane.
The slack values indicate three types of points. If ξi = 0, then the corresponding
point x i is at least kw1 k away from the hyperplane.
If 0 < ξi < 1, then the point is within the margin and still correctly classified, that
is, it is on the correct side of the hyperplane.
However, if ξi ≥ 1 then the point is misclassified and appears on the wrong side of
the hyperplane.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 17
Soft Margin Hyperplane
Shaded points are the support vectors

h (x
)=
5

0
1 bc
kw k
1 bC
4 kw k bc bc

3 bC bC uT

ut bc

2 uT uT bC
bC
ut ut

1
uT
ut

0 1 2 3 4 5

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 18
SVM: Soft Margin or Linearly Non-separable Case
In the nonseparable case, also called the soft margin the SVM objective function is

( n
)
kw k2 X
Objective Function: min +C (ξi )k
w ,b ,ξi 2 i =1

Linear Constraints: yi (w T x i + b) ≥ 1 − ξi , ∀x i ∈ D
ξi ≥ 0 ∀x i ∈ D

where C and k are constants that incorporate the cost of misclassification.

Pn
The term i =1 (ξi )k gives the loss, that is, an estimate of the deviation from the
separable case.
The scalar C is a regularization constant that controls the trade-off between
maximizing the margin or minimizing the loss. For example, if C → 0, then the
loss component essentially disappears, and the objective defaults to maximizing
the margin. On the other hand, if C → ∞, then the margin ceases to have much
effect, and the objective function tries to minimize the loss.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 19
SVM: Soft Margin Loss Function
The constant k governs the form of the loss. When k = 1, called hinge loss, the
goal is to minimize the sum of the slack variables, whereas when k = 2, called
quadratic loss, the goal is to minimize the sum of the squared slack variables.
Hinge Loss: Assuming k = 1, the SVM dual Lagrangian is given as

n n n
X 1 XX
max Ldual = αi − αi αj yi yj x Ti x j
α
i =1
2 i =1 j =1

The only difference from the separable case is that 0 ≤ αi ≤ C .

Quadratic Loss: Assuming k = 2, the dual objective is:

n n n
X 1 XX T 1
max Ldual = αi − αi αj yi yj x i x j + δij
α
i =1
2 i =1 j =1 2C

where δ is the Kronecker delta function, defined as δij = 1 if and only if i = j.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 20
Example Dataset: Linearly Non-separable Case

xi xi1 xi2 yi
x1 3.5 4.25 +1
5
x2 4 3 +1
bc
x3 4 4 +1
bc x4 4.5 1.75 +1
4 bc bc
x5 4.9 4.5 +1
bc x6 5 4 +1
3 bc bc ut x7 5.5 2.5 +1
ut bc
x8 5.5 3.5 +1
x9 0.5 1.5 −1
2 ut ut bc
bc x 10 1 2.5 −1
ut ut x 11 1.25 0.5 −1
1 x 12 1.5 1.5 −1
ut
ut x 13 2 2 −1
x 14 2.5 0.75 −1
x 15 4 2 +1
0 1 2 3 4 5
x 16 2 3 +1
x 17 3 2 −1
x 18 5 3 −1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 21
Example Dataset: Linearly Non-separable Case
Let k = 1 and C = 1, then solving the Ldual yields the following support vectors
and Lagrangian values αi :
xi xi 1 xi 2 yi αi
x1 3.5 4.25 +1 0.0271
x2 4 3 +1 0.2162
x4 4.5 1.75 +1 0.9928
x 13 2 2 −1 0.9928
x 14 2.5 0.75 −1 0.2434
x 15 4 2 +1 1
x 16 2 3 +1 1
x 17 3 2 −1 1
x 18 5 3 −1 1

The optimal hyperplane is given as follows:

T
0.834
h(x) = x − 3.334 = 0
0.333

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 22
Example Dataset: Linearly Non-separable Case
The slack ξi = 0 for all points that are not support vectors, and also for those
support vectors that are on the margin. Slack is positive only for the remaining
support vectors and it can be computed as: ξi = 1 − yi (w T x i + b).
Thus, for all support vectors not on the margin, we have

xi wT xi wT xi + b ξi = 1 − yi (w T x i + b)
x 15 4.001 0.667 0.333
x 16 2.667 −0.667 1.667
x 17 3.167 −0.167 0.833
x 18 5.168 1.834 2.834

The total slack is given as

X
ξi = ξ15 + ξ16 + ξ17 + ξ18 = 0.333 + 1.667 + 0.833 + 2.834 = 5.667
i

The slack variable ξi > 1 for those points that are misclassified (i.e., are on the
wrong side of the hyperplane), namely x 16 = (3, 3)T and x 18 = (5, 3)T . The other
two points are correctly classified, but lie within the margin, and thus satisfy
0 < ξi < 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 23
Kernel SVM: Nonlinear Case

The linear SVM approach can be used for datasets with a nonlinear decision
boundary via the kernel trick.
Conceptually, the idea is to map the original d-dimensional points x i in the input
space to points φ(x i ) in a high-dimensional feature space via some nonlinear
transformation φ.
Given the extra flexibility, it is more likely that the points φ(x i ) might be linearly
separable in the feature space.
A linear decision surface in feature space actually corresponds to a nonlinear
decision surface in the input space.
Further, the kernel trick allows us to carry out all operations via the kernel function
computed in input space, rather than having to map the points into feature space.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 24
Nonlinear SVM
There is no linear classifier that can discriminate between the points. However,
there exists a perfect quadratic classifier that can separate the two classes.
√ √ √
φ(x) = ( 2x1 , 2x2 , x12 , x22 , 2x1 x2 )T
bc

bc bc

5 bc bc

bc bc bC bc

4 uT

ut
ut ut ut
3 ut ut uT
ut ut
ut ut

2 bC bC

bc
1 bc bC bc bc
bc

0
0 1 2 3 4 5 6 7

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 25
Nonlinear SVMs: Kernel Trick

To apply the kernel trick for nonlinear SVM classification, we have to show that
all operations require only the kernel function:

K (x i , x j ) = φ(x i )T φ(x j )

Applying φ to each point, we can obtain the new dataset in the feature space
D φ = {φ(x i ), yi }ni=1 .
The SVM objective function in feature space is given as

( n
)
kw k2 X
Objective Function: min +C (ξi )k
w ,b ,ξi 2 i =1

Linear Constraints: yi (w T φ(x i ) + b) ≥ 1 − ξi , and ξi ≥ 0, ∀x i ∈ D

where w is the weight vector, b is the bias, and ξi are the slack variables, all in
feature space.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 26
Nonlinear SVMs: Kernel Trick

For hinge loss, the dual Lagrangian in feature space is given as

n n n
X 1 XX
max Ldual = αi − αi αj yi yj φ(x i )T φ(x j )
α
i =1
2 i =1 j =1
n n n
X 1 XX
= αi − αi αj yi yj K (x i , x j )
i =1
2 i =1 j =1
Pn
Subject to the constraints that 0 ≤ αi ≤ C , and i =1 αi yi = 0.
The dual Lagrangian depends only on the dot product between two vectors in
feature space φ(x i )T φ(x j ) = K (x i , x j ), and thus we can solve the optimization
problem using the kernel matrix K = {K (x i , x j )}i ,j =1,...,n .
For quadratic loss, the dual Lagrangian corresponds to the use of a new kernel
1 1
Kq (x i , x j ) = x Ti x j + δij = K (x i , x j ) + δij
2C 2C

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 27
Nonlinear SVMs: Weight Vector and Bias
We cannot directly obtain the weight vector without transforming the points, since
X
w= αi yi φ(x i )
αi >0

However, we can compute the bias via kernel operations, since

X
bi = yi − w T φ(x i ) = yi − αj yj K (x j , x i )
αj >0

Likewise, we can predict the class for a new point z as follows:

 
X
ŷ = sign(w T φ(z) + b) = sign  αi yi K (x i , z) + b 
αi >0

All SVM operations can be carried out in terms of the kernel function
K (x i , x j ) = φ(x i )T φ(x j ). Thus, any nonlinear kernel function can be used to do
nonlinear classification in the input space.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 28
Nonlinear SVM: Inhomogeneous Quadratic Kernel
bc

bc bc

5 bc bc
bc
bc bC bc

4 uT

ut
ut ut ut
3 ut ut uT
ut ut
ut ut

2 bC bC

bc
1 bc bC bc bc
bc

0
0 1 2 3 4 5 6 7

The optimal quadratic hyperplane is obtained by setting C = 4, and using an

inhomogeneous polynomial kernel of degree q = 2:
K (x i , x j ) = φ(x i )T φ(x j ) = (1 + x Ti x j )2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 29
Nonlinear SVM: Inhomogeneous Quadratic Kernel
φ maps x i into feature space as follows:
√ √ √ T
φ x = (x1 , x2 )T = 1, 2x1 , 2x2 , x12 , x22 , 2x1 x2

x 1 = (1, 2)T is transformed into

√ √ √ T
φ(x i ) = 1, 2 · 1, 2 · 2, 12 , 22 , 2 · 1 · 2 = (1, 1.41, 2.83, 1, 2, 2.83)T

Solving Ldual , we found the following six support vectors:

xi (xi 1 , xi 2 )T φ(x i ) yi αi
x1 (1, 2)T (1, 1.41, 2.83, 1, 4, 2.83)T +1 0.6198
x2 (4, 1)T (1, 5.66, 1.41, 16, 1, 5.66)T +1 2.069
x3 (6, 4.5)T (1, 8.49, 6.36, 36, 20.25, 38.18)T +1 3.803
x4 (7, 2)T (1, 9.90, 2.83, 49, 4, 19.80)T +1 0.3182
x5 (4, 4)T (1, 5.66, 5.66, 16, 16, 15.91)T −1 2.9598
x6 (6, 3)T (1, 8.49, 4.24, 36, 9, 25.46)T −1 3.8502

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 30
Nonlinear SVM: Inhomogeneous Quadratic Kernel

We compute the weight vector for the hyperplane:

X
w= αi yi φ(x i ) = (0, −1.413, −3.298, 0.256, 0.82, −0.018)T
αi >0

as well as the bias:

b = −8.841
The decision boundary in input space corresponds to an ellipse, centered at
(4.046, 2.907), with axis lengths 2.78 and 1.55.
Notice that we explicitly transformed all the points into the feature space just for
illustration purposes.
The kernel trick allows us to achieve the same goal using only the kernel function.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 31
SVM Training Algorithms
Instead of dealing explicitly with the bias b, we map each point x i ∈ Rd to the
point x ′i ∈ Rd +1 as follows:

x ′i = (xi 1 , . . . , xid , 1)T

We also map the weight vector to Rd +1 , with wd +1 = b, so that

w = (w1 , . . . , wd , b)T

The equation of the hyperplane is then given as follows:

h(x ′ ) : w T x ′ = w1 xi 1 + · · · + wd xid + b = 0

Pn
After the mapping, the constraint i =1 αi yi = 0 does not apply in the SVM dual
formulations. The new set of constraints is given as

yi w T x ≥ 1 − ξi

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 32
Dual Optimization: Gradient Ascent
The dual optimization objective for hinge loss is given as
n n n
X 1 XX
max J(α) = αi − αi αj yi yj K (x i , x j )
α
i =1
2 i =1 j =1

subject to the constraints 0 ≤ αi ≤ C for all i = 1, . . . , n. Here

α = (α1 , α2 , · · · , αn )T ∈ Rn .
The gradient or the rate of change in the objective function at α is given as the
partial derivative of J(α) with respect to α, that is, with respect to each αk :
T
∂J(α) ∂J(α) ∂J(α)
∇J(α) = , ,...,
∂α1 ∂α2 ∂αn
where the kth component of the gradient is obtained by differentiating J(αk ) with
respect to αk :
n
!
∂J(α) ∂J(αk ) X
= = 1 − yk αi yi K (x i , x k )
∂αk ∂αk i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 33
Stochastic Gradient Ascent
Starting from an initial α, the gradient ascent approach successively updates by
moving in the direction of the gradient ∇J(α):

αt +1 = αt + ηt ∇J(αt )

where αt is the estimate at the tth step, and ηt is the step size.
The optimal step size is:
1
ηk =
K (x k , x k )

Instead of updating the entire α vector in each step, in the stochastic gradient
ascent approach, we update each component αk independently and immediately
use the new value to update other components. The update rule for the k-th
component is given as
n
!
∂J(α) X
αk = αk + η k = αk + ηk 1 − yk αi yi K (x i , x k )
∂αk i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 34
Algorithm SVM-Dual
SVM-Dual (D, K , C , ǫ):
xi
1 foreach x i ∈ D do x i ←
1
2 if loss = hinge then
3 K ← {K (x i , x j )}i,j=1,...,n // kernel matrix, hinge loss
4 else if loss = quadratic then
1
5 K ← {K (x i , x j ) + 2C δij }i,j=1,...,n // kernel matrix, quadratic loss
1
6 for k = 1, . . . , n do ηk ← K (x k ,x k )
7 t ←0
8 α0 ← (0, . . . , 0)T
9 repeat
10 α ← αt
11 for k = 1 to n do
// update kth component of α
n
X
12 αk ← αk + ηk 1 − yk αi yi K (x i , x k )
i=1
13 if αk < 0 then αk ← 0
14 if αk > C then αk ← C
15 αt+1 = α
16 t ← t +1
17 until kαt − αt−1 k ≤ ǫ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 35
SVM Dual Algorithm: Iris Data – Linear Kernel
c1 : Iris-setosa (circles) and c2 : other types of Iris flowers (triangles)
X2 h1000 h10
bC

bC
bC
bC
4.0
bC
bC bC uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT

uT X1
2
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Hyperplane h10 uses C = 10 and h1000 uses C = 1000:

h10 (x) : 2.74x1 − 3.74x2 − 3.09 = 0

h1000 (x) : 8.56x1 − 7.14x2 − 23.12 = 0

h10 has a larger margin, but also a larger slack; h1000 has a smaller margin, but it
minimizes the slack.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 36
SVM Dual Algorithm: Quadratic versus Linear Kernel
c1 : Iris-versicolor (circles) and c2 : other types of Iris flowers (triangles)

u2 hq
uT uT

uT uT
uT
1.0
uT
uT uT uT
uT
uT bC uT
uT uT Tu
uT uT uT
bC
0.5 uT uT
uT bC
bC uT uT uT
uT
uT uT uT uT uT uT
Tu bC bC uT uT uT
uT Tu uT uT bC bC
bC uT
bC uT
uT uT uT bC
bC
uT uT
uT uT uT uT uT uT
uT bC uT
uT Tu bC uT
0 uT uT bC uT
bC bC bC Tu
uT uT bC Tu uT uT Tu
uT uT
uT uT uT uT
uT bC bC bC uT uT uT uT
uT Cb uT
uT
bC bC Cb Cb uT
uT bC bC bC bC Cb Tu
bC bC bC
uT Cb bC uT uT
−0.5 uT uT uT
bC bC Cb uT
Cb bC bC
bC Cb bC
uT uT
bC hl
uT
−1.0 bC bC

uT
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 37
Primal Solution: Newton Optimization

Consider the primal optimization function for soft margin SVMs. With
w , x i ∈ Rd +1 , we have to minimize the objective function:
n
1 X
min J(w ) = kw k2 + C (ξi )k
w 2 i =1

subject to the linear constraints:

yi (w T x i ) ≥ 1 − ξi and ξi ≥ 0 for all i = 1, . . . , n

Rearranging the above, we obtain an expression for ξi

ξi ≥ 1 − yi (w T x i ) and ξi ≥ 0, which implies that

ξi = max 0, 1 − yi (w T x i )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 38
Primal Solution: Newton Optimization, Quadratic Loss
The objective function can be rewritten as
n
1 X k
J(w ) = kw k2 + C max 0, 1 − yi (w T x i )

2 i =1
1 X k
= kw k2 + C 1 − yi (w T x i )
2 T
yi (w x i )<1

For quadratic loss, we have k = 2 and the gradient or the rate of change of the
objective function at w is given as the partial derivative of J(w ) with respect to
w:
∂J(w )
∇w = = w − 2C v + 2C Sw
∂w
where the vector v and the matrix S are given as
X X
v= yi x i S= x i x Ti
yi (w T x i )<1 yi (w T x i )<1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 39
Primal Solution: Newton Optimization, Quadratic Loss

The Hessian matrix is defined as the matrix of second-order partial derivatives of

J(w ) with respect to w , which is given as

∂∇w
Hw = = I + 2C S
∂w

Because we want to minimize the objective function J(w ), we should move in the
direction opposite to the gradient. The Newton optimization update rule for w is
given as

w t +1 = w t − ηt H −1
w t ∇w t

where ηt > 0 is a scalar value denoting the step size at iteration t.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 40
Primal SVM Algorithm
SVM-Primal (D, C , ǫ):
1 foreach xi ∈
D do
xi
2 xi ← // map to Rd +1
1
3 t ←0
4 w 0 ← (0, . . . , 0)T // initialize w t ∈ Rd +1
5 repeat X
6 v← yi x i
yi (w T
t x i )<1
X
7 S← x i x Ti
yi (w T
t x i )<1
8 ∇ ← (I + 2C S)w t − 2C v // gradient
9
10 H ← I + 2C S // Hessian
11

12 w t +1 ← w t − ηt H −1 ∇ // Newton update rule

13 until kw t − w t −1 k ≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 41
SVMs: Dual and Primal Solutions
c1 : Iris-setosa (circles) and c2 : other types of Iris flowers (triangles)

X2 hd , hp
bC

uT
2 X1
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 42
SVM Primal Kernel Algorithm: Newton Optimization
The linear soft margin primal algorithm, with quadratic loss, can easily be
extended to work on any kernel matrix K :
SVM-Primal-Kernel (D, K , C , ǫ):
1 foreach x
i ∈D do
xi
2 xi ← // map to Rd +1
1
3 K ← {K (x i , x j )}i ,j =1,...,n // compute kernel matrix
4 t ←0
5 β 0 ← (0, . . . , 0)T // initialize β t ∈ Rn
6 repeat X
7 v← yi K i
yi (K T
i β t )<1
X
8 S← K i K Ti
yi (K T
i β t )<1
9 ∇ ← (K + 2C S)β t − 2C v // gradient
10 H ← K + 2C S // Hessian
11 β t +1 ← β t − ηt H −1 ∇ // Newton update rule
12 t ← t + 1
13 until β t − β t −1 ≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 43
SVM Quadratic Kernel: Dual and Primal Solutions
c1 : Iris-versicolor (circles) and c2 : other types of Iris flowers (triangles)

u2 hd hp
uT uT

uT
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 44
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 21: Support Vector Machines

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 45

Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
100% (1)
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
44 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
74 pages
L9 Support Vector Machines
No ratings yet
L9 Support Vector Machines
83 pages
Unit-III - SVM
No ratings yet
Unit-III - SVM
105 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
103 pages
Support Vector Machines
No ratings yet
Support Vector Machines
9 pages
27 Support - Vector - Machine
No ratings yet
27 Support - Vector - Machine
17 pages
ML - 05 - Support Vector Machines
No ratings yet
ML - 05 - Support Vector Machines
52 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
Lec15 16
No ratings yet
Lec15 16
35 pages
S V M (SVM) : Upport Ector Achine
No ratings yet
S V M (SVM) : Upport Ector Achine
67 pages
3 Classification 2
No ratings yet
3 Classification 2
27 pages
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
No ratings yet
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
15 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
20 SVM
No ratings yet
20 SVM
35 pages
10 SVM
No ratings yet
10 SVM
77 pages
Support Vecto Machine
No ratings yet
Support Vecto Machine
62 pages
Support Vector Machines
No ratings yet
Support Vector Machines
24 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
SVM Slides
No ratings yet
SVM Slides
32 pages
L5 SVMs
No ratings yet
L5 SVMs
37 pages
Power System Economics Notes
100% (3)
Power System Economics Notes
83 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
SVM Explained PDF
No ratings yet
SVM Explained PDF
19 pages
Support Vector Machine
No ratings yet
Support Vector Machine
46 pages
Unit 2 - SVM - 241016 - 104220
No ratings yet
Unit 2 - SVM - 241016 - 104220
47 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
No ratings yet
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
69 pages
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
No ratings yet
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
69 pages
SVM 1
No ratings yet
SVM 1
8 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Support Vector Machine Classifiers
No ratings yet
Support Vector Machine Classifiers
44 pages
PPT10-Multivariate Calculus
No ratings yet
PPT10-Multivariate Calculus
13 pages
Exp 14
No ratings yet
Exp 14
27 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
No ratings yet
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
28 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
Unit 2
No ratings yet
Unit 2
47 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
Support Vector Machine SVM
No ratings yet
Support Vector Machine SVM
58 pages
Lec5 Support Vector Machine
No ratings yet
Lec5 Support Vector Machine
28 pages
Chap 5 Finite Element Analysis of Contact Problem: Nam-Ho Kim
No ratings yet
Chap 5 Finite Element Analysis of Contact Problem: Nam-Ho Kim
78 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
K-SVM: An Effective SVM Algorithm Based On K-Means Clustering
No ratings yet
K-SVM: An Effective SVM Algorithm Based On K-Means Clustering
8 pages
SVM New
No ratings yet
SVM New
12 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
2 pages
Support Vector Machine
No ratings yet
Support Vector Machine
8 pages
SVM Seminarbericht Hofmann
No ratings yet
SVM Seminarbericht Hofmann
16 pages
10 SVM
No ratings yet
10 SVM
23 pages
Tutorial4 SVM
No ratings yet
Tutorial4 SVM
8 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
Economic Dispatch Control
100% (1)
Economic Dispatch Control
31 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
13 pages
Chapter 3 - Support Vector Machine With Math. - Deep Math Machine Learning - Ai - Medium
No ratings yet
Chapter 3 - Support Vector Machine With Math. - Deep Math Machine Learning - Ai - Medium
11 pages
Partial Derivatives
No ratings yet
Partial Derivatives
29 pages
Introduction of Data Science - Mahatma Gandhi Central University
No ratings yet
Introduction of Data Science - Mahatma Gandhi Central University
17 pages
Chapter 10: Sequence Mining
No ratings yet
Chapter 10: Sequence Mining
37 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
59 pages
OMMC 2024 Solutions
No ratings yet
OMMC 2024 Solutions
18 pages
Notes For Econ 8453
No ratings yet
Notes For Econ 8453
51 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
79 pages
Langrange Multipliers PDF
No ratings yet
Langrange Multipliers PDF
8 pages
Practice Midterm 2 Sol
No ratings yet
Practice Midterm 2 Sol
26 pages
Chapter 7: Dimensionality Reduction
No ratings yet
Chapter 7: Dimensionality Reduction
34 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
58 pages
R 2 Calculations
No ratings yet
R 2 Calculations
30 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
57 pages
Financial Economics - Asset Pricing and Portfolio Selection
No ratings yet
Financial Economics - Asset Pricing and Portfolio Selection
21 pages
Report 1
No ratings yet
Report 1
6 pages
Two Marks - UNIT IV - 28.10.2022
No ratings yet
Two Marks - UNIT IV - 28.10.2022
13 pages
Math 1 Chapter 7
No ratings yet
Math 1 Chapter 7
19 pages
Economic Operation of Power System Notes
No ratings yet
Economic Operation of Power System Notes
15 pages
Chapter 8: Itemset Mining
No ratings yet
Chapter 8: Itemset Mining
34 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
31 pages
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
No ratings yet
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
6 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
29 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
28 pages
Chapter 3: Categorical Attributes
No ratings yet
Chapter 3: Categorical Attributes
26 pages
Chapter 1: Data Mining and Analysis
No ratings yet
Chapter 1: Data Mining and Analysis
24 pages
Chapter 6: High-Dimensional Data
No ratings yet
Chapter 6: High-Dimensional Data
21 pages
Data Science in Agriculture Part I: Introduction
100% (1)
Data Science in Agriculture Part I: Introduction
2 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
16 pages
Two-Material Topology Optimization For The Design of Passive Thermal Control Structures
No ratings yet
Two-Material Topology Optimization For The Design of Passive Thermal Control Structures
12 pages
Computation of KKT Points
No ratings yet
Computation of KKT Points
2 pages
Optimizing Transformer RMS Current Using Single Phase Shift Variable Frequency Modulation For Dual Active Bridge DC-DC Converter
No ratings yet
Optimizing Transformer RMS Current Using Single Phase Shift Variable Frequency Modulation For Dual Active Bridge DC-DC Converter
8 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
Octav Dragoi - Lagrange Multipliers
No ratings yet
Octav Dragoi - Lagrange Multipliers
5 pages
Problems Chap8
No ratings yet
Problems Chap8
22 pages
Physics 119A Midterm Solutions 1 PDF
No ratings yet
Physics 119A Midterm Solutions 1 PDF
3 pages
Dynare Matlab Project
No ratings yet
Dynare Matlab Project
7 pages
64f0b519d640c3ad360c7c60 Dezivujozipukikuminagitex
No ratings yet
64f0b519d640c3ad360c7c60 Dezivujozipukikuminagitex
2 pages
2010 Lecture 005 PDF
No ratings yet
2010 Lecture 005 PDF
43 pages
Notes On Gans, Energy-Based Models, and Saddle Points
No ratings yet
Notes On Gans, Energy-Based Models, and Saddle Points
10 pages
Neutral Currents and The Higgs Mechanism: Institute For Theoretical Physics, University of Utrecht
No ratings yet
Neutral Currents and The Higgs Mechanism: Institute For Theoretical Physics, University of Utrecht
13 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Engineering Structures: Lauren L. Stromberg, Alessandro Beghini, William F. Baker, Glaucio H. Paulino
No ratings yet
Engineering Structures: Lauren L. Stromberg, Alessandro Beghini, William F. Baker, Glaucio H. Paulino
19 pages
Some Double Integral Problems
No ratings yet
Some Double Integral Problems
2 pages
Wolfram Alpha Examples
No ratings yet
Wolfram Alpha Examples
5 pages
Week 3 Partial Derivatives, Chain Rule, Directional Derivatives
No ratings yet
Week 3 Partial Derivatives, Chain Rule, Directional Derivatives
2 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Section 9.4 Lagrange Multipliers Problems 1-15 Odd, 19-25 Odd
No ratings yet
Section 9.4 Lagrange Multipliers Problems 1-15 Odd, 19-25 Odd
4 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 21: Support Vector Machines

A hyperplane splits the d-dimensional data space into two half-spaces.

where r is the directed distance of the point x from x p .

The directed distance r of point x to the hyperplane is thus:

To obtain distance, which must be non-negative, we multiply r by the class label yi of

p = (p1 , p2 ) = (4, 0), q = (q1 , q2 ) = (2, 5)

where y ∗ is the class label for x ∗ .

s h(x) = s w T x + s b = (sw )T x + (sb) = 0

To obtain the unique or canonical hyperplane, we choose the scalar

yi (w T x i + b) ≥ 1, for all points x i ∈ D

Given x ∗ = (2, 2)T , y ∗ = −1.

We can obtain an equivalent minimization formulation:

L should be minimized w.r.t. w and b, and it should be maximized w.r.t. αi .

where α = (α1 , α2 , . . . , αn )T is the vector comprising the Lagrange multipliers.

which gives rise to two cases:

This is a very important result because if αi > 0, then yi (w T x i + b) = 1, and thus

Only the support vectors determine w , since αi = 0 for other points.

The bias b is taken as the average value:

b = avgαi >0 {bi }

Solving the Ldual quadratic program yields

x4 4.5 1.75 +1 0.1427

The assumption that the dataset be perfectly linearly separable is unrealistic.

where C and k are constants that incorporate the cost of misclassification.

The only difference from the separable case is that 0 ≤ αi ≤ C .

where δ is the Kronecker delta function, defined as δij = 1 if and only if i = j.

The optimal hyperplane is given as follows:

The total slack is given as

Linear Constraints: yi (w T φ(x i ) + b) ≥ 1 − ξi , and ξi ≥ 0, ∀x i ∈ D

For hinge loss, the dual Lagrangian in feature space is given as

However, we can compute the bias via kernel operations, since

Likewise, we can predict the class for a new point z as follows:

The optimal quadratic hyperplane is obtained by setting C = 4, and using an

x 1 = (1, 2)T is transformed into

Solving Ldual , we found the following six support vectors:

We compute the weight vector for the hyperplane:

as well as the bias:

x ′i = (xi 1 , . . . , xid , 1)T

We also map the weight vector to Rd +1 , with wd +1 = b, so that

The equation of the hyperplane is then given as follows:

subject to the constraints 0 ≤ αi ≤ C for all i = 1, . . . , n. Here

h10 (x) : 2.74x1 − 3.74x2 − 3.09 = 0

subject to the linear constraints:

yi (w T x i ) ≥ 1 − ξi and ξi ≥ 0 for all i = 1, . . . , n

Rearranging the above, we obtain an expression for ξi

ξi ≥ 1 − yi (w T x i ) and ξi ≥ 0, which implies that

The Hessian matrix is defined as the matrix of second-order partial derivatives of

where ηt > 0 is a scalar value denoting the step size at iteration t.

12 w t +1 ← w t − ηt H −1 ∇ // Newton update rule

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 21: Support Vector Machines

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.