0% found this document useful (0 votes)
7 views46 pages

ML TCS Lecture 15

The document discusses Support Vector Machines (SVMs) for binary classification, detailing the process of selecting hypotheses from a hypothesis set and the formulation of the generalization error. It explains the linear classification approach, the concept of maximizing the margin for a hyperplane, and introduces the non-separable case where slack variables are used to accommodate outliers. Additionally, it covers the optimization problem associated with SVMs, including the use of Lagrange multipliers and the dual formulation.

Uploaded by

nehofo2338
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

ML TCS Lecture 15

The document discusses Support Vector Machines (SVMs) for binary classification, detailing the process of selecting hypotheses from a hypothesis set and the formulation of the generalization error. It explains the linear classification approach, the concept of maximizing the margin for a hyperplane, and introduces the non-separable case where slack variables are used to accommodate outliers. Additionally, it covers the optimization problem associated with SVMs, including the use of Lagrange multipliers and the dual formulation.

Uploaded by

nehofo2338
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Working with Support Vector Machines (Lecture 15)

Machine Learning for Real-World Applications

Date 20-Sep-2021

Copyright © 2021 Tata Consultancy Services Limited

1 External
Binary Classification

• Consider an input space ⊆ ℝN with N ≥ 1, and the output or target space is


= {−1, + 1}
• Let f : → be the target function.

• Given a hypothesis set H of functions mapping to


• The learner uses a training set and selects a hypothesis from H

H
Training set Learner Selects
a hypothesis
𝒳
𝒴
𝒳𝒴
𝒳
𝒴
2 External
Input to the binary classification task

• Consider an input space ⊆ ℝN with N ≥ 1, and the output or target space is


= {−1, + 1}
• Let f : → be the target function.

• The learner receives a training set S of m examples drawn i.i.d. from according
to some unknown distribution D

• S = {(x1, y1), (x2, y2), …, (xm, ym)} ∈ ( × )m


with yi = f (xi) for all i ∈ [1,m].
𝒳
𝒴
𝒳
𝒳
𝒴
𝒳
𝒴
3 External
Choosing a hypothesis for binary classification

• The learner chooses a hypothesis h ∈ H, a binary classifier, with small


generalisation error.

• The generalisation error is formulated as the error rate on the actual data
generating distribution D (which is unknown).
P (h(x) ≠ f (x))
x∼D

4 External
Linear Classification

• Different hypothesis sets H can be selected for this task.


• The simplest hypothesis class is that of linear classifiers, or hyperplanes
H = {x ↦ sign(w ⋅ x + b) : w ∈ ℝN , b ∈ ℝ}

• The hypothesis h(x) ≡ sign(w ⋅ x + b) labels the points falling on one side of the
hyperplane (w ⋅ x + b) as positive and points on the other side as negative.

5 External
Linear Classification

w⋅x+b=0

6 External
SVMs − The separable case

7 External
SVMs − The separable case

• The general equation of a hyperplane in ℝN is


w⋅x+b =0
where w ∈ ℝN x ∈ ℝN and scalar b ∈ ℝ
• Multiplying this equation with a scalar does not affect the hyperplane.

min | w . x + b | = 1
• So we can scale w and b appropriately such that (x,y)∈S

• We define this representation of the hyperplane (w, b) as the canonical


hyperplane. For a canonical hyperplane | w ⋅ xi + b | ≥ 1 ∀i ∈ [1,m]
8 External
SVMs − The separable case

• The distance of any point x0 ∈ ℝN to the hyperplane is given by

| w ⋅ x0 + b |
∥w∥

• Thus, for a canonical hyperplane, the margin ρ is given by

|w ⋅ x + b| 1
ρ = min =
(x,y)∈S ∥w∥ ∥w∥

9 External
SVMs − The separable case

• When a training point xi is correctly classified by a hyperplane defined by (w, b),


then (w ⋅ xi + b) has the same sign as that of yi.

• This implies that for correct classification yi(w ⋅ xi + b) ≥ 1

1
• Maximizing the margin ρ = ∥w∥ of a canonical hyperplane is equivalent to

1
minimizing ∥w∥ or ∥w∥2
2
10 External
SVMs − The separable case

• The SVM solution is a hyperplane which maximises the margin while correctly
classifying all the training points

1
min ∥w∥2 subject to. yi(w ⋅ xi + b) ≥ 1, ∀i ∈ [1,m]
w,b 2

1
• The objective function F : w ↦ ∥w∥2 is infinitely differentiable.
2

• We have ΔF(w) = w and Δ2F(w) = I

11 External
SVMs − The separable case

• Since the identity matrix is positive definite, the eigenvalues are strictly positive
and therefore the function F is strictly convex.
• The constraints are affine functions: gi(w, b) ≤ 0 where gi ≡ 1 − yi(w ⋅ xi + b)
• Therefore optimisation admits a unique solution.
• The optimisation problem is a specific instance of quadratic programming (QP).
• Special QP solvers such as the block coordinate descent algorithm can be used.

12 External
SVMs − The Lagrangian

• Since the constraints are convex and differentiable, we can introduce Lagrange
variables αi ≥ 0, i ∈ [1,m] for the m constraints and denote by α the vector
α1
α2
α=

αm

• The Lagrangian can be then defined for all w ∈ ℝN , b ∈ ℝ, and α ∈ ℝm


+ , by
m
1
ℒ(w, b, α) = ∥w∥2 − αi [yi(w ⋅ xi + b) − 1)]
2 ∑
i=1
13 External
SVMs − The Support Vectors

• The KKT (Karush Kuhn Tucker) conditions apply at the optimum point.

• The KKT conditions are obtained by


- setting the gradient of the Lagrangian with respect to the
primal variables w and b to zero, and
- by writing the complementarity conditions:

14 External
SVMs − The Support Vectors

m m

∑ ∑
Δwℒ = w − αi yixi = 0 ⟹ w= αi yixi
i=1 i=1

m m

∑ ∑
Δbℒ = − αi yi = 0 ⟹ αi yi = 0
i=1 i=1

∀i, αi [yi(w ⋅ xi + b) − 1] = 0 ⟹ αi = 0 ∨ yi(w ⋅ xi + b) = 1

15 External
SVMs − The Support Vectors

• The weight vector w solution of the SVM is a linear combination of the training
m


set vectors x1, …, xm. w= αi yixi
i=1

• A vector xi appears in that expansion iff αi ≠ 0


• Such vectors are called support vectors.
• By the complementarity condition, if αi ≠ 0, then yi(w ⋅ xi + b) = 1
Thus, support vectors lie on the marginal hyperplanes w ⋅ xi + b = ± 1

16 External
SVMs − The Dual Formulation

• The Lagrangian can be then defined for all w ∈ ℝN , b ∈ ℝ, and α ∈ ℝm


+ , by
m
1
ℒ(w, b, α) = ∥w∥2 − αi [yi(w ⋅ xi + b) − 1)]

(to be minimized)
2 i=1


Substituting w = αi yixi and after some rearrangement, we get

i=1

m
1 m
∑ ∑
max αi − yi yj αiαj ⟨xi, xj⟩
α
i=1
2 i, j

• This is the dual formulation of SVM


17 External
SVMs − The Dual Formulation

m
1 m
∑ ∑
max αi − yi yj αiαj ⟨xi, xj⟩
α
i
2 i, j

• The dual formulation of SVM is parameterised on α (unknown)

• Also note that the inputs occur in the inner product ⟨xi, xj⟩

• The optimization can be expressed as a standard quadratic programming


problem. Let H be a matrix such that Hij = yi yj⟨xi, xj⟩

18 External
SVMs − The Dual Formulation

m
1 m
∑ ∑
max αi − yi yj αiαj ⟨xi, xj⟩
α
i
2 i, j

• The optimization can be expressed as a standard quadratic programming


problem. Let H be a matrix such that Hij = yi yj⟨xi, xj⟩
m
1 ⊤

max αi − α H α
α
i
2


such that αi ≥ 0 αi yi = 0
i
19 External
SVMs − The Dual Formulation
m
1 ⊤ 1 ⊤ 1 ⊤
min α H α − 1⊤α min x Px + q⊤x

max αi − α H α
α 2 α 2 x 2
i
such that such that
such that

∀i −αi ≤ 0 or −α < 0 Gx < h


αi ≥ 0
Ax = b
m y⊤α = 0

αi yi = 0
i

20 External
SVMs − The Dual Formulation

1 1 P ≡ H size: m × m
min α ⊤H α − 1⊤α min x⊤Px + q⊤x
α 2 x 2
q ≡ − 1⊤ size:m × 1
such that such that
G ≡ − diag[1]
(diagonal matrix size:
∀i −αi ≤ 0 or −α ≤ 0 Gx ≤ h
m × m of -1s)
Ax = b
y⊤α = 0 h ≡ 0 size: m × 1
A ≡ y⊤ size: 1 × m
b ≡ 0 (scalar)

21 External
SVMs − The Dual Formulation

• The quadratic programming solver gives us the value of unknown vector α


• Using α we compute the value of hyperplane parameters w and b
m


w= αi yixi
i=1

y = ⟨w, xi⟩ + b (any of the support vectors xi will satisfy this equation)

therefore b = y − ⟨w, xi⟩

22 External
Mapping from Input space to Feature space

ϕ
Feature Space

Input Space

23 External
Kernel Function

• A kernel function formulates an inner product in the feature space to computes


the similarity of points mapped to the feature space.

• Different kernel functions induce a different notion of similarity in the feature


space
• Example kernels:
RBF: Radial Basis Function
Linear
Polynomial, etc

24 External
Non-Separable Case for Binary Classification
SVMs − Non-separable Case

• In most practical settings, the training data is not linearly separable, i.e., for any
hyperplane w ⋅ x + b = 0, there exists xi ∈ S such that

yi [w ⋅ xi + b] ≱ 1
• Thus, the constraints imposed in the linearly separable case, i.e.

yi [w ⋅ xi + b] ≥ 1 do not hold
• However, a relaxed version of these constraints can still hold

yi [w ⋅ xi + b] ≥ 1 − ξi for each i ∈ [1,m], and for ξi ≥ 0

26 External
SVMs − The Non-Separable Case

ξi

ξj

27 External
SVMs − Non-separable case

• The variables ξi are known as slack variables

• They measure the distance by which vector xi violate the desired inequality

• That is, it allows certain outlier points which have ξi > 0. These are the points
that are placed on the wrong side of the marginal hyperplane.

28 External
SVMs − Non-separable case
• A vector which is correctly classified by the separating hyperplane can also be an
outlier if it is in the wrong side of the marginal hyperplane.
• For the separable case, we say that the training data is separated by a hard
margin, but for the non-separable case, we say that the training data is separated
by a soft margin
ξi

ξj

29 External
SVMs − Optimization in the non-separable case

• How should we select the hyperplane in the non-separable case?


• Objective 1:
We seek to limit the total amount of slack due to outliers, which can be
m


measured by ξi
i

• Objective 2: We seek a hyperplane with a large margin, though a larger margin


can lead to more outliers and thus larger amounts of slack

• These are two conflicting objectives.


30 External
SVMs − Formulation of Optimization
• The objective function:
m
1
min ∥w∥2 + C ξip
w,b,ξ 2 ∑
i=1

subject to yi(w ⋅ xi) + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1,m]


• The parameter C is typically determined by n-fold cross validation
• This is a convex optimisation problem since the constraints are affine and thus
convex and since the objective function is convex for any p ≥ 1
m
ξip = ∥ξ∥pp is convex in view of the convexity of the norm ∥ ⋅ ∥p

The sum

i=1

31 External
SVMs − The Support Vectors

• Loss function corresponding to p = 1 is called the hinge loss

• Loss function corresponding to p = 2 is called the quadratic hinge loss.

• Both hinge losses are convex upper bounds on the zero-one loss, thus making
them well suited for optimisation.

32 External
SVMs − The Support Vectors

Quadratic Hinge Loss ξ 2

Hinge
Loss Loss ξ 1

0/1 Loss

-1 0 1 2
w⋅x+b
33 External
SVMs − The Non-Separable Case

• The objective function as well as the affine constraints are convex and
differentiable.
• Thus, KKT conditions apply at the optimum.
• We introduce Lagrange variables αi ≥ 0, i ∈ [1,m], associated with the first m
constraints and βi ≥ 0, i ∈ [1,m] associated to the non-negativity constraints of
the slack variables.
• The Lagrangian can be defined as:
m m m
1
∑ i ∑ i[ i
ℒ(w, b, ξ, α, β) = ∥w∥2 + C α y (w ⋅ xi + b) − 1 + ξi] −
∑ i i
ξ− βξ
2 i=1 i=1 i=1

34 External
SVMs − The Support Vectors

m m m
1
∑ i ∑ i[ i
ℒ(w, b, ξ, α, β) = ∥w∥2 + C α y (w ⋅ xi + b) − 1 + ξi] −
∑ i i
ξ− βξ
2 i=1 i=1 i=1

• The KKT conditions are obtained by setting the gradient of the Lagrangian with
respect to the primal variables w, b and ξis to zero and by writing the
complementarity conditions:

35 External
SVMs − The Support Vectors

m m

∑ ∑
Δwℒ = w − αi yixi = 0 ⟹ w= αi yixi

i=1 i=1

m m

∑ ∑
Δbℒ = − αi yi = 0 ⟹ αi yi = 0
i=1 i=1

Δξiℒ = C − αi − βi = 0 ⟹ αi + βi = C

∀i, αi [yi(w ⋅ xi + b) − 1 + ξi] = 0 ⟹ αi = 0 ∨ yi(w ⋅ xi + b) = 1 − ξi

∀i, βiξi = 0 ⟹ βi = 0 ∨ ξi = 0

36 External
SVMs − The Support Vectors

• Thus, the weight vector w solution of the SVM problem is a linear combination of
m


the training set vectors x1, …, xm w= αi yixi
i=1

• A vector xi appears in that expansion iff αi ≠ 0. Such vectors are called support
vectors.

37 External
SVMs − The Support Vectors

• Here there are two types of support vectors.

• By the complementarity condition,


if αi ≠ 0, then yi(w ⋅ xi + b) = 1 − ξi

if ξi = 0, then yi(w ⋅ xi + b) = 1 and xi lies on the marginal hyperplane

otherwise, ξi ≠ 0 and xi is an outlier and βi = 0 and αi = C

38 External
SVMs − The Support Vectors

• Thus, the support vectors are either outliers, in which case αi = C, or vectors
lying on the marginal hyperplanes.
• The solution vector w is unique, while the support vectors are not unique.

ξi

ξj

39 External
SVMs − Simplification of Lagrangian

m m m
1
∑ i ∑ i[ i
ℒ(w, b, ξ, α, β) = ∥w∥2 + C α y (w ⋅ xi + b) − 1 + ξi] −
∑ i i
ξ− βξ
2 i=1 i=1 i=1

m m
1
= ∥w∥2 + αi [yi(w ⋅ xi + b) − 1]
∑ ∑
ξi(C − αi − βi) −
2 i=1 i=1

Applying the box constraints that αi + βi = C ∀i


m
1
= ∥w∥2 − αi [yi(w ⋅ xi + b) − 1]
2 ∑
i=1
40 External
SVMs − Simplification of Lagrangian

m m
1
∥w∥2 − αi [yi(w ⋅ xi + b) − 1]
∑ ∑
Substituting for w = αi yixi in
i=1
2 i=1

m
1 ⊤

gives an equivalent maximisation max αi − α H α
α
i
2

subject to 0 ≤ αi ≤ C ∀i


αi yi = 0
i

41 External
SVMs − The Dual Formulation solved using a standard quadratic program

1 1 P ≡ H size: m × m
max α ⊤H α − 1⊤α min x⊤Px + q⊤x
α 2 x 2
q ≡ − 1⊤ size:m × 1
such that such that
G ≡ − diag[1] vertically
stacked on diag[1]
∀i −αi ≤ 0 or −α ≤ 0 Gx ≤ h
(matrix size:2m × m)
∀i αi ≤ C or α≤C Ax = b
h ≡ 0 vertically stacked on C
y⊤α = 0 size: 2m × 1

A ≡ y⊤ size: 1 × m
b ≡ 0 (scalar)
42 External
SVMs − The Dual Formulation solved using a standard quadratic program

1 1 P ≡ H size: m × m
max α ⊤H α − 1⊤α min x⊤Px + q⊤x
α 2 x 2
q ≡ − 1⊤ size:m × 1
such that such that
G ≡ − diag[1] vertically
stacked on diag[1]
∀i −αi ≤ 0 or −α ≤ 0 Gx ≤ h
(matrix size:2m × m)
∀i αi ≤ C or α≤C Ax = b
h ≡ 0 vertically stacked on C
y⊤α = 0 size: 2m × 1

A ≡ y⊤ size: 1 × m
b ≡ 0 (scalar)
43 External
SVMs − The Dual Formulation

• The quadratic programming solver gives us the value of unknown vector α


• Using α we compute the value of hyperplane parameters w and b
m


w= αi yixi
i=1

y = ⟨w, xi⟩ + b (any of the support vectors xi will satisfy this equation)

therefore b = y − ⟨w, xi⟩

44 External
Mapping from Input space to Feature space

Feature Space

Input Space

45 External
Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy