ML TCS Lecture 15
ML TCS Lecture 15
Date 20-Sep-2021
1 External
Binary Classification
H
Training set Learner Selects
a hypothesis
𝒳
𝒴
𝒳𝒴
𝒳
𝒴
2 External
Input to the binary classification task
• The learner receives a training set S of m examples drawn i.i.d. from according
to some unknown distribution D
• The generalisation error is formulated as the error rate on the actual data
generating distribution D (which is unknown).
P (h(x) ≠ f (x))
x∼D
4 External
Linear Classification
• The hypothesis h(x) ≡ sign(w ⋅ x + b) labels the points falling on one side of the
hyperplane (w ⋅ x + b) as positive and points on the other side as negative.
5 External
Linear Classification
w⋅x+b=0
6 External
SVMs − The separable case
7 External
SVMs − The separable case
min | w . x + b | = 1
• So we can scale w and b appropriately such that (x,y)∈S
| w ⋅ x0 + b |
∥w∥
|w ⋅ x + b| 1
ρ = min =
(x,y)∈S ∥w∥ ∥w∥
9 External
SVMs − The separable case
1
• Maximizing the margin ρ = ∥w∥ of a canonical hyperplane is equivalent to
1
minimizing ∥w∥ or ∥w∥2
2
10 External
SVMs − The separable case
• The SVM solution is a hyperplane which maximises the margin while correctly
classifying all the training points
1
min ∥w∥2 subject to. yi(w ⋅ xi + b) ≥ 1, ∀i ∈ [1,m]
w,b 2
1
• The objective function F : w ↦ ∥w∥2 is infinitely differentiable.
2
11 External
SVMs − The separable case
• Since the identity matrix is positive definite, the eigenvalues are strictly positive
and therefore the function F is strictly convex.
• The constraints are affine functions: gi(w, b) ≤ 0 where gi ≡ 1 − yi(w ⋅ xi + b)
• Therefore optimisation admits a unique solution.
• The optimisation problem is a specific instance of quadratic programming (QP).
• Special QP solvers such as the block coordinate descent algorithm can be used.
12 External
SVMs − The Lagrangian
• Since the constraints are convex and differentiable, we can introduce Lagrange
variables αi ≥ 0, i ∈ [1,m] for the m constraints and denote by α the vector
α1
α2
α=
⋮
αm
• The KKT (Karush Kuhn Tucker) conditions apply at the optimum point.
14 External
SVMs − The Support Vectors
m m
∑ ∑
Δwℒ = w − αi yixi = 0 ⟹ w= αi yixi
i=1 i=1
m m
∑ ∑
Δbℒ = − αi yi = 0 ⟹ αi yi = 0
i=1 i=1
15 External
SVMs − The Support Vectors
• The weight vector w solution of the SVM is a linear combination of the training
m
∑
set vectors x1, …, xm. w= αi yixi
i=1
16 External
SVMs − The Dual Formulation
∑
Substituting w = αi yixi and after some rearrangement, we get
•
i=1
m
1 m
∑ ∑
max αi − yi yj αiαj ⟨xi, xj⟩
α
i=1
2 i, j
m
1 m
∑ ∑
max αi − yi yj αiαj ⟨xi, xj⟩
α
i
2 i, j
• Also note that the inputs occur in the inner product ⟨xi, xj⟩
18 External
SVMs − The Dual Formulation
m
1 m
∑ ∑
max αi − yi yj αiαj ⟨xi, xj⟩
α
i
2 i, j
∑
such that αi ≥ 0 αi yi = 0
i
19 External
SVMs − The Dual Formulation
m
1 ⊤ 1 ⊤ 1 ⊤
min α H α − 1⊤α min x Px + q⊤x
∑
max αi − α H α
α 2 α 2 x 2
i
such that such that
such that
20 External
SVMs − The Dual Formulation
1 1 P ≡ H size: m × m
min α ⊤H α − 1⊤α min x⊤Px + q⊤x
α 2 x 2
q ≡ − 1⊤ size:m × 1
such that such that
G ≡ − diag[1]
(diagonal matrix size:
∀i −αi ≤ 0 or −α ≤ 0 Gx ≤ h
m × m of -1s)
Ax = b
y⊤α = 0 h ≡ 0 size: m × 1
A ≡ y⊤ size: 1 × m
b ≡ 0 (scalar)
21 External
SVMs − The Dual Formulation
∑
w= αi yixi
i=1
y = ⟨w, xi⟩ + b (any of the support vectors xi will satisfy this equation)
22 External
Mapping from Input space to Feature space
ϕ
Feature Space
Input Space
23 External
Kernel Function
24 External
Non-Separable Case for Binary Classification
SVMs − Non-separable Case
• In most practical settings, the training data is not linearly separable, i.e., for any
hyperplane w ⋅ x + b = 0, there exists xi ∈ S such that
yi [w ⋅ xi + b] ≱ 1
• Thus, the constraints imposed in the linearly separable case, i.e.
yi [w ⋅ xi + b] ≥ 1 do not hold
• However, a relaxed version of these constraints can still hold
26 External
SVMs − The Non-Separable Case
•
ξi
ξj
27 External
SVMs − Non-separable case
• They measure the distance by which vector xi violate the desired inequality
• That is, it allows certain outlier points which have ξi > 0. These are the points
that are placed on the wrong side of the marginal hyperplane.
28 External
SVMs − Non-separable case
• A vector which is correctly classified by the separating hyperplane can also be an
outlier if it is in the wrong side of the marginal hyperplane.
• For the separable case, we say that the training data is separated by a hard
margin, but for the non-separable case, we say that the training data is separated
by a soft margin
ξi
ξj
29 External
SVMs − Optimization in the non-separable case
∑
measured by ξi
i
31 External
SVMs − The Support Vectors
• Both hinge losses are convex upper bounds on the zero-one loss, thus making
them well suited for optimisation.
32 External
SVMs − The Support Vectors
Hinge
Loss Loss ξ 1
0/1 Loss
-1 0 1 2
w⋅x+b
33 External
SVMs − The Non-Separable Case
• The objective function as well as the affine constraints are convex and
differentiable.
• Thus, KKT conditions apply at the optimum.
• We introduce Lagrange variables αi ≥ 0, i ∈ [1,m], associated with the first m
constraints and βi ≥ 0, i ∈ [1,m] associated to the non-negativity constraints of
the slack variables.
• The Lagrangian can be defined as:
m m m
1
∑ i ∑ i[ i
ℒ(w, b, ξ, α, β) = ∥w∥2 + C α y (w ⋅ xi + b) − 1 + ξi] −
∑ i i
ξ− βξ
2 i=1 i=1 i=1
34 External
SVMs − The Support Vectors
m m m
1
∑ i ∑ i[ i
ℒ(w, b, ξ, α, β) = ∥w∥2 + C α y (w ⋅ xi + b) − 1 + ξi] −
∑ i i
ξ− βξ
2 i=1 i=1 i=1
• The KKT conditions are obtained by setting the gradient of the Lagrangian with
respect to the primal variables w, b and ξis to zero and by writing the
complementarity conditions:
35 External
SVMs − The Support Vectors
m m
∑ ∑
Δwℒ = w − αi yixi = 0 ⟹ w= αi yixi
•
i=1 i=1
m m
∑ ∑
Δbℒ = − αi yi = 0 ⟹ αi yi = 0
i=1 i=1
Δξiℒ = C − αi − βi = 0 ⟹ αi + βi = C
∀i, βiξi = 0 ⟹ βi = 0 ∨ ξi = 0
36 External
SVMs − The Support Vectors
• Thus, the weight vector w solution of the SVM problem is a linear combination of
m
∑
the training set vectors x1, …, xm w= αi yixi
i=1
• A vector xi appears in that expansion iff αi ≠ 0. Such vectors are called support
vectors.
37 External
SVMs − The Support Vectors
38 External
SVMs − The Support Vectors
• Thus, the support vectors are either outliers, in which case αi = C, or vectors
lying on the marginal hyperplanes.
• The solution vector w is unique, while the support vectors are not unique.
ξi
ξj
39 External
SVMs − Simplification of Lagrangian
m m m
1
∑ i ∑ i[ i
ℒ(w, b, ξ, α, β) = ∥w∥2 + C α y (w ⋅ xi + b) − 1 + ξi] −
∑ i i
ξ− βξ
2 i=1 i=1 i=1
m m
1
= ∥w∥2 + αi [yi(w ⋅ xi + b) − 1]
∑ ∑
ξi(C − αi − βi) −
2 i=1 i=1
m m
1
∥w∥2 − αi [yi(w ⋅ xi + b) − 1]
∑ ∑
Substituting for w = αi yixi in
i=1
2 i=1
m
1 ⊤
∑
gives an equivalent maximisation max αi − α H α
α
i
2
subject to 0 ≤ αi ≤ C ∀i
∑
αi yi = 0
i
41 External
SVMs − The Dual Formulation solved using a standard quadratic program
1 1 P ≡ H size: m × m
max α ⊤H α − 1⊤α min x⊤Px + q⊤x
α 2 x 2
q ≡ − 1⊤ size:m × 1
such that such that
G ≡ − diag[1] vertically
stacked on diag[1]
∀i −αi ≤ 0 or −α ≤ 0 Gx ≤ h
(matrix size:2m × m)
∀i αi ≤ C or α≤C Ax = b
h ≡ 0 vertically stacked on C
y⊤α = 0 size: 2m × 1
A ≡ y⊤ size: 1 × m
b ≡ 0 (scalar)
42 External
SVMs − The Dual Formulation solved using a standard quadratic program
1 1 P ≡ H size: m × m
max α ⊤H α − 1⊤α min x⊤Px + q⊤x
α 2 x 2
q ≡ − 1⊤ size:m × 1
such that such that
G ≡ − diag[1] vertically
stacked on diag[1]
∀i −αi ≤ 0 or −α ≤ 0 Gx ≤ h
(matrix size:2m × m)
∀i αi ≤ C or α≤C Ax = b
h ≡ 0 vertically stacked on C
y⊤α = 0 size: 2m × 1
A ≡ y⊤ size: 1 × m
b ≡ 0 (scalar)
43 External
SVMs − The Dual Formulation
∑
w= αi yixi
i=1
y = ⟨w, xi⟩ + b (any of the support vectors xi will satisfy this equation)
44 External
Mapping from Input space to Feature space
Feature Space
Input Space
45 External
Thank You