0% found this document useful (0 votes)
18 views50 pages

Unit II 2.2 ML Kernel Machines SVM

Uploaded by

Hii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views50 pages

Unit II 2.2 ML Kernel Machines SVM

Uploaded by

Hii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

MACHINE LEARNING

(20BT60501)

COURSE DESCRIPTION:
Concept learning, General to specific ordering, Decision tree
learning, Support vector machine, Artificial neural networks,
Multilayer neural networks, Bayesian learning, Instance based
learning, reinforcement learning.
Subject :MACHINE LEARNING -(20BT60501)

Topic: Unit II – DECISION TREE LEARNING AND KERNEL MACHINES

Prepared By:
Dr.J.Avanija
Professor
Dept. of CSE
Sree Vidyanikethan Engineering College
Tirupati.
Unit II – DECISION TREE LEARNING AND KERNEL MACHINES

Decision Tree Learning:


 Decision tree representation
 Problems for decision tree learning
 Decision tree learning algorithm
 Hypothesis space search
 Inductive bias in decision tree learning
 Issues in decision tree learning
Kernel Machines:
 Support vector machines
 SVMs for regression
 SVMs for classification
 Choosing C
 A probabilistic interpretation of SVMs
Kernel Methods in Machine Learning
 Kernels or kernel methods (also called Kernel functions) are sets of
different types of algorithms that are being used for pattern analysis.
 Used to solve a non-linear problem by using a linear classifier.
 Kernel in Machine Learning is a field of study that enables computers to
learn without being explicitly programmed
 The input dataset can be placed into a higher dimensional space with
the help of a kernel method or trick and then use any of the available
classification algorithms in this higher-dimensional space.
 Derives a hyperplane that linearly separates the two categories.
 Kernel in Machine Learning is a measure of similarity between two
points

4
Kernel Methods in Machine Learning

5
Kernel Methods in Machine Learning

In the real world, almost all the data are randomly


distributed, which makes it hard to separate different
classes linearly.

Mapping the data from 2-dimensional space to 3-


dimensional space
6
Kernels in SVM

Interesting feature of SVM is that it can even work with a


non-linear dataset and for this, we use “Kernel Trick”
which makes it easier to classify the point

Different Kernel Functions


 Polynomial Kernel
 Sigmoid Kernel
 RBF Kernel
 Bessel function Kernel
 Annova Kernel 7
Kernels in SVM

Polynomial Kernel
Following is the formula for the polynomial kernel:

Here d is the degree of the polynomial, which we need to


specify manually.

SVM using Polynomial


kernel (degree 2) can
separate data

8
Kernels in SVM

Sigmoid Kernel
Can be used as Neural networks:

It is just taking your input, mapping them to a value of 0


and 1 so that they can be separated by a simple straight
line.

9
Kernels in SVM

RBF Kernel (Radial Basis Function)


It creates non-linear combinations of our features to lift your
samples onto a higher-dimensional feature space where we
can use a linear decision boundary to separate your classes It
is the most used kernel in SVM classifications, the following
formula explains it mathematically:

10
Kernels in SVM

RBF Kernel

11
Support Vector Machines
 A support vector machine (SVM) is machine learning algorithm
that analyzes data for classification and regression analysis.
SVM is a supervised learning method that looks at data and
sorts it into one of two categories.
 It is trained with a series of data already classified into two
categories, building the model as it is initially trained. The task of
an SVM algorithm is to determine which category a new data point
belongs in.
 SVM a kind of non-binary linear classifier.

12
Support Vector Machines
Important Terminologies
 Hyperplane
 Support Vectors
 Marginal Distance
 Linear Separable
 Non-linear Separable

13
Support Vector Machines

Hyperplane

Marginal
Distance

Support
Vectors

14
Support Vector Machines

Applications of SVM
 Text and hypertext classification
 Image classification
 Recognizing handwritten characters
 Biological sciences, including protein classification

 The goal of the SVM algorithm is to create the best line or


decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is
called a hyperplane.

15
Support Vector Machines
 SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine
 The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector.
 Two different categories that are classified using a decision boundary or
hyperplane:

16
Support Vector Machines

The red coloured dashed line is the optimal hyper plane. The green
coloured dashed lines define the boundary for each class. And the
data points with green coloured thick outline that are on the
boundary of the class are called support vectors. Hence, the
name Support Vector Machine.
17
Support vectors used to determine optimal hyperplane.
Support Vector Machines
SVM can be of two types:
 Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight
line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.

 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,


which means if a dataset cannot be classified by using a straight line,
then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.

18
Support Vector Machines

Linear SVM Non-Linear SVM

19
Support Vector Machines
Mathematical Modeling:
 Given, training set {(Xᵢ,Yᵢ) where i=1,2,3,…,n}, Xᵢ ∈ ℜᵐ, Yᵢ ∈ {+1,-1}.
Here, Xᵢ is the feature vector for the iᵗʰ data point and Yᵢ is the label for
the iᵗʰ data point. The label can be either ‘+1’ for positive class or ‘-1’ for
negative class. The value ‘1’ is taken for the mathematical convenience.
 Wᵢ be a vector perpendicular to the decision boundary (the optimal hyper
plane) and Xᵢ be an unknown vector. Then the projection of Xᵢ vector on
the unit vector of Wᵢ will determine if that unknown point belongs to
positive class or negative class

20
Support Vector Machines
Mathematical Modeling:
Y= WᵗXᵢ + b – Equation of hyperplane
The dot product between two vector W and X is same as matrix
multiplication between Wᵗ and X.

21
Support Vector Machines
Mathematical Modeling:
Let, X⁺ be a support vector under positive class and X⁻ be a support vector
under negative class.Then, WX⁺ + b =1 ⇒ WX⁺ = 1 - b. Similarly, WX⁻ + b =-
1 ⇒ WX⁻ = -1 - b. Then, the projection of the vector (X⁺ — X⁻) on the unit
vector of W vector gives the width of the separation gap or the margin between
the support vectors of the two classes. The width of the margin is given by:

22
Support Vector Machines
Mathematical Modeling:
The objective of SVM is to maximise the width of the separation gap. That
means to maximise 2/||W|| which is same as minimising||W|| which is same
as minimising ||W||² and which is same as minimising (1/2)||W||² and the
same thing can be written as (1/2)WᵗW.

23
Support Vector Machines
Hard Margin and Soft Margin
 Hard margin SVM, used for linear separable data
 Soft margin SVM used for non-linear separable data.

24
Support Vector Machines

Soft Margin Constraints

25
Support Vector Machines
Mathematical Modeling:
 In real scenarios, the data are not strictly linearly separable. Thus, the problem is
modified by introducing the slack variables ‘ξ’ and a penalty term ‘C’.
Here, ‘C’ is a kind of regularisation parameter.
 slack variables ‘ξ’ which is the distance between the data point and the
margin of the class from the other side
 If ξᵢ for a data point is less then the mistake is less bad and C*ξᵢ will be less.
 f ξᵢ for a data point is high then the mistake is more bad

26
Support Vector Machines
Mathematical Modeling:

27
SVM for Classification
 Hinge Loss
 The hinge loss is a specific type of cost function that incorporates a margin or
distance from the classification boundary into the cost calculation.
 The hinge loss increases linearly.
 Associated with soft-margin support vector machines.
 The distance from the hyperplane can be regarded as a measure of confidence.

28
SVM for Classification
 Hinge Loss
 Given input features “X” and target “y”, the goal of the SVM algorithm is to
predict a value ( ‘predicted y’) close to the target (‘actual y’) for each
observation.
 Equation that could calculate ‘predicted y’ depends on some weighted values of
input X. It can be written as :

predicted y’ = f (weighted values of X). (weights denoted as w)



Job of the loss function is to quantify the error between the ‘predicted y’ and the
‘actual y’.
 This defines the amount by which you want to penalize the mis-classified
observations.
Total cost = ||w²||/2 + C*(Sum of all losses for each observation)

Where ‘C’ is the hyper-parameter that controls the amount of regularization.


29
SVM for Classification
 Hinge Loss
 The hinge loss is a loss function used for training classifiers in SVM. The hinge
loss is used for "maximum-margin" classification, most notably for support
vector machines (SVMs).

 Decision boundary
classifying positive
and negative
points.
 Points marked in
red are
misclassified

30
SVM for Classification
 Hinge Loss
 Plot the yf(x) against the loss function, For points in yf(x) > 0, assign
‘0’ loss.
For points where yf(x) < 0, assign
a loss of ‘1’

31
SVM for Classification
X is a positive sample. Penalty = 1 – t.y where t is actual output,
y is predicted output by
SVM
Positiv
e
Plane
Negativ
e
Plane
Penalty = Penalty = 0 to 1
0

Penalty = 1 Penalty > 1

32
SVMs for Classification

• The hinge loss function penalty increases


we linearly.
ion
ion • Hinge loss is defined as
Where t is the actual outcome (either +1 or
-1),
D
B

y is𝑛the predicted output of the SVM.


1
min ‖𝑊 ‖ + 𝐶 ∑ 𝑚𝑎𝑥 ( 0 ,1 −𝑡 𝑖 𝑦 𝑖 )
2

2 𝑖 =1 Negative
Plane

where n is the number of


samples.

33
SVM for Classification
 Hinge Loss
 Hinge loss = [0, 1- yf(x)].
 For yf(x) ≥ 1, hinge loss is ‘0’.
 For yf(x) < 1, then hinge loss increases massively.
 If yf(x) increases with every misclassified point the upper bound of
hinge loss {1- yf(x)} also increases exponentially.

xi for which αi > 0 are called support


vectors- the points which are either
34
incorrectly classified or are classified.
SVM for Classification
 Large Margin Principle

35
SVM for Classification
 Large Margin Principle
 The margin is the distance between the two boundaries. The support vectors are
the instances at the boundaries (when WᵗX = 1 or -1) • Or within the boundaries,
if not linearly separable
 The goal of SVMs is to learn the boundaries to make the margin as large as
possible (Large Margin Classification)
 The size of the margin is: 2 / ||w||
 ||w|| is the L2 norm of the weight vector
 Learning goal:
 Maximize 2 / ||w||, subject to the constraints that all instances are correctly
classified
 Turn it into minimization problem by taking the inverse: ½ ||w||
 Can also square the L2 norm (makes the calculus easier), just like with L2
regularization: ½ ||w||2
36
SVM for Classification
 Choosing C
 C is chosen from cross validation
 Support Vector Machine always looks for
-Setting a larger margin
-lowering misclassification rate
 Increase in margin, leads to high misclassfication rate
 Decrease in margin, leads to low misclassfication rate
 Priority should be getting a lower misclassfication rate
 Can be achieved by parameter C

37
SVM for Classification
 Cross Validation- k-fold CV

Cross-validation is a resampling method that uses


different portions of the data to test and train a model on
different iterations. Cross-validation is a statistical
method used to estimate the skill of machine learning
models.
38
SVM for Classification
 Choosing C
 Large Value of parameter C => small margin
 Small Value of paramerter C => Large margin
 Choosing C depends on test data.
 Try with different C values and choose the value which gives you lowest
misclassification rate on testing data.

39
SVM for Classification
 SVMs for Multiclass Classification

40
SVM for Classification
 SVMs for Multiclass Classification

• Upgrading an SVM to the multi-class case is not so easy, since the


outputs are not on a calibrated scale and hence are hard to compare
to each other

• one-versus-the-rest (OVR) approach (also called one-vs-all))

• one-versus-one approach (OVO)

• in which we train 𝐶 binary classifiers, 𝑓𝑐(𝐱), where the data from class 𝑐
is treated as positive, and the data from all the other classes is treated as
negative

• However, this can result in regions of input space which are


ambiguously labeled.
41
•The green region is predicted to be both class 1 and class 2.
SVM for Classification
 SVMs for Multiclass Classification

• The obvious approach is to use a one-versus-the-rest approach (also


called one-vs-all), in which we train 𝐶 binary classifiers, 𝑓𝑐(𝐱), where the
data from class 𝑐 is treated as positive, and the data from all the other
classes is treated as negative

• However, this can result in regions of input space which are


ambiguously labeled.
•The green region is predicted to be
both class 1 and class 2.

42
SVM for Classification
 SVMs for Multiclass Classification

Another approach is to use the one-versus-one or OVO approach,also


called all pairs, in which we train C(C−1)/2 classifiers to discriminate all
pairs 𝑓𝐶,𝐶′
•We then classify a point into the class which has the highest number of
votes. However, this can also result in ambiguities

43
SVM for Regression
 Regression analysis consists of a set of machine learning methods
that allow us to predict a continuous outcome variable (y) based
on the value of one or multiple predictor variables (x).
 Goal of regression model is to build a mathematical equation that
defines y as a function of the x variables.
 This equation can be used to predict the outcome (y) on the basis of
new values of the predictor variables (x).
 t can be utilized to assess the strength of the relationship between
variables and for modeling the future relationship between them.

44
SVM for Regression

y has sub i (ŷᵢ) represents the estimated output given the


input. (Xᵢ)
In this equation, β₀ is a bias and β₁ is the weight of the
model. If the model is charted so that the output is the y
axis and the input is the x axis, β₀ refers to the y -
45
intercept and β₁ represents the slope.
SVM for Regression
vector 𝒘 depends on all the training inputs
• The problem with kernelized ridge regression is that the solution

• We now seek a method to produce a sparse estimate


• Consider the epsilon insensitive loss function called as Huber
loss function

• This means that any point lying inside an 𝜖-tube around the
prediction is not penalized

46
SVM for Regression
• Mean Square Error

where n is the number of samples, t is actual output, y is


predicted output.

• Mean Absolute Error

• Huber Loss

47
SVMs for Regression

(a) Illustration of 𝑃2, Huber and 𝜖-insensitive loss


𝜖 = 1.5
(b) Illustration of the 𝜖-tube used in SVM regression.
functions, where

Huber Loss:
48
SVMs for Regression
• The corresponding objective function

49
SVMs Pros and Cons
– It works really well with a clear margin of separation.
– It is effective in high dimensional spaces.
– It is effective in cases where number of dimensions > number of samples
– It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
– This classifier is heavily reliant on the support vectors and changes as
support vectors change. As a result, they tend to overfit. Hence kernels
functions and regularization is important.
– It does not provide probability estimates.
– It doesn’t perform well with large datasets because the required training
time is higher.
– It also doesn’t perform very well, when the data set has more noise i.e.
target classes are overlapping.

50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy