0% found this document useful (0 votes)
15 views20 pages

T07 IDS - Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views20 pages

T07 IDS - Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Classification Algorithms.

 k - Nearest Neighbour (KNN).


 Decision Tree
 Naïve Bayes.
Introduction to classification k - Nearest Neighbour

• Mainly used when all attribute values are continuous

• It can be modified to deal with categorical attributes

• The idea is to estimate the classification of an unseen instance using the


classification of the instance or instances that are closest to it, in some sense that we
need to define (classifies new cases based on a similarity measure)
Nearest Neighbour

What should its classification be? Even without knowing what the six attributes
represent, it seems intuitively obvious that the unseen instance is nearer to the
first instance than to the second.
Nearest Neighbour
Nearest Neighbour

• A training set with 20 instances, each giving the values of two


attributes and an associated classification
• How can we estimate the classification for an ‘unseen’
instance where the first and second attributes are 9.1 and
11.0, respectively?
Nearest Neighbour

A circle has been added to enclose the five nearest neighbours


of the unseen instance, which is shown as a small circle close to
the centre of the larger one.
The five nearest neighbours are labelled with three + signs and
two − signs
So a basic 5-NN classifier would classify the unseen instance as
‘positive’ by a form of majority voting
Distance Measures: Euclidean Distance

• If we denote an instance in the training set by (a1, a2) and the unseen instance by (b1, b2) the length of the
straight line joining the points is

 If there are two points (a1, a2, a3) and (b1, b2, b3) in a three-dimensional space the corresponding formula
is

• The formula for Euclidean distance between points (a1, a2, . . . , an) and (b1, b2, . . . , bn) in n-dimensional
space is a generalisation of these two results. The Euclidean distance is given by the formula
Estimating the Predictive Accuracy of a Classifier

• Any algorithm which assigns a classification to unseen instances is called a classifier.


• Predictive accuracy -> The proportion of a set of unseen instances that it correctly classifies.
Estimating the Predictive Accuracy of a Classifier

Three main strategies (training/test):


• Dividing the data into training and test set
• K-fold cross validation
• N-fold (leave one out)cross validation
Method 1: Separate Training and Test Sets

Data is split into two parts called a training set and a test set

Training set is used to construct a classifier

The classifier is then used to predict the classification for the instances in the
test set.
If the test set contains N instances of which C are correctly classified, C are
correctly classified
Predictive accuracy, P = C/N
Method 2: K-fold Cross Validation

Dataset comprises N instances

Divided into k equal parts, k typically being a small number such as 5 or 10.

Each of the k parts in turn is used as a test set and the other k − 1 parts are used
as a training set.
Usually K = 5 to 10
Method 3: N-fold Cross Validation

 N-fold cross-validation is an extreme case of k-fold cross-validation

 Often known as ‘leave-one-out’ cross-validation or jack-knifing

 Dataset is divided into as many parts as there are instances, each instance effectively forming a test set of
one.
 K=N

 e1, e1,…….,en

 Take the average.

 High computation is required, but validation way is nice


Experimental Results - I

 Predictive accuracy of classifiers generated for four datasets.


 All the results in this section were obtained using the TDIDT tree induction algorithm, with information
gain used for attribute selection.
Experimental Results - I
Experimental Results - I

• Below results obtained using 10-fold and N-fold Cross-validation for the four datasets.
Confusion Matrix

• As well as the overall predictive accuracy on unseen instances it is often helpful to see a breakdown of the
classifier’s performance, i.e. how frequently instances of class X were correctly classified as class X or
misclassified as some other class.
• This information is given in a confusion matrix.
Confusion Matrix
Confusion Matrix
Confusion Matrix

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy