T07 IDS - Classification
T07 IDS - Classification
What should its classification be? Even without knowing what the six attributes
represent, it seems intuitively obvious that the unseen instance is nearer to the
first instance than to the second.
Nearest Neighbour
Nearest Neighbour
• If we denote an instance in the training set by (a1, a2) and the unseen instance by (b1, b2) the length of the
straight line joining the points is
If there are two points (a1, a2, a3) and (b1, b2, b3) in a three-dimensional space the corresponding formula
is
• The formula for Euclidean distance between points (a1, a2, . . . , an) and (b1, b2, . . . , bn) in n-dimensional
space is a generalisation of these two results. The Euclidean distance is given by the formula
Estimating the Predictive Accuracy of a Classifier
Data is split into two parts called a training set and a test set
The classifier is then used to predict the classification for the instances in the
test set.
If the test set contains N instances of which C are correctly classified, C are
correctly classified
Predictive accuracy, P = C/N
Method 2: K-fold Cross Validation
Divided into k equal parts, k typically being a small number such as 5 or 10.
Each of the k parts in turn is used as a test set and the other k − 1 parts are used
as a training set.
Usually K = 5 to 10
Method 3: N-fold Cross Validation
Dataset is divided into as many parts as there are instances, each instance effectively forming a test set of
one.
K=N
e1, e1,…….,en
• Below results obtained using 10-fold and N-fold Cross-validation for the four datasets.
Confusion Matrix
• As well as the overall predictive accuracy on unseen instances it is often helpful to see a breakdown of the
classifier’s performance, i.e. how frequently instances of class X were correctly classified as class X or
misclassified as some other class.
• This information is given in a confusion matrix.
Confusion Matrix
Confusion Matrix
Confusion Matrix