Machine Learning
Machine Learning
Politics
With the help of KNN algorithms, we can classify a potential voter into
various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party
„Congress‟, “Will Vote to Party „BJP‟.
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
Procedure
• Step:1- Determine K= Number of Neighbours
• Let us select K=3
• Step:2- Calculate the distance between the query-instance
and all the training Samples
• Step:3- Sort the distance and determine nearest neighbors
based on the K-th Minimum distance
• Step:4- Gather the category of classes
• Step:5- Use simple majority of classes category as the
classified value of query instance
KNN
P1 P2 P3 P4
With K=3, there are two Default=Y and one Default=N out of
three closest neighbors. The prediction for the unknown case is
again Default=Y.
Standardized Distance
One major drawback in calculating distance measures directly from the training set
is in the case where variables have different measurement scales or there is a
mixture of numerical and categorical variables.
Using the standardized distance on the same training set, the unknown case
returned a different neighbor which is not a good sign of robustness.
How to select the value of K in
the K-NN Algorithm?
• There is no particular way to determine the best
value for "K", so we need to try some values to
find the best out of them. The most preferred
value for K is 5.
• A very low value for K such as K=1 or K=2, can
be noisy and lead to the effects of outliers in the
model.
• Large values for K are good, but it may find
some difficulties.
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may
be complex some time.
• The computation cost is high because of calculating
the distance between the data points for all the training
samples.