Clustering - KNN
Clustering - KNN
Clustering
K-NN
K-nearest neighbors (KNN) is a type of
supervised learning algorithm used for both
regression and classification.
• KNN tries to predict the correct class for the test data by calculating the
distance between the test data and all the training points.
• Then select the K number of points which is closet to the test data.
• The KNN algorithm calculates the probability of the test data
belonging to the classes of ‘K’ training data and class holds the highest
probability will be selected.
• In the case of regression, the value is the mean of the ‘K’ selected
training points.
KNN
How it works?
The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
• Step-6: Our model is ready.
How to choose value of K?
How to choose value of K?
• There are no pre-defined statistical methods to find the most
favorable value of K.
• Initialize a random K value and start computing.
• Choosing a small value of K leads to unstable decision boundaries.
• The substantial K value is better for classification as it leads to
smoothening the decision boundaries.
• Derive a plot between error rate and K denoting values in a defined
range. Then choose the K value as having a minimum error rate.
Value of K?
1.As we decrease the value of K to 1, our predictions become less stable. Just think for
a minute, imagine K=1 and we have a query point surrounded by several reds and
one green (I’m thinking about the top left corner of the colored plot above), but the
green is the single nearest neighbor. Reasonably, we would think the query point is
most likely red, but because K=1, KNN incorrectly predicts that the query point is
green.
2.Inversely, as we increase the value of K, our predictions become more stable due to
majority voting / averaging, and thus, more likely to make more accurate predictions
(up to a certain point). Eventually, we begin to witness an increasing number of
errors. It is at this point we know we have pushed the value of K too far.
3.In cases where we are taking a majority vote (e.g. picking the mode in a
classification problem) among labels, we usually make K an odd number to have a
tiebreaker.
Pros and Cons
• Advantages
1.The algorithm is simple and easy to implement.
2.There’s no need to build a model, tune several parameters, or make
additional assumptions.
3.The algorithm is versatile. It can be used for classification, regression,
and search (as we will see in the next section).
• Disadvantages
1.The algorithm gets significantly slower as the number of examples
and/or predictors/independent variables increase.
Conclusion
• KNN works by finding the distances
between a query and all the examples
in the data, selecting the specified
number examples (K) closest to the
query, then votes for the most frequent
label (in the case of classification) or
averages the labels (in the case of
regression).