k-nearest neighbors algorithm - Wikipedia
k-nearest neighbors algorithm - Wikipedia
The best choice of k depends upon the data; generally, larger values of k reduces effect of the
noise on the classification,[7] but make boundaries between classes less distinct. A good k can
be selected by various heuristic techniques (see hyperparameter optimization). The special
case where the class is predicted to be the class of the closest training sample (i.e. when k = 1)
is called the nearest neighbor algorithm.
The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or
irrelevant features, or if the feature scales are not consistent with their importance. Much
research effort has been put into selecting or scaling features to improve classification. A
particularly popular approach is the use of evolutionary algorithms to optimize feature
scaling.[8] Another popular approach is to scale features by the mutual information of the
training data with the training classes.
In binary (two class) classification problems, it is helpful to choose k to be an odd number as
this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is
via bootstrap method.[9]
The 1-nearest neighbor classifier
The most intuitive nearest neighbour type classifier is the one nearest neighbour classifier
that assigns a point x to the class of its closest neighbour in the feature space, that is
.
As the size of training data set approaches infinity, the one nearest neighbour classifier
guarantees an error rate of no worse than twice the Bayes error rate (the minimum achievable
error rate given the distribution of the data).
The weighted nearest neighbour classifier
The k-nearest neighbour classifier can be viewed as assigning the k nearest neighbours a
weight and all others 0 weight. This can be generalised to weighted nearest neighbour
classifiers. That is, where the ith nearest neighbour is assigned a weight , with
. An analogous result on the strong consistency of weighted nearest
neighbour classifiers also holds.[10]
Let denote the weighted nearest classifier with weights . Subject to regularity
conditions, which in asymptotic theory are conditional variables which require assumptions to
differentiate among parameters with some criteria. On the class distributions the excess risk
has the following asymptotic expansion[11]
for and
for .
With optimal weights the dominant term in the asymptotic expansion of the excess risk is
. Similar results are true when using a bagged nearest neighbour classifier.
Properties
where is the Bayes error rate (which is the minimal error rate possible), is the
asymptotic k-NN error rate, and M is the number of classes in the problem. This bound is tight
in the sense that both the lower and upper bounds are achievable by some distribution.[15]
For and as the Bayesian error rate approaches zero, this limit reduces to "not more
than twice the Bayesian error rate".
Error rates
There are many results on the error rate of the k nearest neighbour classifiers.[16] The k-
nearest neighbour classifier is strongly (that is for any joint distribution on ) consistent
provided diverges and converges to zero as .
Let denote the k nearest neighbour classifier based on a training set of size n. Under
certain regularity conditions, the excess risk yields the following asymptotic expansion[11]
Metric learning
When the input data to an algorithm is too large to be processed and it is suspected to be
redundant (e.g. the same measurement in both feet and meters) then the input data will be
transformed into a reduced representation set of features (also named features vector).
Transforming the input data into the set of features is called feature extraction. If the features
extracted are carefully chosen it is expected that the features set will extract the relevant
information from the input data in order to perform the desired task using this reduced
representation instead of the full size input. Feature extraction is performed on raw data prior
to applying k-NN algorithm on the transformed data in feature space.
An example of a typical computer vision computation pipeline for face recognition using k-NN
including feature extraction and dimension reduction pre-processing steps (usually
implemented with OpenCV):
1. Haar face detection
2. Mean-shift tracking analysis
3. PCA or Fisher LDA projection into feature space, followed by k-NN classification
Dimension reduction
For high-dimensional data (e.g., with number of dimensions more than 10) dimension
reduction is usually performed prior to applying the k-NN algorithm in order to avoid the
effects of the curse of dimensionality.[17]
The curse of dimensionality in the k-NN context basically means that Euclidean distance is
unhelpful in high dimensions because all vectors are almost equidistant to the search query
vector (imagine multiple points lying more or less on a circle with the query point at the
center; the distance from the query to all data points in the search space is almost the same).
Feature extraction and dimension reduction can be combined in one step using principal
component analysis (PCA), linear discriminant analysis (LDA), or canonical correlation analysis
(CCA) techniques as a pre-processing step, followed by clustering by k-NN on feature vectors
in reduced-dimension space. This process is also called low-dimensional embedding.[18]
For very-high-dimensional datasets (e.g. when performing a similarity search on live video
streams, DNA data or high-dimensional time series) running a fast approximate k-NN search
using locality sensitive hashing, "random projections",[19] "sketches"[20] or other high-
dimensional similarity search techniques from the VLDB toolbox might be the only feasible
option.
Decision boundary
Nearest neighbor rules in effect implicitly compute the decision boundary. It is also possible to
compute the decision boundary explicitly, and to do so efficiently, so that the computational
complexity is a function of the boundary complexity.[21]
Data reduction
Data reduction is one of the most important problems for work with huge data sets. Usually,
only some of the data points are needed for accurate classification. Those data are called the
prototypes and can be found as follows:
1. Select the class-outliers, that is, training data that are classified incorrectly by k-NN (for a
given k)
2. Separate the rest of the data into two sets: (i) the prototypes that are used for the
classification decisions and (ii) the absorbed points that can be correctly classified by k-
NN using prototypes. The absorbed points can then be removed from the training set.
Selection of class-outliers
A training example surrounded by examples of other classes is called a class outlier. Causes of
class outliers include:
random error
insufficient training examples of this class (an isolated example appears instead of a cluster)
missing important features (the classes are separated in other dimensions which we don't
know)
too many training examples of other classes (unbalanced classes) that create a "hostile"
background for the given small class
Class outliers with k-NN produce noise. They can be detected and separated for future
analysis. Given two natural numbers, k>r>0, a training example is called a (k,r)NN class-outlier
if its k nearest neighbors include more than r examples of other classes.
Calculation of the
border ratio
The border ratio is in the interval [0,1] because ‖ x'-y ‖ never exceeds ‖ x-y‖. This ordering
gives preference to the borders of the classes for inclusion in the set of prototypes U. A point
of a different label than x is called external to x. The calculation of the border ratio is
illustrated by the figure on the right. The data points are labeled by colors: the initial point is x
and its label is red. External points are blue and green. The closest to x external point is y. The
closest to y red point is x' . The border ratio a(x) = ‖ x'-y ‖ / ‖x-y‖is the attribute of the initial
point x.
Below is an illustration of CNN in a series of figures. There are three classes (red, green and
blue). Fig. 1: initially there are 60 points in each class. Fig. 2 shows the 1NN classification map:
each pixel is classified by 1NN using all the data. Fig. 3 shows the 5NN classification map.
White areas correspond to the unclassified regions, where 5NN voting is tied (for example, if
there are two green, two red and one blue points among 5 nearest neighbors). Fig. 4 shows
the reduced data set. The crosses are the class-outliers selected by the (3,2)NN rule (all the
three nearest neighbors of these instances belong to other classes); the squares are the
prototypes, and the empty circles are the absorbed points. The left bottom corner shows the
numbers of the class-outliers, prototypes and absorbed points for all three classes. The
number of prototypes varies from 15% to 20% for different classes in this example. Fig. 5
shows that the 1NN classification map with the prototypes is very similar to that with the
initial data set. The figures were produced using the Mirkes applet.[23]
CNN model reduction for k-NN classifiers
Fig. 1. The dataset. Fig. 2. The 1NN classification
map.
k-NN regression
In k-NN regression, also known as k-NN smoothing, the k-NN algorithm is used for estimating
continuous variables. One such algorithm uses a weighted average of the k nearest neighbors,
weighted by the inverse of their distance. This algorithm works as follows:
1. Compute the Euclidean or Mahalanobis distance from the query example to the labeled
examples.
2. Order the labeled examples by increasing distance.