KNN CIML
KNN CIML
" #1
D 2
d( a, b) = ∑ ( a d − bd ) 2
(3.1)
)
, .5
d =1
, .4
(0
conventions:
2: for n = 1 to N do
7: for k = 1 to K do
One aspect of inductive bias that we’ve seen for KNN is that it
assumes that nearby points should have the same label. Another
aspect, which is quite different from decision trees, is that all features
are equally important! Recall that for decision trees, the key question
was which features are most useful for classification? The whole learning Figure 3.6: Classification data for ski vs
algorithm for a decision tree hinged on finding a small set of good snowboard in 2d
Vector sums are computed pointwise, and are only defined when dimensions match, so h1, 2.5, −6i +
h2, −2.5, 3i = h3, 0, −3i. In general, if c = a + b then cd = ad + bd for all d. Vector addition can
be viewed geometrically as taking a vector a, then tacking on b to the end of it; the new end point is
exactly c.
Vectors can be scaled by real values; for instance 2h1, 2.5, −6i = h2, 5, −12i; this is called scalar multi-
plication. In general, ax = h ax1 , ax2 , . . . , ax D i.
Figure 3.8:
The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to
guess it’s label.
An alternative, less passive, way to think about a learned model
is to ask: what sort of test examples will it classify as positive, and
what sort will it classify as negative. In Figure 3.9, we have a set of
training data. The background of the image is colored blue in regions
that would be classified as positive (if a query were issued there) Figure 3.9: decision boundary for 1nn.
Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the Figure 3.12: decision boundary for dt in
previous figure
use of geometric intuitions and data, we are going to temporarily
What sort of data might yield a
consider an unsupervised learning problem. In unsupervised learn- very simple decision boundary with
ing, our data consists only of examples xn and does not contain corre- a decision tree and very complex
sponding labels. Your job is to make sense of this data, even though
? decision boundary with 1-nearest
neighbor? What about the other
no one has provided you with correct labels. The particular notion of way around?
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 3.13. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
three clusters. That is, you should label each data point as A, B or C
in whatever way you want.
For this data set, it’s pretty clear what you should do. You prob-
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine.
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 3.13. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest
A second question is: how long does it take to converge. The first
question is actually easy to answer. Yes, it does. And in practice, it
usually converges quite quickly (usually fewer than 20 iterations). In
Chapter 15, we will actually prove that it converges. The question of
how long it takes to converge is actually a really interesting question.
Even though the K-means algorithm dates back to the mid 1950s, the
best known convergence rates were terrible for a long time. Here, ter-
rible means exponential in the number of data points. This was a sad
situation because empirically we knew that it converged very quickly.
New algorithm analysis techniques called “smoothed analysis” were
invented in 2001 and have been used to show very fast convergence
for K-means (among other algorithms). These techniques are well
beyond the scope of this book (and this author!) but suffice it to say
that K-means is fast in practice and is provably fast in theory.
It is important to note that although K-means is guaranteed to
converge and guaranteed to converge quickly, it is not guaranteed to
converge to the “right answer.” The key problem with unsupervised
learning is that we have no way of knowing what the “right answer”
is. Convergence to a bad solution is usually due to poor initialization. What is the difference between un-
supervised and supervised learning
that means that we know what the
? “right answer” is for supervised
learning but not for unsupervised
learning?
geometry and nearest neighbors 37
Algorithm 4 K-Means(D, K)
1: for k = 1 to K do
4: repeat
5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: Xk ← { x n : z n = k } // points assigned to cluster k
10: µk ← mean(Xk ) // re-estimate center of cluster k
11: end for
12: until µs stop changing
over to high dimensions. We will consider two effects, but there are
countless others. The first is that high dimensional spheres look more
like porcupines than like balls.2 The second is that distances between
points in high dimensions are all approximately the same.
Let’s start in two dimensions as in Figure 3.16. We’ll start with
2
This result was related to me by Mark
four green spheres, each of radius one and each touching exactly two Reid, who heard about it from Marcus
other green spheres. (Remember that in two dimensions a “sphere” Hutter.
is just a “circle.”) We’ll place a red sphere in the middle so that it
touches all four green spheres. We can easily compute the radius of
this small sphere. The pythagorean theorem says that 12 + 12 = (1 +
√
r )2 , so solving for r we get r = 2 − 1 ≈ 0.41. Thus, by calculation,
the blue sphere lies entirely within the cube (cube = square) that
contains the grey spheres. (Yes, this is also obvious from the picture,
but perhaps you can see where this is going.)
Now we can do the same experiment in three dimensions, as
shown in Figure 3.17. Again, we can use the pythagorean theorem
to compute the radius of the blue sphere. Now, we get 12 + 12 + 12 =
√
(1 + r )2 , so r = 3 − 1 ≈ 0.73. This is still entirely enclosed in the Figure 3.16: 2d spheres in spheres
cube of width four that holds all eight grey spheres.
At this point it becomes difficult to produce figures, so you’ll
have to apply your imagination. In four dimensions, we would have
16 green spheres (called hyperspheres), each of radius one. They
would still be inside a cube (called a hypercube) of width four. The
√
blue hypersphere would have radius r = 4 − 1 = 1. Continuing
to five dimensions, the blue hypersphere embedded in 256 green
√
hyperspheres would have radius r = 5 − 1 ≈ 1.23 and so on.
In general, in D-dimensional space, there will be 2D green hyper-
spheres of radius one. Each green hypersphere will touch exactly
n-many other hyperspheres. The blue hyperspheres in the middle Figure 3.17: 3d spheres in spheres
√
will touch them all and will have radius r = D − 1.
geometry and nearest neighbors 39
all of these distances begin to concentrate around 0.4 D, even for 6000
4000
2000
0
0.0 0.2 0.4 0.6 0.8 1.0
distance / sqrt(dimensionality)
Algorithm 9 KNN-Train-LOO(D)
1: errk ← 0, ∀1 ≤ k ≤ N − 1 // errk stores how well you do with kNN
2: for n = 1 to N do
14: return argmin err k // return the K that achieved lowest error
k